This will be the first of a series regarding the topic of Big Data. I intend to discuss it from several approaches including but not limited to requirement, workflow, development technique and security.
Big Data is one of the most buzzing words around recently. From it, we’ve constantly inventing concepts like Open Data and Dark Data. The concept of Big Data has been thoroughly discussed, only from a business perspective though. So what is the classification from a technical perspective?
Big Data, a Technical Definition
An easy definition for Big Data, is a gigantic collection of data WITHOUT predetermined relationship or segmentation. The goal in general is to find the hidden relationship, or segmentation. While this is correct, this definition does not give technical personnel any specification that we can play with. The confinement is simply too vague.
To me, as a software developer, definition for Big Data is as following. Big Data is a collection of data that is scattered, and classified based on data density represented in range [0, 1]. Consider data set as a physical object. The higher the data density, the more internal connection, or force, we have within the data set. A low density data set will have lower level of internal connection. For example, a table within a relational database is of data density 1 as the inner relationship of all the data are determined while the table is constructed. Another example, if we randomly gather 100 names in random locations at random time, the data set has a data density of 0 as none of the data is inner related.
The second example is somewhat false. As a matter of fact, data set with 0 density is only theoretical. Implied relationship, such as when the data is collected or where, always exist in our physical world. As long as we preserve them carefully, any data set has some degree of internal connection, which move it away from 0 density.
The advantage of this definition is that, we now have a logical model to follow when processing big data set: permute our existing low density data set towards a higher density data set. As how this can be done, we will discuss in a later time.
Comparing to data density, another important aspect of Big Data is inter connections between different clustered data sets. The higher the density, the lower this cross-plane relationship will be. It is essential for the permutation of data density, as cross-plane relationship is what guides the transformation and combines discrete data sets.
By introducing cross-plane relationship into the fold, Big Data effectively becomes a 3 dimension concept in terms of data relation. In order to maintain control of the complex cluster of data, extra care is needed while attempting to work with it. In specific, because the data have “crossed” relationship, there should be no assumption as whether two data sets are connected or not.
Relationship with 3 V
Gartner presented the 3V definition for Big Data from a business perspective. The following section will discuss all those Vs from a technical perspective and their relation to our technical Big Data definition.
The shear size of the data is represented by Volume. Well, without the volume, it is not “Big” Data after all. From a technical perspective, without the volume there is no statistical game. To be frank, the initial permutation of the original data set is most likely from a statistical approach. In order to create a more dense data set from the current one, statistical permutation is the safest bet. As a result, we can conclude that volume is the per-requisition for Big Data analytic from a technical stance.
The speed of data accumulation is represented by Velocity. Translating this to a technical guide line presents us two key hints for Big Data analytic. First, as data accumulates fast, the analytic process needs to be equally if not more aggressive. On the other hand, because data accumulates, most likely with similar traits due the nature of data collection method we have, algorithms like greed algorithm and methodologies like recursion are obvious choice. We will get into these requirements and hints later on as they are not related to the definition and classification of Big Data.
The number of isolated data sets is represented by Variety. This mirrors well with the density model of our technical representation. This is also the only one among the 3 Vs that we truly care while defining Big Data for technical purposes.
As we can see, our technical definition is hardly a redefine of Big Data. It is more of a reinterpretation to generate a more technical guide line for software and hardware design and development. In the end, the business definition is not so far off. It does included more info regarding other aspects of product design and engineering though.