When people start thinking about big data and how they can use it to run their operation or to bring in business profits, they should really be thinking about the four Vs, which are Volume, Velocity, Variety and Veracity. Most people tend to focus on the data Volume more than anything else, and in my mind this is the wrong approach.
Big data is such a ubiquitous topic these days, and what’s funny is people who are combing through the data are asking a lot of questions about how best to use it, and in a strange way, I think they are asking the wrong question. Right now people are asking, “What sort of big data analytics do I want to use?” “Do I want to use real time versus batch versus streaming analytics?” “Do I want to use Google MapReduce or do I want to standardize on Splunk?” “Do I want to use Hadoop architecture or do I want to use some Spark architecture,” these are all interesting decisions to make, but to me those are just other tools in the tool kit. What it really comes down to is not limiting the scope of available options that can help you do what you need to do.
Aside from Volume, there are three other Vs that are extremely important and must be considered. Things need to scale up and scale down in big data besides the volume, and your application may be real time, or there may be aspects of what you’re looking at that are not real time. The way that you analyze the data and the characteristics of the data across those four parameters is what’s really important if you are going to run a successful operation.
Data Volume – Data is growing all the time. We used to look at megabytes, then gigabytes, and then it was terabytes, and now it’s tetabytes of information. The more data we have, the more volume there is. The Human Genome Project (HGP), which is a big data analytics project, continues to grow because the more we understand the more we need to capture, and that’s fine, but if you start to build your data center or your architecture with a mind to the gigabyte world and you’re suddenly in the tetabyte world, you’ll need to shift, and if you build a data center around that data center, you may be stranded in time.
Data Velocity – How quickly is that data being generated and how quickly does it need to be analyzed? Time sensitive and real time processes out there, like catching fraud, must be done in real time and big data needs to be able to use that data has it streams into your enterprise, so that you can make real time decisions.
Data Variety – Variety of data is something that people tend to not think about. In traditional business analytics, we know we are using certain types of customer data and we’re looking at it all together to figure out how that customer works. As part of the big data world, we are now looking at things that are not intuitively, necessarily connected data, but we are realizing there are connections there and we can analyze them. So where we did business analytics to make a structured database, now we are looking at any type of data and mashing them all together and looking at them together – structured and unstructured data, text information, payment information, GPS data, audio, video, we are looking at them all together now in the complicated application world that we live in, and seeing what it really means. We are seeing what questions can be asked of this aggregate data to better understand what’s happening in our business and in our applications and ecosystems. Variety is very important.
Data Veracity – How trust worthy is the data that you are using? Especially in the big data world where we are collecting massive amounts of data…there’s going to be some fudge in there. How do we deal with the voracity of the data? We are making educated decisions based on data that may or may not be really verifiable. How do we plan and control for that?
What people are not really thinking about because they are thinking too much about what big data product they should use, is what data they are looking at, what are the characteristics of the questions they want to ask and how does that affect the kind of infrastructure that they require to run big data. This is very important inside of big data…understanding how things are going to change and how to deal with all the elasticity that is required. Well, Cloud is the perfect place to build that elasticity.
That said, what you really need to be looking at when it comes to big data is your infrastructure and whether it meets your needs and can scale over time. One of the difficult things is that big data, more than a lot of traditional applications that we’ve seen out there, is subject to a shifting scope where we are adding more and more data to it.
One of the nice things about Cloud infrastructure is that you get away from being stranded because you’ve got an indeterminate system that will mature overtime with you, and it’s designed to scale up and scale down the way that you need, depending on the sort of projects that you are doing. So, Cloud lends itself well to this kind of space.
Cloud in particular is not just about scaling the data and scaling the compute that can access the data, it’s about scaling the network. When it comes to big data, people have data all over the place that they need to pull, and some of that data may be on premises because its customer data that you need to keep close by, some will be in your cloud application and some won’t even be your data, it will be third party data and mash up data. You may be doing customer analysis and part of that analysis is taking a look at where you bought the data, so you’re using GPS that’s not provided in your data, you’re using Google Maps, you’re using weather reports to find out if someone bought an umbrella because its raining. These are all different types of data that need to suddenly feed into your system and you need to be aware of how that impacts things, so that you can actually meet your requirements.
The first question you really should be asking is what kind of data infrastructure do I need for now and for later? Can Cloud infrastructure allow me to be elastic enough to make that happen in a cost effective way, and do we understand in a longer term or a longer time horizon what our infrastructure really needs to be?
You also need to figure out the data and types you are looking at, what might be part of the future later on, what you might need to factor into it. You need to figure out your data flows and application flows, and how you need to scale up and down.
The last thing people really need to ask, and this could be a tough one, is we’ve got all this data and how do we ask stuff of it? Anybody can capture the data, and most people can analyze it at this point, the problem is recognizing what kind of questions to ask and how to pull real intelligence out of it. There are some businesses that are better at figuring it out than others, and that’s one of the biggest feats to surmount. If you put all this infrastructure and effort into place, but you’re asking the wrong questions, what kind of business value are you determining from it? None.
Large retailers have been quite successful at this and I’m certain it has a direct impact on their bottom line. These companies have figured out the Volume, Velocity, Variety and Veracity of different questions that they need to ask, so that they can make real intelligent decisions about how their customers are buying and how to get loyalty, and then they were able to actually act on it in a timely way. They did so by building an infrastructure that could scale with what they want; by understanding what data type they needed to get and what locations. Whether it’s their data, third party data or mash up data, they were able to figure out what questions to ask. All of that in a highly retail sort of business depends on being able to scale up and down as buyer activity changes. When buyer activity changes it means you need better processing. You also need better data and analytics to take a look at what they actually bought. Cloud is a perfect place to build that sort of elasticity.
What is big data and why is it important? There are many questions that people need to consider, and they also need to think about what questions they should be asking before those questions. Understanding the data and types and the questions to ask are necessary in the world of big data.