- October 16, 2012
It’s likely that, like myself, you have heard again and again about “big data“, its 3 V’s, and the Hadoop brand. Yes, volume, velocity, and variety of data are making it difficult to use traditional data solutions like BI cubes, relational databases, and bespoke data pipelines. The world needs new superheroes like Hadoop, NoSQL, NewSQL, DevOps, etc. to solve our woes.
However, these new technologies and approaches have done much more than just solve the problems around petabytes of data and thousands of events per second. They are the right way to do data. That’s why I’m not convinced the term “big data” was a good choice for us to land on as an industry. It’s really “smart data” or “scalable data.” And despite my distaste for adding a version number to buzz phrases, even “Data 2.0” would be more apt.
If you are a CTO/CIO, system architect, manager, consultant, developer, sys admin, or simply an interested professional – my goal is to prompt some initial points on why big data constitutes a good approach to data management and analytics, regardless of the speed and quantity of data.
Scalable Data: Multi-Node Architecture and Infrastructure-as-Code
Multi-node systems with distributed, horizontally scalable systems are always the right way to do infrastructure, no matter the size of your data or the size of your IT team. This wasn’t always the case, but now multi-node systems are as easy to manage as single-node solutions. It’s so easy now because monitoring, logging, management software, and more are all baked right in; systems come to life in a coordinated fashion that hides all the complexity and scales as needed. You can test your infrastructure in the same way you test programming code. While manually testing a multi-node system may be difficult, testing a piece of code is straightforward.
One of the worst things that can happen to an IT team is having to manage major architecture changes. Using open source, multi-node technologies with an infrastructure-as-code foundation lets organizations grow organically and swap tools and software in and out as needed. Simply modify your infrastructure definitions, test your code, and deploy. Additionally, this kind of framework works perfectly with the DevOps approach to system management. Code repositories are collaborative and iterative – giving individual developers empowerment to directly manage infrastructure, while having the safeguards and tests in place to ensure reliability and quality.
Smart Data: Machine Learning and Data Science
You don’t have to have petabytes of data to begin implementing smart algorithms. To run your business more efficiently, you need to be predictive. You must forecast business and market trends before they happen so you can anticipate how to steer your organization. The companies that win will be the ones analyzing and understanding as much data as possible – building data science as a key competency. Big data tools are making it easier to work with data by providing tools like Mahout for machine learning, Hive for business intelligence queries, or R for statistical analysis, which can interface with Hadoop. Because of big data architecture, you can keep data fresh, use a larger swath of data, and use the newest, most powerful tools to perform the analysis and processing.
Agnostic Data: The Right Database for Each Job
New data pipelining frameworks enable real-time stream processing with multi-node scalability and the ability to fork or merge flows. What that means is, you can easily support multiple databases for multiple problems: columnar stores as primary data stores, relational databases for reporting, search databases for data exploration, graph databases for relationship data, document stores for unstructured data, etc. Because of data splitting/merging capabilities, and your DevOps infrastructure ensuring your databases have integrated monitoring and logging, the added burden of having more than one database is minimum. You just have to learn how to interface with the data through easy-to-use APIs and client libraries.
Holistic Data: Hadoop is Not The End All, Be All
Finally, let’s tackle Hadoop specifically. Hadoop is oriented around large-scale batch processing of data. But so much of what big data is includes databases, data integration/collection, real-time stream processing, and data exploration. Hadoop is not a one trick pony, but it’s also not the answer to every data problem known to man.
Frameworks like Flume, Storm, and S4 are making it easier to perform streaming processing such as collecting hundreds of tweets per seconds, thousands of ad impressions per second, or processing data in near real-time as data flows to its destination (whether a database, Hadoop filesystem, etc.). New database technologies are providing more powerful ways of querying data and building applications. R, Hive, Mahout, and more are providing better data scientist tools. Tableau, Pentaho, GoodData, and others are pushing the envelope with data visualization and big data dashboarding.
Big data software and frameworks are the right foundation for data + data integration and collection + data science + statistical analysis + infrastructure management and administration + IT scaling + data-centric applications + data exploration and visualization. Often regardless of data size.
Your organization benefits from adopting these best practices early and working with vendors that understand your company’s problem isn’t just “oh no, I have too much data”. It’s all about return on investment. The big data approach lowers overhead, enables faster and more efficient IT infrastructure management, generates better insights, and puts them to work in your organization.