Which vendors do you engage with for what. How do you make sense of the landscape?
Jeff Kelly @ Wikibon presents a nice Big Data market segmentation landscape graphic that I found quite interesting.
Lots of new entrants in this space that are raising the innovation quotient. According to GigaOM:
“Cloudera which is synonymous with Hadoop has raised $76 million since 2009 [Has raised another $65 million in December 2012]. Newcomers MapR and Hortonworks have raised $29 million and $50 million. And that’s just at the distribution layer, which is the foundation of any Hadoop deployment. Up the stack, Datameer, Karmasphere and Hadapt have each raised around $10 million, and then are newer funded companies such as Zettaset, Odiago and Platfora. Accel Partners has started a $100 million big data fund to feed applications utilizing Hadoop and other core big data technologies. If anything, funding around Hadoop should increase in 2012, or at least cover a lot more startups.”
Hadoop usage/penetration is growing as more analysts, programmers and – increasingly – processes “use” data. Accelerating data growth drives performance challenges, load time challenges and hardware cost optimization.
The data growth chart below gives you a sense of how quickly we are creating digital data. A few years ago Terabytes were considered a big deal. Now Exabytes are the new Terabytes. Making sense of large data volumes at real-time speed is where we are heading.
- 1000 Kilobytes = 1 Megabyte
- 1000 Megabytes = 1 Gigabyte
- 1000 Gigabytes = 1 Terabyte
- 1000 Terabytes = 1 Petabyte [where most SME corporations are?]
- 1000 Petabytes = 1 Exabyte [where most large corporations are?]
- 1000 Exabytes = 1 Zettabyte [where leaders like Facebook and Google are]
- 1000 Zettabytes = 1 Yottabyte
- 1000 Yottabytes = 1 Brontobyte
- 1000 Brontobytes = 1 Geopbyte
Hadoop based analytic complexity grows as data mining, predictive modeling and advanced statistics become the norm. Usage growth is driving the need for more analytical sophistication.
Hadoop’s framework brings a new set of challenges related to the compute infrastructure and underlined network architectures. As Hadoop graduates from pilots to a mission critical component of the enterprise IT infrastructure, integrating information held in Hadoop and in Enterprise RDBMS becomes imperative.
Finally, adoption of Hadoop in the enterprise will not be an easy journey, and the hardest steps are often the first. Then, they get harder! Weaning the IT organizations off traditional DB and EDW models to use a new approach can be compared to moving the moon out of its orbit with a spatula… but it can be done.
2013-2014 may be the year Hadoop crosses into mainstream IT.
Sources and References
1) Hadoop is a Java-based software framework for distributed processing of data-intensive transformations and analyses.
Apache Hadoop = HDFS + MapReduce
- Hadoop Distributed File System (HDFS) for storing massive datasets using low-cost storage
- MapReduce, the algorithm on which Google built its empire
MapReduce breaks a big data problem into subproblems; distributes them onto tens, hundreds, and even thousands of processing nodes; and then combines the results into a smaller, easy-to-analyze data set. Think of this as an efficient parallel assembly line for data analysis. MapReduce was first presented to the world via a 2004 white paper in which Google. Yahoo re-implemented this technique and open sourced it via the Apache foundation.
2) Related components often deployed with Hadoop – HBase, Hive, Pig, Oozie, Flume and Sqoop. These components form the core Hadoop Stack.
- HBase is an open-source, distributed, versioned, column-oriented store modeled after Google’s BigTable architecture. HBase scales to billions of rows and millions of columns, while ensuring that write and read performance remain constant.
- Hive is a SQL’esque query language for interrogating Apache Hadoop
- Pig, a high-level query language for large-scale data processing
- ZooKeeper, a toolkit of coordination primitives for building distributed systems
3) Hadoop ecosystem is evolving constantly so this makes it tricky for enterprise IT adoption which tends to like stable proven models with a big maintenance tail.
4) It’s important to understand what Hadoop doesn’t do…. Big Data technology like the Hadoop stack does not deliver insight, however – insights depend on analytics that result from combing the results of things like Hadoop MapReduce jobs with manageable “small data” already in the Data warehouse (DW).
5) Foursquare and Hadoop case study writeup by – Matthew Rathbone of the Foursquare Engineering team…. http://engineering.foursquare.com/2011/02/28/how-we-found-the-rudest-cities-in-the-world-analytics-foursquare/
6) Presentation Big Data at FourSquare: http://engineering.foursquare.com/2011/03/24/big-data-foursquare-slides-from-our-recent-talk/
7) For more background info on Hadoop check out our article: New Tools for New Times: Primer on Big Data, Hadoop and In-Memory
8) Businessweek article on Hadoop uses… http://www.businessweek.com/technology/getting-a-handle-on-big-data-with-hadoop-09072011.html
9) Data mining leveraging distributed file systems is a field with multiple techniques. These include: Hadoop, map-reduce; PageRank, topic-sensitive PageRank, spam detection, hubs-and-authorities; similarity search; shingling, minhashing, random hyperplanes, locality-sensitive hashing; analysis of social-network graphs; association rules; dimensionality reduction: UV, SVD, and CUR decompositions; algorithms for very-large-scale mining: clustering, nearest-neighbor search, gradient descent, support-vector machines, classification, and regression; and submodular function optimization.
10) Cloudera distribution for Hadoop (CDH).
Hadoop Distributed File System
Parallel data-processing framework
A set of utilities that support the Hadoop subprojects
Hadoop database for random read/write access
SQL-like queries and tables on large datasets
Data flow language and compiler
Workflow for interdependent Hadoop jobs
Integration of databases and data warehouses with Hadoop
Configurable streaming data collection
Coordination service for distributed applications
User interface framework and software development kit (SDK) for visual Hadoop applications