Hadoop and Big Data are practically synonymous these days, but as the Big Data hype machine gears up, there’s a lot of confusion about where Hadoop actually fits into the overall Big Data landscape.
Hadoop is an open-source software framework that stores and analyzes large data sets distributed across multiple off-the-shelf servers. Responsible for much of the heavy lifting involved in analyzing data from such varied sources as mobile phones, email, social media, sensor networks and pretty much anything that can offer up actionable data, Hadoop is often considered the operating system of Big Data.
And that’s where the first myth creeps in:
It’s not. Whether you prefer to call it a “framework” or a “platform,” just don’t think Hadoop will solve all of your Big Data problems.
“There is no standard Hadoop stack,” said Phil Simon, author of Too Big to Ignore: The Business Case for Big Data. “It’s not like going to IBM or SAP to get a standard database.”
However, Simon doesn’t think that will be a long-term problem. First, since Hadoop is an open-source project, many other Hadoop-related projects, such as Cassandra and HBase, can address specific needs. HBase, for instance, offers a distributed database that supports structured data storage for large tables.
Moreover, just as Red Hat, IBM and plenty of others vendors packaged Linux into a variety of user-friendly products, Big Data startups are emerging to do the same with Hadoop.
So, while Hadoop isn’t a complete solution in and of itself, most enterprises will actually encounter it as something packaged in larger Big Data suites.
Hadoop is often talked about like it’s a database, but it isn’t. “There’s nothing in the core Hadoop platform like a query or an index,” said Marshall Bockrath-Vandegrift, a software engineer with Damballa, a security company. Damballa uses Hadoop to analyze real-time security threats.
“We use HBase to give our threat analysts the ability to run real-time queries against passive DNS data. HBase and the other real-time technologies are not only complementary to Hadoop, but most depend on the core Hadoop distributed storage technology (HDFS) to provide performant access to distributed datasets,” he added.
Or, as Prateek Gupta, a data scientist with marketing analytics firm BloomReach said: “Hadoop is not a replacement for a database system, but you can use it to build one.”
Many organizations fear that Hadoop is too new and untested to be suited for the enterprise. Nothing could be further from the truth.
Remember, Hadoop was built on the Google File System (GFS) distributed storage platform and Google MapReduce, a data analytics tools running on top of GFS. Yahoo actually put the time and money behind Hadoop, and in 2008 launched its first major Hadoop application, a search “webmap,” which indexed all known webpages and the corresponding meta-data needed to search those pages.
Today, Hadoop is used by everyone from Netflix to Twitter to eBay, and major vendors including Microsoft, IBM and Oracle all sell Hadoop tools.
It’s too early to call Hadoop a “mature” technology – which is the case with any Big Data platform – but it has been adopted and tested by major enterprises.
That doesn’t mean it’s a risk-free platform. Security is a sticking point for instance, but businesses shouldn’t be scared off by Hadoop’s youthful veneer.
Depending on what you plan to do, this myth may come true. If you plan to build the next great Hadoop-based Big Data suite, you’ll need programmers who can write in Java and understand specialized MapReduce programming.
However, if you’re content to build on the work of others, programming shouldn’t scare you off. Data Integration vendor, Syncsort, recommends leaning on Hadoop-compatible data integration tools that will allow analysts to run advanced queries without having to do any coding.
Most data integration tools will have GUIs that abstract MapReduce programming complexity, and many come with pre-built templates.
Moreover, startups including Alpine Data Labs, Continuuity and Hortonworks offer tools to simplify Big Data in general, and Hadoop in particular.
Many SMBs fear that they’ll be locked out of the Big Data trend. The big vendors, the IBMs and Oracles, predictably peddle big, expensive solutions. That doesn’t mean there aren’t SMB-friendly tools out there.
Cloud computing is rapidly democratizing access to sophisticated technologies. “The cloud is turning Capex into Opex,” Big Data author Phil Simon notes. “You can take advantage of the same cloud services that Netflix does, and the same thing is starting to happen with Big Data. A company of five can use Kaggle.”
Kaggle calls itself a “marketplace that bridges the gap between data problems and data solutions.” For instance, startup Jetpac offered $5,000 to someone who could come up with an algorithm that would identify compelling vacation photographs. Most vacation photos are pretty awful, after all, and separating the wheat from the chaff is a tedious, time-consuming process.
Jetpac had people manually rate 30,000 photos, and sought an algorithm that would rank photos the same way actual humans did, just by analyzing metadata (photo size, captions, descriptions, etc.). If Jetpac tried to develop this itself, the company would have spent a heck of a lot more than $5,000, and they would have had a single solution, not their pick of several.
In fact, Jetpac’s image processing tool helped them land $2.4 million in VC funding from Khosla Ventures and Yahoo co-founder Jerry Yang.
This is a common misconception associated with anything open source. Just because you’re able to reduce or eliminate the initial costs of purchasing software doesn’t mean you’ll necessarily save money. One of the problems with the cloud, for instance, is that it’s so easy to run a science project on Amazon that developers of all sorts throw projects up in AWS, forget about them, but keep paying for them.
And virtual server sprawl already makes physical server sprawl look quaint.
While Hadoop helps you store and analyze data, how will you get legacy data into the system? How will you visualize the data? How will you share it? How will you secure data as it is shared more often across the enterprise?
A Hadoop solution is actually a patchwork of solutions. You can turn to a company like Cloudera for a complete enterprise solution, or you can start putting together a highly customized solution yourself. Whatever route you choose, you’ll need to budget carefully because free software is never really free.
Jeff Vance is a Santa Monica-based writer. He’s the founder of Startup50, a site devoted to emerging tech startups. Connect with him on Twitter @JWVance.
Ethics and Artificial Intelligence: Driving Greater Equality
FEATURE | By James Maguire,
December 16, 2020
AI vs. Machine Learning vs. Deep Learning
FEATURE | By Cynthia Harvey,
December 11, 2020
Huawei’s AI Update: Things Are Moving Faster Than We Think
FEATURE | By Rob Enderle,
December 04, 2020
Keeping Machine Learning Algorithms Honest in the ‘Ethics-First’ Era
ARTIFICIAL INTELLIGENCE | By Guest Author,
November 18, 2020
Key Trends in Chatbots and RPA
FEATURE | By Guest Author,
November 10, 2020
FEATURE | By Samuel Greengard,
November 05, 2020
ARTIFICIAL INTELLIGENCE | By Guest Author,
November 02, 2020
How Intel’s Work With Autonomous Cars Could Redefine General Purpose AI
ARTIFICIAL INTELLIGENCE | By Rob Enderle,
October 29, 2020
Dell Technologies World: Weaving Together Human And Machine Interaction For AI And Robotics
ARTIFICIAL INTELLIGENCE | By Rob Enderle,
October 23, 2020
The Super Moderator, or How IBM Project Debater Could Save Social Media
FEATURE | By Rob Enderle,
October 16, 2020
FEATURE | By Cynthia Harvey,
October 07, 2020
ARTIFICIAL INTELLIGENCE | By Guest Author,
October 05, 2020
CIOs Discuss the Promise of AI and Data Science
FEATURE | By Guest Author,
September 25, 2020
Microsoft Is Building An AI Product That Could Predict The Future
FEATURE | By Rob Enderle,
September 25, 2020
Top 10 Machine Learning Companies 2021
FEATURE | By Cynthia Harvey,
September 22, 2020
NVIDIA and ARM: Massively Changing The AI Landscape
ARTIFICIAL INTELLIGENCE | By Rob Enderle,
September 18, 2020
Continuous Intelligence: Expert Discussion [Video and Podcast]
ARTIFICIAL INTELLIGENCE | By James Maguire,
September 14, 2020
Artificial Intelligence: Governance and Ethics [Video]
ARTIFICIAL INTELLIGENCE | By James Maguire,
September 13, 2020
IBM Watson At The US Open: Showcasing The Power Of A Mature Enterprise-Class AI
FEATURE | By Rob Enderle,
September 11, 2020
Artificial Intelligence: Perception vs. Reality
FEATURE | By James Maguire,
September 09, 2020
Datamation is the leading industry resource for B2B data professionals and technology buyers. Datamation's focus is on providing insight into the latest trends and innovation in AI, data security, big data, and more, along with in-depth product recommendations and comparisons. More than 1.7M users gain insight and guidance from Datamation every year.
Advertise with TechnologyAdvice on Datamation and our other data and technology-focused platforms.
Advertise with Us
Property of TechnologyAdvice.
© 2025 TechnologyAdvice. All Rights Reserved
Advertiser Disclosure: Some of the products that appear on this
site are from companies from which TechnologyAdvice receives
compensation. This compensation may impact how and where products
appear on this site including, for example, the order in which
they appear. TechnologyAdvice does not include all companies
or all types of products available in the marketplace.