Recently, Hadoop has been getting a lot of buzz in database and content management circles, but many people in the industry still don’t really know what it is and or how it can be best applied. The underlying technology for Hadoop was invented by Google so they could index all the rich textural and structural information they were collecting from the internet, and then present meaningful and actionable results to users in their search engine. There was nothing on the market that would let them do that, so they built their own platform. Google’s innovations were incorporated into an open source project named Nutch, and Hadoop was later spin-off from that project. Even Yahoo played a key role developing Hadoop for enterprise applications.
Hadoop was designed as a platform to solve problems where you have a lot of complex data and it doesn’t fit nicely into traditional tables. It’s for situations where you want to run analytics that are deep and computationally extensive, like clustering and targeting. That’s exactly what Google was doing when it was indexing the web and examining user behavior to improve performance algorithms.
Hadoop applies to a bunch of markets outside of internet search engines. It can be used for analyzing financial data, used for online ordering when you want to show additional items the customer might want to order, and other applications where you need to analyze a lot of data in just a few seconds.
The design of Hadoop was structured so it can be installed on a large number of machines that don’t share any memory or hard drives. This allows you to use a bunch of cheap servers with no special requirements, and run the Hadoop software on each one. The software allows you to push the workload into Hadoop, and the system will bust that data into pieces that it then spreads across your different servers. There’s no one place where you go to see all of your data; Hadoop keeps track of which server the data resides. And because there are multiple copy stores, data stored on a server that goes offline or dies can be automatically replicated from a known good copy on a different server.
In a centralized database system like SQL Server, you could have one large disk connected to one server with multiple CPU cores. This architecture limits the amount of horsepower as you can bring to bear on that instance of you database. In a Hadoop cluster, every one of those cheap servers can have two, four or even 16 CPU cores. You can run your indexing job by sending your code to each of the dozens of servers in your Hadoop cluster, and each server operates on its own little piece of the data. Results are then delivered back to you in a unified whole. That’s MapReduce: you map the operation out to all of those servers and then you reduce the results back into a single result set.
Architecturally, the reason you’re able to deal with lots of data is because Hadoop spreads it out. And the reason you’re able to ask complicated computational questions is because you’ve got all of these processors, working in parallel, harnessed together.
There are numerous Apache Software Foundation projects that comprise the services required by an enterprise to deploy, integrate, and work with Hadoop. Each of them has been developed to deliver an explicit function and each has its own community of developers and individual release cycles.
Hadoop Distributed File System (HDFS) is the core technology for the efficient scale out storage layer, and is designed to run across low-cost commodity hardware. Apache Hadoop YARN is the pre-requisite for Enterprise Hadoop as it provides the resource management and pluggable architecture for enabling a wide variety of data access methods to operate on data stored in Hadoop with predictable performance and service levels.
Apache Hive is the most widely adopted data access technology, though there are many specialized engines. For instance, Apache Pig provides scripting capabilities, Apache Storm offers real-time processing, Apache HBase offers columnar NoSQL storage and Apache Accumulo offers cell-level access control. All of these engines can work across one set of data and resources thanks to YARN. YARN also provides flexibility for new and emerging data access methods, for instance Search and programming frameworks such as Cascading.
Apache Falcon provides policy-based workflows for governance, while Apache Flume and Sqoop enable easy data ingestion, as do the NFS and WebHDFS interfaces to HDFS.
Security is provided at every layer of the Hadoop stack from HDFS and YARN to Hive and the other Data Access components on up through the entire perimeter of the cluster via Apache Knox.