One of the technology that is often being evaluated for HPC (High Performance Computing) or Cloud infrastructure is named after a stuffed elephant toy called Hadoop.
Hadoop is actually a framework for distributing data and running applications across a large cluster of servers built on commodity hardware. Today, I’m going to put away my service management, operations, facility hats and put on my old software hat to tell you a little story about this elephant and explain what Hadoop framework is composed of (in a layman’s terms).
It is made up of a number of components supported by a set of common utilities, which is called Hadoop Common. The other components are Avro (a data serialization system providing dynamic integration with scripting languages), Chukwa (a data collection system for managing large distributed systems), HBase (a scalable & distributed database supporting structured data storage for large tables), HDFS (Hadoop Distributed File System providing high throughput access to application data), Hive (data warehouse framework), Map/Reduce (software framework for distributing processing of large data sets on compute clusters), Pig (high-level data-flow language for parallel computation) and ZooKeeper (a high-performance coordination service for distributed applications).
Let me zoom in on two of these components: the Hadoop Distributed File System (HDFS) and the Map/Reduce programming model.
To explain how to leverage on these two components to build a distributed architecture for fast computational and distributed file system that can handle large data sets, lets use an example. Take a simplistic example that you’re doing some modeling of the sales projection for your company’s products based on inputs specific to your customers demographics, buying behavior / trends, your product marketing plans, R&D plans, etc. And you would like to support this computation and derive the model using historical data over the last 50 years, and you want to do this as granular as possible, i.e. determining the sales pattern by hours and days, and by state, city, town, street, shop, etc. So, you take all your available data globally for your product, mash it all up into a huge data set, which is perhaps in size of PB (petabytes) and you’re ready to run the simulation as a batch process. But you know this is going to take time and you don’t have sufficient internal compute resources to do this.
Well, it is possible to “lease” the compute and storage resources from a Cloud service provider. You just need to pay for what you use. How would such a Cloud service provider be able to setup an infrastructure architecture to support such a need?
For the Cloud service provider offering an IaaS service, the architecture would typically be a cluster of commodity hardware, running a file system architecture that would allow fast and efficient means to breakdown the data and computations into small little chunks so that they could be distributed and executed in parallel. One of the way to do this is using Hadoop.
Let me first explain a bit about HDFS – it is a master/slave architecture, made up of a master server called Namenode, which regulates and manages the access to files by clients, i.e. the opening / closing of files; and a number of DataNodes, which manages the storage attached to them, i.e. the read and write requests from the file system’s clients. The HDFS architecture may be made up of hundreds or thousands of servers, each storing part of the file system’s data set. It is tuned to support large files, i.e. from GB to TB to even PB sizes. The theory behind HDFS and Hadoop, in general, is that a computation requested by an application is much more efficient if it is executed near the data it operates on, especially if the data size is huge, because moving the computation is much cheaper than moving the data (less bandwidth).
Hadoop provides the capability and interfaces for distributing commands of an application to be executed in parallel. This is in the form of a programming model and execution framework called Map/Reduce, using HDFS for storage. It allows you to easily write applications which processes large amount of data (i.e. multi-terabytes) in parallel on large cluster of servers reliably. So, imagine you have this cluster of thousands of servers – you might be thinking, how do you manage the parallel execution of the tasks on these servers? Well, the Map/Reduce framework consists of a master (JobTracker) and for each cluster node, a slave (TaskTracker). The master is the boss, i.e. it schedules the jobs’ component tasks on the slaves, monitors them and re-execute them if the tasks fails. And the slaves are just slaves, i.e. obey and execute the tasks as directed by the master.
Usually, the compute node (i.e. Map/Reduce framework) and the storage node (i.e. Hadoop Distributed File System / HDFS) are running on the same node. The reason so is that it is more efficient to run your computational logic nearer to the data.
Now that we got the foundation sorted out, lets move on little bit further. For our earlier example, say the data set is 20PB size. This input file (data set) will be split into a large number of fragments and distributed into the many Datanodes during the Map phase and each independent fragment of the data set are assigned to a map task. What the map tasks do is basically takes the fragment of data set (input), chops it up into smaller fragments, and then sorts these data sets by key so that all values associated with a particular key appears together. Then, it processes each of these small fragment of data sets and passes the results back to the master node. Because of the way the computational work is distributed, it means that this is suitable for batch processes, where the computation does not need to have any dependencies on the rest of the other computations. Then, in the Reduce phase, each reduce task consumes the fragment of answers assigned to it, and invoked a reduce function that transforms them back into an output. Both inputs and outputs are stored in a file system and the framework manages the scheduling of these tasks, monitors them and re-executes them if failed. The advantage of Map/Reduce is that it allows for distributed processing of a computational operations and each fragment can be executed in parallel. It also brings the computational operation nearer to the data to reduce bandwidth and overhead in moving large sets of data.
So, what are suitable to use Hadoop as a platform?
- General complex computational applications, such as web pages analytics processing;
- General statistical computational applications where the Map task outputs are small. It is not suitable in situations where the Map task outputs are large, as these will create OS/disk and I/O bottlenecks, and consumes lots of resources during the sorting and shuffling stage.
Tags: Architecture, Cloud Computing, Efficiency, productivity