Hadoop 101: Learning the Core Elements of Hadoop

hadoop-101According to 451 Research, the demand for Hadoop is expected to grow to $2.7 billion by 2018 and one of the key challenges accompanying this growth is the lack of widespread Hadoop configuration and development skills. A background in Hadoop can serve as a valuable differentiator for professionals interested in building new areas of expertise in managing and analyzing data, but finding an educational program that conveniently fits into busy work schedules has been difficult and/or cost prohibitive.

To address this need, MapR launched earlier this year a free, comprehensive On-Demand Training program designed to help data analysts, developers and administrators attain the required skills and expertise that can lead to becoming a certified Hadoop professional. In this article, we’ll take a look at the two core elements of Hadoop: the distributed file system and execution engines. This content is based on the curriculum from course HDE 100: Hadoop Essentials.

Types of File Systems

You’re probably reading this article right now on a web browser. A web browser is a file. It’s an application or executable file, but it is still a file that is stored on the hard drive of the computer that you are using. The files on our local hard drive are stored using a hard drive file system, such as HFS or NTFS. Most of us have no idea how these file systems work, but we don’t have to; we just need to know that we can open files, create files, save files, modify files, and delete files. Since I can read existing files, create new files or write to existing files, my local hard drive file system is considered a read-write file system.

When I pop in a CD-ROM into the CD-ROM drive, I can access the files in the CD file system, such as CDFS. CDFS uses the very same utilities behind the scenes as those used for my hard drive files. This set of commonly used file system utilities is part of the POSIX standards. As with HFS or NTFS, I don’t have to know how CDFS works to use the files stored on my CD. When working with files on a CD-ROM, though, I cannot modify files, create new ones, or delete them. Therefore, CDFS is referred to as a read-only file system.

HFS and CDFS are both local to my laptop, and we call them local file systems. Each file in a file system is comprised of an inode and a set of data blocks. The inode stores metadata about the file, such as the file type, permissions, owner, file name, and the time it was last modified. The inode also stores pointers about which data blocks the file is stored in. The data blocks for a file store the actual contents of the file.

Common Problems with a Local File System

Hard Drive Failure

What happens when the hard drive on your laptop crashes and you lose all of your files?

You can install a second local drive and mirror the two. That way, if one disk fails, you have an exact replica of the entire file system on the other drive. All of your music, photos and working files are available as if nothing happened. You can then go replace the failed drive at your convenience.

Lose Your Hard Drive

What if someone breaks into your car and steals your laptop? Mirroring will not help, as both disks are part of the same laptop and are now in the hands of the perpetrator. You can mirror important files in the cloud. That way, when disaster strikes your laptop, you can retrieve the most important data quickly. Further, your cloud provider likely has local and remote mirrors of all their customer data.

Accidentally Delete a File

What about simple human error? Anyone who has worked around a computer for any considerable length of time has accidentally deleted an important file or directory. Having a mirror copy, local or remote, will not help, because both mirrors are exact replicas of each other. Deleting from one deletes from the other.

What we should do in this case is to make regular backups of our important files onto a backup drive. That way, when any disaster strikes, natural or man-made, we can restore the files from the backup. Of course, it depends on how regular is “regular.” A nightly backup would mean you would lose at most the last 24 hours of work.

It is common for an operating system to allow you to make incremental backups as often as every hour, protecting you from accidental deletion of files. This way, the first incremental backup usually copies all of the data that you want to back up. The next backup, however, will only copy any files that are new, or have been modified since the last backup. Since we are usually only saving a small subset of our data, incremental backups much more efficient in both the space and time required to store them.

Run Out of Space

Another common experience is running out of space on your local drive. This could be on your personal drive because you have too many home movies of your kids and pets, or your work drive getting filled up with archived project files. Drives fill up quickly, and we often want to continue to access the old data, while giving ourselves more room for new data.

Generally, our local operating system will have a volume manager that will allow us to grow our file system of the fly. For a desktop computer, or a set of servers at the office, we can often just add a new drive and run some commands to extend the file system onto that new hard drive. The volume manager implements what is called RAID, which stands for a redundant array of inexpensive disks.

Distributed File Systems

A distributed file system behaves similar to a RAID-0 file system where the disks are spread across multiple servers. Network file system, or NFS, is a distributed file system protocol developed by Sun Microsystems that is still commonly used to store and retrieve data across a network.

Distributed file systems aim for transparency when handling data. They are invisible to the client programs, which see a system that is similar to a local file system. Behind the scenes, the distributed file system handles locating files and transporting data. Since the data is spread out across many computers on a network, a distributed file system uses a data locator to store information about data location. Much like the inode in a local file system, the data locator will point to the location of data stored in a distributed file system.

Execution Engines

Once data is ingested into the distributed file system, Hadoop uses execution engines to process this data. MapReduce is the standard in Hadoop, while newer engines like Apache Spark are also available.

MapReduce

There are three phases in MapReduce: map, shuffle and reduce.

The map phase is split among the task tracker nodes, with tasks being assigned to the node where the data for that task is located. Each node in the map phase emits key-value pairs based on input one record at a time.

The shuffle phase is handled by the Hadoop framework. Output from the mappers is sent to the reducers as partitions.

In the reduce phase, each reducer receives one or more partitions, and reads a key and iterable list of values associated with that key. Reducers return zero or more key-value pairs based on the application logic.

Apache Spark

Tutorials explaining Spark in detail will be added to the On-Demand Training curriculum in the coming months. Below is a preview of what students will learn.

Spark is another execution framework that works with the file system to process data within a cluster. Spark offers the ability to run computations in memory, whereas MapReduce shuffles data in and out of disk. It also has a map and a reduce function like MapReduce, but Spark adds others like Filter, Join and Group-by for easier application development. Another difference is that the execution model for MapReduce is limited to batch, whereas Spark supports execution models including batch, interactive, and streaming.

For free Hadoop On-Demand Training classes to learn more about Hadoop and other big data technologies, please visit HERE.

Suzanne_Ferry_MapRContributed by Suzanne Ferry, Vice President of Global Training and Enablement for MapR Technologies.  She is responsible for the company’s worldwide technical education services. Previously, Suzanne held senior education and training positions with Marketo and Hosting.com.

 

Sign up for the free insideAI News newsletter.