Components of Hadoop

The previous article has given you an overview about the Hadoop and the two components of the Hadoop which are HDFS and the Mapreduce framework. This article would now give you the brief explanation about the HDFS architecture and its functioning.

HDFS:

The Hadoop Distributed File System(HDFS) is self-healing high-bandwidth clustered storage. HDFS has a master/slave architecture. An HDFS cluster constitutes of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition there are number of datanodes usually one per node in the cluster, which manages the storage attached to the nodes that they run on.

HDFS exposes a file system namespace and allows user data to be stored in files. Internally a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing and renaming files and directories.

It also determines the mapping of blocks to DataNOdes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion and replication upon instruction from the NameNode.

The above sketch represents the architecture of the HDFS.

MapReduce:

The other concept and component of the Hadoop is the Mapreduce. The Mapreduce is distributed fault-tolerant resource management and scheduling coupled with a scalable data programming abstraction.

It is parallel data-processing framework. The Mapreduce framework is used to get the data out from the various files and data nodes available in a system.The first part is that the data has to be pushed onto the different servers where the files would get replicated in short it is to store the data.

The second step once the data is stored the code is to be pushed onto the Hadoop cluster to the namenode which would be distributed on different datanodes which would be becoming the compute nodes and then the end user would be receiving the final output.

Mapreduce in Hadoop is not only the one function happening, there are different tasks involved like record reader, map, combiner, partition-er, shuffle and sort and reduce the data and finally gives the output. It splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner.

The framework sorts the outputs of the maps, which are then pushed as an input to the reduced tasks. Typically both the input and the output of the job are stored in a file-system. The framework also does take care of scheduling, monitoring them re-executing the failed tasks.

Mapreduce Key-value pair:

Mappers and reducers always use key-value pairs as input and output. A reducer reduces values per key only. A mapper or reducer may emit 0,1 or more key value pairs for every input. Mappers and reducers may emit any arbitrary keys or values,not just subsets or transformations of those in the input.

Example:

def map(key, value, context)

value.to_s.split.each do |word|

word.gsub!(/W/, ”)

word.downcase!

unless word.empty?

context.write(Hadoop::Io::Text.new(word), Hadoop::Io::IntWritable.new(1))

end

end

end

def reduce(key, values, context)

sum = 0

values.each { |value| sum += value.get }

context.write(key, Hadoop::Io::IntWritable.new(sum))

end

Mapper method splits on whitespace, removes all non-word characters and downcases. It outputs a one as the value. Reducer method is iterating over the values, adding up all the numbers, and output the input key and the sum.

Input file: Hello World Bye World

output file: Bye 1

Hello 1

World 2

Hence ends the briefing about the components of the Hadoop, their architecture, functioning and also steps involved in different processes happening in both the systems of the Hadoop.

There are also some pros and cons of the Hadoop similarly like a coin consisting of two faces which would be discussed in the coming blogs. Complete knowledge of any concept can be possible only once you get to know about the merits and demerits of the particular concept.

Henceforth to acquire complete knowledge of Hadoop keep following the upcoming posts of the blog.

The Two Faces of Hadoop

Pros:

  • Hadoop is a platform which provides Distributed storage & Computational capabilities both.
  • Hadoop is extremely scalable, In fact Hadoop was the first considered to fix a scalability issue that existed in Nutch – Start at 1TB/3-nodes grow to petabytes/1000s of nodes.
  • One of the major component of Hadoop is HDFS (the storage component) that is optimized for high throughput.
  • HDFS uses large block sizes that ultimately helps It works best when manipulating large files (gigabytes, petabytes…).
  • Scalability and Availability are the distinguished features of HDFS to achieve data replication and fault tolerance system.
  • HDFS can replicate files for specified number of times (default is 3 replica) that is tolerant of software and hardware failure, Moreover it can automatically re-replicates data blocks on nodes that have failed.
  • Hadoop uses MapReduce framework which is a batch-based, distributed computing framework, It allows paralleled work over a large amount of data.
  • MapReduce let the developers to focus on addressing business needs only, rather than getting involved in distributed system complications.
  • To achieve parallel & faster execution of the Job, MapReduce decomposes the job into Map & Reduce tasks and schedules them for remote execution on the slave or data nodes of the Hadoop Cluster.
  • Hadoop do has capability to work with MR jobs created in other languages – it is called streaming
  • suited to analyzing big data
  • Amazon’s S3 is the ultimate source of truth here and HDFS is ephemeral. You don’t have to worry about reliability etc – Amazon S3 takes care of that for you. Also means you don’t need high replication factor in HDFS.
  • You can take advantage of cool archiving features like Glacier. 
  • You also pay for compute only when you need it. It is well known that most Hadoop installations struggle to hit even 40% utilization [3],[4]. If your utilization is low, spinning up clusters on demand may be a winner for you. 
  • Another key point is that your workloads may have some spikes (say end of the week or month) or may be growing every month. You can launch larger clusters when you need to and stick with smaller ones otherwise.
  • You don’t have to provision for peak workload all the time. Similarly, you don’t need to plan your hardware 2-3 years upfront as is common practice with in-house clusters. You can pay as you go, grow as you please. This reduces the risk involved with Big Data projects considerably.
  • Your administration costs can be significantly lower reducing your TCO. 
  • No up-front equipment costs. You can spin up as many nodes as you like, for as long as you need them, then shut it down. It’s getting easier to run Hadoop on them.
  • Economics – Cost per TB at a fraction of traditional options.
  • Flexibility – Store any data, Run any analysis.

cons:

  • As you know Hadoop uses HDFS and MapReduce, Both of their master processes are single points of failure, Although there is active work going on for High Availability versions.
  • Until the Hadoop 2.x release, HDFS and MapReduce will be using single-master models which can result in single points of failure.
  • Security is also one of the major concern because Hadoop does offer a security model But by default it is disabled because of its high complexity.
  • Hadoop does not offer storage or network level encryption which is very big concern for government sector application data.
  • HDFS is inefficient for handling small files, and it lacks transparent compression. As HDFS is not designed to work well with random reads over small files due to its optimization for sustained throughput.
  • MapReduce is a batch-based architecture that means it does not lend itself to use cases which needs real-time data access.
  • MapReduce is a shared-nothing architecture hence Tasks that require global synchronization or sharing of mutable data are not a good fit which can pose challenges for some algorithms.
  • S3 is not very fast and vanilla Apache Hadoop’s S3 performance is not great. We, at Qubole, have done some work on Hadoop’s performance with S3 filesystem .
  • S3, of course, comes with its own storage cost . 
  • If you want to keep the machines (or data) around for a long time, it’s not as economical a solution as a physical cluster.

Here ends the briefing of Big Data and Hadoop and its various systems and their pros and cons. Wish you would have got an overview about the concept of Big Data and Hadoop.

Get in touch with us.

Manasa Heggere

Senior Ruby on Rails Developer

Related Posts

Leave a Comment

Your email address will not be published. Required fields are marked *

en_USEnglish