Tuesday 30 June 2015

MapReduce Technique : Hadoop Big Data

As a batch processing architecture, the major value of Hadoop is that it enables ad hoc queries to run against an entire data set and return results within a reasonable time frame. Distributed computing across a multi-node cluster is what allows this level of data processing to take place.
MapReduce applications can process vast amounts (multiple terabytes) of data in parallel on large clusters in a reliable, fault-tolerant manner. MapReduce is a computational paradigm in which an application is divided into self-contained units of work. Each of these units of work can be issued on any node in the cluster.

http://bigdataconcept.blogspot.in/2015/06/mapreduce-hadoop-big-data.html


A MapReduce job splits the input data set into independent chunks that are processed by map tasks in parallel. The framework sorts the map outputs, which are then input to reduce tasks. Job inputs and outputs are stored in the file system. The MapReduce framework and the HDFS (Hadoop Distributed File System) are typically on the same set of nodes, which enables the framework to schedule tasks on nodes that contain data.
The MapReduce framework consists of a single primary JobTracker and one secondary TaskTracker per node. The primary node schedules job component tasks, monitors tasks, and re-executes failed tasks. The secondary node runs tasks as directed by the primary node.

MapReduce is composed of the following phases:

i)Map
ii)Reduce

The map phase

The map phase is the first part of the data processing sequence within the MapReduce framework. Map functions serve as worker nodes that can process several smaller snippets of the entire data set. The MapReduce framework is responsible for dividing the data set input into smaller chunks, and feeding them to a corresponding map function. When you write a map function, there is no need to incorporate logic to enable the function to create multiple maps that can use the distributed computing architecture of Hadoop. These functions are oblivious to both data volume and the cluster in which they are operating. As such, they can be used unchanged for both small and large data sets (which is most common for those using Hadoop).

Important: Hadoop is a great engine for batch processing. However, if the data volume is small, the processor usage that is incurred by using the MapReduce framework might negate the benefits of using this approach.
Based on the data set that one is working with, a programmer must construct a map function to use a series of key/value pairs. After processing the chunk of data that is assigned to it, each map function also generates zero or more output key/value pairs to be passed forward to the next phase of the data processing sequence in Hadoop. The input and output types of
the map can be (and often are) different from each other.

The reduce phase

As with the map function, developers also must create a reduce function. The key/value pairs from map outputs must correspond to the appropriate reducer partition such that the final results are aggregates of the appropriately corresponding data. This process of moving map
outputs to the reducers is known as shuffling. When the shuffle process is completed and the reducer copies all of the map task outputs, the
reducers can go into what is known as a merge process. During this part of the reduce phase, all map outputs can be merged together to maintain their sort ordering that is established during the map phase. When the final merge is complete (because this process is done in rounds for performance optimization purposes), the final reduce task of consolidating results
for every key within the merged output (and the final result set), is written to the disk on the HDFS.

Development languages: Java is a common language that is used to develop these functions. However, there is support for a host of other development languages and frameworks, which include Ruby, Python, and C++.

12 comments:

  1. Nice post ! Thanks for sharing valuable information with us. Keep sharing..Big data hadoop online training

    ReplyDelete
  2. I have to voice my passion for your kindness giving support to those people that should have guidance on this important matter.
    safety courses in chennai

    ReplyDelete
  3. It’s great to come across a blog every once in a while that isn’t the same out of date rehashed material. Fantastic read.
    Data science Course Training in Chennai |Best Data Science Training Institute in Chennai
    matlab training chennai | Matlab course in chennai

    ReplyDelete
  4. Useful blog.I have learned a lot of informative stuff from your blog.Thank you so much for sharing this wonderful post. Keep posting such valuable contents. https://suryainformatics.com

    ReplyDelete
  5. Amazon Web Services (AWS) is the most popular and most widely used Infrastructure as a Service (IaaS) cloud in the world. AWS has four core feature buckets—Compute, Storage & Content Delivery, Databases, and Networking.
    AWS training in chennai | AWS training in anna nagar | AWS training in omr | AWS training in porur | AWS training in tambaram | AWS training in velachery

    ReplyDelete

  6. Great Article
    Cloud Computing Projects


    Networking Projects

    Final Year Projects for CSE


    JavaScript Training in Chennai

    JavaScript Training in Chennai

    The Angular Training covers a wide range of topics including Components, Angular Directives, Angular Services, Pipes, security fundamentals, Routing, and Angular programmability. The new Angular TRaining will lay the foundation you need to specialise in Single Page Application developer. Angular Training

    ReplyDelete
  7. Just saying thanks will not just be sufficient, for the fantasti c lucidity in your writing. I will instantly grab your rss feed to stay informed of any updates. business analytics course in mysore

    ReplyDelete
  8. Took me time to understand all of the comments, but I seriously enjoyed the write-up. It proved to be really helpful to me and I'm positive to all of the commenters right here! Its constantly nice when you can not only be informed, but also entertained! I am certain you had enjoyable writing this write-up.
    data analytics course in hyderabad

    ReplyDelete
  9. Your blog provided us with valuable information to work with. Each & every tips of your post are awesome. Thanks a lot for sharing. Keep blogging,
    data science training in hyderabad

    ReplyDelete