Showing posts with label analitical big data. Show all posts
Showing posts with label analitical big data. Show all posts

Tuesday, 30 June 2015

MapReduce Technique : Hadoop Big Data

As a batch processing architecture, the major value of Hadoop is that it enables ad hoc queries to run against an entire data set and return results within a reasonable time frame. Distributed computing across a multi-node cluster is what allows this level of data processing to take place.
MapReduce applications can process vast amounts (multiple terabytes) of data in parallel on large clusters in a reliable, fault-tolerant manner. MapReduce is a computational paradigm in which an application is divided into self-contained units of work. Each of these units of work can be issued on any node in the cluster.

http://bigdataconcept.blogspot.in/2015/06/mapreduce-hadoop-big-data.html


A MapReduce job splits the input data set into independent chunks that are processed by map tasks in parallel. The framework sorts the map outputs, which are then input to reduce tasks. Job inputs and outputs are stored in the file system. The MapReduce framework and the HDFS (Hadoop Distributed File System) are typically on the same set of nodes, which enables the framework to schedule tasks on nodes that contain data.
The MapReduce framework consists of a single primary JobTracker and one secondary TaskTracker per node. The primary node schedules job component tasks, monitors tasks, and re-executes failed tasks. The secondary node runs tasks as directed by the primary node.

MapReduce is composed of the following phases:

i)Map
ii)Reduce

The map phase

The map phase is the first part of the data processing sequence within the MapReduce framework. Map functions serve as worker nodes that can process several smaller snippets of the entire data set. The MapReduce framework is responsible for dividing the data set input into smaller chunks, and feeding them to a corresponding map function. When you write a map function, there is no need to incorporate logic to enable the function to create multiple maps that can use the distributed computing architecture of Hadoop. These functions are oblivious to both data volume and the cluster in which they are operating. As such, they can be used unchanged for both small and large data sets (which is most common for those using Hadoop).

Important: Hadoop is a great engine for batch processing. However, if the data volume is small, the processor usage that is incurred by using the MapReduce framework might negate the benefits of using this approach.
Based on the data set that one is working with, a programmer must construct a map function to use a series of key/value pairs. After processing the chunk of data that is assigned to it, each map function also generates zero or more output key/value pairs to be passed forward to the next phase of the data processing sequence in Hadoop. The input and output types of
the map can be (and often are) different from each other.

The reduce phase

As with the map function, developers also must create a reduce function. The key/value pairs from map outputs must correspond to the appropriate reducer partition such that the final results are aggregates of the appropriately corresponding data. This process of moving map
outputs to the reducers is known as shuffling. When the shuffle process is completed and the reducer copies all of the map task outputs, the
reducers can go into what is known as a merge process. During this part of the reduce phase, all map outputs can be merged together to maintain their sort ordering that is established during the map phase. When the final merge is complete (because this process is done in rounds for performance optimization purposes), the final reduce task of consolidating results
for every key within the merged output (and the final result set), is written to the disk on the HDFS.

Development languages: Java is a common language that is used to develop these functions. However, there is support for a host of other development languages and frameworks, which include Ruby, Python, and C++.

Sunday, 28 June 2015

Operational Vs Analytical : Big Data Technology

There are two technologies used in Big Data Operational and Analytical. Operational capabilities include capturing and storing data in real time where as analytical capabilities include complex analysis of all the data. They both are complementary to each other hence deployed together.




Operational and analytical technologies of Big Data have different requirement and in order to address those requirement different architecture has evolved. Operational systems include NoSql database which deals with responding to concurrent requests. Analytical Systems focuses on complex queries which touch almost all the data.Both system work in tandem and manages hundreds of terabytes of data spanning over billion of records.

Operational Big Data

For Operational Big Data NoSql is generally used. It was developed to address the shortcoming of traditional database and it is faster and can deal with large quantity of data spread over multiple servers. We are also using cloud computing architectures to allow massive computation to run effectively as well as it is cost efficient. This has made Big Data workload easier to manage, faster to implement as well as cheaper.
Here in addition to interaction with user it also provide artificial intelligence about the active data. For example in games the moves of user are studies and next course of actions are suggested. NoSql can analyse real-time data and can generate conclusion based on that.

Analytical Big Data

Analytical Big Data is addressed by MPP database systems and MapReduce. These technologies has evolved as a result of shortcoming in traditional database which deals which one servers only. On the other hand MapReduce provides new method of analyzing data which is beyond the scope of SQL.

As volumes of data generated by users is increasing the analytical workload in realtime has also increased. So MapReduce has emerged as the first choice for Big Data analytics as its algorithm is superior. No Sql also provide limited capabilities in MapReduce technique but generally we prefer copying data from NoSql system to Analytical Systems such as Hadoop for MapReduce.