Thursday, 2 July 2015

Hadoop Distributed File System

Hadoop Distributed File System

(HDFS) is the file system that is used to store the data in Hadoop. How it stores data is special. When a file is saved in HDFS, it is first broken down into blocks with any remainder data that is occupying the final block. The size of the block depends on the way that HDFS is configured. At the time of writing, the default block size for Hadoop is 64 megabytes (MB). To improve performance for larger files, Hadoop changes this setting at the time of installation to 128 MB per block. Then, each block is sent to a different data node and written to the hard disk drive (HDD). When the data node writes the file to disk, it then sends the data to a second data node where the file is written. When this process completes, the second data node sends the data to a third data node. The third node confirms the completion of the writeback to the second, then back to the first. The NameNode is then notified and the block write is complete. After all blocks are written successfully, the result is a file that is broken down into blocks with a copy of each block on three data nodes. The location of all of this data is stored in memory by the NameNode.


Hadoop is designed to run on many commodity servers. The Hadoop software architecture also lends itself to be scalable within each server. HDFS can deal with individual files that are terabytes in size and Hadoop clusters can be petabytes in size if required. Individual nodes can be added to Hadoop at any time. The only cost to the system is the input/output (I/O) of redistributing the data across all of the available nodes, which ultimately might speed up access. The upper limit of how large you can make your cluster is likely to depend on the hardware that you have assembled. For example, the NameNode stores metadata in random access memory (RAM) that is roughly equivalent to a GB for every TB of data in the cluster.

Tuesday, 30 June 2015

MapReduce Technique : Hadoop Big Data

As a batch processing architecture, the major value of Hadoop is that it enables ad hoc queries to run against an entire data set and return results within a reasonable time frame. Distributed computing across a multi-node cluster is what allows this level of data processing to take place.
MapReduce applications can process vast amounts (multiple terabytes) of data in parallel on large clusters in a reliable, fault-tolerant manner. MapReduce is a computational paradigm in which an application is divided into self-contained units of work. Each of these units of work can be issued on any node in the cluster.

A MapReduce job splits the input data set into independent chunks that are processed by map tasks in parallel. The framework sorts the map outputs, which are then input to reduce tasks. Job inputs and outputs are stored in the file system. The MapReduce framework and the HDFS (Hadoop Distributed File System) are typically on the same set of nodes, which enables the framework to schedule tasks on nodes that contain data.
The MapReduce framework consists of a single primary JobTracker and one secondary TaskTracker per node. The primary node schedules job component tasks, monitors tasks, and re-executes failed tasks. The secondary node runs tasks as directed by the primary node.

MapReduce is composed of the following phases:


The map phase

The map phase is the first part of the data processing sequence within the MapReduce framework. Map functions serve as worker nodes that can process several smaller snippets of the entire data set. The MapReduce framework is responsible for dividing the data set input into smaller chunks, and feeding them to a corresponding map function. When you write a map function, there is no need to incorporate logic to enable the function to create multiple maps that can use the distributed computing architecture of Hadoop. These functions are oblivious to both data volume and the cluster in which they are operating. As such, they can be used unchanged for both small and large data sets (which is most common for those using Hadoop).

Important: Hadoop is a great engine for batch processing. However, if the data volume is small, the processor usage that is incurred by using the MapReduce framework might negate the benefits of using this approach.
Based on the data set that one is working with, a programmer must construct a map function to use a series of key/value pairs. After processing the chunk of data that is assigned to it, each map function also generates zero or more output key/value pairs to be passed forward to the next phase of the data processing sequence in Hadoop. The input and output types of
the map can be (and often are) different from each other.

The reduce phase

As with the map function, developers also must create a reduce function. The key/value pairs from map outputs must correspond to the appropriate reducer partition such that the final results are aggregates of the appropriately corresponding data. This process of moving map
outputs to the reducers is known as shuffling. When the shuffle process is completed and the reducer copies all of the map task outputs, the
reducers can go into what is known as a merge process. During this part of the reduce phase, all map outputs can be merged together to maintain their sort ordering that is established during the map phase. When the final merge is complete (because this process is done in rounds for performance optimization purposes), the final reduce task of consolidating results
for every key within the merged output (and the final result set), is written to the disk on the HDFS.

Development languages: Java is a common language that is used to develop these functions. However, there is support for a host of other development languages and frameworks, which include Ruby, Python, and C++.

Sunday, 28 June 2015

Operational Vs Analytical : Big Data Technology

There are two technologies used in Big Data Operational and Analytical. Operational capabilities include capturing and storing data in real time where as analytical capabilities include complex analysis of all the data. They both are complementary to each other hence deployed together.

Operational and analytical technologies of Big Data have different requirement and in order to address those requirement different architecture has evolved. Operational systems include NoSql database which deals with responding to concurrent requests. Analytical Systems focuses on complex queries which touch almost all the data.Both system work in tandem and manages hundreds of terabytes of data spanning over billion of records.

Operational Big Data

For Operational Big Data NoSql is generally used. It was developed to address the shortcoming of traditional database and it is faster and can deal with large quantity of data spread over multiple servers. We are also using cloud computing architectures to allow massive computation to run effectively as well as it is cost efficient. This has made Big Data workload easier to manage, faster to implement as well as cheaper.
Here in addition to interaction with user it also provide artificial intelligence about the active data. For example in games the moves of user are studies and next course of actions are suggested. NoSql can analyse real-time data and can generate conclusion based on that.

Analytical Big Data

Analytical Big Data is addressed by MPP database systems and MapReduce. These technologies has evolved as a result of shortcoming in traditional database which deals which one servers only. On the other hand MapReduce provides new method of analyzing data which is beyond the scope of SQL.

As volumes of data generated by users is increasing the analytical workload in realtime has also increased. So MapReduce has emerged as the first choice for Big Data analytics as its algorithm is superior. No Sql also provide limited capabilities in MapReduce technique but generally we prefer copying data from NoSql system to Analytical Systems such as Hadoop for MapReduce.

SpaceX's Falcon 9 rocket explodes.

On 28th June 2015 SpaceX's Falcon 9 rocket which was an unmanned rocket for international space station had exploded just min after lit-off. Nasa official is not sure what had caused the explosion and they are investigating the matter.

The rocket was launched from Cape Canaveral, Fla at 10:21 a.m. . Things was going smoothly when suddenly after 2 min it exploded. It was carrying more than 4000 pounds of food and supplies to the space station. It was unmanned so it was carrying no astronauts. American astronauts Scott Kelly is in space station and the supply was for him.

Two of the earlier launches of Space X had also failed. They were Orbital Antares rocket and Russian Progress 59.

Big Data Introduction

What is Big Data?

Big Data is a collection of large amount of Data that is available with all the organisation. The amount of these data are so huge that managing them has become a challenge. The worst thing is these data are increasing exponentially. For example :

i) 200 of London's Traffic Cams collect 8 TB of data per day.
ii)1 day of Instant Messaging in 2002 consume 750 GB of Data.
iii)Annual Email Traffic excluding spams consume 300PB+ of Data.
iv)In 2004 Walmart Transacton DB contains 200 TB of Data.
v) Total Digital Data created in 2012 is assumed to be 270000 PB.

As per a report these data will grow at a rate of 40% annually. Big Data Technique is getting lot of importance now a days from organisations to handle those data as well as using them in business growth.
Big Data is a technology that uses data that is diverse, huge and require special skill to handle it. In other word conventional technology will not be able to effectively handle it. It contains data which is too large in term of volume, complex to handle, variable i.e not of same type, veracity in term of quality.

But today we have technologies which can be used to arrive at a conclusion from Big Data. For example a retailer can track their user and identify their behaviour to come at a conclusion regarding their preference, what price they are searching and accordingly they can stock their products. One can use social media signal to come at a conclusion like outbreak of any disease or any unrest happening at any part of the country.
So basically Big Data refers to a set of data which is so voluminous that it is impossible to manage by traditional tool.
So Big Data consists of creating effective data from raw data, storing it, retrieving it when necessary and then come out with a conclusion by analysing it.

Some of the term used in Big Data are:

Volume : We might have 500 GB of storage in our personal system. But Facebook consume 500TB of new data everyday. Excessive use of smartphone with new technologies like sensor will create additional data like location and other information including videos.

Velocity : The data is created very fast. Like on-line game is played by million of users simultaneously, stock trading algorithm generates huge quantity of data every second, sensors are generating data in real time, ad impression capture user behaviour at millions of events per seconds. So the data are created at a rapid pace and we need effective technologies in order to deal with it.

Variety : All the data are of different type. Some may be audio, video, text which may be unstructured. It may not be only numbers, dates and strings.

Traditional database used to deal with smaller volume of data which were predictable and consistent. But with the advent of new technology and techniques the amount of Data generated is so large and voluminous that we  have to deal with them separately. For that we require Data analysts and Data scientists.

Saturday, 27 June 2015

Big Data Analytics a high paying career

The next phase of demand for IT professional will come from Big Data. It is a high paying job and the demand is huge. So this is a great news for IT professional.

Making money from Big Data is a challenge. Here Data analytics come handy. Data analytics can be of different background which include data science, data mining, web analytics or even statistics. IT professional have to work in tandem with Data Analyst in order to get something meaningful from the huge quantity of data. One of the major complain of Data Analysts is that they don't get enough support from their IT team. This is a major deterrent in their work. Other major problem Data Analyst face is the quality of Data given to them. They are poorly documented and they have to spend huge amount of time in reformatting those data. IT professional must understand the need of Data Analyst and must prepare data according to their need so that they can use their time in analysing the data instead of reformatting it.

 What Data Analysts need..
Data  analyst need all common type of data group together so that they can come with some conclusion. Like transaction happening in a retail shop must contain data for whole day so that they can come out with a conclusion what customer are buying, how much they are willing to pay and what the shop must have to attract more customer.

Raw Data must be complete with full details.
All data are important. Previous data always comes as a handy for getting a conclusion. So no data should be deleted and maximum data should be put in use. The more the data the better the conclusion.

Data should be fed to Analysts on time.
In order to come on a particular conclusion is a time consuming activity. So Data analysts must be supplied raw data well in advance not at last moment.

Analysts need lot of space to store the data.
Analysts always want to save data that can come handy in future. They don't want to destroy valuable data due to lack of space.

Current data which is complete, correct and consistent should be supplied to analysts.
Data analysts must spend all there time in analysing the data so the raw data must be properly documented which complete detail. Data must be given in the form of table with rows and columns and each raw representing one particular transaction.

Proper tool must be provided to Analysts in order to make conclusions.
Analysts need various products in order to work properly. They also need to take care of their asset. Most of the time they have no idea how to take care of those valuable assets. Here Big Data professional can come handy and help them.

For the betterment of the organisation IT professional must work in tandem with Data Analysts and provide them with full support.