Big Data Concept

Posts

Showing posts with the label apache

Hadoop Distributed File System

- July 02, 2015

Hadoop Distributed File System (HDFS) is the file system that is used to store the data in Hadoop. How it stores data is special. When a file is saved in HDFS, it is first broken down into blocks with any remainder data that is occupying the final block. The size of the block depends on the way that HDFS is configured. At the time of writing, the default block size for Hadoop is 64 megabytes (MB). To improve performance for larger files, Hadoop changes this setting at the time of installation to 128 MB per block. Then, each block is sent to a different data node and written to the hard disk drive (HDD). When the data node writes the file to disk, it then sends the data to a second data node where the file is written. When this process completes, the second data node sends the data to a third data node. The third node confirms the completion of the writeback to the second, then back to the first. The NameNode is then notified and the block write is complete. After all blocks are w...

MapReduce Technique : Hadoop Big Data

- June 30, 2015

As a batch processing architecture, the major value of Hadoop is that it enables ad hoc queries to run against an entire data set and return results within a reasonable time frame. Distributed computing across a multi-node cluster is what allows this level of data processing to take place. MapReduce applications can process vast amounts (multiple terabytes) of data in parallel on large clusters in a reliable, fault-tolerant manner. MapReduce is a computational paradigm in which an application is divided into self-contained units of work. Each of these units of work can be issued on any node in the cluster. A MapReduce job splits the input data set into independent chunks that are processed by map tasks in parallel. The framework sorts the map outputs, which are then input to reduce tasks. Job inputs and outputs are stored in the file system. The MapReduce framework and the HDFS (Hadoop Distributed File System) are typically on the same set of nodes, which enabl...

Big Data Introduction

- June 28, 2015

What is Big Data? Big Data is a collection of large amount of Data that is available with all the organisation. The amount of these data are so huge that managing them has become a challenge. The worst thing is these data are increasing exponentially. For example : i) 200 of London's Traffic Cams collect 8 TB of data per day. ii)1 day of Instant Messaging in 2002 consume 750 GB of Data. iii)Annual Email Traffic excluding spams consume 300PB+ of Data. iv)In 2004 Walmart Transacton DB contains 200 TB of Data. v) Total Digital Data created in 2012 is assumed to be 270000 PB. As per a report these data will grow at a rate of 40% annually. Big Data Technique is getting lot of importance now a days from organisations to handle those data as well as using them in business growth. Big Data is a technology that uses data that is diverse, huge and require special skill to handle it. In other word conventional technology will not be able to effectively handle it....

Big Data Analytics a high paying career

- June 27, 2015

The next phase of demand for IT professional will come from Big Data . It is a high paying job and the demand is huge. So this is a great news for IT professional. Making money from Big Data is a challenge. Here Data analytics come handy. Data analytics can be of different background which include data science, data mining, web analytics or even statistics. IT professional have to work in tandem with Data Analyst in order to get something meaningful from the huge quantity of data. One of the major complain of Data Analysts is that they don't get enough support from their IT team. This is a major deterrent in their work. Other major problem Data Analyst face is the quality of Data given to them. They are poorly documented and they have to spend huge amount of time in reformatting those data. IT professional must understand the need of Data Analyst and must prepare data according to their need so that they can use their time in analysing the data instead of reformatting it...