Showing posts with label Big Data. Show all posts
Showing posts with label Big Data. Show all posts

Friday, 4 March 2016

What is Big Data

What is Big Data?

We are dealing with data for so many years. But in today's landscape the emphasis has shifted to analytics and Big Data.

Best result can be expected from analytics only when it is provided with high quantity and high quality of data. The more data we have, the better decision we get. Currently data of size of data we deal with is in petabytes which will in future will scale to zeta bytes. With the evolution of technology over the year we are proficient in dealing with massive database, data marts and data warehouses. But now things have changed. We are getting data from different sources which are largely unstructured. So it is a new challenge for the organization how to handle that vast amount of data both structured and unstructured. This situation is dealt with Big Data.

We have reached a point of Data Explosion. From where we are getting all these data. The below diagram explain this.
What is Big Data



The data comes from multiple source sensors that gather climate information, contents posted on social media, online transactions record, call details records, cell phone GPS signals, CCTV cameras.

Characteristics of Big Data

Big Data is characterized by four V's.

i) Volume : As our data volume increase the traditional infrastructure is unable to handle it. Managing such humongous data with current budget is not feasible. Organisation is flooded with growing data sometimes in the range of petabytes.

ii)Velocity : Now we have multiple point of data source. Some of them like sensors generates data at such a large pace with equally large volume, retaining them has become a challenge. We have to improve our response time. Some real time data like fraud detection must be processed immediately.

iii)Variety : Now we have both type of data Structured as well as unstructured. Like texts, sensor data, audio and video clips. If we have to analyse both together then new approach is required. And the irony is 80% of data is unstructured.

iv) Veracity : Establishing trust on the data is also a challenge. As bad input will result in bad output. We are devoting so much of time in analysing the data the data must be trustworthy.



What is Big Data


Big Data Strategy

All source of data must be fully exploited by organization. While making decisions executive should consider not only operational data and customer demographics, but also customer feedback,details in contracts and agreements and other type of unstructured data and content.

Factors for Big Data Strategy

i) Integrate and manage full variety, velocity and volume of data
ii)Apply advanced analytics to information in its native form
iii)Visualize all available data for ad-hocs analysis
iv)Development environment for building new analytic applications
v) Workload optimization and scheduling
v) Security and governance



People get confused with Big Data as a technology. It is not just technology, it is a business strategy for utilizing information resource. Success at each entry point
is accelerated by products within Big Data platform which helps in building the foundation for future requirements by expanding further into the big data platform

Big Data Tool
i) Hadoop
ii)Cloudera
iii)MongoDB
iv)Talend

Hadoop - "Hadoop is big data and big data is Hadoop". This is what most of the people think. But it is not like that. Hadoop is just one of the flavour of Big Data. It is an open source software framework for storage of very large dataset. It has enormous storage of any kind of data coupled with efficient processing system. It can handle concurrent task.

Cloudera - Cloudera has some additional features which allow people working in an organisation better access to the data.It is an enterprise solution in which hadoop
can be implemented. It is more secure. As we are storing sensitive data, data security is more important.

MongoDB - It is a modern approach which helps in storing unstructured data in a proper way.

Talend - It is also open source company with a number of products. 

Thursday, 3 March 2016

Real Time Analytics of Big Data

Big Data is used for storing enormous data which is both structured and unstructured and coming from different sources like sensors. In this post I am going to explain Real Time Analytics of Big Data.

The data that we deal with can be analyzed by two ways.
  1. When the data is in motion. That mean when data is still running and it has not been inserted into database.
  2. After data has been inserted into database.

Now the world has become so fast that if we wait for the data to be inserted into database and then analyze it, sometimes it becomes useless.

Let me give some example. We have CCTV camera at every traffic signal. It generates millions of data every second. Now traditionally we follow the technique where when some crime happens then we analyze the database and try to figure out the criminal. This is the bottom up approach. The better option is to analyze every things at source in real time. We will put face scanner at every source and the moment it find some suspects it will alert the nearest crime control system. In this case we don't have to wait for data to get inserted into database. Therefore we can nail the suspect and caught them
before they can commit crime.

There are other areas also where we can use real time analytics.

Now a days every where we have so much data that it is practically impossible to store all of them. So we analyze the data before storing in data base and remove the unwanted data. In this way we will store only the important data.

Real time analytics tool

i) IBM Infosphere Stream
ii) Apache Spark
iii)Apache Storm

IBM infosphere Stream is a core product of IBM which focuses on real time analysis of  big data. The aim is to analyses the data in real time and come out with meaningful conclusions. It works on the principal of Graph. As graph is set of vertex and edges. It also is based on that principal. Here vertex will be called as operator and edges will be called as stream. In operator we will write the code and in stream tuple will flow. Tuple is nothing but a row of data. We have different types of operators each with specific function.

Operators

Source Type : Any outside data first comes into this operator. This is the entry point of data. It is capable of interacting from external devices. So it is the intersection point between software and hardware. This operator is capable of parsing and creating external tuples.

Sink Type : The main work of this operator is to load the data into database.

Filter : It do the tuple filtering. The tuple which does not meet the criteria is omitted.

Punctor : A punctor operator can insert punctuations into output stream based on user supplied condition.

Aggregate : An aggregate operator is used for grouping and summarizing  incoming tuples.

Join : Join operator is used for correlating two streams.

Sort : Sort operator is used for imposing an order on incoming tuple.

Real Time Analytics of Big Data


So we have source operator and we have sink operator. The source will interact with outside world. Get the required data from any hardware or file.
The sink will load the final data into database.
In between we have different operator which will be linked with each other via edges known as stream in our case. All the data flow through this stream.

Some cases where real time analytics of data is useful
i) Crime detection and prevention
ii) Stock Market - In stock market trading happens so fast that a fraction of second change
    everything. Here if we analyse the pattern in real time then we can generate  meaningful
    conclusion.
iii)Telecommunication - Now a days world is so densely connection that it becomes a headache for
     the companies to manage the CDR. One can imagine the vast quantity of data present in a CDR.
     All of the data is not relevant. So in order to store them efficiently Infosphere Stream can be
     used. It will parse all the details and remove the irrelevant one.
iv)Health monitoring - The system can also be used for proper monitoring of health. Data from
    devices can be monitored and studies in order to find out if  the patient is suffering from some
    diseases.
v) Transportation - Real time data can be available about movement of buses or anything and
    customer can benefit from it.

Infosphere Stream and IOT(Internet of Things)

One of the future technology is IOT. Every company is investing heavily in this field. Streaming technology can be used in implementation of IOT.

For successful implementation of IOT two things are required. The system is capable in handling large amount of data and it is capable of communicating with hardware. Infosphere Stream qualified in both. So it can be one of the technology by which IOT can be implemented.

Let me give an example of IOT-

With the onset of IOT everything will become smart. So we will have smart chair. I can find out from anywhere in the world whether someone has occupied my chair. For this we will give an unique ID to my chair. My chair will be in a network. We will use some sensor like pressure sensor in order to determine whether someone has occupied my chair. The pressure sensor will continuously generate the data after fixed interval of time. Our Source operator will communicate with the sensor and generate the required tuple. Which will be then parsed by the parser to find out if someone is occupying it. So from anywhere in the world we can tell if someone has occupied my chair.


   
   


 




Thursday, 2 July 2015

Hadoop Distributed File System

Hadoop Distributed File System

(HDFS) is the file system that is used to store the data in Hadoop. How it stores data is special. When a file is saved in HDFS, it is first broken down into blocks with any remainder data that is occupying the final block. The size of the block depends on the way that HDFS is configured. At the time of writing, the default block size for Hadoop is 64 megabytes (MB). To improve performance for larger files, Hadoop changes this setting at the time of installation to 128 MB per block. Then, each block is sent to a different data node and written to the hard disk drive (HDD). When the data node writes the file to disk, it then sends the data to a second data node where the file is written. When this process completes, the second data node sends the data to a third data node. The third node confirms the completion of the writeback to the second, then back to the first. The NameNode is then notified and the block write is complete. After all blocks are written successfully, the result is a file that is broken down into blocks with a copy of each block on three data nodes. The location of all of this data is stored in memory by the NameNode.

http://techniquetechnology.blogspot.in/2015/07/hadoop-distributed-file-system.html

Scalability

Hadoop is designed to run on many commodity servers. The Hadoop software architecture also lends itself to be scalable within each server. HDFS can deal with individual files that are terabytes in size and Hadoop clusters can be petabytes in size if required. Individual nodes can be added to Hadoop at any time. The only cost to the system is the input/output (I/O) of redistributing the data across all of the available nodes, which ultimately might speed up access. The upper limit of how large you can make your cluster is likely to depend on the hardware that you have assembled. For example, the NameNode stores metadata in random access memory (RAM) that is roughly equivalent to a GB for every TB of data in the cluster.

Tuesday, 30 June 2015

MapReduce Technique : Hadoop Big Data

As a batch processing architecture, the major value of Hadoop is that it enables ad hoc queries to run against an entire data set and return results within a reasonable time frame. Distributed computing across a multi-node cluster is what allows this level of data processing to take place.
MapReduce applications can process vast amounts (multiple terabytes) of data in parallel on large clusters in a reliable, fault-tolerant manner. MapReduce is a computational paradigm in which an application is divided into self-contained units of work. Each of these units of work can be issued on any node in the cluster.

http://bigdataconcept.blogspot.in/2015/06/mapreduce-hadoop-big-data.html


A MapReduce job splits the input data set into independent chunks that are processed by map tasks in parallel. The framework sorts the map outputs, which are then input to reduce tasks. Job inputs and outputs are stored in the file system. The MapReduce framework and the HDFS (Hadoop Distributed File System) are typically on the same set of nodes, which enables the framework to schedule tasks on nodes that contain data.
The MapReduce framework consists of a single primary JobTracker and one secondary TaskTracker per node. The primary node schedules job component tasks, monitors tasks, and re-executes failed tasks. The secondary node runs tasks as directed by the primary node.

MapReduce is composed of the following phases:

i)Map
ii)Reduce

The map phase

The map phase is the first part of the data processing sequence within the MapReduce framework. Map functions serve as worker nodes that can process several smaller snippets of the entire data set. The MapReduce framework is responsible for dividing the data set input into smaller chunks, and feeding them to a corresponding map function. When you write a map function, there is no need to incorporate logic to enable the function to create multiple maps that can use the distributed computing architecture of Hadoop. These functions are oblivious to both data volume and the cluster in which they are operating. As such, they can be used unchanged for both small and large data sets (which is most common for those using Hadoop).

Important: Hadoop is a great engine for batch processing. However, if the data volume is small, the processor usage that is incurred by using the MapReduce framework might negate the benefits of using this approach.
Based on the data set that one is working with, a programmer must construct a map function to use a series of key/value pairs. After processing the chunk of data that is assigned to it, each map function also generates zero or more output key/value pairs to be passed forward to the next phase of the data processing sequence in Hadoop. The input and output types of
the map can be (and often are) different from each other.

The reduce phase

As with the map function, developers also must create a reduce function. The key/value pairs from map outputs must correspond to the appropriate reducer partition such that the final results are aggregates of the appropriately corresponding data. This process of moving map
outputs to the reducers is known as shuffling. When the shuffle process is completed and the reducer copies all of the map task outputs, the
reducers can go into what is known as a merge process. During this part of the reduce phase, all map outputs can be merged together to maintain their sort ordering that is established during the map phase. When the final merge is complete (because this process is done in rounds for performance optimization purposes), the final reduce task of consolidating results
for every key within the merged output (and the final result set), is written to the disk on the HDFS.

Development languages: Java is a common language that is used to develop these functions. However, there is support for a host of other development languages and frameworks, which include Ruby, Python, and C++.

Sunday, 28 June 2015

Operational Vs Analytical : Big Data Technology

There are two technologies used in Big Data Operational and Analytical. Operational capabilities include capturing and storing data in real time where as analytical capabilities include complex analysis of all the data. They both are complementary to each other hence deployed together.




Operational and analytical technologies of Big Data have different requirement and in order to address those requirement different architecture has evolved. Operational systems include NoSql database which deals with responding to concurrent requests. Analytical Systems focuses on complex queries which touch almost all the data.Both system work in tandem and manages hundreds of terabytes of data spanning over billion of records.

Operational Big Data

For Operational Big Data NoSql is generally used. It was developed to address the shortcoming of traditional database and it is faster and can deal with large quantity of data spread over multiple servers. We are also using cloud computing architectures to allow massive computation to run effectively as well as it is cost efficient. This has made Big Data workload easier to manage, faster to implement as well as cheaper.
Here in addition to interaction with user it also provide artificial intelligence about the active data. For example in games the moves of user are studies and next course of actions are suggested. NoSql can analyse real-time data and can generate conclusion based on that.

Analytical Big Data

Analytical Big Data is addressed by MPP database systems and MapReduce. These technologies has evolved as a result of shortcoming in traditional database which deals which one servers only. On the other hand MapReduce provides new method of analyzing data which is beyond the scope of SQL.

As volumes of data generated by users is increasing the analytical workload in realtime has also increased. So MapReduce has emerged as the first choice for Big Data analytics as its algorithm is superior. No Sql also provide limited capabilities in MapReduce technique but generally we prefer copying data from NoSql system to Analytical Systems such as Hadoop for MapReduce.

Big Data Introduction

What is Big Data?

Big Data is a collection of large amount of Data that is available with all the organisation. The amount of these data are so huge that managing them has become a challenge. The worst thing is these data are increasing exponentially. For example :

i) 200 of London's Traffic Cams collect 8 TB of data per day.
ii)1 day of Instant Messaging in 2002 consume 750 GB of Data.
iii)Annual Email Traffic excluding spams consume 300PB+ of Data.
iv)In 2004 Walmart Transacton DB contains 200 TB of Data.
v) Total Digital Data created in 2012 is assumed to be 270000 PB.

http://techniquetechnology.blogspot.in/2015/06/big-data-introduction.html

As per a report these data will grow at a rate of 40% annually. Big Data Technique is getting lot of importance now a days from organisations to handle those data as well as using them in business growth.
Big Data is a technology that uses data that is diverse, huge and require special skill to handle it. In other word conventional technology will not be able to effectively handle it. It contains data which is too large in term of volume, complex to handle, variable i.e not of same type, veracity in term of quality.

But today we have technologies which can be used to arrive at a conclusion from Big Data. For example a retailer can track their user and identify their behaviour to come at a conclusion regarding their preference, what price they are searching and accordingly they can stock their products. One can use social media signal to come at a conclusion like outbreak of any disease or any unrest happening at any part of the country.
So basically Big Data refers to a set of data which is so voluminous that it is impossible to manage by traditional tool.
So Big Data consists of creating effective data from raw data, storing it, retrieving it when necessary and then come out with a conclusion by analysing it.


Some of the term used in Big Data are:

Volume : We might have 500 GB of storage in our personal system. But Facebook consume 500TB of new data everyday. Excessive use of smartphone with new technologies like sensor will create additional data like location and other information including videos.

Velocity : The data is created very fast. Like on-line game is played by million of users simultaneously, stock trading algorithm generates huge quantity of data every second, sensors are generating data in real time, ad impression capture user behaviour at millions of events per seconds. So the data are created at a rapid pace and we need effective technologies in order to deal with it.

Variety : All the data are of different type. Some may be audio, video, text which may be unstructured. It may not be only numbers, dates and strings.

Traditional database used to deal with smaller volume of data which were predictable and consistent. But with the advent of new technology and techniques the amount of Data generated is so large and voluminous that we  have to deal with them separately. For that we require Data analysts and Data scientists.

Saturday, 27 June 2015

Big Data Analytics a high paying career

The next phase of demand for IT professional will come from Big Data. It is a high paying job and the demand is huge. So this is a great news for IT professional.

Making money from Big Data is a challenge. Here Data analytics come handy. Data analytics can be of different background which include data science, data mining, web analytics or even statistics. IT professional have to work in tandem with Data Analyst in order to get something meaningful from the huge quantity of data. One of the major complain of Data Analysts is that they don't get enough support from their IT team. This is a major deterrent in their work. Other major problem Data Analyst face is the quality of Data given to them. They are poorly documented and they have to spend huge amount of time in reformatting those data. IT professional must understand the need of Data Analyst and must prepare data according to their need so that they can use their time in analysing the data instead of reformatting it.

http://techniquetechnology.blogspot.in/2015/06/big-data-analytics-high-paying-career.html


 What Data Analysts need..
Data  analyst need all common type of data group together so that they can come with some conclusion. Like transaction happening in a retail shop must contain data for whole day so that they can come out with a conclusion what customer are buying, how much they are willing to pay and what the shop must have to attract more customer.



Raw Data must be complete with full details.
All data are important. Previous data always comes as a handy for getting a conclusion. So no data should be deleted and maximum data should be put in use. The more the data the better the conclusion.


Data should be fed to Analysts on time.
In order to come on a particular conclusion is a time consuming activity. So Data analysts must be supplied raw data well in advance not at last moment.

Analysts need lot of space to store the data.
Analysts always want to save data that can come handy in future. They don't want to destroy valuable data due to lack of space.

Current data which is complete, correct and consistent should be supplied to analysts.
Data analysts must spend all there time in analysing the data so the raw data must be properly documented which complete detail. Data must be given in the form of table with rows and columns and each raw representing one particular transaction.

Proper tool must be provided to Analysts in order to make conclusions.
Analysts need various products in order to work properly. They also need to take care of their asset. Most of the time they have no idea how to take care of those valuable assets. Here Big Data professional can come handy and help them.

For the betterment of the organisation IT professional must work in tandem with Data Analysts and provide them with full support.