Thursday, February 28, 2013

Trying out Hadoop on Microsoft Windows Azure


Windows Azure HDInsight Preview

To gain the full value of Big Data you need a modern data platform that manages data of any type, whether structured or unstructured, and of any size. This preview of Windows Azure HDInsight enables you to get up and running with your own Apache Hadoop™ cluster in the cloud in just minutes.

Microsoft’s end-to-end roadmap for Big Data embraces Apache Hadoop™ by distributing enterprise class, Hadoop-based solutions on both Windows Server and Windows Azure. To learn more about Microsoft’s roadmap for Big Data see http://www.microsoft.com/bigdata/

Saturday, February 23, 2013

Testing Hadoop Jobs


Advice on QA Testing Your MapReduce Jobs

Traditional Unit Tests – JUnit, PyUnit, Etc.

MRUnit – Unit Testing for MR Jobs

Local Job Runner Testing – Running MR Jobs on a Single Machine in a Single JVM

Pseudo-distributed Testing – Running MR Jobs on a Single Machine Using Daemons

Full Integration Testing – Running MR Jobs on a QA Cluster


Thursday, February 21, 2013

Massive Dataset & Data Mining

Data mining of very large amounts of data, that is, data so large it does not fit in main memory. Because of the emphasis on size, many of our examples are about the Web or data derived from the Web. Data mining is about applying algorithms to data, rather than using data to “train” a machine-learning engine of some sort.

Statisticians were the first to use the term “data mining.” Originally, “data mining” or “data dredging” was a derogatory term referring to attempts to extract information that was not supported by the data. Now, statisticians view data mining as the construction of a statistical model, that is, an underlying distribution from which the visible data is drawn.

Knowledge discovery - Principal Techniques

·         Distributed file systems and map-reduce as a tool for creating parallel algorithms that succeed on very large amounts of data.
·         Similarity search, including the key techniques of minhashing and locality sensitive hashing.
·         Data-stream processing and specialized algorithms for dealing with data that arrives so fast it must be processed immediately or lost.
·         The technology of search engines, including Google’s PageRank, link-spam detection, and the hubs-and-authorities approach.
·         Frequent-itemset mining, including association rules, market-baskets, the A-Priori Algorithm and its improvements.
·         Algorithms for clustering very large, high-dimensional datasets.
·         Two key problems for Web applications: managing advertising and recommendation systems, A technique known as collaborative filtering.
  • Algorithms for analyzing and mining the structure of very large graphs, especially social-network graphs.

Tuesday, February 5, 2013

Using MAP-REDUCE – to Process Large Data Sets

MapReduce is a framework for processing parallelizable problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster. Computational processing can occur on data stored either in a file system (unstructured) or in a database (structured).

"Map" step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node.

"Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve.

The Map and Reduce functions of MapReduce are both defined with respect to data structured in (key, value) pairs. The MapReduce framework transforms a list of (key, value) pairs into a list of values. This behavior is different from the typical functional programming map and reduce combination.

BigData Course Summary



  • Introduction to Big Data and Hadoop
  • Hadoop ecosystem - Concepts
  • Hadoop Map-reduce concepts and features
  • Developing map-reduce applications
  • Pig concepts
  • Hive concepts
  • Oozie workflow concepts
  • HBASE Concepts
  • Real Life Use Cases






Introduction to Big Data and Hadoop
What is Big Data?
What are the challenges for processing big data?
What technologies support big data?
What is Hadoop?
Why Hadoop?
History of Hadoop
Use Cases of Hadoop
Hadoop eco System
HDFS
Map Reduce
Statistics


Understanding the Cluster
Typical workflow
Writing files to HDFS
Reading files from HDFS
Rack Awareness
5 daemons
Map Reduce
Before Map reduce
Map Reduce Overview
Word Count Problem
Word Count Flow and Solution
Map Reduce Flow
Algorithms for simple problems
Algorithms for complex problems


Developing the Map Reduce Application
Data Types
File Formats
Explain the Driver, Mapper and Reducer code
Configuring development environment - Eclipse
Writing Unit Test
Running locally
Running on Cluster
Hands on exercises
How Map-Reduce Works
Anatomy of Map Reduce Job run
Job Submission
Job Initialization
Task Assignment
Job Completion
Job Scheduling
Job Failures
Shuffle and sort
Oozie Workflows
Hands on Exercises
Map Reduce Types and Formats
MapReduce Types
Input Formats - Input splits & records, text input, binary input, multiple inputs & database input
Output Formats - text Output, binary output, muliple outputs, lazy output and database output
Hands on Exercises
Map Reduce Features
Counters
Sorting
Joins - Map Side and Reduce Side
Side Data Distribution
MapReduce Combiner
MapReduce Partitioner
MapReduce Distributed Cache
Hands Exercises
Hive and PIG
Fundamentals
When to Use PIG and HIVE
Concepts
Hands on Exercises
HBASE
CAP Theorem
Hbase Architecture and concepts
Programming and Hands on Exercises

BIG DATA ANALYTICS

Big data analytics is the process of examining large amounts of data of a variety of types (big data) to uncover hidden patterns, unknown correlations and other useful information. Such information can provide competitive advantages over rival organizations and result in business benefits, such as more effective marketing and increased revenue.

The primary goal of big data analytics is to help companies make better business decisions by enabling data scientists and other users to analyze huge volumes of transaction data as well as other data sources that may be left untapped by conventional business intelligence (BI) programs. These other data sources may include Web server logs and Internet clickstream data, social media activity reports, mobile-phone call detail records and information captured by sensors.

Big data analytics can be done with the software tools commonly used as part of advanced analytics disciplines such as predictive analytics and data mining. But the unstructured data sources used for big data analytics may not fit in traditional data warehouses. Furthermore, traditional data warehouses may not be able to handle the processing demands posed by big data. As a result, a new class of big data technology has emerged and is being used in many big data analytics environments. The technologies associated with big data analytics include NoSQL databases, Hadoop and MapReduce.