Thursday, February 28, 2013

Trying out Hadoop on Microsoft Windows Azure

Windows Azure HDInsight Preview

To gain the full value of Big Data you need a modern data platform that manages data of any type, whether structured or unstructured, and of any size. This preview of Windows Azure HDInsight enables you to get up and running with your own Apache Hadoop™ cluster in the cloud in just minutes.

Microsoft’s end-to-end roadmap for Big Data embraces Apache Hadoop™ by distributing enterprise class, Hadoop-based solutions on both Windows Server and Windows Azure. To learn more about Microsoft’s roadmap for Big Data see http://www.microsoft.com/bigdata/

Testing Hadoop Jobs

Advice on QA Testing Your MapReduce Jobs

Traditional Unit Tests – JUnit, PyUnit, Etc.

MRUnit – Unit Testing for MR Jobs

Local Job Runner Testing – Running MR Jobs on a Single Machine in a Single JVM

Pseudo-distributed Testing – Running MR Jobs on a Single Machine Using Daemons

Full Integration Testing – Running MR Jobs on a QA Cluster

What’s New in Java ApprovalTests V 0.12

Thursday, February 21, 2013

Massive Dataset & Data Mining

Data mining of very large amounts of data, that is, data so large it does not fit in main memory. Because of the emphasis on size, many of our examples are about the Web or data derived from the Web. Data mining is about applying algorithms to data, rather than using data to “train” a machine-learning engine of some sort.

Statisticians were the first to use the term “data mining.” Originally, “data mining” or “data dredging” was a derogatory term referring to attempts to extract information that was not supported by the data. Now, statisticians view data mining as the construction of a statistical model, that is, an underlying distribution from which the visible data is drawn.

Knowledge discovery - Principal Techniques

· Distributed file systems and map-reduce as a tool for creating parallel algorithms that succeed on very large amounts of data.

· Similarity search, including the key techniques of minhashing and locality sensitive hashing.

· Data-stream processing and specialized algorithms for dealing with data that arrives so fast it must be processed immediately or lost.

· The technology of search engines, including Google’s PageRank, link-spam detection, and the hubs-and-authorities approach.

· Frequent-itemset mining, including association rules, market-baskets, the A-Priori Algorithm and its improvements.

· Algorithms for clustering very large, high-dimensional datasets.

· Two key problems for Web applications: managing advertising and recommendation systems, A technique known as collaborative filtering.

Algorithms for analyzing and mining the structure of very large graphs, especially social-network graphs.

Tuesday, February 19, 2013

Blog on Case Study: Retail WiFi Log-file Analysis with Hadoop and Impala

Tuesday, February 5, 2013

Using MAP-REDUCE – to Process Large Data Sets

MapReduce is a framework for processing parallelizable problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster. Computational processing can occur on data stored either in a file system (unstructured) or in a database (structured).

"Map" step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node.

"Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve.

The Map and Reduce functions of MapReduce are both defined with respect to data structured in (key, value) pairs. The MapReduce framework transforms a list of (key, value) pairs into a list of values. This behavior is different from the typical functional programming map and reduce combination.

BigData Course Summary

Introduction to Big Data and Hadoop
Hadoop ecosystem - Concepts
Hadoop Map-reduce concepts and features
Developing map-reduce applications
Pig concepts
Hive concepts
Oozie workflow concepts
HBASE Concepts
Real Life Use Cases

Introduction to Big Data and Hadoop

What is Big Data?
 What are the challenges for processing big data?

What technologies support big data?

What is Hadoop?

Why Hadoop?

History of Hadoop

Use Cases of Hadoop

Hadoop eco System

HDFS

Map Reduce
 Statistics

Understanding the Cluster

Typical workflow

Writing files to HDFS

Reading files from HDFS

Rack Awareness

5 daemons

Map Reduce

Before Map reduce

Map Reduce Overview

Word Count Problem

Word Count Flow and Solution

Map Reduce Flow

Algorithms for simple problems
 Algorithms for complex problems

Developing the Map Reduce Application

Data Types

File Formats

Explain the Driver, Mapper and Reducer code

Configuring development environment - Eclipse

Writing Unit Test

Running locally

Running on Cluster

Hands on exercises

How Map-Reduce Works

Anatomy of Map Reduce Job run

Job Submission

Job Initialization

Task Assignment

Job Completion

Job Scheduling

Job Failures

Shuffle and sort

Oozie Workflows

Hands on Exercises

Map Reduce Types and Formats

MapReduce Types

Input Formats - Input splits & records, text input, binary input, multiple inputs & database input

Output Formats - text Output, binary output, muliple outputs, lazy output and database output

 Hands on Exercises

Map Reduce Features

Counters

Sorting

Joins - Map Side and Reduce Side

Side Data Distribution

MapReduce Combiner

MapReduce Partitioner

MapReduce Distributed Cache

Hands Exercises

Hive and PIG

Fundamentals

When to Use PIG and HIVE

Concepts

Hands on Exercises

HBASE

CAP Theorem

Hbase Architecture and concepts
 Programming and Hands on Exercises

BIG DATA ANALYTICS

Big data analytics is the process of examining large amounts of data of a variety of types (big data) to uncover hidden patterns, unknown correlations and other useful information. Such information can provide competitive advantages over rival organizations and result in business benefits, such as more effective marketing and increased revenue.

The primary goal of big data analytics is to help companies make better business decisions by enabling data scientists and other users to analyze huge volumes of transaction data as well as other data sources that may be left untapped by conventional business intelligence (BI) programs. These other data sources may include Web server logs and Internet clickstream data, social media activity reports, mobile-phone call detail records and information captured by sensors.

Big data analytics can be done with the software tools commonly used as part of advanced analytics disciplines such as predictive analytics and data mining. But the unstructured data sources used for big data analytics may not fit in traditional data warehouses. Furthermore, traditional data warehouses may not be able to handle the processing demands posed by big data. As a result, a new class of big data technology has emerged and is being used in many big data analytics environments. The technologies associated with big data analytics include NoSQL databases, Hadoop and MapReduce.

Big Insights with Big Data

Thursday, February 28, 2013

Trying out Hadoop on Microsoft Windows Azure

Windows Azure HDInsight Preview

Saturday, February 23, 2013

Testing Hadoop Jobs

Advice on QA Testing Your MapReduce Jobs

Traditional Unit Tests – JUnit, PyUnit, Etc.

MRUnit – Unit Testing for MR Jobs

Local Job Runner Testing – Running MR Jobs on a Single Machine in a Single JVM

Pseudo-distributed Testing – Running MR Jobs on a Single Machine Using Daemons

Full Integration Testing – Running MR Jobs on a QA Cluster

What’s New in Java ApprovalTests V 0.12

Thursday, February 21, 2013

Massive Dataset & Data Mining

Tuesday, February 19, 2013

Blog on Case Study: Retail WiFi Log-file Analysis with Hadoop and Impala

Tuesday, February 5, 2013

Using MAP-REDUCE – to Process Large Data Sets

BigData Course Summary

BIG DATA ANALYTICS