Thursday, November 14, 2013

Thrive School - learn to thrive: Hive Getting Started

Thrive School - learn to thrive: Hive Getting Started: In my previous post, we saw how we can execute MapReduce jobs using Java. Java is most flexible and powerful method for doing all MapReduce...

Tuesday, November 12, 2013

Thrive School - learn to thrive: Executing Java MapReduce Program

Thrive School - learn to thrive: Executing Java MapReduce Program: In my previous post, we explored MapReduce concept using a bash shell script. In this post we will compile and execute a MapReduce program...

Thrive School - learn to thrive: Geting a portable hadoop environment

Thrive School - learn to thrive: Geting a portable hadoop environment: Before we start learning individual components of hadoop ecosystem, it is good to get your portable hadoop environment. There are various o...

Wednesday, November 6, 2013

DATA SCIENCE RESOURCES : Data Science

Resources : Data Science


Data Science is an inherently multidisciplinary field that requires a myriadof skills to be a proficient practitioner. The necessary curriculum has not fit into traditional course offerings, but asawareness  of theneed  for individuals who have such abilities is growing, we are seeing universities and private companies creating custom classes.

Resources : Others

  • Data Beta: Professor Joe Hellerstein’s blog about education, computing, and data.
  • Dataists: Hilary Mason and Vince Buffalo’s old blog that has a wealth of information and resources about the field and practice of data science
  • Five Thirty Eight: Nate Silver’s famous NYT blog where he discusses predictive modeling and political forecasts.
  • grep alex: Alex Holmes’s blog about distributed computing and the intricacies of Hadoop.
  • Data Science 101: One man’s personal journey to becoming a data scientist (with plenty of resources)
  • no free hunch: Kaggle’s blog about the practice of data science and its competition highlights.

DATA SCIENCE RESOURCES: Large Scale Computations

Large Scale Computations

When you start operating with data at the scale of the web, the fundamental approach and process of analysis must change. To combat the ever increasing amount of data, Google developed theMapReduce  paradigm. This programming model has become the de facto standard for large scale batch processing since the release of ApacheHadoop  in 2007, the open-source MapReduce framework

DATA SCIENCE RESOURCES: Visualization

Visualization

  • Courses
  • Tools
    • D3.js: Data-Driven Documents — Declarative manipulation of DOM elements with data dependent functions (withPython port ).
    • Vega: A visualization grammer built on top of D3 for declarative visualizations in JSON. Released by the dream team atTrifacta,   it provides a higher level abstraction than D3 for creating “ or SVG based graphics.
    • Rickshaw: A charting library built on top of D3 with a focus on interactive time series graphs.
    • modest maps: A lightweight library with a simple interface for working with maps in the browser (with ports to multiple languages).
    • Chart.js: Very simple (only six charts) HTML5 “ based plotting library with beautiful styling and animation.

DATA SCIENCE RESOURCES: Statistics

Statistics

DATA SCIENCE RESOURCES: Machine Learning and Algorithms

Machine Learning and Algorithms