Big Insights with Big Data: November 2013

Thursday, November 14, 2013

Thrive School - learn to thrive: Hive Getting Started

Thrive School - learn to thrive: Hive Getting Started: In my previous post, we saw how we can execute MapReduce jobs using Java. Java is most flexible and powerful method for doing all MapReduce...

Tuesday, November 12, 2013

Thrive School - learn to thrive: Executing Java MapReduce Program

Thrive School - learn to thrive: Executing Java MapReduce Program: In my previous post, we explored MapReduce concept using a bash shell script. In this post we will compile and execute a MapReduce program...

Thrive School - learn to thrive: Geting a portable hadoop environment

Thrive School - learn to thrive: Geting a portable hadoop environment: Before we start learning individual components of hadoop ecosystem, it is good to get your portable hadoop environment. There are various o...

Wednesday, November 6, 2013

DATA SCIENCE RESOURCES : Data Science

Resources : Data Science

Data Science is an inherently multidisciplinary field that requires a myriadof skills to be a proficient practitioner. The necessary curriculum has not fit into traditional course offerings, but asawareness of theneed for individuals who have such abilities is growing, we are seeing universities and private companies creating custom classes.

Books

An Introduction to Data Science: The companion textbook to Syracuse University’s flagship course for their new Data Science program.

Courses

UC Berkeley: Introduction to Data Science: A course taught by Jeff Hammerbacher and Mike Franklin that highlights each of the varied skills that a Data Scientist must be proficient with.
CouHow to Process, Analyze and Visualize Data: A lab oriented course that teaches you the entire pipeline of data science; from acquiring datasets and analyzing them at scale to effectively visualizing the results.
CMCoursera: Introduction to Data Science: A tour of the basic techniques for Data Science including SQL and NoSQL databases, MapReduce on Hadoop, ML algorithms, and data visualization.
Columbia: Introduction to Data Science: A very comprehensive course that covers all aspects of data science, with an humanistic treatment of the field.
Columbia: Applied Data Science (with book): Another Columbia course — teaches applied software development fundamentals using real data, targeted towards people with mathematical backgrounds.
Coursera: Data Analysis (with notes and lectures): An applied statistics course that covers algorithms and techniques for analyzing data and interpreting the results to communicate your findings.
Kaggle: Getting Started with Python for Data Science: A guided tour of setting up a development environment, an introduction to making your first competition submission, and validating your results.
http://ischool.syr.edu/future/cas/applieddatasciencemooc.aspx

Resources : Others

Data Beta: Professor Joe Hellerstein’s blog about education, computing, and data.
Dataists: Hilary Mason and Vince Buffalo’s old blog that has a wealth of information and resources about the field and practice of data science
Five Thirty Eight: Nate Silver’s famous NYT blog where he discusses predictive modeling and political forecasts.
grep alex: Alex Holmes’s blog about distributed computing and the intricacies of Hadoop.
Data Science 101: One man’s personal journey to becoming a data scientist (with plenty of resources)
no free hunch: Kaggle’s blog about the practice of data science and its competition highlights.

Berkeley: Introduction to Data Science: One of the most comprehensive lists of resources about all things data science.
Cloudera: New to Data Science: Resources about data science from Cloudera’s introduction to data science course/certification.
Kaggle: Tutorials: A set of tutorials, books, courses, and competitions for statistics, data analysis, and machine learning.

http://dataiap.github.io/dataiap

http://cs229.stanford.edu/materials.html

http://www-stat.stanford.edu/~naras/stat290/Stat290_Website/Stat_290.html

http://see.stanford.edu/see/lecturelist.aspx?coll=348ca38a-3a6d-4052-937d-cb017338d7b1

http://www.ischool.berkeley.edu/courses/i290-abdt

http://hackershelf.com/topic/machine-learning/

http://www.e-booksdirectory.com/listing.php?category=284

http://www.intechopen.com/books/machine-learning

http://pages.cs.wisc.edu/~shavlik/cs760.html

http://www.realtechsupport.org/UB/MRIII/papers/MachineLearning/Alppaydin_MachineLearning_2010.pdf

http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-867-machine-learning-fall-2006/index.html

http://www3.nd.edu/~steve/Rcourse/Rnotes.html

http://alex.smola.org/teaching/cmu2013-10-701/

http://www.cmpe.boun.edu.tr/~ethem/i2ml2e/

http://courses.ischool.berkeley.edu/i296a-dsa/s12/

http://datascienc.es/spring-2011-course/

DATA SCIENCE RESOURCES: Large Scale Computations

Large Scale Computations

When you start operating with data at the scale of the web, the fundamental approach and process of analysis must change. To combat the ever increasing amount of data, Google developed theMapReduce paradigm. This programming model has become the de facto standard for large scale batch processing since the release of ApacheHadoop in 2007, the open-source MapReduce framework

Books

Mining Massive Datasets:Mining Massive Datasets: Stanford course resources on large scale machine learning and MapReduce with accompanyingbook.
Data-Intensive Text Processing with MapReduce:Data-Intensive Text Processing with MapReduce: An introduction to algorithms for the indexing and processing of text that teaches you to “think in MapReduce.”
Hadoop: The Definitive Guide: The most thorough treatment of the Hadoop framework, a great tutorial and reference alike.
Programming Pig: An introduction to the Pig framework for programming data flows on Hadoop.

Courses

UC Berkeley: Analyzing Big Data with Twitter: A course — taught in close collaboration with Twitter — that focuses on the tools and algorithms for data analysis as applied to Twitter microblog data (with project based curriculum).
Coursera: Web Intelligence and Big Data: An introduction to dealing with large quantities of data from the web; how the tools and techniques for acquiring, manipulating, querying, and analyzing data change at scale.
CMU: Machine Learning with Large Datasets: A course on scaling machine learning algorithms on Hadoop to handle massive datasets.
U of Chicago: Large Scale Learning: A treatment of handling large datasets through dimensionality reduction, classification, feature parametrization, and efficient data structures.
UC Berkeley: Scalable Machine Learning: A broad introduction to the systems, algorithms, models, and optimizations necessary at scale.

DATA SCIENCE RESOURCES: Visualization

Visualization

Books

Tufte: The Visual Display of Quantitative Information:Not freely available, but perhaps the most influential text for the subject of data visualization. A classic that defined the field.

Courses

UC Berkeley: Visualization: UC Berkeley: Visualization: Graduate class on the techniques and algorithms for creating effective visualizations.
Rice: Data Visualization: Rice: Data Visualization: A treatment of data visualization and how to meaningfully present information from the perspective of Statistics.
Harvard: Introduction to Computing, Modeling, and Visualization: Connects the concepts of computing with data to the process of interactively visualizing results.
School of Data: From Data to Diagrams: A gentle introduction to plotting and charting data, with exercises.
Predictive Analytics: Overview and Data visualization: An introduction to the process of predictive modeling, and a treatment of the visualization of its results.

Tools

D3.js: Data-Driven Documents — Declarative manipulation of DOM elements with data dependent functions (withPython port ).
Vega: A visualization grammer built on top of D3 for declarative visualizations in JSON. Released by the dream team atTrifacta, it provides a higher level abstraction than D3 for creating “ or SVG based graphics.
Rickshaw: A charting library built on top of D3 with a focus on interactive time series graphs.
modest maps: A lightweight library with a simple interface for working with maps in the browser (with ports to multiple languages).
Chart.js: Very simple (only six charts) HTML5 “ based plotting library with beautiful styling and animation.

DATA SCIENCE RESOURCES: Statistics

Statistics

Books

O’Reilly: Think Stats:An introduction to Probability and Statistics for Python programmers.
Introduction to Probability: Textbook for Berkeley’s Stats 134 class, an introductory treatment of probability with complementary exercises.
Lecture notes for Introduction to Probability: Compiled lecture notes of above textbook, complete with exercises.
OpenIntro: Statistics: Introductory text book with supplementary exercises and labs in an online portal.
Think Bayes: An simple introduction to Bayesian Statistics with Python code examples.

Courses

edx:Introduction to Statistics: A basic introductory statistics course.
Coursera: Statistics one : A first course of Statistics from Andrew Conway of Princeton University
Coursera: Statistics, Making sense of Data: A applied Statistics course that teaches the complete pipeline of statistical analysis.
MIT:Statistical Thinking and Data Analysis: Introduction to probability, sampling, regression, common distributions, and inference.
Khan Academy’s Statistics : A wonderful introduction to all things statistics in a very lucid manner .

DATA SCIENCE RESOURCES: Machine Learning and Algorithms

Machine Learning and Algorithms

Books

A first encounter with Machine Learning: An introduction to machine learning concepts focusing on the intuition and explanation behind whythey work.
A Programmer’s Guide to Data Mining: A web based book complete with code samples (in Python) and exercises.
Data Structures and Algorithms: An introduction to computer science with code examples in Python — covers algorithm analysis, data structures, sorting algorithms, and object oriented design.
An Introduction to Data Mining: An interactive Decision Tree guide (with hyperlinked lectures) to learning data mining and ML.
Elements of Statistical Learning: One of the most comprehensive treatments of data mining and ML, often used as a university textbook.
An Introduction to Information Retrieval: Textbook from a Stanford course on NLP and information retrieval with sections on text classification, clustering, indexing, and web crawling.

Courses

Coursera: Machine Learning: Stanford’s famous machine learning course taught by Andrew Ng.
Coursera: Computational Methods for Data Analysis: Statistical methods and data analysis applied to physical, engineering, and biological sciences.
MIT: Data Mining: MIT: Data Mining: An introduction to the techniques of data mining and how to apply ML algorithms to garner insights.
edx: Introduction to Artificial Intelligence: The first half of Berkeley’s popular AI course that teaches you to build autonomous agents to efficiently make decisions in stochastic and adversarial settings.
edx: Introduction to Computer Science and Programming: MIT’s introductory course to the theory and application of Computer Science.