Big Insights with Big Data: January 2013

Tuesday, January 29, 2013

Hadoop, Big Data Analytics and Challenges Old Business Intelligence

It's not that old-school business intelligence software tools are going away, these upstarts grant. But both portray batch-oriented extract-transform-load (ETL) data integration, relational data warehousing, and old-school analytics as too slow, rigid, and expensive to keep up in the big-data era.

Hadoop is the future, because it's a massively scalable data-management and analysis environment that can handle variably structure data from many sources--log files, clickstreams, sensor data, social media sources and so on--without the delays inherent in dealing with the static schemas of relational databases.

If companies want to look at recent point-of-sale transactions alongside Web site clickstreams, recent online enrollments, email campaign results, and social media chatter, for example, it would be difficult if not impossible to quickly put all that data into a relational data warehouse and look for correlations.

Data-analysis and data visualization run on Hadoop. ETL and data warehousing and BI are just fine for the problem of looking at transactions here and there, but there's no chance of bringing it all together to look at the interactions across all of the islands of information.
Hadoop provides modules for data-integration (including connectors to mainframes, databases, social sources like Facebook and Twitter, and more), a spreadsheet-driven data-analysis environment, and a dashboarding and data-visualization environment.

Sunday, January 27, 2013

Keys to the future of analytics and big data in healthcare

Increase Quality and Drive costs down
From a cost perspective and a quality of care point of view, there are a number of different areas that will be impacted. For example, if a patient experiences an injury while staying in a facility, the organization isn't reimbursed for his/her care. So the ability for a system to see that this to happen and alert everyone, so that type of thing doesn't happen to me as a patient. The one way the US and Other government is putting pressure on that is you won't be compensated like the old days. They're going to tap into the information, and that's just one example. There are a whole bunch of things that could happen that are preventable and should be completely avoidable. After tapping into the information, That’s going to drive down the cost of healthcare.

Grow Physician-Patient Relationship
The physician-patient relationship will grow with the help of social media and mobile apps. And this all stems from the need for hospitals to keep patients healthy and out of their facilities. In the old days, hospitals made money, and the longer they keep patient there, the more they make. Analytics predicted that because of this, there will be an "explosion" of mobile applications and even social media, allowing patients to have easier access to nurses and physicians. It's about keeping patients healthy and driving down costs.

Wednesday, January 16, 2013

Quick ref. to BigData and Data Analytics

Source: http://thumbnails.visually.netdna-cdn.com/data-analytics--big-data_50f3dc7e8aa6a.jpg

Big Data Landscape

Source: http://blogs-images.forbes.com/davefeinleib/files/2012/07/Big-Data-Landscape-Jul-4-2012.00111.png

Tuesday, January 15, 2013

BIG DATA & BIG ENTERPRISES

For decades, companies have been making business decisions based on transactional data stored in relational databases. Beyond that critical data, however, is a potential treasure trove of non-traditional, less structured data: weblogs, social media, photographs etc. that can be mined for useful information. Decreases in the cost of both storage and compute power have made it feasible to collect this data - which would have been thrown away only a few years ago. As a result, more and more companies are looking to include non-traditional yet potentially very valuable data in their business intelligence analysis. Some of those big enterprises that have entered the world of Big Data with their flagship technologies/products are listed below:

· IBM Big Insights - IBM Big Insights brings the power of Hadoop to the enterprise.

· Microsoft HDInsight - HDInsight enables you to get up and running with your own Apache Hadoop cluster in Windows and Azure cloud in just minutes.

· Oracle - Oracle offers the broadest and most integrated portfolio of products to help you acquire and organize these diverse data sources and analyzes them alongside your existing data to find new insights and capitalize on hidden relationships.

· Google Big Query - Analyse Big Data in the cloud using SQL and get real-time business insights in seconds using Google BigQuery.

· Hortonworks Data Platform (HDP) - Hortonworks Data Platform (HDP) combines the power and cost-effectiveness of Apache Hadoop with the advanced services and reliability required for enterprise deployments.

Cloudera - Cloudera offers enterprises a powerful new data platform built on the popular Apache Hadoop open-source software package.

Big Data: The moving parts

Source : http://cdn-static.zdnet.com/i/story/60/39/001648/big_data_the_moving_parts_large.png

Monday, January 14, 2013

Daily data ... very big

Source: http://www.capgemini.com/technology-blog/files/2011/11/what_is_bigdata.jpg

Saturday, January 12, 2013

Web Analytics

The success of your website is largely dependent on the overall user experience of your visitors. Here are some of the important drivers for using web analytics:

Web analytics provides insight into the online behavior of your visitors. The visitor data can be used to optimize your website to best meet the requirements of your audience.
Web analytics can provide business intelligence in context to customer segmentation, identifying trends, product development, targeted marketing, etc.
Web analytics can be used to track a wide variety of metrics to measure the overall success of your website.

Note: Unlike web reporting, web analytics is actionable. Web analytics helps in making informed decisions about changing your online strategy.

Big Data research papers

http://www.greenplum.com/industry-buzz/big-data/research-papers

Analyzing Big Data with Twitter

How Big Data is Changing the College Experience

Wednesday, January 9, 2013

Running Hadoop On Ubuntu Linux

Running Hadoop On Ubuntu Linux (Single-Node Cluster)

How to setup a single-node Hadoop cluster.
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/

Writing An Hadoop MapReduce Program In Python
Running Hadoop On Ubuntu Linux (Single-Node Cluster)
Running Hadoop On Ubuntu Linux (Multi-Node Cluster)

Tuesday, January 8, 2013

Gartner: Six Best Practices for Apache Hadoop Pilot

Gartner outlines some best practices that help cross-functional teams deploying a Hadoop pilot project, and assist IT and business leaders in avoiding common pitfalls.

1. Define the use case(s) well
2. Enlist and build a competent team
3. Choose the appropriate distribution vendor
4. Pilot, Test and Scale for Price/Performance
5. Plan for Data Integration
6. Perform a Thorough Postpilot Analysis

Key Challenges
Key challenges for undertaking Apache Hadoop pilot projects include:

Finding an appropriate use case that aligns well with goals of business teams and is feasible to implement.
Enlisting a competent team in the face of an acute shortage of Hadoop-related skills.
Choosing an appropriate distribution, given the multitude of Hadoop projects and version releases.
Dealing with data ingestion and integration challenges that can result in poor analytical outcomes.

Recommendations

Identify current skunkworks projects to find skills and experience within the organization, and build a cross-functional team to tackle a pilot.
Define a use case that leverages Hadoop's strengths and has measurable business outcomes.
Identify skill gaps that should be mitigated by either training or engaging external consultants.
Choose Hadoop software distribution based on use case rather than vice versa, and consider future scalability when running pilot projects.
Identify future integration requirements and opportunities to connect newly exploited data with existing analytics teams and tools.

(Skunkworks project is one typically developed by a small and loosely structured group of people who research and develop a project primarily for the sake of radical innovation. (Source: Wikipedia).

What's that BIG ..?

Big Volume. with simple SQL Analytics
. with complex non-SQL Analytics

Big Velocity
. Drink from the fire hose

Big Variety
. Large number of diverse data sources to integrate

Data volume: terabytes, petabytes, billions of rows per day, hour or minute,
Data variety: mixing point of sales, call data records, machine generated data, scanned documents, social networking data, smart metering data, structured and unstructured data, which includes data complexity, not only because of data variety but also because data may be really complex to analyze, like video, binary data from M2M communications…
Need of velocity: big amount of data means data becomes out-of-date very quickly, so it is important to use data as fast as possible.

What is Big Data?

The generally accepted definition of Big Data is data that’s too big to work with – which is to some degree a fallacy on two levels. First, as hardware and software improves, the limit of what’s “too big” is constantly increasing. Second, when people talk about Big Data, it typically isn’t in the context of throwing up their hands in futility, but in finding ways to use existing hardware and software technology to manipulate their data.

This is why Big Data is often just the other side of the coin from data analytics, because analytics uses different ways of slicing, dicing, and otherwise picking through vast amounts of data to find the bits that are interesting and relevant to a particular task.

Not that long ago, organizations that wanted to analyze their data were limited to the size of a floppy disk. But not only is the potential size of a database getting bigger – particularly with new technologies such as the cloud – the industry is figuring out new ways to link together multiple databases into what appears as a single whole. “Logical data warehouses bringing together information from multiple sources as needed will replace the single data warehouse model,” predicted Gartner recently when it named Big Data as one of its Top 10 Strategic Technologies for 2012.

HADOOP & IT’S TECHNOLOGY STACK

HADOOP & IT’S TECHNOLOGY STACK

HADOOP is an open-source implementation of frameworks for reliable, scalable, distributed computing and data storage. It enables applications to work with thousands of nodes and petabytes of data, and as such is a great tool for research and business operations. Hadoop was inspired by Google’s MapReduce and Google File System (GFS) papers.

The HADOOP stack includes more than a dozen components, or subprojects, that are complex to deploy and manage. Installation, configuration and production deployment at scale is challenging.

The main components include:

Hadoop -- A Java software framework to support data-intensive distributed applications.

· Zookeeper -- A highly reliable distributed coordination system.

· MapReduce -- A flexible parallel data processing framework for large data sets.

· HDFS -- Hadoop Distributed File System.

· Oozie -- A MapReduce job scheduler.

· Hbase -- Key-value database.

· Hive -- A high-level language built on top of MapReduce for analyzing large data sets.

· Pig -- Enables the analysis of large data sets using Pig Latin.

Pig Latin -- A high-level language compiled into MapReduce for parallel data processing

Image Source: http://www.capgemini.com/technology-blog/files/2011/12/hadoop.jpg