Unit – 2
(Learning Notes)
SYLLABUS:
Introduction to Big Data Analytics: Big Data Overview
State of Practice in Analytics Role of Data Scientists
Big Data Analytics in Industry Verticals
Key Roles for a Successful Analytics Project
Business User: Someone who understands the domain area and
usually benefits from the results. This person can consult and advise
the project team on the context of the project, the value of the results,
and how the outputs will be operationalized. Usually a business
analyst, line manager, or deep subject matter expert in the project
domain fulfils this role.
Project Sponsor: Responsible for the genesis of the project. Provides
the impetus and requirements for the project and defines the core
business problem. Generally provides the funding and gauges the
degree of value from the final outputs of the working team. This
person sets the priorities for the project and clarifies the desired
outputs.
Project Manager: Ensures that key milestones and objectives are met
on time and at the expected quality.
Business Intelligence Analyst: Provides business domain expertise
based on a deep understanding of the data, key performance
indicators (KPis), key metrics, and business intelligence from a
reporting perspective. Business Intelligence Analysts generally create
dashboards and reports and have knowledge of the data feeds and
sources.
Database Administrator (DBA): Provisions and configures the
database environment to support the analytics needs of the working
team. These responsibilities may include providing access to key
databases or tables and ensuring the appropriate security levels are in
place related to the data repositories.
Data Engineer: Leverages deep technical skills to assist with tuning
SQL queries for data management and data extraction, and provides
support for data ingestion into the analytic sandbox, which was
discussed in Chapter 1, "Introduction to Big Data Analytics." Whereas
the DBA sets up and configures the databases to be used, the data
engineer executes the actual data extractions and performs
substantial data manipulation to facilitate the analytics. The data
engineer works closely with the data scientist to help shape data in
the right ways for analyses.
Data Scientist: Provides subject matter expertise for analytical
techniques, data modelling, and applying valid analytical techniques
to given business problems. Ensures overall analytics objectives are
met. Designs and executes analytical methods and approaches with
the data available to the project.
Overview of Data Analytics Lifecycle
Phase 1- Discovery: In Phase 1, the team learns the business
domain, including relevant history such as whether the organization
or business unit has attempted similar projects in the past from
which they can learn. The team assesses the resources available to
support the project in terms of people, technology, time, and data.
Important activities in this phase include framing the business
problem as an analytics challenge that can be addressed in
subsequent phases and formulating initial hypotheses (IHs) to test
and begin learning the data.
Phase 2- Data preparation: Phase 2 requires the presence of an
analytic sandbox, in which the team can work with data and perform
analytics for the duration of the project. The team needs to execute ext
ract, load, and transform (ELT) or extract, transform and load (ETL) to
get data into the sandbox. The ELT and ETL are sometimes
abbreviated as ETLT. Data should be t transformed in the ETLT
process so the team can work with it and analyze it. In this phase, the
team also needs to familiarize itself with the data thoroughly and take
steps to condition the data.
Phase 3-Model planning: Phase 3 is model planning, where the team
determines the methods, techniques, and workflow it intends to follow
for the subsequent model building phase. The team explores the data
to learn about the relationships between variables and subsequently
selects key variables and the most suitable models.
Phase 4-Model building: In Phase 4, the team develops data sets for
testing, training, and production purposes. In addition, in this phase
the team builds and executes models based on the work done in the
model planning phase. The team also considers whether its existing
tools will suffice for running the models, or if it will need a more
robust environment for executing models and work flows (for example,
fast hardware and parallel processing, if applicable).
Phase 5-Communicate results: In Phase 5, the team, in collaboration
with major stakeholders, determines if the results of the project are a
success or a failure based on the criteria developed in Phase 1. The
team should identify key findings, quantify the business value, and
develop a narrative to summarize and convey findings to stakeholders.
Phase 6-0perationalize: In Phase 6, the team delivers final reports,
briefings, code, and technical documents. In addition, the team may
run a pilot project to implement the models in a production
environment.
Common Tools for the Model Building Phase
SAS Enterprise Miner allows users to run predictive and descriptive
models based on large volumes of data from across the enterprise. It
interoperates with other large data stores, has many partnerships,
and is built for enterprise-level computing and analytics.
SPSS Modeler (provided by IBM and now called IBM SPSS Modeler)
offers methods to explore and analyze data through a GUI.
Matlab provides a high-level language for performing a variety of data
analytics, algorithms, and data exploration.
Alpine Miner provides a GUI front end for users to develop analytic
workfiows and interact with Big Data tools and platforms on the back
end.
STATISTICA and Mathematica are also popular and well-regarded
data mining and analytics tools.
Rand PL/R R was described earlier in the model planning phase, and
PL!R is a procedural language for PostgreSQL with R. Using this
approach means that R commands can be executed in database. This
technique provides higher performance and is more scalable than
running R in memory.
Octave, a free software programming language for computational
modelling, has some of the functionality of Matlab. Because it is freely
available, Octave is used in major universities when teaching machine
learning.
WEKA is a free data mining software package with an analytic
workbench. The functions created in WEKA can be executed within
Java code.
Python is a programming language that provides toolkits for machine
learning and analysis, such as scikit-learn, numpy, scipy, pandas,
and related data visualization using matplotlib.
SQL in-database implementations, such as MADlib provide an
alterative to in -memory desktop analytical tools. MADiib provides an
open-source machine learning library of algorithms that can be
executed in-database, for PostgreSQL or Greenplum.
Key Outputs from a Successful Analytic Project
Phase – 1: Discovery
Learning Business Domain
Resources – Technology, Tools, Systems, Data & People
Problem Formulation
Identify the key Stake Holder
Interview Stake Holders (Prepare questions and ask open ended)
Develop Hypothesis
Identify Data sources
Phase – 2: Prepare Data
Prepare Analytics Sandbox
Making ETLT - OLTP
Learning Data
Data Conditioning (Tables – Columns)
Survey & Visualize
Phase – 3: Model Planning
Variable Selection
Model Selection (Selection, Classification, Clustering) | Input / Output –
Continuous & Discrete
Phase – 4: Model Building
Building
Accuracy Analysis
Deploy
Testing
Phase – 5: Communicate Results
Build Reports
Communicate Reports
Summery Presentation
Phase – 6: Operationalize
Closer of the Project