Foundation Module – BDA (Indicative duration: 240 hrs.
Foundation Module- pre-requisite
Basics of Information technology
Hardware and software components
Operating system
Computational thinking and problem solving skills
Basics of programming (Python)
Basics of Object oriented programming concepts
Database concepts
Foundational Curriculum – Big Data Analytics
Foundational Curriculum for Big Data Analytics is aimed at up-skilling those who have a basic
understanding of programming and data sequences, to help them expand their knowledge and learn the
fundamentals of Big Data Analytics technologies at a beginner level. This Curriculum has been divided
into three modules, of which the first is an introductory module.
Curriculum Details Scope and Objective Enable students to explore the fundamentals of Big
Data Analytics, to provide them with a base from
where they can up skill themselves for specific Big
Data Analytics job roles.
Intended Audience
University students enrolled in streams such as
Engineering, Computer Science, Statistics, Sciences or
Mathematics
Employed professionals who wish to explore their
career options and interests with regards to Big Data
Analytics
Enthusiasts curious about understanding the hype behind
Big Data Analytics
Pre-requisites Knowledge of the fundamentals of programming
including data sequences such as stacks, queues,
strings, arrays, linked lists, trees,
maps and the concepts of Object-Oriented
Programming
Key Learning Outcomes 1. Evaluate trends in Big Data and discuss how Big
Data is transforming businesses
2. Evaluate the different platforms used for
processing Big Data
3. Evaluate the features of databases
4. Write Map and Reduce codes for distributed
processing of data
5. Understand key concepts behind Big Data
modelling and management and gain practical skills
needed for modelling Big Data projects
6. Select appropriate data models that suit the
requirements of data
7. Differentiate between a traditional Database
Management System and a Big Data Management
System
8. Retrieve data from Big Data management systems
9. Execute simple Big Data integration and processing
operations
List of Tools Suggested (Indicative) SQL, Mongo DB, Hadoop, MapReduce, HDFS, Apache
Spark, PySpark, SparkR, Java, Apache Pig, Dynamo DB,
Spark MLlib, GraphX, Postgres,
Pandas
Indicative TOC
Data Analytics
Module 1: Data analytics an Overview-
What & Why - Data Analytics?
Different components of a modern data ecosystem, and the role of Data Analysts play in this
ecosystem.
Different types of data analysis and the key steps in a data analysis process.
Roles, responsibilities, and skillsets required to be a Data Analyst
Data Analytics Tools
Module 2: The Data Ecosystem
Different types of data structures, file formats, sources of data
Understanding of various types of data repositories such as Databases, Data Warehouses, Data
Marts, Data Lakes, and Data Pipelines.
Extract, Transform, and Load (ETL) Process, which is used to extract, transform, and load data
into data repositories.
Chapter 1: Introduction to Big Data-Hadoop framework
o Big Data Overview, What is Big Data Analytics
o Overview of Hadoop Ecosystem
o What is Big Data & Role of Hadoop in Big data– Overview of other Big Data Systems
o Hadoop integrations into Exiting Software Products
o Current Scenario in Hadoop Ecosystem
o Installation & Configuration
o Use Cases of Hadoop (HealthCare, Retail, Telecom)
Chapter 2: HDFS
o HDFS Concepts & Design
o Architecture, HDFS Daemons
o Overview Of Hadoop Distributed File System
Name nodes
Data nodes
The Command-Line Interface
o Data Flow (File Read , File Write)
o Fault Tolerance, Shell Commands
o Data Flow Archives, Coherency -Data Integrity
o Role of Secondary NameNode
Chapter 3: Hadoop Components - MapReduce
o Anatomy of Map Reduce & Theory
o Data Flow (Map – Shuffle – Reduce)
o MapRed vs MapReduce APIs
o Programming [Mapper, Reducer, Combiner, Partitioner]
o Writables
o Input and Output format
o Streaming API using python
o Magic of Shuffle Phase
o File Formats, Sequence Files
Chapter 4: Extended subjects on HBASE
o Introduction to NoSQL
o CAP Theorem
o Hbase and RDBMS
o HBASE and HDFS
o Architecture (Read Path, Write Path, Compactions, Splits)
o Installation & Configuration
o Role of Zookeeper
o HBase Shell Introduction to Filters
o RowKeyDesign -What’s New in HBase Hands On
Chapter 5: Extended subjects on HIVE
o Architecture
o Installation & Configuration
o Hive vs RDBMS
o Working on Hive Beeline
o Hive- HQL, Tables
o DDL, DML
o UDF
o Partitioning, Bucketing
o Hive functions, Date functions, String functions
o Joins, Sub Queries and other Aggregations
Chapter 6: Apache Spark 5hrs
o Introduction to Spark - Getting started
o Resilient Distributed Dataset and DataFrames
o Spark application programming
o introduction to Spark libraries
o Spark configuration, monitoring and tuning
Module 3: Gathering, Wrangling & Visualizing Data with
Advance Python Libraries [Pandas, numPy & , matplotlib]
o Introduction to Pandas.
o Data Structure in Pandas-(Series, Data Frame)
o DataFrame implementation using – series, Lists, Dictionary, a NumPy 2D array
o Identify and Handle Missing Values
o Data Formatting
o Data Normalization Sets
o Binning
o Indicator variables
o CSV file handling
o Exporting data from DataFrame to CSV File
o EDA & Data Visualization using matplotlib library
Tableau
o What is Tableau?
o Tableau Architecture
o Workspace & Navigation
o Tableau Data Connections
o Filter data in Tableau
o Tableau Sort Data
o Data Visualization with Tableau
o Dynamic Data Manipulation and Presentation in Tableau
Module 4: Mining & Visualizing Data and Communicating
Results
Chapter -1 Introduction to Statistical Modelling
o What is a Statistical Mode
o Why do we need Statistical Modeling?
o Estimation:
o Confidence Interval
o Hypothesis Testing
Chapter 2 - Introduction to Statistical Modelling
o Linear Regression
Simple Linear Regression
Multiple Linear Regression
o Classification
Logistic Regression
Discriminant Analysis
o Resampling Methods
Bootstrapping
Cross-Validation
o Tree-based Methods
Bagging
Boosting
o Unsupervised Learning
Principal Component Analysis
K-Means Clustering
Hierarchical Clustering
o Types of Variables
Dependent Variable, also known as Response Variable:
Explanatory Variable, also known as Independent Variable:
o Model Parameters and Model Residuals
Chapter 3 - Difference between Statistical Modelling and Machine Learning
Chapter 4 - Difference Statistical Modelling Perspective
Chapter 5 - Difference Machine Learning Perspective
R Programming
o Understanding R as a programming environment
o R basics-
Math, Variables, and Strings
Vectors and Factors
Vector operations
o Data structures in R
o Arrays & Matrices
o Lists
o Dataframes
o R programming fundamentals
Conditions and loops
Functions in R
Objects and Classes
Debugging
o Working with data in R
Reading CSV and Excel Files
Reading text files
Writing and saving data objects to file in R
o Strings and Dates in R
String operations in R
Regular Expressions
Dates in R
o Descriptive Statistics using R
o Data Visualization using R
o Exploratory Data Analysis (EDA) using R
o A Comprehensive analysis on a sample data set using Machine Learning Technique.
Module 5: Career Opportunities and Data Analysis in
Action
o Different career opportunities in the field of Data Analysis and the different paths that
you can take for getting skilled as a Data Analyst.
o Hands-on project on with use cases (scenario based) in gathering, wrangling, mining,
analyzing, and visualizing data.