Module Detailed Contents Hours
01 Introduction to Big Data & Hadoop 06
1.1 Introduction to Big Data, 1.2 Big Data characteristics, types of Big
Data, 1.3 Traditional vs. Big Data business approach, 1.4 Case Study
of Big Data Solutions. 1.5 Concept of Hadoop 1.6 Core Hadoop
Components; Hadoop Ecosystem
02 Hadoop HDFS and Map Reduce 10
2.1 Distributed File Systems: Physical Organization of Compute Nodes,
Large-Scale File-System Organization. 2.2 MapReduce: The Map
Tasks, Grouping by Key, The Reduce Tasks, Combiners, Details of
MapReduce Execution, Coping With Node Failures. 2.3 Algorithms
Using MapReduce: Matrix-Vector Multiplication by MapReduce,
Relational-Algebra Operations, Computing Selections by MapReduce,
Computing Projections by MapReduce, Union, Intersection, and
Difference by MapReduce 2.4 Hadoop Limitations s.
03 NoSQL 06
3.1 Introduction to NoSQL, NoSQL Business Drivers, 3.2 NoSQL Data
Architecture Patterns: Key-value stores, Graph stores, Column family
(Bigtable)stores, Document stores, Variations of NoSQL architectural
patterns, NoSQL Case Study 3.3 NoSQL solution for big data,
Understanding the types of big data problems; Analyzing big data with
a shared-nothing architecture; Choosing distribution models:
master-slave versus peer-to-peer; NoSQL systems to handle big data
problems.
peer-to-peer; Four ways that NoSQL systems handle big data problems
04 Mining Data Streams 12
4.1 The Stream Data Model: A Data-Stream-Management System,
Examples of Stream Sources, Stream Queries, Issues in Stream
Processing. 4.2 Sampling Data techniques in a Stream 4.3 Filtering
Streams: Bloom Filter with Analysis. 4.4 Counting Distinct Elements in
a Stream, Count-Distinct Problem, Flajolet-Martin Algorithm,
Combining Estimates, Space Requirements 4.5 Counting Frequent Items
in a Stream, Sampling Methods for Streams, Frequent Itemsets in
Decaying Windows. 4.6 Counting Ones in a Window: The Cost of Exact
Counts, The Datar-Gionis-Indyk-Motwani Algorithm, Query Answering
in the DGIM Algorithm, Decaying Windows.
05 Finding Similar Items and Clustering 08
5.1 Distance Measures: Definition of a Distance Measure, Euclidean
Distances, Jaccard Distance, Cosine Distance, Edit Distance, Hamming
Distance. 5.2 CURE Algorithm, Stream-Computing , A
Stream-Clustering Algorithm, Initializing & Merging Buckets,
Answering Queries.
06 Real-Time Big Data Models 10
6.1 PageRank Overview, Efficient computation of PageRank:
PageRank Iteration Using MapReduce, Use of Combiners to
Consolidate the Result Vector. 6.2 A Model for Recommendation
Systems,
Content-Based Recommendations, Collaborative Filtering. 6.3 Social
Networks as Graphs, Clustering of Social-Network Graphs, Direct
Discovery of Communities in a social graph.
Textbooks:
1 Anand Rajaraman and Jeff Ullman ―Mining of Massive Datasetsǁ, Cambridge
University Press,
2 Alex Holmes ―Hadoop in Practiceǁ, Manning Press, Dreamtech Press.
3 Dan Mcary and Ann Kelly ―Making Sense of NoSQLǁ – A guide for managers and the
rest of us, Manning Press.
References:
1 Bill Franks , ―Taming The Big Data Tidal Wave: Finding Opportunities In Huge
Data Streams With Advanced Analyticsǁ, Wiley
2 Chuck Lam, ―Hadoop in Actionǁ, Dreamtech Press
3 Jared Dean, ―Big Data, Data Mining, and Machine Learning: Value Creation for
Business Leaders and Practitionersǁ, Wiley India Private Limited, 2014.
4 Jiawei Han and Micheline Kamber, ―Data Mining: Concepts and Techniquesǁ,
Morgan Kaufmann Publishers, 3rd ed, 2010.
5 Lior Rokach and Oded Maimon, ―Data Mining and Knowledge Discovery
Handbookǁ, Springer, 2nd edition, 2010.
6 Ronen Feldman and James Sanger, ―The Text Mining Handbook: Advanced Approaches
in Analyzing Unstructured Dataǁ, Cambridge University Press, 2006.
7 Vojislav Kecman, ―Learning and Soft Computingǁ, MIT Press, 2010