Data Analytics
(CS61061)
Lecture #1
Introduction to Data
Dr. Debasis Samanta
Associate Professor
Department of Computer Science & Engineering
Quote of the day..
It is very easy to be a teacher, but very difficult to be a
student.
A good student has to learn many concepts, perform in
examinations, loyal to his teacher and others.
Quote from Hichcki, a Hindi feature film directed by Siddharth P.
Malhotra.
@DSamanta, IIT Kharagpur Data Analytics (CS61061) 2
In this discussion…
Introduction to data
Current trend
Data and Big Data
Big Data vs. small data
Tools and techniques
@DSamanta, IIT Kharagpur Data Analytics (CS61061) 3
Introduction to Data
@DSamanta, IIT Kharagpur Data Analytics (CS61061) 4
Introduction to data
Example:
10, 25, …, Kharagpur, 10CS3002,
[email protected] Anything else?
Data versus Information
100.0, 0.0, 250.0, 150.0, 220.0, 300.0, 110.0
Is there any information?
@DSamanta, IIT Kharagpur Data Analytics (CS61061) 5
Current Trend
@DSamanta, IIT Kharagpur Data Analytics (CS61061) 6
How large your data is?
What is the maximum file size you have
dealt so far?
Movies/files/streaming video that you have
used?
What is the maximum download speed you
get?
To retrieve data stored in distant locations?
How fast your computation is?
How much time to just transfer from you,
process and get result?
@DSamanta, IIT Kharagpur Data Analytics (CS61061) 7
Growth of data
@DSamanta, IIT Kharagpur Data Analytics (CS61061) 8
Sources of data
“Every day, we create 2.5 quintillion bytes of data
So much that 90% of the data in the world today has been created in the
last two years alone.
The data come from several sources
sensors used to gather climate information
posts to social media sites,
digital pictures and videos
purchase transaction records
cell phone GPS signals
etc. …… to name a few!
@DSamanta, IIT Kharagpur Data Analytics (CS61061) 9
Examples
Social media and networks Scientific instruments
(All of us are generating data) (Collecting all sorts of data)
Sensor technology and
Mobile devices networks
(Tracking all objects all the time) (Measuring all kinds of data)
@DSamanta, IIT Kharagpur Data Analytics (CS61061) 10
Big Data
@DSamanta, IIT Kharagpur Data Analytics (CS61061) 11
Now data is Big Data!
No single standard definition!
‘Big-data’ is similar to ‘Small-data’, but bigger
…but having data bigger consequently requires different approaches
techniques, tools and architectures
…to solve: new problems
…and, of course, in a better way
Big data is data whose scale, diversity, and complexity require new
architecture, techniques, algorithms, and analytics to manage it and extract
value and hidden knowledge from it…
@DSamanta, IIT Kharagpur Data Analytics (CS61061) 12
Characteristics of Big Data: V3
@DSamanta, IIT Kharagpur Data Analytics (CS61061) 13
V3 : V for Volume
Volume of data, which needs to
be processed is increasing
rapidly
More storage capacity
More computation
More tools and techniques
Exponential increase in
collected/generated data
@DSamanta, IIT Kharagpur Data Analytics (CS61061) 14
V3: V for Variety
Various formats, types, and
structures
Text, numerical, images, audio,
video, sequences, time series, social
media data, multi-dimensional
arrays, etc…
Static data vs. streaming data
A single application can be
generating/collecting many types of
data
To extract knowledge all these types of
data need to be linked together
@DSamanta, IIT Kharagpur Data Analytics (CS61061) 15
V3: V for Velocity
Data is being generated fast and need to be
processed fast
For time-sensitive processes such as catching
fraud, big data must be used as it streams into
your enterprise in order to maximize its value
Scrutinize 5 million trade events created each
day to identify potential fraud
Analyze 500 million daily call detail records in real-
time to predict customer churn faster
Sometimes, 2 minutes is too late!
The latest we have heard is 10 ns (nano
seconds) delay is too much
@DSamanta, IIT Kharagpur Data Analytics (CS61061) 16
Big Data vs. small data
- Optimizations and predictive analytics
- Complex statistical analysis
- All types of data, and many sources
- Very large datasets
- More of a real-time
- Ad-hoc querying and reporting
- Data mining techniques
- Structured data, typical sources
- Small to mid-size datasets
@DSamanta, IIT Kharagpur Data Analytics (CS61061) 17
Big Data vs. small data
Big Data is more real-time in nature
than traditional applications
Big Data architecture
Traditional architectures are not
well-suited for big data applications
(e.g. Exa-data, Tera-data)
Massively parallel processing, scale
out architectures are well-suited for
big data applications
@DSamanta, IIT Kharagpur Data Analytics (CS61061) 18
Tools and Techniques
@DSamanta, IIT Kharagpur Data Analytics (CS61061) 19
Challenges ahead…
The bottleneck is in technology
New architecture, algorithms, techniques are needed
Also in technical skills
Experts in using the new technology and dealing with Big data
Who are the major players in the
world of Big Data?
@DSamanta, IIT Kharagpur Data Analytics (CS61061) 20
Big data players
@DSamanta, IIT Kharagpur Data Analytics (CS61061) 21
Major players…
Google
Hadoop
MapReduce
Mahout
Apache Hbase
Cassandra
@DSamanta, IIT Kharagpur Data Analytics (CS61061) 22
Tools available
NoSQL
DatabasesMongoDB, CouchDB, Cassandra, Redis, BigTable, Hbase, Hypertable, Voldemort, Riak,
ZooKeeper
MapReduce
Hadoop, Hive, Pig, Cascading, Cascalog, mrjob, Caffeine, S4, MapR, Acunu, Flume, Kafka, Azkaban,
Oozie, Greenplum
Storage
S3, HDFS, GDFS
Servers
EC2, Google App Engine, Elastic, Beanstalk, Heroku
Processing
R, Yahoo! Pipes, Mechanical Turk, Solr/Lucene, ElasticSearch, Datameer, BigSheets, Tinkerpop
@DSamanta, IIT Kharagpur Data Analytics (CS61061) 23
Any question?
@DSamanta, IIT Kharagpur Data Analytics (CS61061) 24
Questions of the day…
1. What is the smallest and largest units of measuring size
of data?
2. How big a Quintillion measure is?
3. Give the examples of a smallest the largest entities of
data.
4. Give FIVE parameters with which data can be
categorized as i) simple, ii) Moderately complex and iii)
complex?
@DSamanta, IIT Kharagpur Data Analytics (CS61061) 25
Questions of the day…
5. What type of data are involved in the following
applications?
1. Weather forecasting
2. Mobile usage of all customers of a service provider
3. Anomaly (e.g. fraud) detection in a bank organization
4. Person categorization, that is, identifying a human
5. Air traffic control in an airport
6. Streaming data from all flying aircrafts of Boeing
@DSamanta, IIT Kharagpur Data Analytics (CS61061) 26