Chapter 2
Data Science
Contents:
Overview of Data Science
Data Types and Their Representation
Data value Chain
Basic Concepts of Big Data
Overview of Data Science
• Data science is a multi-disciplinary field that uses scientific methods,
processes, algorithms, and systems to extract knowledge and insights
from structured, semi-structured and unstructured data.
• Data science is much more than simply analyzing data.
• It offers a range of roles and requires a range of skills.
What are data and information?
• Data can be defined as a representation of facts, concepts, or instructions in a
formalized manner.
• It can be described as unprocessed facts and figures.
• It is represented with the help of:
• alphabets (A-Z, a-z),
• digits (0-9) or
• special characters (+, -, /, *, <,>, =, etc.)
• information is the processed data on which decisions and actions are based.
• It is interpreted data; created from organized, structured, and processed data
in a particular context.
Data Processing Cycle
• Data processing is the re-structuring or re-ordering of data by people
or machines to increase their usefulness and add values for a particular
purpose.
• Data processing consists of the following basic steps - input,
processing, and output.
Data Processing Cycle
Data types and their representation
1. Data types from Computer programming perspective
• Integers(int)- is used to store whole numbers, mathematically known
as integers
• Booleans(bool)- is used to represent restricted to one of two values:
true or false
• Characters(char)- is used to store a single character
• Floating-point numbers(float)- is used to store real numbers
• Alphanumeric strings(string)- used to store a combination of
characters and numbers
• 2. Data types from Data Analytics perspective
Structured Data Unstructured Data Semi-structured Data
• A pre-defined data • Have no pre-defined data model. • contains tags or other markers to
• Straightforward to analyze • May contain data such as dates, separate semantic elements
• Placed in tabular format numbers and facts • known as a self-describing
• Example: Excel files or • difficult to understand using structure.
SQL databases. traditional programs. • Example: JSON and XML
• Example: audio, video files
Metadata
• It is not a separate data structure, but most important element for Big
Data analysis and solutions.
• They are called data about data.
• In a set of photographs,
for example, metadata
could describe when and
where the photos were
taken.
Data value Chain
• The Data Value Chain is introduced to describe the information flow
within a big data system.
• describes the full data lifecycle from collection to analysis and usage.
• The Big Data Value Chain identifies the following key high-level
activities:
Basic concepts of big data
• Big data is the term for a collection of data sets so large and complex
that it becomes difficult to process using on-hand database
management tools or traditional data processing applications.
• According to IBM, Big data is characterized by 3V and more:
• Volume (amount of data): dealing with large scales of data within
data processing (e.g. Global Supply Chains, Global Financial
Analysis, Large Hadron Collider).
• Velocity (speed of data): dealing with streams of high frequency of
incoming real-time data (e.g. Sensors, Pervasive Environments, Electronic
Trading, Internet of Things).
• Variety (range of data types/sources): dealing with data using differing
syntactic formats (e.g. Spreadsheets, XML, DBMS), schemas, and meanings
(e.g. Enterprise Data Integration).
• Veracity: can we trust the data? How accurate is it? etc.
Clustered Computing and Hadoop Ecosystem
Clustered Computing
• Cluster computing refers that many of the computers connected on a network and they
perform like a single entity.
• Because of the qualities of big data, individual computers are often inadequate for handling
the data at most stages.
• To better address the high storage and computational needs of big data, computer clusters are
a better fit.
• Big data clustering software combines the resources of many smaller machines, seeking to
provide a number of benefits:
suppose you have a big file having more than 500 mb data and you need to count the number of words. But
your computer has only 100 mb, how you can handle it ?
• Resource Pooling: Combining the available storage space to hold data
is a clear benefit, but CPU and memory pooling are also extremely
important.
• Processing large datasets requires large amounts of all three of these
resources.
• High Availability: Clusters can provide varying levels of fault
tolerance and availability guarantees to prevent hardware or software
failures from affecting access to data and processing.
• Easy Scalability: Clusters make it easy to scale horizontally by
adding additional machines to the group.
• Cluster membership and resource allocation can be handled by software
like Hadoop’s YARN (which stands for Yet Another Resource Negotiator).
• Hadoop is an open-source framework intended to make interaction with big
data easier.
• It is a framework that allows for the distributed processing of large datasets
across clusters of computers using simple programming models.
The four key characteristics of Hadoop are:
• Economical: Its systems are highly economical as ordinary computers can be used for data
processing.
• Reliable: It is reliable as it stores copies of the data on different machines and is resistant
to hardware failure.
• Scalable: It is easily scalable both, horizontally and vertically. A few extra nodes help in
scaling up the framework.
• Flexible: It is flexible and you can store as much structured and unstructured data as you
need to and decide to use them later.
• Hadoop has an ecosystem that has evolved from its four core
components: data management, data access, data processing, and data
storage.
Hadoop ecosystem
Big Data Life Cycle with Hadoop
1. Ingesting data into the system
• The first stage of Big Data processing is Ingest.
• The data is ingested or transferred to Hadoop from various sources
such as relational databases, systems, or local files.
• Sqoop transfers data from RDBMS to HDFS, whereas Flume transfers
event data.
2. Processing the data in storage
• The second stage is Processing. In this stage, the data is stored and
processed.
• The data is stored in the distributed file system, HDFS, and the
NoSQL distributed data, Hbase, Spark and MapReduce perform data
processing.
3. Computing and analyzing data
• The third stage is to Analyze. Here, the data is analyzed by processing
frameworks such as Pig, Hive, and Impala.
• Pig converts the data using a map and reduce and then analyzes it.
• Hive is also based on the map and reduce programming and is most
suitable for structured data.
4. Visualizing the results
• The fourth stage is Access, which is performed by tools such as Hue
and Cloudera Search.
• In this stage, the analyzed data can be accessed by users.
Thank you!!!