0% found this document useful (0 votes)

446 views6 pages

Data Science & Big Data Guide

- The document discusses defining data science and big data, recognizing different types of data, and gaining insight into the data science process. - It begins by defining big data and how it differs from traditional data management. It then defines data science as using methods to analyze massive amounts of data and extract knowledge. - The document outlines the six main steps of the data science process: setting a research goal, retrieving data, data preparation, data exploration, data modeling/building, and presentation/automation.

Uploaded by

Aishwarya Jagtap

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

446 views6 pages

Data Science & Big Data Guide

Uploaded by

Aishwarya Jagtap

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

• Defining data science and big data

• Recognizing the different types of data

• Gaining insight into the data science process

Big data is a blanket term for any collection of data sets so large or
complex that it becomes difficult to process them using traditional
data management techniques such as, for example, the RDBMS
(relational database management systems). The widely adopted
RDBMS has long been regarded as a one-size-fits-all solution, but the
demands of handling big data have shown otherwise. Data
science involves using methods to analyze massive amounts of data
and extract the knowledge it contains. You can think of the
relationship between big data and data science as being like the
relationship between crude oil and an oil refinery. Data science and big
data evolved from statistics and traditional data management but are
now considered to be distinct disciplines.

The characteristics of big data are often referred to as the three Vs:

• Volume —How much data is there?

• Variety —How diverse are different types of data?
• Velocity —At what speed is new data generated?

Often these characteristics are complemented with a fourth V, veracity: How

accurate is the data? These four properties make big data different from the
data found in traditional data management tools. Consequently, the
challenges they bring can be felt in almost every aspect: data capture,
curation, storage, search, sharing, transfer, and visualization. In addition, big
data calls for specialized techniques to extract the insights.

Data science is an evolutionary extension of statistics capable of dealing with

the massive amounts of data produced today. It adds methods from
computer science to the repertoire of statistics
The main things that set a data scientist apart from a statistician are the
ability to work with big data and experience in machine learning, computing,
and algorithm building. Their tools tend to differ too, with data scientist job
descriptions more frequently mentioning the ability to use Hadoop, Pig,
Spark, R, Python, and Java, among others.
BENEFITS AND USES OF DATA SCIENCE AND BIG DATA
Commercial companies in almost every industry use data science and big
data to gain insights into their customers, processes, staff, completion, and
products. Many companies use data science to offer customers a better user
experience, as well as to cross-sell, up-sell, and personalize their offerings.
Human resource professionals use people analytics and text mining to
screen candidates, monitor the mood of employees, and study informal
networks among coworkers.
Financial institutions use data science to predict stock markets, determine
the risk of lending money, and learn how to attract new clients for their
services.
A data scientist in a governmental organization gets to work on diverse
projects such as detecting fraud and other criminal activity or optimizing
project funding.
The rise of massive open online courses (MOOC) produces a lot of data,
which allows universities to study how this type of learning can complement
traditional classes.

FACETS OF DATA (ALSO REFER PPT)

The main categories of data are these:

• Structured
• Unstructured
• Machine-generated
• Graph-based
• Audio, video, and images

Structured data

Structured data is data that depends on a data model and resides in a fixed
field within a record. As such, it’s often easy to store structured data in tables
within databases or Excel files
Figure 1.1. An Excel table is an example of structured data.
The world isn’t made up of structured data, though; it’s imposed upon it by
humans and machines. More often, data comes unstructured.

Unstructured data

Unstructured data is data that isn’t easy to fit into a data model because the
content is context-specific or varying. One example of unstructured data is
your regular email (figure 1.2). Although email contains structured elements
such as the sender, title, and body text, it’s a challenge to find the number of
people who have written an email complaint about a specific employee
because so many ways exist to refer to a person, for example. The thousands
of different languages and dialects out there further complicate this.

Machine-generated data

Machine-generated data is information that’s automatically created by a

computer, process, application, or other machine without human
intervention.

The analysis of machine data relies on highly scalable tools, due to its high
volume and speed. Examples of machine data are web server logs, call detail
records, network event logs, and telemetry (figure 1.3).
Figure 1.3. Example of machine-generated data
Graph or network data is, in short, data that focuses on the relationship or
adjacency of objects. The graph structures use nodes, edges, and properties
to represent and store graphical data
Figure 1.4. Friends in a social network are an example of graph-
based data.

Graph databases are used to store graph-based data

Audio, image, and video

Audio, image, and video are data types that pose specific challenges to a data
scientist. Tasks that are trivial for humans, such as recognizing objects in
pictures, turn out to be challenging for computers.

THE DATA SCIENCE PROCESS

The data science process typically consists of six steps, as you can see in the
mind map

Setting the research goal

Data science is mostly applied in the context of an organization. When the
business asks you to perform a data science project, you’ll first prepare a
project charter. This charter contains information such as what you’re going
to research, how the company benefits from that, what data and resources
you need, a timetable, and deliverables

Retrieving data

The second step is to collect data. You’ve stated in the project charter which
data you need and where you can find it. In this step you ensure that you can
use the data in your program, which means checking the existence of, quality,
and access to the data. Data can also be delivered by third-party companies
and takes many forms ranging from Excel spreadsheets to different types of
databases.

Data preparation
Data collection is an error-prone process; in this phase you enhance the
quality of the data and prepare it for use in subsequent steps. This phase
consists of three subphases: data cleansing removes false values from a data
source and inconsistencies across data sources, data integration enriches
data sources by combining information from multiple data sources, and data
transformation ensures that the data is in a suitable format for use in your
models.

Data exploration

Data exploration is concerned with building a deeper understanding of your

data. You try to understand how variables interact with each other, the
distribution of the data, and whether there are outliers. To achieve this you
mainly use descriptive statistics, visual techniques, and simple modeling.
This step often goes by the abbreviation EDA, for Exploratory Data Analysis.

Data modeling or model building

In this phase you use models, domain knowledge, and insights about the data
you found in the previous steps to answer the research question. You select
a technique from the fields of statistics, machine learning, operations
research, and so on. Building a model is an iterative process that involves
selecting the variables for the model, executing the model, and model
diagnostics.

Presentation and automation

Finally, you present the results to your business. These results can take many
forms, ranging from presentations to research reports. Sometimes you’ll
need to automate the execution of the process because the business will
want to use the insights you gained in another project or enable an
operational process to use the outcome from your model.

SRS Master Login Module
No ratings yet
SRS Master Login Module
17 pages
Fiona
No ratings yet
Fiona
83 pages
Cka PDF
80% (5)
Cka PDF
58 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
161 pages
Data Science M-1 Notes
No ratings yet
Data Science M-1 Notes
34 pages
Data Science Essentials for Students
No ratings yet
Data Science Essentials for Students
33 pages
Class Notes Unit 2 ML Material
No ratings yet
Class Notes Unit 2 ML Material
31 pages
CS3492 Database Management Systems Lecture Notes 2
100% (1)
CS3492 Database Management Systems Lecture Notes 2
170 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
34 pages
Big Data: Understanding the 5 Vs
No ratings yet
Big Data: Understanding the 5 Vs
4 pages
Gujarat Technological University
No ratings yet
Gujarat Technological University
4 pages
Lecture Notes: Introduction To Data Science and Big Data
No ratings yet
Lecture Notes: Introduction To Data Science and Big Data
5 pages
R Programming UNIT-1
No ratings yet
R Programming UNIT-1
48 pages
Univariate Analysis Basics
No ratings yet
Univariate Analysis Basics
33 pages
CCW331 Business Analytics Material Unit I Type2
No ratings yet
CCW331 Business Analytics Material Unit I Type2
43 pages
Database and DBMS: A Comprehensive Guide
No ratings yet
Database and DBMS: A Comprehensive Guide
32 pages
Mining Social Network Graphs
No ratings yet
Mining Social Network Graphs
35 pages
Decision Science Material
No ratings yet
Decision Science Material
136 pages
Soft Computing
No ratings yet
Soft Computing
13 pages
Unit 4 Data Science
No ratings yet
Unit 4 Data Science
21 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
19 pages
Multimedia & Web Data Mining Guide
100% (2)
Multimedia & Web Data Mining Guide
13 pages
Allocation Methods
No ratings yet
Allocation Methods
20 pages
Data Mining and Data Warehousing
No ratings yet
Data Mining and Data Warehousing
12 pages
Data Science
No ratings yet
Data Science
21 pages
NITHYA S - 412520403004 - Project Report
No ratings yet
NITHYA S - 412520403004 - Project Report
39 pages
Unit 4 - Data Mining - WWW - Rgpvnotes.in
No ratings yet
Unit 4 - Data Mining - WWW - Rgpvnotes.in
12 pages
R22-Ids-Question Bank
No ratings yet
R22-Ids-Question Bank
4 pages
Data Imputation Techniques Guide
No ratings yet
Data Imputation Techniques Guide
6 pages
Data Science A Beginner S Guide 1668243666
100% (1)
Data Science A Beginner S Guide 1668243666
26 pages
Data Foundation & Visualization Guide
No ratings yet
Data Foundation & Visualization Guide
10 pages
Bi Unit1
No ratings yet
Bi Unit1
93 pages
Probability & Statistics Guide
No ratings yet
Probability & Statistics Guide
34 pages
Fundamentals of Data Science Unit 4
100% (1)
Fundamentals of Data Science Unit 4
31 pages
Unit V Big Data Analytics
No ratings yet
Unit V Big Data Analytics
47 pages
Unit-3 DS Students
No ratings yet
Unit-3 DS Students
35 pages
Unit 1
No ratings yet
Unit 1
70 pages
BDA Notes
No ratings yet
BDA Notes
96 pages
Probability and Statistics: Cheat Sheet
100% (1)
Probability and Statistics: Cheat Sheet
10 pages
Time Series Models for Engineers
No ratings yet
Time Series Models for Engineers
15 pages
Data Reduction
No ratings yet
Data Reduction
22 pages
1.5 Triangular Factors and Row Exchanges
No ratings yet
1.5 Triangular Factors and Row Exchanges
29 pages
Machine Learning Basics & kNN Guide
No ratings yet
Machine Learning Basics & kNN Guide
94 pages
R Programming: Features & Uses
No ratings yet
R Programming: Features & Uses
6 pages
CS8091 - Big Data Analytics - Unit 1
No ratings yet
CS8091 - Big Data Analytics - Unit 1
28 pages
BCA-SEP-lesson Plan - R-Programming
No ratings yet
BCA-SEP-lesson Plan - R-Programming
5 pages
Cloud Tech for IT Professionals
No ratings yet
Cloud Tech for IT Professionals
53 pages
Lecture Notes
100% (1)
Lecture Notes
82 pages
Mc4301 APR May 24 (Machine Learning)
No ratings yet
Mc4301 APR May 24 (Machine Learning)
3 pages
Data Mining: Concepts and Techniques: - Chapter 5
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 5
63 pages
Mining Frequent Itemset-Association Analysis
No ratings yet
Mining Frequent Itemset-Association Analysis
59 pages
Object Oriented Programming
No ratings yet
Object Oriented Programming
42 pages
Unit 5 - Data Mining - WWW - Rgpvnotes.in
No ratings yet
Unit 5 - Data Mining - WWW - Rgpvnotes.in
15 pages
DS Mod 1 To 2 Complete Notes
No ratings yet
DS Mod 1 To 2 Complete Notes
63 pages
Business Statistics: Biyani's Think Tank
No ratings yet
Business Statistics: Biyani's Think Tank
33 pages
All Unit R - Programming Notes PDF
No ratings yet
All Unit R - Programming Notes PDF
736 pages
Management Information Systems - Introduction To Social Media
No ratings yet
Management Information Systems - Introduction To Social Media
26 pages
Unit 1 Bda Complete Notes
No ratings yet
Unit 1 Bda Complete Notes
15 pages
Explaratory Data Analysis - Python
No ratings yet
Explaratory Data Analysis - Python
16 pages
Unit 1 - DS - 1st Year
No ratings yet
Unit 1 - DS - 1st Year
13 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
36 pages
Chapter Two
No ratings yet
Chapter Two
57 pages
Unit I
No ratings yet
Unit I
262 pages
CB Unit 3
No ratings yet
CB Unit 3
60 pages
Individual Determinants of Consumer Behavior
100% (1)
Individual Determinants of Consumer Behavior
19 pages
CFM Study Material (Study Group) 1
No ratings yet
CFM Study Material (Study Group) 1
110 pages
Chapter 5 Data Mining: Dr. Huma Lone
No ratings yet
Chapter 5 Data Mining: Dr. Huma Lone
56 pages
Social Judgment Theory
No ratings yet
Social Judgment Theory
9 pages
Hacking Step
100% (2)
Hacking Step
10 pages
Configuration Steps For Automatic Packaging
No ratings yet
Configuration Steps For Automatic Packaging
4 pages
005 - Semester 2 Final Exam
No ratings yet
005 - Semester 2 Final Exam
9 pages
Logon Script
No ratings yet
Logon Script
2 pages
Chapter 3: Gathering User Requirements 3.1. Putting Together Requirements Gathering Team
No ratings yet
Chapter 3: Gathering User Requirements 3.1. Putting Together Requirements Gathering Team
17 pages
VFP Tutorial - Micorsoft Visual Foxpro
100% (1)
VFP Tutorial - Micorsoft Visual Foxpro
46 pages
Ap04 Aa5 Ev05 Desarrollo de La Evidencia Ingles Elaboracion Manual Tecnico
No ratings yet
Ap04 Aa5 Ev05 Desarrollo de La Evidencia Ingles Elaboracion Manual Tecnico
11 pages
Class X IT Sample Paper 2022-23
No ratings yet
Class X IT Sample Paper 2022-23
5 pages
Click On Subject/Paper Under Semester To Enter.: - HS3152 - HS3252 - MA3354 - GE3451 MA3391
No ratings yet
Click On Subject/Paper Under Semester To Enter.: - HS3152 - HS3252 - MA3354 - GE3451 MA3391
43 pages
Port Trace
No ratings yet
Port Trace
4 pages
Security Analyst Resume
No ratings yet
Security Analyst Resume
2 pages
Salesforce Official Exam Guide
No ratings yet
Salesforce Official Exam Guide
3 pages
Activate Methology SAP
100% (4)
Activate Methology SAP
39 pages
MEM System Administrator II (1) - 1
No ratings yet
MEM System Administrator II (1) - 1
2 pages
Watson X
No ratings yet
Watson X
107 pages
Dot Net Questions: Explain The Concepts of CTS and CLS (Common Language Specification)
No ratings yet
Dot Net Questions: Explain The Concepts of CTS and CLS (Common Language Specification)
6 pages
Student Assessment Declaration
No ratings yet
Student Assessment Declaration
6 pages
Implementing Ajax Authentication Using Jquery, Spring Security and HTTPS
No ratings yet
Implementing Ajax Authentication Using Jquery, Spring Security and HTTPS
8 pages
Python Interview Questions
No ratings yet
Python Interview Questions
13 pages
Module 2 - Vulnerabilities and Cyber Security Safeguards
No ratings yet
Module 2 - Vulnerabilities and Cyber Security Safeguards
33 pages
Module5 Security Requirements of Databases
No ratings yet
Module5 Security Requirements of Databases
5 pages
Spark Optimization PDF
100% (1)
Spark Optimization PDF
14 pages
Data Warehousing Exam Guide
No ratings yet
Data Warehousing Exam Guide
10 pages
Unit-5 Query Processing and Optimization
No ratings yet
Unit-5 Query Processing and Optimization
40 pages
Easy Procedure To Assign Disks From Storage (DS8000 - DS6000) To An AIX Host
No ratings yet
Easy Procedure To Assign Disks From Storage (DS8000 - DS6000) To An AIX Host
7 pages
Tac Plus
No ratings yet
Tac Plus
85 pages
Documentation For Joomla Explorer
No ratings yet
Documentation For Joomla Explorer
26 pages

Data Science & Big Data Guide

Uploaded by

Data Science & Big Data Guide

Uploaded by

• Defining data science and big data

• Recognizing the different types of data

• Volume —How much data is there?

Often these characteristics are complemented with a fourth V, veracity: How

Data science is an evolutionary extension of statistics capable of dealing with

FACETS OF DATA (ALSO REFER PPT)

The main categories of data are these:

Machine-generated data is information that’s automatically created by a

Graph databases are used to store graph-based data

Audio, image, and video

THE DATA SCIENCE PROCESS

Setting the research goal

Data exploration is concerned with building a deeper understanding of your

Data modeling or model building

Presentation and automation

You might also like