• Defining data science and big data
• Recognizing the different types of data
• Gaining insight into the data science process
Big data is a blanket term for any collection of data sets so large or
complex that it becomes difficult to process them using traditional
data management techniques such as, for example, the RDBMS
(relational database management systems). The widely adopted
RDBMS has long been regarded as a one-size-fits-all solution, but the
demands of handling big data have shown otherwise. Data
science involves using methods to analyze massive amounts of data
and extract the knowledge it contains. You can think of the
relationship between big data and data science as being like the
relationship between crude oil and an oil refinery. Data science and big
data evolved from statistics and traditional data management but are
now considered to be distinct disciplines.
The characteristics of big data are often referred to as the three Vs:
• Volume —How much data is there?
• Variety —How diverse are different types of data?
• Velocity —At what speed is new data generated?
Often these characteristics are complemented with a fourth V, veracity: How
accurate is the data? These four properties make big data different from the
data found in traditional data management tools. Consequently, the
challenges they bring can be felt in almost every aspect: data capture,
curation, storage, search, sharing, transfer, and visualization. In addition, big
data calls for specialized techniques to extract the insights.
Data science is an evolutionary extension of statistics capable of dealing with
the massive amounts of data produced today. It adds methods from
computer science to the repertoire of statistics
The main things that set a data scientist apart from a statistician are the
ability to work with big data and experience in machine learning, computing,
and algorithm building. Their tools tend to differ too, with data scientist job
descriptions more frequently mentioning the ability to use Hadoop, Pig,
Spark, R, Python, and Java, among others.
BENEFITS AND USES OF DATA SCIENCE AND BIG DATA
Commercial companies in almost every industry use data science and big
data to gain insights into their customers, processes, staff, completion, and
products. Many companies use data science to offer customers a better user
experience, as well as to cross-sell, up-sell, and personalize their offerings.
Human resource professionals use people analytics and text mining to
screen candidates, monitor the mood of employees, and study informal
networks among coworkers.
Financial institutions use data science to predict stock markets, determine
the risk of lending money, and learn how to attract new clients for their
services.
A data scientist in a governmental organization gets to work on diverse
projects such as detecting fraud and other criminal activity or optimizing
project funding.
The rise of massive open online courses (MOOC) produces a lot of data,
which allows universities to study how this type of learning can complement
traditional classes.
FACETS OF DATA (ALSO REFER PPT)
The main categories of data are these:
• Structured
• Unstructured
• Machine-generated
• Graph-based
• Audio, video, and images
Structured data
Structured data is data that depends on a data model and resides in a fixed
field within a record. As such, it’s often easy to store structured data in tables
within databases or Excel files
Figure 1.1. An Excel table is an example of structured data.
The world isn’t made up of structured data, though; it’s imposed upon it by
humans and machines. More often, data comes unstructured.
Unstructured data
Unstructured data is data that isn’t easy to fit into a data model because the
content is context-specific or varying. One example of unstructured data is
your regular email (figure 1.2). Although email contains structured elements
such as the sender, title, and body text, it’s a challenge to find the number of
people who have written an email complaint about a specific employee
because so many ways exist to refer to a person, for example. The thousands
of different languages and dialects out there further complicate this.
Machine-generated data
Machine-generated data is information that’s automatically created by a
computer, process, application, or other machine without human
intervention.
The analysis of machine data relies on highly scalable tools, due to its high
volume and speed. Examples of machine data are web server logs, call detail
records, network event logs, and telemetry (figure 1.3).
Figure 1.3. Example of machine-generated data
Graph or network data is, in short, data that focuses on the relationship or
adjacency of objects. The graph structures use nodes, edges, and properties
to represent and store graphical data
Figure 1.4. Friends in a social network are an example of graph-
based data.
Graph databases are used to store graph-based data
Audio, image, and video
Audio, image, and video are data types that pose specific challenges to a data
scientist. Tasks that are trivial for humans, such as recognizing objects in
pictures, turn out to be challenging for computers.
THE DATA SCIENCE PROCESS
The data science process typically consists of six steps, as you can see in the
mind map
Setting the research goal
Data science is mostly applied in the context of an organization. When the
business asks you to perform a data science project, you’ll first prepare a
project charter. This charter contains information such as what you’re going
to research, how the company benefits from that, what data and resources
you need, a timetable, and deliverables
Retrieving data
The second step is to collect data. You’ve stated in the project charter which
data you need and where you can find it. In this step you ensure that you can
use the data in your program, which means checking the existence of, quality,
and access to the data. Data can also be delivered by third-party companies
and takes many forms ranging from Excel spreadsheets to different types of
databases.
Data preparation
Data collection is an error-prone process; in this phase you enhance the
quality of the data and prepare it for use in subsequent steps. This phase
consists of three subphases: data cleansing removes false values from a data
source and inconsistencies across data sources, data integration enriches
data sources by combining information from multiple data sources, and data
transformation ensures that the data is in a suitable format for use in your
models.
Data exploration
Data exploration is concerned with building a deeper understanding of your
data. You try to understand how variables interact with each other, the
distribution of the data, and whether there are outliers. To achieve this you
mainly use descriptive statistics, visual techniques, and simple modeling.
This step often goes by the abbreviation EDA, for Exploratory Data Analysis.
Data modeling or model building
In this phase you use models, domain knowledge, and insights about the data
you found in the previous steps to answer the research question. You select
a technique from the fields of statistics, machine learning, operations
research, and so on. Building a model is an iterative process that involves
selecting the variables for the model, executing the model, and model
diagnostics.
Presentation and automation
Finally, you present the results to your business. These results can take many
forms, ranging from presentations to research reports. Sometimes you’ll
need to automate the execution of the process because the business will
want to use the insights you gained in another project or enable an
operational process to use the outcome from your model.