1 UNIT I DATA SCIENCE
Data Science
Data science is a deep study of the massive amount of data, which involves extracting meaningful
insights from raw, structured, and unstructured data that is processed using the scientific method,
different technologies, and algorithms.
It is a multidisciplinary field that uses tools and techniques to manipulate the data so that you can
find something new and meaningful.
Data science uses the most powerful hardware, programming systems, and most efficient
algorithms to solve the data related problems. It is the future of artificial intelligence.
In short, we can say that data science is all about:
o Asking the correct questions and analyzing the raw data.
o Modeling the data using various complex and efficient algorithms.
o Visualizing the data to get a better perspective.
o Understanding the data to make better decisions and finding the final result.
Example:
Let suppose we want to travel from station A to station B by car. Now, we need to take
some decisions such as which route will be the best route to reach faster at the location, in which
route there will be no traffic jam, and which will be cost-effective. All these decision factors will
act as input data, and we will get an appropriate answer from these decisions, so this analysis of
data is called the data analysis, which is a part of data science.
2 UNIT I DATA SCIENCE
What is Data Science & Advantages and disadvantages of Data Science
Data science has become an essential part of any industry today. Its a method for
transforming business data into assets that help organizations improve revenue, reduce costs, seize
business opportunities, improve customer experience, and more. Data science is one of the most
debated topics in the industries these days. Its popularity has grown over the years, and companies
have started implementing data science techniques to grow their business and increase customer
satisfaction. Data science is the domain of study that deals with vast volumes of data using
modern tools and techniques to find unseen patterns, derive meaningful information, and make
business decisions.
3 UNIT I DATA SCIENCE
Advantages of Data Science :- In today’s world, data is being generated at an alarming rate. Every
second, lots of data is generated; be it from the users of Facebook or any other social networking
site, or from the calls that one makes, or the data which is being generated from different
organizations. And because of this huge amount of data the value of the field of Data Science has
a number of advantages. Some of the advantages are mentioned below :-
• Multiple Job Options :- Being in demand, it has given rise to a large number of career
opportunities in its various fields. Some of them are Data Scientist, Data Analyst, Research
Analyst, Business Analyst, Analytics Manager, Big Data Engineer, etc.
• Business benefits :- Data Science helps organizations knowing how and when their products sell
best and that’s why the products are delivered always to the right place and right time. Faster and
better decisions are taken by the organization to improve efficiency and earn higher profits.
• Highly Paid jobs & career opportunities :- As Data Scientist continues being the sexiest job and
the salaries for this position are also grand. According to a Dice Salary Survey, the annual average
salary of a Data Scientist $106,000 per year.
• Hiring benefits :- It has made it comparatively easier to sort data and look for best of candidates
for an organization. Big Data and data mining have made processing and selection of CVs, aptitude
tests and games easier for the recruitment teams.
4 UNIT I DATA SCIENCE
Disadvantages of Data Science :- Everything that comes with a number of benefits also has
some consequences . So let’s have a look at some of the disadvantages of Data Science :-
• Data Privacy :- Data is the core component that can increase the productivity and the revenue of
industry by making game-changing business decisions. But the information or the insights
obtained from the data can be misused against any organization or a group of people or any
committee etc. Extracted information from the structured as well as unstructured data for further
use can also misused against a group of people of a country or some committee.
• Cost :- The tools used for data science and analytics can cost a lot to an organization as some of
the tools are complex and require the people to undergo a training in order to use them. Also, it is
very difficult to select the right tools according to the circumstances because their selection is
based on the proper knowledge of the tools as well as their accuracy in analyzing the data and
extracting information.
Data Science Facts :
A massive amount of data is produced every day as a result of the growth in the number
of mobile users, rising internet penetration rates, and the accessibility of different eCommerce
apps. Data science is a discipline that is in charge of gathering, processing, modeling, and
analyzing data in order to acquire a better understanding of the data. Businesses use data science
to improve decision-making, boost revenues, and accomplish growth.
Here are some updated facts connected to Data Sources:
• If we take into account all of the data that is currently available internationally,
around 70% of it is user-generated, according to a DM News report.
All types of content, such as photos, videos, reels, text, and audio, are considered user-generated
content. UGC refers to user-generated content that is published anywhere online or on social
media, including blogs, forums, websites, and online reviews. These data science statistics let us
5 UNIT I DATA SCIENCE
get a good understanding of how much data is produced globally and how unprepared we are to
process it.
• According to one estimate, 1.145 trillion megabytes of data are produced daily.
• Statist estimates that in the previous year (2021), there were around 79 Zettabytes of
data/information created, consumed, collected, and duplicated globally.
• According to forecasts made by CrowdFlower in its Data Scientist Report, text data makes up
91% of the data utilized in data science. According to the same survey, unstructured data
consists of 33% images, 11% audio, 15% video, and 20% other types of data in addition to
text.
• The global data sphere has 90% replicated data and 10% unique data.
• In the worldwide digital universe, between 80 and 90% of the data is unstructured, according
to one of the articles published on CIO.
• A user of the internet today would need 181 million years to download all the data from the
internet.
• In 2020, about two professionals joined LinkedIn per second.
• The United States had 2670 data centers, making it the largest in the world in 2021.
• In 2020, according to Domo, every person on earth generated almost 2.5 quintillion bytes of
data each day.
• According to the same report from DOMO, in 2020, each person generated around 1.7 MB of
data each second.
Let us now look at some of the Benefits of Data Science in 2023.
Data Science Benefits
There are several benefits of Data Science, and every major and minor company in the world
relies on its data to run its business. Let us look at some quick facts to understand better:
6 UNIT I DATA SCIENCE
• The BCG-WEF project report details the findings that 72 percent of manufacturing
organizations use advanced data analytics to increase productivity.
• By 2025, the market for big data analytics in healthcare might be worth $67.82 billion.
• About 68% of international travel brands made significant investments in business
intelligence and predictive analytics capabilities in 2019, according to Statista Research
Department.
• By 2023, the big data analytics market is anticipated to grow to $103 billion.
• Around 1400 colleges and universities worldwide use predictive analytics to improve low
graduation rates, redefine the college experience, and guide students down a direct, data-
driven road to graduation with fewer dead ends and erroneous turns.
• 95% of companies say that managing unstructured data is a challenge for their industry.
• The competition in their industry has changed as a result of data analytics, according to
about 47% of McKinsey survey respondents, and data science has helped businesses gain a
competitive advantage.
What Is the Data Science Process?
The data science process is a systematic approach to solving a data problem. It provides a
structured framework for articulating your problem as a question, deciding how to solve it, and
then presenting the solution to stakeholders.
Data Science Life Cycle
7 UNIT I DATA SCIENCE
Another term for the data science process is the data science life cycle. The terms can be used
interchangeably, and both describe a workflow process that begins with collecting data, and ends
with deploying a model that will hopefully answer your questions. The steps include:
Framing the Problem
Understanding and framing the problem is the first step of the data science life cycle. This
framing will help you build an effective model that will have a positive impact on your
organization.
Collecting Data
The next step is to collect the right set of data. High-quality, targeted data—and the mechanisms
to collect them—are crucial to obtaining meaningful results. Since much of the roughly 2.5
quintillion bytes of data created every day come in unstructured formats, you’ll likely need to
extract the data and export it into a usable format, such as a CSV or JSON file.
Cleaning Data
Most of the data you collect during the collection phase will be unstructured, irrelevant, and
unfiltered. Bad data produces bad results, so the accuracy and efficacy of your analysis will
depend heavily on the quality of your data.
Cleaning data eliminates duplicate and null values, corrupt data, inconsistent data types, invalid
entries, missing data, and improper formatting.
This step is the most time-intensive process, but finding and resolving flaws in your data is
essential to building effective models.
Exploratory Data Analysis (EDA)
Now that you have a large amount of organized, high-quality data, you can begin conducting
an exploratory data analysis (EDA). Effective EDA lets you uncover valuable insights that will
be useful in the next phase of the data science lifecycle.
Model Building and Deployment
Next, you’ll do the actual data modeling. This is where you’ll use machine learning, statistical
models, and algorithms to extract high-value insights and predictions.
The big data ecosystem and data science
Distributed file systems
8 UNIT I DATA SCIENCE
A distributed file system is similar to a normal file system except that it runs on multiple servers
at once. Because it is a file system, you can do almost all the same things you would do on a normal
file system. Actions such as storing, reading, and deleting files and adding security to files are at
the core of every file system, also the distributed one. Distributed file systems have some
significant advantages:
• They can contain files larger than any one computer disk.
• Files get automatically replicated across multiple servers for redundancy or parallel operations
while hiding the complexity of doing so from the user.
• The system scales easily, you are no longer bound by the memory or storage restrictions of a
single server.
Distributed programming framework
Once you have the data stored on the distributed file system, you want to exploit it. An important
aspect of working on a distributed hard disk is that you will not move your data to your program,
but rather you will move your program to the data. When you start from scratch with a normal
general-purpose programming language such as C, Python, or Java, you need to deal with the
complexities that come with distributed programming such as restarting jobs that have failed,
tracking the results from the different subprocesses, and so on. Luckily, the open-source
community has developed many frameworks to handle this for you and give you a much better
experience working with distributed data and dealing with many of the challenges it carries.
Data integration framework
Once you have a distributed file system in place, you need to add some data. This means that you
need to move data from one source to another, and this is where the data integration frameworks
such as Apache Sqoop and Apache Flume excel. The process is similar to an extract, transform,
and load process in a traditional data warehouse.
Machine learning frameworks
9 UNIT I DATA SCIENCE
When you have the data in place, it’s time to extract the coveted insights. This is where you rely
on the fields of machine learning, statistics, and applied mathematics. Before World War II,
everything needed to be calculated by hand, which severely limited the possibilities of data
analysis. After World War II computers and scientific computing were developed. A single
computer could do all the counting and calculations and a world of opportunities opened. Ever
since this breakthrough, people just need to derive the mathematical formulas, write them in an
algorithm, and load their data to be sorted by data profiling software.
NoSQL databases
If you need to store huge amounts of data, you require software that is specialized in managing
and querying this data. Traditionally this has been the playing field of relational databases such as
Oracle SQL, MySQL, Sybase IQ, and others. While they still are the go-to technology for many
use cases, new types of databases have emerged under the grouping of NoSQL databases.
The name of this group can be misleading as “No” in this context stands for “Not Only.” A lack
of functionality in SQL is not the biggest reason for the paradigm shift, and many of the NoSQL
databases have implemented a version of SQL themselves. But traditional databases had some
shortcomings that did not allow them to scale well. By solving some of the problems of traditional
databases, NoSQL databases allow for a virtually endless growth of data.
Many different types of databases have arisen, but they can be categorized into the following types:
• Column databases-Data is stored in columns, which allows some algorithms to perform much
faster queries. Newer technologies use cell-wise storage. Table-like structures are still very
important.
• Document stores-Document stores no longer use tables but store every observation in a
document. This allows for a much more flexible data scheme.
• Streaming data-Data is collected, transformed, and aggregated not in batches but in real time.
Although we have categorized it here as a database to help you in tool selection, it is more a
particular type of problem that drove creation of technologies like Storm.
• Key-value stores-Data is not stored in a table; rather you assign a key for every value such as
org.marketing.sales.2015: 20000. This scales very well but places almost all the
implementation on the developer.
• SQL on Hadoop-Batch queries on Hadoop are in a SQL-like language that uses the map-reduce
framework in the background.
10 UNIT I DATA SCIENCE
• New SQL-This class combines the scalability of NoSQL databases with the advantages of a
relational database. They all have a SQL interface and a relational data model.
• Graph databases-Not every problem is best stored in a table. Some problems are more naturally
translated into graph theory and stored in graph databases. A classic example of this is a social
network.
Service programming
Suppose that you have made a world-class soccer prediction application on Hadoop, and you want
to allow others to use the predictions made by your application. However, you have no idea of the
architecture or technology of everyone keen on using your predictions. Service tools excel here by
exposing big data applications to other applications as a service. Data scientists sometimes need
to expose their models through services. The best known example is the REST service were REST
stands for representational state transfer. It is often used to feed websites with data.
Security
Do you want everybody to have access to all of your data? If so, you probably need to have
fine-grained control over the access to data but don’t want to manage this on an application- by-
application basis. Big data security tools allow you to have central and fine-grained control over
access to the data. It could be important to you and any businesses that you own to keep your data
as secure as possible so that no one will be able to gain any unwanted access to your files.
Companies similar to Secure Data Technologies will be able to offer you the relevant management
services to ensure that your files have the relevant security placed on them so they are kept
confidential to you. Big data security has become a topic in its own right, and data scientists will
usually only be confronted with it as a data consumer, seldom will they implement the security
themselves.