0% found this document useful (0 votes)

68 views10 pages

UNIT 1 - Data Science - III BSC CS

Uploaded by

eshak3778

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

68 views10 pages

UNIT 1 - Data Science - III BSC CS

Uploaded by

eshak3778

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

1 UNIT I DATA SCIENCE

Data Science
Data science is a deep study of the massive amount of data, which involves extracting meaningful
insights from raw, structured, and unstructured data that is processed using the scientific method,
different technologies, and algorithms.

It is a multidisciplinary field that uses tools and techniques to manipulate the data so that you can
find something new and meaningful.

Data science uses the most powerful hardware, programming systems, and most efficient
algorithms to solve the data related problems. It is the future of artificial intelligence.

In short, we can say that data science is all about:

o Asking the correct questions and analyzing the raw data.

o Modeling the data using various complex and efficient algorithms.

o Visualizing the data to get a better perspective.

o Understanding the data to make better decisions and finding the final result.

Example:

Let suppose we want to travel from station A to station B by car. Now, we need to take
some decisions such as which route will be the best route to reach faster at the location, in which
route there will be no traffic jam, and which will be cost-effective. All these decision factors will
act as input data, and we will get an appropriate answer from these decisions, so this analysis of
data is called the data analysis, which is a part of data science.
2 UNIT I DATA SCIENCE

What is Data Science & Advantages and disadvantages of Data Science

Data science has become an essential part of any industry today. Its a method for
transforming business data into assets that help organizations improve revenue, reduce costs, seize
business opportunities, improve customer experience, and more. Data science is one of the most
debated topics in the industries these days. Its popularity has grown over the years, and companies
have started implementing data science techniques to grow their business and increase customer
satisfaction. Data science is the domain of study that deals with vast volumes of data using
modern tools and techniques to find unseen patterns, derive meaningful information, and make
business decisions.
3 UNIT I DATA SCIENCE

Advantages of Data Science :- In today’s world, data is being generated at an alarming rate. Every
second, lots of data is generated; be it from the users of Facebook or any other social networking
site, or from the calls that one makes, or the data which is being generated from different
organizations. And because of this huge amount of data the value of the field of Data Science has
a number of advantages. Some of the advantages are mentioned below :-
• Multiple Job Options :- Being in demand, it has given rise to a large number of career
opportunities in its various fields. Some of them are Data Scientist, Data Analyst, Research
Analyst, Business Analyst, Analytics Manager, Big Data Engineer, etc.
• Business benefits :- Data Science helps organizations knowing how and when their products sell
best and that’s why the products are delivered always to the right place and right time. Faster and
better decisions are taken by the organization to improve efficiency and earn higher profits.
• Highly Paid jobs & career opportunities :- As Data Scientist continues being the sexiest job and
the salaries for this position are also grand. According to a Dice Salary Survey, the annual average
salary of a Data Scientist $106,000 per year.
• Hiring benefits :- It has made it comparatively easier to sort data and look for best of candidates
for an organization. Big Data and data mining have made processing and selection of CVs, aptitude
tests and games easier for the recruitment teams.
4 UNIT I DATA SCIENCE

Disadvantages of Data Science :- Everything that comes with a number of benefits also has
some consequences . So let’s have a look at some of the disadvantages of Data Science :-

• Data Privacy :- Data is the core component that can increase the productivity and the revenue of
industry by making game-changing business decisions. But the information or the insights
obtained from the data can be misused against any organization or a group of people or any
committee etc. Extracted information from the structured as well as unstructured data for further
use can also misused against a group of people of a country or some committee.

• Cost :- The tools used for data science and analytics can cost a lot to an organization as some of
the tools are complex and require the people to undergo a training in order to use them. Also, it is
very difficult to select the right tools according to the circumstances because their selection is
based on the proper knowledge of the tools as well as their accuracy in analyzing the data and
extracting information.

Data Science Facts :

A massive amount of data is produced every day as a result of the growth in the number
of mobile users, rising internet penetration rates, and the accessibility of different eCommerce
apps. Data science is a discipline that is in charge of gathering, processing, modeling, and
analyzing data in order to acquire a better understanding of the data. Businesses use data science
to improve decision-making, boost revenues, and accomplish growth.

Here are some updated facts connected to Data Sources:

• If we take into account all of the data that is currently available internationally,
around 70% of it is user-generated, according to a DM News report.

All types of content, such as photos, videos, reels, text, and audio, are considered user-generated
content. UGC refers to user-generated content that is published anywhere online or on social
media, including blogs, forums, websites, and online reviews. These data science statistics let us
5 UNIT I DATA SCIENCE

get a good understanding of how much data is produced globally and how unprepared we are to
process it.

• According to one estimate, 1.145 trillion megabytes of data are produced daily.

• Statist estimates that in the previous year (2021), there were around 79 Zettabytes of
data/information created, consumed, collected, and duplicated globally.

• According to forecasts made by CrowdFlower in its Data Scientist Report, text data makes up
91% of the data utilized in data science. According to the same survey, unstructured data
consists of 33% images, 11% audio, 15% video, and 20% other types of data in addition to
text.

• The global data sphere has 90% replicated data and 10% unique data.

• In the worldwide digital universe, between 80 and 90% of the data is unstructured, according
to one of the articles published on CIO.

• A user of the internet today would need 181 million years to download all the data from the
internet.

• In 2020, about two professionals joined LinkedIn per second.

• The United States had 2670 data centers, making it the largest in the world in 2021.

• In 2020, according to Domo, every person on earth generated almost 2.5 quintillion bytes of
data each day.

• According to the same report from DOMO, in 2020, each person generated around 1.7 MB of
data each second.

Let us now look at some of the Benefits of Data Science in 2023.

Data Science Benefits

There are several benefits of Data Science, and every major and minor company in the world
relies on its data to run its business. Let us look at some quick facts to understand better:
6 UNIT I DATA SCIENCE

• The BCG-WEF project report details the findings that 72 percent of manufacturing
organizations use advanced data analytics to increase productivity.

• By 2025, the market for big data analytics in healthcare might be worth $67.82 billion.

• About 68% of international travel brands made significant investments in business

intelligence and predictive analytics capabilities in 2019, according to Statista Research
Department.

• By 2023, the big data analytics market is anticipated to grow to $103 billion.

• Around 1400 colleges and universities worldwide use predictive analytics to improve low
graduation rates, redefine the college experience, and guide students down a direct, data-
driven road to graduation with fewer dead ends and erroneous turns.

• 95% of companies say that managing unstructured data is a challenge for their industry.

• The competition in their industry has changed as a result of data analytics, according to
about 47% of McKinsey survey respondents, and data science has helped businesses gain a
competitive advantage.

What Is the Data Science Process?

The data science process is a systematic approach to solving a data problem. It provides a
structured framework for articulating your problem as a question, deciding how to solve it, and
then presenting the solution to stakeholders.

Data Science Life Cycle

7 UNIT I DATA SCIENCE

Another term for the data science process is the data science life cycle. The terms can be used
interchangeably, and both describe a workflow process that begins with collecting data, and ends
with deploying a model that will hopefully answer your questions. The steps include:

Framing the Problem

Understanding and framing the problem is the first step of the data science life cycle. This
framing will help you build an effective model that will have a positive impact on your
organization.

Collecting Data
The next step is to collect the right set of data. High-quality, targeted data—and the mechanisms
to collect them—are crucial to obtaining meaningful results. Since much of the roughly 2.5
quintillion bytes of data created every day come in unstructured formats, you’ll likely need to
extract the data and export it into a usable format, such as a CSV or JSON file.

Cleaning Data
Most of the data you collect during the collection phase will be unstructured, irrelevant, and
unfiltered. Bad data produces bad results, so the accuracy and efficacy of your analysis will
depend heavily on the quality of your data.

Cleaning data eliminates duplicate and null values, corrupt data, inconsistent data types, invalid
entries, missing data, and improper formatting.

This step is the most time-intensive process, but finding and resolving flaws in your data is
essential to building effective models.

Exploratory Data Analysis (EDA)

Now that you have a large amount of organized, high-quality data, you can begin conducting
an exploratory data analysis (EDA). Effective EDA lets you uncover valuable insights that will
be useful in the next phase of the data science lifecycle.

Model Building and Deployment

Next, you’ll do the actual data modeling. This is where you’ll use machine learning, statistical
models, and algorithms to extract high-value insights and predictions.

The big data ecosystem and data science

Distributed file systems

8 UNIT I DATA SCIENCE

A distributed file system is similar to a normal file system except that it runs on multiple servers
at once. Because it is a file system, you can do almost all the same things you would do on a normal
file system. Actions such as storing, reading, and deleting files and adding security to files are at
the core of every file system, also the distributed one. Distributed file systems have some
significant advantages:

• They can contain files larger than any one computer disk.
• Files get automatically replicated across multiple servers for redundancy or parallel operations
while hiding the complexity of doing so from the user.
• The system scales easily, you are no longer bound by the memory or storage restrictions of a
single server.

Distributed programming framework

Once you have the data stored on the distributed file system, you want to exploit it. An important
aspect of working on a distributed hard disk is that you will not move your data to your program,
but rather you will move your program to the data. When you start from scratch with a normal
general-purpose programming language such as C, Python, or Java, you need to deal with the
complexities that come with distributed programming such as restarting jobs that have failed,
tracking the results from the different subprocesses, and so on. Luckily, the open-source
community has developed many frameworks to handle this for you and give you a much better
experience working with distributed data and dealing with many of the challenges it carries.

Data integration framework

Once you have a distributed file system in place, you need to add some data. This means that you
need to move data from one source to another, and this is where the data integration frameworks
such as Apache Sqoop and Apache Flume excel. The process is similar to an extract, transform,
and load process in a traditional data warehouse.

Machine learning frameworks

9 UNIT I DATA SCIENCE

When you have the data in place, it’s time to extract the coveted insights. This is where you rely
on the fields of machine learning, statistics, and applied mathematics. Before World War II,
everything needed to be calculated by hand, which severely limited the possibilities of data
analysis. After World War II computers and scientific computing were developed. A single
computer could do all the counting and calculations and a world of opportunities opened. Ever
since this breakthrough, people just need to derive the mathematical formulas, write them in an
algorithm, and load their data to be sorted by data profiling software.

NoSQL databases

If you need to store huge amounts of data, you require software that is specialized in managing
and querying this data. Traditionally this has been the playing field of relational databases such as
Oracle SQL, MySQL, Sybase IQ, and others. While they still are the go-to technology for many
use cases, new types of databases have emerged under the grouping of NoSQL databases.

The name of this group can be misleading as “No” in this context stands for “Not Only.” A lack
of functionality in SQL is not the biggest reason for the paradigm shift, and many of the NoSQL
databases have implemented a version of SQL themselves. But traditional databases had some
shortcomings that did not allow them to scale well. By solving some of the problems of traditional
databases, NoSQL databases allow for a virtually endless growth of data.

Many different types of databases have arisen, but they can be categorized into the following types:

• Column databases-Data is stored in columns, which allows some algorithms to perform much
faster queries. Newer technologies use cell-wise storage. Table-like structures are still very
important.
• Document stores-Document stores no longer use tables but store every observation in a
document. This allows for a much more flexible data scheme.
• Streaming data-Data is collected, transformed, and aggregated not in batches but in real time.
Although we have categorized it here as a database to help you in tool selection, it is more a
particular type of problem that drove creation of technologies like Storm.
• Key-value stores-Data is not stored in a table; rather you assign a key for every value such as
org.marketing.sales.2015: 20000. This scales very well but places almost all the
implementation on the developer.
• SQL on Hadoop-Batch queries on Hadoop are in a SQL-like language that uses the map-reduce
framework in the background.
10 UNIT I DATA SCIENCE

• New SQL-This class combines the scalability of NoSQL databases with the advantages of a
relational database. They all have a SQL interface and a relational data model.
• Graph databases-Not every problem is best stored in a table. Some problems are more naturally
translated into graph theory and stored in graph databases. A classic example of this is a social
network.

Service programming

Suppose that you have made a world-class soccer prediction application on Hadoop, and you want
to allow others to use the predictions made by your application. However, you have no idea of the
architecture or technology of everyone keen on using your predictions. Service tools excel here by
exposing big data applications to other applications as a service. Data scientists sometimes need
to expose their models through services. The best known example is the REST service were REST
stands for representational state transfer. It is often used to feed websites with data.

Security

Do you want everybody to have access to all of your data? If so, you probably need to have
fine-grained control over the access to data but don’t want to manage this on an application- by-
application basis. Big data security tools allow you to have central and fine-grained control over
access to the data. It could be important to you and any businesses that you own to keep your data
as secure as possible so that no one will be able to gain any unwanted access to your files.
Companies similar to Secure Data Technologies will be able to offer you the relevant management
services to ensure that your files have the relevant security placed on them so they are kept
confidential to you. Big data security has become a topic in its own right, and data scientists will
usually only be confronted with it as a data consumer, seldom will they implement the security
themselves.

Apricorn End User Agreement
No ratings yet
Apricorn End User Agreement
1 page
Introduction to Data Science & Big Data
No ratings yet
Introduction to Data Science & Big Data
14 pages
Seminar On Data Science
100% (7)
Seminar On Data Science
25 pages
5 An Hour Betfair Money Machine
75% (4)
5 An Hour Betfair Money Machine
3 pages
Unit 1 - Data Science - III BSC Cs.
No ratings yet
Unit 1 - Data Science - III BSC Cs.
14 pages
BA UNIT III Developing Analytical Talent
No ratings yet
BA UNIT III Developing Analytical Talent
73 pages
Fundamentals of Data Science Course
100% (3)
Fundamentals of Data Science Course
62 pages
Python Unit 1
No ratings yet
Python Unit 1
8 pages
Basic of Ds
No ratings yet
Basic of Ds
14 pages
Data Science - FYBCA-Sem-II
No ratings yet
Data Science - FYBCA-Sem-II
13 pages
Introduction To Data Science Lecture 1
No ratings yet
Introduction To Data Science Lecture 1
4 pages
Data Science
No ratings yet
Data Science
4 pages
Module-1 Notes Basics 09.07.25
No ratings yet
Module-1 Notes Basics 09.07.25
45 pages
DS-BDS (Unit 1) Technical
No ratings yet
DS-BDS (Unit 1) Technical
22 pages
Fds Module 1
No ratings yet
Fds Module 1
65 pages
Unit 1 DA
No ratings yet
Unit 1 DA
72 pages
Data Science Unit 1 Notes
No ratings yet
Data Science Unit 1 Notes
75 pages
Data Science Essentials for Beginners
No ratings yet
Data Science Essentials for Beginners
14 pages
FODS Unit-1
No ratings yet
FODS Unit-1
33 pages
Data Science Life Cycle
No ratings yet
Data Science Life Cycle
12 pages
Introduction To Data Science - Unit-1
No ratings yet
Introduction To Data Science - Unit-1
9 pages
Unit 1
No ratings yet
Unit 1
76 pages
Unit 1 Dsa
No ratings yet
Unit 1 Dsa
26 pages
Data Science
No ratings yet
Data Science
18 pages
Bhavya Khurana
No ratings yet
Bhavya Khurana
21 pages
Data Science Internship
No ratings yet
Data Science Internship
6 pages
Introductions
No ratings yet
Introductions
14 pages
Unit 1
No ratings yet
Unit 1
28 pages
UNIT - I Intro To DS
No ratings yet
UNIT - I Intro To DS
18 pages
Data Science Basics
No ratings yet
Data Science Basics
8 pages
Unit-1 Data Science
No ratings yet
Unit-1 Data Science
17 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
8 pages
DS Notes
No ratings yet
DS Notes
159 pages
Data Collection and Preparation Exploratory Data Analysis (EDA) Machine Learning Data Visualization Model Deployment and Evaluation
No ratings yet
Data Collection and Preparation Exploratory Data Analysis (EDA) Machine Learning Data Visualization Model Deployment and Evaluation
10 pages
Summary of Data Science
No ratings yet
Summary of Data Science
5 pages
Data Science Course Overview
No ratings yet
Data Science Course Overview
28 pages
Data Science: Key Roles and Benefits
No ratings yet
Data Science: Key Roles and Benefits
32 pages
Fdsa Unit 1
No ratings yet
Fdsa Unit 1
19 pages
Datascience Internship
No ratings yet
Datascience Internship
19 pages
DS R Unit-1
No ratings yet
DS R Unit-1
41 pages
Fundamentals of Data Science
No ratings yet
Fundamentals of Data Science
53 pages
Data Science for Business Leaders
No ratings yet
Data Science for Business Leaders
5 pages
Data Science
No ratings yet
Data Science
18 pages
1-Need For Data Science-13!12!2024
No ratings yet
1-Need For Data Science-13!12!2024
51 pages
Future of Data Science in India
No ratings yet
Future of Data Science in India
5 pages
Unit 1
No ratings yet
Unit 1
30 pages
Data Science for Business Growth
No ratings yet
Data Science for Business Growth
20 pages
Data Science 1
No ratings yet
Data Science 1
15 pages
How Does Data Science Works in 2021
No ratings yet
How Does Data Science Works in 2021
9 pages
Data Science Unit 1
No ratings yet
Data Science Unit 1
85 pages
Data Science Ppt1 Update
No ratings yet
Data Science Ppt1 Update
67 pages
Data Science for Industry Innovators
No ratings yet
Data Science for Industry Innovators
2 pages
Introduction of Data Science
No ratings yet
Introduction of Data Science
3 pages
Unit 1
No ratings yet
Unit 1
60 pages
Data Science Lifecycle Explained
No ratings yet
Data Science Lifecycle Explained
9 pages
Intro to Data Science Basics
No ratings yet
Intro to Data Science Basics
171 pages
DS231 Week 2
No ratings yet
DS231 Week 2
33 pages
M-I Data Science
No ratings yet
M-I Data Science
50 pages
Data Science Basics
No ratings yet
Data Science Basics
25 pages
Ids (R22) U1 PPT 03092024
No ratings yet
Ids (R22) U1 PPT 03092024
87 pages
FDSUnit 1
No ratings yet
FDSUnit 1
20 pages
Prathiba P
No ratings yet
Prathiba P
2 pages
Unit-Iii CC
No ratings yet
Unit-Iii CC
14 pages
UNIT 5 - Data Science - III BSC CS
No ratings yet
UNIT 5 - Data Science - III BSC CS
16 pages
In 2024 Reports
No ratings yet
In 2024 Reports
1 page
Susmitha D
No ratings yet
Susmitha D
2 pages
Harini B
No ratings yet
Harini B
2 pages
SE-Unit2 &3
No ratings yet
SE-Unit2 &3
33 pages
Online Course Registrationand Advisory Systems Basedon Students Personaland Social Constraints
No ratings yet
Online Course Registrationand Advisory Systems Basedon Students Personaland Social Constraints
12 pages
UNIT 3 - Data Science - III BSC CS
No ratings yet
UNIT 3 - Data Science - III BSC CS
19 pages
SE Unit1
No ratings yet
SE Unit1
36 pages
What Is A Cyberattack
No ratings yet
What Is A Cyberattack
8 pages
Sample Sponsor
No ratings yet
Sample Sponsor
2 pages
Unit 2
No ratings yet
Unit 2
23 pages
Se Unit 1
No ratings yet
Se Unit 1
13 pages
TV Channel Head
No ratings yet
TV Channel Head
2 pages
Indoor Video Intercom for Installers
No ratings yet
Indoor Video Intercom for Installers
3 pages
Computer History Timeline PPTX 1
100% (1)
Computer History Timeline PPTX 1
11 pages
Huawei HG8245H ONT Features & Specs
No ratings yet
Huawei HG8245H ONT Features & Specs
2 pages
Process Simulator & Visio: Optimize Business Models
No ratings yet
Process Simulator & Visio: Optimize Business Models
2 pages
Naveed Ahmed CV
No ratings yet
Naveed Ahmed CV
3 pages
SQL Injection
No ratings yet
SQL Injection
3 pages
Steven Slate Drums 3.5 Guide
No ratings yet
Steven Slate Drums 3.5 Guide
61 pages
GM Commands
No ratings yet
GM Commands
10 pages
Gamification in Education
No ratings yet
Gamification in Education
8 pages
Ug892 Vivado Design Flows Overview
No ratings yet
Ug892 Vivado Design Flows Overview
96 pages
Back To School PowerPoint Template
No ratings yet
Back To School PowerPoint Template
36 pages
DB2 System Admin Lab Guide
No ratings yet
DB2 System Admin Lab Guide
24 pages
Introduction To Programming Arc Objects With VBA
100% (1)
Introduction To Programming Arc Objects With VBA
408 pages
SE - Revit Audit Checklist
No ratings yet
SE - Revit Audit Checklist
3 pages
10th Computer FLT Paper 22.09.2024
No ratings yet
10th Computer FLT Paper 22.09.2024
6 pages
BCA Programming Roadmap Guide
No ratings yet
BCA Programming Roadmap Guide
4 pages
Keyboard
No ratings yet
Keyboard
18 pages
B.Tech IT: Embedded Systems Guide
No ratings yet
B.Tech IT: Embedded Systems Guide
103 pages
ARIS PPM System Architecture
100% (1)
ARIS PPM System Architecture
84 pages
How To Execute Field Extension of ACDOCU - 1909 - V6
No ratings yet
How To Execute Field Extension of ACDOCU - 1909 - V6
21 pages
SRIHARI V RESUME Rev
No ratings yet
SRIHARI V RESUME Rev
3 pages
Incose SD Sept2019 Presentation Charley Patton Mbse A Practical Approach v01
No ratings yet
Incose SD Sept2019 Presentation Charley Patton Mbse A Practical Approach v01
20 pages
Docs100-MWO Getting Started
No ratings yet
Docs100-MWO Getting Started
200 pages
Shimadzu 1280 Spectrophotometer Setup
No ratings yet
Shimadzu 1280 Spectrophotometer Setup
7 pages
Gideon Intel Drop 56 Justpasteit
No ratings yet
Gideon Intel Drop 56 Justpasteit
12 pages
R for Data Science Beginners
No ratings yet
R for Data Science Beginners
37 pages
YamMonManual 1.0
No ratings yet
YamMonManual 1.0
5 pages
Talaijhanae Delahoussaye Resume
No ratings yet
Talaijhanae Delahoussaye Resume
2 pages

UNIT 1 - Data Science - III BSC CS

Uploaded by

UNIT 1 - Data Science - III BSC CS

Uploaded by

1 UNIT I DATA SCIENCE

In short, we can say that data science is all about:

o Asking the correct questions and analyzing the raw data.

o Modeling the data using various complex and efficient algorithms.

o Visualizing the data to get a better perspective.

What is Data Science & Advantages and disadvantages of Data Science

Data Science Facts :

Here are some updated facts connected to Data Sources:

• In 2020, about two professionals joined LinkedIn per second.

Let us now look at some of the Benefits of Data Science in 2023.

Data Science Benefits

• About 68% of international travel brands made significant investments in business

What Is the Data Science Process?

Data Science Life Cycle

Framing the Problem

Exploratory Data Analysis (EDA)

Model Building and Deployment

The big data ecosystem and data science

Distributed file systems

Distributed programming framework

Data integration framework

Machine learning frameworks

You might also like