Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
9 views20 pages

FDSUnit 1

This document introduces the fundamentals of data science, defining it as the combination of statistics, mathematics, programming, and problem-solving to extract insights from data. It discusses the importance of data science in decision-making across various industries, the hype surrounding big data, and the skills required for data scientists. Additionally, it covers the processes involved in data analysis, the role of data scientists, and the significance of statistical inference in understanding data.

Uploaded by

veena more
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views20 pages

FDSUnit 1

This document introduces the fundamentals of data science, defining it as the combination of statistics, mathematics, programming, and problem-solving to extract insights from data. It discusses the importance of data science in decision-making across various industries, the hype surrounding big data, and the skills required for data scientists. Additionally, it covers the processes involved in data analysis, the role of data scientists, and the significance of statistical inference in understanding data.

Uploaded by

veena more
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

UNIT- 1 FUNDAMENTALS OF DATASCIENCE- INTRODUCTION

CONTENTS TO BE COVERED IN THIS UNIT

 INTRODUCTION: WHAT IS DATA SCIENCE?


 BIG DATA AND DATA SCIENCE HYPE
 GETTING PAST THE HYPE
 WHY NOW?
 DATAFICATION
 CURRENT LANDSCAPE OF PERSPECTIVES
 A DATA SCIENCE PROFILE
 STATISTICAL INFERENCE
 POPULATIONS AND SAMPLES
 POPULATIONS AND SAMPLES OF BIG DATA
 BIGDATA CAN MEAN BIG ASSUMPTIONS
 MODELING
 PHILOSOPHY OF EXPLORATORY DATA ANALYSIS
 THE DATA SCIENCE PROCESS
 A DATA SCIENTIST’S ROLE IN THIS PROCESS
 QUESTION BANK

INTRODUCTION: WHAT IS DATA SCIENCE?

First, let’s start by understanding what is data science?

1.Have you ever wondered how Amazon, eBay suggest items for you to buy?
2.How Gmail filters your emails in the spam and non-spam categories?
3.How Netflix predicts the shows of your liking?

How do they do it? These are the few questions we ponder from time to time. In
reality, doing such tasks are impossible without the availability of data. Data science
is all about using data to solve problems.

O1
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 1 FUNDAMENTALS OF DATASCIENCE- INTRODUCTION

The problem could be decision making such as identifying which email is spam and
which is not. Or a product recommendation such as which movie to watch? Or
predicting the outcome such as who will be the next President of the USA? So, the
core job of a data scientist is to understand the data, extract useful information out of
it and apply this in solving the problems.

Over the past few years, there’s been a lot of hype in the media about “Data
Science” and “Big Data”. Today, data rules the world. This has resulted in a huge
demand for Data Scientists.

A Data Scientist helps companies with data-driven decisions, to make their


business better. Data Science is a field that deals with unstructured, structured
data, and semi structured data. It involves practices like Data Cleaning, Data
Preparation, Data Analysis, and much more.

DEFINITION : Data Science is the combination of Statistics, Mathematics,


Programming and problem solving, capturing data in ingenious ways, the ability to
look at things differently, and the activity of cleaning, preparing and aligning data.
This includes various techniques that are used when extracting insights and
information from data.

ADVANTAGES :

 It helps in making business decisions such as deciding the health of companies


with whom they plan to collaborate.
 It may help in making better predictions for the future such as making strategic
plans of the company based on present trends etc.
 It may identify similarities among various data patterns leading to applications
O2
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 1 FUNDAMENTALS OF DATASCIENCE- INTRODUCTION

like fraud detection, targeted marketing etc.

APPLICATIONS :

 For route planning: To discover the best routes to ship


 To foresee delays for flight/ship/train etc. (through predictive analysis)
 To create promotional offers
 To find the best suited time to deliver goods
 To forecast the next years revenue for a company
 To analyze health benefit of training
 To predict who will win elections

Big Data refers to significant volumes of data that cannot be processed effectively
with the traditional applications that are currently used. The processing of big data
begins with raw data that isn’t aggregated and is most often impossible to store in
the memory of a single computer.

Data Analytics is the science of examining raw data to reach certain conclusions.
Data Analytics involves applying an algorithmic or mechanical process to derive
insights and running through several data sets to look for meaningful correlations.

It is used in several industries, which enable organizations and data analytics


companies to make more informed decisions, as well as verify and disapprove
existing theories or models. The focus of data analytics lies in inference, which is
the process of deriving conclusions that are solely based on what the researcher
already known.

HOW DOES A DATA SCIENTIST WORK?


A Data Scientist requires expertise in several backgrounds:

 Machine Learning
 Statistics
 Programming (Python or R)
 Mathematics

O3
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 1 FUNDAMENTALS OF DATASCIENCE- INTRODUCTION

 Databases

A Data Scientist must find patterns within the data. Before he/she can find the
patterns, he/she must organize the data in a standard format.

Here is how a Data Scientist works:

1. Ask the right questions - To understand the business problem.


2. Explore and collect data - From database, web logs, customer feedback, etc.
3. Extract the data - Transform the data to a standardized format.
4. Clean the data - Remove erroneous values from the data.
5. Find and replace missing values - Check for missing values and replace them
with a suitable value (e.g. an average value).
6. Normalize data - Scale the values in a practical range (e.g. 140 cm is smaller
than 1,8 m. However, the number 140 is larger than 1,8. - so scaling is
important).
7. Analyze data, find patterns and make future predictions.
8. Represent the result - Present the result with useful insights in a way the
"company" can understand.

BIG DATA AND DATA SCIENCE HYPE


Big data refers to significant volumes of data that cannot be
processed effectively with the traditional applications that are currently used. The
processing of big data begins with raw data that isn’t aggregated and is most often
impossible to store in the memory of a single computer.

Data science enables companies not only to understand data from multiple
sources but also to enhance decision making. As a result, data science is widely
used in almost every industry, including health care, finance, marketing, banking, city
planning, and more. If you are probably means you have something useful to
contribute to making data science into a more legitimate field that has the power
to have a positive impact on society. So, what is eyebrow-raising (shows surprise)
about Big Data and data science? Let’s count the ways

There’s a lack of definitions around the most basic terminology


What is “Big Data” anyway? What does “data science” mean? What is the
relationship

O4
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 1 FUNDAMENTALS OF DATASCIENCE- INTRODUCTION

These terms are so ambiguous ; they’re more or less meaningless.

There’s a distinct lack of respect for the researchers in academia and industry labs who have
been working on this kind of stuff for years, and whose work is based on decades of work by
statisticians, computer scientists, mathematicians, engineers, and scientists of all types. The
hype is crazy

The longer the hype goes on, the more many of us will get turned off by it, and the
harder it will be to see what’s good underneath it all, if anything.

Statisticians already feel that they are studying and working on the “Science of
Data.” That’s their bread and butter Although we will make the case that data
science is not just a rebranding of statistics or machine learning but rather a field
unto itself, the media often describes data science in a way that makes it sound like
as if it’s simply statistics or machine learning in the context of the tech industry.

People have said to us, “Anything that has to call itself a ‘science’ is probably isn’t.”
Although there might be truth in there, that doesn’t mean that the term “data
science” itself represents nothing, but of course what it represents may not be
science but more of a craft (Create documents, which will make an impact).

GETTING PAST THE HYPE


Around all the hype, in other words, there is a ring of truth: this is something
new. But at the same time, it’s a fragile, nascent idea at real risk of being rejected
prematurely. For one thing, it’s being paraded around as a magic bullet, raising
unrealistic expectations that will surely be disappointed.
Rachel gave herself the task of understanding the cultural phenomenon of data
science and how others were experiencing it. She started meeting with people at
Google, at startups and tech companies, and at universities, mostly from within
statistics departments.
From those meetings she started to form a clearer picture of the new thing that’s
emerging. She ultimately decided to continue the investigation by giving a course
at Columbia called “Introduction to Data Science,” which Cathy covered on her
blog. We figured that by the end of the semester, we, and hopefully the students,

O5
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 1 FUNDAMENTALS OF DATASCIENCE- INTRODUCTION

would know what all this actually meant. And now, with this book, we hope to do
the same for many more people.

WHY NOW?
Data Science helps businesses to comprehend vast amounts of data from different
sources, extract useful insights, and make better data-driven choices. Data Science
is used extensively in several industrial fields such as marketing, healthcare,
finance, banking, and policy work.

It’s not only the massiveness that makes all this new data interesting (or poses
challenges). It’s that the data itself, often in real time, becomes the building blocks
of data products.

On the Internet, this means Amazon recommendation systems, friend


recommendations on Facebook, film and music recommendations, and soon

DATAFICATION

Datafication can be defined as a process that “aims to transform most aspects of a


business into quantifiable data (data that can be counted or measured in numerical
values) that can be tracked, monitored, and analysed.
Datafication as a process of “taking all aspects of life and turning them into data.”

As examples, they mention that


“Google’s augmented-reality glasses datafy the gaze.
Twitter datafies stray thoughts.
LinkedIn datafies professional networks.”

CURRENT LANDSCAPE OF PERSPECTIVES

Data science is the process of extracting information, understanding and learning


from raw data to inform decision making in a proactive and systematic fashion that
can be generalized

O6
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 1 FUNDAMENTALS OF DATASCIENCE- INTRODUCTION

So, what is data science? Is it new, or is it just statistics or analytics rebranded? Is


it real, or is it pure hype? And if it’s new and if it’s real, what does that mean?

This is an ongoing discussion, but one way to understand what’s going on in


this industry is to look online and see what current discussions are taking place.
This doesn’t necessarily tell us what data science is, but it at least tells us what
other people think it is, or how they’re perceiving it. For example, on Quora there’s
a discussion from 2010 about “What is Data Science?” and here’s Metamarket CEO
Mike Driscoll’s answer:

Data science, as it’s practiced, is a blend of Red-Bull-fueled hacking and espresso-


inspired statistics.
But data science is not merely hacking—because when hackers finish debugging
their Bash one-liners and Pig scripts, few of them care about non-Euclidean
distance metrics.
And data science is not merely statistics, because when statisticians finish
theorizing the perfect model, few could read a tab-delimited file into R if their job
depended on it.
Data science is the civil engineering of data. Its acolytes possess a practical
knowledge of tools and materials, coupled with a theoretical understanding of
what’s possible.

Figure 1-1. Drew Conway’s Venn diagram of data science

O7
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 1 FUNDAMENTALS OF DATASCIENCE- INTRODUCTION

A DATA SCIENCE PROFILE


A data science profile need skill levels in the following domains:
 Computer science
 Mathematics
 Statistics
 Machine learning
 Domain expertise
 Communication and presentation skills
 Data visualization

Perhaps the most concrete approach is to define data science is by its usage—e.g.,
what data scientists get paid to do.

In Academia
In academia, hardly anyone officially calls themselves a data scientist. Usually,
people take on that title only if they're part of a special "data science institute" or
need it for funding. But a lot of folks in academia are interested in becoming data
scientists or using data science techniques.

For instance, in an Intro to Data Science class at Columbia, there were 60 students.
Originally, the teacher thought it would mostly attract people from statistics, math,
and computer science backgrounds. But it turned out, it was a mix of people from
various fields like sociology, journalism, political science, and more. They all
wanted to use data to solve important problems, especially social ones.

To make "data science" a common term in academia, we need to define it better.


Basically, an academic data scientist is someone from any scientific background
who deals with lots of data. They have to tackle tricky computational problems
because data can be messy and complex. But they're also focused on solving real-
world problems.

Why define it like this? Because no matter what field you're in, dealing with big data
brings similar challenges. If researchers from different areas team up, they can
solve all sorts of real-world problems together.

In Industry
In the tech industry, a chief data scientist is superior of all things data-related in a
O8
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 1 FUNDAMENTALS OF DATASCIENCE- INTRODUCTION

company. They make the big decisions about how data is collected, used, and
protected. This involves setting up the technology to gather data, making sure it's
used in the right way, and even deciding what data should be visible to users.

They lead a team of experts who work with data, like engineers, scientists, and
analysts. They also talk to the big shots in the company, like the CEO and other
leaders, to explain how data can help the business.

A data scientist in general is someone who's really good at understanding data.


They use tools and methods from statistics and machine learning to make sense of
it all. A big part of their job is cleaning up messy data and finding patterns in it.
They might also create models and experiments to help make important decisions
for the company.

They're good at explaining what they find to their teammates and bosses, even if
those people aren't experts in data themselves. And they use visuals, like charts
and graphs, to make things easier to understand.

STATISTICAL INFERENCE
Data represents the traces of the real-world processes. After separating the
process from the data collection, we can see clearly that there are two sources of
randomness and uncertainty.

Namely, the randomness and uncertainty underlying the process itself, and the
uncertainty associated with your underlying data collection methods.

Once you have all this data, you can’t go walking around with a huge Excel
spreadsheet or database of millions of transactions and look at it and, with a snap
of a finger, understand the world and process that generated it.

So you need a new idea, and that’s to simplify those captured traces into
something more comprehensible, to something that somehow captures it all in a
much more concise way, and that something could be mathematical models or
functions of the data, known as statistical estimators. This overall process of going
from the world to the data, and then from the data back to the world, is the field of
statistical inference.

O9
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 1 FUNDAMENTALS OF DATASCIENCE- INTRODUCTION

POPULATION AND SAMPLES

Population:
A population is the entire group that you want to draw conclusion about.

Sample:
Sample is the specific group that you will collect data from. Sample size is less than
the size of population. Generally, population refers to the people who live in a
particular area at a specific time. But in statistics, population refers to data on your
study of interest. It can be a group of individuals, objects, events, organizations, etc

If you had to collect the same data from a larger population, say the entire country of
India, it would be impossible to draw reliable conclusions because of geographical and
accessibility constraints, not to mention time and resource constraints. A lot of data
would be missing or might be unreliable. Further more,due to accessibility issues,
marginalized tribes or villages might not provide data at all, making the data biased
towards certain regions or groups.
Samples are used when :
The population is too large to collect data.
The data collected is not reliable.
The population is hypothetical and is unlimited in size.

POPULATIONS AND SAMPLES OF BIG DATA

In the age of Big Data, where we can record all users’ actions all the time, At
Google, for example, soft‐ ware engineers, data scientists, and statisticians sample
all the time.

How much data you need at hand really depends on what your goal is: for
analysis or inference purposes, you typically don’t need to store all the data all the
time. On the other hand, for serving purposes you might: in order to render the
correct information in a UI for a user, you need to have all the information for that
particular user, for example.

So if we think of all the emails at BigCorp as the population, and if we randomly


sample from that population by reading some but not all emails, then that
sampling process would create one particular sample. However, if we resampled
O10
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 1 FUNDAMENTALS OF DATASCIENCE- INTRODUCTION

we’d get a different set of observations. The uncertainty created by such a


sampling process has a name: the sampling distribution.

BIGDATA CAN MEAN BIG ASSUMPTIONS


Data revolution consists of three things:
• Collecting and using a lot of data rather than small samples
• Accepting messiness in your data
• Giving up on knowing the causes

The concept of “N=ALL” in the context of Big Data refers to the idea that, with the
massive volume of data available, we can effectively analyze and understand the
entire population (or universe) rather than relying on sampling.

Big Data challenges the need for sampling because it deals with vast amounts of
data (structured, unstructured, and semi-structured) collected from various sources
(e.g., social media, sensors, devices).

n=1
In old days a sample size of 1 would be ridiculous, you would never want to draw
inferences about an entire population by looking at a single individual. But the
concept of n=1 takes on new meaning in the age of Big Data, where for a single
person, we actually can record tons of information about them, and in fact we might
even sample from all the events or actions they took (for example, phone calls or
keystrokes) in order to make inferences about them.

MODELING
A model is our attempt to understand and represent the nature of reality through a
particular lens, be it architectural, biological, or mathematical.

A model is an artificial construction where all extraneous detail has been removed
or abstracted. Attention must always be paid to these abstracted details after a
model has been analyzed to see what might have been overlooked.

Humans try to understand the world around them by representing it in different


ways.
 Architects capture attributes of buildings through blueprints and three-
dimensional, scaled-down versions.

O11
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 1 FUNDAMENTALS OF DATASCIENCE- INTRODUCTION

 Molecular biologists capture protein structure with three-dimensional


visualizations of the connections between amino acids.
 Statisticians and data scientists capture the uncertainty and randomness of data-
generating processes with mathematical functions that express the shape and
structure of the data itself.

Statistical Modeling:
Data modelling is a process of creating a conceptual representation of data objects and
their relationships to one another. The process of data modeling typically involves
several steps, including requirements gathering, conceptual design, logical design,
physical design, and implementation.

Before you get too involved with the data and start coding, it’s useful to draw a
picture of what you think the underlying process might be with your model. What
comes first? What influences what? What causes what? What’s a test of that?

But different people think in different ways. Some prefer to express these kinds of
relationships in terms of math.

So, for example, if you have two columns of data, x and y, and you think there’s a
linear relationship, you’d write down
y=mx + b

Other people prefer pictures and will first draw a diagram of data flow, possibly with
arrows, showing how things affect other things or what happens over time.

But how do you build a model?

One place to start is exploratory data analysis (EDA). This entails making plots and
building intuition for your particular dataset. EDA helps out a lot, as well as trial and
error and iteration.

To be honest, until you’ve done it a lot, it seems very mysterious. The best thing to
do is start simply and then build in complexity.

For example, you can (and should) plot histograms and look at scatterplots to
start getting a feel for the data. Then you just try writing something down, even if
it’s wrong first (it will probably be wrong first, but that doesn’t matter).
O12
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 1 FUNDAMENTALS OF DATASCIENCE- INTRODUCTION

So try writing down a linear function. When you write it down, you force yourself to
think: does this make any sense? If not, why? What would make more sense? You
start simply and keep building it up in complexity, making assumptions, and writing
your assumptions down.

Remember, it’s always good to start simply. There is a trade-off in modeling


between simple and accurate. Simple models may be easier to interpret and
understand. Oftentimes the crude, simple model gets you 90% of the way there and
only takes a few hours to build and fit, whereas getting a more complex model might
take months and only get you to 92%. You’ll start building up your arsenal of
potential models throughout this book. Some of the building blocks of these models
are probability distributions

Probability distributions

Probability distributions are the foundation of statistical models. When we get to


linear regression and Naive Bayes, you will see how this happens in practice.

Back in the day, before computers, scientists observed real-world phenomenon,


took measurements, and noticed that certain mathematical shapes kept reappearing.
The classical example is the height of humans, following a normal distribution—a
bell-shaped curve, also called a Gaussian distribution, named after Gauss.

Other common shapes have been named after their observers as well (e.g., the
Poisson distribution and the Weibull distribution), while other shapes such as
Gamma distributions or exponential distributions are named after associated
mathematical objects.

Figure is an illustration of the various common shapes, and to remind you that they
only have names because someone observed them enough times to think they
deserved names. There is actually an infinite number of possible distributions.

They are to be interpreted as assigning a probability to a subset of possible


outcomes, and have corresponding functions. For example, the normal distribution is
written as:

O13
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 1 FUNDAMENTALS OF DATASCIENCE- INTRODUCTION

The parameter μ is the mean and median and controls where the distribution is
centered (because this is a symmetric distribution), and the parameter σ controls
how spread out the distribution is. This is the general functional form, but for specific
real-world phenomenon, these parameters have actual numbers as values, which we
can estimate from the data

i)random variable
A random variable denoted by x or y can be assumed to have a corresponding
probability distribution, p(x) which maps x to a positive real number .
Or the values of the random variable correspond to the outcomes of the random
experiment.
O14
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 1 FUNDAMENTALS OF DATASCIENCE- INTRODUCTION

For example, let x be the amount of time until the next bus arrives (measured in
seconds). x is a random variable because there is variation and uncertainty in the
amount of time until the next bus.

ii)exponential distribution
If we want to know the likelihood of the next bus arriving in between 12 and 13
minutes, we can conduct an experiment where we show up at the bus stop at a
random time, measure how much time until the next bus, and repeat this
experiment over and over again. Then we look at the measurements, plot them,
and approximate the function as discussed. Or, because we are familiar with the
fact that “waiting time” is a common enough real-world phenomenon that a
distribution called the exponential distribution.

iii)joint distributions
denoting distributions of single random variables with functions of one variable, we
use multivariate functions called joint distributions

iv)conditional distribution
to do the same thing for more than one random variable. So in the case of two
random variables, for example, we could denote our distribution by a function
p(x,y) and it would take values in the plane and give us nonnegative values. In
keeping with its interpretation as a probability, its (double) integral over the
whole plane would be 1. We also have what is called a conditional distribution,
p(x|y) which is to be interpreted as the density function of x given a particular
value of y.

Fitting a model

Fitting a model means that you estimate the parameters of the model using the
observed data.

Fitting the model often involves optimization methods and algorithms, such as
maximum likelihood estimation, to help get the parameters.

In fact, when you estimate the parameters, they are actually estimators, meaning
they themselves are functions of the data. Once you fit the model, you actually can
write it as y =7.2+4.5x, for example, which means that your best guess is that this
O15
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 1 FUNDAMENTALS OF DATASCIENCE- INTRODUCTION

equation or functional form expresses the relationship between your two


variables, based on your assumption that the data followed a linear pattern.

Fitting the model is when you start actually coding: your code will read in the data,
and you’ll specify the functional form. Then R or Python will use built-in
optimization methods to give you the most likely values of the parameters given the
data.

overfitting
Overfitting is the term used to mean that you used a dataset to estimate the
parameters of your model, but your model isn’t that good at capturing reality
beyond your sampled data. You might know this because you have tried to use it to
predict labels for another set of data that you didn’t use to fit the model, and it
doesn’t do a good job, as measured by an evaluation metric such as accuracy

EXPLORATORY DATA ANALYSIS

Exploratory Data Analysis (EDA) as the first step toward building a model. John
Tukey, a mathematician at Bell Labs, developed exploratory data analysis. The
“exploratory” aspect means that your understanding of the problem you are solving,
or might solve, is changing as you go.

The basic tools of EDA are plots, graphs and summary statistics. Generally
speaking, it’s a method of systematically going through the data, plotting
distributions of all variables (using box plots), plotting time series of data,
transforming variables, looking at all pairwise relation‐ ships between variables
using scatterplot matrices, and generating summary statistics for all of them. At the
very least that would mean computing their mean, minimum, maximum, the upper
and lower quartiles, and identifying outliers.

PHILOSOPHY OF EXPLORATORY DATA ANALYSIS

EDA is done for some of the same reasons it’s done with smaller datasets, but there
are additional reasons to do it with data that has been generated from logs.

There are important reasons anyone working with data should do EDA. Namely, to
gain intuition about the data; to make comparisons between distributions; for sanity
checking (making sure the data is on the scale you expect, in the format you thought
O16
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 1 FUNDAMENTALS OF DATASCIENCE- INTRODUCTION

it should be; to find out where data is missing or if there are outliers; and to
summarize the data.

In the context of data generated from logs, EDA also helps with debugging the
logging process. For example, “patterns” you find in the data could actually be
something wrong in the logging process that needs to be fixed. If you never go to
the trouble of debugging, you’ll continue to think your patterns are real. The
engineers we’ve worked with are always grateful for help in this area.

THE DATA SCIENCE PROCESS

First, we have the Real World. Inside the Real World are lots of people busy at
various activities. Some people are using Google+, others are competing in the
Olympics; there are spammers sending spam, and there are people getting their
blood drawn. Say we have data on one of these things.

Specifically, we’ll start with raw data—logs, Olympics records, Enron employee
emails, or recorded genetic material (note there are lots of aspects to these activities
already lost even when we have that raw data). We want to process this to make it
clean for analysis. So we build and use pipelines of data munging: joining, scraping,
wrangling, or whatever you want to call it. To do this we use tools such as Python,
shell scripts, R, or SQL, or all of the above. Eventually we get the data down to a nice
format, like something with columns:

name | event | year | gender | event time

O17
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 1 FUNDAMENTALS OF DATASCIENCE- INTRODUCTION

Once we have this clean dataset, we should be doing some kind of EDA. In the
course of doing EDA, we may realize that it isn’t actually clean because of duplicates,
missing values, absurd outliers, and data that wasn’t actually logged or incorrectly
logged. If that’s the case, we may have to go back to collect more data, or spend
more time cleaning the dataset.

Next, we design our model to use some algorithm like k-nearest neighbor (k-NN),
linear regression, Naive Bayes, or something else. The model we choose depends on
the type of problem we’re trying to solve, of course, which could be a classification
problem, a prediction problem, or a basic description problem.

We then can interpret, visualize, report, or communicate our results. This could
take the form of reporting the results up to our boss or coworkers, or publishing a
paper in a journal and going out and giving academic talks about it.

A DATA SCIENTIST’S ROLE IN THIS PROCESS

This model so far seems to suggest this will all magically happen without human
intervention. By “human” here, we mean “data scientist.” Someone has to make the
decisions about what data to collect, and why. That person needs to be formulating
questions and hypotheses and making a plan for how the problem will be attacked.
And that someone is the data scientist or our beloved data science team.

O18
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 1 FUNDAMENTALS OF DATASCIENCE- INTRODUCTION

Let’s revise or at least add an overlay to make clear that the data scientist needs to
be involved in this process throughout, meaning they are involved in the actual
coding as well as in the higher-level process, as shown in Figure.

We can think of the data science process as an extension of or variation of the


scientific method:
• Ask a question.
• Do background research.
• Construct a hypothesis.
• Test your hypothesis by doing an experiment.
• Analyze your data and draw a conclusion.
• Communicate your results.

In both the data science process and the scientific method, not every problem
requires one to go through all the steps, but almost all problems can be solved with
some combination of the stages. For example, if your end goal is a data
visualization (which itself could be thought of as a data product), it’s possible you
might not do any machine learning or statistical modeling, but you’d want to get all
the way to a clean dataset, do some exploratory analysis, and then create the
visualization.

QUESTION BANK

UNIT-I

2 MARKS QUESTION

1. Define Data Science.


2. Define Big Data.
3. Define Data Analyst.
4. Define EDA.
5. Expand EDA.
6. Define Datafication.
7. Define Population.
8. Define Sample.
O19
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 1 FUNDAMENTALS OF DATASCIENCE- INTRODUCTION

9. When Samples are used ?


10. Define Model.
11. Define Random Variable.
12. Define Exponential distribution.
13. Define Joint distribution.
14. Define conditional distribution.

5 MARKS QUESTION
1. What is Data Science explain in detail.
2. State the Advantages of Data Science
3. State the Applications of Data Science.
4. Data Scientist requires expertise in which of the backgrounds ?
5. What is Getting the past hype.
6. Big data and Data Science Hype – Why Now ?
7. What is Statistical Inference ?
8. Explain Population and Samples of big data.
9. What is Fitting a Model ?
10. What is Overfitting ?
11. What is Statistical Modeling explain.
12. How do you build a Model explain in detail.

10 MARKS QUESTION

1. How does Data Scientist work ? explain.


2. Explain in detail Big data and Data Science Hype.
3. What is Current landscape of perspectives of Data Science.
4. A Data science profile ? Explain
5. Does big data can mean big assumptions explain.
6. Explain the concept Probability Distribution with types in detail.
7. With neat diagram explain The Data Science Process.
8. With neat diagram explain A Data Scientist’s role in this process

O20
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme

You might also like