FDSUnit 1
FDSUnit 1
1.Have you ever wondered how Amazon, eBay suggest items for you to buy?
2.How Gmail filters your emails in the spam and non-spam categories?
3.How Netflix predicts the shows of your liking?
How do they do it? These are the few questions we ponder from time to time. In
reality, doing such tasks are impossible without the availability of data. Data science
is all about using data to solve problems.
O1
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 1 FUNDAMENTALS OF DATASCIENCE- INTRODUCTION
The problem could be decision making such as identifying which email is spam and
which is not. Or a product recommendation such as which movie to watch? Or
predicting the outcome such as who will be the next President of the USA? So, the
core job of a data scientist is to understand the data, extract useful information out of
it and apply this in solving the problems.
Over the past few years, there’s been a lot of hype in the media about “Data
Science” and “Big Data”. Today, data rules the world. This has resulted in a huge
demand for Data Scientists.
ADVANTAGES :
APPLICATIONS :
Big Data refers to significant volumes of data that cannot be processed effectively
with the traditional applications that are currently used. The processing of big data
begins with raw data that isn’t aggregated and is most often impossible to store in
the memory of a single computer.
Data Analytics is the science of examining raw data to reach certain conclusions.
Data Analytics involves applying an algorithmic or mechanical process to derive
insights and running through several data sets to look for meaningful correlations.
Machine Learning
Statistics
Programming (Python or R)
Mathematics
O3
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 1 FUNDAMENTALS OF DATASCIENCE- INTRODUCTION
Databases
A Data Scientist must find patterns within the data. Before he/she can find the
patterns, he/she must organize the data in a standard format.
Data science enables companies not only to understand data from multiple
sources but also to enhance decision making. As a result, data science is widely
used in almost every industry, including health care, finance, marketing, banking, city
planning, and more. If you are probably means you have something useful to
contribute to making data science into a more legitimate field that has the power
to have a positive impact on society. So, what is eyebrow-raising (shows surprise)
about Big Data and data science? Let’s count the ways
O4
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 1 FUNDAMENTALS OF DATASCIENCE- INTRODUCTION
There’s a distinct lack of respect for the researchers in academia and industry labs who have
been working on this kind of stuff for years, and whose work is based on decades of work by
statisticians, computer scientists, mathematicians, engineers, and scientists of all types. The
hype is crazy
The longer the hype goes on, the more many of us will get turned off by it, and the
harder it will be to see what’s good underneath it all, if anything.
Statisticians already feel that they are studying and working on the “Science of
Data.” That’s their bread and butter Although we will make the case that data
science is not just a rebranding of statistics or machine learning but rather a field
unto itself, the media often describes data science in a way that makes it sound like
as if it’s simply statistics or machine learning in the context of the tech industry.
People have said to us, “Anything that has to call itself a ‘science’ is probably isn’t.”
Although there might be truth in there, that doesn’t mean that the term “data
science” itself represents nothing, but of course what it represents may not be
science but more of a craft (Create documents, which will make an impact).
O5
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 1 FUNDAMENTALS OF DATASCIENCE- INTRODUCTION
would know what all this actually meant. And now, with this book, we hope to do
the same for many more people.
WHY NOW?
Data Science helps businesses to comprehend vast amounts of data from different
sources, extract useful insights, and make better data-driven choices. Data Science
is used extensively in several industrial fields such as marketing, healthcare,
finance, banking, and policy work.
It’s not only the massiveness that makes all this new data interesting (or poses
challenges). It’s that the data itself, often in real time, becomes the building blocks
of data products.
DATAFICATION
O6
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 1 FUNDAMENTALS OF DATASCIENCE- INTRODUCTION
O7
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 1 FUNDAMENTALS OF DATASCIENCE- INTRODUCTION
Perhaps the most concrete approach is to define data science is by its usage—e.g.,
what data scientists get paid to do.
In Academia
In academia, hardly anyone officially calls themselves a data scientist. Usually,
people take on that title only if they're part of a special "data science institute" or
need it for funding. But a lot of folks in academia are interested in becoming data
scientists or using data science techniques.
For instance, in an Intro to Data Science class at Columbia, there were 60 students.
Originally, the teacher thought it would mostly attract people from statistics, math,
and computer science backgrounds. But it turned out, it was a mix of people from
various fields like sociology, journalism, political science, and more. They all
wanted to use data to solve important problems, especially social ones.
Why define it like this? Because no matter what field you're in, dealing with big data
brings similar challenges. If researchers from different areas team up, they can
solve all sorts of real-world problems together.
In Industry
In the tech industry, a chief data scientist is superior of all things data-related in a
O8
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 1 FUNDAMENTALS OF DATASCIENCE- INTRODUCTION
company. They make the big decisions about how data is collected, used, and
protected. This involves setting up the technology to gather data, making sure it's
used in the right way, and even deciding what data should be visible to users.
They lead a team of experts who work with data, like engineers, scientists, and
analysts. They also talk to the big shots in the company, like the CEO and other
leaders, to explain how data can help the business.
They're good at explaining what they find to their teammates and bosses, even if
those people aren't experts in data themselves. And they use visuals, like charts
and graphs, to make things easier to understand.
STATISTICAL INFERENCE
Data represents the traces of the real-world processes. After separating the
process from the data collection, we can see clearly that there are two sources of
randomness and uncertainty.
Namely, the randomness and uncertainty underlying the process itself, and the
uncertainty associated with your underlying data collection methods.
Once you have all this data, you can’t go walking around with a huge Excel
spreadsheet or database of millions of transactions and look at it and, with a snap
of a finger, understand the world and process that generated it.
So you need a new idea, and that’s to simplify those captured traces into
something more comprehensible, to something that somehow captures it all in a
much more concise way, and that something could be mathematical models or
functions of the data, known as statistical estimators. This overall process of going
from the world to the data, and then from the data back to the world, is the field of
statistical inference.
O9
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 1 FUNDAMENTALS OF DATASCIENCE- INTRODUCTION
Population:
A population is the entire group that you want to draw conclusion about.
Sample:
Sample is the specific group that you will collect data from. Sample size is less than
the size of population. Generally, population refers to the people who live in a
particular area at a specific time. But in statistics, population refers to data on your
study of interest. It can be a group of individuals, objects, events, organizations, etc
If you had to collect the same data from a larger population, say the entire country of
India, it would be impossible to draw reliable conclusions because of geographical and
accessibility constraints, not to mention time and resource constraints. A lot of data
would be missing or might be unreliable. Further more,due to accessibility issues,
marginalized tribes or villages might not provide data at all, making the data biased
towards certain regions or groups.
Samples are used when :
The population is too large to collect data.
The data collected is not reliable.
The population is hypothetical and is unlimited in size.
In the age of Big Data, where we can record all users’ actions all the time, At
Google, for example, soft‐ ware engineers, data scientists, and statisticians sample
all the time.
How much data you need at hand really depends on what your goal is: for
analysis or inference purposes, you typically don’t need to store all the data all the
time. On the other hand, for serving purposes you might: in order to render the
correct information in a UI for a user, you need to have all the information for that
particular user, for example.
The concept of “N=ALL” in the context of Big Data refers to the idea that, with the
massive volume of data available, we can effectively analyze and understand the
entire population (or universe) rather than relying on sampling.
Big Data challenges the need for sampling because it deals with vast amounts of
data (structured, unstructured, and semi-structured) collected from various sources
(e.g., social media, sensors, devices).
n=1
In old days a sample size of 1 would be ridiculous, you would never want to draw
inferences about an entire population by looking at a single individual. But the
concept of n=1 takes on new meaning in the age of Big Data, where for a single
person, we actually can record tons of information about them, and in fact we might
even sample from all the events or actions they took (for example, phone calls or
keystrokes) in order to make inferences about them.
MODELING
A model is our attempt to understand and represent the nature of reality through a
particular lens, be it architectural, biological, or mathematical.
A model is an artificial construction where all extraneous detail has been removed
or abstracted. Attention must always be paid to these abstracted details after a
model has been analyzed to see what might have been overlooked.
O11
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 1 FUNDAMENTALS OF DATASCIENCE- INTRODUCTION
Statistical Modeling:
Data modelling is a process of creating a conceptual representation of data objects and
their relationships to one another. The process of data modeling typically involves
several steps, including requirements gathering, conceptual design, logical design,
physical design, and implementation.
Before you get too involved with the data and start coding, it’s useful to draw a
picture of what you think the underlying process might be with your model. What
comes first? What influences what? What causes what? What’s a test of that?
But different people think in different ways. Some prefer to express these kinds of
relationships in terms of math.
So, for example, if you have two columns of data, x and y, and you think there’s a
linear relationship, you’d write down
y=mx + b
Other people prefer pictures and will first draw a diagram of data flow, possibly with
arrows, showing how things affect other things or what happens over time.
One place to start is exploratory data analysis (EDA). This entails making plots and
building intuition for your particular dataset. EDA helps out a lot, as well as trial and
error and iteration.
To be honest, until you’ve done it a lot, it seems very mysterious. The best thing to
do is start simply and then build in complexity.
For example, you can (and should) plot histograms and look at scatterplots to
start getting a feel for the data. Then you just try writing something down, even if
it’s wrong first (it will probably be wrong first, but that doesn’t matter).
O12
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 1 FUNDAMENTALS OF DATASCIENCE- INTRODUCTION
So try writing down a linear function. When you write it down, you force yourself to
think: does this make any sense? If not, why? What would make more sense? You
start simply and keep building it up in complexity, making assumptions, and writing
your assumptions down.
Probability distributions
Other common shapes have been named after their observers as well (e.g., the
Poisson distribution and the Weibull distribution), while other shapes such as
Gamma distributions or exponential distributions are named after associated
mathematical objects.
Figure is an illustration of the various common shapes, and to remind you that they
only have names because someone observed them enough times to think they
deserved names. There is actually an infinite number of possible distributions.
O13
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 1 FUNDAMENTALS OF DATASCIENCE- INTRODUCTION
The parameter μ is the mean and median and controls where the distribution is
centered (because this is a symmetric distribution), and the parameter σ controls
how spread out the distribution is. This is the general functional form, but for specific
real-world phenomenon, these parameters have actual numbers as values, which we
can estimate from the data
i)random variable
A random variable denoted by x or y can be assumed to have a corresponding
probability distribution, p(x) which maps x to a positive real number .
Or the values of the random variable correspond to the outcomes of the random
experiment.
O14
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 1 FUNDAMENTALS OF DATASCIENCE- INTRODUCTION
For example, let x be the amount of time until the next bus arrives (measured in
seconds). x is a random variable because there is variation and uncertainty in the
amount of time until the next bus.
ii)exponential distribution
If we want to know the likelihood of the next bus arriving in between 12 and 13
minutes, we can conduct an experiment where we show up at the bus stop at a
random time, measure how much time until the next bus, and repeat this
experiment over and over again. Then we look at the measurements, plot them,
and approximate the function as discussed. Or, because we are familiar with the
fact that “waiting time” is a common enough real-world phenomenon that a
distribution called the exponential distribution.
iii)joint distributions
denoting distributions of single random variables with functions of one variable, we
use multivariate functions called joint distributions
iv)conditional distribution
to do the same thing for more than one random variable. So in the case of two
random variables, for example, we could denote our distribution by a function
p(x,y) and it would take values in the plane and give us nonnegative values. In
keeping with its interpretation as a probability, its (double) integral over the
whole plane would be 1. We also have what is called a conditional distribution,
p(x|y) which is to be interpreted as the density function of x given a particular
value of y.
Fitting a model
Fitting a model means that you estimate the parameters of the model using the
observed data.
Fitting the model often involves optimization methods and algorithms, such as
maximum likelihood estimation, to help get the parameters.
In fact, when you estimate the parameters, they are actually estimators, meaning
they themselves are functions of the data. Once you fit the model, you actually can
write it as y =7.2+4.5x, for example, which means that your best guess is that this
O15
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 1 FUNDAMENTALS OF DATASCIENCE- INTRODUCTION
Fitting the model is when you start actually coding: your code will read in the data,
and you’ll specify the functional form. Then R or Python will use built-in
optimization methods to give you the most likely values of the parameters given the
data.
overfitting
Overfitting is the term used to mean that you used a dataset to estimate the
parameters of your model, but your model isn’t that good at capturing reality
beyond your sampled data. You might know this because you have tried to use it to
predict labels for another set of data that you didn’t use to fit the model, and it
doesn’t do a good job, as measured by an evaluation metric such as accuracy
Exploratory Data Analysis (EDA) as the first step toward building a model. John
Tukey, a mathematician at Bell Labs, developed exploratory data analysis. The
“exploratory” aspect means that your understanding of the problem you are solving,
or might solve, is changing as you go.
The basic tools of EDA are plots, graphs and summary statistics. Generally
speaking, it’s a method of systematically going through the data, plotting
distributions of all variables (using box plots), plotting time series of data,
transforming variables, looking at all pairwise relation‐ ships between variables
using scatterplot matrices, and generating summary statistics for all of them. At the
very least that would mean computing their mean, minimum, maximum, the upper
and lower quartiles, and identifying outliers.
EDA is done for some of the same reasons it’s done with smaller datasets, but there
are additional reasons to do it with data that has been generated from logs.
There are important reasons anyone working with data should do EDA. Namely, to
gain intuition about the data; to make comparisons between distributions; for sanity
checking (making sure the data is on the scale you expect, in the format you thought
O16
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 1 FUNDAMENTALS OF DATASCIENCE- INTRODUCTION
it should be; to find out where data is missing or if there are outliers; and to
summarize the data.
In the context of data generated from logs, EDA also helps with debugging the
logging process. For example, “patterns” you find in the data could actually be
something wrong in the logging process that needs to be fixed. If you never go to
the trouble of debugging, you’ll continue to think your patterns are real. The
engineers we’ve worked with are always grateful for help in this area.
First, we have the Real World. Inside the Real World are lots of people busy at
various activities. Some people are using Google+, others are competing in the
Olympics; there are spammers sending spam, and there are people getting their
blood drawn. Say we have data on one of these things.
Specifically, we’ll start with raw data—logs, Olympics records, Enron employee
emails, or recorded genetic material (note there are lots of aspects to these activities
already lost even when we have that raw data). We want to process this to make it
clean for analysis. So we build and use pipelines of data munging: joining, scraping,
wrangling, or whatever you want to call it. To do this we use tools such as Python,
shell scripts, R, or SQL, or all of the above. Eventually we get the data down to a nice
format, like something with columns:
O17
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 1 FUNDAMENTALS OF DATASCIENCE- INTRODUCTION
Once we have this clean dataset, we should be doing some kind of EDA. In the
course of doing EDA, we may realize that it isn’t actually clean because of duplicates,
missing values, absurd outliers, and data that wasn’t actually logged or incorrectly
logged. If that’s the case, we may have to go back to collect more data, or spend
more time cleaning the dataset.
Next, we design our model to use some algorithm like k-nearest neighbor (k-NN),
linear regression, Naive Bayes, or something else. The model we choose depends on
the type of problem we’re trying to solve, of course, which could be a classification
problem, a prediction problem, or a basic description problem.
We then can interpret, visualize, report, or communicate our results. This could
take the form of reporting the results up to our boss or coworkers, or publishing a
paper in a journal and going out and giving academic talks about it.
This model so far seems to suggest this will all magically happen without human
intervention. By “human” here, we mean “data scientist.” Someone has to make the
decisions about what data to collect, and why. That person needs to be formulating
questions and hypotheses and making a plan for how the problem will be attacked.
And that someone is the data scientist or our beloved data science team.
O18
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme
UNIT- 1 FUNDAMENTALS OF DATASCIENCE- INTRODUCTION
Let’s revise or at least add an overlay to make clear that the data scientist needs to
be involved in this process throughout, meaning they are involved in the actual
coding as well as in the higher-level process, as shown in Figure.
In both the data science process and the scientific method, not every problem
requires one to go through all the steps, but almost all problems can be solved with
some combination of the stages. For example, if your end goal is a data
visualization (which itself could be thought of as a data product), it’s possible you
might not do any machine learning or statistical modeling, but you’d want to get all
the way to a clean dataset, do some exploratory analysis, and then create the
visualization.
QUESTION BANK
UNIT-I
2 MARKS QUESTION
5 MARKS QUESTION
1. What is Data Science explain in detail.
2. State the Advantages of Data Science
3. State the Applications of Data Science.
4. Data Scientist requires expertise in which of the backgrounds ?
5. What is Getting the past hype.
6. Big data and Data Science Hype – Why Now ?
7. What is Statistical Inference ?
8. Explain Population and Samples of big data.
9. What is Fitting a Model ?
10. What is Overfitting ?
11. What is Statistical Modeling explain.
12. How do you build a Model explain in detail.
10 MARKS QUESTION
O20
A.S.Patil College of Commerce(Autonomous), Vijayapur BCA Programme