Foundations of Data Science UNIT 1
Foundations of Data Science UNIT 1
The data used for analysis can come from many different sources and presented in
various formats.
Data Science can be applied in nearly every part of a business where data is
available. Examples are:
• Consumer goods
• Stock markets
• Industry
• Politics
• Logistic companies
• E-commerce
A Data Scientist must find patterns within the data. Before he/she can find the
patterns, he/she must organize the data in a standard format.
Data science was born from the idea of merging applied statistics with computer
science. The resulting field of study would use the extraordinary power of
modern computing. Scientists realized they could not only collect data and solve statistical
problems but also use that data to solve real-world problems and make reliable fact-
driven predictions.
1962: American mathematician John W. Tukey first articulated the data science
dream. In his now-famous article “The Future of Data Analysis,” he foresaw the
inevitable emergence of a new field nearly two decades before the first personal
computers. While Tukey was ahead of his time, he was not alone in his early
1977: The theories and predictions of “pre” data scientists like Tukey and Naur became more
concrete with the establishment of the international association for statistical computing
(IASC), whose mission was “to link traditional statistical methodology and modern
computer technology, and the knowledge of domain experts in order to convert data
1980s and 1990s: Data science began taking more significant strides with the emergence
of the first Knowledge Discovery in Databases (KDD) workshop and the founding of the
International Federation of Classification Societies (IFCS).
1994: Business Week published a story on the new phenomenon of “Database Marketing.”
It described the process by which businesses were collecting and leveraging enormous
amounts of data to learn more about their customers, competition, or advertising
techniques.
1990s and early 2000s: We can clearly see that data science has emerged as a recognized
and specialized field. Several data science academic journals began to circulate, and data
science proponents like Jeff Wu and William S. Cleveland continued to help develop and
expound upon the necessity and potential of data science.
5.Modeling 6.Evaluation
7.Deployment 8.Monitoring
1. Business Objective
• Define the problem that needs to be solved.
• Ask the right questions to understand the business context and goals.
2. Data Requirement
• Identify the data needed to answer the business questions.
• Determine data sources and ensure relevance.
3. Data Collection
• Gather data from various sources (databases, APIs, sensors, etc.).
• Ensure data completeness and accessibility.
4. Exploratory Data Analysis (EDA)
• Explore, clean, and preprocess the data.
• Handle missing values, identify outliers, and understand patterns.
• Ensure data quality before modeling.
5. Modeling
• Select appropriate machine learning algorithms or statistical models.
• Train the model using prepared data.
• Aim to predict outcomes or generate insights.
6. Evaluation
• Assess model performance using relevant metrics (accuracy, precision, recall,
etc.).
• Ensure the model meets business objectives.
7. Deployment
• Integrate the model into a production environment or business workflow.
• Make the model available for real-time decision-making.
8. Monitoring
• Continuously monitor model performance.
• Detect data drifts and changes in behavior.
• Re-evaluate earlier steps if necessary.
Q. What is Business Intelligence?
Business intelligence (BI) is a set of technologies, applications, and processes
that are used by enterprises for business data analysis. It is used for the
conversion of raw data into meaningful information which is thus used for
business decision- making and profitable actions. It deals with the analysis of
structured and sometimes unstructured data which paves the way for new and
profitable business opportunities. It supports decision-making based on facts
rather than assumption- based decision-making. Thus it has a direct impact on
the business decisions of an enterprise. Business intelligence tools enhance the
chances of an enterprise to enter a new market as well as help in studying the
impact of marketing efforts.
It is basically a set of
It is a field that uses technologies,
mathematics, statistics and applications and
1. Concept
various other tools to discover processes that are used
the hidden patterns in the data. by the enterprises for
business data analysis.
It is less flexible as in
Data science is much more
case of business
4. Flexibility flexible as data sources can be
intelligence data sources
added as per requirement.
need to be pre-planned.
Business Intelligence
Greater business value is has lesser business value
achieved with data science in as the extraction process
Business
13. comparison to business of business value carries
Value
intelligence as it anticipates out statically by plotting
future events. charts and KPIs (Key
Performance Indicator).
S. Business
No. Factor Data Science Intelligence
Linear Algebra:
• Understanding algorithms, especially those involved in data transformation
and machine learning, requires a solid understanding of linear algebra.
• Important subjects include eigenvalues, vectors, matrices, and matrix
multiplication. These ideas are essential for algorithms that reduce the
dimensions of data, such as Principal Component Analysis (PCA) and
Singular Value Decomposition (SVD).
Calculus:
• Understanding the operation of optimization algorithms requires a strong
understanding of calculus, particularly differential calculus. For example,
derivatives play a key role in gradient descent, which reduces mistakes in
machine learning models.
• Important subjects that help in understanding and improving model behaviour
are derivatives, integrals, and partial derivatives.
Statistics and Probability:
• Data scientists use statistics to test hypotheses, draw conclusions, and make
inferences from data.
• The central limit theorem, hypothesis testing, confidence intervals, probability
distributions, and Bayesian inference are important subjects. These ideas
enable data scientists to make data-driven decisions, analyze estimates, and
measure uncertainty.
2. Programming Skills
Data scientists need to know how to program. It enables efficient data handling,
analysis, and modelling.
Python:
• Python is the preferred language for data science because of its vast libraries,
including Pandas, NumPy, Scikit-learn, and TensorFlow, as well as its ease of
use and readability.
• Use cases: Python can be applied to statistical analysis, machine learning
model development, data cleansing, and visualization.
R:
• R is particularly well-liked in educational institutions and businesses that
depend significantly on statistical analysis. It is renowned for its strong data
analysis and visualization packages.
• Use cases: R is perfect for research-based data analysis, statistical modelling,
and visualizations.
SQL:
• In relational databases, SQL (Structured Query Language) is essential for
accessing and modifying data. Since databases hold the majority of real-world
data, SQL is an essential tool for data extraction.
• Examples of use: SQL allows you to work with massive datasets by
performing aggregations, joining tables, and searching databases .
.
Q. Tools for required data science and applications of data science
various fields
Data scientists require a diverse set of technical skills,
including programming, statistics, machine learning, data visualization, and
database management. Proficiency in languages like Python and R is crucial for
data manipulation and analysis, while strong statistical knowledge is essential for
modeling and interpreting
results. Additionally, understanding data visualization techniques, including tools
like Tableau and Power BI, allows for effective communication of insights.
Key Technical Skills for Data Scientists:
• Programming:
Proficiency in languages like Python and R is essential for data manipulation,
analysis, and building machine learning models.
• Statistics and Mathematics:
A solid understanding of statistical concepts, probability, and mathematical
principles is necessary for building and evaluating models.
• Machine Learning:
Knowledge of various machine learning algorithms, including regression,
classification, and clustering, is crucial for building predictive models and
gaining insights from data.
• Data Visualization:
The ability to create clear and informative visualizations using tools like Tableau
and Power BI is vital for communicating data findings.
• Database Management:
Understanding database systems, including SQL and NoSQL databases, is essential
for managing and accessing large datasets.
• Data Wrangling:
The ability to clean, transform, and prepare data for analysis is a crucial skill for
data scientists.
• Big Data Technologies:
Familiarity with big data technologies like Hadoop and Spark is increasingly
important for handling large datasets.
• Cloud Computing:
Understanding and using cloud computing platforms like AWS, Azure, and
Google Cloud is essential for storing, processing, and analyzing data.
• Deep Learning:
Knowledge of deep learning techniques and frameworks is becoming increasingly
important for tackling complex problems.
• Natural Language Processing (NLP):
For tasks involving text data, familiarity with NLP techniques is beneficial.
Applications of data science
Data science has a wide range of applications across various industries, including healthcare,
finance, e-commerce, and transportation. It's used for predictive analytics, machine learning, data
visualization, recommendation systems, fraud detection, and more, helping businesses and
organizations make better decisions and improve their operations .
Risk management:
Data science helps assess and manage various financial risks, including credit risk and market
risk.
Personalized financial advice:
Data science can analyze financial data to provide tailored recommendations to customers.
3. E-commerce:
Personalized recommendations:
Data science algorithms analyze user behavior to suggest relevant products to customers, boosting
sales and improving user experience.
Inventory optimization:
Data science helps e-commerce businesses manage their inventory efficiently, minimizing storage
costs and preventing stockouts.
Targeted advertising:
Data science enables businesses to target specific customer segments with personalized ads,
improving campaign effectiveness.
4. Transportation:
Route optimization:
Data science algorithms can optimize delivery routes for logistics companies, reducing fuel costs
and delivery times.
Predictive maintenance:
Data science can analyze sensor data from vehicles to predict potential maintenance needs,
preventing costly breakdowns.
Traffic management:
Data science can be used to analyze traffic patterns and optimize traffic flow, reducing
congestion.
5. Other Applications:
1. Search engines
2. Sports
3. Social media
4. Education.
5. Environmental monitoring.
Q. Data science security issues and risks:
The data science security issues are 1. Data breaches and unauthorized Access
2. Data Privacy
3. Data poisoning
4. False positive
5. Lack of skilled professionals
6. Insecure Api
1. Data breaches and unauthorized Access
The data science it contains large data sets i.e sensitive info”making them attractive
Target for cyber criminals” unauthorized access can lead to data theft, identity theft and
financial fraud.
2. Data privacy: The protection of personnel data from unauthorized access use.
Data privacy in data science refers to the ethical and responsible handling of personal information
While collecting and analysing data.
3. Data poisoning: data poisoning is one type of cyber-attack is based on targeting the training data
Of A.I and M.L.
EX: Attackers can introduce new data, modify existing data, delete data points from the
Training.
4. False Positive: False positive in data science means when a model incorrectly predicts a positive
Outcome for a datapoint.
Ex: false positive might mean a test incorrectly indicates a patient has a disease when they don’t.
5. Lack of skilled professionals: securing data science environments and addressing security
Issues requires specialized skills and expertise.
Data Collection Strategies
Data collection is the process of gathering and measuring information from various sources to use in
data science projects. The quality of analysis and models heavily depends on the quality and relevance of
the data collected.
Data collection is collecting information from various sources to answer specific questions or achieve
a goal. It plays a vital role in decision-making across industries like business, healthcare, education and
research. Whether you want to analyze trending data, solve any complex problem or make predictions data
collection is the foundation of many processes that shape our everyday lives.
Types of Data Collection: Data collection methods can be broadly classified into two categories:
Primary Data Collection and Secondary Data Collection. It is as follows:
1. Primary Data Collection: Primary data is information gathered directly from the source. It is first-
hand data, meaning it is collected for a specific purpose and is usually more accurate, reliable, and relevant.
Surveys & Questionnaires
Interviews
Observations
Focus Groups
Experiments & A/B Testing.
2. Secondary Data Collection: Secondary data refers to pre-existing information collected by third
parties. It is cost-effective, quick to access, and often used for trend analysis, competitor research, and
historical data studies.
a) Public Records & Government Data
Governments and organizations release census reports, economic statistics, and demographic insights
that help businesses plan strategies.
Examples: World Bank reports, WHO health statistics, or industry growth trends.
b) Industry Reports & Market Research Studies
Research firms like Gartner, Forrester, and Statista publish industry insights, competitor analysis, and
data-driven predictions.
Companies use these reports to understand market trends and customer behavior.
c) Academic & Scientific Research Papers
Universities and research institutions provide peer-reviewed studies and experiments that contribute
to fields like medicine, AI, and cybersecurity.
Google Scholar and IEEE Xplore are common platforms for accessing academic data.
d) Social Media & Web Analytics
Platforms like Facebook, Twitter, and Google Analytics collect user engagement data, helping
businesses understand customer preferences.
Brands track online trends, hashtags, and audience demographics to fine-tune marketing strategies.
e) Business Databases & Third-Party Data Providers
Companies buy pre-collected datasets from sources like Nielsen, Experian, or Crunchbase to enhance
marketing and business intelligence.
Used for competitor research, audience targeting, and industry benchmarking.
Methods of Data Collection
Data collection methods vary depending on the type of data, purpose, and industry requirements.
These methods help businesses, researchers, and analysts gather accurate and actionable insights.
1. Quantitative Data Collection Methods:
Quantitative data collection focuses on numerical data that can be measured, analyzed, and used for
statistical insights.
Surveys & Questionnaires: Structured multiple-choice questions or rating scales to collect data from
a large audience.
Used in customer feedback, employee satisfaction, and product research.
Tools: Google Forms, Typeform, SurveyMonkey.
Online Polls & Forms
Quick single-question or short-form surveys to gauge user opinions instantly.
Used on social media, websites, and apps for marketing insights.
Experiments & A/B Testing
Comparing two or more versions of a product, website, or campaign to find the best-performing
option.
Common in digital marketing, app development, and UX testing.
Web & App Analytics
Tracks user behavior on websites, mobile apps, and digital platforms.
Example: Google Analytics tracks page views, bounce rates, and session durations.
Sensor & IoT-Based Data Collection
Automated collection of real-time data using IoT devices, smart sensors, and GPS tracking.
Used in smart homes, healthcare, and industrial automation.
2. Qualitative Data Collection Methods: Qualitative data collection focuses on opinions, behaviors,
and experiences, providing deeper insights.
Interviews (Face-to-Face, Phone, or Video)
Direct one-on-one conversations with open-ended questions.
Used for market research, employee feedback, and case studies.
Focus Groups
A small group discussion moderated by a researcher to collect in-depth opinions.
Common in brand perception studies, product testing, and media research.
Observational Research
Monitoring real-world user behavior without direct interaction.
Examples: Retail stores tracking customer movement, social media trend analysis.
Case Studies & Customer Stories
Real-life examples of customer experiences to understand patterns and behaviors.
Used in marketing, psychology, and product development.
3. Automated & Digital Data Collection Methods
With advancements in AI, machine learning, and automation, data collection has become faster,
scalable, and more efficient.
Web Scraping & Data Mining
Automated tools extract data from websites, social media, and online directories.
Used in price tracking, competitor research, and sentiment analysis.
API-Based Data Collection
Companies use APIs (Application Programming Interfaces) to pull data from platforms like Google,
Facebook, or weather services.
Used in real-time analytics, third-party integrations, and automation.
AI-Powered Sentiment Analysis
AI tools analyze customer reviews, social media comments, and feedback to gauge sentiment.
Helps brands understand public perception and market trends.
organizing data to make it suitable for machine learning algorithms and other data-driven tasks
Data transformation Data transformation in data mining refers to the process of converting raw
data into a format that is suitable for analysis and modeling.
There are different methods in data transformation
1. Data smoothing
2. Data aggregation
3. Data discretization
4. Data generalization
5. Data normalization.
1. Data smoothing: It is a process that is used to remove noise from the dataset using some
algorithms It allows for highlighting important features present in the dataset.
Example:
Imagine you have noisy data like this: [5, 7, 6, 20, 7, 8, 6]
Here, the value 20 is an outlier that makes the data look jagged.
One simple smoothing method is replacing each value with the average of its neighbors to reduce
sharp jumps. For example, replacing 20 with the average of its neighbors (6 + 7) / 2 = 6.5 gives
smoother data:
[5, 7, 6, 6.5, 7, 8, 6]
Data Aggregation:
Data collection or aggregation is the method of storing and presenting data in a summary format.
The data may be obtained from multiple data sources to integrate these data sources into a data
analysis description.
Example: Sales, data may be aggregated to compute monthly and annual total amounts.
January: 1,000 February: 1,200 March: 1,300
Filtering requires you to specify a rule or logic to identify the cases you want to include in
your survey analysis. Filtering can also be referred to as “subsetting” data, or a data “drill-down”.
In this article, we illustrate a filtered data set and discuss how you might use filtering.