Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
41 views23 pages

Foundations of Data Science UNIT 1

Data science is a field focused on analyzing large volumes of data to uncover patterns and inform business decisions, utilizing techniques from machine learning and statistics. It is applicable across various industries such as healthcare, finance, and e-commerce, and involves a structured process from defining business objectives to monitoring outcomes. Key prerequisites for data scientists include strong skills in mathematics, programming, and data visualization, while also addressing security issues like data breaches and privacy concerns.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views23 pages

Foundations of Data Science UNIT 1

Data science is a field focused on analyzing large volumes of data to uncover patterns and inform business decisions, utilizing techniques from machine learning and statistics. It is applicable across various industries such as healthcare, finance, and e-commerce, and involves a structured process from defining business objectives to monitoring outcomes. Key prerequisites for data scientists include strong skills in mathematics, programming, and data visualization, while also addressing security issues like data breaches and privacy concerns.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Unit-1

Q. What is Data Science Introduction and need for data science?


Data science is the domain of study that deals with vast volumes of data using
modern tools and techniques to find unseen patterns, derive meaningful
information, and make business decisions. Data science uses complex machine
learning algorithms to build predictive models.

The data used for analysis can come from many different sources and presented in
various formats.

Where is Data Science Needed?


Data Science is used in many industries in the world today, e.g. banking,
consultancy, healthcare, and manufacturing.

Examples of where Data Science is needed:

• For route planning: To discover the best routes to ship


• To foresee delays for flight/ship/train etc. (through predictive analysis)
• To create promotional offers
• To find the best suited time to deliver goods
• To forecast the next years revenue for a company
• To analyze health benefit of training
• To predict who will win elections

Data Science can be applied in nearly every part of a business where data is
available. Examples are:

• Consumer goods
• Stock markets
• Industry
• Politics
• Logistic companies
• E-commerce

How Does a Data Scientist Work?


A Data Scientist requires expertise in several backgrounds:
• Machine Learning
• Statistics
• Programming (Python or R)
• Mathematics
• Databases

A Data Scientist must find patterns within the data. Before he/she can find the
patterns, he/she must organize the data in a standard format.

Here is how a Data Scientist works:

1. Ask the right questions - To understand the business problem.


2. Explore and collect data - From database, web logs, customer feedback,
etc.
3. Extract the data - Transform the data to a standardized format.
4. Clean the data - Remove erroneous values from the data.
5. Find and replace missing values - Check for missing values and replace
them with a suitable value (e.g. an average value).
6. Normalize data - Scale the values in a practical range (e.g. 140 cm is
smaller than 1,8 m. However, the number 140 is larger than 1,8. - so scaling
is important).
7. Analyze data, find patterns and make future predictions.
8. Represent the result - Present the result with useful insights in a way the
"company" can understand.

Q. Evolution of Data Science: Growth & Innovation

Data science was born from the idea of merging applied statistics with computer
science. The resulting field of study would use the extraordinary power of
modern computing. Scientists realized they could not only collect data and solve statistical
problems but also use that data to solve real-world problems and make reliable fact-
driven predictions.

1962: American mathematician John W. Tukey first articulated the data science

dream. In his now-famous article “The Future of Data Analysis,” he foresaw the
inevitable emergence of a new field nearly two decades before the first personal

computers. While Tukey was ahead of his time, he was not alone in his early

appreciation of what would come to be known as “data science.”

1977: The theories and predictions of “pre” data scientists like Tukey and Naur became more

concrete with the establishment of the international association for statistical computing

(IASC), whose mission was “to link traditional statistical methodology and modern

computer technology, and the knowledge of domain experts in order to convert data

into information and knowledge.”

1980s and 1990s: Data science began taking more significant strides with the emergence
of the first Knowledge Discovery in Databases (KDD) workshop and the founding of the
International Federation of Classification Societies (IFCS).

1994: Business Week published a story on the new phenomenon of “Database Marketing.”
It described the process by which businesses were collecting and leveraging enormous
amounts of data to learn more about their customers, competition, or advertising
techniques.

1990s and early 2000s: We can clearly see that data science has emerged as a recognized
and specialized field. Several data science academic journals began to circulate, and data
science proponents like Jeff Wu and William S. Cleveland continued to help develop and
expound upon the necessity and potential of data science.

2000s: Technology made enormous leaps by providing nearly universal


access to internet connectivity, communication, and (of course) data
collection.
2005: Big data enters the scene.

Q. What is the Data Science Process


Data Science Process
The Data Science Process involves a sequence of structured steps aimed
at solving business problems through data-driven decision-making. The
process can be broken down into eight steps.:

1. Business Objective 2.Data Requirement

3.Data Collection 4.Exploratory Data Analysis (EDA)

5.Modeling 6.Evaluation

7.Deployment 8.Monitoring
1. Business Objective
• Define the problem that needs to be solved.
• Ask the right questions to understand the business context and goals.
2. Data Requirement
• Identify the data needed to answer the business questions.
• Determine data sources and ensure relevance.
3. Data Collection
• Gather data from various sources (databases, APIs, sensors, etc.).
• Ensure data completeness and accessibility.
4. Exploratory Data Analysis (EDA)
• Explore, clean, and preprocess the data.
• Handle missing values, identify outliers, and understand patterns.
• Ensure data quality before modeling.
5. Modeling
• Select appropriate machine learning algorithms or statistical models.
• Train the model using prepared data.
• Aim to predict outcomes or generate insights.
6. Evaluation
• Assess model performance using relevant metrics (accuracy, precision, recall,
etc.).
• Ensure the model meets business objectives.
7. Deployment
• Integrate the model into a production environment or business workflow.
• Make the model available for real-time decision-making.
8. Monitoring
• Continuously monitor model performance.
• Detect data drifts and changes in behavior.
• Re-evaluate earlier steps if necessary.
Q. What is Business Intelligence?
Business intelligence (BI) is a set of technologies, applications, and processes
that are used by enterprises for business data analysis. It is used for the
conversion of raw data into meaningful information which is thus used for
business decision- making and profitable actions. It deals with the analysis of
structured and sometimes unstructured data which paves the way for new and
profitable business opportunities. It supports decision-making based on facts
rather than assumption- based decision-making. Thus it has a direct impact on
the business decisions of an enterprise. Business intelligence tools enhance the
chances of an enterprise to enter a new market as well as help in studying the
impact of marketing efforts.

Below is a table of differences between Data Science and Business Intelligence:


S. Business
No. Factor Data Science Intelligence

It is basically a set of
It is a field that uses technologies,
mathematics, statistics and applications and
1. Concept
various other tools to discover processes that are used
the hidden patterns in the data. by the enterprises for
business data analysis.

It focuses on the past


2. Focus It focuses on the future.
and present.

It deals with both structured as It mainly deals only with


3. Data
well as unstructured data. structured data.
S. Business
No. Factor Data Science Intelligence

It is less flexible as in
Data science is much more
case of business
4. Flexibility flexible as data sources can be
intelligence data sources
added as per requirement.
need to be pre-planned.

It makes use of the scientific It makes use of the


5. Method
method. analytic method.

It has a higher complexity in It is much simpler when


6. Complexity comparison to business compared to data
intelligence. science.

It’s expertise is the


7. Expertise It’s expertise is data scientist.
business user.

It deals with the


It deals with the questions of
8. Questions question of what
what will happen and what if.
happened.

The data to be used is


Data warehouse is
9. Storage disseminated in real-time
utilized to hold data.
clusters.

Integration The ELT (Extract-Load- The ETL (Extract-


10. Transform) process is Transform-
of data
generally used for the Load) process is
S. Business
No. Factor Data Science Intelligence

integration of data for data generally used for the


science applications. integration of data for
business intelligence
applications.

Its tools are


InsightSquared Sales
It’s tools are SAS, BigML,
11. Tools Analytics, Klipfolio,
MATLAB, Excel, etc.
ThoughtSpot, Cyfe,
TIBCO Spotfire, etc.

Companies can harness their Business Intelligence


potential by anticipating the helps in performing root
2. Usage future scenario using data cause analysis on a
science in order to reduce risk failure or to understand
and increase income. the current status.

Business Intelligence
Greater business value is has lesser business value
achieved with data science in as the extraction process
Business
13. comparison to business of business value carries
Value
intelligence as it anticipates out statically by plotting
future events. charts and KPIs (Key
Performance Indicator).
S. Business
No. Factor Data Science Intelligence

The technologies such as


Hadoop are available and The sufficient tools and
Handling others are evolving for technologies are not
14.
data sets handling understanding its available for handling
lsarge data sets. large data sets.

Q. Prerequisites for Data Science


The most important prerequisites for data science are organized down below by skill
category.

1. Mathematics and Statistics


The basis of data science is made up of statistics and mathematics. They offer the
framework for understanding data, creating models, and analyzing results.

Linear Algebra:
• Understanding algorithms, especially those involved in data transformation
and machine learning, requires a solid understanding of linear algebra.
• Important subjects include eigenvalues, vectors, matrices, and matrix
multiplication. These ideas are essential for algorithms that reduce the
dimensions of data, such as Principal Component Analysis (PCA) and
Singular Value Decomposition (SVD).

Calculus:
• Understanding the operation of optimization algorithms requires a strong
understanding of calculus, particularly differential calculus. For example,
derivatives play a key role in gradient descent, which reduces mistakes in
machine learning models.
• Important subjects that help in understanding and improving model behaviour
are derivatives, integrals, and partial derivatives.
Statistics and Probability:
• Data scientists use statistics to test hypotheses, draw conclusions, and make
inferences from data.
• The central limit theorem, hypothesis testing, confidence intervals, probability
distributions, and Bayesian inference are important subjects. These ideas
enable data scientists to make data-driven decisions, analyze estimates, and
measure uncertainty.

2. Programming Skills
Data scientists need to know how to program. It enables efficient data handling,
analysis, and modelling.

Python:
• Python is the preferred language for data science because of its vast libraries,
including Pandas, NumPy, Scikit-learn, and TensorFlow, as well as its ease of
use and readability.
• Use cases: Python can be applied to statistical analysis, machine learning
model development, data cleansing, and visualization.

R:
• R is particularly well-liked in educational institutions and businesses that
depend significantly on statistical analysis. It is renowned for its strong data
analysis and visualization packages.
• Use cases: R is perfect for research-based data analysis, statistical modelling,
and visualizations.

SQL:
• In relational databases, SQL (Structured Query Language) is essential for
accessing and modifying data. Since databases hold the majority of real-world
data, SQL is an essential tool for data extraction.
• Examples of use: SQL allows you to work with massive datasets by
performing aggregations, joining tables, and searching databases .

.
Q. Tools for required data science and applications of data science
various fields
Data scientists require a diverse set of technical skills,
including programming, statistics, machine learning, data visualization, and
database management. Proficiency in languages like Python and R is crucial for
data manipulation and analysis, while strong statistical knowledge is essential for
modeling and interpreting
results. Additionally, understanding data visualization techniques, including tools
like Tableau and Power BI, allows for effective communication of insights.
Key Technical Skills for Data Scientists:
• Programming:
Proficiency in languages like Python and R is essential for data manipulation,
analysis, and building machine learning models.
• Statistics and Mathematics:
A solid understanding of statistical concepts, probability, and mathematical
principles is necessary for building and evaluating models.
• Machine Learning:
Knowledge of various machine learning algorithms, including regression,
classification, and clustering, is crucial for building predictive models and
gaining insights from data.
• Data Visualization:
The ability to create clear and informative visualizations using tools like Tableau
and Power BI is vital for communicating data findings.
• Database Management:
Understanding database systems, including SQL and NoSQL databases, is essential
for managing and accessing large datasets.
• Data Wrangling:
The ability to clean, transform, and prepare data for analysis is a crucial skill for
data scientists.
• Big Data Technologies:
Familiarity with big data technologies like Hadoop and Spark is increasingly
important for handling large datasets.
• Cloud Computing:
Understanding and using cloud computing platforms like AWS, Azure, and
Google Cloud is essential for storing, processing, and analyzing data.
• Deep Learning:
Knowledge of deep learning techniques and frameworks is becoming increasingly
important for tackling complex problems.
• Natural Language Processing (NLP):
For tasks involving text data, familiarity with NLP techniques is beneficial.
Applications of data science
Data science has a wide range of applications across various industries, including healthcare,
finance, e-commerce, and transportation. It's used for predictive analytics, machine learning, data
visualization, recommendation systems, fraud detection, and more, helping businesses and
organizations make better decisions and improve their operations .

Here's a more detailed look at some key areas:


1. Healthcare:
Disease prediction and diagnosis:
Data science can analyze patient data to predict the likelihood of developing certain diseases and
assist in early diagnosis using techniques like image recognition (e.g., analyzing X-rays and CT
scans).
Personalized medicine:
By analyzing genetic information and patient data, data science can help create tailored treatment
plans for individuals.
Optimizing hospital operations:
Data science can be used to improve resource allocation, predict patient flow, and optimize
staffing levels.
Drug discovery:
Data science techniques are accelerating the process of finding new drugs and therapies.
2. Finance:
Fraud detection:
Data science algorithms can identify fraudulent transactions in real-time, helping financial
institutions prevent losses.

Risk management:
Data science helps assess and manage various financial risks, including credit risk and market
risk.
Personalized financial advice:
Data science can analyze financial data to provide tailored recommendations to customers.
3. E-commerce:
Personalized recommendations:
Data science algorithms analyze user behavior to suggest relevant products to customers, boosting
sales and improving user experience.
Inventory optimization:
Data science helps e-commerce businesses manage their inventory efficiently, minimizing storage
costs and preventing stockouts.
Targeted advertising:
Data science enables businesses to target specific customer segments with personalized ads,
improving campaign effectiveness.
4. Transportation:
Route optimization:
Data science algorithms can optimize delivery routes for logistics companies, reducing fuel costs
and delivery times.
Predictive maintenance:
Data science can analyze sensor data from vehicles to predict potential maintenance needs,
preventing costly breakdowns.
Traffic management:
Data science can be used to analyze traffic patterns and optimize traffic flow, reducing
congestion.
5. Other Applications:
1. Search engines
2. Sports
3. Social media
4. Education.
5. Environmental monitoring.
Q. Data science security issues and risks:
The data science security issues are 1. Data breaches and unauthorized Access
2. Data Privacy
3. Data poisoning
4. False positive
5. Lack of skilled professionals
6. Insecure Api
1. Data breaches and unauthorized Access
The data science it contains large data sets i.e sensitive info”making them attractive
Target for cyber criminals” unauthorized access can lead to data theft, identity theft and
financial fraud.
2. Data privacy: The protection of personnel data from unauthorized access use.

Data privacy in data science refers to the ethical and responsible handling of personal information
While collecting and analysing data.
3. Data poisoning: data poisoning is one type of cyber-attack is based on targeting the training data
Of A.I and M.L.
EX: Attackers can introduce new data, modify existing data, delete data points from the
Training.
4. False Positive: False positive in data science means when a model incorrectly predicts a positive
Outcome for a datapoint.
Ex: false positive might mean a test incorrectly indicates a patient has a disease when they don’t.
5. Lack of skilled professionals: securing data science environments and addressing security
Issues requires specialized skills and expertise.
Data Collection Strategies
Data collection is the process of gathering and measuring information from various sources to use in
data science projects. The quality of analysis and models heavily depends on the quality and relevance of
the data collected.
Data collection is collecting information from various sources to answer specific questions or achieve
a goal. It plays a vital role in decision-making across industries like business, healthcare, education and
research. Whether you want to analyze trending data, solve any complex problem or make predictions data
collection is the foundation of many processes that shape our everyday lives.
Types of Data Collection: Data collection methods can be broadly classified into two categories:
Primary Data Collection and Secondary Data Collection. It is as follows:
1. Primary Data Collection: Primary data is information gathered directly from the source. It is first-
hand data, meaning it is collected for a specific purpose and is usually more accurate, reliable, and relevant.
Surveys & Questionnaires
Interviews
Observations
Focus Groups
Experiments & A/B Testing.
2. Secondary Data Collection: Secondary data refers to pre-existing information collected by third
parties. It is cost-effective, quick to access, and often used for trend analysis, competitor research, and
historical data studies.
a) Public Records & Government Data
Governments and organizations release census reports, economic statistics, and demographic insights
that help businesses plan strategies.
Examples: World Bank reports, WHO health statistics, or industry growth trends.
b) Industry Reports & Market Research Studies
Research firms like Gartner, Forrester, and Statista publish industry insights, competitor analysis, and
data-driven predictions.
Companies use these reports to understand market trends and customer behavior.
c) Academic & Scientific Research Papers
Universities and research institutions provide peer-reviewed studies and experiments that contribute
to fields like medicine, AI, and cybersecurity.
Google Scholar and IEEE Xplore are common platforms for accessing academic data.
d) Social Media & Web Analytics
Platforms like Facebook, Twitter, and Google Analytics collect user engagement data, helping
businesses understand customer preferences.
Brands track online trends, hashtags, and audience demographics to fine-tune marketing strategies.
e) Business Databases & Third-Party Data Providers
Companies buy pre-collected datasets from sources like Nielsen, Experian, or Crunchbase to enhance
marketing and business intelligence.
Used for competitor research, audience targeting, and industry benchmarking.
Methods of Data Collection
Data collection methods vary depending on the type of data, purpose, and industry requirements.
These methods help businesses, researchers, and analysts gather accurate and actionable insights.
1. Quantitative Data Collection Methods:
Quantitative data collection focuses on numerical data that can be measured, analyzed, and used for
statistical insights.
Surveys & Questionnaires: Structured multiple-choice questions or rating scales to collect data from
a large audience.
Used in customer feedback, employee satisfaction, and product research.
Tools: Google Forms, Typeform, SurveyMonkey.
Online Polls & Forms
Quick single-question or short-form surveys to gauge user opinions instantly.
Used on social media, websites, and apps for marketing insights.
Experiments & A/B Testing
Comparing two or more versions of a product, website, or campaign to find the best-performing
option.
Common in digital marketing, app development, and UX testing.
Web & App Analytics
Tracks user behavior on websites, mobile apps, and digital platforms.
Example: Google Analytics tracks page views, bounce rates, and session durations.
Sensor & IoT-Based Data Collection
Automated collection of real-time data using IoT devices, smart sensors, and GPS tracking.
Used in smart homes, healthcare, and industrial automation.
2. Qualitative Data Collection Methods: Qualitative data collection focuses on opinions, behaviors,
and experiences, providing deeper insights.
Interviews (Face-to-Face, Phone, or Video)
Direct one-on-one conversations with open-ended questions.
Used for market research, employee feedback, and case studies.
Focus Groups
A small group discussion moderated by a researcher to collect in-depth opinions.
Common in brand perception studies, product testing, and media research.
Observational Research
Monitoring real-world user behavior without direct interaction.
Examples: Retail stores tracking customer movement, social media trend analysis.
Case Studies & Customer Stories
Real-life examples of customer experiences to understand patterns and behaviors.
Used in marketing, psychology, and product development.
3. Automated & Digital Data Collection Methods
With advancements in AI, machine learning, and automation, data collection has become faster,
scalable, and more efficient.
Web Scraping & Data Mining
Automated tools extract data from websites, social media, and online directories.
Used in price tracking, competitor research, and sentiment analysis.
API-Based Data Collection
Companies use APIs (Application Programming Interfaces) to pull data from platforms like Google,
Facebook, or weather services.
Used in real-time analytics, third-party integrations, and automation.
AI-Powered Sentiment Analysis
AI tools analyze customer reviews, social media comments, and feedback to gauge sentiment.
Helps brands understand public perception and market trends.
organizing data to make it suitable for machine learning algorithms and other data-driven tasks

Data preprocessing in data science


Data preprocessing is the process of transforming raw data into a clean, structured, and usable
format for analysis and modeling. It involves cleaning, transforming, and organizing data to make it
suitable for machine learning algorithms and other data-driven tasks:
Data preprocessing methods: .1. Data cleaning
2. Data integration
3. Data reduction
4. Data transformation
METHOD-1
1. Data cleaning: Data cleaning is the process of removing noisy data, inconsistency
and Incomplete data. Because to improve the quality of the data.
Methods of data cleaning
1. Handling of missing values.
2. Handling noise and outliers.
3. Remove unwanted data.
1. Handling the missing values
➢ Ignore the data row
➢ Fill the missing values by manually.
➢ Use the global constants to fill in the place of missing data.
➢ Use attribute mean or median.
➢ Use forward fill or backward fill method
➢ Using algorithms.
2. Handling the noise data and outliers:
Noise means error or fault in data occurs during data collection or data entering
Or data transformation.
➢ Noise in the data is eliminated by using “binning and smoothing”.
➢ Data is distributed among the bins and smoothing technique are applied
To eliminate noise.
➢ Outliers are identified by using clustering techniques.
➢ Ex: Outliers are for example the age of high school graduates
are [18,17,16,1,19,18,17,36,17] here 1 and 36 are outliers.
3. Remove unwanted data: Remove duplicate irrelevant data to improve data quality
METHOD-2
DATA INTEGRATION: - Data integration is the process of merging data from different
Sources like multiple databases, data cubes and flat files) to avoid duplicates and inconsistency
in Data which improves the accuracy and speed of data mining.
Data integration approaches 1. Tight coupling
2. Loose coupling
Tight coupling: In this method data science is treated as information retrieval
Component.
Data is combined from various sources into single physical location via the process
Of ETL (Extraction Transformation Loading).
2. Loose Coupling: In loose coupling data only remains in the actual source databases in this
Approach an interface is provided that takes query from user and transform it in a way the source
Database can understand and then sends the query directly to the source database to obtain in the
Result.

Issues in data integration:


➢ Entity identity trouble.
➢ Redundancy and correlation analysis.
➢ Tuple duplication
➢ Data conflict detection and resolution.
1. Entity identity trouble
“Since data is integrated from two different databases “matching of
Entities become a problem”
Ex: custid in one database and custnumber is another database is same but how a system or user
Understand this.
So, to avoid this problem we used in metadata. To match the structure of the attributes, during
Integration functional dependencies and referential constraints are used.
2. Redundancy and correlation analysis
Correlation is one of the methods of studying the
Relationship between two variables where the change the value of one variable produces a change
The other variable.
Ex: cost and demand.
Based on correlation analysis it should be identified the redundancy in the data and should be
Eliminated.
3. Data conflict detection and resolution: -
The same entity attribute values from different sources may differ this representation.
Ex: weight attribute may be represented as metric units in one system and British imperial units in
Another.
4. Tuple duplication: if the relations in the database have denormalized tables then it also causes
data redundancy. Dimension naming of attributes from different entities can also cause
redundancies in the data set. Duplicate data tuples are also present in relational databases when the
same real-world entity is expressed with different attribute names. This may be caused due to
differences in representation or scaling of data
Method-3
Data reduction Data reduction is a technique used in data mining to reduce the size of a dataset
while still preserving the most important information. This can be beneficial in situations where the
dataset is too large to be processed efficiently, or where the dataset contains a large amount of
irrelevant or redundant information
Methods of data reduction
1.Dimensionality reduction
2 Numerosity reduction.
3Data compression.
1 Dimensionality reduction: Dimensionality reduction − Encoding mechanisms are used to reduce
the data set size. In dimensionality reduction, data encoding or transformations are applied to obtain
a reduced or “compressed” representation of the original data. If the original data can be
reconstructed from the compressed data without any loss of information, the data reduction is called
lossless.
Numerosity reduction − The data are restored or predicted by alternative, smaller data
representations including parametric models (which are required to save only the model parameters
rather than the actual data) or nonparametric methods including clustering, sampling, and the use
of histograms.
Data compression
Data compression is defined as the process whereby information is encoded in less bits than it had
originally occupied. This mainly happens through methods that eliminate duplication and other
extraneous information.
• Transform Coding: Uses mathematical transforms that shrink the data, usually in JPEG
• Quantization: Reducing the precision of data; it is common in image and video compression.
Method-4

Data transformation Data transformation in data mining refers to the process of converting raw
data into a format that is suitable for analysis and modeling.
There are different methods in data transformation
1. Data smoothing
2. Data aggregation
3. Data discretization
4. Data generalization
5. Data normalization.
1. Data smoothing: It is a process that is used to remove noise from the dataset using some
algorithms It allows for highlighting important features present in the dataset.
Example:
Imagine you have noisy data like this: [5, 7, 6, 20, 7, 8, 6]
Here, the value 20 is an outlier that makes the data look jagged.
One simple smoothing method is replacing each value with the average of its neighbors to reduce
sharp jumps. For example, replacing 20 with the average of its neighbors (6 + 7) / 2 = 6.5 gives
smoother data:

[5, 7, 6, 6.5, 7, 8, 6]
Data Aggregation:
Data collection or aggregation is the method of storing and presenting data in a summary format.
The data may be obtained from multiple data sources to integrate these data sources into a data
analysis description.
Example: Sales, data may be aggregated to compute monthly and annual total amounts.
January: 1,000 February: 1,200 March: 1,300

Monthly sales can be aggregated to calculate quarterly sales:


Total Sales = $3,500
Data discretization: It is a process of transforming continuous data into set of small intervals.
Most Data Mining activities in the real world require continuous attributes.
Example:
Continuous age values: 22, 25, 37, 60
Discretized into categories:
• 0-25 → "Young"
• 26–50 → "Middle-aged"
• 50+ → "Senior"
Data generalization: It converts low-level data attributes to high-level data attributes using
concept hierarchy. For Example, Age initially in Numerical form (22, 25) is converted into
categorical value (young, old). Like, categorical attributes such as house addresses, may be
generalized to higher-level definitions, such as town or country.
Example:

Low-level attribute: Age = 24


Generalized as:
• Age group: "Young"
Low-level attribute: City = San Francisco
Generalized as:
• State: "California"
Normalization
Data normalization involves converting all data variables into a given range. Techniques that are
used for normalization are:
• Min-Max Normalization:
o This transforms the original data linearly.
o Suppose that: min_A is the minima and max_A is the maxima of an attribute
o v is the value you want to plot in the new range.
o v' is the new value you get after normalizing the old value.
v' = (v - min_A) / (max_A - min_A)
• Z-Score Normalization:
o In z-score normalization (or zero-mean normalization) the values of an attribute (A), are
normalized based on the mean of A and its standard deviation
o A value v of attribute A is normalized to v' by computing using below formula-
v' = (v - mean(A)) / (standard deviation(A))

Q. Explain Data filtering and munging


Data filtering is the process of choosing a smaller part of your data set and using that subset for
viewing or analysis. Filtering is generally (but not always) temporary – the complete data set is
kept, but only part of it is used for the calculation.

Filtering may be used to:


• Look at results for a particular period of time.
• Calculate results for particular groups of interest.
• Exclude erroneous or "bad" observations from an analysis.
• Train and validate statistical models.

Filtering requires you to specify a rule or logic to identify the cases you want to include in
your survey analysis. Filtering can also be referred to as “subsetting” data, or a data “drill-down”.
In this article, we illustrate a filtered data set and discuss how you might use filtering.

Example of data filtering


The table below shows some of the rows of a data set from a survey about people’s preferred Cola.
The survey data contains demographic information about the respondents as well as each person’s
preferred cola and that person’s rating (out of 5) for each of six varieties of cola.
Filtering this data involves:
1. Coming up with a rule for the observations needed.
2. Selecting the observations that fit the rule.
3. Conducting the analysis using only the information contained in those selected observations.
For example, the table below shows the data filtered for Males only. The darker colored rows are
kept in the analysis, while the remaining rows are excluded. Results computed for Males are then
calculated based on the highlighted rows (ID’s 2, 9, 11, 12, 13, 14). If we want to know the average
rating for Coca-Cola among males, we would compute that as (5 + 5 + 4 + 5 + 5 + 3) / 6 = 4.5.
Data munging
Data Munging (also known as Data Wrangling) in data science refers to the process of cleaning,
transforming, and organizing raw data into a usable format for analysis and modeling.

Why is Data Munging Important?


Raw data is often:
• Incomplete
• Inconsistent
• Noisy (contains errors or irrelevant information)
• In different formats (JSON, CSV, Excel, databases, etc.)
Data munging ensures that the data is:
• Clean
• Consistent
• Structured
• Ready for machine learning or statistical analysis

Common Steps in Data Munging


1. Loading the data
Read data from various sources like CSV files, databases, APIs, Excel files, etc.
2. Cleaning the data
o Handling missing values (e.g., filling, dropping)
o Removing duplicates
o Correcting data types (e.g., converting strings to dates)
o Fixing inconsistent formatting (e.g., capitalization, extra spaces)
3. Transforming the data
o Normalization or scaling
o Aggregating data
o Pivoting or melting tables
o Feature engineering (creating new columns from existing ones)
4. Filtering and selecting data
o Removing irrelevant rows or columns
o Keeping only the data you need for analysis
5. Merging and joining datasets
o Combining data from multiple sources (e.g., SQL joins, Pandas merge)
6. Validating the data
o Ensuring data integrity and consistency
o
o Verifying data types and expected ranges

Popular Tools for Data Munging


• Python (Pandas, NumPy)
• R (dplyr, tidyr)
• Excel/Google Sheets
• SQL
• Apache Spark (for large-scale data)
• ETL tools like Talend, Alteryx, or Apache NiFi

You might also like