DS Unit 1
DS Unit 1
Data Science is the process of collecting, analyzing, and interpreting data to find useful patterns
and insights. It combines mathematics, statistics, programming, and domain knowledge to
solve real-world problems.
In today’s world, industries generate a huge amount of data. Data Science helps in making
better decisions, improving efficiency, and predicting future trends based on this data.
1. Data Collection: Gathering data from various sources like databases, websites, sensors,
and social media.
2. Data Cleaning: Removing errors, missing values, and duplicates to improve data quality.
3. Data Analysis: Finding patterns, trends, and relationships in the data using statistics and
visualization.
4. Machine Learning: Building models to make predictions or automate decision-making.
5. Data Visualization: Creating charts, graphs, and dashboards to present insights in an
understandable way.
6. Decision-Making: Using extracted knowledge to solve problems and improve business
strategies.
1. Healthcare – Disease Prediction & Diagnosis
• Data Science helps analyze medical records, genetic data, and patient history to predict
diseases like cancer or diabetes.
• Machine learning models assist doctors in diagnosing diseases early and suggesting
personalized treatments.
• Example: AI-powered tools like IBM Watson Health help doctors make better decisions.
2. Finance – Fraud Detection & Risk Management
• Banks and financial institutions use Data Science to detect suspicious transactions and
prevent fraud.
• Credit scoring models assess a person’s financial history to determine loan eligibility.
• Example: PayPal uses machine learning to detect fraudulent transactions in real-time.
3. Retail – Personalized Recommendations
Data Warehousing (DW), Data Mining (DM), and Data Science are all related fields focused on
extracting useful insights from data, but they differ in scope, methods, and applications. Let’s
explore their similarities and differences.
• Data Warehousing (DW): A system for storing and managing structured data from
multiple sources to support business intelligence (BI) and reporting.
• Data Mining (DM): The process of discovering patterns, correlations, and trends in large
datasets using techniques like classification, clustering, and association rule mining.
• Data Science (DS): A multidisciplinary field that involves data processing, statistical
analysis, machine learning (ML), and predictive modeling to derive insights and make
data-driven decisions.
1. Data Utilization: Both DW-DM and Data Science work with large datasets to extract
valuable insights.
2. Analytical Techniques: Data Mining and Data Science employ machine learning,
statistical modeling, and pattern recognition.
3. Decision-Making Support: They help businesses make informed decisions, whether
through BI reports (DW-DM) or predictive analytics (DS).
4. Automation & AI: Both can leverage automation for data processing, though Data
Science extends further into AI and deep learning.
Discuss the importance of Data Preprocessing in the Data Science pipeline and
its impact on the quality of analysis and modeling outcomes.
Data preprocessing involves several steps aimed at improving the quality of data before it is used
in machine learning (ML) models. It typically includes:
• Data Cleaning: Handling missing values, removing duplicates, and correcting errors.
• Data Transformation: Normalization, standardization, and feature engineering.
• Data Reduction: Dimensionality reduction and feature selection.
• Data Integration: Merging multiple data sources to create a unified dataset.
• Reduced and optimized datasets speed up computation, making training models more
efficient.
Define structured data and provide examples of structured datasets. Describe
the characteristics of structured data.
Structured data is highly organized data that follows a specific format, making it easy to
store, search, and analyze in relational databases (like SQL).
Structured data refers to highly organized data that is stored in a predefined format, typically in
relational databases or spreadsheets. It follows a tabular structure with rows and columns, where
each column represents a specific attribute, and each row contains values for these attributes.
Structured data is easily searchable and processable using SQL and other query languages.
1. Customer Database
o Columns: Customer ID, Name, Email, Phone Number, Address, Purchase
History
o Rows: Each row represents a unique customer’s details.
2. Employee Records in HR Systems
o Columns: Employee ID, Name, Department, Salary, Date of Joining
o Rows: Each row represents an employee’s information.
3. Sales Transactions
o Columns: Transaction ID, Product ID, Customer ID, Date, Amount
o Rows: Each row records an individual sales transaction.
4. Stock Market Data
o Columns: Stock Symbol, Date, Open Price, Close Price, Volume Traded
o Rows: Each row represents stock price details for a specific day.
5. Bank Account Information
o Columns: Account Number, Customer Name, Balance, Account Type, Last
Transaction Date
o Rows: Each row corresponds to an individual bank account.
1. Predefined Schema: The data follows a fixed format, with defined relationships between
tables.
2. Organized in Tables: Stored in relational databases (SQL-based) where rows represent
records and columns define attributes.
3. Easily Searchable: Query languages like SQL enable efficient data retrieval and
manipulation.
4. Highly Scalable: Can be managed efficiently using database management systems
(DBMS).
5. Consistent and Accurate: Data integrity is maintained through constraints (e.g., unique
keys, foreign keys).
6. Efficient Storage and Processing: Optimized for quick access, indexing, and structured
querying.
7. Supports Transactions: Used in systems that require ACID (Atomicity, Consistency,
Isolation, Durability) compliance, like banking and enterprise applications.
Data in Data Science can be categorized into three types based on its format, organization, and
ease of processing: Structured, Unstructured, and Semi-Structured Data. Each type plays a
critical role in data storage, analysis, and decision-making.
1. Structured Data
Definition:
• Data that is highly organized and stored in a fixed format (tables, rows, and columns).
• It follows a predefined schema and is easy to search using SQL queries.
Examples:
2. Unstructured Data
Definition:
• Data that does not have a predefined format or structure.
• It is difficult to store in relational databases and requires special processing techniques.
Examples:
3. Semi-Structured Data
Definition:
• Data that does not follow a strict structure like structured data but has some level of
organization.
• It contains tags, labels, or metadata that help define relationships.
Examples:
Unstructured data, such as text, images, videos, and social media posts, is difficult to manage due
to its lack of a predefined format. Below are the key challenges and their solutions:
• Challenge: Unstructured data requires large storage space and does not fit into traditional
relational databases.
• Solution: Use NoSQL databases (MongoDB, Cassandra) and cloud storage solutions
(AWS S3, Google Cloud Storage) for scalable and cost-effective storage.
• Challenge: Unstructured data often includes sensitive information (emails, chat logs,
medical records), making it vulnerable to breaches.
• Solution: Apply encryption, access controls, and compliance frameworks (GDPR,
HIPAA) to protect data.
Data can be categorized into three types: structured, semi-structured, and unstructured,
based on its format and organization.
1. Structured Data
• Definition: Data that is highly organized and stored in a fixed format, typically in tables
with rows and columns.
• Example: Databases, Spreadsheets, Online Transaction Records
• Characteristics:
o Stored in relational databases (SQL)
o Easily searchable using queries
o Has a predefined schema
Example:
2. Semi-Structured Data
• Definition: Data that does not follow a strict table structure but still has tags, labels, or
markers to organize elements.
• Example: JSON, XML, Email, HTML Web Pages
• Characteristics:
o Partially organized with flexible structure
o Uses tags/keys to define elements
o Easier to store and analyze than unstructured data
3. Unstructured Data
📌 Example:
A YouTube video or a scanned handwritten document – both contain information but lack a
structured format for easy analysis.
1. Databases
Advantages:
• Efficient Storage & Retrieval: Databases handle large datasets efficiently with indexing
and querying.
• Data Integrity & Consistency: Ensure structured and reliable data with constraints and
relationships.
• Concurrency & Security: Support multiple users and provide access control.
Disadvantages:
Advantages:
• Simple & Portable: Easy to store, share, and use across platforms.
• No Setup Required: Does not need a dedicated server or software.
• Human-Readable: Formats like CSV and JSON are easy to read and edit.
Disadvantages:
• Limited Scalability: Not ideal for large datasets due to slow processing.
• Lack of Security & Integrity: Files can be easily modified or lost.
• Data Handling Issues: No direct support for concurrent access.
Advantages:
• Real-time Data Access: Fetches the latest data from online sources.
• Integration with Web Services: Connects with multiple platforms and services.
• Dynamic & Scalable: Can provide large amounts of up-to-date data without manual
downloads.
Disadvantages:
• Dependency on External Services: API changes or downtimes can affect data access.
• Rate Limits & Costs: Some APIs have usage restrictions and may require payment.
• Data Cleaning Needs: API data may require preprocessing to fit analysis needs.
Describe the process of data collection through web scraping and its
importance in data acquisition.
Web scraping is the process of extracting data from websites using automated scripts or tools. It
enables Data Scientists to collect large amounts of publicly available information for analysis,
research, and business applications. The extracted data can be structured and stored in a database
or file for further processing.
Process of Data Collection through Web Scraping
Before scraping, the first step is to select a website containing the required data, such as:
• Use browser developer tools (F12 → Inspect Element) to examine the HTML
structure of the webpage.
• Identify relevant tags (e.g., <div>, <span>, <table>) and attributes (e.g., class, id) that
contain the target data.
• Use libraries like requests in Python to send a GET request to the website’s URL and
retrieve the HTML source code.
• Example in Python:
import requests
url = "https://example.com"
response = requests.get(url)
html_content = response.text
print(item.text)
Illustrate how data from social media platforms can be leveraged for
sentiment analysis and market research purposes.
Leveraging Social Media Data for Sentiment Analysis and Market Research
Social media platforms like Twitter, Facebook, Instagram, and LinkedIn generate vast
amounts of user-generated content. This data can be analyzed to understand public
sentiment, track trends, and improve marketing strategies
1. Sentiment Analysis
Definition: Sentiment analysis helps determine whether a post, comment, or tweet expresses a
positive, negative, or neutral opinion.
1. Data Collection: Gather data from social media using APIs (e.g., Twitter API, Facebook
Graph API).
2. Preprocessing: Clean text by removing stop words, special characters, and unnecessary
symbols.
2. Market Research
Definition: Market research involves analyzing trends, customer preferences, and competitor
strategies using social media data.
Key Use Cases:
Both sensor data (from IoT devices, industrial sensors, wearables, etc.) and social media data
(from platforms like Twitter, Facebook, Instagram) present unique challenges in data collection,
storage, processing, and analysis. Effective strategies are essential for handling and extracting
meaningful insights from these data sources.
Challenges in Sensor Data
Challenges:
1. High Volume and Velocity – Sensors produce data continuously, leading to storage and
processing issues.
2. Data Quality Issues – Noisy, missing, or inaccurate readings can affect analysis.
3. Heterogeneous Data Formats – Different sensors generate data in different formats,
requiring standardization.
4. Real-time Processing – Many applications require instant analysis, which can be
computationally expensive.
5. Security and Privacy – Sensitive data from healthcare or industrial sensors must be
protected.
Social media data is unstructured, diverse (text, images, videos), and highly dynamic. It is
mainly used for sentiment analysis, trend detection, and market research.
Challenges:
1. Data Noise and Spam – Fake accounts and irrelevant posts create misleading insights.
2. Data Volume and Velocity – Millions of posts are generated per second, making real-time
processing difficult.
3. Unstructured and Multimodal Data – Text, images, and videos require different processing
techniques.
4. Sentiment Misinterpretation – Sarcasm and ambiguous words can lead to incorrect sentiment
classification.
5. Privacy Concerns – User-generated content must be handled ethically to comply with
regulations.
• Dirty data (missing values, duplicates, or incorrect entries) can lead to wrong
conclusions.
• Cleaning ensures data is correct, reliable, and consistent.
• Example: Handling missing values by replacing them with the mean.
import pandas as pd
df = pd.read_csv("data.csv")
df.fillna(df.mean(), inplace=True) # Replace missing values with the
mean
• Machine learning models perform better with clean and well-structured data.
• Example: A model trained on noisy data may give inaccurate predictions.
• Merging different datasets often results in inconsistencies due to variations in formats, units, or
missing values.
Ans: Data cleaning is an essential step in Data Science to ensure high-quality and reliable data
for analysis and modeling. Below are the key steps involved in data cleaning along with
techniques to handle missing values, outliers, and duplicates.
Data cleaning involves identifying and addressing issues like missing values, outliers, and
duplicates to ensure data quality and accuracy for analysis or modeling. Common techniques
include imputation for missing values, outlier removal or transformation, and deduplication
methods for handling duplicates.
1. Steps in Data Cleaning
import pandas as pd
df = pd.read_csv("data.csv")
print(df.isnull().sum()) # Count missing values in each column
• Missing values can lead to inaccurate results and must be handled properly.
• Techniques to Handle Missing Values:
1. Removal: Drop rows or columns with many missing values.
• Outliers are extreme values that can distort analysis and model performance.
• Techniques to Handle Outliers:
1. Removing Outliers: Use the Interquartile Range (IQR) method.
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
df_cleaned = df[~((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 *
IQR))).any(axis=1)]
2. Removing Duplicates:
df['Category'] = df['Category'].str.lower()
df.to_csv("cleaned_data.csv", index=False)
• Applied in algorithms like Logistic Regression, Support Vector Machines (SVM), and
Principal Component Analysis (PCA).
Formula:
3. Robust Scaling
• Uses median and interquartile range (IQR) instead of mean and standard deviation.
• Used in robust regression and outlier-resistant models.
Formula:
2. Normalization
Normalization transforms features to follow a specific statistical distribution, typically between
0 and 1 or -1 and 1. It helps with:
• Ensuring uniformity in datasets where variables have different ranges.
• Preventing dominance of large-scale features over small-scale ones.
• Speeding up convergence in gradient descent algorithms (e.g., Deep Learning).
.Min-Max Scaling (Normalization)
• Scales data between 0 and 1.
• Useful for models like Neural Networks and K-Nearest Neighbors (KNN) that are
distance-based.
Formula:
Paris (1,0,0)
London (0,1,0)
Berlin (0,0,1)
2. Label Encoding
• Assigns each category a unique integer (e.g., "Red" = 0, "Green" = 1, "Blue" = 2).
• Introduces an ordinal relationship, which may be misleading for unordered categories.
label_encoder LabelEncoder()
data['education_level_encoded'] = label_encoder.fit_transform(data['education_level'])
Discuss the importance of feature selection in machine learning models and the criteria
used for selecting relevant features.
Importance of Feature Selection in Machine Learning
Feature selection is the process of identifying and selecting the most relevant features (variables)
from a dataset for training a machine learning model. It plays a crucial role in improving model
performance, reducing complexity, and preventing overfitting.
1. Statistical Significance:
o Features that have a strong correlation with the target variable are more useful for
making predictions.
2. Variance Threshold:
o Features with very little variation across samples may not contribute much
information and can be removed.
3. Correlation Analysis:
o Highly correlated features provide redundant information, and one of them can be
removed to avoid multicollinearity.
4. Recursive Feature Elimination (RFE):
o This method systematically removes less important features to improve model
performance.
5. Feature Importance Scores:
o Machine learning models such as decision trees and random forests assign
importance scores to features, helping in selecting the most significant ones.
6. Chi-Square Test:
o This statistical test is used to determine the relevance of categorical features in
classification problems.
Outline the process of data merging and the challenges associated with combining multiple
datasets for analysis.
Discuss the challenges and strategies involved in data merging when combining multiple
datasets for analysis.
Analyze the impact of data preprocessing on the quality and effectiveness of machine
learning algorithms.
Data preprocessing is a crucial step in machine learning that involves transforming raw data into a clean
and structured format. It directly influences the quality, accuracy, and efficiency of machine learning
models. Poorly processed data can lead to biased results, low accuracy, and inefficiencies
Define data wrangling and explain its role in preparing raw data for analysis.
Data Wrangling is the process of cleaning, transforming, and organizing raw data into a
structured format suitable for analysis. It involves converting messy and complex data into a
more understandable and usable form.
1. Data Collection:
o Gathering data from various sources such as databases, files, APIs, or web
scraping.
o Example: Collecting sales data from different stores.
2. Data Cleaning:
o Removing or correcting inaccurate, incomplete, or duplicate data.
o Example: Handling missing values, fixing typos, and removing duplicate records.
3. Data Transformation:
o Converting data into a consistent format.
o Example: Changing date formats, converting text to lowercase, or scaling
numerical values.
4. Data Merging:
o Combining data from multiple sources into a single dataset.
o Example: Merging customer details with purchase history for better analysis.
5. Data Filtering:
o Selecting only relevant data for analysis.
o Example: Filtering out customers below a certain age or products with zero sales.
6. Data Enrichment:
o Adding new information to the dataset to improve analysis.
o Example: Calculating customer age based on birth date.
7. Data Validation:
o Checking data consistency and accuracy.
o Example: Ensuring no negative values exist in the price column.
Data wrangling involves various techniques to organize and prepare raw data for analysis. Some
of the most common techniques include reshaping, pivoting, and aggregating.
1. Reshaping
Reshaping refers to changing the structure or format of a dataset to make it suitable for analysis.
Python Example:
import pandas as pd
print(df_melted)
Output:
ID Subject Score
0 1 Math 90
1 2 Math 85
2 1 Science 88
3 2 Science 80
2. Pivoting
Pivoting is the reverse of reshaping. It transforms a long dataset into a wide format. This is
useful when summarizing data in a tabular format.
Python Example:
print(df_pivoted)
Output:
Subject Math Science
ID
1 90 88
2 85 80
3. Aggregating
Feature engineering is the process of creating new features or modifying existing ones to
improve the performance of machine learning models. Well-engineered features help models
learn patterns more effectively, leading to better accuracy and efficiency.
New features can be derived from existing data to enhance model learning.
python
CopyEdit
import pandas as pd
Time-series data requires special feature engineering techniques, such as extracting date
components or lag features.
Sales Sales_Lag_1
0 200 NaN
1 220 200.0
2 250 220.0
Lag features help capture time-dependent relationships in the data.
Data Science relies on powerful libraries that help with data manipulation, numerical
computations, and machine learning. Three of the most commonly used libraries are Pandas,
NumPy, and Scikit-Learn.
1. Pandas (Data Manipulation and Analysis)
Pandas is a library used for handling and analyzing structured data. It provides DataFrames and
Series objects to store and process tabular data efficiently.
Key Functionalities:
• Data Loading – Importing datasets from CSV, Excel, JSON, SQL, etc.
• Data Cleaning – Handling missing values, duplicates, and inconsistencies.
• Data Transformation – Filtering, grouping, merging, and reshaping datasets.
• Statistical Analysis – Calculating mean, median, mode, and other statistics.
NumPy (Numerical Python) is essential for performing mathematical and statistical operations
on large datasets. It provides n-dimensional arrays (ndarrays) and functions optimized for
numerical calculations.
Key Functionalities:
Scikit-Learn (sklearn) is a powerful machine learning library used for training, evaluating,
and deploying models. It provides built-in functions for supervised and unsupervised
learning.
Key Functionalities:
Describe how Pandas facilitates data manipulation tasks such as reading, cleaning, and
transforming datasets
Pandas is a powerful Python library used for handling structured data efficiently. It provides
tools for reading, cleaning, and transforming datasets, making it essential for Data Science
and Machine Learning tasks.
1. Reading Data in Pandas
Pandas allows importing data from various file formats, such as CSV, Excel, JSON, and SQL
databases.
Other formats:
Cleaning data ensures accuracy and removes inconsistencies like missing values, duplicates, and
incorrect data types.
Removing Duplicates
df.drop_duplicates(inplace=True)
Transformation involves modifying, reshaping, and aggregating data for better analysis.
Discuss the advantages of using NumPy for numerical computing and its role in scientific
computing applications. OR Discuss the role of NumPy in numerical computing and its
advantages over traditional Python lists.
Explain how Sci-kit Learn facilitates machine learning tasks such as model training,
evaluation, and deployment.
Scikit-learn is a widely used Python library for machine learning. It provides efficient tools for
data preprocessing, model training, evaluation, and deployment. It supports various machine
learning algorithms, including classification, regression, clustering, and dimensionality
reduction.
1. Model Training
Scikit-learn simplifies the process of training machine learning models. It provides ready-to-use
implementations of algorithms like Logistic Regression, Decision Trees, Support Vector
Machines, and Random Forests. Users can easily fit a model to training data using simple
functions.
2. Model Evaluation
The library includes built-in functions to evaluate model performance using metrics such as
accuracy, precision, recall, and F1-score. These evaluation methods help in assessing the
effectiveness of a model and improving its accuracy.
3. Data Preprocessing
Scikit-learn provides tools for handling missing values, feature scaling, and encoding categorical
variables. These preprocessing techniques ensure that data is clean and suitable for machine
learning models.
4. Model Deployment
After training, models can be saved and reloaded for future use. Scikit-learn allows users to
deploy trained models in real-world applications, making it useful for industry and research.
Discuss the importance of using libraries and technologies in Data Science projects for
efficient and scalable data analysis.
In Data Science, libraries and technologies play a crucial role in making data analysis efficient,
scalable, and easier to implement. They provide pre-built functions and optimized algorithms
that help data scientists process large datasets quickly and accurately.
Libraries like Pandas and NumPy enable quick data manipulation, reducing the time required
for operations like filtering, sorting, and aggregation.
Technologies like Scikit-learn and TensorFlow offer pre-built models and algorithms, making it
easier to implement and train machine learning models without writing complex code from
scratch.
Frameworks like Apache Spark and Dask allow handling large datasets efficiently, enabling
parallel processing and distributed computing for better scalability.
Libraries like Matplotlib and Seaborn help create clear visual representations of data, making it
easier to understand patterns, trends, and correlations.
5. Enhanced Data Cleaning and Transformation
Tools such as OpenRefine and Pandas assist in handling missing values, normalizing data, and
performing feature engineering, ensuring high-quality datasets.
Using libraries ensures that repetitive tasks, such as data preprocessing and model training, can
be automated, improving efficiency and consistency across projects.