Module 2
Module 2
Modelling
Data Collection
The process of gathering and analyzing accurate data from various sources to find
answers to research problems, trends and probabilities, to support efficient
decision making etc., to evaluate possible outcomes is Known as Data Collection.
The methods used to collect data can vary widely depending on the nature of the
project, the type of data required, and the sources of the data.
Primary data collection Methods
● Primary data collection involves gathering data directly from original sources
for a specific purpose.
● primary data collection methods are crucial when existing datasets are
inadequate for the analysis or when specific data points are required.
3. Observations
● Description: Collecting data by observing subjects in their natural environment.
● Types: Participant observation, non-participant observation.
● Tools: Observation checklists, video recordings.
● Use Cases: Behavioral studies, usability testing.
4. Experiments
● Description: Collecting data by conducting controlled experiments.
● Types: Laboratory experiments, field experiments.
● Tools: Experimental setups, statistical software.
● Use Cases: Scientific research, A/B testing in marketing.
3. Commercial Datasets
● Description: Data collected and sold by companies.
● Sources: Nielsen, Experian, Statista.
● Use Cases: Market research, business analytics.
6. Web Data
● Description: Data collected from websites using web scraping.
● Sources: Web scraping, RSS feeds.
● Use Cases: Competitive analysis, price monitoring.
Clearly define what you want to achieve with your dataset. For this example, the objective is to predict housing
prices.
Identify all possible sources of data that can help achieve your objective. For predicting housing prices, you might
need:
Collect current real estate listings from websites like Zillow with python libraries like Beautifulsoup etc.
Use APIs to collect additional data such as demographic information and historical prices.
● Public Datasets
5. Integrate Data
Combine all the collected data into a single dataset.
6. Final Dataset Example
Here's a simplified example of what your final dataset might look like:
Considerations
● Data Quality: Ensure all data is accurate and up-to-date.
● Privacy: Be mindful of any privacy concerns and ensure compliance with regulations.
● Relevance: Ensure the data is relevant to your objective.
● Scalability: Ensure your data collection and integration process can handle large volumes of data.
Data preparation
● Data preparation is a fundamental step in the data science process that
involves transforming raw data into a format suitable for analysis.
● Within this process, ensuring data cleaning, validity, and quality are critical
components that significantly impact the accuracy and reliability of any
analysis or model.
Importance of Data Cleaning in Data Preparation
Data cleaning is a crucial step in the data preparation process, which involves identifying and
rectifying errors, inconsistencies, and inaccuracies in the data.
This step is fundamental for ensuring that the data is of high quality and suitable for analysis,
ultimately leading to more reliable and accurate results.
Data Quality: High-quality data is accurate, complete, and reliable. Cleaning data improves its
quality by removing errors and filling in missing values, which leads to better decision-making.
● Example: In a customer database, correcting misspelled names and addresses ensures that
communication reaches the intended recipients, thereby improving customer service and
satisfaction.
2. Improves Accuracy of Analysis
Accuracy: Clean data provides a true representation of the underlying phenomena. Errors and
inconsistencies in data can lead to incorrect conclusions and flawed analysis.
● Example: In financial analysis, ensuring that transaction data is free from duplicates and
incorrect entries results in accurate financial reports and forecasts.
Decision-Making: Decisions based on clean data are more likely to be correct and effective.
Inaccurate data can lead to poor decisions that may have significant negative consequences.
● Example: In a supply chain, accurate inventory data ensures that stock levels are maintained
appropriately, preventing both stockouts and overstock situations.
4. Ensures Consistency Across Datasets
Consistency: Clean data ensures that all datasets used in analysis follow the same standards and
formats, making it easier to combine and compare data from different sources.
● Example: In a healthcare setting, ensuring that patient data from various departments uses
consistent formats for dates and measurements allows for more straightforward integration
and comprehensive patient care analysis.
Efficiency: Clean data reduces the complexity and time required for data processing and analysis.
It minimizes the need for extensive data preprocessing steps, enabling faster insights.
● Example: In a machine learning project, having clean data from the start reduces the need
for extensive data wrangling, allowing data scientists to focus more on model building and
tuning.
6. Enhances Model Performance
Model Performance: In machine learning and predictive analytics, the quality of the input data
significantly impacts the model's performance. Clean data leads to more accurate and
generalizable models.
● Example: In a predictive maintenance application, clean sensor data from machinery leads to
more accurate predictions of equipment failures, enabling timely maintenance and reducing
downtime.
Data Corruption: Ensuring that data is clean prevents the propagation of errors through the
system. Unclean data can corrupt subsequent data processing steps, leading to widespread
issues.
● Example: In a database migration project, ensuring that the data is clean before the
migration prevents data corruption and loss during the transfer process.
Steps in Data Cleaning
1. Removing Duplicates:
○ Identify and remove duplicate records to ensure each entity is uniquely represented.
○ Example: Removing duplicate customer entries in a CRM system.
2. Handling Missing Values:
○ Identify missing values and decide on a strategy to handle them, such as imputation or
deletion.
○ Example: Filling missing age values in a demographic dataset using the mean or median age.
3. Correcting Errors:
○ Identify and correct errors and inaccuracies in the data.
○ Example: Correcting incorrect product prices in a sales database.
4. Standardizing Data:
○ Ensure consistency in data formats and standards.
○ Example: Standardizing date formats across different datasets.
5. Validating Data:
○ Ensure that the data values fall within the expected ranges and adhere to business rules.
○ Example: Validating that all postal codes in an address dataset are valid.
Importance of Data Validity in Data Preparation
1. Reliability of Analysis:
● Impact: Valid data ensures the results of any analysis are trustworthy and reliable, leading to
accurate conclusions.
● Example: In clinical research, if patient data, such as age, gender, and medical history, is
valid, the study's findings about a drug’s effectiveness will be reliable and can be confidently
used for further research or approvals.
2. Accuracy of Insights:
● Impact: Valid data leads to precise and accurate insights, avoiding misleading outcomes and
ensuring sound decision-making.
● Example: In a market analysis, if survey responses are valid and accurately reflect consumer
preferences, the insights drawn about market trends will be correct, guiding effective
marketing strategies.
3. Compliance with Regulations:
● Impact: Ensuring data validity helps organizations comply with industry regulations and
standards, avoiding legal and financial repercussions.
● Example: In financial reporting, valid transaction records ensure compliance with regulations
4. Effective Decision-Making:
● Impact: Decisions based on valid data are more likely to be correct and effective, reducing
the risk of errors and improving outcomes.
● Example: In supply chain management, if inventory data is valid and accurately represents
stock levels, decisions about restocking and inventory management will be effective, reducing
the risk of stockouts or overstock situations.
5. Building Trust and Credibility:
● Impact: Valid data builds trust among stakeholders and users, ensuring the credibility of the
data and the analyses derived from it.
● Example: In healthcare, valid patient data ensures that medical professionals trust the data
for making treatment decisions, enhancing the credibility of the healthcare institution.
Example of Ensuring Data Validity
Scenario: A financial institution is preparing data for assessing the credit risk of its customers.
1. Accuracy:
○ Action: Verify that customer income data is accurately recorded from reliable sources, such as tax
returns or pay stubs.
○ Impact: Ensures the credit risk assessment reflects the true financial status of the customers.
2. Consistency:
○ Action: Ensure credit scores are consistently formatted and within valid ranges, using standardized
scales (e.g., FICO scores).
○ Impact: Prevents discrepancies in credit risk assessment due to varied scoring methods.
3. Completeness:
○ Action: Ensure all required fields, such as age, employment status, and credit history, are fully completed
in the dataset.
○ Impact: Provides a comprehensive view of each customer’s creditworthiness, avoiding incomplete
assessments.
4. Timeliness:
○ Action: Ensure the data is up-to-date, reflecting the latest customer information, such as recent changes
in employment or financial status.
○ Impact: Ensures the credit risk assessment is based on current and relevant data, leading to accurate
evaluations.
To be continued
Impact of Valid Data:
● Reliable Credit Risk Model: A credit risk model built on valid data accurately
assesses the creditworthiness of customers, reducing the likelihood of defaults.
● Regulatory Compliance: Ensures the financial institution complies with regulatory
requirements for accurate and transparent credit assessments.
● Trust and Credibility: Builds trust with customers and stakeholders, as decisions
and assessments are based on valid and reliable data.
Data validity is a cornerstone of effective data preparation, ensuring that the data used in
analysis and decision-making is accurate, reliable, and representative of real-world
scenarios.
By prioritizing data validity, organizations can produce precise insights, comply with
regulations, make effective decisions, and build trust among stakeholders.
Importance of Data Quality in Data Preparation
Data quality refers to the condition of data based on factors such as accuracy,
completeness, consistency, and timeliness.
High-quality data is essential for effective data analysis, decision-making, and reliable
outcomes in any data-driven project.
Importance of Data Quality
1. Accuracy of Analysis and Insights
1. Impact: Ensures that the results of data analysis are correct, leading to reliable
insights and sound decision-making.
2. Example: In a healthcare setting, accurate patient data is critical for diagnosing
diseases and recommending treatments.
○
2. Reliability of Models and Predictions
● Impact: High-quality data enhances the performance of predictive models, making them
more reliable and generalizable.
● Example: In financial services, accurate historical transaction data improves the reliability of
credit risk models.
● Impact: Reduces the time and effort required for data cleaning and preprocessing, making
data processing more efficient.
● Example: In business intelligence, clean and well-structured sales data speeds up the
generation of reports and dashboards.
4. Consistency Across Systems
● Impact: Ensures seamless integration and comparison across different systems and
datasets.
● Example: In a global corporation, consistent financial data across regions allows for
accurate consolidated financial reporting.
5. Trust and Credibility
● Impact: Builds trust among stakeholders and users, ensuring the credibility of the data and
the analyses derived from it.
● Example: In scientific research, high-quality data ensures that research findings are credible
and can be replicated.
6. Minimization of Errors
● Impact: Reduces the risk of errors in analysis and decision-making, leading to better
outcomes.
● Example: In logistics, high-quality inventory data minimizes errors in stock management,
reducing the risk of overstocking or stockouts.
Example of Ensuring Data Quality
Scenario: An e-commerce company is preparing data for a sales forecasting model.
1. Accuracy:
○ Action: Cross-check sales transaction records with receipts to ensure data accuracy.
○ Impact: Ensures that the sales data accurately reflects the actual transactions, leading to
precise sales forecasts.
2. Completeness:
○ Action: Ensure all relevant fields, such as product ID, quantity sold, and sale date, are filled.
○ Impact: Provides a comprehensive dataset for analysis, avoiding incomplete forecasts.
3. Consistency:
● Action: Standardize data formats, such as using a consistent date format (e.g.,
YYYY-MM-DD) across all records.
● Impact: Facilitates seamless data integration and comparison, ensuring consistent analysis.
4. Timeliness:
● Action: Ensure that the sales data is up-to-date and includes the most recent
transactions.
● Impact: Ensures that the sales forecasts are based on current data, leading to accurate
and relevant predictions.
5. Removing Duplicates:
● Reliable Sales Forecasting Model: A sales forecasting model built on high-quality data
accurately predicts future sales, helping the company manage inventory and optimize supply
chain operations.
● Effective Decision-Making: Ensures that business decisions, such as promotional strategies
and pricing adjustments, are based on accurate sales data.
● Enhanced Customer Satisfaction: Accurate sales forecasts help the company maintain
optimal stock levels, reducing the likelihood of stockouts and overstock situations, thereby
enhancing customer satisfaction.
Probability and Statistics basics
What is Data?
● Data refers to any information that is collected, stored, and used for various
purposes.
● It can be numbers, text, images, or other forms of information that can be
processed by computers.
● Data — a collection of facts (numbers, words, measurements, observations,
etc) that has been translated into a form that computers can process.
Types of Data
1. Quantitative Data / Numerical Data
Quantitative data represents quantities and is numerical.
● Discrete Data: This data can take on only specific, distinct values and is often counted.
○ Example: Number of students in a class, number of cars in a parking lot, number
of books on a shelf.
● Continuous Data: This data can take on any value within a given range and is often
measured.
○ Example: Height of individuals, temperature, time, weight.
2. Qualitative Data
Qualitative data represents qualities or characteristics and is descriptive.
● Nominal Data: This data consists of names or labels without any specific order.
○ Example: Types of fruits (apple, banana, cherry), gender , eye color (blue, green,
brown).
● Ordinal Data: This data consists of ordered categories, where the order matters but the
differences between the categories are not necessarily uniform.
○ Example: Movie ratings (poor, fair, good, excellent), education level (high school,
bachelor's, master's, doctorate), customer satisfaction.
Statistics Definition
Statistics is the science of collecting, analyzing, interpreting, presenting, and
organizing data. It provides methodologies and tools for making inferences and
predictions based on data.
Descriptive statistics
Covariance is a statistical measure that indicates the extent to which two variables
change together.
A positive covariance indicates that the two variables tend to increase
together.
A negative covariance indicates that one variable tends to increase when the
other decreases.
Data format refers to the structure and type of data available for analysis. Common formats
include structured data (like databases and spreadsheets), semi-structured data (like JSON
and XML files), and unstructured data (like text, images, and videos).
● Structured Data:
○ Method Suitability: Structured data is highly organized and easily searchable,
making it suitable for traditional statistical analysis, machine learning algorithms,
and database queries.
○ Tools: SQL, Excel, R, Python (Pandas, NumPy), Tableau.
○ Example: Company sales data stored in RDBMS can be used to analyse trends
with sql queries,generate reports
Semi-Structured Data:
Unstructured Data:
Data volume refers to the amount of data available for analysis. It ranges from
small datasets that can be handled on a single machine to large-scale datasets
that require distributed computing environments.
● Divide the data into training and testing sets, typically using an 80/20 or 70/30
split.
6. Choosing a Model:
● Use the training dataset to train the chosen model, adjusting parameters to
minimize prediction error.
8. Model Evaluation:
10. Validation:
● Validate the model with a different dataset or through k-fold cross-validation to ensure
robustness.
11. Deployment:
● Implement the model into a production environment where it can make real-time predictions.
● Interpret the results and communicate the insights to stakeholders in a clear and actionable
manner.
Feature engineering
Feature engineering is the process of transforming raw data into features that are
suitable for machine learning models. In other words, it is the process of selecting,
extracting, and transforming the most relevant features from the available data to build
more accurate and efficient machine learning models.
What is a Feature?
The choice and quality of features are critical in machine learning, as they can greatly
impact the accuracy and performance of the model.
1. Understand the Data
● Domain Knowledge: Gain a deep understanding of the dataset and the domain from
which it comes. Know what each feature represents and how it might impact the target
variable.
● Exploratory Data Analysis (EDA): Conduct thorough EDA to understand the
distribution, relationships, and patterns in the data.
2. Data Cleaning
● Handling Missing Values: Decide how to deal with missing data, either by imputing,
removing, or using advanced techniques like KNN imputation.
● Removing Duplicates: Ensure there are no duplicate records that could skew the
analysis.
● Correcting Errors: Identify and correct any obvious data entry errors or
inconsistencies.
3. Create New Features
● Feature Generation: Create new features based on existing ones.
This could involve mathematical operations, aggregations, or
domain-specific transformations.
○ Example: Calculate the age of a house as Current Year -
Year Built.
○ Example: Price_per_SqFt = Sale_Price / Size
Using Domain Knowledge: Using specific knowledge to create meaningful
features.
● Example: BMI = Weight / (Height^2) in healthcare.
4. Feature transformation
Feature transformation involves changing the format or distribution of features to
make them more suitable for modeling. This helps in normalizing the data,
reducing skewness, and improving model performance.
Examples:
● Log Transformation: Reducing skewness in features with a heavy-tailed
distribution.
○ Example: Log_Size = log(Size)
● Polynomial Features: Capturing non-linear relationships.
○ Example: Size^2, Size^3
● One-Hot Encoding: Convert categorical variables into binary columns.
● Label Encoding: Assign a unique integer to each category.
● Target Encoding: Replace each category with the mean of the target
variable for that category.
5. Feature Extraction
Feature extraction involves reducing the dimensionality of data while
retaining important information. This is particularly useful for
high-dimensional data like text or images.
Examples:
● Principal Component Analysis (PCA): Reducing dimensionality by
projecting data onto a lower-dimensional space.
○ Example: Reducing a dataset with 100 features to a dataset with 10
principal components.
● Text Features: Extracting n-grams or TF-IDF scores from text data.
○ Example: Extracting bi-grams from customer reviews.
● Extract Year and Month from Date :
● Sale_Year = Year(Sale_Date)
● Sale_Month = Month(Sale_Date)
6. Feature Selection
Feature selection involves identifying and selecting the most important features that
contribute to the target variable. This helps in reducing overfitting, improving model
interpretability, and reducing computational cost.
Examples:
● Correlation Analysis: Selecting features highly correlated with the target
variable.
○ Example: Selecting features with a correlation coefficient above a certain
threshold.
● Model-Based Selection: Using models like Lasso regression to identify
important features.
○ Example: Using a decision tree to rank feature importance.
● Variance Threshold: Remove features with low variance as they might not
provide much information.
7. Feature Scaling
Feature scaling involves normalizing or standardizing features to ensure they contribute
equally to the model. This is crucial for models that rely on distance metrics.
Examples:
4. Feature Selection:
5. Feature Scaling:
2. Holdout Method
3. Bootstrapping
4. Information Criteria
● Akaike Information Criterion (AIC): Used for model comparison where models
with lower AIC values are preferred. AIC balances model fit and complexity.
● Develop an API using Flask or Django in Python that allows external systems
to interact with the model.
● The API might accept input data in JSON format (e.g., current weather
conditions) and return predictions (e.g., expected temperature, rain
probability) in real-time.
Batch Processing:
● The model and its API can be containerized using Docker and deployed
as a microservice on a platform like Kubernetes or AWS ECS for
scalability and independent updates.