0% found this document useful (0 votes)

15 views12 pages

Let

The document outlines key concepts in data architecture design, data reduction techniques, secondary data sources, data quality assessment, analytics techniques, and model evaluation methods. It discusses the importance of data modeling, the properties of the Ordinary Least Squares (OLS) estimator, and the differences between supervised and unsupervised learning. Additionally, it covers visualization techniques and applications in data analysis, including the use of tools like Tableau.

Uploaded by

kondurupravalika999

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views12 pages

Let

Uploaded by

kondurupravalika999

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Let’s break down each topic step by step in simple terms for better understanding.

---

### 2.a) Constraints and Influences on Data Architecture Design

#### **Constraints:**

1. **Scalability**:

- You need to ensure that the data system can handle growth in data size.

- Example: If your business grows and you collect more data, the architecture should not slow down.

2. **Performance**:

- The system must provide fast access to data, even when a lot of users request it simultaneously.

- Example: A website that takes too long to load might lose customers.

3. **Cost**:

- Building and maintaining data systems costs money. You need to design within your budget.

- Example: Cloud storage may be more cost-efficient than buying physical servers.

4. **Compliance**:

- Laws and regulations may limit what data you can store and how you process it.

- Example: GDPR requires customer data protection in the European Union.

5. **Integration**:

- The system must work smoothly with other tools or software you already use.

- Example: A new database must connect easily with your existing reporting tools.

#### **Influences:**

1. **Business Goals**:

- The architecture should align with what the business wants to achieve.

- Example: If the goal is better customer analysis, the design should focus on customer-related data.

2. **Technology Trends**:

- New technologies, like artificial intelligence or real-time data analysis, may shape the design.

- Example: A cloud-based system might replace older local servers.

3. **Stakeholder Needs**:

- People in different roles (managers, analysts) need different types of data.

- Example: Managers may need summaries, while analysts need detailed data.

4. **Security**:

- Protecting sensitive data is critical.

- Example: Encrypt customer data to prevent unauthorized access.

5. **Data Sources**:

- The design must accommodate the type and speed of data.

- Example: A system handling live data (e.g., weather updates) needs real-time capabilities.

---

### 2.b) Data Reduction as a Preprocessing Step

Data reduction means reducing the size of your dataset while keeping the important parts. This helps make analysis faster
and easier.

#### Steps:

1. **Dimensionality Reduction**:

- Reduce the number of columns or features in the data.

- Example: Instead of using "Height" and "Weight," use "BMI" as a combined measure.

2. **Numerosity Reduction**:

- Represent data with fewer values.

- Example: Use averages or ranges instead of listing every single value.

3. **Data Compression**:

- Compress data to save storage space.

- Example: Store an image as a smaller file without losing important details.

4. **Feature Selection**:

- Select only the most important columns for analysis.

- Example: For predicting house prices, focus on "Location" and "Size," ignoring irrelevant columns like "Owner’s Name."

5. **Sampling**:

- Use a smaller, representative portion of the dataset instead of the entire dataset.

- Example: Instead of analyzing all customer data, use a sample of 1,000 customers.

---

### 3.a) Secondary Data and Its Sources

#### Secondary Data:

- Data that was collected by someone else, but you use it for your purposes.

#### Sources of Secondary Data:

1. **Internal Sources**:

- Data already available in your organization.

- Example: Sales records, employee details, or past reports.

2. **External Sources**:

- Data obtained from outside your organization.

- Examples:

- Government Publications: Census data, crime rates.

- Industry Reports: Market trends or consumer behavior studies.

- Online Sources: Websites, blogs, or freely available datasets.

- **Databases**: Paid data services like Statista or research databases like JSTOR.

---

### 3.b) Data Quality Assessment

Assessing data quality depends on how you plan to use it.

#### Key Factors:

1. **Accuracy**:

- The data must be correct.

- Example: Financial data must be accurate to avoid mistakes in tax calculations.

2. **Timeliness**:

- Data should be up-to-date.

- Example: Stock prices must be current for real-time trading platforms.

3. **Completeness**:

- The dataset should not have missing values.

- Example: In medical records, missing diagnosis data can lead to incorrect treatment.

4. **Consistency**:

- The data must be uniform across the system.

- Example: A customer’s address must be the same in billing and shipping systems.

5. **Relevance**:

- The data must be useful for its intended purpose.

- Example: Weather data is irrelevant for analyzing shopping trends.

---

### 4.a) Analytics Techniques

1. **Descriptive Analytics**:

- Summarizes past data to tell you “what happened.”

- Example: A sales dashboard showing last month’s revenue.

2. **Diagnostic Analytics**:

- Explains “why it happened” by identifying relationships or causes.

- Example: Finding that sales dropped due to bad weather.

3. **Predictive Analytics**:

- Uses historical data to predict future outcomes.

- Example: Forecasting next month’s sales using trends.

4. **Prescriptive Analytics**:
- Suggests actions to achieve a goal.

- Example: Recommending discount offers to increase sales.

5. **Cognitive Analytics**:

- Mimics human thinking and reasoning.

- Example: Chatbots that understand customer questions.

---

### 4.b) Identifying Gaps in Data

#### Steps:

1. Look for Missing Values:

- Check if any cells or fields are empty.

2. **Detect Outliers**:

- Identify values that are too high or low compared to others.

3. **Analyze Metadata**:

- Look at the structure and completeness of the data.

#### Handling Gaps:

1. Fill Missing Data:

- Use averages, median, or predictive models.

2. Remove Inconsistent Data:

- Delete data points that don’t make sense.

3. Supplement with New Data:

- Collect additional data to fill gaps.

---

### 5.a) Importance of Data Modeling

1. **Provides Structure**:

- A clear design helps manage data efficiently.

- Example: Organizing customer data into tables.

2. **Optimizes Performance**:

- Well-structured data speeds up queries and analysis.

- Example: Indexing customer names for faster searches.

3. **Supports Decision-Making**:

- Ensures reliable and accurate insights.

- Example: Helps analyze sales trends to set prices.

4. **Scalable Design**:

- Makes it easy to add new data types as the business grows.

---

### 5.b) NoSQL Tools Based on Data Models

1. **Key-Value Stores**:

- Simple systems storing data as key-value pairs.

- Example: Redis, DynamoDB.

2. **Document Stores**:

- Store complex data like JSON documents.

- Example: MongoDB.

3. **Column-Family Stores**:

- Store data in columns instead of rows for fast querying.

- Example: Cassandra, HBase.

4. **Graph Databases**:

- Focus on relationships between data points.

- Example: Neo4j for social networks.

Here’s an in-depth explanation for the topics in simple terms:

---
### **6.a) What is meant by BLUE property? What are the BLUE properties of OLS?**

**BLUE** stands for **Best Linear Unbiased Estimator**. It describes the desirable properties of the **Ordinary Least
Squares (OLS)** estimator used in regression analysis.

#### BLUE Properties of OLS:

1. **Best**: The OLS estimator has the smallest variance among all unbiased estimators.

2. Linear: The estimator is a linear function of the observed data.

3. **Unbiased**: The expected value of the OLS estimate equals the true population parameter (no systematic error).

4. **Efficient**: The OLS estimator is efficient, meaning it achieves the lowest possible variance among unbiased
estimators.

#### Conditions for OLS to be BLUE:

- The errors (residuals) have a mean of zero.

- Errors are uncorrelated with each other (no autocorrelation).

- Errors have constant variance (homoscedasticity).

- Errors follow a normal distribution (optional for inference).

---

### 6.b) Multinomial Logistic Regression

Multinomial Logistic Regression is an extension of Logistic Regression used when the outcome variable has **more than
two categories**.

#### How it works:

1. Instead of predicting binary outcomes (e.g., 0 or 1), it predicts probabilities for **multiple categories**.

2. For each category, the model calculates the log-odds of the outcome relative to a baseline category.

3. Probabilities are assigned to each category using the softmax function.

#### **Applications**:

- Customer Segmentation: Predicting whether a customer prefers Product A, B, or C.

- Medical Diagnosis: Predicting disease type based on symptoms.

- Marketing: Identifying which product category a customer is likely to buy.

#### Steps in Multinomial Logistic Regression:

1. Define the target variable (categorical with >2 classes).

2. Convert features into numerical format.

3. Use a multinomial logit model to fit the data.

4. Interpret the coefficients for each class relative to the baseline.

---

### 7.a) Purpose of Least Squares Estimation in Regression

The **Least Squares Estimation (LSE)** is a method used in regression analysis to find the best-fitting line (or curve) for a
given dataset. It minimizes the **sum of the squared differences** between observed values and predicted values.

#### **Purpose**:

- To determine the regression line that best explains the relationship between the independent (X) and dependent (Y)
variables.

#### **Steps**:

1. Compute the predicted value for each observation based on a guessed line.

2. Calculate the error (difference) between observed and predicted values.

3. Square the errors and sum them.

4. Adjust the line to minimize this total error.

#### **Example**:

- For predicting house prices:

- Input: Size of the house (independent variable).

- Output: Predicted price (dependent variable).

- LSE finds the line that minimizes the differences between actual and predicted prices.

---

### 7.b) Model Fit Statistics

#### i) Hosmer-Lemeshow Test:

- Used in logistic regression to test the goodness of fit.

- It divides data into groups based on predicted probabilities and compares observed vs. predicted values.
- A higher p-value (>0.05) indicates that the model fits the data well.

#### ii) Error Matrix (Confusion Matrix):

- A table used to evaluate the performance of classification models.

- Components:

- True Positive (TP): Correctly predicted positives.

- True Negative (TN): Correctly predicted negatives.

- False Positive (FP): Predicted positive when actual is negative.

- False Negative (FN): Predicted negative when actual is positive.

- Metrics derived:

- Accuracy: (TP + TN) / Total Predictions.

- Precision: TP / (TP + FP).

- Recall: TP / (TP + FN).

---

### 8.a) What is Overfitting? How to Prevent Overfitting?

#### What is Overfitting?

- Overfitting happens when a model learns the **noise and details** in the training data instead of the actual pattern.
This leads to poor performance on new data.

#### How to Prevent Overfitting?

1. Simplify the Model:

- Use fewer features or simpler algorithms.

2. **Cross-Validation**:

- Test the model on different subsets of data to ensure generalization.

3. **Regularization**:

- Add a penalty term to the loss function to prevent over-complex models (e.g., L1, L2 regularization).

4. **Pruning**:

- Reduce the complexity of decision trees by removing unnecessary branches.

5. **Early Stopping**:

- Stop training the model once performance on validation data stops improving.

---
### **8.b) Compare ARMA and ARIMA**

#### ARMA (Autoregressive Moving Average):

- Combines AR (Autoregressive) and MA (Moving Average) components.

- Used for modeling stationary time series (constant mean and variance).

#### ARIMA (Autoregressive Integrated Moving Average):

- Adds an Integration (I) step to make non-stationary data stationary.

- Handles trends and seasonality by differencing the data.

#### Key Differences:

| Aspect | ARMA | ARIMA |

|----------------------|--------------------------|----------------------------------|

| Handles Stationarity | Only stationary data | Both stationary and non-stationary data |

| Differencing | Not required | Required to remove trends |

| Use Cases | Simple time series | Complex, trending time series |

---

### 9.a) Supervised vs. Unsupervised Learning

| Aspect | Supervised Learning | Unsupervised Learning |

|-----------------------|---------------------------------|-----------------------------------|

| Definition | Learns from labeled data | Learns from unlabeled data |

| Goal | Predict or classify | Find patterns or structure |

| Examples | Classification, regression | Clustering, dimensionality reduction |

| **Use Cases** | Spam detection, price prediction | Customer segmentation, anomaly detection |

---

### 9.b) STL Approach for Time Series Decomposition

STL (Seasonal and Trend Decomposition using Loess):

- Breaks down a time series into three components:

1. **Trend**: Long-term direction of the data.

2. Seasonality: Regular patterns that repeat over time.

3. Residual: Noise or random variation.

**Steps**:

1. Identify the frequency of the seasonal pattern.

2. Use smoothing techniques (like Loess) to extract the trend.

3. Subtract trend and seasonality to get residuals.

---

### 10.a) Geometric Projection Visualization Techniques

These techniques visualize high-dimensional data in 2D or 3D.

#### **Examples**:

1. **Scatter Plots**:

- Shows the relationship between two variables.

2. PCA (Principal Component Analysis):

- Projects high-dimensional data into fewer dimensions while retaining variance.

3. **t-SNE**:

- Groups similar points together for better visualization of clusters.

---

### 10.b) Applications of Data Visualization

1. Business Intelligence: Dashboards for sales and KPIs.

2. Scientific Research: Analyzing experimental data.

3. Healthcare: Visualizing patient statistics or disease patterns.

4. Education: Creating interactive learning materials.

---

### 11.a) Steps in Tableau Data Visualization

1. **Connect to Data**:

- Import data from files, databases, or APIs.

2. **Prepare Data**:

- Clean data, remove duplicates, and handle missing values.

3. **Create Visualizations**:

- Drag and drop fields to create charts like bar graphs or maps.

4. **Build Dashboards**:

- Combine multiple visualizations into one interactive display.

5. **Share**:

- Publish the dashboard or export it as a file.

---

### 11.b) Tag Cloud Visualization

**Tag Cloud**:

- A way to visualize text data where the size of a word reflects its frequency or importance.

#### **Applications**:

- Analyzing customer feedback (common words in reviews).

- Summarizing social media trends. convert to pdf

M-4 U-3 Combined Notes
No ratings yet
M-4 U-3 Combined Notes
166 pages
Multilevel Modeling Using R - Finch Bolin Kelley
100% (2)
Multilevel Modeling Using R - Finch Bolin Kelley
82 pages
Project Report
100% (1)
Project Report
16 pages
BBA 202 Business Analytics
No ratings yet
BBA 202 Business Analytics
52 pages
Business Analytics Chapter1 3
No ratings yet
Business Analytics Chapter1 3
3 pages
Assignment Week 2 BDA
No ratings yet
Assignment Week 2 BDA
4 pages
Lesson2 Notes
No ratings yet
Lesson2 Notes
13 pages
? Data Preprocessing
No ratings yet
? Data Preprocessing
19 pages
Job Prep
No ratings yet
Job Prep
32 pages
Introduction To Data Science Methodology
No ratings yet
Introduction To Data Science Methodology
45 pages
Harteg Notes
No ratings yet
Harteg Notes
4 pages
Data Mining
No ratings yet
Data Mining
48 pages
Da Full Notes
No ratings yet
Da Full Notes
30 pages
4-1 Introduction To Corrrelation and Its Properties
0% (1)
4-1 Introduction To Corrrelation and Its Properties
14 pages
Pa Unit 2
No ratings yet
Pa Unit 2
6 pages
Data Analytics Syllabus PDF
No ratings yet
Data Analytics Syllabus PDF
5 pages
Da Unit-I
No ratings yet
Da Unit-I
19 pages
DA Unitwise Notes Detailed Cleaned
No ratings yet
DA Unitwise Notes Detailed Cleaned
5 pages
Analytics Overview
No ratings yet
Analytics Overview
34 pages
Regression Analysis Explained
No ratings yet
Regression Analysis Explained
54 pages
Data Analyst and Science Roadmap
No ratings yet
Data Analyst and Science Roadmap
6 pages
Data Mining Overview
No ratings yet
Data Mining Overview
4 pages
AMA - Theory Notes
No ratings yet
AMA - Theory Notes
5 pages
Unit No 3
No ratings yet
Unit No 3
10 pages
Analytics Engineer Training Roadmap
No ratings yet
Analytics Engineer Training Roadmap
6 pages
DMT Unit1
No ratings yet
DMT Unit1
46 pages
Business Analytics Summary (Units 1.2 - 1.8)
No ratings yet
Business Analytics Summary (Units 1.2 - 1.8)
8 pages
Ba Theory
No ratings yet
Ba Theory
10 pages
Comprehensive Data Analyst Roadmap
No ratings yet
Comprehensive Data Analyst Roadmap
4 pages
ISPFL9 Module1
100% (1)
ISPFL9 Module1
22 pages
What Is Duplicate Data?
No ratings yet
What Is Duplicate Data?
10 pages
Ba Notes Ete
No ratings yet
Ba Notes Ete
16 pages
Document
No ratings yet
Document
1 page
Data Analysis Essentials Guide
No ratings yet
Data Analysis Essentials Guide
36 pages
DA Assignment 20241015 091512 0000
No ratings yet
DA Assignment 20241015 091512 0000
19 pages
Introduction to Data Analytics
No ratings yet
Introduction to Data Analytics
30 pages
Unit II Notes
No ratings yet
Unit II Notes
36 pages
Annual Report 1
No ratings yet
Annual Report 1
23 pages
Data Anlytics
No ratings yet
Data Anlytics
2 pages
Abhijitya Midsem
No ratings yet
Abhijitya Midsem
6 pages
50 Interview Questions & Answers!
No ratings yet
50 Interview Questions & Answers!
52 pages
What Is Business Analytics
No ratings yet
What Is Business Analytics
109 pages
Data Analyst Learning Path
No ratings yet
Data Analyst Learning Path
10 pages
DF
No ratings yet
DF
4 pages
Data and Its Types
No ratings yet
Data and Its Types
6 pages
Rakshana SN - LAQ Week 2 DA
No ratings yet
Rakshana SN - LAQ Week 2 DA
3 pages
BI Bankai
No ratings yet
BI Bankai
27 pages
Data Analytics Exam Notes Guide
No ratings yet
Data Analytics Exam Notes Guide
4 pages
1.four Types of Analytics in Simple Terms
No ratings yet
1.four Types of Analytics in Simple Terms
11 pages
Confusion Matrix
No ratings yet
Confusion Matrix
21 pages
Data Engineering Lab
No ratings yet
Data Engineering Lab
6 pages
Da
No ratings yet
Da
6 pages
Unit1 Iba
No ratings yet
Unit1 Iba
11 pages
Data Analytics
No ratings yet
Data Analytics
4 pages
Business Analytics for BBA Students
No ratings yet
Business Analytics for BBA Students
8 pages
Data Analytics
No ratings yet
Data Analytics
6 pages
Data Science: A Comprehensive Guide
No ratings yet
Data Science: A Comprehensive Guide
5 pages
Business Analytics
No ratings yet
Business Analytics
3 pages
Viva Preparation Notes
No ratings yet
Viva Preparation Notes
6 pages
BI Vimlu
No ratings yet
BI Vimlu
18 pages
Data Management System
No ratings yet
Data Management System
3 pages
Data Analytics Course Overview
No ratings yet
Data Analytics Course Overview
2 pages
Data Science MBA
No ratings yet
Data Science MBA
6 pages
Statistics For Business and Economics: Bab 14
No ratings yet
Statistics For Business and Economics: Bab 14
31 pages
Analysis of Covariance (Linear, Quadratic, Site Index As Covariables) - Dr. Rong-Cai Yang
No ratings yet
Analysis of Covariance (Linear, Quadratic, Site Index As Covariables) - Dr. Rong-Cai Yang
47 pages
Tobit
100% (1)
Tobit
20 pages
Pss Stata 2017
No ratings yet
Pss Stata 2017
27 pages
Afroz 21 Assignment-2
No ratings yet
Afroz 21 Assignment-2
19 pages
12 Regresi Linier Dan Korelasi
No ratings yet
12 Regresi Linier Dan Korelasi
16 pages
5 ASAP Advanced Statistics - ANOVA - Total
No ratings yet
5 ASAP Advanced Statistics - ANOVA - Total
127 pages
Psych R Package
No ratings yet
Psych R Package
412 pages
Lab
No ratings yet
Lab
9 pages
Assignment-3 Noc18 mg34 35
100% (1)
Assignment-3 Noc18 mg34 35
4 pages
Supervised Learning Methods Guide
No ratings yet
Supervised Learning Methods Guide
34 pages
Applied Econometrics Course Guide
No ratings yet
Applied Econometrics Course Guide
4 pages
Problems With Econometric Models Heteros
No ratings yet
Problems With Econometric Models Heteros
10 pages
Assignment 1-ML
No ratings yet
Assignment 1-ML
4 pages
Mixed Models Day 1 - 2023
No ratings yet
Mixed Models Day 1 - 2023
58 pages
Lecture 5 ParametricMethod
No ratings yet
Lecture 5 ParametricMethod
20 pages
Future Orientation in Indonesian Teens
No ratings yet
Future Orientation in Indonesian Teens
15 pages
Minitab Commands Guide
No ratings yet
Minitab Commands Guide
9 pages
A Cat
No ratings yet
A Cat
7 pages
Thuy T. Pham: By, U. of Technology Sydney
No ratings yet
Thuy T. Pham: By, U. of Technology Sydney
5 pages
Chi Nguyen - 1622431 - LAB 4
No ratings yet
Chi Nguyen - 1622431 - LAB 4
5 pages
Combining Multiple Imputations: Thomas Lumley April 26, 2019
No ratings yet
Combining Multiple Imputations: Thomas Lumley April 26, 2019
5 pages
Crop Yield Estimation ML
No ratings yet
Crop Yield Estimation ML
5 pages
Overview of Hypothesis Testing Analysis
No ratings yet
Overview of Hypothesis Testing Analysis
3 pages
QMS 064-DAS-Content
No ratings yet
QMS 064-DAS-Content
3 pages
Fixed Effects Lecture1 PDF
No ratings yet
Fixed Effects Lecture1 PDF
40 pages

Let

Uploaded by

Let

Uploaded by

Let’s break down each topic step by step in simple terms for better understanding.

### **2.a) Constraints and Influences on Data Architecture Design**

- Example: GDPR requires customer data protection in the European Union.

- Example: A cloud-based system might replace older local servers.

- People in different roles (managers, analysts) need different types of data.

- Protecting sensitive data is critical.

- Example: Encrypt customer data to prevent unauthorized access.

- The design must accommodate the type and speed of data.

### **2.b) Data Reduction as a Preprocessing Step**

- Reduce the number of columns or features in the data.

- Represent data with fewer values.

- Example: Use averages or ranges instead of listing every single value.

- Compress data to save storage space.

- Example: Store an image as a smaller file without losing important details.

- Select only the most important columns for analysis.

### **3.a) Secondary Data and Its Sources**

#### **Secondary Data**:

#### **Sources of Secondary Data**:

- Data already available in your organization.

- Example: Sales records, employee details, or past reports.

- Data obtained from outside your organization.

- **Government Publications**: Census data, crime rates.

- **Industry Reports**: Market trends or consumer behavior studies.

- **Online Sources**: Websites, blogs, or freely available datasets.

### **3.b) Data Quality Assessment**

Assessing data quality depends on how you plan to use it.

#### Key Factors:

- The data must be correct.

- Data should be up-to-date.

- Example: Stock prices must be current for real-time trading platforms.

- The dataset should not have missing values.

- The data must be uniform across the system.

- The data must be useful for its intended purpose.

- Example: Weather data is irrelevant for analyzing shopping trends.

### **4.a) Analytics Techniques**

- Summarizes past data to tell you “what happened.”

- Example: A sales dashboard showing last month’s revenue.

- Explains “why it happened” by identifying relationships or causes.

- Example: Finding that sales dropped due to bad weather.

- Uses historical data to predict future outcomes.

- Example: Forecasting next month’s sales using trends.

- Example: Recommending discount offers to increase sales.

- Mimics human thinking and reasoning.

- Example: Chatbots that understand customer questions.

### **4.b) Identifying Gaps in Data**

1. **Look for Missing Values**:

- Check if any cells or fields are empty.

- Identify values that are too high or low compared to others.

- Look at the structure and completeness of the data.

#### Handling Gaps:

1. **Fill Missing Data**:

- Use averages, median, or predictive models.

2. **Remove Inconsistent Data**:

- Delete data points that don’t make sense.

3. **Supplement with New Data**:

- Collect additional data to fill gaps.

### **5.a) Importance of Data Modeling**

- A clear design helps manage data efficiently.

- Example: Organizing customer data into tables.

- Well-structured data speeds up queries and analysis.

- Example: Indexing customer names for faster searches.

- Ensures reliable and accurate insights.

- Example: Helps analyze sales trends to set prices.

- Makes it easy to add new data types as the business grows.

### **5.b) NoSQL Tools Based on Data Models**

- Simple systems storing data as key-value pairs.

- Example: Redis, DynamoDB.

- Store complex data like JSON documents.

- Store data in columns instead of rows for fast querying.

- Example: Cassandra, HBase.

- Focus on relationships between data points.

- Example: Neo4j for social networks.

Here’s an in-depth explanation for the topics in simple terms:

#### **BLUE Properties of OLS:**

2. **Linear**: The estimator is a linear function of the observed data.

#### **Conditions for OLS to be BLUE**:

### 2.a) Constraints and Influences on Data Architecture Design

### 2.b) Data Reduction as a Preprocessing Step

### 3.a) Secondary Data and Its Sources

#### Secondary Data:

#### Sources of Secondary Data:

- Government Publications: Census data, crime rates.

- Industry Reports: Market trends or consumer behavior studies.

- Online Sources: Websites, blogs, or freely available datasets.

### 3.b) Data Quality Assessment

### 4.a) Analytics Techniques

### 4.b) Identifying Gaps in Data

1. Look for Missing Values:

1. Fill Missing Data:

2. Remove Inconsistent Data:

3. Supplement with New Data:

### 5.a) Importance of Data Modeling

### 5.b) NoSQL Tools Based on Data Models

#### BLUE Properties of OLS:

2. Linear: The estimator is a linear function of the observed data.

#### Conditions for OLS to be BLUE:

### 6.b) Multinomial Logistic Regression

#### How it works:

3. Probabilities are assigned to each category using the softmax function.

- Customer Segmentation: Predicting whether a customer prefers Product A, B, or C.

- Medical Diagnosis: Predicting disease type based on symptoms.

- Marketing: Identifying which product category a customer is likely to buy.

#### Steps in Multinomial Logistic Regression:

### 7.a) Purpose of Least Squares Estimation in Regression

### 7.b) Model Fit Statistics

#### i) Hosmer-Lemeshow Test:

- Used in logistic regression to test the goodness of fit.

#### ii) Error Matrix (Confusion Matrix):

- True Positive (TP): Correctly predicted positives.

- True Negative (TN): Correctly predicted negatives.

- False Positive (FP): Predicted positive when actual is negative.

- False Negative (FN): Predicted negative when actual is positive.

- Accuracy: (TP + TN) / Total Predictions.

- Precision: TP / (TP + FP).

- Recall: TP / (TP + FN).

### 8.a) What is Overfitting? How to Prevent Overfitting?

#### What is Overfitting?

#### How to Prevent Overfitting?

1. Simplify the Model:

#### ARMA (Autoregressive Moving Average):

- Combines AR (Autoregressive) and MA (Moving Average) components.

#### ARIMA (Autoregressive Integrated Moving Average):

- Adds an Integration (I) step to make non-stationary data stationary.

#### Key Differences:

| Aspect | ARMA | ARIMA |

### 9.a) Supervised vs. Unsupervised Learning

| Aspect | Supervised Learning | Unsupervised Learning |

| Definition | Learns from labeled data | Learns from unlabeled data |

| Goal | Predict or classify | Find patterns or structure |

| Examples | Classification, regression | Clustering, dimensionality reduction |

### 9.b) STL Approach for Time Series Decomposition

STL (Seasonal and Trend Decomposition using Loess):

2. Seasonality: Regular patterns that repeat over time.

3. Residual: Noise or random variation.

### 10.a) Geometric Projection Visualization Techniques

2. PCA (Principal Component Analysis):

### 10.b) Applications of Data Visualization

1. Business Intelligence: Dashboards for sales and KPIs.