Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
15 views12 pages

Let

The document outlines key concepts in data architecture design, data reduction techniques, secondary data sources, data quality assessment, analytics techniques, and model evaluation methods. It discusses the importance of data modeling, the properties of the Ordinary Least Squares (OLS) estimator, and the differences between supervised and unsupervised learning. Additionally, it covers visualization techniques and applications in data analysis, including the use of tools like Tableau.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views12 pages

Let

The document outlines key concepts in data architecture design, data reduction techniques, secondary data sources, data quality assessment, analytics techniques, and model evaluation methods. It discusses the importance of data modeling, the properties of the Ordinary Least Squares (OLS) estimator, and the differences between supervised and unsupervised learning. Additionally, it covers visualization techniques and applications in data analysis, including the use of tools like Tableau.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Let’s break down each topic step by step in simple terms for better understanding.

---

### **2.a) Constraints and Influences on Data Architecture Design**

#### **Constraints:**

1. **Scalability**:

- You need to ensure that the data system can handle growth in data size.

- Example: If your business grows and you collect more data, the architecture should not slow down.

2. **Performance**:

- The system must provide fast access to data, even when a lot of users request it simultaneously.

- Example: A website that takes too long to load might lose customers.

3. **Cost**:

- Building and maintaining data systems costs money. You need to design within your budget.

- Example: Cloud storage may be more cost-efficient than buying physical servers.

4. **Compliance**:

- Laws and regulations may limit what data you can store and how you process it.

- Example: GDPR requires customer data protection in the European Union.

5. **Integration**:

- The system must work smoothly with other tools or software you already use.

- Example: A new database must connect easily with your existing reporting tools.

#### **Influences:**

1. **Business Goals**:

- The architecture should align with what the business wants to achieve.

- Example: If the goal is better customer analysis, the design should focus on customer-related data.

2. **Technology Trends**:

- New technologies, like artificial intelligence or real-time data analysis, may shape the design.

- Example: A cloud-based system might replace older local servers.


3. **Stakeholder Needs**:

- People in different roles (managers, analysts) need different types of data.

- Example: Managers may need summaries, while analysts need detailed data.

4. **Security**:

- Protecting sensitive data is critical.

- Example: Encrypt customer data to prevent unauthorized access.

5. **Data Sources**:

- The design must accommodate the type and speed of data.

- Example: A system handling live data (e.g., weather updates) needs real-time capabilities.

---

### **2.b) Data Reduction as a Preprocessing Step**

Data reduction means reducing the size of your dataset while keeping the important parts. This helps make analysis faster
and easier.

#### Steps:

1. **Dimensionality Reduction**:

- Reduce the number of columns or features in the data.

- Example: Instead of using "Height" and "Weight," use "BMI" as a combined measure.

2. **Numerosity Reduction**:

- Represent data with fewer values.

- Example: Use averages or ranges instead of listing every single value.

3. **Data Compression**:

- Compress data to save storage space.

- Example: Store an image as a smaller file without losing important details.

4. **Feature Selection**:

- Select only the most important columns for analysis.


- Example: For predicting house prices, focus on "Location" and "Size," ignoring irrelevant columns like "Owner’s Name."

5. **Sampling**:

- Use a smaller, representative portion of the dataset instead of the entire dataset.

- Example: Instead of analyzing all customer data, use a sample of 1,000 customers.

---

### **3.a) Secondary Data and Its Sources**

#### **Secondary Data**:

- Data that was collected by someone else, but you use it for your purposes.

#### **Sources of Secondary Data**:

1. **Internal Sources**:

- Data already available in your organization.

- Example: Sales records, employee details, or past reports.

2. **External Sources**:

- Data obtained from outside your organization.

- Examples:

- **Government Publications**: Census data, crime rates.

- **Industry Reports**: Market trends or consumer behavior studies.

- **Online Sources**: Websites, blogs, or freely available datasets.

- **Databases**: Paid data services like Statista or research databases like JSTOR.

---

### **3.b) Data Quality Assessment**

Assessing data quality depends on how you plan to use it.

#### Key Factors:

1. **Accuracy**:

- The data must be correct.


- Example: Financial data must be accurate to avoid mistakes in tax calculations.

2. **Timeliness**:

- Data should be up-to-date.

- Example: Stock prices must be current for real-time trading platforms.

3. **Completeness**:

- The dataset should not have missing values.

- Example: In medical records, missing diagnosis data can lead to incorrect treatment.

4. **Consistency**:

- The data must be uniform across the system.

- Example: A customer’s address must be the same in billing and shipping systems.

5. **Relevance**:

- The data must be useful for its intended purpose.

- Example: Weather data is irrelevant for analyzing shopping trends.

---

### **4.a) Analytics Techniques**

1. **Descriptive Analytics**:

- Summarizes past data to tell you “what happened.”

- Example: A sales dashboard showing last month’s revenue.

2. **Diagnostic Analytics**:

- Explains “why it happened” by identifying relationships or causes.

- Example: Finding that sales dropped due to bad weather.

3. **Predictive Analytics**:

- Uses historical data to predict future outcomes.

- Example: Forecasting next month’s sales using trends.

4. **Prescriptive Analytics**:
- Suggests actions to achieve a goal.

- Example: Recommending discount offers to increase sales.

5. **Cognitive Analytics**:

- Mimics human thinking and reasoning.

- Example: Chatbots that understand customer questions.

---

### **4.b) Identifying Gaps in Data**

#### Steps:

1. **Look for Missing Values**:

- Check if any cells or fields are empty.

2. **Detect Outliers**:

- Identify values that are too high or low compared to others.

3. **Analyze Metadata**:

- Look at the structure and completeness of the data.

#### Handling Gaps:

1. **Fill Missing Data**:

- Use averages, median, or predictive models.

2. **Remove Inconsistent Data**:

- Delete data points that don’t make sense.

3. **Supplement with New Data**:

- Collect additional data to fill gaps.

---

### **5.a) Importance of Data Modeling**

1. **Provides Structure**:

- A clear design helps manage data efficiently.

- Example: Organizing customer data into tables.


2. **Optimizes Performance**:

- Well-structured data speeds up queries and analysis.

- Example: Indexing customer names for faster searches.

3. **Supports Decision-Making**:

- Ensures reliable and accurate insights.

- Example: Helps analyze sales trends to set prices.

4. **Scalable Design**:

- Makes it easy to add new data types as the business grows.

---

### **5.b) NoSQL Tools Based on Data Models**

1. **Key-Value Stores**:

- Simple systems storing data as key-value pairs.

- Example: Redis, DynamoDB.

2. **Document Stores**:

- Store complex data like JSON documents.

- Example: MongoDB.

3. **Column-Family Stores**:

- Store data in columns instead of rows for fast querying.

- Example: Cassandra, HBase.

4. **Graph Databases**:

- Focus on relationships between data points.

- Example: Neo4j for social networks.

Here’s an in-depth explanation for the topics in simple terms:

---
### **6.a) What is meant by BLUE property? What are the BLUE properties of OLS?**

**BLUE** stands for **Best Linear Unbiased Estimator**. It describes the desirable properties of the **Ordinary Least
Squares (OLS)** estimator used in regression analysis.

#### **BLUE Properties of OLS:**

1. **Best**: The OLS estimator has the smallest variance among all unbiased estimators.

2. **Linear**: The estimator is a linear function of the observed data.

3. **Unbiased**: The expected value of the OLS estimate equals the true population parameter (no systematic error).

4. **Efficient**: The OLS estimator is efficient, meaning it achieves the lowest possible variance among unbiased
estimators.

#### **Conditions for OLS to be BLUE**:

- The errors (residuals) have a mean of zero.

- Errors are uncorrelated with each other (no autocorrelation).

- Errors have constant variance (homoscedasticity).

- Errors follow a normal distribution (optional for inference).

---

### **6.b) Multinomial Logistic Regression**

Multinomial Logistic Regression is an extension of Logistic Regression used when the outcome variable has **more than
two categories**.

#### **How it works:**

1. Instead of predicting binary outcomes (e.g., 0 or 1), it predicts probabilities for **multiple categories**.

2. For each category, the model calculates the log-odds of the outcome relative to a baseline category.

3. Probabilities are assigned to each category using the **softmax function**.

#### **Applications**:

- **Customer Segmentation**: Predicting whether a customer prefers Product A, B, or C.

- **Medical Diagnosis**: Predicting disease type based on symptoms.

- **Marketing**: Identifying which product category a customer is likely to buy.

#### **Steps in Multinomial Logistic Regression**:


1. Define the target variable (categorical with >2 classes).

2. Convert features into numerical format.

3. Use a multinomial logit model to fit the data.

4. Interpret the coefficients for each class relative to the baseline.

---

### **7.a) Purpose of Least Squares Estimation in Regression**

The **Least Squares Estimation (LSE)** is a method used in regression analysis to find the best-fitting line (or curve) for a
given dataset. It minimizes the **sum of the squared differences** between observed values and predicted values.

#### **Purpose**:

- To determine the regression line that best explains the relationship between the independent (X) and dependent (Y)
variables.

#### **Steps**:

1. Compute the predicted value for each observation based on a guessed line.

2. Calculate the error (difference) between observed and predicted values.

3. Square the errors and sum them.

4. Adjust the line to minimize this total error.

#### **Example**:

- For predicting house prices:

- Input: Size of the house (independent variable).

- Output: Predicted price (dependent variable).

- LSE finds the line that minimizes the differences between actual and predicted prices.

---

### **7.b) Model Fit Statistics**

#### **i) Hosmer-Lemeshow Test**:

- Used in **logistic regression** to test the goodness of fit.

- It divides data into groups based on predicted probabilities and compares observed vs. predicted values.
- A higher p-value (>0.05) indicates that the model fits the data well.

#### **ii) Error Matrix (Confusion Matrix)**:

- A table used to evaluate the performance of classification models.

- Components:

- **True Positive (TP)**: Correctly predicted positives.

- **True Negative (TN)**: Correctly predicted negatives.

- **False Positive (FP)**: Predicted positive when actual is negative.

- **False Negative (FN)**: Predicted negative when actual is positive.

- Metrics derived:

- **Accuracy**: (TP + TN) / Total Predictions.

- **Precision**: TP / (TP + FP).

- **Recall**: TP / (TP + FN).

---

### **8.a) What is Overfitting? How to Prevent Overfitting?**

#### **What is Overfitting?**

- Overfitting happens when a model learns the **noise and details** in the training data instead of the actual pattern.
This leads to poor performance on new data.

#### **How to Prevent Overfitting?**

1. **Simplify the Model**:

- Use fewer features or simpler algorithms.

2. **Cross-Validation**:

- Test the model on different subsets of data to ensure generalization.

3. **Regularization**:

- Add a penalty term to the loss function to prevent over-complex models (e.g., L1, L2 regularization).

4. **Pruning**:

- Reduce the complexity of decision trees by removing unnecessary branches.

5. **Early Stopping**:

- Stop training the model once performance on validation data stops improving.

---
### **8.b) Compare ARMA and ARIMA**

#### **ARMA (Autoregressive Moving Average):**

- Combines **AR (Autoregressive)** and **MA (Moving Average)** components.

- Used for modeling stationary time series (constant mean and variance).

#### **ARIMA (Autoregressive Integrated Moving Average):**

- Adds an **Integration (I)** step to make non-stationary data stationary.

- Handles trends and seasonality by differencing the data.

#### **Key Differences**:

| **Aspect** | **ARMA** | **ARIMA** |

|----------------------|--------------------------|----------------------------------|

| Handles Stationarity | Only stationary data | Both stationary and non-stationary data |

| Differencing | Not required | Required to remove trends |

| Use Cases | Simple time series | Complex, trending time series |

---

### **9.a) Supervised vs. Unsupervised Learning**

| **Aspect** | **Supervised Learning** | **Unsupervised Learning** |

|-----------------------|---------------------------------|-----------------------------------|

| **Definition** | Learns from labeled data | Learns from unlabeled data |

| **Goal** | Predict or classify | Find patterns or structure |

| **Examples** | Classification, regression | Clustering, dimensionality reduction |

| **Use Cases** | Spam detection, price prediction | Customer segmentation, anomaly detection |

---

### **9.b) STL Approach for Time Series Decomposition**

**STL (Seasonal and Trend Decomposition using Loess)**:

- Breaks down a time series into three components:


1. **Trend**: Long-term direction of the data.

2. **Seasonality**: Regular patterns that repeat over time.

3. **Residual**: Noise or random variation.

**Steps**:

1. Identify the frequency of the seasonal pattern.

2. Use smoothing techniques (like Loess) to extract the trend.

3. Subtract trend and seasonality to get residuals.

---

### **10.a) Geometric Projection Visualization Techniques**

These techniques visualize high-dimensional data in 2D or 3D.

#### **Examples**:

1. **Scatter Plots**:

- Shows the relationship between two variables.

2. **PCA (Principal Component Analysis)**:

- Projects high-dimensional data into fewer dimensions while retaining variance.

3. **t-SNE**:

- Groups similar points together for better visualization of clusters.

---

### **10.b) Applications of Data Visualization**

1. **Business Intelligence**: Dashboards for sales and KPIs.

2. **Scientific Research**: Analyzing experimental data.

3. **Healthcare**: Visualizing patient statistics or disease patterns.

4. **Education**: Creating interactive learning materials.

---

### **11.a) Steps in Tableau Data Visualization**


1. **Connect to Data**:

- Import data from files, databases, or APIs.

2. **Prepare Data**:

- Clean data, remove duplicates, and handle missing values.

3. **Create Visualizations**:

- Drag and drop fields to create charts like bar graphs or maps.

4. **Build Dashboards**:

- Combine multiple visualizations into one interactive display.

5. **Share**:

- Publish the dashboard or export it as a file.

---

### **11.b) Tag Cloud Visualization**

**Tag Cloud**:

- A way to visualize text data where the size of a word reflects its frequency or importance.

#### **Applications**:

- Analyzing customer feedback (common words in reviews).

- Summarizing social media trends. convert to pdf

You might also like