Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
11 views4 pages

Unit 1

ggod

Uploaded by

mrxgamer33899
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views4 pages

Unit 1

ggod

Uploaded by

mrxgamer33899
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Unit 1: Data Warehousing

Overview and Definition:

Data warehousing refers to the process of collecting, managing, and analyzing large volumes of data
from different sources to support decision-making. It enables organizations to consolidate data into a
central repository for efficient querying and reporting.

Components:

1. Data Sources: Data is extracted from heterogeneous sources such as transactional


databases, flat files, and external systems.

2. ETL (Extract, Transform, Load) Tools: These tools are used to extract data, transform it into a
suitable format, and load it into the data warehouse.

3. Data Warehouse Database: A centralized repository where transformed data is stored.

4. Metadata: Information about data structure, source, and usage, essential for data
management.

5. Query Tools: Tools that allow users to retrieve and analyze data, including reporting tools,
OLAP tools, and data mining tools.

Building a Data Warehouse:

1. Requirements gathering.

2. Designing the warehouse schema.

3. ETL process implementation.

4. Testing and validation.

5. Deployment and maintenance.

Mapping to Multiprocessor Architecture:

Data warehouses are mapped to multiprocessor systems to enhance performance. Common


architectures include:

• Shared-nothing.

• Shared-disk.

• Shared-memory models.

Difference Between Database System and Data Warehouse:

Database System Data Warehouse

Optimized for transaction processing (OLTP). Optimized for analytical processing (OLAP).

Stores current data. Stores historical data.

Normalized schema. Denormalized schema.

Multi-Dimensional Data Model:


• Organizes data in a cube structure to support OLAP operations.

• Data Cubes: Represent data dimensions and measures.

• Schemas:

o Star Schema: Simplified structure with fact tables linked to dimension tables.

o Snowflake Schema: Normalized dimensions for complex hierarchies.

o Fact Constellations: Multiple fact tables sharing dimension tables.

Unit 2: Data Warehouse Process and Technology

Warehousing Strategy:

• Align warehouse design with business goals.

• Consider scalability, performance, and data governance.

Warehouse Management and Support Processes:

• Include data extraction, transformation, loading, backup, recovery, and security.

Planning and Implementation:

1. Define objectives.

2. Design architecture.

3. Select tools and technologies.

4. Build and test the system.

Hardware and Operating Systems:

• Use parallel processors, cluster systems, and distributed DBMS for performance.

Client/Server Computing Model:

• Supports distributed access and processing.

Software and Schema Design:

• Use warehousing software for efficient query processing.

• Design schemas (star, snowflake) to organize data logically.

Unit 3: Data Mining

Overview, Motivation, and Definition:

Data mining involves discovering patterns, correlations, and insights from large datasets using
algorithms and statistical techniques.

Data Processing:
1. Data Cleaning: Handle missing values, noisy data, and inconsistencies using:

o Binning.

o Clustering.

o Regression.

o Computer and human inspection.

2. Data Integration and Transformation: Combine data from multiple sources and standardize
it.

3. Data Reduction: Techniques include:

o Data Cube Aggregation.

o Dimensionality Reduction.

o Data Compression.

o Numerosity Reduction.

o Discretization and Concept Hierarchy Generation.

Decision Tree:

• A tree-based model for classification and decision-making.

Unit 4: Classification and Clustering

Classification:

1. Definition: Predictive analysis for categorizing data.

2. Key Steps:

o Data Generalization.

o Analytical Characterization.

o Attribute Relevance Analysis.

3. Algorithms:

o Statistical-Based Algorithms.

o Distance-Based Algorithms.

o Decision Tree-Based Algorithms.

Clustering:

1. Definition: Grouping data points based on similarity.

2. Similarity and Distance Measures: Basis for clustering.

3. Algorithms:
o Hierarchical (e.g., CURE, Chameleon).

o Density-Based (e.g., DBSCAN, OPTICS).

o Grid-Based (e.g., STING, CLIQUE).

o Model-Based (e.g., Statistical Approach).

Association Rules:

• Discover relationships between large item sets.

• Methods include basic, parallel, and distributed algorithms as well as neural networks.

Unit 5: Data Visualization and Warehousing Trends

Data Visualization:

• Key features include aggregation, historical data presentation, and querying capabilities.

• OLAP tools (ROLAP, MOLAP, HOLAP) enhance data exploration.

Security and Maintenance:

• Implement robust security measures and ensure regular backups and recovery.

• Optimize query performance and test the warehouse periodically.

Warehousing Applications:

1. Types: Business intelligence, financial analysis, and supply chain management.

2. Emerging Fields: Web Mining, Spatial Mining, and Temporal Mining.

Summary:

1. Data warehousing is crucial for centralized data storage and analysis.

2. ETL processes and schema designs are foundational to warehouse functionality.

3. Data mining enhances decision-making through pattern recognition and insights.

4. Classification and clustering methods are pivotal for organizing and understanding data.

5. Advances in visualization and mining applications drive industry innovation.

You might also like