Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
50 views39 pages

Introduction to Data Warehousing

BI

Uploaded by

jagajit1110
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views39 pages

Introduction to Data Warehousing

BI

Uploaded by

jagajit1110
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 39

UNIT 1

INTRODUCTION TO BUSINESS INTELLIGENCE


Introduction to business intelligence and
business decisions – Data warehouses and its
role in Business Intelligence – Creating a
corporate data warehouse – Data Warehousing
architecture –OLAP vs. OLTP - ETL process –
Tools for Data Warehousing – Data Mining –
KDD Process

Characteristics and Functions of Data warehouse


Subject-oriented
Integrated
Time-Variant
Non-Volatile
Functions of Data warehouse

Characteristics and Functions of Data warehouse

A data warehouse is a centralized repository for storing and


managing large amounts of data from various sources for analysis and
reporting.
It is optimized for fast querying and analysis, enabling
organizations to make informed decisions by providing a single source
of truth for data. Data warehousing typically involves transforming
and integrating data from multiple sources into a unified, organized,
and consistent format.
Prerequisite – Data Warehousing Data warehouse can be controlled
when the user has a shared way of explaining the trends that are
introduced as a specific subject. Below are major characteristics of
data warehouse

Subject-oriented
A data warehouse is always subject oriented as it delivers
information about a theme instead of organization’s current operations. It
can be achieved on a specific theme. That means the data warehousing
process is proposed to handle a specific theme which is more defined.
These themes can be sales, distributions, marketing etc.
A data warehouse never puts emphasis only on current operations.
Instead, it focuses on demonstrating and analyzing data to make various
decisions. It also delivers an easy and precise demonstration around a
particular theme by eliminating data which is not required to make the
decisions.
1. Integrated – It is somewhere the same as subject orientation which
is made in a reliable format. Integration means founding a shared
entity to scale all similar data from the different databases. The
data also required to be resided into various data warehouses in a
shared and generally granted manner.
A data warehouse is built by integrating data from various sources
of data such that a mainframe and a relational database. In
addition, it must have reliable naming conventions, format and
codes. Integration of data warehouse benefits in effective analysis
of data. Reliability in naming conventions, column scaling, encoding
structure etc. should be confirmed. Integration of data warehouses
handles various subject related warehouses.
2. Time-Variant – In this data is maintained via different intervals of
time such as weekly, monthly, or annually etc. It finds various time
limits which are structured between the large datasets and are held
in an online transaction process (OLTP). The time limits for data
warehouses are wider-range than that of operational systems. The
data resided in the data warehouse is predictable with a specific
interval of time and delivers information from the historical
perspective. It comprises elements of time explicitly or implicitly.
Another feature of time-variance is that once data is stored in the
data warehouse then it cannot be modified, altered, or updated.
Data is stored with a time dimension, allowing for analysis of data
over time.
3. Non-Volatile – As the name defines the data resided in the data
warehouse is permanent. It also means that data is not erased or
deleted when new data is inserted. It includes the mammoth
quantity of data that is inserted into modification between the
selected quantity on logical business. It evaluates the analysis
within the technologies of the warehouse. Data is not updated, once
it is stored in the data warehouse, to maintain the historical data.
In this, data is read-only and refreshed at particular intervals. This
is beneficial in analyzing historical data and in comprehending the
functionality. It does not need a transaction process, recapture and
concurrency control mechanism. Functionalities such as delete,
update, and insert that are done in an operational application are
lost in the data warehouse environment. Two types of data
operations done in the data warehouse are:
● Data Loading
● Data Access
1. Subject Oriented: Focuses on a specific area or subject such as
sales, customers, or inventory.
2. Integrated: Integrates data from multiple sources into a single,
consistent format.
3. Read-Optimized: Designed for fast querying and analysis, with
indexing and aggregations to support reporting.
4. Summary Data: Data is summarized and aggregated for faster
querying and analysis.
5. Historical Data: Stores large amounts of historical data, making it
possible to analyze trends and patterns over time.
6. Schema-on-Write: Data is transformed and structured according to
a predefined schema before it is loaded into the data warehouse.
7. Query-Driven: Supports ad-hoc querying and reporting by business
users, without the need for technical support.
Functions of Data warehouse:
It works as a collection of data and here is organized by various
communities that endures the features to recover the data functions. It has
stocked facts about the tables which have high transaction levels which are
observed so as to define the data warehousing techniques and major
functions which are involved in this are mentioned below:
1. Data Consolidation: The process of combining multiple data sources
into a single data repository in a data warehouse. This ensures a
consistent and accurate view of the data.
2. Data Cleaning: The process of identifying and removing errors,
inconsistencies, and irrelevant data from the data sources before
they are integrated into the data warehouse. This helps ensure the
data is accurate and trustworthy.
3. Data Integration: The process of combining data from multiple
sources into a single, unified data repository in a data warehouse.
This involves transforming the data into a consistent format and
resolving any conflicts or discrepancies between the data sources.
Data integration is an essential step in the data warehousing
process to ensure that the data is accurate and usable for analysis.
Data from multiple sources can be integrated into a single data
repository for analysis.
4. Data Storage: A data warehouse can store large amounts of
historical data and make it easily accessible for analysis.
5. Data Transformation: Data can be transformed and cleaned to
remove inconsistencies, duplicate data, or irrelevant information.
6. Data Analysis: Data can be analyzed and visualized in various ways
to gain insights and make informed decisions.
7. Data Reporting: A data warehouse can provide various reports and
dashboards for different departments and stakeholders.
8. Data Mining: Data can be mined for patterns and trends to support
decision-making and strategic planning.
9. Performance Optimization: Data warehouse systems are optimized
for fast querying and analysis, providing quick access to data.

Implementation and Components in Data Warehouse


Data Warehouse
types of systems required for a data warehouse –
1. Source Systems
2. Data Staging Area
3. Presentation Server
Components of Data Warehouse Architecture and their tasks :
1. Operational Source
2. Load Manager
3. Warehouse Manage
4. Query Manager
5. Detailed Data
6. Summarized Data
7. Archive and Backup Data
8. Metadata
9. End User Access Tools

Advantages and disadvantages of the components commonly found in


data warehouses

Implementation and Components in Data Warehouse


Data Warehouse is used to store historical data which helps to make
strategic decisions for the business. It is used for Online Analytical Processing
(OLAP) which helps to analyze the data.
The data warehouse contributes to business executives in systematically
organizing, accepting, and using their data to make strategic decisions.
Data Warehouse
Data Warehouse has been defined in many ways, making it difficult to
formulate a rigorous definition.
Gradually speaking, a data warehouse is a data repository that is kept
separate from an organization’s operational database.
Data warehouse systems allow the integration of a wide variety of
application systems.
They support information processing by providing a solid plan of
aggregated historical data for analysis.
Data in a data warehouse comes from the organization’s operational systems
as well as other external sources.
These are collectively referred to as the source systems. The data extracted
from the source systems is stored in an area called the data staging area,
where the data is cleaned, transformed, assembled, and duplicated to prepare
the data in the data warehouse.
There are three different types of systems required for a data warehouse –
1.Source Systems
2.Data Staging Area
3.Presentation Server
4. Query Manager
Data Warehouse Architecture
The data moves from the data source area through the staging area to the
presentation server. The entire process is better known as ETL (extract,
transform, and load) or ETT (extract, transform, and transfer).
Components of Data Warehouse Architecture and their tasks :
1. Operational Source –
● An operational Source is a data source consists of Operational Data
and External Data.
● Data can come from Relational DBMS like Informix, Oracle.
2. Load Manager –
● The Load Manager performs all operations associated with the
extraction of loading data in the data warehouse.
● These tasks include the simple transformation of data to prepare data
for entry into the warehouse.
3. Warehouse Manage –
● The warehouse manager is responsible for the warehouse
management process.
● The operations performed by the warehouse manager are the
analysis, aggregation, backup and collection of data, de-normalization
of the data.
4. Query Manager –
● Query Manager performs all the tasks associated with the
management of user queries.
● The complexity of the query manager is determined by the end-user
access operations tool and the features provided by the database.
5. Detailed Data –
● It is used to store all the detailed data in the database schema.
● Detailed data is loaded into the data warehouse to complement the
data collected.
6. Summarized Data –
● Summarized Data is a part of the data warehouse that stores
predefined aggregations
● These aggregations are generated by the warehouse manager.
7. Archive and Backup Data –
● The Detailed and Summarized Data are stored for the purpose of
archiving and backup.
● The data is relocated to storage archives such as magnetic tapes or
optical disks.
8. Metadata –
● Metadata is basically data stored above data.
● It is used for the extraction and loading process, warehouse,
management process, and query management process.
9. End User Access Tools –
● End-User Access Tools consist of Analysis, Reporting, and mining.
● By using end-user access tools users can link with the warehouse.
Advantages and disadvantages of the components commonly found in
data warehouses:
Data sources: Data sources are the systems or databases that provide data to
the data warehouse. Advantages of using multiple data sources include
increased data coverage and the ability to integrate diverse data types.
However, disadvantages include potential data quality issues, data
inconsistencies, and increased complexity in data integration.
ETL (Extract, Transform, Load) processes: ETL processes are used to extract
data from source systems, transform it to conform to the data warehouse
schema, and load it into the data warehouse. Advantages of ETL processes
include efficient data integration and improved data quality. However,
disadvantages include potential data loss or corruption, increased processing
time and complexity, and potential data inconsistency due to data
transformations.
Data storage: Data storage is the component of the data warehouse that
stores the data. Advantages of data storage in a data warehouse include the
ability to store large amounts of data in a single location, fast and efficient
data retrieval, and improved data quality due to data cleansing and
standardization. Disadvantages include the high cost of data storage, potential
data loss or corruption, and potential security risks associated with storing
large amounts of sensitive data in a single location.
Data modeling: Data modeling is the process of designing the structure of the
data warehouse. Advantages of data modeling include the ability to organize
and structure data in a way that is optimized for BI activities, improved data
quality due to data cleansing and standardization, and increased scalability
and flexibility. However, disadvantages include the potential for complex data
relationships and the need for specialized skills and knowledge to design and
implement an effective data model.
Data access tools: Data access tools are used to access and analyze data in
the data warehouse. Advantages of data access tools include the ability to
easily access and analyze data, improved data quality due to data cleansing
and standardization, and increased speed and efficiency of BI activities.
Disadvantages include the potential for user error, the need for specialized
skills and knowledge to use the tools effectively, and potential security risks
associated with data access.

ETL Process in Data Warehouse


● Extraction
● Transformation
● Loading
● ETL Tools
● Advantages of ETL process in data warehousing
● Disadvantages of ETL process in data warehousing

1. ETL stands for Extract, Transform, Load and it is a process used in


data warehousing to extract data from various sources, transform it
into a format suitable for loading into a data warehouse, and then
load it into the warehouse. The process of ETL can be broken down
into the following three stages:

2. Extract: The first stage in the ETL process is to extract data from
various sources such as transactional systems, spreadsheets, and flat
files. This step involves reading data from the source systems and
storing it in a staging area.

3. Transform: In this stage, the extracted data is transformed into a


format that is suitable for loading into the data warehouse. This may
involve cleaning and validating the data, converting data types,
combining data from multiple sources, and creating new data fields.
4. Load: After the data is transformed, it is loaded into the data
warehouse. This step involves creating the physical data structures
and loading the data into the warehouse.

5. The ETL process is an iterative process that is repeated as new data is


added to the warehouse. The process is important because it ensures
that the data in the data warehouse is accurate, complete, and up-to-
date. It also helps to ensure that the data is in the format required for
data mining and reporting.

Additionally, there are many different ETL tools and technologies available,
such as Informatica, Talend, DataStage, and others, that can automate and
simplify the ETL process.

ETL is a process in Data Warehousing and it stands for Extract, Transform and
Load. It is a process in which an ETL tool extracts the data from various data
source systems, transforms it in the staging area, and then finally, loads it into
the Data Warehouse system.

Let us understand each step of the ETL process in-depth:


1. Extraction:
The first step of the ETL process is extraction. In this step, data from
various source systems is extracted which can be in various formats
like relational databases, No SQL, XML, and flat files into the staging
area. It is important to extract the data from various source systems
and store it into the staging area first and not directly into the data
warehouse because the extracted data is in various formats and can
be corrupted also. Hence loading it directly into the data warehouse
may damage it and rollback will be much more difficult. Therefore, this
is one of the most important steps of the ETL process.

2. Transformation:
The second step of the ETL process is transformation. In this step, a
set of rules or functions are applied on the extracted data to convert it
into a single standard format. It may involve following
processes/tasks:

● Filtering – loading only certain attributes into the data


warehouse.

● Cleaning – filling up the NULL values with some default


values, mapping U.S.A, United States, and America into USA,
etc.

● Joining – joining multiple attributes into one.

● Splitting – splitting a single attribute into multiple attributes.

● Sorting – sorting tuples on the basis of some attribute


(generally key-attribute).

3. Loading:
The third and final step of the ETL process is loading. In this step, the
transformed data is finally loaded into the data warehouse.
Sometimes the data is updated by loading into the data warehouse
very frequently and sometimes it is done after longer but regular
intervals. The rate and period of loading solely depends on the
requirements and varies from system to system.

ETL processes can also use the pipelining concept i.e. as soon as some data is
extracted, it can be transformed and during that period some new data can be
extracted. And while the transformed data is being loaded into the data
warehouse, the already extracted data can be transformed. The block diagram
of the pipelining of ETL process is shown below:

ETL Tools: Most commonly used ETL tools are Hevo, Sybase, Oracle Warehouse
builder, CloverETL, and MarkLogic.

Data Warehouses: Most commonly used Data Warehouses are Snowflake,


Redshift, BigQuery, and Firebolt.
Advantages of ETL process in data warehousing:

1. Improved data quality: ETL process ensures that the data in the data
warehouse is accurate, complete, and up-to-date.

2. Better data integration: ETL process helps to integrate data from


multiple sources and systems, making it more accessible and usable.

3. Increased data security: ETL process can help to improve data


security by controlling access to the data warehouse and ensuring that
only authorized users can access the data.

4. Improved scalability: ETL process can help to improve scalability by


providing a way to manage and analyze large amounts of data.

5. Increased automation: ETL tools and technologies can automate and


simplify the ETL process, reducing the time and effort required to load
and update data in the warehouse.

Disadvantages of ETL process in data warehousing:

1. High cost: ETL process can be expensive to implement and maintain,


especially for organizations with limited resources.

2. Complexity: ETL process can be complex and difficult to implement,


especially for organizations that lack the necessary expertise or
resources.

3. Limited flexibility: ETL process can be limited in terms of flexibility, as


it may not be able to handle unstructured data or real-time data
streams.

4. Limited scalability: ETL process can be limited in terms of scalability,


as it may not be able to handle very large amounts of data.

5. Data privacy concerns: ETL process can raise concerns about data
privacy, as large amounts of data are collected, stored, and analyzed.

Overall, ETL process is an essential process in data warehousing that helps to


ensure that the data in the data warehouse is accurate, complete, and up-to-
date. However, it also comes with its own set of challenges and limitations, and
organizations need to carefully consider the costs and benefits before
implementing them.

KDD Process in Data Mining


KDD Process
Data Integration
Data Selection
Data Mining
Pattern Evaluation
Advantages of KDD

KDD Process in Data Mining


In the context of computer science, “Data Mining” can be referred to as
knowledge mining from data, knowledge extraction, data/pattern analysis,
data archaeology, and data dredging.
Data Mining, also known as Knowledge Discovery in Databases, refers to
the nontrivial extraction of implicit, previously unknown and potentially useful
information from data stored in databases.
The need of data mining is to extract useful information from large
datasets and use it to make predictions or better decision-making.
Nowadays, data mining is used in almost all places where a large
amount of data is stored and processed.
For examples: Banking sector, Market Basket Analysis, Network Intrusion
Detection.

KDD Process
KDD (Knowledge Discovery in Databases) is a process that involves the
extraction of useful, previously unknown, and potentially valuable information
from large datasets.
The KDD process is an iterative process and it requires multiple iterations
of the above steps to extract accurate knowledge from the data.
The following steps are included in KDD process:
Data Cleaning
Data cleaning is defined as removal of noisy and irrelevant data from
collection.
1. Cleaning in case of Missing values.
2. Cleaning noisy data, where noise is a random or variance error.
3. Cleaning with Data discrepancy detection and Data transformation
tools.

Data Integration
Data integration is defined as heterogeneous data from multiple sources
combined in a common source(DataWarehouse). Data integration using Data
Migration tools, Data Synchronization tools and ETL(Extract-Load-
Transformation) process.

Data Selection
Data selection is defined as the process where data relevant to the analysis is
decided and retrieved from the data collection. For this we can use Neural
networks, Decision Trees, Naive bayes, Clustering, and Regression methods.
Data Transformation
Data Transformation is defined as the process of transforming data into
appropriate form required by mining procedure. Data Transformation is a two
step process:
1. Data Mapping: Assigning elements from source base to destination to
capture transformations.
2. Code generation: Creation of the actual transformation program.

Data Mining
Data mining is defined as techniques that are applied to extract patterns
potentially useful. It transforms task relevant data into patterns, and decides
the purpose of the model using classification or characterization.

Pattern Evaluation
Pattern Evaluation is defined as identifying strictly increasing patterns
representing knowledge based on given measures. It find interestingness score
of each pattern, and uses summarization and Visualization to make data
understandable by the user.
Knowledge Representation
This involves presenting the results in a way that is meaningful and can be used
to make decisions.
KDD is an iterative process where evaluation measures can be enhanced,
mining can be refined, new data can be integrated and transformed in order to
get different and more appropriate results.Preprocessing of databases consists
of Data cleaning and Data Integration.

Advantages of KDD
Improves decision-making: KDD provides valuable insights and knowledge that
can help organizations make better decisions.
Increased efficiency: KDD automates repetitive and time-consuming tasks and
makes the data ready for analysis, which saves time and money.
Better customer service: KDD helps organizations gain a better understanding
of their customers’ needs and preferences, which can help them provide better
customer service.
Fraud detection: KDD can be used to detect fraudulent activities by identifying
patterns and anomalies in the data that may indicate fraud.
Predictive modeling: KDD can be used to build predictive models that can
forecast future trends and patterns.
Disadvantages of KDD
Privacy concerns: KDD can raise privacy concerns as it involves collecting and
analyzing large amounts of data, which can include sensitive information about
individuals.
Complexity: KDD can be a complex process that requires specialized skills and
knowledge to implement and interpret the results.
Unintended consequences: KDD can lead to unintended consequences, such as
bias or discrimination, if the data or models are not properly understood or
used.
Data Quality: KDD process heavily depends on the quality of data, if data is not
accurate or consistent, the results can be misleading
High cost: KDD can be an expensive process, requiring significant investments
in hardware, software, and personnel.
Overfitting: KDD process can lead to overfitting, which is a common problem in
machine learning where a model learns the detail and noise in the training data
to the extent that it negatively impacts the performance of the model on new
unseen data.
Difference between KDD and Data Mining

Parameter KDD Data Mining

KDD refers to a process of


Data Mining refers to a
identifying valid, novel,
process of extracting
potentially useful, and
Definition useful and valuable
ultimately understandable
information or patterns
patterns and relationships
from large data sets.
in data.

To find useful knowledge To extract useful


Objective
from data. information from data.
Data cleaning, data
Association rules,
integration, data selection,
classification, clustering,
data transformation, data
Techniques regression, decision
mining, pattern
Used trees, neural networks,
evaluation, and knowledge
and dimensionality
representation and
reduction.
visualization.

Patterns, associations, or
Structured information,
insights that can be used
such as rules and models,
Output to improve decision-
that can be used to make
making or
decisions or predictions.
understanding.

Focus is on the discovery


Data mining focus is on
of useful knowledge,
Focus the discovery of patterns
rather than simply finding
or relationships in data.
patterns in data.

Domain expertise is less


Domain expertise is
critical in data mining,
important in KDD, as it
Role of as the algorithms are
helps in defining the goals
domain designed to identify
of the process, choosing
expertise patterns without relying
appropriate data, and
on prior knowledge.
interpreting the results.

OLAP and OLTP


OLAP stands for Online Analytical Processing. OLAP systems have the
capability to analyze database information of multiple systems at the current
time. The primary goal of OLAP Service is data analysis and not data
processing.

OLTP stands for Online Transaction Processing. OLTP has the work to
administer day-to-day transactions in any organization. The main goal of OLTP
is data processing not data analysis.

Online Analytical Processing (OLAP)


Online Analytical Processing (OLAP) consists of a type of software tool that is
used for data analysis for business decisions. OLAP provides an environment to
get insights from the database retrieved from multiple database systems at one
time.

OLAP Examples

Any type of Data Warehouse System is an OLAP system. The uses of the OLAP
System are described below.

● Spotify analyzed songs by users to come up with a personalized


homepage of their songs and playlist.

● Netflix movie recommendation system.

Benefits of OLAP Services

● OLAP services help in keeping consistency and calculation.

● We can store planning, analysis, and budgeting for business analytics


within one platform.

● OLAP services help in handling large volumes of data, which helps in


enterprise-level business applications.

● OLAP services help in applying security restrictions for data


protection.

● OLAP services provide a multidimensional view of data, which helps in


applying operations on data in various ways.
Drawbacks of OLAP Services

● OLAP Services requires professionals to handle the data because of its


complex modeling procedure.

● OLAP services are expensive to implement and maintain in cases when


datasets are large.

● We can perform an analysis of data only after extraction and


transformation of data in the case of OLAP which delays the system.

● OLAP services are not efficient for decision-making, as it is updated


on a periodic basis.

Online Transaction Processing (OLTP)


Online transaction processing provides transaction-oriented applications in a 3-
tier architecture. OLTP administers the day-to-day transactions of an
organization.

OLTP Examples

An example considered for OLTP System is ATM Center a person who


authenticates first will receive the amount first and the condition is that the
amount to be withdrawn must be present in the ATM. The uses of the OLTP
System are described below.

● ATM center is an OLTP application.

● OLTP handles the ACID properties during data transactions via the
application.

● It’s also used for Online banking, Online airline ticket booking, sending
a text message, add a book to the shopping cart.
Benefits of OLTP Services

● OLTP services allow users to read, write and delete data operations
quickly.

● OLTP services help in increasing users and transactions which helps in


real-time access to data.

● OLTP services help to provide better security by applying multiple


security features.

● OLTP services help in making better decision making by providing


accurate data or current data.

● OLTP Services provide Data Integrity, Consistency, and High


Availability to the data.

Drawbacks of OLTP Services

● OLTP has limited analysis capability as they are not capable of


intending complex analysis or reporting.

● OLTP has high maintenance costs because of frequent maintenance,


backups, and recovery.

● OLTP Services get hampered in the case whenever there is a hardware


failure which leads to the failure of online transactions.

● OLTP Services many times experience issues such as duplicate or


inconsistent data.
Difference between OLAP and OLTP

OLAP (Online OLTP (Online Transaction


Category
Analytical Processing) Processing)

It is well-known as an It is well-known as an
Definition online database query online database modifying
management system. system.

Consists of historical
Consists of only operational
Data source data from various
current data.
Databases.

It makes use of a standard


Method It makes use of a data
database management
used warehouse.
system (DBMS).

Application It is subject-oriented. It is application-oriented.


Used for Data Mining,
Analytics, Decisions
making, etc. Used for business tasks.

In an OLAP database, In an OLTP database,


Normalized tables are not tables are normalized
normalized. (3NF).

The data is used in


The data is used to perform
Usage of planning, problem-
day-to-day fundamental
data solving, and decision-
operations.
making.

It provides a multi-
dimensional view of It reveals a snapshot of
Task
different business present business tasks.
tasks.

It serves the purpose It serves the purpose to


to extract information Insert, Update, and Delete
Purpose
for analysis and information from the
decision-making. database.
The size of the data is
A large amount of data
Volume of relatively small as the
is stored typically in TB,
data historical data is archived
PB
in MB, and GB.

Relatively slow as the


amount of data
Very Fast as the queries
Queries involved is large.
operate on 5% of the data.
Queries may take
hours.

The OLAP database is The data integrity


not often updated. As a constraint must be
Update
result, data integrity is maintained in an OLTP
unaffected. database.

It only needs backup The backup and recovery


Backup and
from time to time as process is maintained
Recovery
compared to OLTP. rigorously

Processing The processing of It is comparatively fast in


time complex queries can processing because of
simple and straightforward
take a lengthy time.
queries.

This data is generally


Types of This data is managed by
managed by CEO, MD,
users clerksForex and managers.
and GM.

Only read and rarely Both read and write


Operations
write operations. operations.

With lengthy,
scheduled batch The user initiates data
Updates operations, data is updates, which are brief
refreshed on a regular and quick.
basis.

Nature of The process is focused The process is focused on


audience on the customer. the market.

Database Design with a focus on Design that is focused on


Design the subject. the application.

Enhances the user’s


Improves the
productivity.
Productivity efficiency of business
analysts.

Data Mining

Data mining has applications in multiple fields like science and research.
It is a prediction based on likely outcomes. It focuses on the last data set. Data
mining is the procedure of mining knowledge from data. The knowledge
extracted can be used for any of the following applications such as production
control, market analysis, science exploration. Data mining is the practice of
searching large stores of data to discover patterns. It trends that go beyond
simple analysis. Data mining is also known as knowledge. It focuses on the last
data sets and databases and the creation of actionable information. It is the
automatic discovery of patterns.
Data mining deals with a kind of pattern that can be mined. There are such
categories as descriptive, classification, prediction, cluster analysis, evolution
analysis.
1. Knowledge base :
The knowledge base is domain language. It is used to guide the search
and interestingness of resulting patterns. It is knowledge presentation,
data integration etc.
2. Data transformation :
Data is transformed into forms appropriate for mining, by performing
summary operations.
3. Clusters :
To a group of similar kinds of objects. Cluster forming group of objects
that are very similar to each other.
4. Data cleaning :
It is a process of preparing data for data mining activities. The
technology that applied to remove the data and correct the
consistency is in data. It performed as a data pre-processing step.;
5. Data selection :
It is the process where data relevant to analysis task are retrieved
from the data.
6. Data integration :
Data integration is the data processing technique. We use it to merge
the data from multiple heterogeneous data.
7. User interface :
It visualizes the pattern in the module of data mining system that help
the communication between users and data. It’s providing information
to help focus the search.
8. Data :
It defined as facts, transactions, and figures.
9. GUI :
Graphical user Interface.
10. Data mining :
It extracts the information from the youth set of data. This information
for the following applications market analysis, science exploration,
production control.
11. Associations :
It is a type of algorithm. To create rules that describe how events have
together.
12. Classification :
It referred to the data mining problem. To predict the category of
categorical data by building a model. It must be based on some
predictor variables.
13. Continuous :
It can have any value in an interval of real numbers. The value does
not have to be an integer continuous is the opposite categorical.
14. DBMS :
Database management systems.
15. Interaction :
When two independent variable interest and changes in the value of
one change the effect on the dependent variable of the others.
Data Mining Process
● Data Mining Process
● Data Preprocessing
● Major issues in Data Mining
● Advantages of Data Mining
● Disadvantages of Data Mining
● Data Mining Models
● Types of Data Mining Models
1. Predictive Models
2. Descriptive Models’
● Description Model

The data mining process typically involves the following steps:


Business Understanding: This step involves understanding the problem
that needs to be solved and defining the objectives of the data mining project.
This includes identifying the business problem, understanding the goals and
objectives of the project, and defining the KPIs that will be used to measure
success. This step is important because it helps ensure that the data mining
project is aligned with business goals and objectives.
Data Understanding: This step involves collecting and exploring the data
to gain a better understanding of its structure, quality, and content. This
includes understanding the sources of the data, identifying any data quality
issues, and exploring the data to identify patterns and relationships. This step is
important because it helps ensure that the data is suitable for analysis.
Data Preparation: This step involves preparing the data for analysis. This
includes cleaning the data to remove any errors or inconsistencies,
transforming the data to make it suitable for analysis, and integrating the data
from different sources to create a single dataset. This step is important because
it ensures that the data is in a format that can be used for modeling.
Modeling: This step involves building a predictive model using machine
learning algorithms. This includes selecting an appropriate algorithm, training
the model on the data, and evaluating its performance. This step is important
because it is the heart of the data mining process and involves developing a
model that can accurately predict outcomes on new data.
Evaluation: This step involves evaluating the performance of the model.
This includes using statistical measures to assess how well the model is able to
predict outcomes on new data. This step is important because it helps ensure
that the model is accurate and can be used in the real world.
Deployment: This step involves deploying the model into the production
environment. This includes integrating the model into existing systems and
processes to make predictions in real-time. This step is important because it
allows the model to be used in a practical setting and to generate value for the
organization.
Data Mining refers to extracting or mining knowledge from large
amounts of data. The term is actually a misnomer. Thus, data mining should
have been more appropriately named as knowledge mining which emphasis on
mining from large amounts of data. It is computational process of discovering
patterns in large data sets involving methods at intersection of artificial
intelligence, machine learning, statistics, and database systems. The overall
goal of data mining process is to extract information from a data set and
transform it into an understandable structure for further use. It is also defined
as extraction of interesting (non-trivial, implicit, previously unknown and
potentially useful) patterns or knowledge from a huge amount of data. Data
mining is a rapidly growing field that is concerned with developing techniques
to assist managers and decision-makers to make intelligent use of a huge
amount of repositories. Alternative names for Data Mining :
1. Knowledge discovery (mining) in databases (KDD)
2. Knowledge extraction
3. Data/pattern analysis
4. Data archaeology
5. Data dredging
6. Information harvesting
7. Business intelligence
Data Mining and Business Intelligence :
Key properties of Data Mining :
1. Automatic discovery of patterns
2. Prediction of likely outcomes
3. Creation of actionable information
4. Focus on large datasets and databases
Data Mining : Confluence of Multiple Disciplines –
Data Mining Process : Data Mining is a process of discovering various models,
summaries, and derived values from a given collection of data. The general
experimental procedure adapted to data-mining problem involves following
steps :
State problem and formulate hypothesis – In this step, a modeler usually
specifies a group of variables for unknown dependency and, if possible, a
general sort of this dependency as an initial hypothesis. There could also be
several hypotheses formulated for one problem at this stage. The primary step
requires combined expertise of an application domain and a data-mining
model. In practice, it always means an in-depth interaction between a data-
mining expert and application expert. In successful data-mining applications,
this cooperation does not stop within the initial phase. It continues during the
whole data-mining process.
Collect data – This step cares about how information is generated and
picked up. Generally, there are two distinct possibilities. The primary is when the
data-generation process is under control of an expert (modeler). This approach
is understood as a designed experiment. The second possibility is when experts
cannot influence the data generation process. This is often referred to as the
observational approach. An observational setting, namely, random data
generation, is assumed in most data-mining applications. Typically, sampling
distribution is totally unknown after data are collected, or it is partially and
implicitly given within data-collection procedure. It is vital, however, to know
how data collection affects its theoretical distribution since such a piece of prior
knowledge is often useful for modeling and, later, for ultimate interpretation of
results. Also, it is important to be sure that information used for estimating a
model and therefore data used later for testing and applying a model come
from an equivalent, unknown, sampling distribution. If this is often not the case,
the estimated model cannot be successfully utilized in a final application of
results.
Data Preprocessing – In the observational setting, data is usually
“collected” from prevailing databases, data warehouses, and data marts. Data
preprocessing usually includes a minimum of two common tasks :
(i) Outlier Detection (and removal) : Outliers are unusual data values that
are not according to most observations. Commonly, outliers result from
measurement errors, coding, and recording errors, and, sometimes, are natural,
abnormal values. Such non-representative samples can seriously affect models
produced later. There are two strategies for handling outliers : Detect and
eventually remove outliers as a neighborhood of preprocessing phase. And
Develop robust modeling methods that are insensitive to outliers.
(ii) Scaling, encoding, and selecting features : Data preprocessing
includes several steps like variable scaling and differing types of encoding. For
instance, one feature with range [0, 1] and another with range [100, 1000] will
not have an equivalent weight within applied technique. They are going to also
influence ultimate data-mining results differently. Therefore, it is recommended
to scale them and convey both features to an equivalent weight for further
analysis. Also, application-specific encoding methods usually achieve
dimensionality reduction by providing a smaller number of informative features
for subsequent data modeling.
Estimate model – The selection and implementation of acceptable data-
mining techniques is that main task during this phase. This process is not
straightforward. Usually, in practice, implementation is predicated on several
models, and selecting the simplest one is a further task.
Interpret models and draw conclusions – In most cases, data-mining
models should help in deciding. Hence, such models got to be interpretable so
as to be useful because humans are not likely to base their decisions on
complex “black-box” models. Note that goals of accuracy of model and
accuracy of its interpretation are somewhat contradictory. Usually, simple
models are more interpretable, but they are also less accurate. Modern data-
mining methods are expected to yield highly accurate results using high
dimensional models. The matter of interpreting these models, also vital, is taken
into account as a separate task, with specific techniques to validate results.
Classification of Data Mining Systems :
1. Database Technology
2. Statistics
3. Machine Learning
4. Information Science
5. Visualization
Major issues in Data Mining :
Mining different kinds of knowledge in databases – The need for different
users is not the same. Different users may be interested in different kinds of
knowledge. Therefore it is necessary for data mining to cover a broad range of
knowledge discovery tasks.
Interactive mining of knowledge at multiple levels of abstraction – The
data mining process needs to be interactive because it allows users to focus on
searching for patterns, providing and refining data mining requests based on
returned results.
Incorporation of background knowledge – To guide discovery process
and to express discovered patterns, background knowledge can be used to
express discovered patterns not only in concise terms but at multiple levels of
abstraction.
Data mining query languages and ad-hoc data mining – Data Mining
Query language that allows users to describe ad-hoc mining tasks should be
integrated with a data warehouse query language and optimized for efficient
and flexible data mining.
Presentation and visualization of data mining results – Once patterns are
discovered it needs to be expressed in high-level languages, visual
representations. These representations should be easily understandable by
users.
Handling noisy or incomplete data – The data cleaning methods are
required that can handle noise, incomplete objects while mining data
regularities. If data cleaning methods are not there then accuracy of
discovered patterns will be poor.
Pattern evaluation – It refers to the interestingness of the problem. The
patterns discovered should be interesting because either they represent
common knowledge or lack of novelty.
Efficiency and scalability of data mining algorithms – In order to
effectively extract information from huge amounts of data in databases, data
mining algorithms must be efficient and scalable.
Parallel, distributed, and incremental mining algorithms – The factors
such as huge size of databases, wide distribution of data, and complexity of
data mining methods motivate development of parallel and distributed data
mining algorithms. These algorithms divide data into partitions that are further
processed parallel. Then results from partitions are merged. The incremental
algorithms update databases without having mined data again from scratch.

Advantages of Data Mining:


Improved decision making: Data mining can help organizations make
better decisions by providing them with valuable insights and knowledge about
their data.
Increased efficiency: Data mining can automate repetitive and time-
consuming tasks, such as data cleaning and data preparation, which can help
organizations save time and money.
Better customer service: Data mining can help organizations gain a better
understanding of their customers’ needs and preferences, which can help them
provide better customer service.
Fraud detection: Data mining can be used to detect fraudulent activities
by identifying patterns and anomalies in the data that may indicate fraud.
Predictive modeling: Data mining can be used to build predictive models
that can be used to forecast future trends and patterns.
Disadvantages of Data Mining:
Privacy concerns: Data mining can raise privacy concerns as it involves
collecting and analyzing large amounts of data, which can include sensitive
information about individuals.
Complexity: Data mining can be a complex process that requires
specialized skills and knowledge to implement and interpret the results.
Unintended consequences: Data mining can lead to unintended
consequences, such as bias or discrimination, if the data or models are not
properly understood or used.
Data Quality: Data mining process heavily depends on the quality of
data, if data is not accurate or consistent, the results can be misleading
High cost: Data mining can be an expensive process, requiring significant
investments in hardware, software, and personnel.

Data mining
The motive of data mining is to recognize valid, probable advantageous,
and understandable connections and patterns in existing data. Database
technology has become more developed where huge amounts of data are
required to be stored in a database, and the wealth of knowledge hidden in
those datasets is collected by business people as a usable tool for making
business vital decisions. Data mining then fascinates more awareness as it is
obligated to take out valuable information from the raw data that businesses
can use to enlarge their advantage via a profitable decision-making process.
Data mining is used to depict intelligence in databases; it is a procedure
of extracting and recognizing useful information and succeeding knowledge
from databases using mathematical, statistical, artificial intelligence, and
machine learning techniques. Data mining consolidates many various
algorithms to put through different tasks. All these algorithms assimilate the
model into the data. The algorithms examine the data and modulate the data
that is closest to the features of the data being examined. Data mining
algorithms can be described as consisting of three parts.
Model – The objective of the model is to fit the model in the data.
Preference – Some identification tests must be used to fit one model over
another.
Search – All algorithms are necessary for processing to find data.

Types of Data Mining Models –


Predictive Models
Descriptive Models
Data Mining Models
Predictive Model :
A predictive model constitutes prediction concern values of data using known
results found from various data. Predictive modeling may be made based on
the use of variant historical data. Predictive model data mining tasks comprise
regression, time series analysis, classification, prediction.
The Predictive Model is known as Statistical Regression. It is a monitoring
learning technique that Incorporates an explication of the dependency of few
attribute values upon the values of other attributes In a similar item and the
growth of a model that can predict these attribute values for recent cases.
Classification –
It is the act of assigning objects to one of several predefined categories. Or we
can define classification as a learning function of a target function that sets
each attribute to a predefined class label.
Regression –
It is used for appropriate data. It is a technique that verifies data values for a
function. There are two types of regression –
1. Linear Regression is associated with the search for the optimal line to fit the
two attributes so that one attribute can be applied to predict the other.
2. Multi-Linear Regression involves two or more than two attributes and data
are fit to multidimensional space.
Time Series Analysis –
It is a set of data based on time. Time series analysis serves as an independent
variable to estimate the dependent variable in time.
Prediction –
It predicts some missing or unknown values.
Description Model :
A descriptive model distinguishes relationships or patterns in data.
Unlike Predictive Model, a descriptive model serves as a way to explore
the properties of data being examined, not to predict new properties,
clustering, summarization, associating rules, and sequence discovery are
descriptive model data mining tasks.
Descriptive analytics Concentrate on the summarization and conversion
of the data into significant information for monitoring and reporting.
Clustering –
It is the technique of converting a group of abstract objects into classes of
identical objects.
Summarization –
It holds a set of data in a more in-depth, easy-to-understand form.
Associative Rules –
They find an exciting consistency or causal relationship between a large set of
data objects.
Sequence –
It is the discovery of interesting patterns in the data in relation to some
objective or subjective measurement of how interesting it is.

You might also like