Itdw
Itdw
Name : ………………………
Register Number : ……………………………
Semester : ……………………………
CERTIFICATE
This is to Certify that the bonafide record of the practical work done by ………………………………
1
Data exploration and integration with WEKA
2
Apply weka tool for data validation
3
Plan the architecture for real time application
4
Write the query for schema definition
5 Design data ware house for real time applications
6
Analyse the dimensional Modeling
7
Case study using OLAP
8
Case study using OTLP
9
Implementation of warehouse testing.
EX.NO.:1
DATE: DATA EXPLORATION AND INTEGRATION WITH WEKA
AIM:
To exploring the data and performing integration with weka
PROCEDURE:
To install WEKA on your machine, visit WEKA’s official website and download the installation
file. WEKA supports installation on Windows, Mac OS X and Linux. You just need to follow the
instructions on this page to install WEKA for your OS.
The WEKA GUI Chooser application will start and you would see the following screen
The GUI Chooser application allows you to run five different types of applications as listed
here:
Explorer
Experimenter
Knowledge Flow
Workbench
Simple CLI
Loading Data
We will open the file from a public URL Type the following URL in the popup box:
https://storm.cis.fordham.edu/~gweiss/data-mining/weka-data/weather.nominal.arff
You may specify any other URL where your data is stored. The Explorer will load the data from
the remote site into its environment.
Loading Data from DB
Once you click on the Open DB button, you can see a window as follows:
Set the connection string to your database, set up the query for data selection, process the query
and load the selected records in WEKA.
WEKA File Formats
WEKA supports a large number of file formats for the data. Here is the complete list:
arff
arff.gz
bsi
csv
dat
data
json
json.gz
libsvm
m
names
xrff
xrff.gz
The types of files that it supports are listed in the drop-down list box at the bottom of the screen.
This is shown in the screenshot given below.
As you would notice it supports several formats including CSV and JSON. The default file type is
Arff.
Arff Format
An Arff file contains two sections - header and data.
The header describes the attribute types.
The data section contains a comma separated list of data.
As an example for Arff format, the Weather data file loaded from the WEKA sample databases is
shown below:
From the screenshot, you can infer the following points:
The @relation tag defines the name of the database.
The @attribute tag defines the attributes.
The @data tag starts the list of data rows each containing the comma separated
fields.
The attributes can take nominal values as in the case of outlook shown here:
@attribute outlook (sunny, overcast, rainy)
The attributes can take real values as in this case:
@attribute temperature real
You can also set a Target or a Class variable called play as shown here:
@attribute play (yes, no)
The Target assumes two nominal values yes or no.
Understanding Data
Let us first look at the highlighted Current relation sub window. It shows the name of the database
that is currently loaded. You can infer two points from this sub window:
There are 14 instances - the number of rows in the table.
The table contains 5 attributes - the fields, which are discussed in the upcoming
sections.
On the left side, notice the Attributes sub window that displays the various fields in the
database.
The weather database contains five fields - outlook, temperature, humidity, windy and play.
when you select an attribute from this list by clicking on it, further details on the attribute itself
are displayed on the right hand side.
Let us select the temperature attribute first. When you click on it, you would see the following
screen:
In the Selected Attribute subwindow, you can observe the following:
The name and the type of the attribute are displayed.
The type for the temperature attribute is Nominal.
The number of Missing values is zero.
There are three distinct values with no unique value.
The table underneath this information shows the nominal values for this field as
hot, mild and cold.
It also shows the count and weight in terms of a percentage for each nominal value.
At the bottom of the window, you see the visual representation of the class values
If you click on the Visualize All button, you will be able to see all features in one single window
as shown here:
Removing Attributes
Many a time, the data that you want to use for model building comes with many irrelevant fields.
For example, the customer database may contain his mobile number which is relevant in analysing
his credit rating
To remove Attribute/s select them and click on the Remove button at the bottom.
The selected attributes would be removed from the database. After you fully preprocess the data,
you can save it for model building.
Next, you will learn to preprocess the data by applying filters on this data.
Data Integration
Suppose you have 2 datasets as below and need to merge them together
RESULT:
Thus the weka software are installed and performed data exploration and integration successfully
EX.NO.:2
DATE: APPLY WEKA TOOL FOR DATA VALIDATION
AIM:
PROCEDURE:
Data validation is the process of verifying and validating data that is collected before it is
used. Any type of data handling task, whether it is gathering data, analyzing it, or structuring it
for presentation, must include data validation to ensure accurate results.
1. Data Sampling
Click on choose ( certain datasets in sample datasets does not allow this operation. I used
Brest-cancer dataset for this experiment )
Filters -> supervised -> Instance -> Re-sample
Click on the name of the algorithm to change parameters
Change biasToUniformClass to have a biased sample. If you set it to 1 resulting dataset
will have equal number of instances for each class. Ex:- Brest-cancer positive 20 negative
20.
Change noReplacement accordingly.
Change sampleSizePrecent accordingly. ( self explanatory )
2. Removing duplicates
3. Data Reduction
PCA
Load iris dataset
Filters -> unsupervised -> attribute -> PrincipleComponents
Original iris dataset have 5 columns. ( 4 data + 1 class ). Lets reduce that to 3 columns ( 2
data + 1 class ).
4. Data transformation
Normalization
Load iris dataset
Filters -> unsupervised -> attribute -> normalize
Normalization is important when you don’t know the distribution of data beforehand.
Scale is the length of number line and translation is the lower bound.
Ex :- scale 2 and translation -1 => -1 to 1, scale 4 and translation -2 => -2 to 2
This filter get applied to all numeric columns. You can’t selectively normalize.
Standardization
Load iris dataset.
Used when dataset known to be in Gaussian (bell curve) distribution.
Filters -> unsupervised -> attribute -> standardize
This filter get applied to all numeric columns. You can’t selectively standardize.
Discretization
Load diabetes dataset.
Discretization comes in handy when using decision trees.
Suppose you need to change weight column to two values like low and high.
Set column number 6 to AttributeIndices.
Set bins to 2 ( Low/ High)
When you set equal frequency to true there will be equal number of high and low entries
in the final column.
RESULT:
Thus the software as performed and apply weka tool for data validation as successfully validate.
EX.NO.:3
DATE: PLAN THE ARCHITECTURE FOR REAL TIME APPLICATION
AIM:
To plan the architecture for real time application.
PROCEDURE:
DESIGN STEPS:
1. Gather Requirements: Aligning the business goals and needs of different departments
with the overall data warehouse project.
2. Set Up Environments: This step is about creating three environments for data warehouse
development, testing, and production, each running on separate servers
3. Data Modeling: Design the data warehouse schema, including the fact tables and
dimension tables, to support the business requirements.
4. Develop Your ETL Process: ETL stands for Extract, Transform, and Load. This process
is how data gets moved from its source into your warehouse.
5. OLAP Cube Design: Design OLAP cubes to support analysis and reporting requirements.
6. Reporting & Analysis: Developing and deploying the reporting and analytics tools that
will be used to extract insights and knowledge from the data warehouse.
7. Optimize Queries: Optimizing queries ensures that the system can handle large amounts
of data and respond quickly to queries.
8. Establish a Rollout Plan: Determine how the data warehouse will be introduced to the
organization, which groups or individuals will have access to it, and how the data will be
presented to these users.
OUTPUT
RESULT:
Thus, the architecture for classifying and testing a real time application (data set) was designed
successfully
EX.NO.:4 QUERY FOR SCHEMA DEFINITION
DATE:
AIM:
To Write a query for Star, Snowflake and Galaxy schema definitions.
PROCEDURE:
STAR SCHEMA
SNOWFLAKE SCHEMA
A fact constellation has multiple fact tables. It is also known as galaxy schema.
The sales fact table is same as that in the star schema.
The sales fact table is same as that in the star schema.
The shipping fact table also contains two measures, namely dollars sold and units sold.
SAMPLE PROGRAM:
Table Creation: (Galaxy Schema)
SELECT
s.date,
s.sales_amount,
p.product_name,
c.customer_name
FROM
sales s
JOIN
product p ON s.product_id = p.product_id
JOIN
customer c ON s.customer_id = c.customer_id;
OUTPUT:
+ + + + +
| date | sales_amount | product_name | customer_name |
+ + + + +
| 2024-02-01 | 500.00 | Product A | Customer X |
| 2024-02-02 | 750.00 | Product B | Customer Y |
| 2024-02-03 | 600.00 | Product C | Customer Z |
+ + + + +
3 rows in set (0.00 sec)
SNOWFLAKE SCHEMA DEFINITION
SELECT
s.Date,
s.SalesAmount,
p.ProductName,
pc.CategoryName,
c.CustomerName,
cl.City
FROM
Sales s
JOIN
Product p ON s.ProductID = p.ProductID
JOIN
ProductCategory pc ON p.CategoryID = pc.CategoryID
JOIN
Customer c ON s.CustomerID = c.CustomerID
JOIN
CustomerLocation cl ON c.LocationID = cl.LocationID;
OUTPUT:
SELECT
s.Date AS SalesDate,
s.SalesAmount,
o.Date AS OrderDate,
o.Quantity,
p.ProductName,
pc.CategoryName,
c.CustomerName,
cl.City
FROM
Sales s
JOIN
Product p ON s.ProductID = p.ProductID
JOIN
ProductCategory pc ON p.CategoryID = pc.CategoryID
JOIN
OUTPUT:
+ + + + + + + + +
| SalesDate | SalesAmount | OrderDate | Quantity | ProductName | CategoryName |
CustomerName | City |
+ + + + + + + + +
| 2024-02-01 | 500.00 | 2024-02-01 | 2 | Smartphone | Electronics | John Doe | New
York |
| 2024-02-04 | 800.00 | 2024-02-04 | 5 | Laptop | Electronics | John Doe | New York
|
| 2024-02-02 | 30.00 | 2024-02-02 | 3 | T-Shirt | Clothing | Jane Smith | Los Angeles
|
| 2024-02-05 | 50.00 | 2024-02-05 | 2 | Jeans | Clothing | Jane Smith | Los Angeles
|
| 2024-02-03 | 150.00 | 2024-02-03 | 1 | Coffee Maker | Home & Garden | Bob Johnson |
Chicago |
+ + + + + + + + +
5 rows in set (0.00 sec)
STAR SCHEMA
SNOWFLAKE SCHEMA
RESULT:
Thus the query for star, Snowflake and Galaxy schema was written Successfull
EX.NO.:5
DATE:
DESIGN DATA WARE HOUSE FOR REAL TIME APPLICATIONS
AIM:
To design a data ware house for a real time application using PostgreSQL tool.
PROCEDURE:
1. Click Start- AllPrograms -PostgreSQL 16 - Open pgAdmin4.
2. Click this icon, enter name, host and password as postgre.
3. Double click PostgreSQL 16.
4. Right click databases (1) and choose Create and type database name as dwftp and
Save.
5. Double click dwftp and click schemas (1) - Right click and select Create and type
schema name as dw and Save.
6. Double click dw- right click Tables -select Table create table for Employee as emp1
with columns:
Eno integer PRIMARY KEY
Empname VARCHAR(20)
Age integer
Salary integer
Job Char
Deptno integer
and Save
7. To insert values into table right click the table emp1 select View/edit data -> All rows
and then add the number of rows, insert values by double clicking each attribute and
Save.
8. Right click on table emp1, select Query tool and perform the query operations:
(a) To list the records in the emp1 table orderby salary in descending order.
select * from dw.emp1 order by salary desc;
OUTPUT:
AIM:
PROCEDURE:
To analyze the dimensional modeling, you can follow this procedure:
PROGRAM:
RESULT:
Thus the program was written and executed successfully for analyze the dimensional modeling
EX.NO: 7
DATE: CASE STUDY USING OLAP
AIM:
case study scenario for using OLAP (Online Analytical Processing) in a data warehousing
environment:
Background:
A retail company operates multiple stores across different regions. They have been collecting
transactional data from their point-of-sale (POS) systems for several years. The company wants to
gain insights into their sales performance, customer behavior, and product trends to make informed
business decisions.
Objective:
The objective is to design a data warehousing solution using OLAP for analyzing retail sales data to
uncover actionable insights.
Data Sources:
1. Transactional data: Includes information such as sales date, store ID, product ID, quantity sold,
unit price, and total sales amount.
2. Customer data: Includes demographics, loyalty program membership status, and purchase
history.
3. Product data: Includes product attributes such as category, brand, and price.
Solution:
1. Data Integration:
Extract data from various sources and load it into a centralized data warehouse. Transform
and cleanse the data to ensure consistency and quality.
2. Dimensional Modeling:
Design a star schema or snowflake schema to organize the data into fact tables (e.g., sales
transactions) and dimension tables (e.g., store, product, time, customer).
3. OLAP Cube Creation:
Build OLAP cubes based on the dimensional model to provide multi-dimensional views of
the data. Dimensions such as time, product, store, and customer can be sliced and diced for
analysis.
4. Analysis:
Analyze total sales revenue, units sold, and average transaction value by store, region, product
category, and time period.
5. Visualization:
Create interactive dashboards and reports using OLAP cube data to present insights to business
users. Visualization tools like Tableau, Power BI, or custom-built dashboards can be used.
6. Decision Making:
Use insights gained from analysis to make data-driven decisions such as inventory
management, marketing campaigns, and product assortment planning.
Benefits:
1. Improved Decision Making: Provides timely and relevant insights to stakeholders for making
informed decisions.
2. Enhanced Operational Efficiency: Optimizes inventory management, marketing strategies, and
resource allocation based on data-driven insights.
3. Competitive Advantage: Enables the company to stay ahead of competitors by understanding
customer preferences and market trends.
4. Scalability: The OLAP solution can scale to handle large volumes of data and accommodate
evolving business needs.
Conclusion:
By leveraging OLAP technology within a data warehousing environment, the retail company can
gain deeper insights into their sales data, customer behavior, and product performance, ultimately
driving business growth and profitability.
EX.NO: 8
DATE: CASE STUDY USING OLTP
AIM:
case study scenario for using OLTP (Online Transaction Processing) in a data warehousing
environment:
Background:
A retail company operates an e-commerce platform where customers can purchase products online.
They need to manage a high volume of transactions efficiently while ensuring data integrity and
real-time processing.
Objective:
The objective is to design an OLTP system within a data warehousing environment to handle online
retail orders, manage inventory, process payments, and maintain customer information.
Solution:
1. Database Design:
3. Inventory Management:
4. Payment Processing:
5. Customer Management:
- Maintain customer profiles with information such as contact details, shipping addresses, and
order history.
- Enable customers to update their profiles and track order statuses.
- Implement authentication and authorization mechanisms to ensure data security.
6. Scalability and Performance:
- Optimize database performance for handling concurrent transactions and high throughput.
- Implement indexing, partitioning, and caching strategies to improve query performance.
- Scale the system horizontally or vertically to accommodate increasing transaction volumes.
Benefits:
Conclusion:
By implementing an OLTP system within a data warehousing environment, the retail company can
effectively manage online retail operations, process transactions in real-time, and maintain data
integrity, ultimately driving customer satisfaction and business growth.
EX.NO: 9
IMPLEMENTATION OF WAREHOUSE TESTING
DATE:
AIM:
Implementing warehouse testing involves several steps to ensure the efficiency and accuracy of warehouse
operations.
PROCEDURE:
Determine the specific objectives and goals of the warehouse testing. This could include ensuring
inventory accuracy, optimizing picking and packing processes, improving order fulfillment times,
etc.
Define a set of test scenarios that cover different aspects of warehouse operations, such as receiving
goods, put-away, picking, packing, shipping, and inventory counts. These scenarios should be based
on real-world scenarios and should cover both normal and edge cases.
Establish criteria for evaluating the success of each test scenario. This could include accuracy rates,
time taken to complete tasks, error rates, etc.
Gather or generate the necessary test data to simulate real-world warehouse operations. This could
include product data, inventory levels, customer orders, shipping information, etc.
Assign personnel and equipment necessary to conduct the warehouse testing. This may involve
coordinating with warehouse staff, IT personnel, and any external vendors or consultants as needed.
Conduct the warehouse testing by executing the predefined test scenarios using the allocated
resources. Ensure that each scenario is executed according to the defined criteria, and record the
results of each test.
Analyze the results of the warehouse testing to identify any areas of improvement or areas where
issues were encountered. Determine the root causes of any issues and prioritize them based on their
impact on warehouse operations.
Take corrective actions to address any issues or deficiencies identified during the testing process.
This could involve updating procedures, modifying system configurations, providing additional
training to warehouse staff, etc.
9. Document Findings:
Document the findings of the warehouse testing process, including test results, corrective actions
taken, and any recommendations for future improvements. This documentation will serve as a
reference for future testing cycles and continuous improvement efforts.
Continuously iterate and refine the warehouse testing process based on feedback and results from
previous testing cycles. Make adjustments as needed to improve the efficiency and effectiveness of
warehouse.
PROGRAM:
-- Insert sample data into dim_time and fact_sales tables (replace with your own data)
INSERT INTO dim_time (date, year, quarter, month, day) VALUES
('2024-01-01', 2024, 1, 1, 1),
('2024-01-02', 2024, 1, 1, 2);
RESULT:
Thus the program was written and executed successfully for Implementation of warehouse testing