Data Warehousing Lab Excercise
Data Warehousing Lab Excercise
No:1
Date:
Study of WEKA Tool
Introduction
Weka (pronounced to rhyme with Mecca) is a workbench that contains a collection of
visualization tools and algorithms for data analysis and predictive modeling, together with
graphical user interfaces for easy access to these functions. The original non-Java version of
Weka was a Tcl/Tk front-end to (mostly third-party) modeling algorithms implemented in
other programming languages, plus data preprocessing utilities in C, and Make file-based
system for running machine learning experiments. This original version was primarily
designed as a tool for analyzing data from agricultural domains, but the more recent fully
Java-based version (Weka 3), for which development started in 1997, is now used in many
different application areas, in particular for educational purposes and research. Advantages
of Weka include:
Free availability under the GNU General Public License.
Portability, since it is fully implemented in the Java programming language and thus
runs on almost any modern computing platform
A comprehensive collection of data preprocessing and modeling techniques
Ease of use due to its graphical user interfaces
Description
Open the program. Once the program has been loaded on the user‟s machine it is opened by
navigating to the programs start option and that will depend on the user‟s operating system.
Figure 1.1 is an example of the initial opening screen on a computer.
There are four options available on this initial screen:
1. Explorer - the graphical interface used to conduct experimentation on raw data After
clicking the Explorer button the weka explorer interface appears.
Fig: 1.2 Pre-processor
Inside the weka explorer window there are six tabs:
i) Preprocess- used to choose the data file to be used by the application.
Open File- allows for the user to select files residing on the local machine or recorded
medium
Open URL- provides a mechanism to locate a file or data source from a different location
specified by the user
Open Database- allows the user to retrieve files or data from a database source provided by
user
ii) Classify- used to test and train different learning schemes on the preprocessed data file
under experimentation
iii) Cluster- used to apply different tools that identify clusters within the data file.
The Cluster tab opens the process that is used to identify commonalties or clusters of
occurrences within the data set and produce information for the user to analyze.
iv) Association- used to apply different rules to the data file that identify association within
the data. The associate tab opens a window to select the options for associations within the
dataset.
Result:
The core features, general characteristics and the applications of Weka Tool
has been studied.
Ex.No:2
Date:
Data exploration and integration with Weka
Aim:
To implement data exploration and integration with Weka
Procedure:
Step 1: Launch Weka Explorer
- Open Weka and select the "Explorer" from the Weka GUI Chooser.
Step 2: Load the dataset
- Click on the "Open file" button and select "datasets" > "iris.arff" from the Weka
installation directory. This will load the Iris dataset.
Step 3: To know more about the iris dataset, open iris.arff in notepad++ or in a similar tool
and read thecomments.
Step 4: Fill this tables:
Flower Type Count
Sepal length
Sepalwidth
Petallength
Petal width
Aim:
To implement data validation using Weka
Procedure:
Step 1: Launch Weka Explorer
- Open Weka and select the "Explorer" from the Weka GUI Chooser.
Step 2: Load the dataset
- Click on the "Open file" button and select "datasets" > "iris.arff" from the Weka
installation directory. This will load the Iris dataset.
Step 3: Split your data into training and testing sets. Under the "Classify" tab, click on the
"Choose" button next to the "Test options" area and select a testing method. Weka offers
options like cross-validation, percentage split, and user-defined test set. Configure the
options according to your needs.
Step 4: Select a classifier algorithm. Weka offers a wide range of algorithms for
classification, regression, clustering, and other tasks. Under the "Classify" tab, click on the
"Choose" button next to the "Classifier" area and choose an algorithm. Configure its
parameters, if needed.
Step 5: Click on the "Start" button under the "Classify" tab to run the training and testing
process. Weka will train the model on the training set and test its performance on the testing
set using the selected algorithm.
Validation Techniques:
Cross-Validation: Go to the "Classify" tab and choose a classifier. Then, under the "Test
options," select the type of cross-validation you want to perform (e.g., 10-fold cross-
validation). Click "Start" to run the validation.
Train-Test Split: You can also split your data into a training set and a test set. Use the
"Supervised" tab to train a model on the training set and evaluate its performance on the test
set.
Step 6: Evaluate the model's performance. Once the process finishes, Weka will display
various performance measures like accuracy, precision, recall, and ROC curve (for
classification tasks) or RMSE and MAE (for regression tasks). These measures can be found
in the "Result list" on the right side of the window.
Step 7: Analyze the results and interpret them. Examine the performance measures to assess
the model's quality and suitability for your dataset. Compare different models or validation
methods if you have tried more than one.
Step 8: Repeat steps 4-7 with different algorithms or validation methods if desired. This will
help you compare the performance of different models and choose the best one.
Output
Result:
Thus the simple data validation and testing dataset using Weka was implemented.
Ex.No:4
Date:
Training the Given Dataset for an Application
Aim:
To apply the concept of Linear Regression for training the given dataset.
Procedure:
Step 1: Open the weka tool.
Step 2: Download a dataset by using UCI.
Step 3: Apply replace missing values.
Step 4: Apply normalize filter.
Step 5: Click the Classify Tab.
Step 6: Choose the Simple Linear Regression option.
Step 7: Select the training set of data.
Step 8: Start the validation process.
Step 9: Note the output.
Linear Regression:
In statistics, Linear Regression is an approach for modeling a relationship between a scalar
dependent variable Y
and one or more explanatory variables denoted X.the case of explanatory variable is called
Simple Linear
Regression.
Coefficient of Linear Regression is given by: Y=ax+b
Problem:
Consider the dataset below where x is the number of working expeince of a college graduate
and y is the
corresponding salary of the graduate. Build a regression equation and predict the salary of
college graduate whose
experience is 10 years.
Input:
Output:
Result: Thus the concept of Linear Regression for training the given dataset is applied and
implemented.
Ex.No:5
Date:
Testing the Given Dataset for an Application
Aim:
To apply the Navie Bayes Classification for testing the given dataset.
Procedure:
Step 1: Open the weka tool.
Step 2: Download a dataset by using UCI.
Step 3: Apply replace missing values.
Step 4: Apply normalize filter.
Step 5: Click the Classification Tab.
Step 6: Apply Navie Bayes Classification.
Step 7: Find the Classified Value.
Step 8: Note the output.
Example: predict whether a costumer will buy a computer or not " Costumers are described
by two attributes: age
and income " X is a 35 years-old costumer with an income of 40k " H is the hypothesis that
the costumer will buy a
computer " P(H|X) reflects the probability that costumer X will buy a computer given that
we know the costumers’
age and income.
Input:
Output:
Result:
Thus the Navie Bayes Classification for testing the given dataset is implemented.
Ex.No:6
Date:
Write the Query for Schema Definition
Ex.No.6.1 Query for Star schema using SQL Server Management Studio
Aim:
To execute and verify query for star schema using SQL Server Management Studio
Procedure:
Step 1: Install SQLEXPR and SQLManagementStudio
Step 2: Launch SQL Server Management Studio
Step 3: Create new database and write query for creating Star schema table
Step 4: Execute the query for schema
Step 5: Explore the database diagram for Star schema
Result:
Thus the Query for Star Schema was created and executed successfully
Ex.No.6.2 Query for SnowFlake schema using SQL Server Management Studio
Aim:
To execute and verify query for SnowFlake schema using SQL Server Management Studio
Procedure:
Step 1: Install SQLEXPR and SQLManagementStudio
Step 2: LaunchSQL Server Management Studio
Step 3: Create new database and write query for creating Star schema table
Step 4: Execute the query
Step 5: Explore the database diagram for SnowFlake schema
Step 6: Connect the Geography table with Salesperson & Product Geography key
Output
Result:
Thus the Query for SnowFlake Schema was created and executed successfully
Ex.No:7
Date:
Design Data Warehouse for Real Time Applications
Aim:
To design and execute data warehouse for real time application using SQL Server Management
Studio
Procedure:
Step 1: Launch SQL Server Management Studio
Step 2: Explore the created database
Step 3: 3.1 Right-click on the table name and click on the Edit top 200 rows option.
3.2. Enter the data inside the table or use the top 1000 rows option and enter the query.
Step 4: Execute the query, and the data will be updated in the table.
Step 5: Right-click on the database and click on the tasks option. Use the import data option to
import files to the database.
Sample Query
INSERT INTO dbo.person(first_name,last_name,gender) VALUES
('Kavi','S','M'), ('Nila','V','F'), ('Nirmal','B','M'), ('Kaviya','M','F');
Result:
Thus, the data warehouse for real-time applications was designed successfully.
Ex.No:8
Date:
Case Study Using OLAP
Aim:
To evaluate the implementation and impact of OLAP technology in a real-world business
context, analyzing its effectiveness in enhancing data analysis, decision-making, and overall
operational efficiency.
Introduction:
OLAP stands for On-Line Analytical Processing. OLAP is a classification of
software technology which authorizes analysts, managers, and executives to gain insight into
information through fast, consistent, interactive access in a wide variety of possible views of
data that has been transformed from raw information to reflect the real dimensionality of the
enterprise as understood by the clients .It is used to analyze business data from different
points of view. Organizations collect and store data from multiple data sources, such as
websites, applications, smart meters, and internal systems.
Methodology
OLAP (Online Analytical Processing) methodology refers to the approach and techniques
used to design, create, and use OLAP systems for efficient multidimensional data analysis. Here
are the key components and steps involved in the OLAP methodology:
1. Requirement Analysis:
The process begins with understanding the specific analytical requirements of the
users. Analysts and stakeholders define the dimensions, measures, hierarchies, and data sources
that will be part of the OLAP system. This step is crucial to ensure that the OLAP system meets
the business needs.
2. Dimensional Modeling:
Dimension tables are designed to represent attributes like time, geography, and
product categories. Fact tables contain the numerical data (measures) and the keys to
dimension tables.
3. Star Schema:
This is a common design in OLAP systems where the fact table is at the center, connected to
dimension tables.
Operations in OLAP
In OLAP (Online Analytical Processing), operations are the fundamental actions performed on
multidimensional data cubes to retrieve, analyze, and present data in a way that facilitates
decision-making and data exploration. The main operations in OLAP are:
2. Dice: Dicing is the process of selecting specific values from two or more dimensions to
create a subcube. It allows you to focus on a particular combination of attributes. For
example, you can dice the cube to view sales data for a specific product category and region
within a certain time frame.
3. Roll-up (Drill-up): Roll-up allows you to move from a more detailed level of data to a
higher-level summary. For instance, you can roll up from daily sales data to monthly or yearly
sales data, aggregating the information.
4. Drill-down (Drill-through): Drill-down is the opposite of roll-up, where you move from
a higher-level summary to a more detailed view of the data. For example, you can drill
down from yearly sales data to quarterly, monthly, and daily data, getting more granularity.
5. Pivot (Rotate): Pivoting involves changing the orientation of the cube, which means
swapping dimensions to view the data from a different perspective. This operation is useful for
exploring data in various ways.
6. Slice and Dice: Combining slicing and dicing allows you to select specific values from
different dimensions to create subcubes. This operation helps you focus on a highly specific
subset of the data.
7. Drill-across: Drill-across involves navigating between cubes that are related but have
different dimensions or hierarchies. It allows users to explore data across different OLAP cubes.
8. Data Filtering: In OLAP, you can filter data to view only specific data points or subsets
that meet certain criteria. This operation is useful for narrowing down data to what is most
relevant for analysis.
Slice
Dice
Roll Up
Pivot
Drill Down
3. Data Loading:
Load the integrated and preprocessed transaction data into the OLAP cube. Ensure that the
cube is regularly updated to reflect the most recent data.
4. OLAP Cube Design:
Define hierarchies and relationships within the cube to enable effective analysis. For instance,
you might have hierarchies that allow drilling down from product categories to individual
products.
5. Market Basket Analysis:
Although OLAP cubes are not designed for direct market basket analysis, they can
facilitate it in several ways:
Conclusion
OLAP is a powerful technology for businesses and organizations seeking data insights,
informed decisions, and performance improvement. It enables multidimensional data
analysis, especially in complex, data-intensive environments. It is a crucial technology for
organizations seeking to gain insights from their data and make informed decisions. It
empowers businesses to analyze data efficiently and effectively, offering a competitive
advantage in today's data-driven world.
Ex.No:9
Date:
Case Study Using OLTP
Aim:
Develop an OLTP system that enables the e-commerce company to process a high volume of
online orders, track inventory, manage customer information, and handle financial
transactions in real-time, ensuring data integrity and providing a seamless shopping
experience for customers.
Introduction:
In today's digital age, businesses across various industries are relying heavily on technology to
streamline their operations and provide seamless services to their customers. One crucial
aspect of this technological transformation is the development and implementation of
efficient Online Transaction Processing (OLTP) systems. This case study delves into the
design and implementation of an OLTP system for a fictional e-commerce company,
"TechTrend Electronics," and examines the key considerations, challenges, and aims
associated with such a project.
This case study aims to showcase the process of developing an OLTP system tailored to
TechTrend Electronics' unique requirements. The objective is to ensure that the company can
efficiently handle a multitude of real-time transactions while maintaining data accuracy and
providing a seamless shopping experience for its customers.
Methodology:
The methodology for developing an OLTP (Online Transaction Processing) system for a case
study involves a systematic approach to designing, implementing, and testing the system.
Below is a step-by-step methodology for creating an OLTP system for a case study, using the
fictional e-commerce company "Tech Trend Electronics" as an example:
1. Database Design:
Develop a well-structured relational database schema that aligns with the business
requirements.
Normalize the data to eliminate redundancy and ensure data consistency.
Create entity-relationship diagrams and define data models for key entities like customers,
products, orders, payments, and inventory.
2.Technology Selection:
Choose appropriate technologies for the database management system (e.g., MySQL,
PostgreSQL, Oracle) and programming languages (e.g., Java, Python, C#) for the OLTP
system.
Evaluate and select suitable frameworks, libraries, and tools that align with the chosen
technologies.
3. System Architecture:
Design the system's architecture, which may include multiple application layers, a web
interface, and a database layer.
Implement a layered architecture, separating concerns for scalability, maintainability, and
security.
Conclusion:
In conclusion, OLTP systems play a pivotal role in modern business operations, facilitating
real-time transaction processing, data integrity, and customer interactions. These systems are
designed for high concurrency, low-latency, and consistent data access, making them
essential for day-to-day operations in various industries, such as finance, e-commerce,
healthcare, and more.
Overall, OLTP systems are the backbone of modern business operations, ensuring the
seamless execution of day-to-day transactions and delivering a positive customer experience.
Ex.No:10
Date:
Implementation of Warehouse Testing.
Aim:
To perform load testing using JMeter and interact with a SQL Server database using SQL
Management Studio, you'll need to set up JMeter to send SQL queries to the database
and collect the results for analysis.
Procedure:
1. Install Required Software:
Install JMeter: Download and install JMeter from the official Apache JMeter website.
Install SQL Server and SQL Management Studio: If you haven't already, set up SQL
Server and SQL Management Studio to manage your database.
2. Create a Test Plan in JMeter:
Launch JMeter and create a new Test Plan.
3. Add Thread Group:
Add a Thread Group to your Test Plan to simulate the number of users and requests.
4. Add JDBC Connection Configuration:
Add a JDBC Connection Configuration element to your Thread Group. Configure it
with the database connection details, such as the JDBC URL, username, and password.
This element will allow JMeter to connect to your SQL Server database.
5. Add a JDBC Request Sampler:
6. Add Listeners:
Add listeners to your Test Plan to collect and view the test results. Common
listeners include View Results Tree, Summary Report, and Response Times
Over Time.
7. Configure Your Test Plan:
Configure the number of threads (virtual users), ramp-up time, and loop count in the
Thread Group to simulate the desired load.
8. Run the Test:
Start the test by clicking the "Run" button in JMeter.
Conclusion
Using JMeter in conjunction with SQL Management Studio can be a powerful combination
for load testing and performance analysis of applications that rely on SQL Server databases.
This approach allows you to simulate a realistic user load, send SQL queries to the database,
and evaluate the system's performance under various conditions.
JMeter in combination with SQL Management Studio provides a robust solution for assessing
the performance of applications that rely on SQL Server databases. Through thorough testing,
analysis, and optimization, you can ensure your application is capable of delivering a reliable
and responsive experience to users even under heavy load conditions.