Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
6 views2 pages

Databases For Data Mining

The document outlines key databases used for data mining benchmarking, including the UCI Machine Learning Repository, Kaggle Datasets, and ImageNet, among others. It emphasizes the importance of dataset diversity, size, complexity, and reproducibility in evaluating data mining algorithms. Additionally, it provides URLs and instructions for accessing these benchmark databases.

Uploaded by

Miguel Barth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views2 pages

Databases For Data Mining

The document outlines key databases used for data mining benchmarking, including the UCI Machine Learning Repository, Kaggle Datasets, and ImageNet, among others. It emphasizes the importance of dataset diversity, size, complexity, and reproducibility in evaluating data mining algorithms. Additionally, it provides URLs and instructions for accessing these benchmark databases.

Uploaded by

Miguel Barth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 2

DATABASES FOR DATA MINING SOFTWARE BENCHMARKING

Data mining benchmarking involves evaluating the performance of various data mining algorithms
and systems on standard datasets. To facilitate this, certain benchmark databases are commonly
used due to their well-defined characteristics and wide acceptance in the research community.

Here are some of the key databases used for data mining benchmarking:

1. UCI Machine Learning Repository


The UCI Machine Learning Repository is one of the most popular sources of datasets for data
mining and machine learning. It contains a wide variety of datasets from different domains,
including classification, regression, clustering, and more.
Notable datasets:
Iris, Wine, Adult, and Breast Cancer.
Features: Provides metadata for each dataset, including attribute information and data
types.
2. Kaggle Datasets
Kaggle offers a diverse collection of datasets contributed by the community, often used in
competitions and for research purposes.
Notable datasets: Titanic, House Prices, MNIST (digit recognition).
Features: Datasets are often accompanied by detailed descriptions, kernels (notebooks), and
discussions, making them useful for benchmarking and experimentation.
3. KEEL Repository
The Knowledge Extraction based on Evolutionary Learning (KEEL) repository is designed for
benchmarking evolutionary algorithms in data mining. It includes datasets for classification,
regression, clustering, and more.
Notable datasets: Various small to medium-sized datasets specifically prepared for algorithm
testing.
Features: Provides tools and software for experimental setups.
4. StatLib
Hosted by Carnegie Mellon University, StatLib offers a collection of datasets primarily used in
statistics and machine learning research.
Notable datasets: Boston Housing, COIL Challenge 2000.
Features: Emphasizes statistical datasets, often used for regression analysis.
5. OpenML
OpenML is an open platform for sharing datasets, algorithms, and experiments. It provides a
wide variety of datasets that can be used for benchmarking data mining algorithms.
Notable datasets: MNIST, CIFAR-10, Adult.
Features: Integrates with various machine learning tools and platforms, allowing for easy
experiment sharing and comparison.
6. TREC
The Text Retrieval Conference (TREC) provides datasets for benchmarking text retrieval and
information retrieval systems.
Notable datasets: Various collections related to web search, QA, and more.
Features: Focuses on large-scale text data and associated retrieval tasks.
7. ImageNet
Primarily used for benchmarking image classification algorithms, ImageNet is a large-scale
image database organized according to the WordNet hierarchy.
Notable datasets: ImageNet Large Scale Visual Recognition Challenge (ILSVRC) datasets.
Features: Contains millions of labeled images for training and benchmarking deep learning
models.
8. LIBSVM Datasets
The LIBSVM library offers a collection of datasets used for benchmarking support vector
machine (SVM) algorithms.
Notable datasets: Adult, Heart Disease, and other standard SVM datasets.
Features: Suitable for evaluating SVM and other related algorithms.

Considerations for Benchmarking

Diversity of Data: Choose datasets from various domains (text, image, numerical) to
comprehensively evaluate the performance of data mining algorithms.
Size and Complexity: Include both small and large datasets to test the scalability of
algorithms.
Data Characteristics: Consider the characteristics such as missing values, noise, and class
imbalance to test the robustness of algorithms.
Reproducibility: Ensure that the datasets and experimental setups are well-documented
to facilitate reproducibility.

Using these benchmark databases allows researchers and practitioners to objectively


compare the performance of different data mining techniques and contribute to the
advancement of the field.

The URLs and instructions for accessing and downloading datasets from these benchmark
databases are:
1. UCI Machine Learning Repository
URL: UCI Machine Learning Repository https://archive.ics.uci.edu/ml/index.php
Instructions: Navigate to the website, select a dataset, and download the data files
(usually available in CSV or ARFF formats).
2. Kaggle Datasets
URL: Kaggle Datasets https://www.kaggle.com/datasets
Instructions: Create a Kaggle account, log in, search for a dataset, and download it. You
can also use Kaggle’s API to download datasets programmatically.
3. KEEL Repository
URL: KEEL Repository http://sci2s.ugr.es/keel/datasets.php
Instructions: Browse the datasets by category, select a dataset, and download the data
files in KEEL format. You may need to convert them to other formats like CSV.
4. StatLib
URL: StatLib Datasets http://lib.stat.cmu.edu/datasets/
Instructions: Browse the datasets available on the website, click on a dataset to view its
description and download the data files.

5. OpenML
URL: OpenML
https://www.openml.org/

You might also like