Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
137 views12 pages

Uci Dataset

The document provides an overview of various medical conditions, datasets, and their characteristics, including hepatitis, breast cancer, and lung cancer, as well as multiple datasets related to health, marketing, and environmental factors. Each dataset is described with its size, features, and target variables, covering topics from bike sharing to student performance. Additionally, it highlights the importance of these datasets for machine learning and data analysis.

Uploaded by

Himanshu Harsh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
137 views12 pages

Uci Dataset

The document provides an overview of various medical conditions, datasets, and their characteristics, including hepatitis, breast cancer, and lung cancer, as well as multiple datasets related to health, marketing, and environmental factors. Each dataset is described with its size, features, and target variables, covering topics from bike sharing to student performance. Additionally, it highlights the importance of these datasets for machine learning and data analysis.

Uploaded by

Himanshu Harsh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

UCI

Hepatitis
Hepatitis means inflammation of the liver. The liver is a vital organ
that processes nutrients, filters the blood, and fights infections.
When the liver is inflamed or damaged, its function can be affected.
Heavy alcohol use, toxins, some medications, and certain medical
conditions can cause hepatitis. However, hepatitis is often caused by
a virus. In the United States, the most common types of viral
hepatitis are hepatitis A, hepatitis B, and hepatitis C.

Breast cancer
Cancer that forms in tissues of the breast. The most common type of
breast cancer is ductal carcinoma, which begins in the lining of the
milk ducts (thin tubes that carry milk from the lobules of the breast
to the nipple). Another type of breast cancer is lobular carcinoma,
which begins in the lobules (milk glands) of the breast. Invasive
breast cancer is breast cancer that has spread from where it began in
the breast ducts or lobules to surrounding normal tissue. Breast
cancer occurs in both men and women, although male breast cancer
is rare.
https://archive.ics.uci.edu/ml/datasets/Breast+Cancer

Statlog (Heart)
The Statlog (Heart) dataset is a heart disease database containing
270 instances that consist of 13 attributes: age, sex, chest pain type
(4 values), resting blood pressure, serum cholesterol in mg/dL,
fasting blood sugar > 120 mg/dL, resting electrocardiographic results
(values 0, 1, and 2), maximum heart rate.
https://archive.ics.uci.edu/ml/datasets/Statlog+%28Heart%29
Parkinsons
Parkinson’s disease is a brain disorder that causes unintended or
uncontrollable movements, such as shaking, stiffness, and difficulty
with balance and coordination.
https://archive.ics.uci.edu/ml/datasets/Parkinsons

Lung cancer
Lung cancer is a type of cancer that begins in the lungs. Your lungs
are two spongy organs in your chest that take in oxygen when you
inhale and release carbon dioxide when you exhale. Lung cancer is
the leading cause of cancer deaths worldwide.
https://archive.ics.uci.edu/ml/datasets/Lung+Cancer

Blood-transfusion
A blood transfusion is a common procedure in which donated
blood or blood components are given to you through an
intravenous line (IV). A blood transfusion is given to replace blood
and blood components that may be too low.

https://archive.ics.uci.edu/ml/datasets/Blood+Transfusion+Service+C
enter
Amazon Commerce reviews set Data Set
Dataset are derived from the customersa reviews in Amazon
Commerce Website for authorship identification. Most previous
studies conducted the identification experiments for two to ten
authors. But in the online context, reviews to be identified usually
have more potential authors, and normally classification algorithms
are not adapted to large number of target classes. To examine the
robustness of classification algorithms, we identified 50 of the most
active users (represented by a unique ID and username) who
frequently posted reviews in these newsgroups. The number of
reviews we collected for each author is 30.

Bank Marketing Data Set


The data is related with direct marketing campaigns of a Portuguese
banking institution. The marketing campaigns were based on phone
calls. Often, more than one contact to the same client was required,
in order to access if the product (bank term deposit) would be ('yes')
or not ('no') subscribed.

There are four datasets:


1) bank-additional-full.csv with all examples (41188) and 20 inputs,
ordered by date (from May 2008 to November 2010), very close to
the data analyzed in [Moro et al., 2014]
2) bank-additional.csv with 10% of the examples (4119), randomly
selected from 1), and 20 inputs.
3) bank-full.csv with all examples and 17 inputs, ordered by date
(older version of this dataset with less inputs).
4) bank.csv with 10% of the examples and 17 inputs, randomly
selected from 3 (older version of this dataset with less inputs).
The smallest datasets are provided to test more computationally
demanding machine learning algorithms (e.g., SVM).

Fertility Data Set


Fertility is the ability to conceive a child. The fertility rate is
the average number of children born during an individuals
lifetime and is quantified demographically.
Conversely, infertility is the difficulty or inability
to reproduce naturally. In general, infertility is defined as not
being able to conceive a child after one year (or longer)
of unprotected sex [1]. Infertility is widespread, with fertility
specialists available all over the world to assist parents and
couples who experience difficulties conceiving a baby.

Wine Dataset
Contains the results of a chemical analysis of wines grown in a
particular region in Italy. The dataset contains 178 samples,
with each sample representing one wine. Each sample
contains 13 features, including measurements of alcohol
content, acidity, and color intensity. The target variable is the
type of wine, with three possible values: class 1, class 2, and
class 3.

Car Evaluation Dataset


Contains data on cars and their features, along with
evaluations from experts. The dataset contains 1,728
samples, with each sample representing one car. Each
sample contains six features, including the price,
maintenance cost, and number of doors. The target variable
is the evaluation of the car, with four possible values: unacc
(unacceptable), acc (acceptable), good, and vgood (very
good).

Diabetes Dataset
Contains data on patients with diabetes and their health metrics. The
dataset contains 768 samples, with each sample representing one
patient. Each sample contains eight features, including age, body
mass index, and blood pressure. The target variable is whether the
patient has diabetes, with two possible values: yes or no.

Titanic Dataset
Contains data on passengers aboard the Titanic, including whether
they survived. The dataset contains 891 samples, with each sample
representing one passenger. Each sample contains 12 features,
including age, sex, and ticket class. The target variable is whether the
passenger survived, with two possible values: yes or no.

Abalone Dataset
Contains data on the age, gender, and physical measurements of
abalone snails. The dataset contains 4,177 samples, with each
sample representing one abalone snail. Each sample contains eight
features, including the length, diameter, and weight of the snail. The
target variable is the age of the snail, which is a continuous value.

Forest Fires Dataset


Contains data on the spatial location and various metrics of forest
fires. The dataset contains 517 samples, with each sample
representing one forest fire. Each sample contains 12 features,
including the month, day, and area of the fire. The target variable is
the burned area of the forest (in hectares), which is a continuous
value.

Seeds Dataset
Contains data on three different varieties of wheat seeds. The
dataset contains 210 samples, with each sample representing one
wheat seed. Each sample contains seven features, including
measurements of the area, perimeter, and compactness of the seed.
The target variable is the variety of the wheat seed, with three
possible values: Kama, Rosa, and Canadian.

Abalone Dataset
Contains data on the physical characteristics of abalone, a type of
shellfish. The dataset contains 4,177 samples, with each sample
representing one abalone. Each sample contains eight features,
including measurements of the length, diameter, and weight of the
abalone. The target variable is the age of the abalone, which is a
continuous value.

Bike Sharing Dataset


Contains data on bike rentals, including various weather and
seasonal factors. The dataset contains 17,379 samples, with each
sample representing one hour of bike rentals. Each sample contains
16 features, including the temperature, humidity, and wind speed.
The target variable is the number of bike rentals, which is a
continuous value.

Letter Recognition Dataset


Contains data on the recognition of capital letters. The dataset
contains 20,000 samples, with each sample representing one letter.
Each sample contains 16 features, including measurements of the
diagonal length and the width of the letter. The target variable is the
letter that was recognized, with 26 possible values: A to Z.

Superconductivity Dataset
Contains data on the critical temperature of superconductors, based
on various material properties. The dataset contains 21,263 samples,
with each sample representing one superconductor. Each sample
contains 81 features, including measurements of the atomic mass
and electronegativity. The target variable is the critical temperature
of the superconductor, which is a continuous value.

Dermatology Dataset
Contains data on the diagnosis of various skin diseases. The dataset
contains 366 samples, with each sample representing one patient.
Each sample contains 34 features, including the age, sex, and various
skin lesion features. The target variable is the diagnosis of the skin
disease, with six possible values: psoriasis, seboreic dermatitis, lichen
planus, pityriasis rosea, cronic dermatitis, and pityriasis rubra pilaris.

Gas Sensor Array Drift Dataset


Contains data on the drift behavior of gas sensor arrays, based on
various concentration levels of gas mixtures. The dataset contains
13,910 samples, with each sample representing one gas sensor array
measurement. Each sample contains 128 features, including
measurements of the response of the sensor array to different gases.
The target variable is the concentration level of the gas mixture,
which is a continuous value.

Gesture Recognition Dataset


Contains data on the recognition of hand gestures, captured using a
Kinect sensor. The dataset contains 8,080 samples, with each sample
representing one gesture. Each sample contains 20 features,
including measurements of the position and velocity of the hand. The
target variable is the type of gesture, with five possible values: swipe
left, swipe right, wave, clap, and arm cross.

Covertype Dataset
Contains data on predicting forest cover type based on various
cartographic variables. The dataset contains 581,012 samples, with
each sample representing one 30m x 30m patch of forest land. Each
sample contains 54 features, including measurements of elevation,
slope, and distance to water. The target variable is the forest cover
type, with seven possible values: spruce/fir, lodgepole pine,
ponderosa pine, cottonwood/willow, aspen, douglas fir, or
krummholz.

Credit Approval Dataset


Contains data on credit card applications, with a focus on approving
or rejecting the applications. The dataset contains 690 samples, with
each sample representing one credit card application. Each sample
contains 15 features, including the age, income, and employment
status of the applicant. The target variable is whether or not the
application was approved, with two possible values: + (approved) or -
(rejected).

Human Activity Recognition Using


Smartphones Dataset
Contains data on the recognition of human activities using data from
smartphones. The dataset contains 10,299 samples, with each
sample representing one 2.56-second window of data. Each sample
contains 561 features, including measurements of the accelerometer
and gyroscope readings from the smartphone. The target variable is
the type of activity, with six possible values: walking, walking
upstairs, walking downstairs, sitting, standing, and laying.

Mushroom Dataset
Contains data on classifying mushrooms as edible or poisonous,
based on various physical characteristics. The dataset contains 8,124
samples, with each sample representing one mushroom. Each
sample contains 22 features, including measurements of the cap
shape, color, and odor. The target variable is the edibility of the
mushroom, with two possible values: edible or poisonous.

Student Performance Dataset


Contains data on predicting student performance in math and
Portuguese language classes, based on various personal, social, and
school-related factors. The dataset contains 649 samples, with each
sample representing one student. Each sample contains 30 features,
including measurements of the student's age, family background,
and study habits. The target variable is the final grade in the class,
with values ranging from 0 to 20.

Car Evaluation Dataset


Contains data on evaluating the acceptability of cars based on various
attributes. The dataset contains 1,728 samples, with each sample
representing one car. Each sample contains six features, including
measurements of the buying price, maintenance price, and number of
doors. The target variable is the car's acceptability, with four possible
values: unacceptable, acceptable, good, or very good.

Climate Model Simulation Crashes Datase


Contains data on predicting the likelihood of a climate model
simulation crashing, based on various performance metrics. The
dataset contains 54,000 samples, with each sample representing one
simulation. Each sample contains 18 features, including
measurements of the simulation's runtime, memory usage, and CPU
utilization. The target variable is the probability of a crash, with
values ranging from 0 to 1.

Energy Efficiency Dataset


Contains data on predicting the energy efficiency of buildings, based
on various building and environmental characteristics. The dataset
contains 768 samples, with each sample representing one building.
Each sample contains eight features, including measurements of the
building's surface area, roof area, and overall height. The target
variable is the heating load and cooling load, with values ranging
from 0 to 43.1 and 0 to 48.03, respectively.

You might also like