Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
13 views5 pages

Data Preprocessing

The document serves as an exam guide for data preprocessing in biomedical analysis, outlining objectives such as locating datasets and using R for data cleaning, normalization, and discretization. It emphasizes the importance of data preprocessing to ensure accuracy and completeness, detailing methods for handling missing values and various normalization techniques. Additionally, it introduces R as a tool for statistical computing and provides resources for setup and usage.

Uploaded by

Mike Tresford
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views5 pages

Data Preprocessing

The document serves as an exam guide for data preprocessing in biomedical analysis, outlining objectives such as locating datasets and using R for data cleaning, normalization, and discretization. It emphasizes the importance of data preprocessing to ensure accuracy and completeness, detailing methods for handling missing values and various normalization techniques. Additionally, it introduces R as a tool for statistical computing and provides resources for setup and usage.

Uploaded by

Mike Tresford
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Data Preprocessing - Exam Guide

MODULE: Preparing Data for Analysis

Learning Objectives

1. Locate and download biomedical/medical datasets.

2. Preprocess data using R.

3. Write R scripts to:

- Replace missing values

- Normalize data

- Discretize data

- Sample data

KEY CONCEPTS & DEFINITIONS

Information

- Anything that changes the uncertainty in a system.

- Technical definition: A stored or transmitted symbol = data.

Variables, Datasets, and Databases

- Variable: Temporary container (e.g., Excel cell)

- File: Persistent data storage (e.g., .csv, .txt)

- Database: Shared collection of logically related persistent data.

Dataset Format
- Rows = Samples (individuals/patients)

- Columns = Variables/Features/Attributes

- Files are often .csv or .txt (delimited)

Data Sources

- Proprietary: EMRs, clinical studies.

- Public: TCGA, ADNI, HRS, UK Biobank, UCI ML Repository, etc.

Importance of Data Preprocessing

> Garbage In, Garbage Out (GIGO)

Fix issues like:

- Missing values

- Noisy/inaccurate data

- Wrong data types

- Incomplete data

Goal: Make data accurate, precise, complete, interpretable, and correct

Data Preprocessing Tasks

1. Data Cleaning: Handling missing/erroneous data

2. Data Transformation: Changing types, normalization, adding vars

3. Data Reduction: Feature selection, sampling

Missing Values

Represented as: Blank, ., n/a, ?

Replacing Missing Values


- Delete row/column

- Replace with constant/statistic/neighbor/likelihood/random

Normalization (Scaling)

Why? Mixed scales distort results

Min-Max Normalization:

val' = (val min) / (max min) * (new_max new_min) + new_min

Z-Score Normalization:

val' = (val mean) / std

Decimal Scaling:

val' = val / 10^n

Comparison of Normalization

- Decimal: Preserves distribution

- Z-Score: Makes data normal

- Min-Max: Flexible range

Discretization

Numeric Nominal

Discretization Methods:

- Manual

- Automatic: Equal-width, Equal-depth, Regression, Clustering


Binning Comparison

- Equal-width: Simple, sensitive to outliers

- Equal-depth: Keeps distribution, less intuitive

Data Reduction

1. Feature Selection: Dimensionality reduction (genes)

2. Sampling: Representative subset

- Simple random (with/without replacement)

- Stratified sampling

Introduction to R Language

- R: Statistical computing, graphics, open-source

- Functions, packages, interpreter, scripts

R Setup & Tools

- Download: https://cran.r-project.org

- Tools: Rgui, RStudio, Notepad++, Jupyter

Using R

- ls(), rm(), q(), summary(), class()

- install.packages(), library()

R Distributions

- Bioconductor, Anaconda (with Jupyter support)


R Preprocessing

- Video tutorial & Jupyter link (from course)

You might also like