Data Preprocessing - Exam Guide
MODULE: Preparing Data for Analysis
Learning Objectives
1. Locate and download biomedical/medical datasets.
2. Preprocess data using R.
3. Write R scripts to:
- Replace missing values
- Normalize data
- Discretize data
- Sample data
KEY CONCEPTS & DEFINITIONS
Information
- Anything that changes the uncertainty in a system.
- Technical definition: A stored or transmitted symbol = data.
Variables, Datasets, and Databases
- Variable: Temporary container (e.g., Excel cell)
- File: Persistent data storage (e.g., .csv, .txt)
- Database: Shared collection of logically related persistent data.
Dataset Format
- Rows = Samples (individuals/patients)
- Columns = Variables/Features/Attributes
- Files are often .csv or .txt (delimited)
Data Sources
- Proprietary: EMRs, clinical studies.
- Public: TCGA, ADNI, HRS, UK Biobank, UCI ML Repository, etc.
Importance of Data Preprocessing
> Garbage In, Garbage Out (GIGO)
Fix issues like:
- Missing values
- Noisy/inaccurate data
- Wrong data types
- Incomplete data
Goal: Make data accurate, precise, complete, interpretable, and correct
Data Preprocessing Tasks
1. Data Cleaning: Handling missing/erroneous data
2. Data Transformation: Changing types, normalization, adding vars
3. Data Reduction: Feature selection, sampling
Missing Values
Represented as: Blank, ., n/a, ?
Replacing Missing Values
- Delete row/column
- Replace with constant/statistic/neighbor/likelihood/random
Normalization (Scaling)
Why? Mixed scales distort results
Min-Max Normalization:
val' = (val min) / (max min) * (new_max new_min) + new_min
Z-Score Normalization:
val' = (val mean) / std
Decimal Scaling:
val' = val / 10^n
Comparison of Normalization
- Decimal: Preserves distribution
- Z-Score: Makes data normal
- Min-Max: Flexible range
Discretization
Numeric Nominal
Discretization Methods:
- Manual
- Automatic: Equal-width, Equal-depth, Regression, Clustering
Binning Comparison
- Equal-width: Simple, sensitive to outliers
- Equal-depth: Keeps distribution, less intuitive
Data Reduction
1. Feature Selection: Dimensionality reduction (genes)
2. Sampling: Representative subset
- Simple random (with/without replacement)
- Stratified sampling
Introduction to R Language
- R: Statistical computing, graphics, open-source
- Functions, packages, interpreter, scripts
R Setup & Tools
- Download: https://cran.r-project.org
- Tools: Rgui, RStudio, Notepad++, Jupyter
Using R
- ls(), rm(), q(), summary(), class()
- install.packages(), library()
R Distributions
- Bioconductor, Anaconda (with Jupyter support)
R Preprocessing
- Video tutorial & Jupyter link (from course)