Python implementation of common RNA-seq normalization methods:
- CPM (Counts per million)
- FPKM (Fragments per kilobase million)
- TPM (Transcripts per million)
- UQ (Upper quartile)
- CUF (Counts adjusted with UQ factors)
- TMM (Trimmed mean of M-values)
- CTF (Counts adjusted with TMM factors)
For in-depth description of methods see documentation.
- Pure Python implementation (no need for R, etc.)
- Compatible with Scikit-learn
- Command line interface
- Verbose documentation
- Validated method implementation
We recommend installing RNAnorm with pip:
pip install rnanorm
The implemented methods can be executed from Python or from the command line.
The most common use case is to run normalization from Python:
>>> from rnanorm.datasets import load_toy_data
>>> from rnanorm import FPKM
>>> dataset = load_toy_data()
>>> # Expressions need to have genes in columns and samples in rows
>>> dataset.exp
Gene_1 Gene_2 Gene_3 Gene_4 Gene_5
Sample_1 200 300 500 2000 7000
Sample_2 400 600 1000 4000 14000
Sample_3 200 300 500 2000 17000
Sample_4 200 300 500 2000 2000
>>> fpkm = FPKM(dataset.gtf_path).set_output(transform="pandas")
>>> fpkm.fit_transform(dataset.exp)
Gene_1 Gene_2 Gene_3 Gene_4 Gene_5
Sample_1 100000.0 100000.0 100000.0 200000.0 700000.0
Sample_2 100000.0 100000.0 100000.0 200000.0 700000.0
Sample_3 50000.0 50000.0 50000.0 100000.0 850000.0
Sample_4 200000.0 200000.0 200000.0 400000.0 400000.0
Normalization from the command line is also supported. To list available methods and general help:
rnanorm --help
Get info about a particular method, e.g., CPM:
rnanorm cpm --help
To normalize with CPM:
rnanorm cpm exp.csv --out exp_cpm.csv
File exp.csv needs to be comma separated file with genes in columns and
samples in rows. Values should be raw counts. The output is saved to
exp_cpm.csv. Example of input file:
cat exp.csv ,Gene_1,Gene_2,Gene_3,Gene_4,Gene_5 Sample_1,200,300,500,2000,7000 Sample_2,400,600,1000,4000,14000 Sample_3,200,300,500,2000,17000 Sample_4,200,300,500,2000,2000
One can also provide input through standard input:
cat exp.csv | rnanorm cpm --out exp_cpm.csv
If file specified with --out already exists the command will fail. If you
are sure that you wish to overwrite, use --force flag:
cat exp.csv | rnanorm cpm --force --out exp_cpm.csv
If no file is specified with --out parameter, output is printed to standard
output:
cat exp.csv | rnanorm cpm > exp_cpm.csv
Methods TPM and FPKM require gene lengths. These can be provided either with GTF
file or with "gene lengths" file. The later is a two columns file. The first
column should include the genes in the header of exp.csv and the second
column should contain gene lengths computed by union exon model:
# Use GTF file rnanorm tpm exp.csv --gtf annotations.gtf > exp_out.csv # Use gene lengths file rnanorm tpm exp.csv --gene-lengths lenghts.csv > exp_out.csv
To learn about contributing to the code base, read the Contributing section.