IsoMEX is a stand-alone Python tool that converts IsoSeq/pigeon make-seurat output files into the standardized Matrix Exchange (MEX) format, matching the filtered feature-barcode matrix produced by the 10x Genomics CellRanger pipeline. It fixes the duplicate gene entry issue found in the existing IsoSeq workflow (e.g. isoseq.how), ensuring consistent gene and transcript annotations across multiple samples. This allows for seamless integration with downstream single cell/single nuclei tools, including multi-sample UMAP clustering.
The isomex.py script:
- Loads two tab-delimited files, outputs from
pigeon make-seurat, with a common basename (Note that while they have the ".csv" extension, they are actually tab-delimited.):<basename>.info.csv<basename>.annotated.info.csv
- Merges these files on the
idcolumn. - Optionally filters rows based on SQANTI3 isoform categories provided via command-line arguments.
- Aggregates counts by grouping on the
geneortranscriptcolumn and by cell barcode (BC). - Sums the counts and converts the result into a MEX-format directory (containing
matrix.mtx,features.tsv, andbarcodes.tsv). - Gzips the output files (producing files such as
matrix.mtx.gz).
It produces two sets of output directories with the filtered feature-barcode matrix in the 10× Genomics cellranger MEX format (i.e. a directory containing matrix.mtx.gz, features.tsv.gz, and barcodes.tsv,gz). One output is a gene-level summary and the other a transcript-level summary.
-
Fixes Duplicate Gene Entries:
Ensures consistent gene and transcript annotations across samples, preventing issues caused by duplicate gene entries in the original Iso-Seq workflow. -
10× Genomics Compatible:
Generates output files (matrix.mtx.gz,features.tsv.gz,barcodes.tsv.gz) that conform to CellRanger's MEX format, making outputs comparable to the standard 10x Genomics pipeline. -
Filter by Isoform Classification Category:
Allows filtering based on SQANTI3 isoform classification categories, such asfull-splice_match,novel_in_catalog, orantisense. See the full list of categories: Iso-Seq Classification. -
Consistent Gene and Transcript Feature Annotation: Uses gene and transcript mapping files to ensure feature names and IDs are preserved while still allowing novel genes and transcripts to be included in the output.
-
Compressed Output:
Outputs are gzipped (.gz) for efficient storage, and using MEX files instead of CSVs significantly reduces file size. -
Cluster-Ready:
Includes example SLURM array job scripts for running IsoMEX on HPC systems like Biowulf.
Install dependencies using:
pip install pandas gffutilsThe basic command-line usage is:
python isomex.py <basename> --gene_map <gene_map.txt> --transcript_map <transcript_map.txt> [--filter_category "cat1,cat2"] --output_dir <output_directory>If you have:
sample1.info.csvandsample1.annotated.info.csv- A gene mapping file:
gene_map.txt - A transcript mapping file:
transcript_map.txt
Run:
python isomex.py sample1 --gene_map gene_map.txt --transcript_map transcript_map.txt --output_dir filtered_feature_bc_matrixThis will generate two directories (gene_filtered_feature_bc_matrix/ and transcript_filtered_feature_bc_matrix/) with MEX-formatted output:
matrix.mtx.gzfeatures.tsv.gzbarcodes.tsv.gz
IsoMEX requires two separate mapping files before running:
-
Gene Map File (
gene_map.txt)- A tab-delimited file containing
gene_idandgene_name.
- A tab-delimited file containing
-
Transcript Map File (
transcript_map.txt)- A tab-delimited file containing
transcript_idandtranscript_name.
- A tab-delimited file containing
To create these files from an annotation GTF file, use the included utility script:
python utils/generate_map.py annotation.forPigeon.gtf gene_map.txt transcript_map.txtThis will extract:
gene_idandgene_name→gene_map.txttranscript_idandtranscript_name→transcript_map.txt
IsoMEX includes a SLURM job array script for processing multiple samples efficiently.
-
Prepare a sample list (
samples.txt) with one sample per line:sample1 sample2 sample3 -
Submit the SLURM job array:
sbatch slurm/isomex.sh