SparkMap

A novel framework to speed up short read alignment using Apache Spark. SparkMap produces large speed increases for:

bowtie2, BBMAP, and HISAT2 for single-end mapping
bowtie2 for paired-end mapping

SparkMap can also function with TopHat and works well with STAR when large numbers of machines are available.*

Installation

SparkMap requires the following dependencies to run:

Python 3.7 with numpy(> 1.16.4), progressbar2(>3.50.1) , pydoop( > 2.0.0), py4j(> 0.10.7), pyinstaller(>3.6), python-utils(>2.4.0)
Apache Spark ( > 2.4.3) with findspark( > 1.3.0)
Hadoop (> 3.1.2)
Unix sorting. Install GNU core utilities if running on MacOS.

It is also recommended that you run SparkMap in Linux and on a compute cluster.

To download SparkMap, make sure you have the appropriate permissions and then follow these instructions.

First, download the SparkMap repository as a zip file. If needed, send the zip file from your local machine to your computer cluster using scp:

scp /path/to/SparkMap-master.zip username@IP:/path/to/directory

Unzip it with the following command.

unzip SparkMap-master.zip

Next, configure your system to make the dependencies accessible. You can either install the dependencies system-wide or through Pipenv. With Pipenv, instead of running python ...., you will run pipenv run python ...

Configuration Information

First, download Hadoop and Spark, if you have not done so already. We recommend this guide for Spark installation: https://www.linode.com/docs/databases/hadoop/install-configure-run-spark-on-top-of-hadoop-yarn-cluster/ and this guide for Hadoop installation: https://www.linode.com/docs/databases/hadoop/how-to-install-and-set-up-hadoop-cluster/.

Add these user specific configurations to your .bash_profile.

	CONFIGURATIONS
PATH	ADD SPARKMAP INSTALLATION FOLDER AND LOCATIONS OF SPARK AND HADOOP TO PATH
PYTHONPATH	PATH TO PYTHON3.7
JAVA_HOME	PATH TO JAVA JDK FOR HADOOP
HADOOP_CONF_DIR	$HADOOP_HOME:/etc/hadoop
SPARK_HOME	PATH TO SPARK INSTALLATION
LD_LIBRARY_PATH	$HADOOP_HOME:/lib/native:$LD_LIBRARY_PATH
HADOOP_HOME	PATH TO HADOOP INSTALLTION

Add these user specific configurations to your .bashrc

	CONFIGURATIONS
PATH	ADD $HADOOP_HOME/bin and $HADOOP_HOME/sbin to PATH
HADOOP_MAPRED_HOME	$HADOOP_HOME
HADOOP_COMMON_HOME	$HADOOP_HOME
HADOOP_HDFS_HOME	$HADOOP_HOME
YARN_HOME	$HADOOP_HOME
HADOOP_HOME	PATH TO HADOOP INSTALLTION

Check out a full listing of Spark configurations here: https://spark.apache.org/docs/latest/configuration.html

The config directory should only be used as a supplementary resource to edit config files in your $HADOOP_HOME/etc/hadoop directory and your $SPARK_HOME/conf directory.

Usage Guidelines

Getting a Reference Genome and Fastq/FASTA files

Skip this step if you already have a Reference Genome and Fastq/FASTA paired-end reads from an experiment. Otherwise, continue if you are using data from online. Input: Can start with an SRA format (if using online data), but convert to the FastQ file type using fastq dump. Ex: Fastq-dump --split-files --fasta {Accession #}

Find a genome index online(widely available - bowtie2, as a reference genome or build your own.

Install your mapper as an executable

Starting Hadoop and Spark

In order to use Hadoop Distributed File System(HDFS), run start-all.sh in the $HADOOP_HOME/sbin directory to start all Hadoop Daemons.

Start the Spark Driver by running start-all.sh in the $SPARK_HOME/sbin directory to start the Spark master and all Spark workers.

Mapper-Specific Options

Please look at the mapper-specific manuals (linked above) for specific mapper syntax.

However, all mappers should be run with configurations to accept input through STDIN and output their reads to STDOUT. They should also be run locally as an executable and explicitly specify the number of parallel search threads needing to be launched. To run paired-end mapping, please remember to specify interleaved FASTQ input for your mapper.

If you are running Bowtie2 or HISAT2, please explicitly specify the number of parallel search threads using the -p flag. If you are using BBMAP, please remember to explicitly specify the number of search threads, the java minimum and maximum heap space, the build type, and the path to your prebuilt index. This means that running BBMAP involves creation of the genome index beforehand rather than in-memory. Additionally, make sure that there is a SPACE between the / and align2.BBMap class call as shown in scripts/singlespark.sh for BBMAP.

*If you are using STAR with SparkMap, make sure to set the number of executor instances equal to the number of machines/nodes that you have on your computer cluster. If you do not you will run into executor-level errors!!!

Running SparkMap in single-end mode

Edit singlespark.sh file with parameters in the following format:

python singlespark.py full_path_to_fastq_directory full_path_to_sam_output_directory memory_to_Executor(in GB) driver_Memory(in GB) max_cores_for_process executor_instances mapper_specific_options mapper_type (max_BBMAP) logging

NOTE: If you are using BBMAP, make sure to specify an additional parameter, max_BBMAP, which indicates the maximum number of partitions you would like to use for your run. This is needed because BBMAP is RAM intensive and so may run out of RAM if mapping is done on a large dataset with a large number of partitions. For reference, with 3 executor instances, 3 machines, and 55 cores/250 GB of free RAM per machine, a maximum of 30 partitions could be utilized for mapping. The maximum number of partitions will be proportional to the number of partitions being sent to each executor instance and amount of RAM. For example, in the aforementioned setup, 10 partitions would be sent to each executor instance/machine. Loading the genome index and mapping the 10 partitions with 2 parallel search threads per partition on each machine used around 200 GB of RAM.

Make sure that the full_path_to_sam_output_directory contains the prefix file: and that the SAM output directory does not already exist(remove it if it does). Executor and driver memory should end with G to indicate Gigabytes or MB to indicate megabytes. The logging parameter should be passed either a Y or N for Yes or No.

Example: python singlespark.py /s1/snagaraj/project_env/SRR639031_1.fastq file:/s1/snagaraj/output/single 20G 100G 100 2 "/s1/snagaraj/bowtie2/bowtie2 --no-hd --no-sq -p 2 -x /s1/snagaraj/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/genome -"

See sample run file in scripts/singlespark.sh file for further examples.
Run "chmod +x singlespark.sh" to give permissions
Run ./singlespark.sh to run Spark as an interactive process or run "nohup ./singlespark.sh" to run Spark as a background process.
Go into your local output directory and run cat * > combined_sam_file to combine the blocks into a single file.

Running SparkMap in paired-end mode

Edit pairspark.sh file with parameters in the following format:

python pairspark.py full_path_to_fastq_mate1_directory full_path_to_fastq_mate2_directory full_path_to_sam_output_directory memory_to_Executor(in GB) driver_Memory(in GB) max_cores_for_process executor_instances mapper_specific_options logging

Make sure that the full_path_to_sam_output_directory contains the prefix file: and that the SAM output directory does not already exist(remove it if it does). Executor and driver memory should end with G to indicate Gigabytes or MB to indicate megabytes. The logging parameter should be passed either a Y or N for Yes or No.

Example: python pairspark.py /s1/snagaraj/project_env/SRR639031_1.fastq /s1/snagaraj/project_env/SRR639031_2.fastq file:/s1/snagaraj/output/pair 20G 100G 100 2 "/s1/snagaraj/bowtie2/bowtie2 --no-hd --no-sq -p 2 -x /s1/snagaraj/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/genome --interleaved -"
Run "chmod +x pairspark.sh" to give permissions
Run ./pairspark.sh to run Spark as an interactive process or run "nohup ./pairspark.sh" to run Spark as a background process.
Go into your local output directory and run cat * > combined_sam_file to combine the blocks into a single file.

Optimization of Spark Settings

We have found that found that setting the number of spark executor instances equal to the number of worker nodes/machines on your cluster produces optimal results. Furthermore, we have found that when using smaller numbers cores per machine(< 20) performance was optimized with 2-3 parallel search threads in the mapper-specific options for HISAT2, Tophat, and Bowtie2. For BBMAP, use 1-2 search threads for a smaller number of cores per machine. At larger numbers of cores per machine(> 20), we found that 1-2 parallel search threads worked optimally for HISAT2, Tophat, and Bowtie2. With STAR, you should set the number of threads equal to the number of cores you have available on each machine. This is because when running STAR the number of data partitions used when mapping should equal the number of executor instances you have(which you should specify as the number of machines on your cluster).

If you are familiar with Spark, you can also edit your spark-defaults.conf file and specify the spark.executor.cores parameter for further optimization.

Validation Scripts

Valid_reads.sh/.py- Used to create new SAM files with only mapped reads. Can only used for single-end mapping.
Reorder.sh/.py - Used to create new SAM files that are ordered according to read ID. Can only be used for single-end mapping.
Alternatively, use the SAMtools sort to sort by chromosome position or by read number.

Hi-C shell script usage

Use interactions.sh for single-end/locally aligned SAM files and interactions_pair.sh for paired-end SAM files. These scripts are useful to create interactions data(Hi-C) in the form:

Chr1 pos1 direction1(0 or 16 for Watson/Crick strand) Chr2 pos2 direction2

interactions.sh input example: ./interactions.sh test.sam test_interactions.txt interactions_pair.sh input example: ./interactions_pair.sh test.sam test1.sam test_interactions.txt

Important Misc. Information

If you ask for a header in your mapper specific options, run awk '!seen[$0]++' orig_file_name > new_file_name to eliminate duplicate headers. Do this BEFORE running any further analyses or you will receive errors. However, we recommend generating the header separately and joining it to the mapping file as this is faster.

Name		Name	Last commit message	Last commit date
Latest commit History 210 Commits
TestData		TestData
config		config
scripts		scripts
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

SparkMap

Installation

Configuration Information

Usage Guidelines

Getting a Reference Genome and Fastq/FASTA files

Install your mapper as an executable

Starting Hadoop and Spark

Mapper-Specific Options

Running SparkMap in single-end mode

Running SparkMap in paired-end mode

Optimization of Spark Settings

Validation Scripts

Hi-C shell script usage

Important Misc. Information

About

Uh oh!

Releases

Packages

Languages

Uh oh!

Uh oh!

snagaraj0/SparkMap

Folders and files

Latest commit

History

Repository files navigation

SparkMap

Installation

Configuration Information

Usage Guidelines

Getting a Reference Genome and Fastq/FASTA files

Install your mapper as an executable

Starting Hadoop and Spark

Mapper-Specific Options

Running SparkMap in single-end mode

Running SparkMap in paired-end mode

Optimization of Spark Settings

Validation Scripts

Hi-C shell script usage

Important Misc. Information

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages