This repository provides:
- How to set up a DESC python environment with Apache Spark at NERSC (batch/interactive + JupyterLab).
- Basic tutorials to discover Apache Spark.
- Links to Apache Spark developments in DESC (DC2 data access, 3x2pt, ...).
To come:
- Stack environment + Apache Spark.
- Bootcamp for DESC members.
Apache Spark is a cluster computing framework, that is a set of tools to perform computation on a network of many machines. Spark started in 2009 as a research project, and it had a huge success so far in the industry. It is based on the so-called MapReduce cluster computing paradigm, popularized by the Hadoop framework using implicit data parallelism and fault tolerance.
Spark provides many functionalities exposed through Scala/Python/Java/R API (Scala is the native one). As far as DESC is concerned, I would advocate to use the Python API (called pyspark) for obvious reasons. But feel free to put your hands on Scala, it's worth it.
We provide a kernel to work with Apache Spark and DESC shipped with your DESC
environment (see LSSTDESC/nersc). You just need to select the desc-pyspark kernel in the JupyterLab interface. Note that this kernel is automatically installed in your environment when you use the kernel setup script:
source /global/common/software/lsst/common/miniconda/kernels/setup.shThe previous JupyterLab mode is limited to 4 cores with a total 8GB memory. For more demanding work you may use the NERSC interactive (or batch see next part) queues. Then each node has 100GB memory and 32 cores. Note that the available cache is about 60% of the total memory.
To ease the Spark python usage (pyspark) we provide some scripts located in
the /script directory:
- on a cori interactive node, first run:
source scripts/spark-interactive.sh NODES TIMEwhere NODES is the number of requested nodes and TIME the session time (in minutes).
- once logged in:
source scripts/init_spark.sh
- you can then run the pysparkshell with:scripts/run_psyspark
You enter an ipython shell from which you can run interactive
commands and/or standard python scripts with the usual %run magic
command.
The pyspark version is a customized one including Anaconda + a few
other packages (as healpy). If you need other packages submit an issue.
NERSC provides support to run Spark at scale. Note that for Spark version 2.3.0+, Spark runs inside of Shifter. Complete information is available here. Example batch scripts will follow soon.
We started the AstroLab Software project as a platform for big data developments in astronomy, with focus on Apache Spark. There are several projects of interest for LSST, but they largely go beyond.
In case you want to generate your own DESC python + Apache Spark kernel, follow these steps:
# Clone the repo
git clone https://github.com/astrolabsoftware/spark-kernel-nersc.git
cd spark-kernel-nersc
# Create the kernel - it will be stored under
# $HOME/.local/share/jupyter/kernels/
python desc-kernel.py \
  -kernelname desc-pyspark-custom \
  -pyspark_args "--master local[4] \
  --driver-memory 32g --executor-memory 32g \
  --packages com.github.astrolabsoftware:spark-fits_2.11:0.7.1"And then select the kernel desc-pyspark-custom in the JupyterLab interface.
More information can be found at spark-kernel-nersc.