Kowalski is an API-driven multi-survey data archive and alert broker. Its main focus is the Zwicky Transient Facility.
A schematic overview of the functional aspects of Kowalski and how they interact is shown below:
- A non-relational (NoSQL) database
MongoDBpowers the data archive, the alert stream sink, and the alert handling service. - An API layer provides an interface for the interaction with the backend:
it is built using a
pythonasynchronous web framework,aiohttp, and the standardpythonasync event loop serves as a simple, fast, and robust job queue. Multiple instances of the API service are maintained using theGunicornWSGI HTTP Server. - A programmatic
pythonclient is also available to interact with Kowalski's API. - Incoming and outgoing traffic can be routed through
traefik, which acts as a simple and performant reverse proxy/load balancer. - An alert brokering layer listens to
Kafkaalert streams and uses adask.distributedcluster for distributed alert packet processing, which includes data preprocessing, execution of machine learning models, catalog cross-matching, and ingestion intoMongoDB. It also executes user-defined filters based on the augmented alert data and posts the filtering results to aSkyPortalinstance. - Kowalski is containerized using
Dockersoftware and orchestrated withdocker-composeallowing for simple and efficient deployment in the cloud and/or on-premise. However, it can also run without Docker especially for development purposes.
Kowalski is an API-first system. The full OpenAPI specs can be found here. Most users will only need the queries section of the specs.
The easiest way to interact with a Kowalski instance is by using a python client penquins.
Start off by creating your own kowalski fork and github, and cloning it, then cd into the cloned directory:
git clone https://github.com/<your_github_id>/kowalski.git
cd kowalskiMake sure you have a python environment that meets the requirements to run Kowalski. You can use both conda and virtualenv. Using virtualenv, you can do:
virtualenv env
source env/bin/activate
pip install -r requirements.txtYou need config files in order to run Kowalski. You can start off by copying the default config/secrets over:
cp config.defaults.yaml config.yaml
cp docker-compose.defaults.yaml docker-compose.yamlconfig.yaml contains the API and ingester configs, the supervisord config for the API and ingester containers,
together with all the secrets, so be careful when committing code / pushing docker images.
However, if you want to run in a production setting, be sure to modify config.yaml and choose strong passwords!
docker-compose.yaml serves as a config file for docker-compose, and can be used for different Kowalski deployment modes.
Kowalski comes with several template docker-compose configs (see below for more info).
Finally, once you've set the config files, you can build an instance of Kowalski. You can do this with the following command:
./kowalski.py up --buildYou have now successfully built a Kowalski instance!
Any time you want to rebuild kowalski, you need to re-run this command.
If you want to just interact with a Kowalski instance that has already been built, you can drop the --build flag:
./kowalski.py upto start up a pre-built Kowalski instance./koiwalski.py downto shut down a pre-built Kowalski instance
You can check that a running docker Kowalski instance is working by using the Kowalski test suite:
./kowalski.py test./kowalski.py downSimilar to the Docker setup, you need config files in order to run Kowalski. You can start off by copying the default config/secrets over. Here however, the default config file is config.local.yaml:
cp config.local.yaml config.yamlThe difference between config.local.yaml and config.defaults.yaml is that the former has all the path variables set to the local relative path of the kowalski repo. This is useful if you want to run Kowalski without Docker without having to change many path variables in config.yaml.
You will need to edit the database section to point to your local mongodb instance, or to a mongodb atlas cluster, in which case you should set database.srv to true, and database.replica_set to the name of your cluster or simply null.
If you are using a mongodb atlas cluster, kowalski won't be able to create admin users, so you will need to do so manually on the cluster's web interface. You will need to create 2 users: admin user and user, based on what usernames and passwords you've set in the config file.
If you are running your own MongoDB cluster locally (not using mongoDB atlas or any remove server), it is likely that database.host should be 127.0.0.1 or similar. For simplicity, we also set database.replica_set to null. We also need to set the admin and user roles for the database. To do so, login to mongdb and set (using the default values from the config):
mongosh --host 127.0.0.1 --port 27017and then from within the mongo terminal
use kowalski
db.createUser( { user: "mongoadmin", pwd: "mongoadminsecret", roles: [ { role: "userAdmin", db: "admin" } ] } )
db.createUser( { user: "ztf", pwd: "ztf", roles: [ { role: "readWrite", db: "admin" } ] } )
db.createUser( { user: "mongoadmin", pwd: "mongoadminsecret", roles: [ { role: "userAdmin", db: "kowalski" } ] } )
db.createUser( { user: "ztf", pwd: "ztf", roles: [ { role: "readWrite", db: "kowalski" } ] } )We recommend using MongoDB atlas when running locally to avoid having to setup a momgodb cluster. Don't forget seting the database.host to you atlas url.
First, you'll need to install few system dependencies:
sudo apt install -y default-jdk wgetMake sure you have a version of python that is 3.8 or above before following the next steps.
Now, in the same terminal, run:
sudo pip install virtualenv
virtualenv env
source env/bin/activateto create your virtual environment. If you are told that pip is not found, try using pip3 instead. For the following steps however (in the virtualenv), pip should work.
Then, run:
pip install -r requirements.txt
pip install -r kowalski/requirements_api.txt
pip install -r kowalski/requirements_ingester.txt
pip install -r kowalski/requirements_tools.txtFirst, you need to install several system dependencies using homebrew:
brew install java librdkafka wgetAfter installing java, run the following to make sure it is accessible by kafka later on:
sudo ln -sfn /opt/homebrew/opt/openjdk/libexec/openjdk.jdk /Library/Java/JavaVirtualMachines/openjdk.jdk
echo 'export PATH="/opt/homebrew/opt/openjdk/bin:$PATH"' >> ~/.zshrc
Seperately, we install hdf5:
brew install hdf5At the end of hdf5's installation, the path where it has been installed will be displayed in your terminal. Copy that path and make sure that you save it somewhere (like as a comment in the config.yaml for example). You will need it when installing or updating python dependencies.
Next, run:
export HDF5_DIR=<path_to_hdf5>Make sure you have a version of python that is 3.8 or above before following the next steps. You can consider installing a newer version with homebrew if needed.
Now, in the same terminal, run:
sudo pip install virtualenv
virtualenv env
source env/bin/activateto create your virtual environment. If you are told that pip is not found, try using pip3 instead. For the following steps however (in the virtualenv), pip should work.
We do not suggest using conda, as we experienced trouble installing arm64 specific packages like tensorflow-macos and tensor flow-metal.
Finally, run the following to install the python dependencies:
pip install -r requirements.txt
pip install -r kowalski/requirements_api.txt
pip install -r kowalski/requirements_ingester_macos.txt
pip install -r kowalski/requirements_tools.txtThen, you'll need to install kafka and zookeeper:
export scala_version=2.13
export kafka_version=3.4.0
wget https://downloads.apache.org/kafka/$kafka_version/kafka_$scala_version-$kafka_version.tgz
tar -xzf kafka_$scala_version-$kafka_version.tgzInstalled in this way, path.kafka in the config should be set to ./kafka_2.13-3.4.0.
The last step of the kafka setup is to replace its kafka_$scala_version-$kafka_version/config/server.properties file with the one in kowalski/server.properties. As this setup is meant to run locally outside docker, you should modify said server.properties file to set log.dirs=/data/logs/kafka-logs to log.dirs=./data/logs/kafka-logs instead (the addition here is the . at the beginning of the path) to avoid permission issues.
The ingester we will run later on needs the different models to be downloaded. To do so, run:
cd kowalski && mkdir models && cd models && \
braai_version=d6_m9 && acai_h_version=d1_dnn_20201130 && \
acai_v_version=d1_dnn_20201130 && acai_o_version=d1_dnn_20201130 && \
acai_n_version=d1_dnn_20201130 && acai_b_version=d1_dnn_20201130 && \
wget https://github.com/dmitryduev/braai/raw/master/models/braai_$braai_version.h5 -O braai.$braai_version.h5 && \
wget https://github.com/dmitryduev/acai/raw/master/models/acai_h.$acai_h_version.h5 && \
wget https://github.com/dmitryduev/acai/raw/master/models/acai_v.$acai_v_version.h5 && \
wget https://github.com/dmitryduev/acai/raw/master/models/acai_o.$acai_o_version.h5 && \
wget https://github.com/dmitryduev/acai/raw/master/models/acai_n.$acai_n_version.h5 && \
wget https://github.com/dmitryduev/acai/raw/master/models/acai_b.$acai_b_version.h5Kowalski's path have been designed to be easy to use in Docker. To avoid running into any problems accessing the models, we suggest copying them at the root of your kowalski repo, and not only in the kowalski directory in it (as the command given above did).
The API app can then be run with
KOWALSKI_APP_PATH=./ KOWALSKI_PATH=kowalski python kowalski/api.pyThen tests can be run by going into the kowalski/ directory
cd kowalskiand running:
KOWALSKI_APP_PATH=../ python -m pytest -s api.py ../tests/test_api.pywhich should complete.
You might get 2 failed tests the first time. This is because some of the API tests rely on the ingester tests to have run first to populate the database. If you run the tests again after the ingester tests, they should all pass.
Next, you need to start the dask scheduler and workers. These are the processes that will run the machine learning models and the alert filtering when running the ingester and/or the broker.
Before starting it, you might want to consider lowering or increasing the number of workers and threads per worker in the config file, depending on your config. The default values are set to 4 workers and 4 threads per worker. This means that the dask cluster will be able to process 4 alerts at the same time. If you have a GPU available with a lot of VRAM, you might want to increase the number of workers and threads per worker to take advantage of it.
cd kowalskiand running:
KOWALSKI_APP_PATH=../ python dask_cluster.pyIf you have a GPU available, tensorflow might use it by default. However, you might run into some issue running the ml models when following the instructions below if you GPU does not have much memory. To avoid this, you can set the environment variable CUDA_VISIBLE_DEVICES=-1 before running the dask cluster:
CUDA_VISIBLE_DEVICES=-1 KOWALSKI_APP_PATH=../ python dask_cluster.pyIf you have access to a ZTF alert stream and have it configured accordingly in the config. If you intend to simulate an alert stream locally, you should change the kafka config to set the bootstrap.servers to localhost:9092, and zookeeper to localhost:2181.Then, you can run the broker with
KOWALSKI_APP_PATH=./ python kowalski/alert_broker_ztf.pyWhen running locally for testing/development purposes, you can simply run the tests which will create a mock alert stream and run the broker on it.
Then tests can be run by going into the kowalski/ directory (when running the tests, you do not need to start the broker as instructed in the previous step. The tests will take care of it)
cd kowalskiand running:
KOWALSKI_APP_PATH=../ python -m pytest -s alert_broker_ztf.py ../tests/test_alert_broker_ztf.pyWe also provide an option USE_TENSORFLOW=False for users who cannot install Tensorflow for whatever reason.
Once the broker is running, you might want to create a local kafka stream of alerts to test it. To do so, you can run the ingester with
cd kowalskiand running:
PYTHONPATH=. KOWALSKI_APP_PATH=../ python ../tools/kafka_stream.py --topic="<topic_listened_by_your_broker" --path="<path_to_alerts_in_KOWALSKI_APP_PATH/data/>" --test=Truewhere <topic_listened_by_your_broker> is the topic listened by your broker (ex: ztf_20200301_programid3 for the ztf broker) and <path_to_alerts_in_KOWALSKI_APP_PATH/data/> is the path to the alerts in the data/ directory of the kowalski app (ex: ztf_alerts/20200202 for the ztf broker).
Otherwise, you can test both ingestion and broker at the same time by running the ingester tests:
Then tests can be run by going into the kowalski/ directory (similarly to the broker tests, you do not need to star the broker manually as instructed in the previous step. The tests will take care of it). Make sure that the dask cluster is running before running the ingester tests.
cd kowalskiand running:
PYTHONPATH=. KOWALSKI_APP_PATH=../ python -m pytest ../tests/test_ingester_ztf.pyThe ingester tests can take a while to complete, be patient! If they encounter an error with kafka, it is likely that you did not modify the server.properties file as instructed above, meaning that the kafka logs can't be created, which blocks the test without showing an error.
If that happens, you can simply run the ingester without pytest so that you can see the logs and debug the issue:
PYTHONPATH=. KOWALSKI_APP_PATH=../ python ../tests/test_ingester_ztf.pyAnother common problem is that if you stop the ingester test while its running or if it fails, it might leave a lock file in the kafka logs directory, which will prevent kafka from starting the next time you try to do so. If that happens, you can: delete the lock file, or simply retry starting the test, as a failed test attempt is supposed to remove the lock file. Then, the test should work on the following attempt.
We stronly advise you to open the server.log file found in /kakfa_$scala_version-$kafka_version/logs/ to see what is going on with the kafka server. It will be particularly useful if you encounter errors with the ingester tests or the broker.
Then tests can be run by going into the kowalski/ directory
cd kowalskiand running:
KOWALSKI_APP_PATH=../ python -m pytest -s ../tools/istarmap.py ../tests/test_tools.pyKowalski uses docker-compose under the hood and requires a docker-compose.yaml file.
There are several available deployment scenarios:
- Bare-bones
- Bare-bones + broker for
SkyPortal/Fritz - Behind
traefik
Use docker-compose.defaults.yaml as a template for docker-compose.yaml.
Note that the environment variables for the mongo service must match
admin_* under kowalski.database in config.yaml.
Use docker-compose.fritz.defaults.yaml as a template for docker-compose.yaml.
If you want the alert ingester to post (filtered) alerts to SkyPortal, make sure
{"misc": {"broker": true}} in config.yaml.
Use docker-compose.traefik.defaults.yaml as a template for docker-compose.yaml.
If you have a publicly accessible host allowing connections on port 443 and a DNS record with the domain
you want to expose pointing to this host, you can deploy kowalski behind traefik,
which will act as the edge router -- it can do many things including load-balancing and
getting a TLS certificate from letsencrypt.
In docker-compose.yaml:
- Replace
[email protected]with your email. - Replace
private.caltech.eduwith your domain.
OpenAPI specs are to be found under /docs/api/ once Kowalski is up and running.
Contributions to Kowalski are made through GitHub Pull Requests, a set of proposed commits (or patches).
To prepare, you should:
-
Create your own fork the kowalski repository by clicking the "fork" button.
-
Clone (download) your copy of the repository, and set up a remote called
upstreamthat points to the main Kowalski repository.git clone [email protected]:<yourname>/kowalski git remote add upstream [email protected]:dmitryduev/kowalski
Then, for each feature you wish to contribute, create a pull request:
-
Download the latest version of Kowalski, and create a new branch for your work.
Here, let's say we want to contribute some documentation fixes; we'll call our branch
rewrite-contributor-guide.git checkout master git pull upstream master git checkout -b rewrite-contributor-guide
-
Make modifications to Kowalski and commit your changes using
git addandgit commit. Each commit message should consist of a summary line and a longer description, e.g.:Rewrite the contributor guide While reading through the contributor guide, I noticed several places in which instructions were out of order. I therefore reorganized all sections to follow logically, and fixed several grammar mistakes along the way. -
When ready, push your branch to GitHub:
git push origin rewrite-contributor-guide
Once the branch is uploaded, GitHub should print a URL for turning your branch into a pull request. Open that URL in your browser, write an informative title and description for your pull request, and submit it. There, you can also request a review from a team member and link your PR with an existing issue.
-
The team will now review your contribution, and suggest changes. To simplify review, please limit pull requests to one logical set of changes. To incorporate changes recommended by the reviewers, commit edits to your branch, and push to the branch again (there is no need to re-create the pull request, it will automatically track modifications to your branch).
-
Sometimes, while you were working on your feature, the
masterbranch is updated with new commits, potentially resulting in conflicts with your feature branch. To fix this, please merge in the latestupstream/masterbranch:git merge rewrite-contributor-guide upstream/master
Developers may merge master into their branch as many times as they want to.
- Once the pull request has been reviewed and approved by at least two team members, it will be merged into Kowalski.
Install our pre-commit hook as follows:
pip install pre-commit
pre-commit install
This will check your changes before each commit to ensure that they
conform with our code style standards. We use black to reformat Python
code and flake8 to verify that code complies with PEP8.
To add a new alert stream to kowalski, see the PR associated with the addition of WINTER to Kowalski. A brief summary of the changes required (to add WINTER into Kowalski, but hopefully can be extended to any other survey) is given below -
-
A new
kowalski/alert_broker_<winter>.pyneeds to be created for the new alert stream. This can be modelled off the existing alert_broker_ztf.py or alert_broker_pgir.py scripts, with the following main changes -a.
watchdogneeds to be pointed to pull from the correct topic associated with the new streamb.
topic_listenerneeds to be updated to use the correct dask-ports associated with the new stream from the config file (every alert stream should have different dask ports to avoid conflicts).topic_listeneralso needs to be updated to use the<WNTR>AlertConsumerasociated with the new stream.c.
<WNTR>AlertConsumerneeds to be updated per the requirements of the survey. For example, WINTER does not require MLing prior to ingestion, so that step is excluded unlike in theZTFAlertConsumer. TheWNTRAlertConsumeralso does a cross-match to the ZTF alert stream, a step that is obviously not present inZTFAlertConsumer.d.
<WNTR>AlertWorkerneeds to be updated to use the correct stream from SkyPortal.alert_filter__xmatch_ztf_alertsneeds to be updated with the new survey-specific cross-match radius (2 arcsec for WINTER). -
In
kowalski/alert_broker.py,make_photometryneeds to be updated with the filterlist and zeropoint system appropriate for the new stream. -
A new
kowalski/dask_cluster_<winter>,pyneeds to be created, modeled ondask_cluster.pybut using the ports for the new stream from the config file. -
The config file
config.defaults.yamlneeds to be updated to include the collections, upstream filters, crossmatches, dask ports, and ml_models (if MLing is necessary) for the new stream. No two streams should use the same ports for dask to avoid conflicts. Entries also need to be made in thesupervisordsection of the config file so thatalert_broker_<winter>.pyanddask_cluster_<winter>.pycan be run through supervisor. -
Some alerts need to be added to
data/for testing. Tests for alert ingestion (tests/test_ingester_<wntr>.py) and alert processing (tests/test_alert_broker_wntr.py) can be modeled on the ZTF tests, with appropriate changes for the new stream. The ingester test is where you will be able to create a mock kafka stream to test your broker. -
Need to edit
ingester.Dockerfileso that all new files are copied into the docker container (add or modify the COPY lines).
For now, only the ZTF alert stream has a method implement to run ML models on the alerts. However, this can be extended as reused as a basis to run ML models on other streams as well.
To add a new ML model to run on the ZTF alert stream, you simply need to add the model to the models directory, and add the model to ml_models.ZTF in the config file. The model will then be automatically loaded and run on the alerts.
Here are the exact steps to add a new ML model to Kowalski:
-
Add the model in .h5 format, or if you are using a .pb format you can also add the model files and directories in a folder called
<model_name.model_version>in themodelsdirectory. -
Add the model name to
ml_models.ZTFin the config file. All models need to have at least the following fields:triplet: True or False, whether the model uses the triplet (images) or not as an input to the modelfeature_names: list of features used by the model as a tuple, they need to be a subset of theZTF_ALERT_NUMERICAL_FEATURESfound inkowalski/utils.py. Ex:('drb', 'diffmaglim', 'ra', 'dec', 'magpsf', 'sigmapsf')version: version of the model
Then, you might want to provide additional information about the model, such as:
feature_norms: dictionary of feature names and their normalization values, if the model was trained with normalized featuresorder: in which order do the triplet and features need to be passed to the model. ex:['triplet', 'features']or['features', 'triplet']format: format of the model, eitherh5orpb. If not provided, the default ish5.
The best way to see if the model is being loaded correctly is to run the broker tests mentioned earlier. These tests will show you the models that are running, and the errors encountered when loading the models (if any).