This is the repo for NFInsight's ETL server.
This repo consists of the Docker application code needed to run the ETL layer, in the above diagram.
In the future, we hope to incorporate some analytics workflows with these Cassandra clusters, and with Tensorflow and Spark.
Developed by @SeeuSim and @JamesLiuZx
To run, simply follow these steps:
- Clone the repo.
- Create a virtual environment within the directory. We recommend using Python 3.9.
- Activate the virtual environment.
- Grab your API connection strings and populate it in a
.envfile in the/etl/fastapi_appfolder.- You may use the
.env.exampleas a guideline. - You should also generate an app secret for authenticating JWT tokens.
- You may use the
Additional care needs to be taken when setting up the database. Please refer to etl/database.py:
- For Datastax Astra: (You may also refer to Astra set up)
- It uses a zip file which we store the in
etlfolder that is referenced inetl/database.py. To get this package, go to your Datastax Astra console and grab the "connection bundle" and save it in theetlfolder. - For its other variables, populate them according the the
etl/.env.examplefile.
- It uses a zip file which we store the in
- For Azure Cosmos:
- The code in
database.pywill need to be modified, as well as the environment variables needed. - This applies for other Cassandra clusters as well.
- The code in
-
Using your Cassandra CQL shell, create all the tables in the
etl/celery_app/db/models.pywith those commands. -
Insert an admin user with username and bcrypt hashed password into the
admin_usertable.
6.1. Generate the password with Python:
from passlib.Context import CryptContext
context = CryptContext(schemes=['bcrypt'], deprecated='auto')
password="{PASSWORD}"
hashed_password=context.hash(password)
print(hashed_password)
# >> 'hashed password'6.2. Insert the user into your database with its CQL shell.
INSERT INTO admin_user (username, hashed_password, disabled)
VALUES ('{username}', '{hashed_password}', false);
-- >> Values inserted- Run these commands:
# Build the app from the local code.
docker compose up --buildThe server should be up and running. To visit the OpenAPI spec, simply go to 127.0.0.1/docs in your browser.
-
To trigger authenticated routes, key in the credentials from step 6 and click
authenticatein the OpenAPI spec. -
To spin down the server, simply run Cmd + C in the Docker Compose terminal.
You first need to create a Datastax Astra database, and navigate to your database admin console on the web.
To set up, simply download your secure-connect-{database_name} bundle and populate it in the etl folder.
Reference that (including the database name) in the etl/database.py file by setting the variables in the etl/.env file:
ASTRA_DB_NAME="<value>"
ASTRA_CLIENT_ID="<value>"
ASTRA_CLIENT_SECRET="<value>"
ASTRA_TOKEN="<value>"
ASTRA_KEYSPACE="<keyspace>"These values can be obtained from your Astra DB web console.
The Python ORM by DataStax has some flaws. Hence, we execute our queries using only its raw CQL execution engine.
To connect to the database, create a Datastax Astra account and database, and use either their web CQL shell or execute raw queries with its various drivers.
Within the web shell, you should be able to test and execute queries using CQL.
Once your queries have been validated, use session.prepare and session.execute in your Python code to execute database statements.
Within our code, there are multiple CQL injection vulnerabilities with raw f-string queries. However, as we are not storing sensitive data within the database and are optimising our queries for batch performance, we will leave them as such for now.
Fixes proposed are welcome, via our issues section.
If you're wondering what Celery is, it is a backend tasks broker that can be used to run background tasks.
We've configured Celery in this project to use a RabbitMQ broker with py-amqp, and a Redis in-memory results backend that can be used for an access lock if needed.
Here's how to run the demo:
- If you haven't already, ensure that your system has:
- Docker installed, and that the Docker daemon is up and running. In OSes with GUI, you may simply launch the Docker Desktop client.
- Ensure all your environment variables are set. You may follow the respective
.env.examplefiles. - Run this command in your terminal:
docker-compose up- Now, your app may call any function denoted with
@app.taskinapp/celery.py. This should run in the background. - To illustrate, open a separate shell with the same venv activated, and run this:
# Start a REPL environment for testing
python3
>> from celery_app.celery import app
>> app.send_task('task_name', args=(...), kwargs={...})You should be able to see the Celery worker handle and execute the task. In the future, we hope to be able to implement the necessary APIs to manage, start and stop tasks.
-
You may also run the script
./scripts/flower.shin another terminal to see a GUI to view task running statuses atlocalhost:5556. Remember to run the same chmod command on the flower script. Alternatively, you can runfind ./scripts -type f -exec chmod +x {} +to enable permissions for all current script files within the folder. -
To spin down the celery app and related resources, perform these actions in this sequence:
- Terminate the flower script by running Cmd+C in the
flower.shterminal. - Terminate the containers by running
docker-compose downfrom another terminal.
If Kubernetes is more your thing, we also provide a set up for kubernetes.
We also provide a Kubernetes workflow under k8s/Setup.md.
Pre-requisites: Modify the kubernetes scripts image tags for the celery deployment and the fastapi deployment to point to your local images that were previously built with docker compose.
They may be found under: ./k8s/resources/celery-worker-deployment.yaml and ./k8s/resources/fastapi-application-deployment.yaml.
- Ensure that you have
minikubeandkubectlon your system. - Ensure that the Docker daemon is running.
- Run the commands below:
#Start the local control plane with minikube
minikube start
# Create the necessary namespaces
kubectl apply -f k8s/namespace.yaml
# Create the resources
kubectl apply -f k8as/resources
# Mirror the ports
minikube tunnelNow, you will be able to interact with the containers and the FastAPI application as if you were running docker-compose. To terminate, simply run kubectl delete -f k8s/resources.
NOTE: Do NOT delete the namespace.
- Model Training Scripts
- CI to build and deploy to ACA