The g2s application is a multi-stage data processing pipeline that extracts data from a Qlever SPARQL endpoint, processes it, and loads it into a Solr index. The process involves the following stages:
- Query: Executes a SPARQL query against a Qlever endpoint and stores the results in a Parquet file and a LanceDB table.
- Group: Groups the data in the LanceDB table based on a SQL query and stores the result in a new LanceDB table.
- Augment: Augments the grouped data with additional information, such as geospatial data and temporal data.
- JSONL: Converts the augmented data into a JSONL file suitable for Solr.
- Batch: Loads the JSONL file into a Solr index in batches.
/scratch/g2s/
├───.dockerignore
├───.gitignore
├───.python-version
├───Dockerfile
├───main.py
├───pyproject.toml
├───README.md
├───uv.lock
├───workflow.sh
├───defs/
│ ├───datashaping.py
│ ├───etl_augment.py
│ ├───etl_batch.py
│ ├───etl_group.py
│ ├───etl_jsonl.py
│ └───etl_query.py
│ └───...
├───SPARQL/
│ ├───duckdbSQL.sql
│ ├───...
└───stores/
├───files/
├───lancedb/
└───solrInputFiles/
main.py: The main entry point for the application. It usesargparseto handle different processing stages (modes).defs/: Contains the Python modules for each processing stage (e.g.,etl_query.py,etl_group.py).SPARQL/: Contains the SPARQL and SQL queries used in the pipeline.stores/: The default directory for storing intermediate and final data files.files/: Stores Parquet files.lancedb/: Stores LanceDB tables.solrInputFiles/: Stores JSONL files for Solr.
Dockerfile: For building the Docker image.workflow.sh: A shell script that demonstrates the full pipeline execution.pyproject.toml: Defines the project dependencies.
To build the Docker image for this project, run the following command from the project root:
docker build -t g2s-app .The default command executes the query stage.
docker run --rm g2s-appYou can override the default arguments by appending them to the docker run command. For example, to run the query command with a different source URL:
docker run --rm g2s-app query --source "http://another-source.com" --sink "./stores/files/results.parquet" --query "./SPARQL/some_other_query.rq" --table "my_results"To use local SPARQL query collection and a local directory for the storage of the generated files, use a command like.
docker run --rm \
-v $PWD/SPARQL:/app/SPARQL \
-v $PWD/stores:/app/stores \
g2s-app query --source "http://ghost.lan:7007" --sink "./stores/files/results_sparql.parquet" --query "./SPARQL/unionByType/dataCatalog.rq" --table "sparql_results"Each of the commands (query, group, augment, jsonl, and batch) can be run individually. Here are examples of how to run each command locally using uv:
# Query
uv run main.py query --source "http://ghost.lan:7007" --sink "./stores/files/results_sparql.parquet" --query "./SPARQL/unionByType/dataCatalog.rq" --table "sparql_results"
# Group
uv run main.py group --source "sparql_results" --sink './stores/files/results_long_grouped.csv'
# Augment
uv run main.py augment --source "sparql_results_grouped"
# JSONL
uv run main.py jsonl --source "sparql_results_grouped_augmented"
# Batch
uv run main.py batch --source "./stores/solrInputFiles/sparql_results_grouped_augmented.jsonl" --sink "http://localhost:8983/solr/my_core"The workflow.sh script provides an example of how to run the entire pipeline for different data types. It executes the query, group, augment, jsonl, and batch stages in sequence for each data type.
#!/bin/bash
SPARQL_ENDPOINT="http://localhost:7007"
SOLR_ENDPOINT="http://oih.ioc-africa.org:8983/solr/ckan"
echo "----------> dataCatalog"
python main.py query --source "${SPARQL_ENDPOINT}" --sink "./stores/files/results_sparql.parquet" --query "./SPARQL/unionByType/dataCatalog.rq" --table "sparql_results"
python main.py group --source "sparql_results" --sink './stores/files/results_long_grouped.csv'
python main.py augment --source "sparql_results_grouped"
python main.py jsonl --source "sparql_results_grouped_augmented"
python main.py batch --source "./stores/solrInputFiles/sparql_results_grouped_augmented.jsonl" --sink "${SOLR_ENDPOINT}"
# ... (repeated for other data types)