OGD4All is a prototype system enabling an easy-to-use, transparent interaction with Geospatial Open Government Data through natural language. It was developed and tested with 430 tabular and geospatial datasets of the city of Zurich.
OGD4All_Simple_Demo.mp4
Run the following commands with Python 3.12 to create a virtual environment that contains all required dependencies:
python -m venv .venv # Create new virtual environment named .venv (or whatever you wanna call it)
# activate virtual environment depending on your platform #
.venv\Scripts\activate.bat # Windows, cmd.exe
.venv\Scripts\Activate.ps1 # Windows, Powershell
source .venv/bin/activate # Linux/MacOS
# ------------------------------------------------------- #
pip install -r requirements.txtConfigure the following environment variables: AZURE_OPENAI_API_KEY, AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_ENDPOINT_EMBEDDING_LARGE
The system was tested with 430 tabular and geospatial datasets of the city of Zurich, but could also be used with other types of data (with some additional steps). While the associated metadata is already contained in this repository, you need to perform some additional steps for getting the actual data.
- Download the datasets here. Unfortunately, Zurich's open data platform does not offer an API for downloading datasets, meaning these datasets were manually exported. Note that this was performed in March and May of 2025, meaning some datasets may be out-of-date. All extraction timestamp, dataset titles and filenames are listed in
data\opendata\50000006\downloads.csv, feel free to replace datasets with more recent versions. - Extract the ZIP folder into the directory
data\opendata\50000006\extracted(make sure it is not further nested). - If you added the data correctly, you should see the following messages at startup:
[info ] Will use 430 files for group owner 50000006.
[info ] 430/548 metadata embeddings remaining after filtering for existing data.
By default, OGD4All uses OpenAI's text-embedding-3-large for semantic search. Corresponding embeddings were generated for the metadata of all datasets, and is stored in data\metadata\50000006\processed_metadata_embeddings.csv. You have two options:
- Keep text-embedding-3-large: You'll need to set either the environment variable
OPENAI_API_KEY(with your OpenAI API key) or bothAZURE_OPENAI_API_KEYandAZURE_OPENAI_ENDPOINT_EMBEDDING_LARGE(with your Azure OpenAI API key and endpoint). - Use custom embeddings: First, update the
processed_metadata_embeddings.csvfile by modifying thescraping\generate_embeddings.pyscript with your own method. Then, update theget_embeddingfunction inretrieval\retriever.py.
Support for LLMs provided by Azure OpenAI, Azure, OpenAI, OpenRouter, and Ollama is implemented.
Depending on which one you use, set the environment variable AZURE_OPENAI_API_KEY, AZURE_OPENAI_ENDPOINT, AZURE_API_KEY, OPENAI_API_KEY, or OPENROUTER_API_KEY.
If your desired model is not listed in SUPPORTED_LLMS in utils.py, just add it to the list and adapt the get_llm_client function.
python main.py
This will start a local gradio server where you can interact with 430 geospatial and tabular datasets of the city of Zurich through a simple interface.
| Argument | Default | Choices | Description |
|---|---|---|---|
--groupOwner |
50000006 |
– | The groupOwner ID whose metadata should be queried (e.g., the city of Zurich). |
--top_n |
10 |
Any number > 0 | The number of documents to retrieve for a single KNN search. Cost considerations and your model's context window should determine how high you push this. |
--retriever |
agentic |
agentic, knn, verified |
The retrieval strategy to use. |
--analyzer |
iterative_local |
simple_local_v2, simple, iterative_local |
The analyzer type to use. |
--retrieval_llm |
gpt-4.1 |
SUPPORTED_LLMS |
The LLM to use for retrieval tasks. |
--retrieval_check_llm |
gpt-4o |
SUPPORTED_LLMS |
The LLM to use for checking whether a follow-up retrieval is needed. Should ideally be fast. |
--coding_llm |
gpt-4.1 |
SUPPORTED_LLMS |
The LLM to use for coding tasks/analysis. |
--hybrid_search |
Not set |
– | Enable hybrid search with Milvus (default is semantic search). |
--bm25_search |
Not set |
– | Enable BM25 search with Milvus (default is semantic search). |
--no_streaming |
Not set |
– | Disable streaming for the coding LLM. This enables validation of LLM responses and token counting, but makes the system feel less responsive. |
SUPPORTED_LLMS = ['gpt-5', 'gpt-5-mini', 'gpt-5-nano', 'gpt-4.1', 'gpt-4.1-mini', 'gpt-4o', 'gpt-o1', 'gpt-o3-mini-preview', 'gemini-2.5-pro', 'gemini-2.5-flash', 'gemini-2.5-flash-lite-preview-06-17', 'meta-llama-70B-instruct-latest', 'mistral-codestral', 'mistral-large', 'openai/gpt-oss-120b', 'meta-llama/llama-4-maverick']
This prototype was implemented as part of Michael Siebenmann's Master's thesis at the Professorship of Computational Social Science and was further supported by the Esri R&D Center Zurich.