YAADA is a data architecture and analytics platform developed to support the full analytics development lifecycle, from prototyping in local Python to operational deployment as containerized microservices. YAADA’s primary focus is ingesting, storing, and analyzing semi-structured document-oriented data and training, persisting, and applying analytic models. It leverages industry hardened cloud technologies such as OpenSearch and OpenSearch Dashboards for document storage and visualization and Jupyter Notebook for exploratory data analysis and analytic prototyping. It provides an analytic plugin API that allow analytic developers to focus on the algorithms, while handling all the details of data management and analytic invocation through REST and message-based APIs. In addition, YAADA includes pre-built analytic wrappers for popular open source libraries for NLP and web scraping.
This README has contains instructions for developing the core YAADA project. For instructions on using YAADA for your project, consult the Getting Started documentation.
- Docker
- docker-compose (isn't always installed with Docker under Linux, so may need to install seperately)
- Python 3.10
- PIP
pipenv- Optional but recommended
cookiecutter-- if you would like to create your project from a template
If you run into pip version issues when installing pipenv, consider installing pipenv through pipx.
- On Mac, use Docker for Mac
- On Linux, use Docker for Linux
- On Windows 10, use Docker for Windows (WSL2 install preferred)
- Note: Windows 11 is currently untested
A full YAADA-based system running infrastructure and YAADA services in development mode should ideally be allocated at least 8GB of memory. If using Docker for Mac, the default virtual machine has 2GB allocated, so this will need to be adjusted.
To install and run the stock YAADA system from this repo, follow the instruction in this section. If you want to create your own project that uses YAADA, consult the Getting Started documentation.
Activate virtual environment with:
pipenv shell
Install yaada packages with:
pipenv install
yda run download-nlp-resources
$ yda build
Bring up services and infrastructure:
$ yda up
You can now access the services provided in YAADA. Here is a table of the services and how to access them locally
| Server Access Points |
|---|
| OpenAPI REST UI |
| Jupyter Lab |
| OpenSearch Dashboards |
| MinIO |
Bring down services and infrastructure:
$ yda down
This section will cover some useful builtin commands and tools for local development.
To see what docker containers are running, run:
$ yda ps
When having difficulty with a service, to see the logs of that service, run::
$ yda logs <service_name>
The following command is commonly used to check what documents are currently in OpenSearch:
$ yda data counts
Launch an IPython shell with a YAADA context already constructed and available and live reload setup:
$ yda run ipython
Launch Jupyter Lab locally (stopping the Docker-based instance that gets launched automatically):
$ yda run jupyter
Please see Contributing Guide for details.
Install yaada packages with:
pipenv install --dev
Update dependencies after modifying a package's install_requires.
pipenv update
Updating package lock:
pipenv lock
YAADA uses black and isort for code reformatting and flake8 for linting. Before committing any code to the YAADA repo, you should make sure that the code is formatted and passes the linting check:
make lint
Add zscaler root cert See: https://community.zscaler.com/t/installing-tls-ssl-root-certificates-to-non-standard-environments/7261
cat cert/ZscalerRootCertificate-2048-SHA256.crt >> $(python -m certifi)
- Drop support for Elasticsearch and OpenSearch 1.X. Upgraded OpenSearch to latest 2.X version.
- Minimum Python version changed from 3.8 to 3.10
- update pinned dependencies for Docker images
- fix model downloads for spacy 3.7.0
- add yaada.yml script for model downloads
- relax pinned regex dependency.
- fix docker-compose network configuration.
- Fix bug in artifact download API due to use of deprecated Flask API.
- Add support for environment variable substitution to yaada.yml variables section
Adding cookiecutter templates to main yaada repo and moving away from using yaada-core Docker image. Going forward, projects will have fully self-contained Docker images and install yaada from the public Github repo.
Add plugins for:
- Neo4j graph database
- Weaviate vector database
Due to current versions of Elasticsearch and MinIO incompatibility with M1 Macs, and due to both products having recent license changes, YAADA has replaced Elasticsearch with OpenSearch, and replaced MinIO with Zenko CloudServer. Additionally, all use of the minio python client has been replaced with use of the boto3 AWS client.
OpenSearch and Zenko CloudServer both have Apache 2.0 licenses.
This will affect all project docker-compose.yml configurations, but the Python API is unchanged.