Yet Another Analytic Dataflow Architecture

YAADA is a data architecture and analytics platform developed to support the full analytics development lifecycle, from prototyping in local Python to operational deployment as containerized microservices. YAADA’s primary focus is ingesting, storing, and analyzing semi-structured document-oriented data and training, persisting, and applying analytic models. It leverages industry hardened cloud technologies such as OpenSearch and OpenSearch Dashboards for document storage and visualization and Jupyter Notebook for exploratory data analysis and analytic prototyping. It provides an analytic plugin API that allow analytic developers to focus on the algorithms, while handling all the details of data management and analytic invocation through REST and message-based APIs. In addition, YAADA includes pre-built analytic wrappers for popular open source libraries for NLP and web scraping.

Getting started

This README has contains instructions for developing the core YAADA project. For instructions on using YAADA for your project, consult the Getting Started documentation.

Prerequisites

Docker
docker-compose (isn't always installed with Docker under Linux, so may need to install seperately)
Python 3.10
PIP
pipenv
Optional but recommended cookiecutter -- if you would like to create your project from a template

If you run into pip version issues when installing pipenv, consider installing pipenv through pipx.

Docker Setup

On Mac, use Docker for Mac
On Linux, use Docker for Linux
On Windows 10, use Docker for Windows (WSL2 install preferred)
- Note: Windows 11 is currently untested

YAADA memory requirements

A full YAADA-based system running infrastructure and YAADA services in development mode should ideally be allocated at least 8GB of memory. If using Docker for Mac, the default virtual machine has 2GB allocated, so this will need to be adjusted.

Installing and running

To install and run the stock YAADA system from this repo, follow the instruction in this section. If you want to create your own project that uses YAADA, consult the Getting Started documentation.

Activate virtual environment with:

pipenv shell

Install yaada packages with:

pipenv install

Downloading NLP resources

yda run download-nlp-resources

Building and running through Docker

$ yda build

Launching

Bring up services and infrastructure:

$ yda up

You can now access the services provided in YAADA. Here is a table of the services and how to access them locally

Server Access Points
OpenAPI REST UI
Jupyter Lab
OpenSearch Dashboards
MinIO

Bring down services and infrastructure:

$ yda down

Useful CLI Commands

This section will cover some useful builtin commands and tools for local development.

To see what docker containers are running, run:

$ yda ps

When having difficulty with a service, to see the logs of that service, run::

$ yda logs <service_name>

The following command is commonly used to check what documents are currently in OpenSearch:

$ yda data counts

Launch an IPython shell with a YAADA context already constructed and available and live reload setup:

$ yda run ipython

Launch Jupyter Lab locally (stopping the Docker-based instance that gets launched automatically):

$ yda run jupyter

Contributing

Please see Contributing Guide for details.

Developing YAADA

Install yaada packages with:

pipenv install --dev

Update dependencies after modifying a package's install_requires.

pipenv update

Updating package lock:

pipenv lock

Linting

YAADA uses black and isort for code reformatting and flake8 for linting. Before committing any code to the YAADA repo, you should make sure that the code is formatted and passes the linting check:

make lint

Troubleshooting zscaler issues

Add zscaler root cert See: https://community.zscaler.com/t/installing-tls-ssl-root-certificates-to-non-standard-environments/7261

Mac/Linux

cat cert/ZscalerRootCertificate-2048-SHA256.crt >> $(python -m certifi)

Release Notes

7.0.0

Drop support for Elasticsearch and OpenSearch 1.X. Upgraded OpenSearch to latest 2.X version.
Minimum Python version changed from 3.8 to 3.10

6.2.3

update pinned dependencies for Docker images
fix model downloads for spacy 3.7.0
add yaada.yml script for model downloads

6.2.2

relax pinned regex dependency.
fix docker-compose network configuration.

6.2.1

Fix bug in artifact download API due to use of deprecated Flask API.
Add support for environment variable substitution to yaada.yml variables section

6.2.0

Adding cookiecutter templates to main yaada repo and moving away from using yaada-core Docker image. Going forward, projects will have fully self-contained Docker images and install yaada from the public Github repo.

Add plugins for:

Neo4j graph database
Weaviate vector database

6.1.0

Due to current versions of Elasticsearch and MinIO incompatibility with M1 Macs, and due to both products having recent license changes, YAADA has replaced Elasticsearch with OpenSearch, and replaced MinIO with Zenko CloudServer. Additionally, all use of the minio python client has been replaced with use of the boto3 AWS client.

OpenSearch and Zenko CloudServer both have Apache 2.0 licenses.

This will affect all project docker-compose.yml configurations, but the Python API is unchanged.

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
bin		bin
cert		cert
conf		conf
docker		docker
docs		docs
notebooks		notebooks
site		site
src		src
template/simple-compose		template/simple-compose
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
Pipfile		Pipfile
README.md		README.md
setup.cfg		setup.cfg
yaada.yml		yaada.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Yet Another Analytic Dataflow Architecture

Getting started

Prerequisites

Docker Setup

YAADA memory requirements

Installing and running

Downloading NLP resources

Building and running through Docker

Launching

Useful CLI Commands

Contributing

Developing YAADA

Linting

Troubleshooting zscaler issues

Mac/Linux

Release Notes

7.0.0

6.2.3

6.2.2

6.2.1

6.2.0

6.1.0

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

Aptima/yaada

Folders and files

Latest commit

History

Repository files navigation

Yet Another Analytic Dataflow Architecture

Getting started

Prerequisites

Docker Setup

YAADA memory requirements

Installing and running

Downloading NLP resources

Building and running through Docker

Launching

Useful CLI Commands

Contributing

Developing YAADA

Linting

Troubleshooting zscaler issues

Mac/Linux

Release Notes

7.0.0

6.2.3

6.2.2

6.2.1

6.2.0

6.1.0

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages