Note: to open links in new tab use CTRL+click (Windows and Linux) or CMD+click (MacOS).
Agricultural Data Extract, Transform, and Load Framework is a set of functions written in python that allow you to process data files from different agricultural and plant science experiments and aggregate them into a standard database table in a central repository to make data available for different variety of data analyses.
The execution of functions for this step is divided into two notebook files and configuration files.
- Extraction and Transformation processes:
Runs the Extraction and Transformation processes, and the user gets a CSV file where the data from different source files are aggregated and standardized into a single format.
Notebook file: extract-transform.ipynb
Configuration file: config_extract-transform.yml
- Load processes
Loads the data into a single table in a data warehouse
Notebook file: load.ipynb
Configuration file: config_load.yml
If you are working on plant phenotyping experiments, we encourage you to follow the MIAPPE standards (https://www.miappe.org/) for creating your database tables.
-
Option 1
- You should make a simple installation of either JupyterLab or Jupyter Notebook, or you also can install an environment management such as conda, mamba, or pipenv.
-
Option 2
- Using a Jupyter Hub enviroment.
- Option 1
- Using Requirements File
pip install -r requirements.txt-
option 2
- Install the requiered libraries using the pip package installer for Python.
pip install pyyaml
pip install pandas
pip install psycopg2
-
Clone option
- Open a new Jupyter Notebook Terminal
New > Terminal
- Clone the GitHub repository
git clone https://github.com/Purdue-LuisVargas/agETL.git
-
Download option
- Download AgETL from the Github repository: https://github.com/Purdue-LuisVargas/agETL.
- Unzip the entire folder, then copy (if running Jupyter locally) or upload the downloaded files (if using the Jupyter Hub environment) in your Jupyter Notebook directory.
To run the functions in AgETL you should open them in Jupyter Notebook, first modify the configuration file (.yml), and second run the Python functions (.ipynb). The process is divided into two tasks as it is indicated bellow:
Raw data files (input) --> Extraction and transformation --> standardized dataframe (output) --> Load
-
Extraction and Transformation: The first set of functions runs the Extract and Transform processes. It outputs a CSV file where the data from different source files have been aggregated and standardized into a single format.
You need the following files: extract-transform.ipynb config_extract-transform.yml -
Loading: The second group of functions is used to load data into a single table in the database.
You need the following files: load.ipynb config_load.ymlTo make the database connection you need to update the following information in the configuration file (config_load.yml), as the following examples:
- Localhost database:
DATABASE_CREDENTIALS: Host: localhost Dbname: wanglab user: postgres port: 5432 password: **************WAdxm1- Cloud server database:
DATABASE_CREDENTIALS: Host: containers-us-west-187.railway.app Dbname: railway user: postgres port: 7895 password: **************WAdxm1
Vargas-Rojas L, Ting T-C, Rainey KM, Reynolds M and Wang DR (2024) AgTC and AgETL: open-source tools to enhance data collection and management for plant science research. Front. Plant Sci. 15:1265073. doi: 10.3389/fpls.2024.1265073.
Diane Wang - [email protected]
Luis Vargas Rojas - [email protected]
Purdue University, Wang Lab dianewanglab.com