earth-speaks

Tools used for research in curation and analysis of Earth data.

Installation

Install Python 3
Make a new Python 3 environment (with Python 3.6 this is python3 -m venv /path/to/new/virtual/environment)
Activate the environment with source env/bin/activate
Update pip with pip install --upgrade pip
Install dependencies with pip install -r requirements.txt
Set Jupyter notebooks to use the environment's python with python -m ipykernel install --user
Fill in the private credentials/ folder
Some scripts communicate with cloud instances that you may have to set up - see amazon.md

Organization

analysis/
  ... code for running various gensim models on collected data
cleaning/
  serialize.py         - methods turn JSON-like structures into text
  nature/
     convert.py        - methods to turn Nature articles into documents
common.py              - common functions
credentials/           - stores JSON files of various formats with credentials
                       - not checked into repo
data/
  ... collect data from mining to this unchecked directory
elastic/
  elastic-manage.ipynb - tools for creating indices
  elastic-ingest.py    - tool for massively parallel data ingestion
mining/
  authenticate.ipynb   - run to refresh credentials for cookie-based mining
  ... each data source has a subdirectory with methods to populate data/
resources/
  ... misc files such as information needs observed from nature papers,
      and AWS code
turk/

Data sources

Currently the system has two data sources included: NASA EarthData and Nature.com journal papers on climate. To most accurately answer human queries about the earth we may want to consider other sources.

Amazon Machine Images (AMIs)

Amazon AMI's are clonable cloud computers that exist in a frozen state but remain ready to spring to life when needed. This project relies on certain types of AWS instances that have been preserved as AMI's:

Repository AMI: Contains this repository and some mined data from Nature.com papers.
Elasticsearch Node AMI: An instance that has been optimized to perform as an Elasticsearch node with Kibana.
Elasticsearch Proxy AMI: An instance running the NGINX server software that lies between system users and the Elasticsearch node(s) for protection. Also serves analyzer debugging web page.

For more details about bringing these AMI's to life and allowing them to talk to each other, see amazon.md.

Intended Workflow

Run scripts found in mining to populate data in data/ directory, either locally or from EC2 instances set up from the repo AMI.
Use the Elasticsearch node AMI to set up one or more EC2 instances to function as an Elasticsearch cluster. Then set up the Elasticsearch proxy AMI to authenticate and proxy traffic to the cluster (see amazon README)
Use elastic-manage.ipynb to set up one or more indices on the newly created node(s). Test analyzers using the analysis tester page (included already in the Proxy AMI, source is in resources/)
Either locally or in parallel across Repository AMI EC2 instances, ingest data to the Elasticsearch cluster (or use a reindex command if an existing index exists with the data already).
Given a budget for providing ground-truth, use scripts in turk/ to create MTurk HIT's (mini internet jobs) from the existing systems using pooling strategies.
Use scripts in analysis/ to investigate ways of improving the system. Use the continuous evaluation notebook to estimate the effectiveness and validation cost of new and improved systems.
Return to step 3. Aim is to improve until we have a system that returns resources capable of actually answering the natural language queries.
Optionally set up a pricing alarm using resources/aws-lambda-watch.py to keep a closer eye on charges incurred by the system.

Code improvements needed

Structured data to doc2vec
Selenium redo credentials
- Failure detection of file downloads
For NASA: Make datum files not entries in a pickled object - should not be so memory reliant.
More robust creation of HIT's

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

earth-speaks

Installation

Organization

Data sources

Amazon Machine Images (AMIs)

Intended Workflow

Code improvements needed

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
analysis		analysis
cleaning		cleaning
elastic		elastic
mining		mining
resources		resources
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
amazon.md		amazon.md
common.py		common.py
requirements.txt		requirements.txt

montaguegabe/earth-speaks

Folders and files

Latest commit

History

Repository files navigation

earth-speaks

Installation

Organization

Data sources

Amazon Machine Images (AMIs)

Intended Workflow

Code improvements needed

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages