aryn-ai · mdwelsh · Oct 14, 2024 · Oct 9, 2024 · Oct 11, 2024 · Oct 11, 2024
diff --git a/apps/query-eval/README.md b/apps/query-eval/README.md
@@ -0,0 +1,108 @@
+# Sycamore Query Evaluation tool
+
+This tool can be used to evaluate the query planning and answering capabilities
+of Sycamore Query against a given dataset and set of queries. This is a wrapper
+around the `sycamore.query.client.SycamoreQueryClient` class that reads a
+configuration from an input YAML file, and writes results to an output YAML file.
+
+## Input file format
+
+The input file format is YAML and is defined by the `queryeval.types.QueryEvalInputFile`
+class. The following is a minimal example of the input file format:
+
+```yaml
+# General configuration options. Each of these can be specified
+# on the command line as well.
+config:
+    # The OpenSearch index to use.
+    index: const_ntsb
+
+# The list of queries to run. Each has a query and an expected
+# result, which can be either a string, or a list of dictionaries,
+# with each element of the list representing a Sycamore Document
+# expected to be returned by the query.
+queries:
+    - query: "How many incidents were there in 2023?"
+      expected: "There were 10 incidents in 2023."
+    - query: "How many incidents occurred in bad weather?"
+      expected: "7 incidents occurred in bad weather."
+```
+
+Examples of input files can be found in the `data/` directory.
+
+## Output file format
+
+The output file format is YAML and is defined by the `queryeval.types.QueryEvalResultsFile`
+type. Depending on the configuration options used when run, the output file may
+contain one or more of the following:
+    * Query plans generated by the Sycamore Query Planner.
+    * Query results produced by running these query plans.
+    * Accuracy and quality metrics calculated from the query results.
+
+The idea is that each of these stages of the evaluation can be run independently, and
+the results from previous stages can be used as input to the next stage.
+
+## Running the tool
+
+First, run `poetry install` in this directory to install all dependencies.
+
+You can get a full list of options by running:
+
+```bash
+$ poetry run python queryeval/main.py --help
+```
+
+To generate query plans and run all of the resulting queries:
+
+```bash
+$ poetry run python queryeval/main.py --outfile results.yaml data/ntsb-mini.yaml run
+```
+
+To only generate query plans:
+
+```bash
+$ poetry run python queryeval/main.py --outfile results.yaml data/ntsb-mini.yaml plan
+```
+
+Note that the query plans generated during the `plan` phase are saved to the results
+file, so if you use `run` after `plan` with the same `--outfile` option set, the query plans
+will be reused. You can force regeneration of the query plans using the `--overwrite` option.
+
+## Specifying the schema
+
+By default, the data schema will be fetched from the provided OpenSearch index.
+However, the schema can be specified manually by setting the `data_schema` field in the
+input file. Each field in the schema has two entries: the type of the field, and
+a list of example values. For example:
+
+```yaml
+data_schema:
+
+  properties.entity.accidentNumber:
+    # The type of the field.
+    - str
+    # A list of example values.
+    - ["CEN23LAO80", "DCA23LA133", "CEN23LA086", "ERA23LA168", "CEN23LA097"]
+
+  properties.entity.aircraftDamage:
+    - str
+    # You can also specify individual examples as list entries on their own line.
+    - - Destroyed
+      - None
+      - Substantial
+```
+
+## Useful flags
+
+Use the `--query-cache-path` and `--llm-cache-path` flags to specify caches for intermediate
+query results and LLM results, respectively. This can save a substantial amount of time and
+LLM cost if you are doing repeated evaluations, however, be aware that stale cache entries
+may affect your results.
+
+Use `--dry-run` to avoid performing any planning, queries, or writing results. This is useful
+to test if your config file format is correct.
+
+Use `--logfile` to write detailed logs of the evaluation process to a file.
+
+
+
diff --git a/apps/query-eval/data/ntsb-full.yaml b/apps/query-eval/data/ntsb-full.yaml
@@ -0,0 +1,101 @@
+# This file contains a query-eval config for evaluating Sycamore Query against
+# the NTSB incident dataset.
+
+config:
+  index: const_ntsb
+
+queries:
+  - query: "Were there any environmentally caused incidents?"
+    expected: "Yes, there were environmentally caused incidents."
+  - query: "Were there any ice related incidents in Alaska?"
+    expected: "Yes, there were ice-related incidents in Alaska."
+  - query: "Were there any incidents in the last three days of January 2023 in Washington?"
+    expected: "Yes, there was an incident in the last three days of January 2023 in Washington."
+  - query: "Were there any fire related incidents in CA in 2023?"
+    expected: "Yes, there was a fire-related incident in California in 2023."
+  - query: "How many Piper aircrafts were involved in accidents?"
+    expected: "There were 21 Piper aircrafts involved in accidents."
+  - query: "How many incidents occurred in the summer months of 2023 which involved birds?"
+    expected: "No incidents occurred in the summer months of 2023 which involved birds."
+  - query: "What fraction of incidents that resulted in substantial damage were due to engine problems?"
+    expected: "0.338 of the incidents that resulted in substantial damage were due to engine problems."
+  - query: "What fraction of environmentally caused incidents were due to fires in the past 5 years?"
+    expected: "0.043 of environmentally caused incidents were due to fires in the past 5 years."
+  - query: "How many more environmentally caused incidents were there compared to human errors?"
+    expected: "There were 16 fewer environmentally caused incidents compared to human errors."
+  - query: "What planes (by company) were involved in incidents in California?"
+    expected: "Cessna and Piper planes were involved in incidents in California."
+  - query: "What was the most prevalent cause of incidents in 2023 with 2+ serious injuries?"
+    expected: "The most prevalent cause of incidents in 2023 with 2+ serious injuries was 'unknown cause'."
+  - query: "Of all the incidents related to icy conditions on the tarmac, what was the top three types of failures?"
+    expected: "The top three types of failures were 'snow berm', 'impact to ground', and 'wet/dense snow'."
+  - query: "Which states in the Midwest were most affected by aviation incidents in 2023?"
+    expected: "Nebraska was the Midwest state that was the most affected by aviation incidents in 2023."
+  - query: "How many incidents happened in california in 2023?"
+    expected: "There were 9 incidents."
+  - query: "How many incidents occurred in California?"
+    expected: "There were 9 incidents."
+  - query: "How many locations did incidents in the first 5 days of January 2023 occur in?"
+    expected: "There were 10 locations."
+  - query: "How many incidents happened due to environmental issues?"
+    expected: "There were 15 incidents."
+  - query: "How many types of planes did incidents in the first 5 days of January 2023 occur in?"
+    expected: "There were 5 types of planes."
+  - query: "How many U.S. States did incidents in the first 5 days of January 2023 occur in?"
+    expected: "Incidents occurred in 3 U.S. States."
+  - query: "Where did incidents happen?"
+    expected: "Incidents happened in California, Florida, and Texas."
+  - query: "What percentage of incidents that resulted in substantial damage were due to engine problems?"
+    expected: "30% of incidents that resulted in substantial damage were due to engine problems."
+  - query: "What fraction of incidents that resulted in substantial damage involved engine problems?"
+    expected: "50% of incidents that resulted in substantial damage involved engine problems."
+  - query: "What fraction of incidents occurred in the first 5 days of January 2023?"
+    expected: "60% of incidents occurred in the first 5 days of January 2023."
+  - query: "How many incidents occurred in the first 5 days of January 2023?"
+    expected: "There were 100 incidents."
+  - query: "How many incidents occurred before January 6, 2023?"
+    expected: "There were 50 incidents."
+  - query: "How many incidents occurred after January 6, 2023?"
+    expected: "There were 75 incidents."
+  - query: "What fraction of incidents resulted in substantial damage?"
+    expected: "40% of incidents resulted in substantial damage."
+  - query: "What fraction of incidents that resulted in substantial damage occurred in California?"
+    expected: "20% of incidents that resulted in substantial damage occurred in California."
+  - query: "How many more incidents happened in California compared to Florida?"
+    expected: "There were 10 more incidents in California compared to Florida."
+  - query: "How many incidents resulted in 2+ serious injuries?"
+    expected: "There were 5 incidents that resulted in 2+ serious injuries."
+  - query: "How many U.S states did incidents occur in?"
+    expected: "Incidents occurred in 10 U.S. states."
+  - query: "What were the top 2 states in the Midwest that were most affected by aviation incidents in 2023?"
+    expected: "The top 2 states in the Midwest that were most affected by aviation incidents in 2023 were Illinois and Ohio."
+  - query: "What was the most prevalent cause of incidents in 2023 with 1+ serious injuries?"
+    expected: "The most prevalent cause of incidents in 2023 with 1+ serious injuries was pilot error."
+  - query: "Was northern or southern California more affected by airplane incidents?"
+    expected: "Northern California was more affected by airplane incidents."
+  - query: "Were there any environmentally caused incidents?"
+    expected: "Yes, there were environmentally caused incidents."
+  - query: "Were there any ice related incidents in Alaska?"
+    expected: "Yes, there were ice related incidents in Alaska."
+  - query: "Were there any incidents in the last three days of January 2023 in Washington?"
+    expected: "Yes, there were incidents in the last three days of January 2023 in Washington."
+  - query: "Were there any fire related incidents in CA in 2023?"
+    expected: "Yes, there were fire related incidents in CA in 2023."
+  - query: "How many Piper aircrafts were involved in accidents?"
+    expected: "There were 5 Piper aircrafts involved in accidents."
+  - query: "How many incidents occurred in the summer months of 2023 which involved birds?"
+    expected: "There were 20 incidents that occurred in the summer months of 2023 which involved birds."
+  - query: "What fraction of incidents that resulted in substantial damage were due to engine problems?"
+    expected: "30% of incidents that resulted in substantial damage were due to engine problems."
+  - query: "What fraction of environmentally caused incidents were due to fires in the past 5 years?"
+    expected: "50% of environmentally caused incidents were due to fires in the past 5 years."
+  - query: "How many more environmentally caused incidents were there compared to human errors?"
+    expected: "There were 10 more environmentally caused incidents compared to human errors."
+  - query: "What planes (by company) were involved in incidents in California?"
+    expected: "The planes involved in incidents in California were Boeing, Airbus, and Cessna."
+  - query: "What was the most prevalent cause of incidents in 2023 with 2+ serious injuries?"
+    expected: "The most prevalent cause of incidents in 2023 with 2+ serious injuries was mechanical failure."
+  - query: "Of all the incidents related to icy conditions on the tarmac, what were the top three types of failure?"
+    expected: "The top three types of failure in incidents related to icy conditions on the tarmac were braking failure, steering failure, and engine failure."
+  - query: "Which states in the Midwest were most affected by aviation incidents in 2023?"
+    expected: "The states in the Midwest that were most affected by aviation incidents in 2023 were Illinois, Ohio, and Michigan."
diff --git a/apps/query-eval/data/ntsb-mini.yaml b/apps/query-eval/data/ntsb-mini.yaml
@@ -0,0 +1,15 @@
+# This file contains a query-eval config for evaluating Sycamore Query against
+# the NTSB incident dataset. This version only contains a few queries for quick
+# testing.
+
+config:
+  index: const_ntsb
+
+
+queries:
+  - query: "Were there any fire related incidents in CA in 2023?"
+    expected: "Yes, there was a fire-related incident in California in 2023."
+  - query: "How many Piper aircrafts were involved in accidents?"
+    expected: "There were 21 Piper aircrafts involved in accidents."
+  - query: "What fraction of incidents that resulted in substantial damage were due to engine problems?"
+    expected: "0.338 of the incidents that resulted in substantial damage were due to engine problems."