Overview

jist attempts to find the complete JSON value (string, number, bool, or JSON object) for a given search key as fast as possible. You can also summarize a full json/ndjson file into a schema to you know... get the "jist" of it ;)

Benchmarks

jist uses the state-of-the art simdjson library for parsing JSON input / files smaller than 4.2GB in size.

3.3GB input (get last element)	jist	jq
Time	2.05s	34.17s
Memory	4.5GB	18GB 😱
Throughput	1600MB/s ✅	96MB/s

For files larger than 4.2GB or that don't fit into memory or forced streaming mode (option -s or --streaming), jist falls back to an earlier implementation using a simple character based lexer json-tools. While the fallback implementation is relatively slower than simdjson, it's still really fast at a throughput of ~300MB/s and uses almost no memory (around 10MB generally) for virtually any size of file (B / KB / MB / GB / TB / PB / etc).

28.9GB input (get last element)	jist	jq (not enough ram)
Time	1:30s ✅	❌
Memory	9.7MB ✅	❌
Throughput	321MB/s ✅	❌

(Test machine: Intel i7-12700H, 64GB DDR5@4800MT RAM)

Method

The benchmark file has the following shape:

[
   {
      "bar": {
         "baz": "65gBJtrk7B1YrQVqgo9jxw4TXvS2UQ5upIiXPwI6Vtx36eQvHS",
         "bizbizbiz": "SCGgrAumMpZkfD7BWgryfka5Q",
         "bouou": [
            91,
            55
         ],
         "poo": "true"
      },
      "foo": 45
   },
   ...
]

To create your own file of similar structure:

# cd to the project directory
cd benchmarking
cargo build --release # things will go way faster with release especially for really large files
./target/release/genearator -n 100000000 -o ../output.json 
# n = number of records with the shape above - e.g. 100M records will result in a 28.9GB file
cd ../ # go back to project root directory
cargo build --release # (optional - build from source or use binary) again things will be much faster with release
./target/release/jist -f output.json -p "[9999999].bar.baz"

You can of course modify the shape of the data as well by updating jist/benchmarking/src/main.rs.

Examples

$ jist --data '{"a":"b", "c": {"d": ["e", "f", "g"]}}' --path "c.d"
["e", "f", "g"]

Or

$ jist -d '[{"a": "b"}, {"c": {"d": "e"}}]' -p "[1].c"
{"d": "e"}

Or

$ jist -f my.json -p "[1054041].c"
{"d": "e"}

One of the use cases I had in mind was being able to extract values from JSON objects like access tokens programmatically for setting up config files easily without having to perform jq gymnastics. You know the JSON data shape and key you're looking for, just declare what you want.

Interface:

Find the value of an exact match

jist can take any valid JSON as input including an array root type. It expects the search key to be valid given the requested key.

$ curl https://api.github.com/repos/adelamodwala/rustbook/commits?per_page=1 | jist -p "[0].commit.author"
{
    "name": "adelamodwala",
    "email": "[email protected]",
    "date": "2023-11-06T20:36:53Z"
}

You can find values for keys that are deeply nested

$ wget https://api.github.com/repos/adelamodwala/rustbook/commits?per_page=1 | jist -p "[0].commit.author.name"
adelamodwala

If the root object is an array, then it's named root by default. All arrays are used like Javascript arrays syntactically.

$ wget https://api.github.com/repos/adelamodwala/rustbook/commits?per_page=1 | jist -p "[0].parents"
[]

To get the schema of a json/ndjson file:

$ jist -f "path_to_file"

If you know the file is an array of objects that should share the same schema, use the "unionize" flag:

$ jist -f "path_to_file" -u

For example, this content

[{"a":"b","f":12}, {"a":"c","d":"c"}]

should have schema

$ echo '[{"a":"b","f":12}, {"a":"c","d":"c"}]' | jist -u
[{"a":"string","d":"string","f":"number"}]

Algorithm

jist uses simdjson, a C++ library, over a rust-C++ bridge. While a pure rust implementation of simdjson exists, it performed twice as slow as the native C++ version in my testing.

When the JSON input file is too large or in streaming mode, jist falls back to a streaming approach that keeps memory usage low, and uses json-tools crate to get a lexer iterator. Put together, we can scan through a JSON string/file from the top and keep track of depths compared to our target depth without ever unmarshalling JSON into memory. Once we reach our target depth and match all the expected indices/keys, jist returns the result.

Goals

It should find the full JSON value of a given search key (streaming mode or when file is too large only). If the JSON data supplied provides an incomplete JSON value, the program should return an error.
JSON object size should not impact memory usage while fully utilizing a single CPU core (streaming mode or when file is too large only)
As long as the search key is appropriate and a complete JSON value can be found, the input JSON object does not need to be complete or correctly formed (streaming mode or when file is too large only)
Parsing the entire input JSON object is not necessary, simply finding the search key path using JSON format is sufficient (streaming mode or when file is too large only)
Streaming the JSON input should be possible, though will not be part of the starting design
SIMD: the final frontier
Feature: generate JSON schema, like super fast
Search over compressed files like gzip and bgzip

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
benchmarking		benchmarking
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
_config.yml		_config.yml
build.rs		build.rs
jist_vs_jq.png		jist_vs_jq.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Overview

Benchmarks

Method

Examples

Interface:

Algorithm

Goals

About

Uh oh!

Releases 2

Packages

Uh oh!

Languages

License

adelamodwala/jist

Folders and files

Latest commit

History

Repository files navigation

Overview

Benchmarks

Method

Examples

Interface:

Algorithm

Goals

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Languages

Packages