A highly flexible, scalable, fault-tolerant document ingestion system designed for search.
Builds are run on infrastructure kindly donated by 
Frequently, search projects start by feeding a few documents manually to a search engine, often via the "just for testing" built-in processing features of Solr such as SolrCell or post.jar. These features are documented and included to help users get a feel for what they can do with Solr with minimal painful setup.
This is good, and that's how it should be for first explorations. Unfortunately, it's also a potential trap. Large-scale ingestion of documents for search is non-trivial. Many projects outgrow these simple tools and have to throw away their early exploratory work. Nobody likes setting aside valuable work, and it's natural to resist, but the longer one clings to an insufficient tool, the bigger, more difficult, and more expensive the migration is.
Common problems are:
- It works "ok" for a small test corpus and then becomes unstable on a larger production corpus.
- The code written to feed into such interfaces (hopefully) reproduces standard solutions to problems that have been solved many times by other search engineers over the last 20 years.
- No way to recover if indexing errors or is disrupted partway through. One is forced to start again from the beginning.
- If failure is related to the size of a growing corpus, failures become increasingly common, and eventually, the search index cannot be reindexed or upgraded at all.
- Leveraging the power of modern multicore machines requires developers skilled at threading and concurrency, the resulting bugs can be very expensive to troubleshoot, fix and test.
- Reliance on outdated, unmaintained or poorly maintained features such as the Data Import Handler (which has been removed from solr 9.0+). Such features are not used by any major companies (where committers often work), and consequently receive less attention and support.
JesterJ makes it super easy to start with a robust, full-featured indexing infrastructure, so that you don't have to re-invent the wheel, and you don't have to throw away your early work.
The key aspects for achieving this are simplicity, robustness, flexibility, and scalability:
- A variety of re-usable processing components are provided (flexibility)
- Scanners (active connectors) for database and filesystem data sources (simplicity)
- Custom processors only require a 4-method interface (simplicity)
- Specialized classloading allows any version of a library in your custom code (flexibility, simplicity)
- Simplified startup: java -jar jesterj.jar <id> <secret>(simplicity)
- Built in embedded Cassandra for performant persistent storage (simplicity)
- Optional auto-detection of changes to documents (flexibility, simplicity)
- Automatic fault-tolerant restart skipping previously seen documents (robustness, scalability)
- Multithreaded processing to leverage modern machines with large numbers of cores. (scalability)
- Explicit and direct control of threading. Easy to ensure more threads working on heavy steps (scalability)
- Single system handling multiple data sources (flexibility, scalability, simplicity)
- Pre-baked batching of documents for efficient transmission to the search engine (scalability, simplicity)
- Directed acyclic graph (DAG) capable processing model, and graphical visualization (flexibility, simplicity)
DAG-structured processing is a key feature that is not provided by other tools. Most other tools require a linear pipeline structure, which can become limiting. As time passes, features and enhancements often add complexity. Multiple data sources are also a common dimension for growth. With other systems, you wind up deploying a system per data source. JesterJ is designed to handle complex indexing scenarios.
Consider the following hypothetical indexing workflow, where the system has evolved from a simple linear ingestion into a single index:
- The source data format changed from, effectively creating a new data source (old data may need reindexing)
- An external system needed to know that the document was received
- Product features required a faster, optimized line-item-only search index
- New features were added to the product that required block-join indexing, but old features couldn't be migrated, so a new index was required.
- Two new systems also wanted to be notified
In other tools, this will mean six indexing processes (two sources times three indexes), all of which need to send messages, none of which are coordinated if one fails. In JesterJ, it is all one coherent system:
JesterJ handles such scenarios with a single centralized processing plan, and there is no need to deploy new indexing infrastructure. Furthermore, JesterJ will ensure that if the system is unplugged partway through indexing, you won't get a second message about an order received for everything it processed previously (fault tolerance). The default mode for JesterJ is to ensure at-most-once delivery for steps that are not marked safe or idempotent. Safe steps do not have external effects, and idempotent steps may be repeated en route to the final processing end point.
The best place to start learning more is the documentation in the wiki
Current release: 1.0.0 (recommended)
Next Release: 1.1.0
Presently, only JDK 11 has been tested regularly. Unit tests have passed on JDK 17, but the initial system startup and custom class loading are the most JDK-sensitive parts, so we welcome feedback on experiences with more recent JDK versions. Any Distribution of JDK 11 should work. Support for Java 17 and future LTS versions is among our highest priorities for future releases. Building with the latest uno-jar version may be sufficient, but this is not yet certified. nsoft/uno-jar#37
Discuss features, ask questions, etc., on Discord. https://discord.gg/RmdTYvpXr9
Please see the RELEASE_NOTES.adoc file for full details on features for each available version.