Knowgly
Entity Indexing and Retrieval in Knowledge Graphs, via an information-based and fully unsupervised index schema construction
García, S., Bobed, C. (2025). Information-Aware Entity Indexing in Knowledge Graphs to Enable Semantic Search. In: Curry, E., et al. The Semantic Web. ESWC 2025. Lecture Notes in Computer Science, vol 15718. Springer, Cham.
https://doi.org/10.1007/978-3-031-94575-5_12
The system consists of a Java project implementing the whole pipeline presented in the paper. There are also several helper scripts written in both Python and bash for evaluation and handling external systems, such as Terrier or Galago. The requirements are the following:
-
Maven -
openJDK >=17.
pom.xmlassumesopenJDK 20, but can be easily changed in themaven.compiler.sourceandmaven.compiler.targetsettings. -
If using
galago(recommended):- A valid installation of galago under
../galago/galago-3.16/andopenJDK 8. It is also possible to use newer versions. See how_to_setup_galago.txt on../galago/for more information.
- A valid installation of galago under
-
If using
pyTerrier:- A valid installation of pyTerrier under
../terrier/. See how_to_setup_terrier.txt on../terrier/for more information.
- A valid installation of pyTerrier under
-
If using
elastic:- An accesible local or remote elastic instance. We have tested our system under version 8.6.2. See
configuration/examples/elasticEndpointConfiguration.jsonfor more details.
- An accesible local or remote elastic instance. We have tested our system under version 8.6.2. See
-
If using
Lucene:- Nothing. The Lucene libraries are already included in Knowgly, and it will use them to create a local index.
-
If performing evaluations
- A compilled executable of https://github.com/usnistgov/trec_eval, and a python environment with
numpyandscikit-learninstalled. SeeKnowgly/evaluation/metrics_testing/README.txtfor more details.
- A compilled executable of https://github.com/usnistgov/trec_eval, and a python environment with
To compile the system, simply run compile.sh. It will compile and move to this folder all required .jar files, with all dependencies statically linked.
Warning
Some systems have limitations
Luceneandelasticcannot use weights below 1.0 in BM25F queriespyTerrierdoes not properly tune k1 and b parameters, despite exposing themgalagorequires runningbuild_galago_index.shafter performing indexing within Knowgly, as it requires calling a Java 8 executable.
Tip
We recommend using galago for reproducing our results, or Lucene for considerably faster indexing and retrieval (albeit with slightly worse performance due to its field weighting limitations)
If you want to:
-
Run Knowgly's metrics generation and indexing pipelines:
- Freely edit the demo shown in the
Main.javafile, and runexecute.sh(or use Knowgly as a library). - A simple demo on how to run each pipeline is already provided in the file.
- Freely edit the demo shown in the
-
Perform individual queries:
- Please refer to the demo shown in the
Main.javafile
- Please refer to the demo shown in the
-
Perform a full evaluation (multiple queries and .run file generation):
- Execute the
RunEvaluator.jarfile, which has been prepared as a CLI tool for any system. Our evaluation scripts employ this executable too.
- Execute the
-
Perform Coordinate Ascent (Note: We allow all systems, but it has only been tested on galago):
- Run the
ca*.pyscripts in theCAfolder. There are currently scripts for 3 and 5 fields.
- Run the
-
Evaluate .run files:
- See
Knowgly/evaluation/metrics_testing/README.txtfor more details.
- See
-
Build the datasets we used for evaluation (
DBpediaandIMDb):- See
Knowgly/datasets/README.txtfor more details.
- See
Important
- Please check all configuration files under
configuration/examples/. All neccesary configuration files should be placed underconfigurationbefore running any of the pipelines. - Although some parts of the pipelines may support classic SPARQL endpoints (Local Jena/Jena--fuseki models/endpoints and remote SPARQL endpoints), all functionalities are only feature-complete and tested for local HDT endpoints. In particular, metrics generation is too computationally expensive when done naively on SPARQL, and thus not fully implemented nor tested on non-HDT endpoints.
Aside from Knowgly's implementation, we also provide additional documentation, as mentioned throughout the paper:
- dataset_analysis.md: An overview of the dataset details and Predicate-Type distributions.
- implementation_details.md: An analysis of Importance Metrics Calculation times for both datasets (DBpedia and IMDb) and additional clustering (KMeans) details.
- fields_and_weight_scheme_analysis.md: An analysis of the effect of different numbers of fields and an overview of alternative weight scheme strategies, such as directly assigning normalized centroid values.
Additionally, the paper's figures are also available. A small subset of VDoc templates are also available in example_vdocs, and the best results are available as .run files in best_runs.
This software is licensed under the GNU Affero General Public License v3.0