Knowgly
Entity Indexing and Retrieval in Knowledge Graphs, via an information-based and fully unsupervised index schema construction
García, S., Bobed, C. (2025). Information-Aware Entity Indexing in Knowledge Graphs to Enable Semantic Search. In: Curry, E., et al. The Semantic Web. ESWC 2025. Lecture Notes in Computer Science, vol 15718. Springer, Cham.
https://doi.org/10.1007/978-3-031-94575-5_12
The system consists of a Java project implementing the whole pipeline presented in the paper. There are also several helper scripts written in both Python and bash for evaluation and handling external systems, such as Terrier or Galago. The requirements are the following:
-
Maven
-
openJDK >=17.
pom.xml
assumesopenJDK 20
, but can be easily changed in themaven.compiler.source
andmaven.compiler.target
settings. -
If using
galago
(recommended):- A valid installation of galago under
../galago/galago-3.16/
andopenJDK 8
. It is also possible to use newer versions. See how_to_setup_galago.txt on../galago/
for more information.
- A valid installation of galago under
-
If using
pyTerrier
:- A valid installation of pyTerrier under
../terrier/
. See how_to_setup_terrier.txt on../terrier/
for more information.
- A valid installation of pyTerrier under
-
If using
elastic
:- An accesible local or remote elastic instance. We have tested our system under version 8.6.2. See
configuration/examples/elasticEndpointConfiguration.json
for more details.
- An accesible local or remote elastic instance. We have tested our system under version 8.6.2. See
-
If using
Lucene
:- Nothing. The Lucene libraries are already included in Knowgly, and it will use them to create a local index.
-
If performing evaluations
- A compilled executable of https://github.com/usnistgov/trec_eval, and a python environment with
numpy
andscikit-learn
installed. SeeKnowgly/evaluation/metrics_testing/README.txt
for more details.
- A compilled executable of https://github.com/usnistgov/trec_eval, and a python environment with
To compile the system, simply run compile.sh
. It will compile and move to this folder all required .jar files, with all dependencies statically linked.
Warning
Some systems have limitations
Lucene
andelastic
cannot use weights below 1.0 in BM25F queriespyTerrier
does not properly tune k1 and b parameters, despite exposing themgalago
requires runningbuild_galago_index.sh
after performing indexing within Knowgly, as it requires calling a Java 8 executable.
Tip
We recommend using galago
for reproducing our results, or Lucene
for considerably faster indexing and retrieval (albeit with slightly worse performance due to its field weighting limitations)
If you want to:
-
Run Knowgly's metrics generation and indexing pipelines:
- Freely edit the demo shown in the
Main.java
file, and runexecute.sh
(or use Knowgly as a library). - A simple demo on how to run each pipeline is already provided in the file.
- Freely edit the demo shown in the
-
Perform individual queries:
- Please refer to the demo shown in the
Main.java
file
- Please refer to the demo shown in the
-
Perform a full evaluation (multiple queries and .run file generation):
- Execute the
RunEvaluator.jar
file, which has been prepared as a CLI tool for any system. Our evaluation scripts employ this executable too.
- Execute the
-
Perform Coordinate Ascent (Note: We allow all systems, but it has only been tested on galago):
- Run the
ca*.py
scripts in theCA
folder. There are currently scripts for 3 and 5 fields.
- Run the
-
Evaluate .run files:
- See
Knowgly/evaluation/metrics_testing/README.txt
for more details.
- See
-
Build the datasets we used for evaluation (
DBpedia
andIMDb
):- See
Knowgly/datasets/README.txt
for more details.
- See
Important
- Please check all configuration files under
configuration/examples/
. All neccesary configuration files should be placed underconfiguration
before running any of the pipelines. - Although some parts of the pipelines may support classic SPARQL endpoints (Local Jena/Jena--fuseki models/endpoints and remote SPARQL endpoints), all functionalities are only feature-complete and tested for local HDT endpoints. In particular, metrics generation is too computationally expensive when done naively on SPARQL, and thus not fully implemented nor tested on non-HDT endpoints.
Aside from Knowgly's implementation, we also provide additional documentation, as mentioned throughout the paper:
- dataset_analysis.md: An overview of the dataset details and Predicate-Type distributions.
- implementation_details.md: An analysis of Importance Metrics Calculation times for both datasets (DBpedia and IMDb) and additional clustering (KMeans) details.
- fields_and_weight_scheme_analysis.md: An analysis of the effect of different numbers of fields and an overview of alternative weight scheme strategies, such as directly assigning normalized centroid values.
Additionally, the paper's figures are also available. A small subset of VDoc templates are also available in example_vdocs, and the best results are available as .run
files in best_runs.
This software is licensed under the GNU Affero General Public License v3.0