OpenTapioca is a simple and fast Named Entity Linking system for Wikidata. It is kept synchronous with Wikidata in real time, encouraging users to improve the results of their entity linking tasks by contributing back to Wikidata.
A live instance is running at https://opentapioca.org/. To run it on a server that is powerful enough, I would need 50€/month: please help fund the service if you can.
A NIF endpoint is available at:
- https://opentapioca.org/api/nif (only exposing the matches that are deemed good enough)
- https://opentapioca.org/api/nif?only_matching=false (also exposing all the other matches regardless of their score)
See the docs for more information about how it works and how to run it. See the paper for some more motivation about the design of the system.
OpenTapioca is released under the Apache-2.0 license.
Follow the instructions below to set up the OpenTapioca environment.
git clone https://github.com/ehrhart/opentapiocadocker network create opentapioca-networkdocker run --name=opentapioca-solr --env='SOLR_JAVA_MEM=-Xms10g -Xmx10g' --volume=opentapioca-solr-data:/var/solr --volume=./configsets:/configsets --network=opentapioca-network -p 8983:8983 --restart=unless-stopped --detach=true solr:8 -c -m 8g
docker exec -it opentapioca-solr bin/solr zk -upconfig -z localhost:9983 -n tapioca -d /configsets/tapiocacp settings_template.py settings.py
docker build -t opentapioca .- Create a directory for the data dump:
mkdir dump cd dump/ - Download the latest Wikidata JSON dump:
wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.bz2
- Index the dump using the OpenTapioca profile:
docker run --rm --detach=true --network=opentapioca-network --volume=./dump:/app/dump --volume=./profiles:/app/profiles opentapioca bash -c "bunzip2 < dump/latest-all.json.bz2 | tapioca index-dump wd_2019-02-24 - --profile profiles/human_organization_location.json"
Download the precomputed data files:
wget -c https://github.com/wetneb/opentapioca/releases/download/v0.1.0/wd_2019-02-24.bow.pkl -O data/wd_2019-02-24.bow.pkl
wget -c https://github.com/wetneb/opentapioca/releases/download/v0.1.0/wd_2019-02-24.pgrank.npy -O data/wd_2019-02-24.pgrank.npy
wget -c https://github.com/wetneb/opentapioca/releases/download/v0.1.0/sample_classifier.pkl -O data/rss_istex_classifier.pkl- Set up a Python virtual environment:
python3 -m venv venv source venv/bin/activate pip install -r requirements.txt python setup.py install - Train and preprocess the data:
tapioca train-bow latest-all.json.bz2 tapioca preprocess latest-all.json.bz2
- Sort and compile the graph:
sort -n -k 1 latest-all.unsorted.tsv > wikidata_graph.tsv tapioca compile wikidata_graph.tsv - Compute PageRank:
tapioca compute-pagerank wikidata_graph.npz
Start the OpenTapioca application:
docker run --name=opentapioca-app --env=SOLR_ENDPOINT=http://opentapioca-solr:8983/solr --volume=/home/cixty/Services/opentapioca/dump:/app/dump --volume=/home/cixty/Services/opentapioca/data:/app/data --network=opentapioca-network -p 8457:8457 --restart=unless-stopped --detach=true opentapiocaOpen your browser and navigate to http://localhost:8457.