OpenTapioca

OpenTapioca is a simple and fast Named Entity Linking system for Wikidata. It is kept synchronous with Wikidata in real time, encouraging users to improve the results of their entity linking tasks by contributing back to Wikidata.

A live instance is running at https://opentapioca.org/. To run it on a server that is powerful enough, I would need 50€/month: please help fund the service if you can.

A NIF endpoint is available at:

https://opentapioca.org/api/nif (only exposing the matches that are deemed good enough)
https://opentapioca.org/api/nif?only_matching=false (also exposing all the other matches regardless of their score)

See the docs for more information about how it works and how to run it. See the paper for some more motivation about the design of the system.

OpenTapioca is released under the Apache-2.0 license.

Set Up

Follow the instructions below to set up the OpenTapioca environment.

Step 1: Clone the Repository

git clone https://github.com/ehrhart/opentapioca

Step 2: Create a Docker Network

docker network create opentapioca-network

Step 3: Set Up the Solr Server

docker run --name=opentapioca-solr --env='SOLR_JAVA_MEM=-Xms10g -Xmx10g' --volume=opentapioca-solr-data:/var/solr --volume=./configsets:/configsets --network=opentapioca-network -p 8983:8983 --restart=unless-stopped --detach=true solr:8 -c -m 8g
docker exec -it opentapioca-solr bin/solr zk -upconfig -z localhost:9983 -n tapioca -d /configsets/tapioca

Step 4: Build the Application

cp settings_template.py settings.py
docker build -t opentapioca .

Step 5: Prepare the Data Dump

Create a directory for the data dump:
```
mkdir dump
cd dump/
```

Download the latest Wikidata JSON dump:

wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.bz2

Index the dump using the OpenTapioca profile:

docker run --rm --detach=true --network=opentapioca-network --volume=./dump:/app/dump --volume=./profiles:/app/profiles opentapioca bash -c "bunzip2 < dump/latest-all.json.bz2 | tapioca index-dump wd_2019-02-24 - --profile profiles/human_organization_location.json"

Step 6: Compute or Download Supporting Data

Option A: Use Precomputed Data

Download the precomputed data files:

wget -c https://github.com/wetneb/opentapioca/releases/download/v0.1.0/wd_2019-02-24.bow.pkl -O data/wd_2019-02-24.bow.pkl
wget -c https://github.com/wetneb/opentapioca/releases/download/v0.1.0/wd_2019-02-24.pgrank.npy -O data/wd_2019-02-24.pgrank.npy
wget -c https://github.com/wetneb/opentapioca/releases/download/v0.1.0/sample_classifier.pkl -O data/rss_istex_classifier.pkl

Option B: Compute the Data Yourself

Set up a Python virtual environment:

python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python setup.py install

Train and preprocess the data:

tapioca train-bow latest-all.json.bz2
tapioca preprocess latest-all.json.bz2

Sort and compile the graph:

sort -n -k 1 latest-all.unsorted.tsv > wikidata_graph.tsv
tapioca compile wikidata_graph.tsv

Compute PageRank:

tapioca compute-pagerank wikidata_graph.npz

Step 7: Run the OpenTapioca Application

Start the OpenTapioca application:

docker run --name=opentapioca-app --env=SOLR_ENDPOINT=http://opentapioca-solr:8983/solr --volume=/home/cixty/Services/opentapioca/dump:/app/dump --volume=/home/cixty/Services/opentapioca/data:/app/data --network=opentapioca-network -p 8457:8457 --restart=unless-stopped --detach=true opentapioca

Open your browser and navigate to http://localhost:8457.

Name		Name	Last commit message	Last commit date
Latest commit History 125 Commits
.github		.github
configsets		configsets
data		data
docs		docs
html		html
opentapioca		opentapioca
plugins		plugins
profiles		profiles
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
README.rst		README.rst
app.py		app.py
pytest.ini		pytest.ini
requirements.txt		requirements.txt
settings_template.py		settings_template.py
settings_travis.py		settings_travis.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpenTapioca

Set Up

Step 1: Clone the Repository

Step 2: Create a Docker Network

Step 3: Set Up the Solr Server

Step 4: Build the Application

Step 5: Prepare the Data Dump

Step 6: Compute or Download Supporting Data

Option A: Use Precomputed Data

Option B: Compute the Data Yourself

Step 7: Run the OpenTapioca Application

About

Uh oh!

Releases

Packages

Languages

License

ehrhart/opentapioca

Folders and files

Latest commit

History

Repository files navigation

OpenTapioca

Set Up

Step 1: Clone the Repository

Step 2: Create a Docker Network

Step 3: Set Up the Solr Server

Step 4: Build the Application

Step 5: Prepare the Data Dump

Step 6: Compute or Download Supporting Data

Option A: Use Precomputed Data

Option B: Compute the Data Yourself

Step 7: Run the OpenTapioca Application

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages