opentapioca long-due contributions

Hello @wetneb 

I worked intensively on opentapioca in the last days and I wanted to summarize the major points. The general aim is to add proper multilingual support to OpenTapioca, improve its performance (in relations to its bootstrap, like creating the index) and including more entities into the indexing by relying on the type inheritance.

As I started this work a long time ago and only recently got back to it, some of the things might not make 100% sense :-)

In short:

1. Multiple language support (Solr configset and document structure)
2. Performance (train-bow and index-dump)
3. Wikidata type inheritance

### Multiple language support

In order to properly support multiple languages I created a different structure for doc where the labels would be stored in tag_{language}_name and the aliases in tag_{language}_alias.

This is fundamental to ensure that I can tag targeting a specific language (set as a parameter to the annotate api).

In order to enable this I created a custom solr image (based on the official one) which includes the required extra libs and support files:
https://github.com/wordlift/docker-solr-opentapioca

You can see the managed-schema configuration here:
https://github.com/wordlift/docker-solr-opentapioca/blob/master/tapioca/conf/managed-schema

The image is published to docker hub as well:
https://hub.docker.com/r/wordlift/solr-opentapioca/tags

You can spot the updated structure in my own fork:
* Indexing: https://github.com/ziodave/opentapioca/blob/e44fa870ca12373a5bb0f51694a756974c600799/opentapioca/indexingprofile.py#L204-L211
* Tagging: https://github.com/ziodave/opentapioca/blob/e44fa870ca12373a5bb0f51694a756974c600799/opentapioca/tagger.py#L47-L62


### Performance (train-bow and index-dump)

I tried to boost as much as possible the performance of `train-bow` and `index-dump` since I would like to be able to run many tests and I can't wait one week before I can recreate an environment.

I used `cProfile` to profile the app and remove bottlenecks, working in particular on the readers (removing the json loads from there) and replacing the system json.loads with the more efficient msgspecs package, while defining [the schema that we need](https://github.com/ziodave/opentapioca/blob/master/opentapioca/wikidata/model.py).

I also added rxpy and added multithreading support for index-dump (i.e. not waiting for Solr to reply while moving forward processing new docs):
https://github.com/ziodave/opentapioca/blob/master/opentapioca/taggerfactory.py#L146-L159

The result is that I believe that I would be able to index the full wikidata dump in ~4 hours (compared to the 3-4 days I needed before) on my M1.

This is in conjuction with the type inheritance cache (more on this on the next paragraph).


### Wikidata type inheritance

More than a year ago I noticed that some of the entities I needed where not in the index (e.g. ingredients). By investigating I found out that we needed to do some sparql queries to Wikidata for the type inheritance. TBH I don't recall how much was already in opentapioca and what I contributed :-)

Anyway as sparql queries are what they are, I added here a `@cache` hof https://github.com/ziodave/opentapioca/blob/e44fa870ca12373a5bb0f51694a756974c600799/opentapioca/sparqlwikidata.py#L9 which stores the results to disk for future references.

And then I created a GH repo holding this cache:
https://github.com/wordlift/opentapioca-wikidata-query-cache

The increase of speed is drastic and I think the Wikidata folks may also be happy that I don't repeat the same queries.


### Summary

In short I really like OpenTapioca and I am trying to grow it according to my needs (and from some conversation I had, also other people needs).

My aim now would be to share this work with you, see if you have the time to review it and eventually merge it upstream.

Thanks for your time 🙏


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

opentapioca long-due contributions #53

Multiple language support

Performance (train-bow and index-dump)

Wikidata type inheritance

Summary

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

opentapioca long-due contributions #53

Description

Multiple language support

Performance (train-bow and index-dump)

Wikidata type inheritance

Summary

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions