Thanks to visit codestin.com
Credit goes to github.com

Skip to content

opentapioca long-due contributions #53

@ziodave

Description

@ziodave

Hello @wetneb

I worked intensively on opentapioca in the last days and I wanted to summarize the major points. The general aim is to add proper multilingual support to OpenTapioca, improve its performance (in relations to its bootstrap, like creating the index) and including more entities into the indexing by relying on the type inheritance.

As I started this work a long time ago and only recently got back to it, some of the things might not make 100% sense :-)

In short:

  1. Multiple language support (Solr configset and document structure)
  2. Performance (train-bow and index-dump)
  3. Wikidata type inheritance

Multiple language support

In order to properly support multiple languages I created a different structure for doc where the labels would be stored in tag_{language}name and the aliases in tag{language}_alias.

This is fundamental to ensure that I can tag targeting a specific language (set as a parameter to the annotate api).

In order to enable this I created a custom solr image (based on the official one) which includes the required extra libs and support files:
https://github.com/wordlift/docker-solr-opentapioca

You can see the managed-schema configuration here:
https://github.com/wordlift/docker-solr-opentapioca/blob/master/tapioca/conf/managed-schema

The image is published to docker hub as well:
https://hub.docker.com/r/wordlift/solr-opentapioca/tags

You can spot the updated structure in my own fork:

Performance (train-bow and index-dump)

I tried to boost as much as possible the performance of train-bow and index-dump since I would like to be able to run many tests and I can't wait one week before I can recreate an environment.

I used cProfile to profile the app and remove bottlenecks, working in particular on the readers (removing the json loads from there) and replacing the system json.loads with the more efficient msgspecs package, while defining the schema that we need.

I also added rxpy and added multithreading support for index-dump (i.e. not waiting for Solr to reply while moving forward processing new docs):
https://github.com/ziodave/opentapioca/blob/master/opentapioca/taggerfactory.py#L146-L159

The result is that I believe that I would be able to index the full wikidata dump in ~4 hours (compared to the 3-4 days I needed before) on my M1.

This is in conjuction with the type inheritance cache (more on this on the next paragraph).

Wikidata type inheritance

More than a year ago I noticed that some of the entities I needed where not in the index (e.g. ingredients). By investigating I found out that we needed to do some sparql queries to Wikidata for the type inheritance. TBH I don't recall how much was already in opentapioca and what I contributed :-)

Anyway as sparql queries are what they are, I added here a @cache hof https://github.com/ziodave/opentapioca/blob/e44fa870ca12373a5bb0f51694a756974c600799/opentapioca/sparqlwikidata.py#L9 which stores the results to disk for future references.

And then I created a GH repo holding this cache:
https://github.com/wordlift/opentapioca-wikidata-query-cache

The increase of speed is drastic and I think the Wikidata folks may also be happy that I don't repeat the same queries.

Summary

In short I really like OpenTapioca and I am trying to grow it according to my needs (and from some conversation I had, also other people needs).

My aim now would be to share this work with you, see if you have the time to review it and eventually merge it upstream.

Thanks for your time 🙏

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions