Add EFO, DOID, and HP names and synonyms #4

cthoyt · 2019-08-05T20:23:45Z

This piggybacks on the last PR to INDRA (gyorilab/indra#928) and uses the JSON resources generated by that new code to extract the names, identifiers, and synonyms for terms in three ontologies (via OBO): EFO, DOID, and HP.

gilda/generate_terms.py

bgyori

One thing I noticed when generating grounding_terms.tsv locally is that EFO produces some duplicate rows, e.g.,

facial nerve disease»···facial nerve disease»···EFO»1002051»facial nerve disease»···synonym»efo
facial nerve disease»···facial nerve disease»···EFO»1002051»facial nerve disease»···synonym»efo

A more bigger-picture issue we need to think about is: if any of these terms have cross references to existing namespaces integrated in Gilda, then we should integrate them, ideally, through that name space, and make sure there are no redundancies. For instance, when integrating human proteins from UniProt, I mapped UP IDs to HGNC IDs to create HGNC-based Terms, and filtered out ones that were redundant with what we already got from HGNC. An example is

lipocalin2»·Lipocalin-2»HGNC»···6526»···LCN2»···synonym»uniprot

Here the db:id are HGNC:6526 with uniprot indicated as the source.

Similarly, I imagine that where EFO/DOID/HP overlap with MeSH, we could create Terms based on MeSH IDs, filter out any redundant entries, and set the source attribute as needed. What do you think?

cthoyt · 2019-08-05T21:49:56Z

One thing we could do is iterate through all of the cross-references for the entry then add lots more terms, but with the alternate db and db_id (while keeping the source?) then there would need to be some post-processing logic on all of the terms to remove redundancies

Maybe something like this?

def _generate_obo_terms(prefix, keep_xrefs=None):
    filename = os.path.join(resources, '{prefix}.json'.format(prefix=prefix))
    logger.info('Loading %s', filename)
    with open(filename) as file:
        entries = json.load(file)

    db = prefix.upper()
    if keep_xrefs is None:
        keep_xrefs = set()
    for entry in entries:
        id_ = entry['id']
        name = entry['name']
        yield Term(normalize(name), name, db, id_, name, 'name', prefix)
        for synonym in entry['synonyms']:
            yield Term(normalize(synonym), synonym, db, id_, name,
                       'synonym', prefix)
        curies = [
            (entry['id'], entry['name'], prefix),
        ]

        for xref_db, xref_id in _get_xrefs(entry, keep_xrefs):
            curies.append((xref_db, xref_id, prefix))

        for db_id, name, source in curies:
            yield Term(normalize(name), name, db, db_id, name, 'name', source)
            for synonym in set(entry['synonyms']):
                yield Term(normalize(synonym), synonym, db, db_id, name,
                           'synonym', source)


def _get_xrefs(entry, keep_xrefs):
    for xref in entry['xrefs']:
        try:
            db, db_id = xref.split(':', 1)
        except ValueError:
            continue
        if db in keep_xrefs:
            yield db, db_id

cthoyt · 2019-08-06T13:37:22Z

Okay now I'm checking out what the consequences of adding this loop for the xrefs - it makes it slow down from < 1 second to mode than 30 minutes for each. Investigating now...

cthoyt · 2019-08-06T20:55:03Z

Lookup of MeSH identifiers solved with gyorilab/indra#933

cthoyt · 2019-08-06T21:18:48Z

Remaining missing DOID xrefs seem to be due to alternate identifiers. Looking into that now

In case the generate...() functions ever start returning more information (such as xref mappings, name-synonym relationships) this will make it easier to handle more uniformly

This will make it easier to add in more terms if we want to use the xrefs

This could now be extended to generate terms for other databases too with similar client lookups.

- Don't add redundant "name" term for when there's a xref - Use str.startswith() gives a speedup. Might need to pre-parse all of these into the JSON in the INDRA resources. - Add tqdm!

Relies on gyorilab/indra#935

[skip ci]

bgyori · 2020-04-06T05:06:11Z

Alright, thanks again for this @cthoyt, I finally had a chance to work through this and merge it. I made the following changes:

Migrated the code from the extra_terms file into the main generate_terms file.
Reimplemented the xref mapping logic to apply once to a single entry (instead of generating out all xref combinations for each synonym) and replace it with a MESH or DOID ID. I reviewed the types of all the xrefs reported by all EFO, HP and DOID, and from what I can tell, only these two mappings are relevant in this context. I also got rid of mappings to supplementary MESH IDs which were often to outdated entries - in these cases we stick with the original source instead of MESH.
Reviewed the entries we get from each source and found a number of corner cases that need to be post-processed or removed. There are still some weird synonyms but I handled most of the low-hanging-fruit issues.

bgyori · 2020-04-06T05:15:55Z

As a follow-up, not unexpectedly, there are redundancies not resolved by the source-provided mappings. For instance:

In [11]: ground('pancreatic cancer')                                                                                                          
Out[11]: 
[ScoredMatch(Term(pancreatic cancer,pancreatic cancer,DOID,DOID:1793,pancreatic cancer,name,efo),0.7777777777777778,Match(query=pancreatic cancer,ref=pancreatic cancer,exact=True,space_mismatch=False,dash_mismatches={},cap_combos=[])),
 ScoredMatch(Term(pancreatic cancer,pancreatic cancer,EFO,0002618,pancreatic carcinoma,synonym,efo),0.5555555555555556,Match(query=pancreatic cancer,ref=pancreatic cancer,exact=True,space_mismatch=False,dash_mismatches={},cap_combos=[])),
 ScoredMatch(Term(pancreatic cancer,pancreatic cancer,MESH,D010190,Pancreatic Neoplasms,synonym,efo),0.5555555555555556,Match(query=pancreatic cancer,ref=pancreatic cancer,exact=True,space_mismatch=False,dash_mismatches={},cap_combos=[]))]

In a separate PR, I will attempt to extend the automated MeSH mappings finder concept to these new redundancies.

cthoyt · 2020-04-06T09:42:29Z

gilda/generate_terms.py

+    return _generate_obo_terms('hp')
+
+
+def _generate_obo_terms(prefix):


I've written a lot more rules for normalizing terms and xrefs in the https://github.com/pyobo/pyobo repo (see https://github.com/pyobo/pyobo/blob/master/src/pyobo/identifier_utils.py). I'm still improving it but right now it loads all of the ontologies in OBO Foundry plus some ones I converted myself and extracts identifiers, labels, synonyms, and xrefs in a pretty general way. Throughout the process I've also started curating a database of prefixes, synonyms for prefixes, etc. that would supplement the Identifiers.org database.

I don't think you'll want to have gilda depend on that code, but maybe you can get some ideas from it. Would be happy to explain further

cthoyt commented Aug 5, 2019

View reviewed changes

gilda/generate_terms.py Show resolved Hide resolved

bgyori reviewed Aug 5, 2019

View reviewed changes

bgyori force-pushed the add-more-resources branch from 7b74185 to cc1e76f Compare March 17, 2020 15:59

cthoyt force-pushed the add-more-resources branch from 05fa96a to a3bb411 Compare March 17, 2020 23:00

cthoyt added 16 commits April 5, 2020 22:41

Update generate_terms.py

f3617f4

Remove redundant synonyms

22f8602

Make get_all_terms() more extensible

86b28cd

In case the generate...() functions ever start returning more information (such as xref mappings, name-synonym relationships) this will make it easier to handle more uniformly

Reorganize generation of terms from OBO

3ac043b

This will make it easier to add in more terms if we want to use the xrefs

Generate terms from OBO based on MeSH xrefs

f0dddea

This could now be extended to generate terms for other databases too with similar client lookups.

Update generation of terms from OBO

4706cfe

- Don't add redundant "name" term for when there's a xref - Use str.startswith() gives a speedup. Might need to pre-parse all of these into the JSON in the INDRA resources. - Add tqdm!

Significant speedup when making sure mesh doesn't check online

7d2d967

Add DOID xrefs

d2d7a55

Add better logging

e0775a2

Add absolute path to make running as a script easier

bc72f43

Rework xref lookup based on recent PRs

b65b643

Relies on gyorilab/indra#935

Reorganize

07bdaa4

Update generate_terms.py

551caa4

[skip ci]

Fix bug in doid xrefs

5700e56

Update caching of resources for generate_terms.py

ee57ce6

Update generate_terms.py

eafe521

bgyori force-pushed the add-more-resources branch from e30c07b to eafe521 Compare April 6, 2020 02:42

bgyori added 5 commits April 5, 2020 22:46

Add obonet as a basic requirement

cbcbcaa

Migrate into main generate resources file

48aa45a

Reorganize to always prioritize MeSH

2861b30

Map to primary DOID to avoid name lookup failure

383ff04

Handle some special cases for synonyms

e53c736

bgyori added 6 commits April 6, 2020 00:02

Handle corner case when mapping is to MeSH supplementary

9ceb531

Handle corner case with ambiguous synonyms

1764ddb

Small changes to other parts of generate resources

501732e

Bump version to 0.3.0

07c5af9

Add another correct curation to the benchmark

9b1702a

Fix test with another ambigous entry from EFO

988ee9f

bgyori merged commit fb991af into gyorilab:master Apr 6, 2020

cthoyt commented Apr 6, 2020

View reviewed changes

cthoyt deleted the add-more-resources branch September 12, 2021 12:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add EFO, DOID, and HP names and synonyms #4

Add EFO, DOID, and HP names and synonyms #4

Uh oh!

cthoyt commented Aug 5, 2019

Uh oh!

Uh oh!

bgyori left a comment

Uh oh!

cthoyt commented Aug 5, 2019 •

edited

Loading

Uh oh!

cthoyt commented Aug 6, 2019

Uh oh!

cthoyt commented Aug 6, 2019

Uh oh!

cthoyt commented Aug 6, 2019

Uh oh!

bgyori commented Apr 6, 2020

Uh oh!

bgyori commented Apr 6, 2020

Uh oh!

cthoyt Apr 6, 2020 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		return _generate_obo_terms('hp')


		def _generate_obo_terms(prefix):

Add EFO, DOID, and HP names and synonyms #4

Add EFO, DOID, and HP names and synonyms #4

Uh oh!

Conversation

cthoyt commented Aug 5, 2019

Uh oh!

Uh oh!

bgyori left a comment

Choose a reason for hiding this comment

Uh oh!

cthoyt commented Aug 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cthoyt commented Aug 6, 2019

Uh oh!

cthoyt commented Aug 6, 2019

Uh oh!

cthoyt commented Aug 6, 2019

Uh oh!

bgyori commented Apr 6, 2020

Uh oh!

bgyori commented Apr 6, 2020

Uh oh!

cthoyt Apr 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cthoyt commented Aug 5, 2019 •

edited

Loading

cthoyt Apr 6, 2020 •

edited

Loading