Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@cthoyt
Copy link
Contributor

@cthoyt cthoyt commented Aug 5, 2019

This piggybacks on the last PR to INDRA (gyorilab/indra#928) and uses the JSON resources generated by that new code to extract the names, identifiers, and synonyms for terms in three ontologies (via OBO): EFO, DOID, and HP.

Copy link
Member

@bgyori bgyori left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I noticed when generating grounding_terms.tsv locally is that EFO produces some duplicate rows, e.g.,

facial nerve disease»···facial nerve disease»···EFO»1002051»facial nerve disease»···synonym»efo
facial nerve disease»···facial nerve disease»···EFO»1002051»facial nerve disease»···synonym»efo

A more bigger-picture issue we need to think about is: if any of these terms have cross references to existing namespaces integrated in Gilda, then we should integrate them, ideally, through that name space, and make sure there are no redundancies. For instance, when integrating human proteins from UniProt, I mapped UP IDs to HGNC IDs to create HGNC-based Terms, and filtered out ones that were redundant with what we already got from HGNC. An example is

lipocalin2»·Lipocalin-2»HGNC»···6526»···LCN2»···synonym»uniprot

Here the db:id are HGNC:6526 with uniprot indicated as the source.

Similarly, I imagine that where EFO/DOID/HP overlap with MeSH, we could create Terms based on MeSH IDs, filter out any redundant entries, and set the source attribute as needed. What do you think?

@cthoyt
Copy link
Contributor Author

cthoyt commented Aug 5, 2019

One thing we could do is iterate through all of the cross-references for the entry then add lots more terms, but with the alternate db and db_id (while keeping the source?) then there would need to be some post-processing logic on all of the terms to remove redundancies

Maybe something like this?

def _generate_obo_terms(prefix, keep_xrefs=None):
    filename = os.path.join(resources, '{prefix}.json'.format(prefix=prefix))
    logger.info('Loading %s', filename)
    with open(filename) as file:
        entries = json.load(file)

    db = prefix.upper()
    if keep_xrefs is None:
        keep_xrefs = set()
    for entry in entries:
        id_ = entry['id']
        name = entry['name']
        yield Term(normalize(name), name, db, id_, name, 'name', prefix)
        for synonym in entry['synonyms']:
            yield Term(normalize(synonym), synonym, db, id_, name,
                       'synonym', prefix)
        curies = [
            (entry['id'], entry['name'], prefix),
        ]

        for xref_db, xref_id in _get_xrefs(entry, keep_xrefs):
            curies.append((xref_db, xref_id, prefix))

        for db_id, name, source in curies:
            yield Term(normalize(name), name, db, db_id, name, 'name', source)
            for synonym in set(entry['synonyms']):
                yield Term(normalize(synonym), synonym, db, db_id, name,
                           'synonym', source)


def _get_xrefs(entry, keep_xrefs):
    for xref in entry['xrefs']:
        try:
            db, db_id = xref.split(':', 1)
        except ValueError:
            continue
        if db in keep_xrefs:
            yield db, db_id

@cthoyt
Copy link
Contributor Author

cthoyt commented Aug 6, 2019

Okay now I'm checking out what the consequences of adding this loop for the xrefs - it makes it slow down from < 1 second to mode than 30 minutes for each. Investigating now...

@cthoyt
Copy link
Contributor Author

cthoyt commented Aug 6, 2019

Lookup of MeSH identifiers solved with gyorilab/indra#933

@cthoyt
Copy link
Contributor Author

cthoyt commented Aug 6, 2019

Remaining missing DOID xrefs seem to be due to alternate identifiers. Looking into that now

@bgyori bgyori force-pushed the add-more-resources branch from 7b74185 to cc1e76f Compare March 17, 2020 15:59
@cthoyt cthoyt force-pushed the add-more-resources branch from 05fa96a to a3bb411 Compare March 17, 2020 23:00
cthoyt added 16 commits April 5, 2020 22:41
In case the generate...() functions ever start returning more information (such as xref mappings, name-synonym relationships) this will make it easier to handle more uniformly
This will make it easier to add in more terms if we want to use the xrefs
This could now be extended to generate terms for other databases too with similar client lookups.
- Don't add redundant "name" term for when there's a xref
- Use str.startswith() gives a speedup. Might need to pre-parse all of these into the JSON in the INDRA resources.
- Add tqdm!
@bgyori bgyori force-pushed the add-more-resources branch from e30c07b to eafe521 Compare April 6, 2020 02:42
@bgyori
Copy link
Member

bgyori commented Apr 6, 2020

Alright, thanks again for this @cthoyt, I finally had a chance to work through this and merge it. I made the following changes:

  • Migrated the code from the extra_terms file into the main generate_terms file.
  • Reimplemented the xref mapping logic to apply once to a single entry (instead of generating out all xref combinations for each synonym) and replace it with a MESH or DOID ID. I reviewed the types of all the xrefs reported by all EFO, HP and DOID, and from what I can tell, only these two mappings are relevant in this context. I also got rid of mappings to supplementary MESH IDs which were often to outdated entries - in these cases we stick with the original source instead of MESH.
  • Reviewed the entries we get from each source and found a number of corner cases that need to be post-processed or removed. There are still some weird synonyms but I handled most of the low-hanging-fruit issues.

@bgyori bgyori merged commit fb991af into gyorilab:master Apr 6, 2020
@bgyori
Copy link
Member

bgyori commented Apr 6, 2020

As a follow-up, not unexpectedly, there are redundancies not resolved by the source-provided mappings. For instance:

In [11]: ground('pancreatic cancer')                                                                                                          
Out[11]: 
[ScoredMatch(Term(pancreatic cancer,pancreatic cancer,DOID,DOID:1793,pancreatic cancer,name,efo),0.7777777777777778,Match(query=pancreatic cancer,ref=pancreatic cancer,exact=True,space_mismatch=False,dash_mismatches={},cap_combos=[])),
 ScoredMatch(Term(pancreatic cancer,pancreatic cancer,EFO,0002618,pancreatic carcinoma,synonym,efo),0.5555555555555556,Match(query=pancreatic cancer,ref=pancreatic cancer,exact=True,space_mismatch=False,dash_mismatches={},cap_combos=[])),
 ScoredMatch(Term(pancreatic cancer,pancreatic cancer,MESH,D010190,Pancreatic Neoplasms,synonym,efo),0.5555555555555556,Match(query=pancreatic cancer,ref=pancreatic cancer,exact=True,space_mismatch=False,dash_mismatches={},cap_combos=[]))]

In a separate PR, I will attempt to extend the automated MeSH mappings finder concept to these new redundancies.

return _generate_obo_terms('hp')


def _generate_obo_terms(prefix):
Copy link
Contributor Author

@cthoyt cthoyt Apr 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've written a lot more rules for normalizing terms and xrefs in the https://github.com/pyobo/pyobo repo (see https://github.com/pyobo/pyobo/blob/master/src/pyobo/identifier_utils.py). I'm still improving it but right now it loads all of the ontologies in OBO Foundry plus some ones I converted myself and extracts identifiers, labels, synonyms, and xrefs in a pretty general way. Throughout the process I've also started curating a database of prefixes, synonyms for prefixes, etc. that would supplement the Identifiers.org database.

I don't think you'll want to have gilda depend on that code, but maybe you can get some ideas from it. Would be happy to explain further

@cthoyt cthoyt deleted the add-more-resources branch September 12, 2021 12:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants