-
Notifications
You must be signed in to change notification settings - Fork 14
Add EFO, DOID, and HP names and synonyms #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
bgyori
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I noticed when generating grounding_terms.tsv locally is that EFO produces some duplicate rows, e.g.,
facial nerve disease»···facial nerve disease»···EFO»1002051»facial nerve disease»···synonym»efo
facial nerve disease»···facial nerve disease»···EFO»1002051»facial nerve disease»···synonym»efo
A more bigger-picture issue we need to think about is: if any of these terms have cross references to existing namespaces integrated in Gilda, then we should integrate them, ideally, through that name space, and make sure there are no redundancies. For instance, when integrating human proteins from UniProt, I mapped UP IDs to HGNC IDs to create HGNC-based Terms, and filtered out ones that were redundant with what we already got from HGNC. An example is
lipocalin2»·Lipocalin-2»HGNC»···6526»···LCN2»···synonym»uniprot
Here the db:id are HGNC:6526 with uniprot indicated as the source.
Similarly, I imagine that where EFO/DOID/HP overlap with MeSH, we could create Terms based on MeSH IDs, filter out any redundant entries, and set the source attribute as needed. What do you think?
|
One thing we could do is iterate through all of the cross-references for the entry then add lots more terms, but with the alternate db and db_id (while keeping the source?) then there would need to be some post-processing logic on all of the terms to remove redundancies Maybe something like this? def _generate_obo_terms(prefix, keep_xrefs=None):
filename = os.path.join(resources, '{prefix}.json'.format(prefix=prefix))
logger.info('Loading %s', filename)
with open(filename) as file:
entries = json.load(file)
db = prefix.upper()
if keep_xrefs is None:
keep_xrefs = set()
for entry in entries:
id_ = entry['id']
name = entry['name']
yield Term(normalize(name), name, db, id_, name, 'name', prefix)
for synonym in entry['synonyms']:
yield Term(normalize(synonym), synonym, db, id_, name,
'synonym', prefix)
curies = [
(entry['id'], entry['name'], prefix),
]
for xref_db, xref_id in _get_xrefs(entry, keep_xrefs):
curies.append((xref_db, xref_id, prefix))
for db_id, name, source in curies:
yield Term(normalize(name), name, db, db_id, name, 'name', source)
for synonym in set(entry['synonyms']):
yield Term(normalize(synonym), synonym, db, db_id, name,
'synonym', source)
def _get_xrefs(entry, keep_xrefs):
for xref in entry['xrefs']:
try:
db, db_id = xref.split(':', 1)
except ValueError:
continue
if db in keep_xrefs:
yield db, db_id |
|
Okay now I'm checking out what the consequences of adding this loop for the xrefs - it makes it slow down from < 1 second to mode than 30 minutes for each. Investigating now... |
|
Lookup of MeSH identifiers solved with gyorilab/indra#933 |
|
Remaining missing DOID xrefs seem to be due to alternate identifiers. Looking into that now |
7b74185 to
cc1e76f
Compare
05fa96a to
a3bb411
Compare
In case the generate...() functions ever start returning more information (such as xref mappings, name-synonym relationships) this will make it easier to handle more uniformly
This will make it easier to add in more terms if we want to use the xrefs
This could now be extended to generate terms for other databases too with similar client lookups.
- Don't add redundant "name" term for when there's a xref - Use str.startswith() gives a speedup. Might need to pre-parse all of these into the JSON in the INDRA resources. - Add tqdm!
[skip ci]
e30c07b to
eafe521
Compare
|
Alright, thanks again for this @cthoyt, I finally had a chance to work through this and merge it. I made the following changes:
|
|
As a follow-up, not unexpectedly, there are redundancies not resolved by the source-provided mappings. For instance: In a separate PR, I will attempt to extend the automated MeSH mappings finder concept to these new redundancies. |
| return _generate_obo_terms('hp') | ||
|
|
||
|
|
||
| def _generate_obo_terms(prefix): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've written a lot more rules for normalizing terms and xrefs in the https://github.com/pyobo/pyobo repo (see https://github.com/pyobo/pyobo/blob/master/src/pyobo/identifier_utils.py). I'm still improving it but right now it loads all of the ontologies in OBO Foundry plus some ones I converted myself and extracts identifiers, labels, synonyms, and xrefs in a pretty general way. Throughout the process I've also started curating a database of prefixes, synonyms for prefixes, etc. that would supplement the Identifiers.org database.
I don't think you'll want to have gilda depend on that code, but maybe you can get some ideas from it. Would be happy to explain further
This piggybacks on the last PR to INDRA (gyorilab/indra#928) and uses the JSON resources generated by that new code to extract the names, identifiers, and synonyms for terms in three ontologies (via OBO): EFO, DOID, and HP.