Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@mingjiecn
Copy link
Contributor

  1. Adding igvf related schemas for different filter. For files in igvf portal, some fields are not required anymore. And schema for igvf has read_id pattern added.
  2. Make sure version higher than 0.3.0 will still work with seqspec check and seqspec upgrade.

@sbooeshaghi
Copy link
Collaborator

Hi Mingjie, thank you for the PR. It’s important that seqspec works well for the IGVF consortium. I have a concern about the schema changes you proposed.

The schema defined in the schema folder is meant to describe the structure of any seqspec file. Adding a new schema specific to IGVF introduces a second definition of what a valid seqspec file looks like. This creates ambiguity; should files be validated against the IGVF schema or the standard seqspec schema?

Maintaining two diverging schemas makes future development harder and risks breaking consistency across the tool. For that reason, I think we should avoid adding a schema that is specific to IGVF.

Could you clarify specific attributes about the seqspec file that need to be changed to support IGVF? Maybe there is a possibility of adding additional tooling that allows standard seqspec files to work with the IGVF portal.

@mingjiecn
Copy link
Contributor Author

Hi, Sina. For schema seqspec_igvf_onlist_skip.schema.json, we changed some fields from required to not required. For seqspec_igvf.schema.json, we have the same change beside that we add pattern for read_id: https://github.com/pachterlab/seqspec/blob/b1f71df6650220f89def56f7f6c2e97a5945bef5/seqspec/schema/seqspec_igvf.schema.json#L314C12-L314C19.
I feel that have separate schemas is the easiest way to schema check if those files are actually do have different schema. Also I can see some potential issue for using errobj generated in check_schema to filter errors. error_object is generated by using the last item in err_elements , if the last item is a very general name then you will not be able to tell if this the error we want to filter, for example many object may have a field called name , if we want to filter error for object A name, if there is a name error, we will not be able tell whether this is the name error we want to filter. Also I see some error with a number value as error_object , those errors are on array elements, the number value error_object will not be meaningful at all. That is the reason that I want to avoid to filter error using the error_object generated in function check_schema.
also now instead of filter error, we will need to check read_id pattern for IGVF. Since it is the check that already implemented in jsonschema. I feel that we should just update the schema to add pattern and use jsonschema to check, so we don’t need to write extra code to check read_id.
Most part of the three schemas are the same, maybe we can create a base schema then extend. Let me know what you think

@sbooeshaghi
Copy link
Collaborator

Hi Mingjie, I’ve been thinking a lot about this and I have a few thoughts. The proposed changes to the schema you made are listed below:
• Removed required fields: lib_struct, library_protocol, library_kit, sequence_protocol, sequence_kit.
• Removed all enum lists for library_protocol, library_kit, sequence_protocol, sequence_kit.
• Dropped required keys (protocol_id/kit_id, modality) from protocol/kit object definitions.
• Made md5 in region.onlist optional (was required).
• Removed md5 regex validation.
• Added regex “^IGVF.*” requirement to read.read_id (was any string).
in a sense these changes are a “slimmed” version the original spec. After reviewing the seqspec code, I am OK with making the sequence_protocol, sequence_kit, library_protocol, library_kit not required in the schema since this matches the base Assay class.

Secondly, could you explain the rationale for removing the md5 requirement for the onlist files? Historically I’ve ran into issues where ive gotten the wrong onlist file and have the md5 match the file helps with that. If youd like i could implement in seqspec format the ability to autocompute the md5 for onlist file and populate it in the spec (keeping it required for seqspec check).

Lastly, I don’t think its a good idea to bake in the IGVF regex check into the base schema. If youd like I could add in an a --filter-list parameter that can optionally take in a list of function names like check_igvf_ids which performs the validation on the python side. This keeps the base schema as is and allows users to validate against a custom set of validators. What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants