Update seqspec check for checking files in IGVF portal #74

mingjiecn · 2025-08-28T18:28:18Z

Adding igvf related schemas for different filter. For files in igvf portal, some fields are not required anymore. And schema for igvf has read_id pattern added.
Make sure version higher than 0.3.0 will still work with seqspec check and seqspec upgrade.

…terlab#58) * update seqspec check * add spec parameter back to check function

* support gzipped yaml file for function load_spec * fix bug in function run_check * support gzipped yaml file for function load_spec

sbooeshaghi · 2025-08-28T19:45:27Z

Hi Mingjie, thank you for the PR. It’s important that seqspec works well for the IGVF consortium. I have a concern about the schema changes you proposed.

The schema defined in the schema folder is meant to describe the structure of any seqspec file. Adding a new schema specific to IGVF introduces a second definition of what a valid seqspec file looks like. This creates ambiguity; should files be validated against the IGVF schema or the standard seqspec schema?

Maintaining two diverging schemas makes future development harder and risks breaking consistency across the tool. For that reason, I think we should avoid adding a schema that is specific to IGVF.

Could you clarify specific attributes about the seqspec file that need to be changed to support IGVF? Maybe there is a possibility of adding additional tooling that allows standard seqspec files to work with the IGVF portal.

mingjiecn · 2025-08-28T22:54:18Z

Hi, Sina. For schema seqspec_igvf_onlist_skip.schema.json, we changed some fields from required to not required. For seqspec_igvf.schema.json, we have the same change beside that we add pattern for read_id: https://github.com/pachterlab/seqspec/blob/b1f71df6650220f89def56f7f6c2e97a5945bef5/seqspec/schema/seqspec_igvf.schema.json#L314C12-L314C19.
I feel that have separate schemas is the easiest way to schema check if those files are actually do have different schema. Also I can see some potential issue for using errobj generated in check_schema to filter errors. error_object is generated by using the last item in err_elements , if the last item is a very general name then you will not be able to tell if this the error we want to filter, for example many object may have a field called name , if we want to filter error for object A name, if there is a name error, we will not be able tell whether this is the name error we want to filter. Also I see some error with a number value as error_object , those errors are on array elements, the number value error_object will not be meaningful at all. That is the reason that I want to avoid to filter error using the error_object generated in function check_schema.
also now instead of filter error, we will need to check read_id pattern for IGVF. Since it is the check that already implemented in jsonschema. I feel that we should just update the schema to add pattern and use jsonschema to check, so we don’t need to write extra code to check read_id.
Most part of the three schemas are the same, maybe we can create a base schema then extend. Let me know what you think

sbooeshaghi · 2025-08-29T18:33:13Z

Hi Mingjie, I’ve been thinking a lot about this and I have a few thoughts. The proposed changes to the schema you made are listed below:
• Removed required fields: lib_struct, library_protocol, library_kit, sequence_protocol, sequence_kit.
• Removed all enum lists for library_protocol, library_kit, sequence_protocol, sequence_kit.
• Dropped required keys (protocol_id/kit_id, modality) from protocol/kit object definitions.
• Made md5 in region.onlist optional (was required).
• Removed md5 regex validation.
• Added regex “^IGVF.*” requirement to read.read_id (was any string).
in a sense these changes are a “slimmed” version the original spec. After reviewing the seqspec code, I am OK with making the sequence_protocol, sequence_kit, library_protocol, library_kit not required in the schema since this matches the base Assay class.

Secondly, could you explain the rationale for removing the md5 requirement for the onlist files? Historically I’ve ran into issues where ive gotten the wrong onlist file and have the md5 match the file helps with that. If youd like i could implement in seqspec format the ability to autocompute the md5 for onlist file and populate it in the spec (keeping it required for seqspec check).

Lastly, I don’t think its a good idea to bake in the IGVF regex check into the base schema. If youd like I could add in an a --filter-list parameter that can optionally take in a list of function names like check_igvf_ids which performs the validation on the python side. This keeps the base schema as is and allows users to validate against a custom set of validators. What do you think?

mingjiecn and others added 20 commits November 1, 2024 15:42

update schema (pachterlab#52)

2e19173

update file_exsits function to check file url in igvf portal (pachter…

289e7e3

…lab#53)

adding seqspec spec tokenization

2a5df33

allow https for remote onlist (pachterlab#54)

e3a6dea

Update seqspec check so we can run it directly in python script (pach…

8e9554f

…terlab#58) * update seqspec check * add spec parameter back to check function

added python usage to docs

a280567

support gzipped yaml file for function load_spec (pachterlab#60)

c9520b4

* support gzipped yaml file for function load_spec * fix bug in function run_check * support gzipped yaml file for function load_spec

enabled skipping checks with seqspec check

1ea7239

updating seqspec-html to print read info

1e1abed

CHECK-161-onlist (#3)

1b2f345

Merge devel to dev (#7)

676c020

fix version

c62b007

remove print

255c2ca

Merge pull request #8 from IGVF-DACC/CHECK-196-version-fix

0b46fcc

CHECK-207-kb-single (#9)

17b1923

CHECK-214-region-type (#10)

5f1192f

CHECK-219-api-merge(#12)

1a29c3c

CHECK-201-read-id (#11)

f1e094b

add bead_TSO to all schema

5ca2bbe

CHECK-231-region-type (pachterlab#72) (#16)

aff8051

mingjiecn force-pushed the dev branch from b1f71df to aff8051 Compare September 19, 2025 15:46

CHECK-244-random-x (#18)

9fd3b55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update seqspec check for checking files in IGVF portal #74

Update seqspec check for checking files in IGVF portal #74

Uh oh!

mingjiecn commented Aug 28, 2025

Uh oh!

sbooeshaghi commented Aug 28, 2025

Uh oh!

mingjiecn commented Aug 28, 2025

Uh oh!

sbooeshaghi commented Aug 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Update seqspec check for checking files in IGVF portal #74

Are you sure you want to change the base?

Update seqspec check for checking files in IGVF portal #74

Uh oh!

Conversation

mingjiecn commented Aug 28, 2025

Uh oh!

sbooeshaghi commented Aug 28, 2025

Uh oh!

mingjiecn commented Aug 28, 2025

Uh oh!

sbooeshaghi commented Aug 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants