-
Notifications
You must be signed in to change notification settings - Fork 3
Improve featurizer flexibility #335
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: development
Are you sure you want to change the base?
Conversation
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## development #335 +/- ##
==============================================
Coverage ? 83.25%
==============================================
Files ? 71
Lines ? 7256
Branches ? 0
==============================================
Hits ? 6041
Misses ? 1215
Partials ? 0 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
If it would be possible to pass a string like My main blocker here is the way hyperparameters are treated in the package - e.g. how could we handle the number of PCs with the PCA featurizer? I just had the idea that we could have a |
|
Yes, honestly, featurizers were a bit of an afterthought. We have a very related problem with models that use the exact same backbone, but a different input type, e.g., methylation instead of gene expression. Currently, we just inherit from the first and change the feature loading, which is also a bit ugly. Just thinking out loud, currently, in principle, we could solve the featurization issue by adapting the models with an optional featurizer in the training, maybe something like that (we currently do something similar in a hardcoded way e.g. cl_featurizers = self.hyperparameters.get("cl_featurizer", None)
drug_featurizer = self.hyperparameters.get("drug_featurizer", None)
for featurizer in cl_featurizers:
cell_line_input = FEATURIZER_FACTORY[featurizer]().transform_cell_line_features(cell_line_input)
for featurizer in drug_featurizers:
drug_input = FEATURIZER_FACTORY[drug_featurizer]().transform_drug_features(drug_input)But this would not be a global solution, and it would hyperparameter-tune over the featurizers and only show the best one instead of showing all results in the report of the different featurizers tested. So, actually, I like the more dynamic option. Was your idea to add something like def featurize(
self,
featurizers: List[Featurizers],
output: DrugResponseDataset,
cell_line_input: FeatureDataset,
drug_input: FeatureDataset | None = None,)
is_train: boolto the DrugResponseModel interface, which gets called before the training? I see a bit of a difficulty for models that use various inputs e.g. PCA+MultiOmicsNeuralNetwork. Where do we define which input the PCA is applied to? I guess that relates to your hpam question...
I will think about this some more. Great initiative!! I think this is really something that could improve the usability of the library a lot. |
|
Hey, I summarized here what I think would be the best way of approaching this. Let me know what you think: Featurizers
Implementation
Views
Models
Fixed-view models
Flexi-view models
Model constructionCurrently models can be instantiated by a single string. Fixed-view models would still allow this, but flexi-view models require additional parameterization. The information that is needed for model construction is:
Here are two main strategies that could be used to tackle this String-based approachThis could look something like this:
File-based approachThis could be a json, csv or tsv file: Note that the JSON approach would provide users with a lot more flexibility for parameterization; We could also allow users to enable hyperparameter optimization just for a subset of models, or decide which hyperparameters should be tuned for each model, or redefine the options for each hyperparameter value. A note on custom models and featurizersImplementing this could be nice, especially for users of the nextflow pipeline. I would probably do it with an approach similar to this. But this is not a priority for me, as using the python package provides me with all the flexibility I need. Some remaining issues to considerFeaturizer flexibilityBy making the models flexible and the featurizers 'fixed', the problem of having to define a new class for each combination just moved up one level. We would still need to define a I think the only alternative would be to make the interface for class creation even more complex; instead of I think this makes the configuration more complex than necessary, and the added flexibility benefit will not justify it. Also certain featurizers will work only on certain input omics (e.g. scVI only on transcriptomics), and then we would have to do something like flexi-featurizers and fixed-featurizers again. Model descriptorsCurrently in figures etc. models can be referenced conveniently by the corresponding descriptive string - e.g. A potential idea for this would be to allow users to define 'Nicknames' for provided model configurations. This could look something like this:
or |
|
Note that the above description does not treat the featurizer selection itself as a hyperparameter |
|
I like the file-based approach. Do I understand you correctly? We could then provide users with additional models by simply adding a JSON to the codebase, without requiring any further Python code, right? And then users could use the nicknames, and we check if they exist in model_configs.json |
|
Well actually I was thinking the user would provide the json file and then the package and then construct the model from this. The nickname would only be used for plots etc. However I am also fine with what you wrote, e.g. we have an internal json file and users can select using nicknames. This basically boils down to who should be allowed to mix and match featurizers and models and decide on the nicknames: If the json is package-internal, then only the developers can do this, if the json file is a command line parameter users can also do this. A middle ground would maybe be to provide a package-internal json file, that can be extended by users through an additional CLI-provided json file. If these things should be allowed or not is a question of the package philosophy (user flexibility vs central curation): |
Ah basically in the last sentence you said the same thing - sorry for not reading properly. Still I will not delete my comment because the part about package philosophy is still relevant IMO |
| if not self._fitted or self._scaler is None or self._pca is None: | ||
| raise RuntimeError("PCA model is not fitted. Call generate_embeddings() or fit() first.") | ||
|
|
||
| if "gene_expression" not in omics_data: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we already decide to make it so general, then the omic should also be a variable right? So that we don't have to implement a PCAFeaturizer for gene expression, methylation, mutation data, … separately
| model_file = data_dir / dataset_name / self._get_model_filename() | ||
|
|
||
| # Load gene expression data | ||
| ge_file = data_dir / dataset_name / "gene_expression.csv" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could then be retrieved from OMICS_FILE_MAPPING
|
@JudithBernett the code does not really reflect anymore what I have been writing in the comments, let's not focus on this too much But you are right, if we do an internal json file with model-featurizer combinations and give them nicknames, we could also create a second json file (or store both in a single json file, does not matter) where we can create featurizers and apply them on different omics and give them nicknames. We would end up with something like this: And then users could provide model names like Thinking one step further (not relevant for me, but possibly to others) one could add a I think this concept is in a quite good shape now, if you are comfortable with this I could start a draft implementation soon |
|
Hey :) first of all thanks a lot for these great ideas and all the code you've already written, this is great!! I like the idea of the definition files. I think it's completely fine in terms of package philosophy. We don't have a leaderboard (yet?), but even if we did, it would be possible to state the configuration there. I think we should definitely supply JSON files then for all the configurations that we have so far, but we should keep the user flexibility to encourage people to develop within our framework. I also think we have too much redundancy already with the SimpleNeuralNetwork+ChemBERTaNeuralNetwork and with the RandomForest and ProteomicsRandomForest, and it seems like this would be an elegant way to get rid of that. It would then also be easier to try out different inputs for the same general architecture, which is really important for model ablation studies and the initial development phase, where users might think about which input representation they should use in their final model proposal. I really like this last snippet, where a user could then try out different featurizer variants for the same general model architecture. Go ahead :)) |
|
Yes, great, also from my side: feel free to go ahead with this plan! Great ideas!! |
|
Hey, just to let you know: |
d2a10b8 to
cd9d677
Compare
This PR is still somewhat a WIP, but I would love to hear feedback already.
My overall goal is to create embeddings using models that have been trained on large transcriptomics datasets, and then use these as input for the drug response prediction models. The first step here is introducing the concept of featurizers also in the cell line dimension (as it already exists for the drugs). I created a PCA featurizer as an example.
Also I tried to make the implementation a bit more elegant, to prevent redundant code if a featurizer is used in multiple locations and also to allow creating the embeddings on-the-fly if they have not been pre-computed.
Other possible methods for producing embeddings include BulkFormer, flow matching models like cellFlow, pre-trained scVI models from the model hub, etc.
Like I said, I am still working on this, but I would still appreciate to know what you think about the implementation so far.