Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@nictru
Copy link
Collaborator

@nictru nictru commented Jan 9, 2026

This PR is still somewhat a WIP, but I would love to hear feedback already.

My overall goal is to create embeddings using models that have been trained on large transcriptomics datasets, and then use these as input for the drug response prediction models. The first step here is introducing the concept of featurizers also in the cell line dimension (as it already exists for the drugs). I created a PCA featurizer as an example.

Also I tried to make the implementation a bit more elegant, to prevent redundant code if a featurizer is used in multiple locations and also to allow creating the embeddings on-the-fly if they have not been pre-computed.

Other possible methods for producing embeddings include BulkFormer, flow matching models like cellFlow, pre-trained scVI models from the model hub, etc.

Like I said, I am still working on this, but I would still appreciate to know what you think about the implementation so far.

@codecov-commenter
Copy link

codecov-commenter commented Jan 9, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 53.25279% with 503 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (development@587cefd). Learn more about missing BASE report.

Files with missing lines Patch % Lines
drevalpy/datasets/featurizer/drug/molgnet.py 24.72% 271 Missing ⚠️
drevalpy/datasets/featurizer/drug/drug_graph.py 29.59% 69 Missing ⚠️
drevalpy/datasets/featurizer/cell_line/base.py 43.61% 53 Missing ⚠️
drevalpy/models/drp_model.py 42.50% 23 Missing ⚠️
drevalpy/datasets/featurizer/drug/chemberta.py 63.63% 20 Missing ⚠️
drevalpy/experiment.py 9.09% 20 Missing ⚠️
drevalpy/datasets/featurizer/cell_line/pca.py 89.21% 11 Missing ⚠️
tests/test_featurizers.py 91.30% 10 Missing ⚠️
drevalpy/datasets/featurizer/drug/base.py 89.33% 8 Missing ⚠️
drevalpy/models/MOLIR/utils.py 0.00% 5 Missing ⚠️
... and 5 more
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@              Coverage Diff               @@
##             development     #335   +/-   ##
==============================================
  Coverage               ?   83.25%           
==============================================
  Files                  ?       71           
  Lines                  ?     7256           
  Branches               ?        0           
==============================================
  Hits                   ?     6041           
  Misses                 ?     1215           
  Partials               ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@nictru
Copy link
Collaborator Author

nictru commented Jan 9, 2026

If it would be possible to pass a string like PCA+SimpleNeuralNetwork to the MODEL_FACTORY and then create the respective mixed class dynamically this would be amazing for me, because there are a lot of different embedding-model combinations

My main blocker here is the way hyperparameters are treated in the package - e.g. how could we handle the number of PCs with the PCA featurizer?

I just had the idea that we could have a hyperparameters.yaml file per featurizer (in addition to the model ones), then we could define the hyperparameter space as the union of the model hyperparameters and the featurizer hyperparameters. Still thinking if there could be better ways

@PascalIversen
Copy link
Collaborator

Yes, honestly, featurizers were a bit of an afterthought. We have a very related problem with models that use the exact same backbone, but a different input type, e.g., methylation instead of gene expression. Currently, we just inherit from the first and change the feature loading, which is also a bit ugly.

Just thinking out loud, currently, in principle, we could solve the featurization issue by adapting the models with an optional featurizer in the training, maybe something like that (we currently do something similar in a hardcoded way e.g. using scale_gene_expression):

        cl_featurizers = self.hyperparameters.get("cl_featurizer", None)
        drug_featurizer = self.hyperparameters.get("drug_featurizer", None)
        for featurizer in cl_featurizers:
            cell_line_input = FEATURIZER_FACTORY[featurizer]().transform_cell_line_features(cell_line_input)
        for featurizer in drug_featurizers:
            drug_input = FEATURIZER_FACTORY[drug_featurizer]().transform_drug_features(drug_input)

But this would not be a global solution, and it would hyperparameter-tune over the featurizers and only show the best one instead of showing all results in the report of the different featurizers tested.

So, actually, I like the more dynamic option. Was your idea to add something like

def featurize(
        self,
        featurizers: List[Featurizers],
        output: DrugResponseDataset,
        cell_line_input: FeatureDataset,
        drug_input: FeatureDataset | None = None,) 
		is_train: bool

to the DrugResponseModel interface, which gets called before the training?

I see a bit of a difficulty for models that use various inputs e.g. PCA+MultiOmicsNeuralNetwork. Where do we define which input the PCA is applied to? I guess that relates to your hpam question...

could define the hyperparameter space as the union of the model hyperparameters and the featurizer hyperparameters.
I guess there is a third kind of hpams, which are related to the model-hpam combination, e.g. how many PCA dims for methylation vs gene_expression pca dims? Apply the PCA to each feature view separately or one PCA on all feature views?

I will think about this some more. Great initiative!! I think this is really something that could improve the usability of the library a lot.

@nictru
Copy link
Collaborator Author

nictru commented Jan 12, 2026

Hey, I summarized here what I think would be the best way of approaching this. Let me know what you think:

Featurizers

  • A featurizer can take one or more omics views as input and produces a representation for each cell line in its input data
    • Single view: E.g. PCA
    • Multi view: E.g. MultiVI
  • Representation must be generated without seeing the labels (drug screening data)
    • Unsupervised (e.g. PCA)
    • Pre-training/fine-tuning (e.g. scVI model from modelHub that is fine-tuned to the CL transcriptomics data)
    • Zero-shot (e.g. BulkFormer)
  • Featurizers come with their own tuneable hyperparameter space (e.g. number of PCs for PCA)

Implementation

  • A concrete featurizer comes with a defined list of input views
  • We can have an AbstractPCAFeaturizer for example, but concrete classes still need to be defined in the code (e.g. TranscriptomicsPCAFeaturizer)
  • We can not generate featurizers dynamically
  • Each featurizer can be accessed by a single string (similar to MODEL_FACTORY)

Views

  • A view can be either a raw single-omics view, or a featurizer output

Models

  • Distinction between flexi-view and fixed-view models
  • A model that only takes a single view as input can still be multi-omics, because featurizers can be trained on multiple omics

Fixed-view models

  • Define specific views that they require as input, can be one or many
  • This is pretty much what all models currently are
  • Can be constructed just by name

Flexi-view models

  • Can take either one (single-view) or n (multi-view) views as input
    • Multi-view models come equipped with a way of aggregating views, while single-view models don't
  • Can work with any view in each input field (raw omics or featurizer output)
  • Example: Simple neural networks, random forest, everything sklearn

Model construction

Currently models can be instantiated by a single string. Fixed-view models would still allow this, but flexi-view models require additional parameterization. The information that is needed for model construction is:

  • Model class
  • Views to be used (single-view models fail if multiple are provided and vice-versa)
  • Potentially also the drug featurizer to be used (this will always be exactly one)

Here are two main strategies that could be used to tackle this

String-based approach

This could look something like this:

  • Multi-view: <model_class>(+<drug_featurizer>)+<view_1>+...+<view_n>
  • Single-view: <model_class>(+<drug_featurizer>)+<view>

File-based approach

This could be a json, csv or tsv file:

{
    models: [
        {
            class: '<model_class>',
            (drug_featurizer: '<drug_featurizer>',)
            views: [
                '<view_1>',
                ...,
                '<view_n>'
            ]
        }
    ]
}

Note that the JSON approach would provide users with a lot more flexibility for parameterization; We could also allow users to enable hyperparameter optimization just for a subset of models, or decide which hyperparameters should be tuned for each model, or redefine the options for each hyperparameter value.

A note on custom models and featurizers

Implementing this could be nice, especially for users of the nextflow pipeline. I would probably do it with an approach similar to this. But this is not a priority for me, as using the python package provides me with all the flexibility I need.

Some remaining issues to consider

Featurizer flexibility

By making the models flexible and the featurizers 'fixed', the problem of having to define a new class for each combination just moved up one level. We would still need to define a TranscriptomicsPCAFeaturizer, a MutationPCAFeaturizer etc. separately in drevalpy. This could of course be simplified using an abstract class.

I think the only alternative would be to make the interface for class creation even more complex; instead of <view_1> we would need to provide something like:

{
    class: `PCAFeaturizer`,
    omics: ['Transcriptomics']
}

I think this makes the configuration more complex than necessary, and the added flexibility benefit will not justify it. Also certain featurizers will work only on certain input omics (e.g. scVI only on transcriptomics), and then we would have to do something like flexi-featurizers and fixed-featurizers again.

Model descriptors

Currently in figures etc. models can be referenced conveniently by the corresponding descriptive string - e.g. RandomForest, SimpleNeuralNetwork etc. With increased configuration flexbility, the descriptive strings would grow in complexity (and length).

A potential idea for this would be to allow users to define 'Nicknames' for provided model configurations. This could look something like this:

PCANeuralNetwork:SimpleNeuralNetwork(+<drug_featurizer>)+TranscriptomicsPCAFeaturizer

or

{
    models: [
        {
            name: PCANeuralNetowrk
            class: 'SimpleNeuralNetwork',
            (drug_featurizer: '<drug_featurizer>',)
            views: [
                'TranscriptomicsPCAFeaturizer',
            ]
        }
    ]
}

@nictru
Copy link
Collaborator Author

nictru commented Jan 12, 2026

Note that the above description does not treat the featurizer selection itself as a hyperparameter

@PascalIversen
Copy link
Collaborator

I like the file-based approach. Do I understand you correctly? We could then provide users with additional models by simply adding a JSON to the codebase, without requiring any further Python code, right? And then users could use the nicknames, and we check if they exist in model_configs.json
Additionally, users could provide a custom_model_configs.json file and run all the models defined within it.

@nictru
Copy link
Collaborator Author

nictru commented Jan 12, 2026

Well actually I was thinking the user would provide the json file and then the package MODEL_FACTORY would not take a string as input but a json object like

{
    models: [
        {
            name: PCANeuralNetowrk
            class: 'SimpleNeuralNetwork',
            (drug_featurizer: '<drug_featurizer>',)
            views: [
                'TranscriptomicsPCAFeaturizer',
            ]
        }
    ]
}

and then construct the model from this. The nickname would only be used for plots etc.


However I am also fine with what you wrote, e.g. we have an internal json file and users can select using nicknames.

This basically boils down to who should be allowed to mix and match featurizers and models and decide on the nicknames: If the json is package-internal, then only the developers can do this, if the json file is a command line parameter users can also do this.

A middle ground would maybe be to provide a package-internal json file, that can be extended by users through an additional CLI-provided json file.

If these things should be allowed or not is a question of the package philosophy (user flexibility vs central curation):
If users can create any model and name them what they want, benchmarks are maybe less comparable than if only curated setups can be used. Of course people can always fork and override, but this is more work than providing a CLI interface for overriding.

@nictru
Copy link
Collaborator Author

nictru commented Jan 12, 2026

Additionally, users could provide a custom_model_configs.json file and run all the models defined within it.

Ah basically in the last sentence you said the same thing - sorry for not reading properly. Still I will not delete my comment because the part about package philosophy is still relevant IMO

if not self._fitted or self._scaler is None or self._pca is None:
raise RuntimeError("PCA model is not fitted. Call generate_embeddings() or fit() first.")

if "gene_expression" not in omics_data:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we already decide to make it so general, then the omic should also be a variable right? So that we don't have to implement a PCAFeaturizer for gene expression, methylation, mutation data, … separately

model_file = data_dir / dataset_name / self._get_model_filename()

# Load gene expression data
ge_file = data_dir / dataset_name / "gene_expression.csv"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could then be retrieved from OMICS_FILE_MAPPING

@nictru
Copy link
Collaborator Author

nictru commented Jan 12, 2026

@JudithBernett the code does not really reflect anymore what I have been writing in the comments, let's not focus on this too much

But you are right, if we do an internal json file with model-featurizer combinations and give them nicknames, we could also create a second json file (or store both in a single json file, does not matter) where we can create featurizers and apply them on different omics and give them nicknames.

We would end up with something like this:

{
    featurizers: [
        {
            name: ExpressionPCA,
            class: PCAFeaturizer,
            omics: ['Expression']
        },
        ...
    ],
    models: [
        {
             name: ExpressionPCANeuralNetwork,
             class: SimpleNeuralNetwork,
             views: ['ExpressionPCA']
        },
        {
            name: ExpressionNeuralNetwork,
            class: SimpleNeuralNetwork,
            views: ['Expression']
        }
    ]
}

And then users could provide model names like ExpressionPCANeuralNetwork etc. Users could also provide a custom JSON file with additional featurizers or models, that we could then append to the corresponding lists. Still this would only allow creating featurizers or models from classes that are inside the library.

Thinking one step further (not relevant for me, but possibly to others) one could add a classFile property to the model or featurizer schemas, allowing users to provide own, external classes for constructing models or featurizers.

I think this concept is in a quite good shape now, if you are comfortable with this I could start a draft implementation soon

@JudithBernett
Copy link
Contributor

Hey :) first of all thanks a lot for these great ideas and all the code you've already written, this is great!!

I like the idea of the definition files. I think it's completely fine in terms of package philosophy. We don't have a leaderboard (yet?), but even if we did, it would be possible to state the configuration there. I think we should definitely supply JSON files then for all the configurations that we have so far, but we should keep the user flexibility to encourage people to develop within our framework.

I also think we have too much redundancy already with the SimpleNeuralNetwork+ChemBERTaNeuralNetwork and with the RandomForest and ProteomicsRandomForest, and it seems like this would be an elegant way to get rid of that. It would then also be easier to try out different inputs for the same general architecture, which is really important for model ablation studies and the initial development phase, where users might think about which input representation they should use in their final model proposal.

I really like this last snippet, where a user could then try out different featurizer variants for the same general model architecture. Go ahead :))

@JudithBernett JudithBernett linked an issue Jan 12, 2026 that may be closed by this pull request
@PascalIversen
Copy link
Collaborator

Yes, great, also from my side: feel free to go ahead with this plan! Great ideas!!

@nictru
Copy link
Collaborator Author

nictru commented Jan 13, 2026

Hey, just to let you know:
When brainstorming how this new architecture could actually be implemented I realized that it will require a lot of effort to do it properly. As I need to focus on my internship goals to produce some relevant results, I will first have to focus on implementing various methods for producing embeddings and evaluating them.
Once this works well, I will then get back to this PR and see how we can cleanly implement general interfaces for dynamic models.

@nictru nictru force-pushed the tx-featurizer branch 3 times, most recently from d2a10b8 to cd9d677 Compare January 20, 2026 08:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Baseline support for Multi-OMICs / different OMICs

4 participants