Improve featurizer flexibility #335

nictru · 2026-01-09T16:38:44Z

This PR is still somewhat a WIP, but I would love to hear feedback already.

My overall goal is to create embeddings using models that have been trained on large transcriptomics datasets, and then use these as input for the drug response prediction models. The first step here is introducing the concept of featurizers also in the cell line dimension (as it already exists for the drugs). I created a PCA featurizer as an example.

Also I tried to make the implementation a bit more elegant, to prevent redundant code if a featurizer is used in multiple locations and also to allow creating the embeddings on-the-fly if they have not been pre-computed.

Other possible methods for producing embeddings include BulkFormer, flow matching models like cellFlow, pre-trained scVI models from the model hub, etc.

Like I said, I am still working on this, but I would still appreciate to know what you think about the implementation so far.

codecov-commenter · 2026-01-09T17:12:23Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 53.25279% with 503 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (development@587cefd). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
drevalpy/datasets/featurizer/drug/molgnet.py	24.72%	271 Missing ⚠️
drevalpy/datasets/featurizer/drug/drug_graph.py	29.59%	69 Missing ⚠️
drevalpy/datasets/featurizer/cell_line/base.py	43.61%	53 Missing ⚠️
drevalpy/models/drp_model.py	42.50%	23 Missing ⚠️
drevalpy/datasets/featurizer/drug/chemberta.py	63.63%	20 Missing ⚠️
drevalpy/experiment.py	9.09%	20 Missing ⚠️
drevalpy/datasets/featurizer/cell_line/pca.py	89.21%	11 Missing ⚠️
tests/test_featurizers.py	91.30%	10 Missing ⚠️
drevalpy/datasets/featurizer/drug/base.py	89.33%	8 Missing ⚠️
drevalpy/models/MOLIR/utils.py	0.00%	5 Missing ⚠️
... and 5 more
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@              Coverage Diff               @@
##             development     #335   +/-   ##
==============================================
  Coverage               ?   83.25%           
==============================================
  Files                  ?       71           
  Lines                  ?     7256           
  Branches               ?        0           
==============================================
  Hits                   ?     6041           
  Misses                 ?     1215           
  Partials               ?        0

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

nictru · 2026-01-09T17:12:50Z

If it would be possible to pass a string like PCA+SimpleNeuralNetwork to the MODEL_FACTORY and then create the respective mixed class dynamically this would be amazing for me, because there are a lot of different embedding-model combinations

My main blocker here is the way hyperparameters are treated in the package - e.g. how could we handle the number of PCs with the PCA featurizer?

I just had the idea that we could have a hyperparameters.yaml file per featurizer (in addition to the model ones), then we could define the hyperparameter space as the union of the model hyperparameters and the featurizer hyperparameters. Still thinking if there could be better ways

PascalIversen · 2026-01-10T11:24:16Z

Yes, honestly, featurizers were a bit of an afterthought. We have a very related problem with models that use the exact same backbone, but a different input type, e.g., methylation instead of gene expression. Currently, we just inherit from the first and change the feature loading, which is also a bit ugly.

Just thinking out loud, currently, in principle, we could solve the featurization issue by adapting the models with an optional featurizer in the training, maybe something like that (we currently do something similar in a hardcoded way e.g. using scale_gene_expression):

        cl_featurizers = self.hyperparameters.get("cl_featurizer", None)
        drug_featurizer = self.hyperparameters.get("drug_featurizer", None)
        for featurizer in cl_featurizers:
            cell_line_input = FEATURIZER_FACTORY[featurizer]().transform_cell_line_features(cell_line_input)
        for featurizer in drug_featurizers:
            drug_input = FEATURIZER_FACTORY[drug_featurizer]().transform_drug_features(drug_input)

But this would not be a global solution, and it would hyperparameter-tune over the featurizers and only show the best one instead of showing all results in the report of the different featurizers tested.

So, actually, I like the more dynamic option. Was your idea to add something like

def featurize(
        self,
        featurizers: List[Featurizers],
        output: DrugResponseDataset,
        cell_line_input: FeatureDataset,
        drug_input: FeatureDataset | None = None,) 
		is_train: bool

to the DrugResponseModel interface, which gets called before the training?

I see a bit of a difficulty for models that use various inputs e.g. PCA+MultiOmicsNeuralNetwork. Where do we define which input the PCA is applied to? I guess that relates to your hpam question...

could define the hyperparameter space as the union of the model hyperparameters and the featurizer hyperparameters.
I guess there is a third kind of hpams, which are related to the model-hpam combination, e.g. how many PCA dims for methylation vs gene_expression pca dims? Apply the PCA to each feature view separately or one PCA on all feature views?

I will think about this some more. Great initiative!! I think this is really something that could improve the usability of the library a lot.

nictru · 2026-01-12T09:29:07Z

Hey, I summarized here what I think would be the best way of approaching this. Let me know what you think:

Featurizers

A featurizer can take one or more omics views as input and produces a representation for each cell line in its input data
- Single view: E.g. PCA
- Multi view: E.g. MultiVI
Representation must be generated without seeing the labels (drug screening data)
- Unsupervised (e.g. PCA)
- Pre-training/fine-tuning (e.g. scVI model from modelHub that is fine-tuned to the CL transcriptomics data)
- Zero-shot (e.g. BulkFormer)
Featurizers come with their own tuneable hyperparameter space (e.g. number of PCs for PCA)

Implementation

A concrete featurizer comes with a defined list of input views
We can have an AbstractPCAFeaturizer for example, but concrete classes still need to be defined in the code (e.g. TranscriptomicsPCAFeaturizer)
We can not generate featurizers dynamically
Each featurizer can be accessed by a single string (similar to MODEL_FACTORY)

Views

A view can be either a raw single-omics view, or a featurizer output

Models

Distinction between flexi-view and fixed-view models
A model that only takes a single view as input can still be multi-omics, because featurizers can be trained on multiple omics

Fixed-view models

Define specific views that they require as input, can be one or many
This is pretty much what all models currently are
Can be constructed just by name

Flexi-view models

Can take either one (single-view) or n (multi-view) views as input
- Multi-view models come equipped with a way of aggregating views, while single-view models don't
Can work with any view in each input field (raw omics or featurizer output)
Example: Simple neural networks, random forest, everything sklearn

Model construction

Currently models can be instantiated by a single string. Fixed-view models would still allow this, but flexi-view models require additional parameterization. The information that is needed for model construction is:

Model class
Views to be used (single-view models fail if multiple are provided and vice-versa)
Potentially also the drug featurizer to be used (this will always be exactly one)

Here are two main strategies that could be used to tackle this

String-based approach

This could look something like this:

Multi-view: <model_class>(+<drug_featurizer>)+<view_1>+...+<view_n>
Single-view: <model_class>(+<drug_featurizer>)+<view>

File-based approach

This could be a json, csv or tsv file:

{
    models: [
        {
            class: '<model_class>',
            (drug_featurizer: '<drug_featurizer>',)
            views: [
                '<view_1>',
                ...,
                '<view_n>'
            ]
        }
    ]
}

Note that the JSON approach would provide users with a lot more flexibility for parameterization; We could also allow users to enable hyperparameter optimization just for a subset of models, or decide which hyperparameters should be tuned for each model, or redefine the options for each hyperparameter value.

A note on custom models and featurizers

Implementing this could be nice, especially for users of the nextflow pipeline. I would probably do it with an approach similar to this. But this is not a priority for me, as using the python package provides me with all the flexibility I need.

Some remaining issues to consider

Featurizer flexibility

By making the models flexible and the featurizers 'fixed', the problem of having to define a new class for each combination just moved up one level. We would still need to define a TranscriptomicsPCAFeaturizer, a MutationPCAFeaturizer etc. separately in drevalpy. This could of course be simplified using an abstract class.

I think the only alternative would be to make the interface for class creation even more complex; instead of <view_1> we would need to provide something like:

{
    class: `PCAFeaturizer`,
    omics: ['Transcriptomics']
}

I think this makes the configuration more complex than necessary, and the added flexibility benefit will not justify it. Also certain featurizers will work only on certain input omics (e.g. scVI only on transcriptomics), and then we would have to do something like flexi-featurizers and fixed-featurizers again.

Model descriptors

Currently in figures etc. models can be referenced conveniently by the corresponding descriptive string - e.g. RandomForest, SimpleNeuralNetwork etc. With increased configuration flexbility, the descriptive strings would grow in complexity (and length).

A potential idea for this would be to allow users to define 'Nicknames' for provided model configurations. This could look something like this:

PCANeuralNetwork:SimpleNeuralNetwork(+<drug_featurizer>)+TranscriptomicsPCAFeaturizer

or

{
    models: [
        {
            name: PCANeuralNetowrk
            class: 'SimpleNeuralNetwork',
            (drug_featurizer: '<drug_featurizer>',)
            views: [
                'TranscriptomicsPCAFeaturizer',
            ]
        }
    ]
}

nictru · 2026-01-12T09:30:14Z

Note that the above description does not treat the featurizer selection itself as a hyperparameter

PascalIversen · 2026-01-12T10:51:48Z

I like the file-based approach. Do I understand you correctly? We could then provide users with additional models by simply adding a JSON to the codebase, without requiring any further Python code, right? And then users could use the nicknames, and we check if they exist in model_configs.json
Additionally, users could provide a custom_model_configs.json file and run all the models defined within it.

nictru · 2026-01-12T11:00:35Z

Well actually I was thinking the user would provide the json file and then the package MODEL_FACTORY would not take a string as input but a json object like

{
    models: [
        {
            name: PCANeuralNetowrk
            class: 'SimpleNeuralNetwork',
            (drug_featurizer: '<drug_featurizer>',)
            views: [
                'TranscriptomicsPCAFeaturizer',
            ]
        }
    ]
}

and then construct the model from this. The nickname would only be used for plots etc.

However I am also fine with what you wrote, e.g. we have an internal json file and users can select using nicknames.

This basically boils down to who should be allowed to mix and match featurizers and models and decide on the nicknames: If the json is package-internal, then only the developers can do this, if the json file is a command line parameter users can also do this.

A middle ground would maybe be to provide a package-internal json file, that can be extended by users through an additional CLI-provided json file.

If these things should be allowed or not is a question of the package philosophy (user flexibility vs central curation):
If users can create any model and name them what they want, benchmarks are maybe less comparable than if only curated setups can be used. Of course people can always fork and override, but this is more work than providing a CLI interface for overriding.

nictru · 2026-01-12T11:03:11Z

Additionally, users could provide a custom_model_configs.json file and run all the models defined within it.

Ah basically in the last sentence you said the same thing - sorry for not reading properly. Still I will not delete my comment because the part about package philosophy is still relevant IMO

JudithBernett · 2026-01-12T10:29:51Z

drevalpy/datasets/featurizer/cell_line/pca.py

+        if not self._fitted or self._scaler is None or self._pca is None:
+            raise RuntimeError("PCA model is not fitted. Call generate_embeddings() or fit() first.")
+
+        if "gene_expression" not in omics_data:


If we already decide to make it so general, then the omic should also be a variable right? So that we don't have to implement a PCAFeaturizer for gene expression, methylation, mutation data, … separately

JudithBernett · 2026-01-12T10:31:02Z

drevalpy/datasets/featurizer/cell_line/pca.py

+        model_file = data_dir / dataset_name / self._get_model_filename()
+
+        # Load gene expression data
+        ge_file = data_dir / dataset_name / "gene_expression.csv"


could then be retrieved from OMICS_FILE_MAPPING

nictru · 2026-01-12T12:21:37Z

@JudithBernett the code does not really reflect anymore what I have been writing in the comments, let's not focus on this too much

But you are right, if we do an internal json file with model-featurizer combinations and give them nicknames, we could also create a second json file (or store both in a single json file, does not matter) where we can create featurizers and apply them on different omics and give them nicknames.

We would end up with something like this:

{
    featurizers: [
        {
            name: ExpressionPCA,
            class: PCAFeaturizer,
            omics: ['Expression']
        },
        ...
    ],
    models: [
        {
             name: ExpressionPCANeuralNetwork,
             class: SimpleNeuralNetwork,
             views: ['ExpressionPCA']
        },
        {
            name: ExpressionNeuralNetwork,
            class: SimpleNeuralNetwork,
            views: ['Expression']
        }
    ]
}

And then users could provide model names like ExpressionPCANeuralNetwork etc. Users could also provide a custom JSON file with additional featurizers or models, that we could then append to the corresponding lists. Still this would only allow creating featurizers or models from classes that are inside the library.

Thinking one step further (not relevant for me, but possibly to others) one could add a classFile property to the model or featurizer schemas, allowing users to provide own, external classes for constructing models or featurizers.

I think this concept is in a quite good shape now, if you are comfortable with this I could start a draft implementation soon

JudithBernett · 2026-01-12T12:28:34Z

Hey :) first of all thanks a lot for these great ideas and all the code you've already written, this is great!!

I like the idea of the definition files. I think it's completely fine in terms of package philosophy. We don't have a leaderboard (yet?), but even if we did, it would be possible to state the configuration there. I think we should definitely supply JSON files then for all the configurations that we have so far, but we should keep the user flexibility to encourage people to develop within our framework.

I also think we have too much redundancy already with the SimpleNeuralNetwork+ChemBERTaNeuralNetwork and with the RandomForest and ProteomicsRandomForest, and it seems like this would be an elegant way to get rid of that. It would then also be easier to try out different inputs for the same general architecture, which is really important for model ablation studies and the initial development phase, where users might think about which input representation they should use in their final model proposal.

I really like this last snippet, where a user could then try out different featurizer variants for the same general model architecture. Go ahead :))

PascalIversen · 2026-01-12T17:17:12Z

Yes, great, also from my side: feel free to go ahead with this plan! Great ideas!!

nictru · 2026-01-13T09:58:41Z

Hey, just to let you know:
When brainstorming how this new architecture could actually be implemented I realized that it will require a lot of effort to do it properly. As I need to focus on my internship goals to produce some relevant results, I will first have to focus on implementing various methods for producing embeddings and evaluating them.
Once this works well, I will then get back to this PR and see how we can cleanly implement general interfaces for dynamic models.

JudithBernett reviewed Jan 12, 2026

View reviewed changes

JudithBernett linked an issue Jan 12, 2026 that may be closed by this pull request

Baseline support for Multi-OMICs / different OMICs #24

Open

nictru force-pushed the tx-featurizer branch 3 times, most recently from d2a10b8 to cd9d677 Compare January 20, 2026 08:32

nictru added 13 commits January 22, 2026 15:34

Install optuna

19b6a33

First implementation draft

292c99f

Fix docstrings

84fcf1a

Pre-commit

befd82d

MyPy

d0d47d4

Make continuous hyperparameter spaces work

a23778d

Fix continuous hyperparameter sampling

655f073

Add PCA featurizer

bd5d0d2

Add PCA neural network

d98552f

Improve featurizer structure

23a8426

Fix failing tests

77bba71

Implement featurizer mixins

ea77c5b

Remove redundant get_model_name functions

eb7c0bb

nictru added 2 commits January 22, 2026 17:44

Fix typing issues

792eb8e

Fix production dataset colname issue

e2edb4c

nictru force-pushed the tx-featurizer branch from cd9d677 to e2edb4c Compare January 22, 2026 17:45

Improve featurizer flexibility #335

Are you sure you want to change the base?

Improve featurizer flexibility #335

Uh oh!

Conversation

nictru commented Jan 9, 2026

Uh oh!

codecov-commenter commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

nictru commented Jan 9, 2026

Uh oh!

PascalIversen commented Jan 10, 2026

Uh oh!

nictru commented Jan 12, 2026

Featurizers

Implementation

Views

Models

Fixed-view models

Flexi-view models

Model construction

String-based approach

File-based approach

A note on custom models and featurizers

Some remaining issues to consider

Featurizer flexibility

Model descriptors

Uh oh!

nictru commented Jan 12, 2026

Uh oh!

PascalIversen commented Jan 12, 2026

Uh oh!

nictru commented Jan 12, 2026

Uh oh!

nictru commented Jan 12, 2026

Uh oh!

JudithBernett Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

JudithBernett Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

nictru commented Jan 12, 2026

Uh oh!

JudithBernett commented Jan 12, 2026

Uh oh!

PascalIversen commented Jan 12, 2026

Uh oh!

nictru commented Jan 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented Jan 9, 2026 •

edited

Loading