Thanks to visit codestin.com
Credit goes to github.com

Skip to content

'BlockBootstrap' sometimes includes samples in neither training nor test set #790

@HannesMK

Description

@HannesMK

Describe the bug
If the number of samples is not neatly divisible by the length of a block, then BlockBootstrap drops the first samples, excluding them from both the training and the test sets of every split. When fitting a model, a warning is issued that "at least one point of training set belongs to every resamplings," which is not necessarily true. Also, the fix suggested in the warning to "increase the number of resamplings" will never resolve this issue.

To Reproduce

import numpy as np
from mapie.regression import TimeSeriesRegressor
from mapie.subsample import BlockBootstrap
from sklearn.ensemble import RandomForestRegressor

random_state = 42

number_of_samples = 11

X_train = np.random.rand(number_of_samples, 2)
y_train = np.random.rand(number_of_samples)

cross_validation = BlockBootstrap(
    n_resamplings=6,
    n_blocks=3,
    overlapping=False,
    random_state=random_state,
)

train_indices_present_in_every_split = set(np.arange(X_train.shape[0]))  # start with all indices
test_indices_present_across_all_splits = set()  # start with no indices

for train_indices, test_indices in cross_validation.split(X_train):
    # Reduce to indices present in the current training set and all previous training sets
    train_indices_present_in_every_split = train_indices_present_in_every_split.intersection(train_indices)

    # Add indices present in the current test set
    test_indices_present_across_all_splits = test_indices_present_across_all_splits.union(set(test_indices))

    print(f"train indices: {train_indices}, test indices: {test_indices}")

print(f"There are {len(train_indices_present_in_every_split)} indices included in every training set: {train_indices_present_in_every_split}")
print(f"There are {len(test_indices_present_across_all_splits)} indices included across all test sets: {test_indices_present_across_all_splits}")

model = TimeSeriesRegressor(
    estimator=RandomForestRegressor(random_state=random_state),
    method="enbpi",
    cv=cross_validation,
)

model.fit(X_train, y_train)
Output
train indices: [ 8  9 10  2  3  4  8  9 10], test indices: [5 6 7]
train indices: [ 8  9 10  2  3  4  2  3  4], test indices: [5 6 7]
train indices: [ 8  9 10  5  6  7  8  9 10], test indices: [2 3 4]
train indices: [ 8  9 10  8  9 10  8  9 10], test indices: [2 3 4 5 6 7]
train indices: [ 2  3  4  8  9 10  5  6  7], test indices: []
train indices: [2 3 4 5 6 7 5 6 7], test indices: [ 8  9 10]
There are 0 indices included in every training set: set()
There are 9 indices included across all test sets: {2, 3, 4, 5, 6, 7, 8, 9, 10}
~\.venv\lib\site-packages\mapie\utils.py:719: UserWarning:

WARNING: at least one point of training set belongs to every resamplings.
Increase the number of resamplings

~\.venv\lib\site-packages\mapie\aggregation_functions.py:118: RuntimeWarning:

Mean of empty slice

Expected behavior
I would expect all indices that are not part of a given training set to be included in the respective test set. In the example above, I would expect indices 0 and 1 to be part of every test set (as they are part of none of the training sets). Alternatively, I would expect the warning message to accurately describe the issue (i.e., number of samples is not neatly divisible by block length, causing some samples to be dropped).

MAPIE Version:
1.1.0

Additional context
The issue originates in l. 204 in subsample.py, where indices is overwritten with a version potentially excluding the first indices:

indices = indices[(n % length):]

My suggested fix would be to keep a copy of the original indices and to, in l. 221, sample from these original indices, i.e., include any indices that are not part of the training set in the test set, even if they do not belong to any block. If that sounds like a good idea to you, I'd be happy to submit a pull request.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions