-
Notifications
You must be signed in to change notification settings - Fork 127
Description
Describe the bug
If the number of samples is not neatly divisible by the length of a block, then BlockBootstrap drops the first samples, excluding them from both the training and the test sets of every split. When fitting a model, a warning is issued that "at least one point of training set belongs to every resamplings," which is not necessarily true. Also, the fix suggested in the warning to "increase the number of resamplings" will never resolve this issue.
To Reproduce
import numpy as np
from mapie.regression import TimeSeriesRegressor
from mapie.subsample import BlockBootstrap
from sklearn.ensemble import RandomForestRegressor
random_state = 42
number_of_samples = 11
X_train = np.random.rand(number_of_samples, 2)
y_train = np.random.rand(number_of_samples)
cross_validation = BlockBootstrap(
n_resamplings=6,
n_blocks=3,
overlapping=False,
random_state=random_state,
)
train_indices_present_in_every_split = set(np.arange(X_train.shape[0])) # start with all indices
test_indices_present_across_all_splits = set() # start with no indices
for train_indices, test_indices in cross_validation.split(X_train):
# Reduce to indices present in the current training set and all previous training sets
train_indices_present_in_every_split = train_indices_present_in_every_split.intersection(train_indices)
# Add indices present in the current test set
test_indices_present_across_all_splits = test_indices_present_across_all_splits.union(set(test_indices))
print(f"train indices: {train_indices}, test indices: {test_indices}")
print(f"There are {len(train_indices_present_in_every_split)} indices included in every training set: {train_indices_present_in_every_split}")
print(f"There are {len(test_indices_present_across_all_splits)} indices included across all test sets: {test_indices_present_across_all_splits}")
model = TimeSeriesRegressor(
estimator=RandomForestRegressor(random_state=random_state),
method="enbpi",
cv=cross_validation,
)
model.fit(X_train, y_train)Output
train indices: [ 8 9 10 2 3 4 8 9 10], test indices: [5 6 7]
train indices: [ 8 9 10 2 3 4 2 3 4], test indices: [5 6 7]
train indices: [ 8 9 10 5 6 7 8 9 10], test indices: [2 3 4]
train indices: [ 8 9 10 8 9 10 8 9 10], test indices: [2 3 4 5 6 7]
train indices: [ 2 3 4 8 9 10 5 6 7], test indices: []
train indices: [2 3 4 5 6 7 5 6 7], test indices: [ 8 9 10]
There are 0 indices included in every training set: set()
There are 9 indices included across all test sets: {2, 3, 4, 5, 6, 7, 8, 9, 10}
~\.venv\lib\site-packages\mapie\utils.py:719: UserWarning:
WARNING: at least one point of training set belongs to every resamplings.
Increase the number of resamplings
~\.venv\lib\site-packages\mapie\aggregation_functions.py:118: RuntimeWarning:
Mean of empty slice
Expected behavior
I would expect all indices that are not part of a given training set to be included in the respective test set. In the example above, I would expect indices 0 and 1 to be part of every test set (as they are part of none of the training sets). Alternatively, I would expect the warning message to accurately describe the issue (i.e., number of samples is not neatly divisible by block length, causing some samples to be dropped).
MAPIE Version:
1.1.0
Additional context
The issue originates in l. 204 in subsample.py, where indices is overwritten with a version potentially excluding the first indices:
indices = indices[(n % length):]My suggested fix would be to keep a copy of the original indices and to, in l. 221, sample from these original indices, i.e., include any indices that are not part of the training set in the test set, even if they do not belong to any block. If that sounds like a good idea to you, I'd be happy to submit a pull request.