'BlockBootstrap' sometimes includes samples in neither training nor test set

**Describe the bug**
If the number of samples is not neatly divisible by the length of a block, then `BlockBootstrap` drops the first samples, excluding them from both the training and the test sets of every split. When fitting a model, a warning is issued that "at least one point of training set belongs to every resamplings," which is not necessarily true. Also, the fix suggested in the warning to "increase the number of resamplings" will never resolve this issue.

**To Reproduce**

```python
import numpy as np
from mapie.regression import TimeSeriesRegressor
from mapie.subsample import BlockBootstrap
from sklearn.ensemble import RandomForestRegressor

random_state = 42

number_of_samples = 11

X_train = np.random.rand(number_of_samples, 2)
y_train = np.random.rand(number_of_samples)

cross_validation = BlockBootstrap(
    n_resamplings=6,
    n_blocks=3,
    overlapping=False,
    random_state=random_state,
)

train_indices_present_in_every_split = set(np.arange(X_train.shape[0]))  # start with all indices
test_indices_present_across_all_splits = set()  # start with no indices

for train_indices, test_indices in cross_validation.split(X_train):
    # Reduce to indices present in the current training set and all previous training sets
    train_indices_present_in_every_split = train_indices_present_in_every_split.intersection(train_indices)

    # Add indices present in the current test set
    test_indices_present_across_all_splits = test_indices_present_across_all_splits.union(set(test_indices))

    print(f"train indices: {train_indices}, test indices: {test_indices}")

print(f"There are {len(train_indices_present_in_every_split)} indices included in every training set: {train_indices_present_in_every_split}")
print(f"There are {len(test_indices_present_across_all_splits)} indices included across all test sets: {test_indices_present_across_all_splits}")

model = TimeSeriesRegressor(
    estimator=RandomForestRegressor(random_state=random_state),
    method="enbpi",
    cv=cross_validation,
)

model.fit(X_train, y_train)
```

<details>
<summary>Output</summary>

```
train indices: [ 8  9 10  2  3  4  8  9 10], test indices: [5 6 7]
train indices: [ 8  9 10  2  3  4  2  3  4], test indices: [5 6 7]
train indices: [ 8  9 10  5  6  7  8  9 10], test indices: [2 3 4]
train indices: [ 8  9 10  8  9 10  8  9 10], test indices: [2 3 4 5 6 7]
train indices: [ 2  3  4  8  9 10  5  6  7], test indices: []
train indices: [2 3 4 5 6 7 5 6 7], test indices: [ 8  9 10]
There are 0 indices included in every training set: set()
There are 9 indices included across all test sets: {2, 3, 4, 5, 6, 7, 8, 9, 10}
~\.venv\lib\site-packages\mapie\utils.py:719: UserWarning:

WARNING: at least one point of training set belongs to every resamplings.
Increase the number of resamplings

~\.venv\lib\site-packages\mapie\aggregation_functions.py:118: RuntimeWarning:

Mean of empty slice
```

</details>

**Expected behavior**
I would expect **all** indices that are not part of a given training set to be included in the respective test set. In the example above, I would expect indices 0 and 1 to be part of every test set (as they are part of none of the training sets). Alternatively, I would expect the warning message to accurately describe the issue (i.e., number of samples is not neatly divisible by block length, causing some samples to be dropped).

**MAPIE Version:**
1.1.0

**Additional context**
The issue originates in l. 204 in [subsample.py](https://github.com/scikit-learn-contrib/MAPIE/blob/master/mapie/subsample.py), where `indices` is overwritten with a version potentially excluding the first indices:

```python
indices = indices[(n % length):]
```

My suggested fix would be to keep a copy of the original indices and to, in l. 221, sample from these original indices, i.e., include any indices that are not part of the training set in the test set, even if they do not belong to any block. If that sounds like a good idea to you, I'd be happy to submit a pull request.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

'BlockBootstrap' sometimes includes samples in neither training nor test set #790

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

'BlockBootstrap' sometimes includes samples in neither training nor test set #790

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions