pip install ts2mlimport pandas as pd
from ts2ml.core import add_missing_slots
from ts2ml.core import transform_ts_data_into_features_and_targetdf = pd.DataFrame({
'pickup_hour': ['2022-01-01 00:00:00', '2022-01-01 01:00:00', '2022-01-01 03:00:00', '2022-01-01 01:00:00', '2022-01-01 02:00:00', '2022-01-01 05:00:00'],
'pickup_location_id': [1, 1, 1, 2, 2, 2],
'rides': [2, 3, 1, 1, 2, 1]
})
df| pickup_hour | pickup_location_id | rides | |
|---|---|---|---|
| 0 | 2022-01-01 00:00:00 | 1 | 2 |
| 1 | 2022-01-01 01:00:00 | 1 | 3 |
| 2 | 2022-01-01 03:00:00 | 1 | 1 |
| 3 | 2022-01-01 01:00:00 | 2 | 1 |
| 4 | 2022-01-01 02:00:00 | 2 | 2 |
| 5 | 2022-01-01 05:00:00 | 2 | 1 |
Let’s fill the missing slots with zeros
df = add_missing_slots(df, datetime_col='pickup_hour', entity_col='pickup_location_id', value_col='rides', freq='H')
df100%|██████████| 2/2 [00:00<00:00, 907.86it/s]
| pickup_hour | pickup_location_id | rides | |
|---|---|---|---|
| 0 | 2022-01-01 00:00:00 | 1 | 2 |
| 1 | 2022-01-01 01:00:00 | 1 | 3 |
| 2 | 2022-01-01 02:00:00 | 1 | 0 |
| 3 | 2022-01-01 03:00:00 | 1 | 1 |
| 4 | 2022-01-01 04:00:00 | 1 | 0 |
| 5 | 2022-01-01 05:00:00 | 1 | 0 |
| 6 | 2022-01-01 00:00:00 | 2 | 0 |
| 7 | 2022-01-01 01:00:00 | 2 | 1 |
| 8 | 2022-01-01 02:00:00 | 2 | 2 |
| 9 | 2022-01-01 03:00:00 | 2 | 0 |
| 10 | 2022-01-01 04:00:00 | 2 | 0 |
| 11 | 2022-01-01 05:00:00 | 2 | 1 |
Now, let’s build features and targets to predict the number of rides for the next hour for each location_id, by using the historical number of rides for the last 3 hours
features, targets = transform_ts_data_into_features_and_target(
df,
n_features=3,
datetime_col='pickup_hour',
entity_col='pickup_location_id',
value_col='rides',
n_targets=1,
step_size=1,
step_name='hour'
)100%|██████████| 2/2 [00:00<00:00, 597.86it/s]
features| rides_previous_3_hour | rides_previous_2_hour | rides_previous_1_hour | pickup_hour | pickup_location_id | |
|---|---|---|---|---|---|
| 0 | 2.0 | 3.0 | 0.0 | 2022-01-01 03:00:00 | 1 |
| 1 | 3.0 | 0.0 | 1.0 | 2022-01-01 04:00:00 | 1 |
| 2 | 0.0 | 1.0 | 2.0 | 2022-01-01 03:00:00 | 2 |
| 3 | 1.0 | 2.0 | 0.0 | 2022-01-01 04:00:00 | 2 |
targets| target_rides_next_hour | |
|---|---|
| 0 | 1.0 |
| 1 | 0.0 |
| 2 | 0.0 |
| 3 | 0.0 |
Xy_df = pd.concat([features, targets], axis=1)
Xy_df| rides_previous_3_hour | rides_previous_2_hour | rides_previous_1_hour | pickup_hour | pickup_location_id | target_rides_next_hour | |
|---|---|---|---|---|---|---|
| 0 | 2.0 | 3.0 | 0.0 | 2022-01-01 03:00:00 | 1 | 1.0 |
| 1 | 3.0 | 0.0 | 1.0 | 2022-01-01 04:00:00 | 1 | 0.0 |
| 2 | 0.0 | 1.0 | 2.0 | 2022-01-01 03:00:00 | 2 | 0.0 |
| 3 | 1.0 | 2.0 | 0.0 | 2022-01-01 04:00:00 | 2 | 0.0 |
Montly spaced time series
import pandas as pd
import numpy as np
# Generate timestamp index with monthly frequency
date_rng = pd.date_range(start='1/1/2020', end='12/1/2022', freq='MS')
# Create list of city codes
cities = ['FOR', 'SP', 'RJ']
# Create dataframe with random sales data for each city on each month
df = pd.DataFrame({
'date': date_rng,
'city': np.repeat(cities, len(date_rng)//len(cities)),
'sales': np.random.randint(1000, 5000, size=len(date_rng))
})
df| date | city | sales | |
|---|---|---|---|
| 0 | 2020-01-01 | FOR | 4944 |
| 1 | 2020-02-01 | FOR | 3435 |
| 2 | 2020-03-01 | FOR | 4543 |
| 3 | 2020-04-01 | FOR | 3879 |
| 4 | 2020-05-01 | FOR | 2601 |
| 5 | 2020-06-01 | FOR | 2922 |
| 6 | 2020-07-01 | FOR | 4542 |
| 7 | 2020-08-01 | FOR | 1338 |
| 8 | 2020-09-01 | FOR | 2938 |
| 9 | 2020-10-01 | FOR | 2695 |
| 10 | 2020-11-01 | FOR | 4065 |
| 11 | 2020-12-01 | FOR | 3864 |
| 12 | 2021-01-01 | SP | 2652 |
| 13 | 2021-02-01 | SP | 2137 |
| 14 | 2021-03-01 | SP | 2663 |
| 15 | 2021-04-01 | SP | 1168 |
| 16 | 2021-05-01 | SP | 4523 |
| 17 | 2021-06-01 | SP | 4135 |
| 18 | 2021-07-01 | SP | 3566 |
| 19 | 2021-08-01 | SP | 2121 |
| 20 | 2021-09-01 | SP | 1070 |
| 21 | 2021-10-01 | SP | 1624 |
| 22 | 2021-11-01 | SP | 3034 |
| 23 | 2021-12-01 | SP | 4063 |
| 24 | 2022-01-01 | RJ | 2297 |
| 25 | 2022-02-01 | RJ | 3430 |
| 26 | 2022-03-01 | RJ | 2903 |
| 27 | 2022-04-01 | RJ | 4197 |
| 28 | 2022-05-01 | RJ | 4141 |
| 29 | 2022-06-01 | RJ | 2899 |
| 30 | 2022-07-01 | RJ | 4529 |
| 31 | 2022-08-01 | RJ | 3612 |
| 32 | 2022-09-01 | RJ | 1856 |
| 33 | 2022-10-01 | RJ | 4804 |
| 34 | 2022-11-01 | RJ | 1764 |
| 35 | 2022-12-01 | RJ | 4425 |
FOR city only have data for 2020 year, RJ only for 2022 and SP only for 2021. Let’s also simulate more missing slots between the years.
# Generate random indices to drop
drop_indices = np.random.choice(df.index, size=int(len(df)*0.2), replace=False)
# Drop selected rows from dataframe
df = df.drop(drop_indices)
df.reset_index(drop=True, inplace=True)
df| date | city | sales | |
|---|---|---|---|
| 0 | 2020-01-01 | FOR | 4944 |
| 1 | 2020-02-01 | FOR | 3435 |
| 2 | 2020-03-01 | FOR | 4543 |
| 3 | 2020-04-01 | FOR | 3879 |
| 4 | 2020-05-01 | FOR | 2601 |
| 5 | 2020-06-01 | FOR | 2922 |
| 6 | 2020-07-01 | FOR | 4542 |
| 7 | 2020-08-01 | FOR | 1338 |
| 8 | 2020-09-01 | FOR | 2938 |
| 9 | 2020-11-01 | FOR | 4065 |
| 10 | 2020-12-01 | FOR | 3864 |
| 11 | 2021-01-01 | SP | 2652 |
| 12 | 2021-02-01 | SP | 2137 |
| 13 | 2021-03-01 | SP | 2663 |
| 14 | 2021-07-01 | SP | 3566 |
| 15 | 2021-08-01 | SP | 2121 |
| 16 | 2021-10-01 | SP | 1624 |
| 17 | 2021-11-01 | SP | 3034 |
| 18 | 2021-12-01 | SP | 4063 |
| 19 | 2022-01-01 | RJ | 2297 |
| 20 | 2022-02-01 | RJ | 3430 |
| 21 | 2022-03-01 | RJ | 2903 |
| 22 | 2022-04-01 | RJ | 4197 |
| 23 | 2022-05-01 | RJ | 4141 |
| 24 | 2022-06-01 | RJ | 2899 |
| 25 | 2022-09-01 | RJ | 1856 |
| 26 | 2022-10-01 | RJ | 4804 |
| 27 | 2022-11-01 | RJ | 1764 |
| 28 | 2022-12-01 | RJ | 4425 |
Now lets fill the missing slots with zero values. The function will complete the missing slots with zeros:
df_full = add_missing_slots(df, datetime_col='date', entity_col='city', value_col='sales', freq='MS')
df_full100%|██████████| 3/3 [00:00<00:00, 843.70it/s]
| date | city | sales | |
|---|---|---|---|
| 0 | 2020-01-01 | FOR | 4944 |
| 1 | 2020-02-01 | FOR | 3435 |
| 2 | 2020-03-01 | FOR | 4543 |
| 3 | 2020-04-01 | FOR | 3879 |
| 4 | 2020-05-01 | FOR | 2601 |
| ... | ... | ... | ... |
| 103 | 2022-08-01 | RJ | 0 |
| 104 | 2022-09-01 | RJ | 1856 |
| 105 | 2022-10-01 | RJ | 4804 |
| 106 | 2022-11-01 | RJ | 1764 |
| 107 | 2022-12-01 | RJ | 4425 |
108 rows × 3 columns
Let’s build a dataset for training a machine learning model to predict the sales for the next 3 months, for each city, based on historical data of sales for the previous 6 months.
features, targets = transform_ts_data_into_features_and_target(
df_full,
n_features=3,
datetime_col='date',
entity_col='city',
value_col='sales',
n_targets=1,
step_size=1,
step_name='month'
)100%|██████████| 3/3 [00:00<00:00, 205.58it/s]
pd.concat([features, targets], axis=1)| sales_previous_3_month | sales_previous_2_month | sales_previous_1_month | date | city | target_sales_next_month | |
|---|---|---|---|---|---|---|
| 0 | 4944.0 | 3435.0 | 4543.0 | 2020-04-01 | FOR | 3879.0 |
| 1 | 3435.0 | 4543.0 | 3879.0 | 2020-05-01 | FOR | 2601.0 |
| 2 | 4543.0 | 3879.0 | 2601.0 | 2020-06-01 | FOR | 2922.0 |
| 3 | 3879.0 | 2601.0 | 2922.0 | 2020-07-01 | FOR | 4542.0 |
| 4 | 2601.0 | 2922.0 | 4542.0 | 2020-08-01 | FOR | 1338.0 |
| ... | ... | ... | ... | ... | ... | ... |
| 91 | 4197.0 | 4141.0 | 2899.0 | 2022-07-01 | RJ | 0.0 |
| 92 | 4141.0 | 2899.0 | 0.0 | 2022-08-01 | RJ | 0.0 |
| 93 | 2899.0 | 0.0 | 0.0 | 2022-09-01 | RJ | 1856.0 |
| 94 | 0.0 | 0.0 | 1856.0 | 2022-10-01 | RJ | 4804.0 |
| 95 | 0.0 | 1856.0 | 4804.0 | 2022-11-01 | RJ | 1764.0 |
96 rows × 6 columns
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformeradd_missing_slots_transformer = FunctionTransformer(
add_missing_slots,
kw_args={
'datetime_col': 'date',
'entity_col': 'city',
'value_col': 'sales',
'freq': 'MS'
}
)
transform_ts_data_into_features_and_target_transformer = FunctionTransformer(
transform_ts_data_into_features_and_target,
kw_args={
'n_features': 3,
'datetime_col': 'date',
'entity_col': 'city',
'value_col': 'sales',
'n_targets': 1,
'step_size': 1,
'step_name': 'month',
'concat_Xy': True
}
)ts_data_to_features_and_target_pipeline = make_pipeline(
add_missing_slots_transformer,
transform_ts_data_into_features_and_target_transformer
)
ts_data_to_features_and_target_pipelinePipeline(steps=[('functiontransformer-1',
FunctionTransformer(func=<function add_missing_slots at 0x11f8f49d0>,
kw_args={'datetime_col': 'date',
'entity_col': 'city',
'freq': 'MS',
'value_col': 'sales'})),
('functiontransformer-2',
FunctionTransformer(func=<function transform_ts_data_into_features_and_target at 0x11f925ca0>,
kw_args={'concat_Xy': True,
'datetime_col': 'date',
'entity_col': 'city',
'n_features': 3, 'n_targets': 1,
'step_name': 'month',
'step_size': 1,
'value_col': 'sales'}))])</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class="sk-container" hidden><div class="sk-item sk-dashed-wrapped"><div class="sk-label-container"><div class="sk-label sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-7" type="checkbox" ><label for="sk-estimator-id-7" class="sk-toggleable__label sk-toggleable__label-arrow">Pipeline</label><div class="sk-toggleable__content"><pre>Pipeline(steps=[('functiontransformer-1',
FunctionTransformer(func=<function add_missing_slots at 0x11f8f49d0>,
kw_args={'datetime_col': 'date',
'entity_col': 'city',
'freq': 'MS',
'value_col': 'sales'})),
('functiontransformer-2',
FunctionTransformer(func=<function transform_ts_data_into_features_and_target at 0x11f925ca0>,
kw_args={'concat_Xy': True,
'datetime_col': 'date',
'entity_col': 'city',
'n_features': 3, 'n_targets': 1,
'step_name': 'month',
'step_size': 1,
'value_col': 'sales'}))])</pre></div></div></div><div class="sk-serial"><div class="sk-item"><div class="sk-estimator sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-8" type="checkbox" ><label for="sk-estimator-id-8" class="sk-toggleable__label sk-toggleable__label-arrow">FunctionTransformer</label><div class="sk-toggleable__content"><pre>FunctionTransformer(func=<function add_missing_slots at 0x11f8f49d0>,
kw_args={'datetime_col': 'date', 'entity_col': 'city',
'freq': 'MS', 'value_col': 'sales'})</pre></div></div></div><div class="sk-item"><div class="sk-estimator sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-9" type="checkbox" ><label for="sk-estimator-id-9" class="sk-toggleable__label sk-toggleable__label-arrow">FunctionTransformer</label><div class="sk-toggleable__content"><pre>FunctionTransformer(func=<function transform_ts_data_into_features_and_target at 0x11f925ca0>,
kw_args={'concat_Xy': True, 'datetime_col': 'date',
'entity_col': 'city', 'n_features': 3,
'n_targets': 1, 'step_name': 'month',
'step_size': 1, 'value_col': 'sales'})</pre></div></div></div></div></div></div></div>
Xy_df = ts_data_to_features_and_target_pipeline.fit_transform(df)
Xy_df
100%|██████████| 3/3 [00:00<00:00, 715.47it/s]
100%|██████████| 3/3 [00:00<00:00, 184.12it/s]
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
sales_previous_3_month
sales_previous_2_month
sales_previous_1_month
date
city
target_sales_next_month
0
4944.0
3435.0
4543.0
2020-04-01
FOR
3879.0
1
3435.0
4543.0
3879.0
2020-05-01
FOR
2601.0
2
4543.0
3879.0
2601.0
2020-06-01
FOR
2922.0
3
3879.0
2601.0
2922.0
2020-07-01
FOR
4542.0
4
2601.0
2922.0
4542.0
2020-08-01
FOR
1338.0
...
...
...
...
...
...
...
91
4197.0
4141.0
2899.0
2022-07-01
RJ
0.0
92
4141.0
2899.0
0.0
2022-08-01
RJ
0.0
93
2899.0
0.0
0.0
2022-09-01
RJ
1856.0
94
0.0
0.0
1856.0
2022-10-01
RJ
4804.0
95
0.0
1856.0
4804.0
2022-11-01
RJ
1764.0
96 rows × 6 columns