From c07f93853f19084503d0e5228ea150743cf1ad39 Mon Sep 17 00:00:00 2001 From: Xiaoyong Zhu Date: Wed, 10 Aug 2022 00:25:23 -0700 Subject: [PATCH 1/3] Update materializing-features.md --- docs/concepts/materializing-features.md | 25 ++++++++++++++++++++++--- 1 file changed, 22 insertions(+), 3 deletions(-) diff --git a/docs/concepts/materializing-features.md b/docs/concepts/materializing-features.md index 20cfe0a97..c1f717361 100644 --- a/docs/concepts/materializing-features.md +++ b/docs/concepts/materializing-features.md @@ -33,7 +33,7 @@ In the above example, we define a Redis table called `nycTaxiDemoFeature` and ma ## Feature Backfill -It is also possible to backfill the features for a particular time range, like below. If the `BackfillTime` part is not specified, it's by default to `now()` (i.e. if not specified, it's equivalent to `BackfillTime(start=now, end=now, step=timedelta(days=1))`). +It is also possible to backfill the features till a particular time, like below. If the `BackfillTime` part is not specified, it's by default to `now()` (i.e. if not specified, it's equivalent to `BackfillTime(start=now, end=now, step=timedelta(days=1))`). ```python client = FeathrClient() @@ -46,9 +46,28 @@ settings = MaterializationSettings("nycTaxiMaterializationJob", client.materialize_features(settings) ``` -Note that if you don't have features available in `now`, you'd better specify a `BackfillTime` range where you have features. +Feathr will submit a materialization job for each of the step for performance reasons. I.e. if you have `BackfillTime(start=datetime(2022, 2, 1), end=datetime(2022, 2, 20), step=timedelta(days=1))`, Feathr will submit 20 jobs to run in parallel for maximum performance. -Also, Feathr will submit a materialization job for each of the step for performance reasons. I.e. if you have `BackfillTime(start=datetime(2022, 2, 1), end=datetime(2022, 2, 20), step=timedelta(days=1))`, Feathr will submit 20 jobs to run in parallel for maximum performance. +Please note that the `start` and `end` parameter means the cutoff start and end time. For example, we might have a dataset like below: + +| TrackingID | UserId | Spending | Date | +| ---------- | ------ | -------- | ---------- | +| 1 | 1 | 10 | 2022/05/01 | +| 2 | 2 | 15 | 2022/05/02 | +| 3 | 3 | 19 | 2022/05/03 | +| 4 | 1 | 18 | 2022/05/04 | +| 5 | 3 | 7 | 2022/05/05 | + +If we call the API like this: +`BackfillTime(start=datetime(2022, 5, 2), end=datetime(2022, 5, 4), step=timedelta(days=1))` + +Feathr will trigger 3 jobs: + +- job 1 will backfill all data till 2022/05/02 (so feature using data in 2022/05/01 will also be materialized) +- job 2 will backfill all data till 2022/05/03 (so feature using data in 2022/05/01 and 2022/05/02 will also be materialized) +- job 3 will backfill all data till 2022/05/04 (so feature using data in 2022/05/01, 2022/05/02, and 2022/05/03 will also be materialized) + +This is in particular useful for aggregated features. For example, if there is a feature defined as `user_purchase_in_last_2_days`, this will grantee that all the materialized features come with the right result. More reference on the APIs: From 3792743f07e71563a2718be99de28d98fb73a5e0 Mon Sep 17 00:00:00 2001 From: Xiaoyong Zhu Date: Wed, 10 Aug 2022 00:37:48 -0700 Subject: [PATCH 2/3] update doc --- docs/how-to-guides/streaming-source-ingestion.md | 4 ++++ feathr_project/feathr/definition/materialization_settings.py | 2 +- 2 files changed, 5 insertions(+), 1 deletion(-) diff --git a/docs/how-to-guides/streaming-source-ingestion.md b/docs/how-to-guides/streaming-source-ingestion.md index e197e1450..d82c4f5da 100644 --- a/docs/how-to-guides/streaming-source-ingestion.md +++ b/docs/how-to-guides/streaming-source-ingestion.md @@ -88,3 +88,7 @@ res = client.multi_get_online_features('kafkaSampleDemoFeature', ['1', '2'], ['f ``` You can also refer to the [test case](../../feathr_project/test/test_azure_kafka_e2e.py) for more details. + +## Kafka configuration + +Please refer to the [Feathr Configuration Doc](./feathr-configuration-and-env.md#kafkasasljaasconfig) for more details on the credentials. \ No newline at end of file diff --git a/feathr_project/feathr/definition/materialization_settings.py b/feathr_project/feathr/definition/materialization_settings.py index c21a76b14..c4e550924 100644 --- a/feathr_project/feathr/definition/materialization_settings.py +++ b/feathr_project/feathr/definition/materialization_settings.py @@ -5,7 +5,7 @@ class BackfillTime: - """Time range to materialize/backfill feature data. + """Time range to materialize/backfill feature data. Please refer to https://github.com/linkedin/feathr/blob/main/docs/concepts/materializing-features.md#feature-backfill for a more detailed explanation. Attributes: start: start time of the backfill, inclusive. From 673d27326100e6a94758b2c1237b41b76fdfa00d Mon Sep 17 00:00:00 2001 From: Xiaoyong Zhu Date: Wed, 10 Aug 2022 07:28:59 -0700 Subject: [PATCH 3/3] resolve comments --- docs/concepts/materializing-features.md | 4 +++- feathr_project/feathr/definition/materialization_settings.py | 2 +- 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/docs/concepts/materializing-features.md b/docs/concepts/materializing-features.md index c1f717361..a1d714730 100644 --- a/docs/concepts/materializing-features.md +++ b/docs/concepts/materializing-features.md @@ -48,7 +48,9 @@ client.materialize_features(settings) Feathr will submit a materialization job for each of the step for performance reasons. I.e. if you have `BackfillTime(start=datetime(2022, 2, 1), end=datetime(2022, 2, 20), step=timedelta(days=1))`, Feathr will submit 20 jobs to run in parallel for maximum performance. -Please note that the `start` and `end` parameter means the cutoff start and end time. For example, we might have a dataset like below: +Please note that the parameter forms a closed interval, which means that both start and end date will be included in materialized job, + +Please also note that the `start` and `end` parameter means the cutoff start and end time. For example, we might have a dataset like below: | TrackingID | UserId | Spending | Date | | ---------- | ------ | -------- | ---------- | diff --git a/feathr_project/feathr/definition/materialization_settings.py b/feathr_project/feathr/definition/materialization_settings.py index c4e550924..fdc62dc5f 100644 --- a/feathr_project/feathr/definition/materialization_settings.py +++ b/feathr_project/feathr/definition/materialization_settings.py @@ -5,7 +5,7 @@ class BackfillTime: - """Time range to materialize/backfill feature data. Please refer to https://github.com/linkedin/feathr/blob/main/docs/concepts/materializing-features.md#feature-backfill for a more detailed explanation. + """Time range to materialize/backfill feature data. Please refer to https://linkedin.github.io/feathr/concepts/materializing-features.html#feature-backfill for a more detailed explanation. Attributes: start: start time of the backfill, inclusive.