diff --git a/docs/concepts/feature-definition.md b/docs/concepts/feature-definition.md index 80892334c..bfdc07977 100644 --- a/docs/concepts/feature-definition.md +++ b/docs/concepts/feature-definition.md @@ -28,14 +28,14 @@ batch_source = HdfsSource(name="nycTaxiBatchSource", timestamp_format="yyyy-MM-dd HH:mm:ss") ``` -See the [Python API documentation](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.source.HdfsSource) to get the details on each input column. +See the [Python API documentation](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.HdfsSource) to get the details on each input column. ## Step2: Define Anchors and Features A feature is called an anchored feature when the feature is directly extracted from the source data, rather than computed on top of other features. The latter case is called derived feature. -Check [Feature Python API documentation](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.feature.Feature) -and [Anchor Python API documentation](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.anchor.FeatureAnchor) to see more details. +Check [Feature Python API documentation](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.Feature) +and [Anchor Python API documentation](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.FeatureAnchor) to see more details. Here is a sample: @@ -100,8 +100,7 @@ Feature(name="f_location_max_fare", window="90d")) ``` - -Note that the `agg_func`([API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.aggregation.Aggregation)) should be any of these: +Note that the `agg_func`([API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.Aggregation)) should be any of these: | Aggregation Type | Input Type | Description | | --- | --- | --- | @@ -125,9 +124,9 @@ request_anchor = FeatureAnchor(name="request_features", Note that if the data source is from the observation data, the `source` section should be `INPUT_CONTEXT` to indicate the source of those defined anchors. ## Step3: Derived Features Section -Derived features([Python API documentation](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.feature_derivations.DerivedFeature)) -are the features that are computed from other features. They could be computed from anchored features, or other derived features. +Derived features([Python API documentation](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.DerivedFeature)) +are the features that are computed from other features. They could be computed from anchored features, or other derived features. ```python f_trip_distance = Feature(name="f_trip_distance", diff --git a/docs/concepts/feature-generation.md b/docs/concepts/feature-generation.md index a04ff1719..297a1d6d4 100644 --- a/docs/concepts/feature-generation.md +++ b/docs/concepts/feature-generation.md @@ -20,8 +20,9 @@ settings = MaterializationSettings("nycTaxiMaterializationJob", feature_names=["f_location_avg_fare", "f_location_max_fare"]) client.materialize_features(settings) ``` -([MaterializationSettings API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.materialization_settings.MaterializationSettings), -[RedisSink API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.sink.RedisSink)) + +([MaterializationSettings API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.MaterializationSettings), +[RedisSink API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.RedisSink) In the above example, we define a Redis table called `nycTaxiDemoFeature` and materialize two features called `f_location_avg_fare` and `f_location_max_fare` to Redis. @@ -37,8 +38,9 @@ settings = MaterializationSettings("nycTaxiMaterializationJob", backfill_time=backfill_time) client.materialize_features(settings) ``` -([BackfillTime API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.materialization_settings.BackfillTime), -[client.materialize_features() API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.client.FeathrClient.materialize_features)) + +([BackfillTime API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.BackfillTime), +[client.materialize_features() API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.FeathrClient.materialize_features)) ## Consuming the online features @@ -48,7 +50,8 @@ client.wait_job_to_finish(timeout_sec=600) res = client.get_online_features('nycTaxiDemoFeature', '265', [ 'f_location_avg_fare', 'f_location_max_fare']) ``` -([client.get_online_features API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.client.FeathrClient.get_online_features)) + +([client.get_online_features API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.FeathrClient.get_online_features)) After we finish running the materialization job, we can get the online features by querying the feature name, with the corresponding keys. In the example above, we query the online features called `f_location_avg_fare` and @@ -59,6 +62,7 @@ corresponding keys. In the example above, we query the online features called `f This is a useful when the feature transformation is computation intensive and features can be re-used. For example, you have a feature that needs more than 24 hours to compute and the feature can be reused by more than one model training pipeline. In this case, you should consider generate features to offline. Here is an API example: + ```python client = FeathrClient() offlineSink = HdfsSink(output_path="abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/materialize_offline_test_data/") @@ -68,11 +72,12 @@ settings = MaterializationSettings("nycTaxiMaterializationJob", feature_names=["f_location_avg_fare", "f_location_max_fare"]) client.materialize_features(settings) ``` + This will generate features on latest date(assuming it's `2022/05/21`) and output data to the following path: `abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/materialize_offline_test_data/df0/daily/2022/05/21` - You can also specify a BackfillTime so the features will be generated for those dates. For example: + ```Python backfill_time = BackfillTime(start=datetime( 2020, 5, 20), end=datetime(2020, 5, 20), step=timedelta(days=1)) @@ -83,8 +88,9 @@ settings = MaterializationSettings("nycTaxiTable", "f_location_avg_fare", "f_location_max_fare"], backfill_time=backfill_time) ``` -This will generate features only for 2020/05/20 for me and it will be in folder: + +This will generate features only for 2020/05/20 for me and it will be in folder: `abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/materialize_offline_test_data/df0/daily/2020/05/20` -([MaterializationSettings API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.materialization_settings.MaterializationSettings), -[HdfsSink API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.sink.HdfsSink)) +([MaterializationSettings API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.MaterializationSettings), +[HdfsSink API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.HdfsSink)) diff --git a/docs/concepts/feature-join.md b/docs/concepts/feature-join.md index 0b7bc39a3..ce6c0434a 100644 --- a/docs/concepts/feature-join.md +++ b/docs/concepts/feature-join.md @@ -70,8 +70,9 @@ client.get_offline_features(observation_settings=settings, output_path="abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/demo_data/output.avro") ``` -([ObservationSettings API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.settings.ObservationSettings), -[client.get_offline_feature API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.client.FeathrClient.get_offline_features)) + +([ObservationSettings API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.ObservationSettings), +[client.get_offline_feature API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.FeathrClient.get_offline_features)) After you have defined the features (as described in the [Feature Definition](feature-definition.md)) part, you can define how you want to join them. diff --git a/docs/dev_guide/update_python_docs.md b/docs/dev_guide/update_python_docs.md index 690b2379a..910529adc 100644 --- a/docs/dev_guide/update_python_docs.md +++ b/docs/dev_guide/update_python_docs.md @@ -119,3 +119,7 @@ So that only this module is accessbile for end users. ### Debug and Known Issues * `No module named xyz`: Readthedocs need to run the code to generated the docs. So if your dependency is not specified in the docs/requirements.txt, it will fail on this. To fix it, specify the dependency in requirements.txt. + +## Update the Documentation Links + +If your change will affect the Python Doc url link, please remember to check and update related links in `feathr/docs` folder. diff --git a/feathr_project/feathr/__init__.py b/feathr_project/feathr/__init__.py index bfa4b0895..ea5363051 100644 --- a/feathr_project/feathr/__init__.py +++ b/feathr_project/feathr/__init__.py @@ -45,6 +45,7 @@ 'MaterializationSettings', 'MonitoringSettings', 'RedisSink', + 'HdfsSink', 'MonitoringSqlSink', 'FeatureQuery', 'LookupFeature',