Thanks to visit codestin.com
Credit goes to github.com

Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 6 additions & 7 deletions docs/concepts/feature-definition.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,14 +28,14 @@ batch_source = HdfsSource(name="nycTaxiBatchSource",
timestamp_format="yyyy-MM-dd HH:mm:ss")
```

See the [Python API documentation](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.source.HdfsSource) to get the details on each input column.
See the [Python API documentation](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.HdfsSource) to get the details on each input column.

## Step2: Define Anchors and Features
A feature is called an anchored feature when the feature is directly
extracted from the source data, rather than computed on top of other features. The latter case is called derived feature.

Check [Feature Python API documentation](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.feature.Feature)
and [Anchor Python API documentation](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.anchor.FeatureAnchor) to see more details.
Check [Feature Python API documentation](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.Feature)
and [Anchor Python API documentation](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.FeatureAnchor) to see more details.

Here is a sample:

Expand Down Expand Up @@ -100,8 +100,7 @@ Feature(name="f_location_max_fare",
window="90d"))
```


Note that the `agg_func`([API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.aggregation.Aggregation)) should be any of these:
Note that the `agg_func`([API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.Aggregation)) should be any of these:

| Aggregation Type | Input Type | Description |
| --- | --- | --- |
Expand All @@ -125,9 +124,9 @@ request_anchor = FeatureAnchor(name="request_features",
Note that if the data source is from the observation data, the `source` section should be `INPUT_CONTEXT` to indicate the source of those defined anchors.

## Step3: Derived Features Section
Derived features([Python API documentation](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.feature_derivations.DerivedFeature))
are the features that are computed from other features. They could be computed from anchored features, or other derived features.

Derived features([Python API documentation](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.DerivedFeature))
are the features that are computed from other features. They could be computed from anchored features, or other derived features.

```python
f_trip_distance = Feature(name="f_trip_distance",
Expand Down
24 changes: 15 additions & 9 deletions docs/concepts/feature-generation.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,9 @@ settings = MaterializationSettings("nycTaxiMaterializationJob",
feature_names=["f_location_avg_fare", "f_location_max_fare"])
client.materialize_features(settings)
```
([MaterializationSettings API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.materialization_settings.MaterializationSettings),
[RedisSink API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.sink.RedisSink))

([MaterializationSettings API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.MaterializationSettings),
[RedisSink API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.RedisSink)

In the above example, we define a Redis table called `nycTaxiDemoFeature` and materialize two features called `f_location_avg_fare` and `f_location_max_fare` to Redis.

Expand All @@ -37,8 +38,9 @@ settings = MaterializationSettings("nycTaxiMaterializationJob",
backfill_time=backfill_time)
client.materialize_features(settings)
```
([BackfillTime API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.materialization_settings.BackfillTime),
[client.materialize_features() API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.client.FeathrClient.materialize_features))

([BackfillTime API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.BackfillTime),
[client.materialize_features() API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.FeathrClient.materialize_features))

## Consuming the online features

Expand All @@ -48,7 +50,8 @@ client.wait_job_to_finish(timeout_sec=600)
res = client.get_online_features('nycTaxiDemoFeature', '265', [
'f_location_avg_fare', 'f_location_max_fare'])
```
([client.get_online_features API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.client.FeathrClient.get_online_features))

([client.get_online_features API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.FeathrClient.get_online_features))

After we finish running the materialization job, we can get the online features by querying the feature name, with the
corresponding keys. In the example above, we query the online features called `f_location_avg_fare` and
Expand All @@ -59,6 +62,7 @@ corresponding keys. In the example above, we query the online features called `f
This is a useful when the feature transformation is computation intensive and features can be re-used. For example, you
have a feature that needs more than 24 hours to compute and the feature can be reused by more than one model training
pipeline. In this case, you should consider generate features to offline. Here is an API example:

```python
client = FeathrClient()
offlineSink = HdfsSink(output_path="abfss://[email protected]/materialize_offline_test_data/")
Expand All @@ -68,11 +72,12 @@ settings = MaterializationSettings("nycTaxiMaterializationJob",
feature_names=["f_location_avg_fare", "f_location_max_fare"])
client.materialize_features(settings)
```

This will generate features on latest date(assuming it's `2022/05/21`) and output data to the following path:
`abfss://[email protected]/materialize_offline_test_data/df0/daily/2022/05/21`


You can also specify a BackfillTime so the features will be generated for those dates. For example:

```Python
backfill_time = BackfillTime(start=datetime(
2020, 5, 20), end=datetime(2020, 5, 20), step=timedelta(days=1))
Expand All @@ -83,8 +88,9 @@ settings = MaterializationSettings("nycTaxiTable",
"f_location_avg_fare", "f_location_max_fare"],
backfill_time=backfill_time)
```
This will generate features only for 2020/05/20 for me and it will be in folder:

This will generate features only for 2020/05/20 for me and it will be in folder:
`abfss://[email protected]/materialize_offline_test_data/df0/daily/2020/05/20`

([MaterializationSettings API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.materialization_settings.MaterializationSettings),
[HdfsSink API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.sink.HdfsSink))
([MaterializationSettings API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.MaterializationSettings),
[HdfsSink API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.HdfsSink))
5 changes: 3 additions & 2 deletions docs/concepts/feature-join.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,8 +70,9 @@ client.get_offline_features(observation_settings=settings,
output_path="abfss://[email protected]/demo_data/output.avro")

```
([ObservationSettings API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.settings.ObservationSettings),
[client.get_offline_feature API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.client.FeathrClient.get_offline_features))

([ObservationSettings API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.ObservationSettings),
[client.get_offline_feature API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.FeathrClient.get_offline_features))

After you have defined the features (as described in the [Feature Definition](feature-definition.md)) part, you can define how you want to join them.

Expand Down
4 changes: 4 additions & 0 deletions docs/dev_guide/update_python_docs.md
Original file line number Diff line number Diff line change
Expand Up @@ -119,3 +119,7 @@ So that only this module is accessbile for end users.
### Debug and Known Issues
* `No module named xyz`: Readthedocs need to run the code to generated the docs. So if your dependency is not specified
in the docs/requirements.txt, it will fail on this. To fix it, specify the dependency in requirements.txt.

## Update the Documentation Links

If your change will affect the Python Doc url link, please remember to check and update related links in `feathr/docs` folder.
1 change: 1 addition & 0 deletions feathr_project/feathr/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@
'MaterializationSettings',
'MonitoringSettings',
'RedisSink',
'HdfsSink',
Copy link
Collaborator

@blrchen blrchen Jun 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am curious why materialize test can pass without this., does that means HdfsSink already included in other place?

'MonitoringSqlSink',
'FeatureQuery',
'LookupFeature',
Expand Down