Thanks to visit codestin.com
Credit goes to github.com

Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/concepts/feathr-concepts-for-beginners.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,9 +107,9 @@ client.get_offline_features(observation_settings=settings,

## What is "materialization" in Feathr?

You are very likely to train a machine learning model with the features that you just queried (with `get_offline_features()`). After you have trained a machine learning model, say a fraud detection model, you are likely to put the machine learning model into an online envrionment and do online inference.
You are very likely to train a machine learning model with the features that you just queried (with `get_offline_features()`). After you have trained a machine learning model, say a fraud detection model, you are likely to put the machine learning model into an online environment and do online inference.

In that case, you will need to retrieve the features (for example the user historical spending) in real time, since the fraud detection model is very time sensitive. Usually some key-value store is used for that scenario (for example Redis), and Feathr will help you to materialize features in the online environment for faster inference. That is why you will see something like below, where you specify Redis as the online storage you want to use, and retrieve features from online envrionment using `get_online_features()` from there:
In that case, you will need to retrieve the features (for example the user historical spending) in real time, since the fraud detection model is very time sensitive. Usually some key-value store is used for that scenario (for example Redis), and Feathr will help you to materialize features in the online environment for faster inference. That is why you will see something like below, where you specify Redis as the online storage you want to use, and retrieve features from online environment using `get_online_features()` from there:

```python
redisSink = RedisSink(table_name="nycTaxiDemoFeature")
Expand Down
114 changes: 114 additions & 0 deletions docs/how-to-guides/feathr-configuration-and-env.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
---
layout: default
title: Configuration, environment variables, and store secrets in a secure way
parent: Feathr How-to Guides
---

# Configuration and environment variables in Feathr

Feathr uses a YAML file and a few environment variables to allow end users to have more flexibility. See the example of the following configurations in [this file](https://github.com/linkedin/feathr/blob/main/feathr_project/feathrcli/data/feathr_user_workspace/feathr_config.yaml).

In that YAML file, it contains the configurations that are used by Feathr. All the configurations can be overwritten by environment variables with concatenation of `__` for different layers of this config file. For example, `feathr_runtime_location` for databricks can be overwritten by setting this environment variable: `SPARK_CONFIG__DATABRICKS__FEATHR_RUNTIME_LOCATION`. For example, you can set it in python:

```python
os.environ['SPARK_CONFIG__DATABRICKS__FEATHR_RUNTIME_LOCATION'] = "https://azurefeathrstorage.blob.core.windows.net/public/feathr-assembly-LATEST.jar"
```

or in shell environment:

```bash
export SPARK_CONFIG__DATABRICKS__FEATHR_RUNTIME_LOCATION=https://azurefeathrstorage.blob.core.windows.net/public/feathr-assembly-LATEST.jar
```

This allows end users to store the configuration in a secure way, say in Kubernetes secrets, key vault, etc. All the configurations available for end users to configure are listed below.


# Default behaviors

Feathr will get the configurations in the following order:

1. If the key is set in the environment variable, Feathr will use the value of that environment variable
2. If it's not set in the environment, then a value is retrieved from the feathr_config.yaml file with the same config key.
3. If it's not available in the feathr_config.yaml file, Feathr will try to reterive the value from a key vault service. Currently only Azure Key Vault is supported.

# A list of environment variables that Feathr uses
| Environment Variable | Description | Required? |
| -------------- | -------------- | -------------- |
| SECRETS__AZURE_KEY_VAULT__NAME | Name of the Azure Key Vault service so that Feathr can get credentials from that service. | Optional |
| AZURE_CLIENT_ID | Client ID for authentication into Azure Services. Read [here](https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity.environmentcredential?view=azure-python) for more details. | This is required if you are using Service Principal to login with Feathr. |
| AZURE_TENANT_ID |Client ID for authentication into Azure Services. Read [here](https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity.environmentcredential?view=azure-python) for more details. |This is required if you are using Service Principal to login with Feathr. |
| AZURE_CLIENT_SECRET | Client ID for authentication into Azure Services. Read [here](https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity.environmentcredential?view=azure-python) for more details. |This is required if you are using Service Principal to login with Feathr. |
| OFFLINE_STORE__ADLS__ADLS_ENABLED | Whether to enable ADLS as offline store or not. |Optional |
| ADLS_ACCOUNT | ADLS account that you connect to. |Required if using ADLS as an offline store. |
| ADLS_KEY | ADLS key that you connect to. |Required if using ADLS as an offline store. |
| OFFLINE_STORE__WASB__WASB_ENABLED | Whether to enable Azure BLOB storage as offline store or not. |
| WASB_ACCOUNT | Azure BLOB Storage account that you connect to.| Required if using Azure BLOB Storage as an offline store. |
| WASB_KEY | Azure BLOB Storage key that you connect to. |Required if using Azure BLOB Storage as an offline store. |
| S3_ACCESS_KEY | AWS S3 access key for the S3 account. |Required if using AWS S3 Storage as an offline store. |
| S3_SECRET_KEY | AWS S3 secret key for the S3 account. |Required if using AWS S3 Storage as an offline store. |
| OFFLINE_STORE__S3__S3_ENABLED | Whether to enable S3 as offline store or not. |Optional |
| OFFLINE_STORE__S3__S3_ENDPOINT | S3 endpoint. If you use S3 endpoint, then you need to provide access key and secret key in the environment variable as well. |Required if using AWS S3 Storage as an offline store. |
| OFFLINE_STORE__JDBC__JDBC_ENABLED | Whether to enable JDBC as offline store or not. |Optional |
| OFFLINE_STORE__JDBC__JDBC_DATABASE | If using JDBC endpoint as offline store, this config specifies the JDBC database to read from. | Required if using JDBC sources as offline store |
| OFFLINE_STORE__JDBC__JDBC_TABLE | If using JDBC endpoint as offline store, this config specifies the JDBC table to read from. Same as `JDBC_TABLE`. |Required if using JDBC sources as offline store |
| JDBC_TABLE | If using JDBC endpoint as offline store, this config specifies the JDBC table to read from |Required if using JDBC sources as offline store |
| JDBC_USER | If using JDBC endpoint as offline store, this config specifies the JDBC user |Required if using JDBC sources as offline store |
| JDBC_PASSWORD | If using JDBC endpoint as offline store, this config specifies the JDBC password |Required if using JDBC sources as offline store |
| KAFKA_SASL_JAAS_CONFIG | If using EventHub as a streaming input source, this configures the KAFKA stream. If using EventHub, read [here](https://github.com/Azure/azure-event-hubs-for-kafka#updating-your-kafka-client-configuration) for how to get this string from the existing string in Azure Portal. | Required if using Kafka/EventHub as streaming source input.|
| PROJECT_CONFIG__PROJECT_NAME | Configures the project name. | Required|
| OFFLINE_STORE__SNOWFLAKE__URL | Configures the Snowflake URL. Usually it's something like `dqllago-ol19457.snowflakecomputing.com`. |Required if using Snowflake as an offline store. |
| OFFLINE_STORE__SNOWFLAKE__USER | Configures the Snowflake user. |Required if using Snowflake as an offline store. |
| OFFLINE_STORE__SNOWFLAKE__ROLE | Configures the Snowflake role. Usually it's something like `ACCOUNTADMIN`. |Required if using Snowflake as an offline store. |
|JDBC_SF_PASSWORD| Configurations for Snowflake password|Required if using Snowflake as an offline store. |
| SPARK_CONFIG__SPARK_CLUSTER | Choice for spark runtime. Currently support: `azure_synapse`, `databricks`. The `databricks` configs will be ignored if `azure_synapse` is set and vice versa. | Required|
| SPARK_CONFIG__SPARK_RESULT_OUTPUT_PARTS | Configure number of parts for the spark output for feature generation job | Required|
| SPARK_CONFIG__AZURE_SYNAPSE__DEV_URL | Dev URL to the synapse cluster. Usually it's something like `https://yourclustername.dev.azuresynapse.net` | Required if using Azure Synapse|
| SPARK_CONFIG__AZURE_SYNAPSE__POOL_NAME | name of the sparkpool that you are going to use |Required if using Azure Synapse|
| SPARK_CONFIG__AZURE_SYNAPSE__WORKSPACE_DIR | A location that Synapse has access to. This workspace dir stores all the required configuration files and the jar resources. All the feature definitions will be uploaded here |Required if using Azure Synapse|
| SPARK_CONFIG__AZURE_SYNAPSE__EXECUTOR_SIZE | Specifies the executor size for the Azure Synapse cluster. Currently the options are `Small`, `Medium`, `Large`. |Required if using Azure Synapse|
| SPARK_CONFIG__AZURE_SYNAPSE__EXECUTOR_NUM | Sepcifies the number of executors for the Azure Synapse cluster |Required if using Azure Synapse|
| SPARK_CONFIG__AZURE_SYNAPSE__FEATHR_RUNTIME_LOCATION | Specifies the Feathr runtime location. Support local paths, path start with `http(s)://`, and paths start with `abfss:/`. If not set, will use the [Feathr package published in Maven](https://search.maven.org/artifact/com.linkedin.feathr/feathr_2.12). |Required if using Azure Synapse|
| SPARK_CONFIG__DATABRICKS__WORKSPACE_INSTANCE_URL | Workspace instance URL for your databricks cluster. Will be something like this: `https://adb-6885802458123232.12.azuredatabricks.net/` |Required if using Databricks|
| SPARK_CONFIG__DATABRICKS__CONFIG_TEMPLATE | Config string including run time information, spark version, machine size, etc. See [below](#sparkconfigdatabricksconfigtemplate) for more details. |Required if using Databricks|
| SPARK_CONFIG__DATABRICKS__WORK_DIR | Workspace dir for storing all the required configuration files and the jar resources. All the feature definitions will be uploaded here. |Required if using Databricks|
| SPARK_CONFIG__DATABRICKS__FEATHR_RUNTIME_LOCATION | Feathr runtime location. Support local paths, path start with `http(s)://`, and paths start with `dbfs:/`. If not set, will use the [Feathr package published in Maven](https://search.maven.org/artifact/com.linkedin.feathr/feathr_2.12). |Required if using Databricks|
| ONLINE_STORE__REDIS__HOST | Redis host name to access Redis cluster. |Required if using Redis as online store. |
| ONLINE_STORE__REDIS__PORT | Redis port number to access Redis cluster. |Required if using Redis as online store. |
| ONLINE_STORE__REDIS__SSL_ENABLED | Whether SSL is enabled to access Redis cluster. |Required if using Redis as online store. |
| REDIS_PASSWORD | Password for the Redis cluster. |Required if using Redis as online store. |
| FEATURE_REGISTRY__PURVIEW__PURVIEW_NAME | Configure the name of the purview endpoint. |Required if using Purview as the endpoint. |
| FEATURE_REGISTRY__PURVIEW__DELIMITER | See [here](#featureregistrypurviewdelimiter) for more details. | Required|
| FEATURE_REGISTRY__PURVIEW__TYPE_SYSTEM_INITIALIZATION | Controls whether the type system (think this as the "schema" for the registry) will be initialized or not. Usually this is only required to be set to `True` to initialize schema, and then you can set it to `False` to shorten the initialization time. | Required|

# Explanation for selected configurations
## SPARK_CONFIG__DATABRICKS__CONFIG_TEMPLATE

Essentially it's a compact JSON string represents the important configurations that you can configure for the databricks cluster that you use. There are parts that marked as "FEATHR_FILL_IN" that Feathr will fill in, but all the other parts are customizable.

Essentially, the config template represents what is going to be submitted to a databricks cluster, and you can see the structure of this configuration template by visiting the [Databricks job runs API](https://docs.microsoft.com/en-us/azure/databricks/dev-tools/api/2.0/jobs#--runs-submit):

The most important and useful part would be the `new_cluster` section. For example, you can change`spark_version`, `node_type_id`, `num_workers`, etc. based on your environment.

```json
{"run_name":"FEATHR_FILL_IN","new_cluster":{"spark_version":"9.1.x-scala2.12","node_type_id":"Standard_D3_v2","num_workers":2,"spark_conf":{"FEATHR_FILL_IN":"FEATHR_FILL_IN"}},"libraries":[{"jar":"FEATHR_FILL_IN"}],"spark_jar_task":{"main_class_name":"FEATHR_FILL_IN","parameters":["FEATHR_FILL_IN"]}}
```

Another use case is to use `instance_pool_id`, where instead of creating the Spark cluster from scratch every time, you can reuse a pool to run the job to make the run time shorter:

```json
{"run_name":"FEATHR_FILL_IN","new_cluster":{"spark_version":"9.1.x-scala2.12","num_workers":2,"spark_conf":{"FEATHR_FILL_IN":"FEATHR_FILL_IN"},"instance_pool_id":"0403-214809-inlet434-pool-l9dj3kwz"},"libraries":[{"jar":"FEATHR_FILL_IN"}],"spark_jar_task":{"main_class_name":"FEATHR_FILL_IN","parameters":["FEATHR_FILL_IN"]}}

```

Other advanced settings includes `idempotency_token` to guarantee the idempotency of job run requests, etc.


## FEATURE_REGISTRY__PURVIEW__DELIMITER
Delimiter indicates that how the project name, feature names etc. are delimited. By default it will be '__'. this is for global reference (mainly for feature sharing). For exmaple, when we setup a project called foo, and we have an anchor called 'taxi_driver' and the feature name is called 'f_daily_trips'. the feature will have a global unique name called 'foo__taxi_driver__f_daily_trips'


# A note on using Azure Key Vault to store credentials

Feathr has native integrations with Azure Key Vault to make it more secure to access resources. However, Azure Key Vault doesn't support the secret name to have underscore `_` in the secret name. Feathr will automatically convert underscore `_` to dash `-`. For example, Feathr will look for `ONLINE-STORE--REDIS--HOST` in Azure Key Vault if the actual environment variable is `ONLINE_STORE__REDIS__HOST`.

Azure Key Vault is not case sensitive, so `online_store__redis__host` and `ONLINE_STORE__REDIS__HOST` will result in the same request to Azure Key Vault and yield the same result.
Loading