-
Notifications
You must be signed in to change notification settings - Fork 243
Optimize environment variable behavior #333
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
18 commits
Select commit
Hold shift + click to select a range
70a296e
optimize environment variable behavior
xiaoyongzhu bbf4892
Create feathr-configuration-and-env.md
xiaoyongzhu 036ea1b
Update feathr-configuration-and-env.md
xiaoyongzhu 8b209fc
Update feathr-configuration-and-env.md
xiaoyongzhu 2b2d31e
update log
xiaoyongzhu 7295c9c
update docs
xiaoyongzhu 27a7144
Update feathr-configuration-and-env.md
xiaoyongzhu f70cfac
Update feathr-configuration-and-env.md
xiaoyongzhu 4e9ca4b
Update feathr-configuration-and-env.md
xiaoyongzhu ba5cf7a
Update feathr-configuration-and-env.md
xiaoyongzhu 8546391
Adding tests
xiaoyongzhu b9d3b3b
Update test_secrets_read.py
xiaoyongzhu e6e2cba
fix test
xiaoyongzhu 9fad4c6
Update test_secrets_read.py
xiaoyongzhu 38aa258
fix test
xiaoyongzhu a58e0fa
Update feathr-configuration-and-env.md
xiaoyongzhu befba39
fix comments and test
xiaoyongzhu 4d35b88
fix comments
xiaoyongzhu File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,114 @@ | ||
| --- | ||
| layout: default | ||
| title: Configuration, environment variables, and store secrets in a secure way | ||
xiaoyongzhu marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| parent: Feathr How-to Guides | ||
| --- | ||
|
|
||
| # Configuration and environment variables in Feathr | ||
|
|
||
| Feathr uses a YAML file and a few environment variables to allow end users to have more flexibility. See the example of the following configurations in [this file](https://github.com/linkedin/feathr/blob/main/feathr_project/feathrcli/data/feathr_user_workspace/feathr_config.yaml). | ||
|
|
||
| In that YAML file, it contains the configurations that are used by Feathr. All the configurations can be overwritten by environment variables with concatenation of `__` for different layers of this config file. For example, `feathr_runtime_location` for databricks can be overwritten by setting this environment variable: `SPARK_CONFIG__DATABRICKS__FEATHR_RUNTIME_LOCATION`. For example, you can set it in python: | ||
|
|
||
| ```python | ||
| os.environ['SPARK_CONFIG__DATABRICKS__FEATHR_RUNTIME_LOCATION'] = "https://azurefeathrstorage.blob.core.windows.net/public/feathr-assembly-LATEST.jar" | ||
| ``` | ||
|
|
||
| or in shell environment: | ||
|
|
||
| ```bash | ||
| export SPARK_CONFIG__DATABRICKS__FEATHR_RUNTIME_LOCATION=https://azurefeathrstorage.blob.core.windows.net/public/feathr-assembly-LATEST.jar | ||
| ``` | ||
|
|
||
| This allows end users to store the configuration in a secure way, say in Kubernetes secrets, key vault, etc. All the configurations available for end users to configure are listed below. | ||
|
|
||
|
|
||
| # Default behaviors | ||
|
|
||
| Feathr will get the configurations in the following order: | ||
|
|
||
| 1. If the key is set in the environment variable, Feathr will use the value of that environment variable | ||
| 2. If it's not set in the environment, then a value is retrieved from the feathr_config.yaml file with the same config key. | ||
| 3. If it's not available in the feathr_config.yaml file, Feathr will try to reterive the value from a key vault service. Currently only Azure Key Vault is supported. | ||
|
|
||
| # A list of environment variables that Feathr uses | ||
| | Environment Variable | Description | Required? | | ||
| | -------------- | -------------- | -------------- | | ||
| | SECRETS__AZURE_KEY_VAULT__NAME | Name of the Azure Key Vault service so that Feathr can get credentials from that service. | Optional | | ||
| | AZURE_CLIENT_ID | Client ID for authentication into Azure Services. Read [here](https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity.environmentcredential?view=azure-python) for more details. | This is required if you are using Service Principal to login with Feathr. | | ||
| | AZURE_TENANT_ID |Client ID for authentication into Azure Services. Read [here](https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity.environmentcredential?view=azure-python) for more details. |This is required if you are using Service Principal to login with Feathr. | | ||
| | AZURE_CLIENT_SECRET | Client ID for authentication into Azure Services. Read [here](https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity.environmentcredential?view=azure-python) for more details. |This is required if you are using Service Principal to login with Feathr. | | ||
| | OFFLINE_STORE__ADLS__ADLS_ENABLED | Whether to enable ADLS as offline store or not. |Optional | | ||
| | ADLS_ACCOUNT | ADLS account that you connect to. |Required if using ADLS as an offline store. | | ||
| | ADLS_KEY | ADLS key that you connect to. |Required if using ADLS as an offline store. | | ||
| | OFFLINE_STORE__WASB__WASB_ENABLED | Whether to enable Azure BLOB storage as offline store or not. | | ||
| | WASB_ACCOUNT | Azure BLOB Storage account that you connect to.| Required if using Azure BLOB Storage as an offline store. | | ||
| | WASB_KEY | Azure BLOB Storage key that you connect to. |Required if using Azure BLOB Storage as an offline store. | | ||
| | S3_ACCESS_KEY | AWS S3 access key for the S3 account. |Required if using AWS S3 Storage as an offline store. | | ||
| | S3_SECRET_KEY | AWS S3 secret key for the S3 account. |Required if using AWS S3 Storage as an offline store. | | ||
| | OFFLINE_STORE__S3__S3_ENABLED | Whether to enable S3 as offline store or not. |Optional | | ||
| | OFFLINE_STORE__S3__S3_ENDPOINT | S3 endpoint. If you use S3 endpoint, then you need to provide access key and secret key in the environment variable as well. |Required if using AWS S3 Storage as an offline store. | | ||
| | OFFLINE_STORE__JDBC__JDBC_ENABLED | Whether to enable JDBC as offline store or not. |Optional | | ||
| | OFFLINE_STORE__JDBC__JDBC_DATABASE | If using JDBC endpoint as offline store, this config specifies the JDBC database to read from. | Required if using JDBC sources as offline store | | ||
| | OFFLINE_STORE__JDBC__JDBC_TABLE | If using JDBC endpoint as offline store, this config specifies the JDBC table to read from. Same as `JDBC_TABLE`. |Required if using JDBC sources as offline store | | ||
| | JDBC_TABLE | If using JDBC endpoint as offline store, this config specifies the JDBC table to read from |Required if using JDBC sources as offline store | | ||
| | JDBC_USER | If using JDBC endpoint as offline store, this config specifies the JDBC user |Required if using JDBC sources as offline store | | ||
| | JDBC_PASSWORD | If using JDBC endpoint as offline store, this config specifies the JDBC password |Required if using JDBC sources as offline store | | ||
| | KAFKA_SASL_JAAS_CONFIG | If using EventHub as a streaming input source, this configures the KAFKA stream. If using EventHub, read [here](https://github.com/Azure/azure-event-hubs-for-kafka#updating-your-kafka-client-configuration) for how to get this string from the existing string in Azure Portal. | Required if using Kafka/EventHub as streaming source input.| | ||
| | PROJECT_CONFIG__PROJECT_NAME | Configures the project name. | Required| | ||
| | OFFLINE_STORE__SNOWFLAKE__URL | Configures the Snowflake URL. Usually it's something like `dqllago-ol19457.snowflakecomputing.com`. |Required if using Snowflake as an offline store. | | ||
| | OFFLINE_STORE__SNOWFLAKE__USER | Configures the Snowflake user. |Required if using Snowflake as an offline store. | | ||
| | OFFLINE_STORE__SNOWFLAKE__ROLE | Configures the Snowflake role. Usually it's something like `ACCOUNTADMIN`. |Required if using Snowflake as an offline store. | | ||
| |JDBC_SF_PASSWORD| Configurations for Snowflake password|Required if using Snowflake as an offline store. | | ||
| | SPARK_CONFIG__SPARK_CLUSTER | Choice for spark runtime. Currently support: `azure_synapse`, `databricks`. The `databricks` configs will be ignored if `azure_synapse` is set and vice versa. | Required| | ||
| | SPARK_CONFIG__SPARK_RESULT_OUTPUT_PARTS | Configure number of parts for the spark output for feature generation job | Required| | ||
| | SPARK_CONFIG__AZURE_SYNAPSE__DEV_URL | Dev URL to the synapse cluster. Usually it's something like `https://yourclustername.dev.azuresynapse.net` | Required if using Azure Synapse| | ||
| | SPARK_CONFIG__AZURE_SYNAPSE__POOL_NAME | name of the sparkpool that you are going to use |Required if using Azure Synapse| | ||
| | SPARK_CONFIG__AZURE_SYNAPSE__WORKSPACE_DIR | A location that Synapse has access to. This workspace dir stores all the required configuration files and the jar resources. All the feature definitions will be uploaded here |Required if using Azure Synapse| | ||
| | SPARK_CONFIG__AZURE_SYNAPSE__EXECUTOR_SIZE | Specifies the executor size for the Azure Synapse cluster. Currently the options are `Small`, `Medium`, `Large`. |Required if using Azure Synapse| | ||
| | SPARK_CONFIG__AZURE_SYNAPSE__EXECUTOR_NUM | Sepcifies the number of executors for the Azure Synapse cluster |Required if using Azure Synapse| | ||
| | SPARK_CONFIG__AZURE_SYNAPSE__FEATHR_RUNTIME_LOCATION | Specifies the Feathr runtime location. Support local paths, path start with `http(s)://`, and paths start with `abfss:/`. If not set, will use the [Feathr package published in Maven](https://search.maven.org/artifact/com.linkedin.feathr/feathr_2.12). |Required if using Azure Synapse| | ||
| | SPARK_CONFIG__DATABRICKS__WORKSPACE_INSTANCE_URL | Workspace instance URL for your databricks cluster. Will be something like this: `https://adb-6885802458123232.12.azuredatabricks.net/` |Required if using Databricks| | ||
| | SPARK_CONFIG__DATABRICKS__CONFIG_TEMPLATE | Config string including run time information, spark version, machine size, etc. See [below](#sparkconfigdatabricksconfigtemplate) for more details. |Required if using Databricks| | ||
| | SPARK_CONFIG__DATABRICKS__WORK_DIR | Workspace dir for storing all the required configuration files and the jar resources. All the feature definitions will be uploaded here. |Required if using Databricks| | ||
| | SPARK_CONFIG__DATABRICKS__FEATHR_RUNTIME_LOCATION | Feathr runtime location. Support local paths, path start with `http(s)://`, and paths start with `dbfs:/`. If not set, will use the [Feathr package published in Maven](https://search.maven.org/artifact/com.linkedin.feathr/feathr_2.12). |Required if using Databricks| | ||
| | ONLINE_STORE__REDIS__HOST | Redis host name to access Redis cluster. |Required if using Redis as online store. | | ||
| | ONLINE_STORE__REDIS__PORT | Redis port number to access Redis cluster. |Required if using Redis as online store. | | ||
| | ONLINE_STORE__REDIS__SSL_ENABLED | Whether SSL is enabled to access Redis cluster. |Required if using Redis as online store. | | ||
| | REDIS_PASSWORD | Password for the Redis cluster. |Required if using Redis as online store. | | ||
| | FEATURE_REGISTRY__PURVIEW__PURVIEW_NAME | Configure the name of the purview endpoint. |Required if using Purview as the endpoint. | | ||
| | FEATURE_REGISTRY__PURVIEW__DELIMITER | See [here](#featureregistrypurviewdelimiter) for more details. | Required| | ||
| | FEATURE_REGISTRY__PURVIEW__TYPE_SYSTEM_INITIALIZATION | Controls whether the type system (think this as the "schema" for the registry) will be initialized or not. Usually this is only required to be set to `True` to initialize schema, and then you can set it to `False` to shorten the initialization time. | Required| | ||
|
|
||
| # Explanation for selected configurations | ||
| ## SPARK_CONFIG__DATABRICKS__CONFIG_TEMPLATE | ||
|
|
||
| Essentially it's a compact JSON string represents the important configurations that you can configure for the databricks cluster that you use. There are parts that marked as "FEATHR_FILL_IN" that Feathr will fill in, but all the other parts are customizable. | ||
|
|
||
| Essentially, the config template represents what is going to be submitted to a databricks cluster, and you can see the structure of this configuration template by visiting the [Databricks job runs API](https://docs.microsoft.com/en-us/azure/databricks/dev-tools/api/2.0/jobs#--runs-submit): | ||
|
|
||
| The most important and useful part would be the `new_cluster` section. For example, you can change`spark_version`, `node_type_id`, `num_workers`, etc. based on your environment. | ||
|
|
||
| ```json | ||
| {"run_name":"FEATHR_FILL_IN","new_cluster":{"spark_version":"9.1.x-scala2.12","node_type_id":"Standard_D3_v2","num_workers":2,"spark_conf":{"FEATHR_FILL_IN":"FEATHR_FILL_IN"}},"libraries":[{"jar":"FEATHR_FILL_IN"}],"spark_jar_task":{"main_class_name":"FEATHR_FILL_IN","parameters":["FEATHR_FILL_IN"]}} | ||
| ``` | ||
|
|
||
| Another use case is to use `instance_pool_id`, where instead of creating the Spark cluster from scratch every time, you can reuse a pool to run the job to make the run time shorter: | ||
|
|
||
| ```json | ||
| {"run_name":"FEATHR_FILL_IN","new_cluster":{"spark_version":"9.1.x-scala2.12","num_workers":2,"spark_conf":{"FEATHR_FILL_IN":"FEATHR_FILL_IN"},"instance_pool_id":"0403-214809-inlet434-pool-l9dj3kwz"},"libraries":[{"jar":"FEATHR_FILL_IN"}],"spark_jar_task":{"main_class_name":"FEATHR_FILL_IN","parameters":["FEATHR_FILL_IN"]}} | ||
|
|
||
| ``` | ||
|
|
||
| Other advanced settings includes `idempotency_token` to guarantee the idempotency of job run requests, etc. | ||
|
|
||
|
|
||
| ## FEATURE_REGISTRY__PURVIEW__DELIMITER | ||
| Delimiter indicates that how the project name, feature names etc. are delimited. By default it will be '__'. this is for global reference (mainly for feature sharing). For exmaple, when we setup a project called foo, and we have an anchor called 'taxi_driver' and the feature name is called 'f_daily_trips'. the feature will have a global unique name called 'foo__taxi_driver__f_daily_trips' | ||
|
|
||
|
|
||
| # A note on using Azure Key Vault to store credentials | ||
|
|
||
| Feathr has native integrations with Azure Key Vault to make it more secure to access resources. However, Azure Key Vault doesn't support the secret name to have underscore `_` in the secret name. Feathr will automatically convert underscore `_` to dash `-`. For example, Feathr will look for `ONLINE-STORE--REDIS--HOST` in Azure Key Vault if the actual environment variable is `ONLINE_STORE__REDIS__HOST`. | ||
|
|
||
| Azure Key Vault is not case sensitive, so `online_store__redis__host` and `ONLINE_STORE__REDIS__HOST` will result in the same request to Azure Key Vault and yield the same result. | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.