From f91efddbb85598ae09963b65c688882f8df0a73e Mon Sep 17 00:00:00 2001 From: Blair Chen Date: Tue, 10 Jan 2023 13:52:57 +0800 Subject: [PATCH] Updates on docs --- docs/how-to-guides/jdbc-cosmos-notes.md | 78 ++----------------------- registry/purview-registry/README.md | 6 +- registry/sql-registry/README.md | 4 +- 3 files changed, 9 insertions(+), 79 deletions(-) diff --git a/docs/how-to-guides/jdbc-cosmos-notes.md b/docs/how-to-guides/jdbc-cosmos-notes.md index 52fb493e8..726941fb6 100644 --- a/docs/how-to-guides/jdbc-cosmos-notes.md +++ b/docs/how-to-guides/jdbc-cosmos-notes.md @@ -1,6 +1,6 @@ --- layout: default -title: Using SQL databases, CosmosDb, and ElasticSearch with Feathr +title: Using SQL databases, CosmosDb with Feathr parent: Feathr How-to Guides --- @@ -14,7 +14,7 @@ To use SQL database as source, we need to create `JdbcSource` instead of `HdfsSo A `JdbcSource` can be created with follow statement: -``` +```python source = feathr.JdbcSource(name, url, dbtable, query, auth) ``` @@ -43,7 +43,7 @@ When the `auth` parameter is set to `TOKEN`, you need to set following environme I.e., if you created a source: -``` +```python src1_name="source1" source1 = JdbcSource(name=src1_name, url="jdbc:...", dbtable="table1", auth="USERPASS") anchor1 = FeatureAnchor(name="anchor_name", @@ -87,17 +87,16 @@ credential = DefaultAzureCredential() token = credential.get_token("https://management.azure.com/.default").token() ``` - ## Using SQL database as the offline store To use SQL database as the offline store, you can use `JdbcSink` as the `output_path` parameter of `FeathrClient.get_offline_features`, e.g.: -``` +```python name = 'output' sink = client.JdbcSink(name, some_jdbc_url, dbtable, "USERPASS") ``` Then you need to set following environment variables before submitting job: -``` +```python os.environ[f"{name.upper()}_USER"] = "some_user_name" os.environ[f"{name.upper()}_PASSWORD"] = "some_magic_word" client.get_offline_features(..., output_path=sink) @@ -106,69 +105,4 @@ client.get_offline_features(..., output_path=sink) ## Using SQL database as the online store -Same as the offline, create JDBC sink and add it to the `MaterializationSettings`, set corresponding environment variables, then use it with `FeathrClient.materialize_features`. - -## Using CosmosDb as the online store - -To use CosmosDb as the online store, create `CosmosDbSink` and add it to the `MaterializationSettings`, then use it with `FeathrClient.materialize_features`, e.g.. - -``` -name = 'cosmosdb_output' -sink = CosmosDbSink(name, some_cosmosdb_url, some_cosmosdb_database, some_cosmosdb_collection) -os.environ[f"{name.upper()}_KEY"] = "some_cosmosdb_api_key" -client.materialize_features(..., materialization_settings=MaterializationSettings(..., sinks=[sink])) -``` - -Feathr client doesn't support getting feature values from CosmosDb, you need to use [official CosmosDb client](https://pypi.org/project/azure-cosmos/) to get the values: - -``` -from azure.cosmos import exceptions, CosmosClient, PartitionKey - -client = CosmosClient(some_cosmosdb_url, some_cosmosdb_api_key) -db_client = client.get_database_client(some_cosmosdb_database) -container_client = db_client.get_container_client(some_cosmosdb_collection) -doc = container_client.read_item(some_key) -feature_value = doc['feature_name'] -``` - -## Using ElasticSearch as online store - -To use ElasticSearch as the online store, create `ElasticSearchSink` and add it to the `MaterializationSettings`, then use it with `FeathrClient.materialize_features`, e.g.. - -``` -name = 'es_output' -sink = ElasticSearchSink(name, host="esnode1:9200", index="someindex", ssl=False, auth=True) -os.environ[f"{name.upper()}_USER"] = "some_user_name" -os.environ[f"{name.upper()}_PASSWORD"] = "some_magic_word" -client.materialize_features(..., materialization_settings=MaterializationSettings(..., sinks=[sink])) -``` - -Feathr client doesn't support getting feature values from ElasticSearch, you need to use [official ElasticSearch client](https://pypi.org/project/elasticsearch/) to get the values, e.g.: - -``` -from elasticsearch import Elasticsearch - -es = Elasticsearch("http://esnode1:9200") -resp = es.get(index="someindex", id="somekey") -print(resp['_source']) -``` - -The feature generation job uses `upsert` mode to write data, so after the job the index may contain stale data, the recommended way is to create a new index each time, and use index alias to seamlessly switch over, detailed information can be found from [the official doc](https://www.elastic.co/guide/en/elasticsearch/reference/master/aliases.html), currently Feathr doesn't provide any helper to do this. - -NOTE: -+ You can use no auth or basic auth only, no other authentication methods are supported. -+ If you enabled SSL, you need to make sure the certificate on ES nodes is trusted by the Spark cluster, otherwise the job will fail. - -## Using ElasticSearch as offline store - -To use ElasticSearch as the offline store, create `ElasticSearchSink` and use it with `FeathrClient.get_offline_features`, e.g.. - -``` -name = 'es_output' -sink = ElasticSearchSink(name, host="esnode1", index="someindex", ssl=False, auth=True) -os.environ[f"{name.upper()}_USER"] = "some_user_name" -os.environ[f"{name.upper()}_PASSWORD"] = "some_magic_word" -client.get_offline_features(..., output_path=sink) -``` - -NOTE: The feature joining process doesn't generate meaningful keys for each document, you need to make sure the output dataset can be accessed/queried by some other ways such as full-text-search, otherwise you may have to fetch all the data from ES to get what you look for. \ No newline at end of file +Same as the offline, create JDBC sink and add it to the `MaterializationSettings`, set corresponding environment variables, then use it with `FeathrClient.materialize_features`. \ No newline at end of file diff --git a/registry/purview-registry/README.md b/registry/purview-registry/README.md index f06ca7def..03c623535 100644 --- a/registry/purview-registry/README.md +++ b/registry/purview-registry/README.md @@ -1,5 +1,3 @@ -# SQL-Based Registry for Feathr +# Purview-Based Registry for Feathr -This is the reference implementation of [the Feathr API spec](./api-spec.md), base on SQL databases instead of PurView. - -Please note that this implementation uses iterations of `select` to retrieve graph lineages, this approach is very inefficient and should **not** be considered as production-ready. We only suggest to use this implementation for testing/researching purposes. \ No newline at end of file +This is the reference implementation of [the Feathr API spec](./api-spec.md), base on Purview. diff --git a/registry/sql-registry/README.md b/registry/sql-registry/README.md index f06ca7def..66342a6b0 100644 --- a/registry/sql-registry/README.md +++ b/registry/sql-registry/README.md @@ -1,5 +1,3 @@ # SQL-Based Registry for Feathr -This is the reference implementation of [the Feathr API spec](./api-spec.md), base on SQL databases instead of PurView. - -Please note that this implementation uses iterations of `select` to retrieve graph lineages, this approach is very inefficient and should **not** be considered as production-ready. We only suggest to use this implementation for testing/researching purposes. \ No newline at end of file +This is the reference implementation of [the Feathr API spec](./api-spec.md), base on SQL databases.