Thanks to visit codestin.com
Credit goes to github.com

Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
78 changes: 6 additions & 72 deletions docs/how-to-guides/jdbc-cosmos-notes.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
layout: default
title: Using SQL databases, CosmosDb, and ElasticSearch with Feathr
title: Using SQL databases, CosmosDb with Feathr
parent: Feathr How-to Guides
---

Expand All @@ -14,7 +14,7 @@ To use SQL database as source, we need to create `JdbcSource` instead of `HdfsSo

A `JdbcSource` can be created with follow statement:

```
```python
source = feathr.JdbcSource(name, url, dbtable, query, auth)
```

Expand Down Expand Up @@ -43,7 +43,7 @@ When the `auth` parameter is set to `TOKEN`, you need to set following environme

I.e., if you created a source:

```
```python
src1_name="source1"
source1 = JdbcSource(name=src1_name, url="jdbc:...", dbtable="table1", auth="USERPASS")
anchor1 = FeatureAnchor(name="anchor_name",
Expand Down Expand Up @@ -87,17 +87,16 @@ credential = DefaultAzureCredential()
token = credential.get_token("https://management.azure.com/.default").token()
```


## Using SQL database as the offline store

To use SQL database as the offline store, you can use `JdbcSink` as the `output_path` parameter of `FeathrClient.get_offline_features`, e.g.:
```
```python
name = 'output'
sink = client.JdbcSink(name, some_jdbc_url, dbtable, "USERPASS")
```

Then you need to set following environment variables before submitting job:
```
```python
os.environ[f"{name.upper()}_USER"] = "some_user_name"
os.environ[f"{name.upper()}_PASSWORD"] = "some_magic_word"
client.get_offline_features(..., output_path=sink)
Expand All @@ -106,69 +105,4 @@ client.get_offline_features(..., output_path=sink)

## Using SQL database as the online store

Same as the offline, create JDBC sink and add it to the `MaterializationSettings`, set corresponding environment variables, then use it with `FeathrClient.materialize_features`.

## Using CosmosDb as the online store

To use CosmosDb as the online store, create `CosmosDbSink` and add it to the `MaterializationSettings`, then use it with `FeathrClient.materialize_features`, e.g..

```
name = 'cosmosdb_output'
sink = CosmosDbSink(name, some_cosmosdb_url, some_cosmosdb_database, some_cosmosdb_collection)
os.environ[f"{name.upper()}_KEY"] = "some_cosmosdb_api_key"
client.materialize_features(..., materialization_settings=MaterializationSettings(..., sinks=[sink]))
```

Feathr client doesn't support getting feature values from CosmosDb, you need to use [official CosmosDb client](https://pypi.org/project/azure-cosmos/) to get the values:

```
from azure.cosmos import exceptions, CosmosClient, PartitionKey

client = CosmosClient(some_cosmosdb_url, some_cosmosdb_api_key)
db_client = client.get_database_client(some_cosmosdb_database)
container_client = db_client.get_container_client(some_cosmosdb_collection)
doc = container_client.read_item(some_key)
feature_value = doc['feature_name']
```

## Using ElasticSearch as online store

To use ElasticSearch as the online store, create `ElasticSearchSink` and add it to the `MaterializationSettings`, then use it with `FeathrClient.materialize_features`, e.g..

```
name = 'es_output'
sink = ElasticSearchSink(name, host="esnode1:9200", index="someindex", ssl=False, auth=True)
os.environ[f"{name.upper()}_USER"] = "some_user_name"
os.environ[f"{name.upper()}_PASSWORD"] = "some_magic_word"
client.materialize_features(..., materialization_settings=MaterializationSettings(..., sinks=[sink]))
```

Feathr client doesn't support getting feature values from ElasticSearch, you need to use [official ElasticSearch client](https://pypi.org/project/elasticsearch/) to get the values, e.g.:

```
from elasticsearch import Elasticsearch

es = Elasticsearch("http://esnode1:9200")
resp = es.get(index="someindex", id="somekey")
print(resp['_source'])
```

The feature generation job uses `upsert` mode to write data, so after the job the index may contain stale data, the recommended way is to create a new index each time, and use index alias to seamlessly switch over, detailed information can be found from [the official doc](https://www.elastic.co/guide/en/elasticsearch/reference/master/aliases.html), currently Feathr doesn't provide any helper to do this.

NOTE:
+ You can use no auth or basic auth only, no other authentication methods are supported.
+ If you enabled SSL, you need to make sure the certificate on ES nodes is trusted by the Spark cluster, otherwise the job will fail.

## Using ElasticSearch as offline store

To use ElasticSearch as the offline store, create `ElasticSearchSink` and use it with `FeathrClient.get_offline_features`, e.g..

```
name = 'es_output'
sink = ElasticSearchSink(name, host="esnode1", index="someindex", ssl=False, auth=True)
os.environ[f"{name.upper()}_USER"] = "some_user_name"
os.environ[f"{name.upper()}_PASSWORD"] = "some_magic_word"
client.get_offline_features(..., output_path=sink)
```

NOTE: The feature joining process doesn't generate meaningful keys for each document, you need to make sure the output dataset can be accessed/queried by some other ways such as full-text-search, otherwise you may have to fetch all the data from ES to get what you look for.
Same as the offline, create JDBC sink and add it to the `MaterializationSettings`, set corresponding environment variables, then use it with `FeathrClient.materialize_features`.
6 changes: 2 additions & 4 deletions registry/purview-registry/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,3 @@
# SQL-Based Registry for Feathr
# Purview-Based Registry for Feathr

This is the reference implementation of [the Feathr API spec](./api-spec.md), base on SQL databases instead of PurView.

Please note that this implementation uses iterations of `select` to retrieve graph lineages, this approach is very inefficient and should **not** be considered as production-ready. We only suggest to use this implementation for testing/researching purposes.
This is the reference implementation of [the Feathr API spec](./api-spec.md), base on Purview.
4 changes: 1 addition & 3 deletions registry/sql-registry/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,3 @@
# SQL-Based Registry for Feathr

This is the reference implementation of [the Feathr API spec](./api-spec.md), base on SQL databases instead of PurView.

Please note that this implementation uses iterations of `select` to retrieve graph lineages, this approach is very inefficient and should **not** be considered as production-ready. We only suggest to use this implementation for testing/researching purposes.
This is the reference implementation of [the Feathr API spec](./api-spec.md), base on SQL databases.