-
Notifications
You must be signed in to change notification settings - Fork 12
Open
Description
I'm trying to extract a parquet file from a bucket but Sling seems to fail when trying to load the parquet file (using Duckdb, I guess). This only happens when using the Python wrapper and when using a local setup using Localstack. Perhaps the endpoint-url isn't passed to duckdb but that's just a wild guess. On the other hand, writing a parquet file works just fine. It also works with the non-Python CLI.
I'm running Sling as part of Dagster which might also be relevant as you seem to evaluate that in the Python wrapper (although, with a non-local bucket this works).
Run Localstack
You can run Localstack using Docker or Podman
- https://docs.localstack.cloud/getting-started/installation/#docker-compose
- https://docs.localstack.cloud/references/podman/#podman-on-windows (works on an ARM Mac, too)
Create a local bucket
aws s3 mb s3://my-test-bucket --endpoint=http://localhost:4566
Define a new Sling Connection for that bucket (in your env.yaml):
AWS_S3:
type: s3
bucket: my-test-bucket
region: eu-central-1
endpoint: http://localhost:4566
access_key_id: localstack
secret_access_key: localstackCreate a CSV file
echo "Hello,World\nHello,World" > test.txtDefine a new Local Connection to access that CSV file (in your env.yaml):
LOCAL:
type: local
url: file://<root/of/text/file>Load the file as parquet into your bucket using a replication YAML
This is only to create a Parquet test file:
source: LOCAL
target: AWS_S3
defaults:
mode: full-refresh
target_options:
format: parquet
streams:
test.txt:
object: test.parquetExtract the parquet file from the bucket to local storage
This fails:
source: AWS_S3
target: LOCAL
defaults:
mode: full-refresh
source_options:
format: parquet
streams:
test.parquet:
object: result.txtMetadata
Metadata
Assignees
Labels
No labels