Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Error when reading Parquet files from local S3 #29

@temminks

Description

@temminks

I'm trying to extract a parquet file from a bucket but Sling seems to fail when trying to load the parquet file (using Duckdb, I guess). This only happens when using the Python wrapper and when using a local setup using Localstack. Perhaps the endpoint-url isn't passed to duckdb but that's just a wild guess. On the other hand, writing a parquet file works just fine. It also works with the non-Python CLI.

I'm running Sling as part of Dagster which might also be relevant as you seem to evaluate that in the Python wrapper (although, with a non-local bucket this works).

Run Localstack

You can run Localstack using Docker or Podman

Create a local bucket

aws s3 mb s3://my-test-bucket --endpoint=http://localhost:4566

Define a new Sling Connection for that bucket (in your env.yaml):

  AWS_S3:
    type: s3
    bucket: my-test-bucket
    region: eu-central-1
    endpoint: http://localhost:4566
    access_key_id: localstack
    secret_access_key: localstack

Create a CSV file

 echo "Hello,World\nHello,World" > test.txt

Define a new Local Connection to access that CSV file (in your env.yaml):

  LOCAL:
    type: local
    url: file://<root/of/text/file>

Load the file as parquet into your bucket using a replication YAML

This is only to create a Parquet test file:

source: LOCAL
target: AWS_S3

defaults:
  mode: full-refresh
  target_options:
    format: parquet

streams:
  test.txt:
    object: test.parquet

Extract the parquet file from the bucket to local storage

This fails:

source: AWS_S3
target: LOCAL

defaults:
  mode: full-refresh
  source_options:
    format: parquet

streams:
  test.parquet:
    object: result.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions