A utility to inspect Parquet files.
Pre-built binaries or packages can be found on the release page, on Mac you can install with brew:
$ brew install go-parquet-toolsOnce it is installed:
$ parquet-tools
Usage: parquet-tools <command>
A utility to inspect Parquet files, for full usage see https://github.com/hangxie/parquet-tools/blob/main/README.md
Flags:
-h, --help Show context-sensitive help.
Commands:
cat Prints the content of a Parquet file, data only.
import Create Parquet file from other source data.
merge Merge multiple parquet files into one.
meta Prints the metadata.
row-count Prints the count of rows.
schema Prints the schema.
shell-completions Install/uninstall shell completions
size Prints the size.
split Split into multiple parquet files.
version Show build version.
Run "parquet-tools <command> --help" for more information on a command.
parquet-tools: error: expected one of "cat", "import", "merge", "meta", "row-count", ...- parquet-tools
You can choose one of the installation methods from below, the functionality will be mostly the same.
Good for people who are familiar with Go, you need 1.24 or newer version.
$ go install github.com/hangxie/parquet-tools@latestThe above command installs the latest released version of parquet-tools to $GOPATH/bin, parquet-tools installed from source will not report proper version and build time, so if you run parquet-tools version, it will just give you an empty line, all other functions are not affected.
Tip
If you do not set GOPATH environment variable explicitly, then its default value can be obtained by running go env GOPATH, usually it is go/ directory under your home directory.
Good for people who do not want to build and all other installation approaches do not work.
Go to release page, pick the release and platform you want to run, download the corresponding gz/zip file, extract it to your local disk, make sure the execution bit is set if you are running on Linux, Mac, or FreeBSD, then run the program.
For Windows 10 on ARM (like Surface Pro X), use windows-arm64, if you are using Windows 11 on ARM, both windows-arm64 and windows-amd64 build should work.
Mac users can use Homebrew to install:
$ brew install go-parquet-toolsTo upgrade to the latest version:
$ brew upgrade go-parquet-toolsContainer image supports amd64, arm64, and arm/v7, it is hosted in two registries:
You can pull the image from either location:
$ docker run --rm hangxie/parquet-tools version
v1.36.0
$ podman run --rm ghcr.io/hangxie/parquet-tools version
v1.36.0RPM and deb package can be found on release page, only amd64/x86_64 and arm64/aarch64 arch are available at this moment, download the proper package and run corresponding installation command:
- On Debian/Ubuntu:
$ sudo dpkg -i parquet-tools_1.36.0_amd64.deb
Preparing to unpack parquet-tools_1.36.0_amd64.deb ...
Unpacking parquet-tools (1.36.0) ...
Setting up parquet-tools (1.36.0) ...- On CentOS/Fedora:
$ sudo rpm -Uhv parquet-tools-1.36.0-1.x86_64.rpm
Verifying... ################################# [100%]
Preparing... ################################# [100%]
Updating / installing...
1:parquet-tools-1.36.0-1 ################################# [100%]parquet-tools provides help information through -h flag, whenever you are not sure about a parameter for a command, just add -h to the end of the line then it will give you all available options, for example:
$ parquet-tools meta -h
Usage: parquet-tools meta <uri> [flags]
Prints the metadata.
Arguments:
<uri> URI of Parquet file.
Flags:
-h, --help Show context-sensitive help.
-b, --base64 deprecated, will be removed in future version
--fail-on-int96 fail command if INT96 data type is present.
--anonymous (S3, GCS, and Azure only) object is publicly accessible.
--http-extra-headers= (HTTP URI only) extra HTTP headers.
--http-ignore-tls-error (HTTP URI only) ignore TLS error.
--http-multiple-connection (HTTP URI only) use multiple HTTP connection.
--object-version="" (S3, GCS, and Azure only) object version.Most commands can output JSON format result which can be processed by utilities like jq or JSON parser online.
parquet-tools can read and write parquet files from these locations:
- file system
- AWS Simple Storage Service (S3) bucket
- Google Cloud Storage (GCS) bucket
- Azure Storage Container
- HDFS file
parquet-tools can read parquet files from these locations:
- HTTP/HTTPS URL
Important
You need to have proper permission on the file you are going to process.
For files from the file system, you can specify file:// scheme or just ignore it:
$ parquet-tools row-count testdata/good.parquet
3
$ parquet-tools row-count file://testdata/good.parquet
3
$ parquet-tools row-count file://./testdata/good.parquet
3Use full S3 URL to indicate S3 object location, it starts with s3://. You need to make sure you have permission to read or write the S3 object, the easiest way to verify that is using AWS cli:
$ aws sts get-caller-identity
{
"UserId": "REDACTED",
"Account": "123456789012",
"Arn": "arn:aws:iam::123456789012:user/redacted"
}
aws s3 ls s3://daylight-openstreetmap/parquet/osm_features/release=v1.46/type=way/20240506_151445_00143_nanmw_fb5fe2f1-fec8-494f-8c2e-0feb15cedff0
2024-05-06 08:33:48 362267322 20240506_151445_00143_nanmw_fb5fe2f1-fec8-494f-8c2e-0feb15cedff0
$ parquet-tools row-count s3://daylight-openstreetmap/parquet/osm_features/release=v1.46/type=way/20240506_151445_00143_nanmw_fb5fe2f1-fec8-494f-8c2e-0feb15cedff0
2405462If an S3 object is publicly accessible and you do not have AWS credential, you can use --anonymous flag to bypass AWS authentication:
$ aws sts get-caller-identity
Unable to locate credentials. You can configure credentials by running "aws configure".
$ aws s3 --no-sign-request ls s3://daylight-openstreetmap/parquet/osm_features/release=v1.46/type=way/20240506_151445_00143_nanmw_fb5fe2f1-fec8-494f-8c2e-0feb15cedff0
2024-05-06 08:33:48 362267322 20240506_151445_00143_nanmw_fb5fe2f1-fec8-494f-8c2e-0feb15cedff0
$ parquet-tools row-count --anonymous s3://daylight-openstreetmap/parquet/osm_features/release=v1.46/type=way/20240506_151445_00143_nanmw_fb5fe2f1-fec8-494f-8c2e-0feb15cedff0
2405462Optionally, you can specify object version by using --object-version when you perform read operation (like cat, row-count, schema, etc.) for S3, parquet-tools will access current version if this parameter is omitted.
If version for the S3 object does not exist or bucket does not have version enabled, parquet-tools will report error:
$ parquet-tools row-count s3://daylight-openstreetmap/parquet/osm_features/release=v1.46/type=way/20240506_151445_00143_nanmw_fb5fe2f1-fec8-494f-8c2e-0feb15cedff0 --object-version non-existent-version
parquet-tools: error: failed to open S3 object [s3://daylight-openstreetmap/parquet/osm_features/release=v1.46/type=way/20240506_151445_00143_nanmw_fb5fe2f1-fec8-494f-8c2e-0feb15cedff0] version [non-existent-version]: operation error S3: HeadObject, https response error StatusCode: 400, RequestID: 75GZZ1W5M4KMAK1H, HostID: hgDGBOolDqLgH+CHRuZU+dXZXv4CB+mmSpjEfGxF5fLnKhNkJCWEAZBSS0kbT/k2gFotuoWNLX+zaWNWzHR49w==, api error BadRequest: Bad RequestTip
According to HeadObject and GetObject, status code for non-existent object or version will be 403 instead of 404 if the caller does not have permission to ListBucket, or return 400 if bucket does not have version enabled.
Thanks to parquet-go-source, parquet-tools loads only necessary data from S3 bucket, for most cases it is footer only, so it is much more faster than downloading the file from S3 bucket and run parquet-tools on a local file. Size of the S3 object used in above sample is more than 4GB, but the row-count command takes just several seconds to finish.
Use full gsutil URI to point to GCS object location, it starts with gs://. You need to make sure you have permission to read or write to the GCS object, either use application default or GOOGLE_APPLICATION_CREDENTIALS, you can refer to Google Cloud document for more details.
$ export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service/account/key.json
$ parquet-tools import -s testdata/csv.source -m testdata/csv.schema gs://REDACTED/csv.parquet
$ parquet-tools row-count gs://REDACTED/csv.parquet
7Similar to S3, parquet-tools downloads only necessary data from GCS bucket.
If the GCS object is publicly accessible, you can use --anonymous option to indicate that anonymous access is expected:
$ parquet-tools row-count gs://cloud-samples-data/bigquery/us-states/us-states.parquet
parquet-tools: error: failed to create GCS client: dialing: google: could not find default credentials. See https://cloud.google.com/docs/authentication/external/set-up-adc for more information
$ parquet-tools row-count --anonymous gs://cloud-samples-data/bigquery/us-states/us-states.parquet
50Optionally, you can specify object generation by using --object-version when you perform read operation (like cat, row-count, schema, etc.), parquet-tools will access latest generation if this parameter is omitted.
$ parquet-tools row-count --anonymous gs://cloud-samples-data/bigquery/us-states/us-states.parquet
50
$ parquet-tools row-count --anonymous --object-version=-1 gs://cloud-samples-data/bigquery/us-states/us-states.parquet
50parquet-tools reports error on invalid or non-existent generations:
$ parquet-tools row-count --anonymous --object-version=123 gs://cloud-samples-data/bigquery/us-states/us-states.parquet
parquet-tools: error: unable to open file [gs://cloud-samples-data/bigquery/us-states/us-states.parquet]: failed to create new reader: storage: object doesn't exist: googleapi: Error 404: No such object: cloud-samples-data/bigquery/us-states/us-states.parquet, notFound
$ parquet-tools row-count --anonymous --object-version=foo-bar gs://cloud-samples-data/bigquery/us-states/us-states.parquet
parquet-tools: error: unable to open file [gs://cloud-samples-data/bigquery/us-states/us-states.parquet]: invalid GCS generation [foo-bar]: strconv.ParseInt: parsing "foo-bar": invalid syntaxparquet-tools uses the HDFS URL format:
- starts with
wasbs://(wasb://is not supported), followed by - container as user name, followed by
- storage account as host, followed by
- blob name as path
For example:
wasbs://laborstatisticscontainer@azureopendatastorage.blob.core.windows.net/lfs/part-00000-tid-6312913918496818658-3a88e4f5-ebeb-4691-bfb6-e7bd5d4f2dd0-63558-c000.snappy.parquet
means the parquet file is at:
- storage account
azureopendatastorage - container
laborstatisticscontainer - blob
lfs/part-00000-tid-6312913918496818658-3a88e4f5-ebeb-4691-bfb6-e7bd5d4f2dd0-63558-c000.snappy.parquet
parquet-tools uses AZURE_STORAGE_ACCESS_KEY environment variable to identify access:
$ AZURE_STORAGE_ACCESS_KEY=REDACTED parquet-tools import -s testdata/csv.source -m testdata/csv.schema wasbs://[email protected]/test/csv.parquet
$ AZURE_STORAGE_ACCESS_KEY=REDACTED parquet-tools row-count wasbs://[email protected]/test/csv.parquet
7If the blob is publicly accessible, either unset AZURE_STORAGE_ACCESS_KEY or use --anonymous option to indicate that anonymous access is expected:
$ AZURE_STORAGE_ACCESS_KEY= parquet-tools row-count wasbs://laborstatisticscontainer@azureopendatastorage.blob.core.windows.net/lfs/part-00000-tid-6312913918496818658-3a88e4f5-ebeb-4691-bfb6-e7bd5d4f2dd0-63558-c000.snappy.parquet
6582726
$ parquet-tools row-count --anonymous wasbs://laborstatisticscontainer@azureopendatastorage.blob.core.windows.net/lfs/part-00000-tid-6312913918496818658-3a88e4f5-ebeb-4691-bfb6-e7bd5d4f2dd0-63558-c000.snappy.parquet
6582726Optionally, you can specify object version by using --object-version when you perform read operation (like cat, row-count, schema, etc.) for Azure blob, parquet-tools will access current version if this parameter is omitted.
Note
Azure blob returns different errors for non-existent version and invalid version id:
$ parquet-tools row-count --anonymous wasbs://laborstatisticscontainer@azureopendatastorage.blob.core.windows.net/lfs/part-00000-tid-6312913918496818658-3a88e4f5-ebeb-4691-bfb6-e7bd5d4f2dd0-63558-c000.snappy.parquet --object-version foo-bar
parquet-tools: error: unable to open file [wasbs://laborstatisticscontainer@azureopendatastorage.blob.core.windows.net/lfs/part-00000-tid-6312913918496818658-3a88e4f5-ebeb-4691-bfb6-e7bd5d4f2dd0-63558-c000.snappy.parquet]: HEAD https://azureopendatastorage.blob.core.windows.net/laborstatisticscontainer/lfs/part-00000-tid-6312913918496818658-3a88e4f5-ebeb-4691-bfb6-e7bd5d4f2dd0-63558-c000.snappy.parquet
--------------------------------------------------------------------------------
RESPONSE 400: 400 Value for one of the query parameters specified in the request URI is invalid.
ERROR CODE UNAVAILABLE
--------------------------------------------------------------------------------
Response contained no body
--------------------------------------------------------------------------------
$ parquet-tools row-count --anonymous wasbs://laborstatisticscontainer@azureopendatastorage.blob.core.windows.net/lfs/part-00000-tid-6312913918496818658-3a88e4f5-ebeb-4691-bfb6-e7bd5d4f2dd0-63558-c000.snappy.parquet --object-version 2025-05-20T01:27:08.0552942Z
parquet-tools: error: unable to open file [wasbs://laborstatisticscontainer@azureopendatastorage.blob.core.windows.net/lfs/part-00000-tid-6312913918496818658-3a88e4f5-ebeb-4691-bfb6-e7bd5d4f2dd0-63558-c000.snappy.parquet]: HEAD https://azureopendatastorage.blob.core.windows.net/laborstatisticscontainer/lfs/part-00000-tid-6312913918496818658-3a88e4f5-ebeb-4691-bfb6-e7bd5d4f2dd0-63558-c000.snappy.parquet
--------------------------------------------------------------------------------
RESPONSE 404: 404 The specified blob does not exist.
ERROR CODE: BlobNotFound
--------------------------------------------------------------------------------
Response contained no body
--------------------------------------------------------------------------------Similar to S3 and GCS, parquet-tools downloads only necessary data from blob.
parquet-tools can read and write files under HDFS with schema hdfs://username@hostname:port/path/to/file, if username is not provided then current OS user will be used.
$ parquet-tools import -f jsonl -m testdata/jsonl.schema -s testdata/jsonl.source hdfs://localhost:9000/temp/good.parquet
parquet-tools: error: failed to create JSON writer: failed to open HDFS source [hdfs://localhost:9000/temp/good.parquet]: create /temp/good.parquet: permission denied
$ parquet-tools import -f jsonl -m testdata/jsonl.schema -s testdata/jsonl.source hdfs://root@localhost:9000/temp/good.parquet
$ parquet-tools row-count hdfs://localhost:9000/temp/good.parquet
7Similar to cloud storage, parquet-tools downloads only necessary data from HDFS.
parquet-tools supports URI with http or https scheme, the remote server needs to support Range header, particularly with unit of bytes.
HTTP endpoint does not support write operation so it cannot be used as destination of import, merge, or split command.
These options can be used along with HTTP endpoints:
--http-multiple-connectionwill enable dedicated transport for concurrent requests,parquet-toolswill establish multiple TCP connections to remote server. This may or may not have performance impact depending on how remote server handles concurrent connections, it is recommended to leave it to defaultfalsevalue for all commands exceptcat, and test performance carefully withcatcommand.--http-extra-headersin the format ofkey1=value1,key2=value2,..., they will be used as extra HTTP headers, a use case is to provideAuthorizationheader or JWT token that is required by remote server.--http-ignore-tls-errorwill ignore TLS errors, this is generally a bad idea.
$ parquet-tools row-count https://azureopendatastorage.blob.core.windows.net/laborstatisticscontainer/lfs/part-00000-tid-6312913918496818658-3a88e4f5-ebeb-4691-bfb6-e7bd5d4f2dd0-63558-c000.snappy.parquet
6582726
$ parquet-tools size https://dpla-provider-export.s3.amazonaws.com/2021/04/all.parquet/part-00000-471427c6-8097-428d-9703-a751a6572cca-c000.snappy.parquet
4632041101Similar to S3 and other remote endpoints, parquet-tools downloads only necessary data from remote server through Range header.
Tip
parquet-tools will use HTTP/2 if remote server supports this, however you can disable this if things are not working well by setting environment variable GODEBUG to http2client=0:
$ parquet-tools row-count https://...
2022/09/05 09:54:52 protocol error: received DATA after END_STREAM
2022/09/05 09:54:52 protocol error: received DATA after END_STREAM
2022/09/05 09:54:53 protocol error: received DATA after END_STREAM
2022/09/05 09:54:53 protocol error: received DATA after END_STREAM
2022/09/05 09:54:53 protocol error: received DATA after END_STREAM
2022/09/05 09:54:53 protocol error: received DATA after END_STREAM
2022/09/05 09:54:53 protocol error: received DATA after END_STREAM
2022/09/05 09:54:53 protocol error: received DATA after END_STREAM
2022/09/05 09:54:53 protocol error: received DATA after END_STREAM
18141856
$ GODEBUG=http2client=0 parquet-tools row-count https://...
18141856cat command outputs data in parquet file, it supports JSON, JSONL, CSV, and TSV format. Since most parquet files are rather large, you should use row-count command to have a rough idea how many rows are there in the parquet file, then use --skip, --limit and --sample-ratio flags to reduce the output to a certain level, these flags can be used together.
There is a parameter that you probably will never touch: --read-page-size tells how many rows parquet-tools needs to read from the parquet file every time, you can play with it if you hit performance or resource problem.
$ parquet-tools cat --format jsonl testdata/good.parquet
{"shoe_brand":"nike","shoe_name":"air_griffey"}
{"shoe_brand":"fila","shoe_name":"grant_hill_2"}
{"shoe_brand":"steph_curry","shoe_name":"curry7"}Tip
You can set --fail-on-int96 option to fail cat command for parquet files that contain fields with INT96 type, which is deprecated, default value for this option is false so you can still read INT96 type, but this behavior may change in the future.
$ parquet-tools cat --fail-on-int96 testdata/all-types.parquet
parquet-tools: error: field Int96 has type INT96 which is not supported
$ parquet-tools cat testdata/all-types.parquet
[{"Bool":true,"ByteArray":"ByteArray-0","Date":1640995200,...--skip is similar to OFFSET in SQL, parquet-tools will skip this many rows from the beginning of the parquet file before applying other logic.
$ parquet-tools cat --skip 2 --format jsonl testdata/good.parquet
{"shoe_brand":"steph_curry","shoe_name":"curry7"}Caution
parquet-tools will not report error if --skip is greater than total number of rows in parquet file.
$ parquet-tools cat --skip 20 testdata/good.parquet
[]Warning
There is no standard for CSV and TSV format, parquet-tools utilizes Go's encoding/csv module to maximize compatibility, however, there is no guarantee that output can be interpreted by other utilities, especially if they are from other programming languages.
$ parquet-tools cat -f csv testdata/good.parquet
shoe_brand,shoe_name
nike,air_griffey
fila,grant_hill_2
steph_curry,curry7Caution
nil values will be presented as empty string:
$ parquet-tools cat -f csv --limit 2 testdata/int96-nil-min-max.parquet
Utf8,Int96
UTF8-0,
UTF8-1,By default CSV and TSV output contains a header line with field names, you can use --no-header option to remove it from output.
$ parquet-tools cat -f csv --no-header testdata/good.parquet
nike,air_griffey
fila,grant_hill_2
steph_curry,curry7Important
CSV and TSV do not support parquet files with complex schema:
$ parquet-tools cat -f csv testdata/all-types.parquet
parquet-tools: error: field [Map] is not scalar type, cannot output in csv format--limit is similar to LIMIT in SQL, or head in Linux shell, parquet-tools will stop running after outputting this many rows.
$ parquet-tools cat --limit 2 testdata/good.parquet
[{"shoe_brand":"nike","shoe_name":"air_griffey"},{"shoe_brand":"fila","shoe_name":"grant_hill_2"}]--sample-ratio enables sampling, the ratio is a number between 0.0 and 1.0 inclusively. 1.0 means output everything in the parquet file, while 0.0 means nothing. If you want to have 1 row out of every 10 rows, use 0.1.
Caution
This feature picks rows in parquet file randomly, so only 0.0 and 1.0 will output deterministic result, all other ratio may generate data set less or more than you want.
$ parquet-tools cat --sample-ratio 0.34 testdata/good.parquet
[{"shoe_brand":"nike","shoe_name":"air_griffey"}]
$ parquet-tools cat --sample-ratio 0.34 testdata/good.parquet
[]
$ parquet-tools cat --sample-ratio 0.34 testdata/good.parquet
[{"shoe_brand":"steph_curry","shoe_name":"curry7"}]
$ parquet-tools cat --sample-ratio 0.34 testdata/good.parquet
[{"shoe_brand":"nike","shoe_name":"air_griffey"},{"shoe_brand":"fila","shoe_name":"grant_hill_2"}]
$ parquet-tools cat --sample-ratio 0.34 testdata/good.parquet
[{"shoe_brand":"fila","shoe_name":"grant_hill_2"}]
$ parquet-tools cat --sample-ratio 1.0 testdata/good.parquet
[{"shoe_brand":"nike","shoe_name":"air_griffey"},{"shoe_brand":"fila","shoe_name":"grant_hill_2"},{"shoe_brand":"steph_curry","shoe_name":"curry7"}]
$ parquet-tools cat --sample-ratio 0.0 testdata/good.parquet
[]--skip, --limit and --sample-ratio can be used together to achieve certain goals, for example, to get the 3rd row from the parquet file:
$ parquet-tools cat --skip 2 --limit 1 testdata/good.parquet
[{"shoe_brand":"steph_curry","shoe_name":"curry7"}]Caution
cat supports two output formats, one is the default JSON format that wraps all JSON objects into a list, this works perfectly with small output and is compatible with most JSON toolchains, however, since almost all JSON libraries load full JSON into memory to parse and process, this will lead to memory pressure if you dump a huge amount of data.
$ parquet-tools cat testdata/good.parquet
[{"shoe_brand":"nike","shoe_name":"air_griffey"},{"shoe_brand":"fila","shoe_name":"grant_hill_2"},{"shoe_brand":"steph_curry","shoe_name":"curry7"}]cat also supports line delimited JSON streaming format format by specifying --format jsonl, allows readers of the output to process in a streaming manner, which will greatly reduce the memory footprint. Note that there is always a newline by end of the output.
Tip
If you want to filter data, use JSONL format output and pipe to jq.
$ parquet-tools cat --format jsonl testdata/good.parquet
{"shoe_brand":"nike","shoe_name":"air_griffey"}
{"shoe_brand":"fila","shoe_name":"grant_hill_2"}
{"shoe_brand":"steph_curry","shoe_name":"curry7"}You can read data line by line and parse every single line as a JSON object if you do not have a toolchain to process JSONL format.
If you do not care about order of records, you can use --concurrent which will launch multiple encoders (up to number of CPUs) to boost output speed, but does not maintain original order from the parquet file.
$ parquet-tools cat -f jsonl --concurrent testdata/good.parquet
{"shoe_brand":"fila","shoe_name":"grant_hill_2"}
{"shoe_brand":"nike","shoe_name":"air_griffey"}
{"shoe_brand":"steph_curry","shoe_name":"curry7"}
$ parquet-tools cat -f jsonl --concurrent testdata/good.parquet
{"shoe_brand":"nike","shoe_name":"air_griffey"}
{"shoe_brand":"fila","shoe_name":"grant_hill_2"}
{"shoe_brand":"steph_curry","shoe_name":"curry7"}import command creates a parquet file based on data in other formats. The target file can be on local file system or cloud storage object like S3, you need to have permission to write to target location. Existing file or cloud storage object will be overwritten.
The command takes 3 parameters, --source tells which file (file system only) to load source data, --format tells the format of the source data file, it can be json, jsonl or csv, --schema points to the file that holds schema. Optionally, you can use --compression to specify compression codec (UNCOMPRESSED/SNAPPY/GZIP/LZ4/LZ4_RAW/ZSTD), default is "SNAPPY". If CSV file contains a header line, you can use --skip-header to skip the first line of CSV file.
Each source data file format has its own dedicated schema format:
- CSV: you can refer to sample in this repo.
- JSON: you can refer to sample in this repo.
- JSONL: use the same schema as JSON format.
Warning
You cannot import INT96 data at this moment, more details can be found at #149.
$ parquet-tools import -f csv -s testdata/csv.source -m testdata/csv.schema /tmp/csv.parquet
$ parquet-tools row-count /tmp/csv.parquet
7$ parquet-tools import -f json -s testdata/json.source -m testdata/json.schema -z GZIP /tmp/json.parquet
$ parquet-tools row-count /tmp/json.parquet
1Tip
JSON format allows only a single record to be imported, if you want to import multiple records, use JSONL as source format.
JSONL is line-delimited JSON streaming format, use JSONL if you want to load multiple JSON objects into parquet.
$ parquet-tools import -f jsonl -s testdata/jsonl.source -m testdata/jsonl.schema /tmp/jsonl.parquet
$ parquet-tools row-count /tmp/jsonl.parquet
7merge command merges multiple parquet files into one parquet file, source parquet files need to have the same schema, except top level node can have different names. All source files and target file can be from and to different storage locations.
$ parquet-tools merge -s testdata/good.parquet,testdata/good.parquet /tmp/doubled.parquet
$ parquet-tools cat -f jsonl testdata/good.parquet
{"shoe_brand":"nike","shoe_name":"air_griffey"}
{"shoe_brand":"fila","shoe_name":"grant_hill_2"}
{"shoe_brand":"steph_curry","shoe_name":"curry7"}
$ parquet-tools cat -f jsonl /tmp/doubled.parquet
{"shoe_brand":"nike","shoe_name":"air_griffey"}
{"shoe_brand":"nike","shoe_name":"air_griffey"}
{"shoe_brand":"fila","shoe_name":"grant_hill_2"}
{"shoe_brand":"steph_curry","shoe_name":"curry7"}
{"shoe_brand":"fila","shoe_name":"grant_hill_2"}
{"shoe_brand":"steph_curry","shoe_name":"curry7"}
$ parquet-tools merge -s testdata/top-level-tag1.parquet -s testdata/top-level-tag2.parquet /tmp/merged.parquet
$ parquet-tools row-count /tmp/merged.parquet
6--read-page-size configures how many rows will be read from source file and write to target file each time, you can also use --compression to specify compression codec (UNCOMPRESSED/SNAPPY/GZIP/LZ4/LZ4_RAW/ZSTD) for target parquet file, default is "SNAPPY". Other read options like --http-multiple-connection, --http-ignore-tls-error, --http-extra-headers, --object-version, and --anonymous can still be used, but since they are applied to all source files, some of them may not make sense, eg --object-version.
When --concurrent option is specified, the merge command will read input files in parallel (up to number of CPUs), this can bring performance gain between 5% and 10%, trade-off is that the order of records in the result parquet file will not be strictly in the order of input files.
You can set --fail-on-int96 option to fail merge command for parquet files that contain fields with INT96 type, which is deprecated, default value for this option is false so you can still read INT96 type, but this behavior may change in the future.
meta command shows meta data of every row group in a parquet file.
Note
PathInSchema uses field name from parquet file, same as cat command.
$ parquet-tools meta testdata/good.parquet
{"NumRowGroups":1,"RowGroups":[{"NumRows":3,"TotalByteSize":438,"Columns":[{"PathInSchema":["shoe_brand"],"Type":"BYTE_ARRAY","ConvertedType":"convertedtype=UTF8","Encodings":["RLE","BIT_PACKED","PLAIN"],"CompressedSize":269,"UncompressedSize":194,"NumValues":3,"NullCount":0,"MaxValue":"steph_curry","MinValue":"fila","CompressionCodec":"GZIP"},{"PathInSchema":["shoe_name"],"Type":"BYTE_ARRAY","ConvertedType":"convertedtype=UTF8","Encodings":["RLE","BIT_PACKED","PLAIN"],"CompressedSize":319,"UncompressedSize":244,"NumValues":3,"NullCount":0,"MaxValue":"grant_hill_2","MinValue":"air_griffey","CompressionCodec":"GZIP"}]}]}Note
MinValue, MaxValue and NullCount are optional, if they do not show up in output then it means parquet file does not have that section.
You can set --fail-on-int96 option to fail meta command for parquet files that contain fields with INT96 type, which is deprecated, default value for this option is false so you can still read INT96 type, but this behavior may change in the future.
$ parquet-tools meta testdata/int96-nil-min-max.parquet
{"NumRowGroups":1,"RowGroups":[{"NumRows":10,"TotalByteSize":488,"Columns":[{"PathInSchema":["Utf8"],"Type":"BYTE_ARRAY","ConvertedType":"convertedtype=UTF8","Encodings":["RLE","BIT_PACKED","PLAIN","PLAIN_DICTIONARY","RLE_DICTIONARY"],"CompressedSize":381,"UncompressedSize":380,"NumValues":10,"NullCount":0,"MaxValue":"UTF8-9","MinValue":"UTF8-0","CompressionCodec":"ZSTD"},{"PathInSchema":["Int96"],"Type":"INT96","Encodings":["RLE","BIT_PACKED","PLAIN"],"CompressedSize":160,"UncompressedSize":108,"NumValues":10,"NullCount":10,"CompressionCodec":"ZSTD"}]}]}
$ parquet-tools meta --fail-on-int96 testdata/int96-nil-min-max.parquet
parquet-tools: error: field Int96 has type INT96 which is not supportedrow-count command provides total number of rows in the parquet file:
$ parquet-tools row-count testdata/good.parquet
3schema command shows schema of the parquet file in different formats.
JSON format schema can be used directly in parquet-go based golang program like this example:
$ parquet-tools schema testdata/good.parquet
{"Tag":"name=parquet_go_root","Fields":[{"Tag":"name=shoe_brand, type=BYTE_ARRAY, convertedtype=UTF8"},{"Tag":"name=shoe_name, type=BYTE_ARRAY, convertedtype=UTF8"}]}Schema will output converted type and logical type when they are present in the parquet file, however, default settings will be ignored to make output shorter, e.g.,
- convertedtype=LIST
- convertedtype=MAP
- repetitiontype=REQUIRED
- type=STRUCT
Schema does not output omitstats tag as there is no reliable way to determine it.
Raw format is the schema directly dumped from parquet file, all other formats are derived from raw format.
$ parquet-tools schema --format raw testdata/good.parquet
{"repetition_type":"REQUIRED","name":"parquet_go_root","num_children":2,"children":[{"type":"BYTE_ARRAY","type_length":0,"repetition_type":"REQUIRED","name":"shoe_brand","converted_type":"UTF8","scale":0,"precision":0,"field_id":0,"logicalType":{"STRING":{}}},{"type":"BYTE_ARRAY","type_length":0,"repetition_type":"REQUIRED","name":"shoe_name","converted_type":"UTF8","scale":0,"precision":0,"field_id":0,"logicalType":{"STRING":{}}}]}go struct format generates go struct definition snippet that can be used in go:
$ parquet-tools schema --format go testdata/good.parquet
type Parquet_go_root struct {
Shoe_brand string `parquet:"name=shoe_brand, type=BYTE_ARRAY, convertedtype=UTF8"`
Shoe_name string `parquet:"name=shoe_name, type=BYTE_ARRAY, convertedtype=UTF8"`
}You can turn on --camel-case to convert field names from snake_case_name to CamelCaseName:
$ parquet-tools schema --format go --camel-case testdata/good.parquet
type Parquet_go_root struct {
ShoeBrand string `parquet:"name=shoe_brand, type=BYTE_ARRAY, convertedtype=UTF8"`
ShoeName string `parquet:"name=shoe_name, type=BYTE_ARRAY, convertedtype=UTF8"`
}Important
parquet-go does not support composite type as map key or value in go struct tag as for now so parquet-tools will report error if there is such a field, you can still output in raw or JSON format:
$ parquet-tools schema -f go testdata/map-composite-value.parquet
parquet-tools: error: go struct does not support LIST as MAP value in Parquet_go_root.Scores
$ parquet-tools schema testdata/map-composite-value.parquet
{"Tag":"name=parquet_go_root","Fields":[{"Tag":"name=name, type=BYTE_ARRAY, convertedtype=UTF8"},{"Tag":"name=age, type=INT32"},{"Tag":"name=id, type=INT64"},{"Tag":"name=weight, type=FLOAT"},{"Tag":"name=sex, type=BOOLEAN"},{"Tag":"name=classes, type=LIST","Fields":[{"Tag":"name=element, type=BYTE_ARRAY, convertedtype=UTF8"}]},{"Tag":"name=scores, type=MAP","Fields":[{"Tag":"name=key, type=BYTE_ARRAY, convertedtype=UTF8"},{"Tag":"name=value, type=LIST","Fields":[{"Tag":"name=element, type=FLOAT"}]}]},{"Tag":"name=friends, type=LIST","Fields":[{"Tag":"name=element","Fields":[{"Tag":"name=name, type=BYTE_ARRAY, convertedtype=UTF8"},{"Tag":"name=id, type=INT64"}]}]},{"Tag":"name=teachers, repetitiontype=REPEATED","Fields":[{"Tag":"name=name, type=BYTE_ARRAY, convertedtype=UTF8"},{"Tag":"name=id, type=INT64"}]}]}CSV format is the schema that can be used to import from CSV files:
$ parquet-tools schema --format csv testdata/csv-good.parquet
name=Id, type=INT64
name=Name, type=BYTE_ARRAY, convertedtype=UTF8
name=Age, type=INT32
name=Temperature, type=FLOAT
name=Vaccinated, type=BOOLEANNote
Since CSV is a flat 2D format, we cannot generate CSV schema for nested or optional columns:
$ parquet-tools schema -f csv testdata/csv-optional.parquet
parquet-tools: error: CSV does not support optional column
$ parquet-tools schema -f csv testdata/csv-nested.parquet
parquet-tools: error: CSV supports flat schema onlyshell-completions updates shell's rcfile with proper shell completions setting, this is an experimental feature at this moment, only bash is tested.
To install shell completions, run:
$ parquet-tools shell-completionsYou will not get output if everything runs well, you can check shell's rcfile, for example, .bash_profile or .bashrc for bash, to see what it added.
This command will return error if the same line is in shell's rcfile already.
To uninstall shell completions, run:
$ parquet-tools shell-completions --uninstallYou will not get output if everything runs well, you can check shell's rcfile, for example, .bash_profile or .bashrc for bash, to see what it removed.
This command will return error if the line does not exist in shell rcfile.
Hit <TAB> key in command line when you need hint or want to auto complete current option.
size command provides various size information, it can be raw data (compressed) size, uncompressed data size, or footer (meta data) size.
$ parquet-tools size testdata/good.parquet
588$ parquet-tools size --query footer --json testdata/good.parquet
{"Footer":323}$ parquet-tools size -q all -j testdata/good.parquet
{"Raw":588,"Uncompressed":438,"Footer":323}split command distributes data in source file into multiple parquet files, number of output files is either --file-count parameter, or total number of rows in source file divided by --record-count parameter.
Name of output files is determined by --name-format and will be used by fmt.Sprintf, default value is result-%06d.parquet which means output files will be under current directory with name result-000000.parquet, result-000001.parquet, etc., you can use any of file locations that support write operation, eg S3, or HDFS.
Other useful parameters include:
--fail-on-int96to fail the command if source parquet file contains INT96 fields--compressionto specify compression codec for output files, default isSNAPPY--read-page-sizeto tell how many rows will be read per batch from source
Only one verb for integers is allowed, and it has to be variant of %b, %d, %o, %x, or %X.
$ parquet-tools split --name-format file-%0.2f.parquet --file-count 3 testdata/good.parquet
parquet-tools: error: invalid name format [file-%0.2f.parquet]: [%0.2f] is not an allowed format verb
$ parquet-tools split --name-format file.parquet --file-count 3 testdata/good.parquet
parquet-tools: error: invalid name format [file.parquet]: lack of usable verbYou can specify width and leading zeros:
$ parquet-tools split --name-format file-%04b.parquet --file-count 3 testdata/all-types.parquet
$ ls file-*
file-0000.parquet file-0001.parquet file-0010.parquet$ parquet-tools row-count testdata/all-types.parquet
10
$ parquet-tools split --file-count 3 testdata/all-types.parquet
$ parquet-tools row-count result-000000.parquet
4
$ parquet-tools row-count result-000001.parquet
3
$ parquet-tools row-count result-000002.parquet
3$ parquet-tools row-count testdata/all-types.parquet
10
$ parquet-tools split --record-count 3 --name-format %d.parquet testdata/all-types.parquet
$ parquet-tools row-count 0.parquet
3
$ parquet-tools row-count 1.parquet
3
$ parquet-tools row-count 2.parquet
3
$ parquet-tools row-count 3.parquet
1version command provides version, build time, git hash, and source of the executable, it will be quite helpful when you are troubleshooting a problem from this tool itself. Source of the executable can be "source" (or "") which means it was built from source code, or "github" indicates it was from github release (include container images and deb/rpm packages as they share the same build result), or "Homebrew" if it was from homebrew bottles.
$ parquet-tools version
v1.36.0-a is equivalent to -bs.
$ parquet-tools version -a
v1.36.0
2025-09-19T05:00:59Z
Homebrew$ parquet-tools version --build-time --json
{"Version":"v1.36.0","BuildTime":"2025-09-19T05:00:59Z"}$ parquet-tools version -j
{"Version":"v1.36.0"}Warning
This is an experimental feature that still under development, functionalities may be changed in the future.
parquet-tools recognize GEOGRAPHY and GEOMETRY logical types:
$ parquet-tools schema --format go testdata/geospatial.parquet
type Parquet_go_root struct {
Geometry string `parquet:"name=Geometry, type=BYTE_ARRAY, logicaltype=GEOMETRY"`
Geography string `parquet:"name=Geography, type=BYTE_ARRAY, logicaltype=GEOGRAPHY"`
}parquet-tool support GEOGRAPHY and GEOMETRY logical types, these types were introduced in Apache Parquet Format 2.11.0. However, parquet-tools does not support GeoParquet format as it does not provide schema information in parquet file itself, those fields in GeoParquet format file are just BYTE_ARRAY.
parquet-tools support different output formats for GEOGRAPHY and GEOMETRY types:
geojson: output in GeoJSON formathex: output raw data in hex format, plus crs/algorithmbase64: output raw data in base64 format, plus crs/algorithm
You can use --geo-format option to change format of cat command output, default is geojson.
$ parquet-tools cat --limit 1 testdata/geospatial.parquet
[{"Geography":{"geometry":{"coordinates":[0,0],"type":"Point"},"properties":{"algorithm":"SPHERICAL","crs":"OGC:CRS84"},"type":"Feature"},"Geometry":{"geometry":{"coordinates":[0,0],"type":"Point"},"properties":{"crs":"OGC:CRS84"},"type":"Feature"}}]
$ parquet-tools cat --limit 1 --geo-format geojson testdata/geospatial.parquet
[{"Geography":{"geometry":{"coordinates":[0,0],"type":"Point"},"properties":{"algorithm":"SPHERICAL","crs":"OGC:CRS84"},"type":"Feature"},"Geometry":{"geometry":{"coordinates":[0,0],"type":"Point"},"properties":{"crs":"OGC:CRS84"},"type":"Feature"}}]
$ parquet-tools cat --limit 1 --geo-format hex testdata/geospatial.parquet
[{"Geography":{"algorithm":"SPHERICAL","crs":"OGC:CRS84","wkb_hex":"010100000000000000000000000000000000000000"},"Geometry":{"crs":"OGC:CRS84","wkb_hex":"010100000000000000000000000000000000000000"}}]MinValue and MaxValue of geospatial columns will be bounding box value if Geospatial Statistics presents, note that MinValue and MaxValue of underlying BYTE_ARRAY value do not make any sense to these columns.
$ parquet-tools meta testdata/geospatial.parquet
{"NumRowGroups":1,"RowGroups":[{"NumRows":10,"TotalByteSize":4590,"Columns":[{"PathInSchema":["Geometry"],"Type":"BYTE_ARRAY","LogicalType":"logicaltype=GEOMETRY","Encodings":["RLE","BIT_PACKED","PLAIN"],"CompressedSize":1920,"UncompressedSize":2472,"NumValues":10,"NullCount":0,"MaxValue":[16,11],"MinValue":[-3,-8],"CompressionCodec":"SNAPPY"},{"PathInSchema":["Geography"],"Type":"BYTE_ARRAY","LogicalType":"logicaltype=GEOGRAPHY","Encodings":["RLE","BIT_PACKED","PLAIN"],"CompressedSize":1711,"UncompressedSize":2118,"NumValues":10,"NullCount":0,"MaxValue":[10.5,10.5],"MinValue":[0,0],"CompressionCodec":"SNAPPY"}]}]}This project is inspired by:
- parquet-go/parquet-tools: https://github.com/xitongsys/parquet-go/tree/master/tool/parquet-tools/
- Python parquet-tools: https://pypi.org/project/parquet-tools/
- Java parquet-tools: https://mvnrepository.com/artifact/org.apache.parquet/parquet-tools
- Makefile: https://github.com/cisco-sso/kdk/blob/master/Makefile
Some test cases are from:
- https://registry.opendata.aws/binding-db/
- https://github.com/xitongsys/parquet-go/tree/master/example/
- https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-parquet
- https://azure.microsoft.com/en-us/services/open-datasets/catalog/
- https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
- https://pro.dp.la/developers/bulk-download
- https://exchange.aboutamazon.com/data-initiative
- https://github.com/apache/parquet-testing/
This project is licensed under the BSD 3-Clause License.