The parquet
extension enables direct querying of Parquet files. It is included
by default in the CLI, Python, and WebAssembly (Wasm) bindings.
read_parquet
Alias: parquet_scan
The read_parquet
function takes a path to a Parquet file and returns a table
containing the data.
SELECT * FROM read_parquet('cities.parquet');
By default, read_parquet
will automatically infer column data types from the
Parquet file schema.
You can inspect the inferred column names and types using the DESCRIBE statement:
DESCRIBE read_parquet('cities.parquet');
This returns a table with the name and data type of each column.
For S3 sources, additional parameters can be provided:
SELECT * FROM read_parquet('s3://bucket-name/path/to/file.parquet',
region='us-east-1',
access_key_id='YOUR_ACCESS_KEY',
secret_access_key='YOUR_SECRET_KEY');
Parquet files can also be queried directly by using the file path or URI in the FROM clause:
SELECT * FROM 'cities.parquet';
parquet_file_metadata
Returns high-level metadata about a Parquet file.
Column | Description |
---|---|
filename | Name of the file being queried. |
version | Parquet format version used in the file. |
num_rows | Total number of rows in the file. |
create_by | Application or library that wrote the file. |
num_row_groups | Number of row groups contained within the file. |
parquet_rowgroup_metadata
Returns metadata for each row group within a Parquet file.
Column | Description |
---|---|
filename | Name of the file being queried. |
num_rows | Number of rows in the row group. |
num_columns | Number of columns in the row group. |
uncompressed_size | Uncompressed size of the row group in bytes. |
ordinal | Zero-based ordinal of the row group within the file. |
parquet_column_metadata
Returns metadata for each column in each row group within a Parquet file.
Column | Description |
---|---|
filename | Name of the file being queried. |
rowgroup_ordinal | Zero-based ordinal of the row group within the file. |
column_ordinal | Zero-based ordinal of the column within the row group. |
physical_type | Physical storage type of the column (e.g., INT32, BYTE_ARRAY). |
max_definition_level | Maximum definition level for the column. |
max_repetition_level | Maximum repetition level for the column. |
file_offset | Byte offset of the column chunk in the file. |
num_values | Number of values stored in the column chunk. |
total_compressed_size | Compressed size of the column chunk in bytes. |
total_uncompressed_size | Uncompressed size of the column chunk in bytes. |
data_page_offset | Byte offset from beginning of file to first data page. |