Changelog

Subscribe to stay up to date on what we're shipping. No spam, just changelogs.

June 2025 (v25.6 series)

v25.6.1 (4 June 2025)

  • Per-extension schemas: All extensions (parquet, csv, iceberg) now create schemas containing scalar, aggregate, and table functions exported by that extension.

    For often used functions (read_parquet, read_csv), aliases are added to the default schema for convenience and continue to work as normal. Less often used functions have been moved out of the default schema and into the extension specific schema. For example, parquet_column_metadata is now parquet.column_metadata.

Behavior changes

  • All Parquet and Iceberg "metadata" functions have moved.

    • parquet_file_metadata is now parquet.file_metadata
    • parquet_rowgroup_metadata is now parquet.rowgroup_metadata
    • parquet_column_metadata is now parquet.column_metadata
    • iceberg_metadata is now iceberg.metadata
    • iceberg_snapshots is now iceberg.snapshots
    • iceberg_manifest_list is now iceberg.manifest_list
    • iceberg_data_files is now iceberg.data_files
  • read_csv and scan_csv now alias to csv.read with no functional change.

  • read_parquet and scan_parquet now alias to parquet.read with no functional change.

v25.6.0 (3 June 2025)

  • Initial metadata functions for Iceberg: Added functions for reading Iceberg table metadata:

    • iceberg_metadata: Read top-level table metadata.
    • iceberg_manifest_list: Read the manifest list for a specific table version.
    • iceberg_data_files: List info about the data files for a table.

    Behavior and columns returned for these functions are subject to change as Iceberg integration moves forward.

  • Additional scalar functions: New functions regexp_instr, sign, factorial have been added.

Behavior changes

  • New alias_of column in list_functions: An alias_of column has been added to the list_functions table function for showing if a function is just an alias to another function.

    Function reference docs are generated from the output of list_functions, so we're now able label function aliases as appropriate.

May 2025 (v25.5 series)

v25.5.13 (31 May 2025)

  • Additional string functions: Added translate, replace, split_part, and md5 functions for strings.

    Introduced POSITION(substring IN string) syntax as an alternative to the existing strpos function.

  • Additional approximate aggregate function: Added an approx_quantile aggregate function for approximating values at specific quantiles. For example, to get the approximate 75th-percentile for my_numeric_column:

    SELECT approx_quantile(my_numeric_column, 0.75) FROM my_table;
    

    Also added approx_unique as an alias to approx_count_distinct for easier discoverability.

v25.5.12 (29 May 2025)

  • Infer function to use for TSV files: TSV files can be queried directly without the need to specify a table function:

    SELECT * FROM 'data.tsv';
    

    This will use read_csv to read the file (read_csv will detect the use of tab as the delimiter automatically).

    A motivation for this was for easily querying automated benchmark results that are uploaded to GCS:

    SELECT _filename, * FROM 'gs://glaredb-bench/results/main/**/results-*.tsv';
    
  • Use u64 instead of usize for file lengths internally: This ensures we're able to read larger files on 32-bit systems (WebAssembly).

    For example, querying a 14GB parquet file using WebAssembly will now work:

    SELECT count(*) FROM 'gs://glaredb-bench/data/clickbench/hits.parquet';
    

    Note that the WebAssembly environment is more resource constrained than when running the native binary, including lack of multi-threading and limited available memory. Complex queries may run more slowly than when using CLI.

v25.5.11 (24 May 2025)

  • Provide list of files to read: The previous release added the ability to read multiple files at once by using glob patterns. This release builds on that feature by allowing a list of files to be provided as an alternative to the glob syntax.

    SELECT * FROM read_csv(['data1.csv', 'data2.csv'])
    

    This can be used with all file scan functions (read_parquet, read_text, and read_csv).

  • Added parquet_column_metadata: Added a new table function that can return column-level metadata for columns in a parquet file, including its physical type, compressed and uncompressed sizes, and more.

Behavior changes

  • Column renames for parquet metadata functions: parquet_rowgroup_metadata and parquet_file_metadata both returned a column named file_name. That column has been renamed to filename to be more consistent with other table functions and metadata columns.

Bug fixes

  • Parquet definition level fixes: Fixed some issues around properly reading definition levels for columns depending on the encoding used for that column.

v25.5.10 (23 May 2025)

  • Initial support for multi-file reads: Added support for reading multiple files in the read_csv, read_text, and read_parquet table functions. Specifying multiple files can be done by providing a path with a glob pattern.

    For example, reading all parquet files in a directory:

    SELECT * FROM 'data/*.parquet'
    

    Glob patterns can also be used when querying files in S3 or GCS:

    SELECT count(*), min(numbers), max(numbers)
    FROM 's3://glaredb-public/testdata/parquet/glob_numbers/{200,400}.parquet'
    

    All files matched by the glob are currently expected to have the same schema. Better support for merging different schemas will come in a future release.

  • Metadata columns: This release also adds support for "metadata" columns. Metadata columns are columns that don't exist in the source files, and are added during scanning, and will only show up in the output when explicitly selected.

    All table functions using the multi-file provider (read_text, read_csv, and read_parquet) now include two metadata columns: _filename and _rowid.

    _filename returns the name of the file as reported by the File System implementation doing the read, and _rowid returns a 0-based index of the row within the file containing that row.

    Metadata columns can be used just like normal columns in the query. For example, we can get the min and max values for a column for each file matched by a glob pattern:

    SELECT _filename, min(numbers), max(numbers)
    FROM 'gs://glaredb-public/testdata/parquet/glob_numbers/*.parquet'
    GROUP BY _filename
    

v25.5.9 (21 May 2025)

  • Faster remote CSV reads: Reading large CSV files over http or from object stores (S3 and GCS) is now faster. The internal read buffer for CSV files has been enlarged to reduce the number of requests needed to read the file.

Bug fixes

  • Erroneous range requests: When reading a file from an object store or over http, a Range header is added to all requests to allow us to read only the parts of the file we care about. This allows us to avoid buffering the entire file in memory.

    Occasionally we generated a Range header that couldn't be fulfilled. This mostly impacted reading CSV files due to needing to read the file in fixed sized chunks, with the chunk possibly ending in the middle of a CSV record, which could lead us to generating ranges that tried to read beyond the end of a file. Additional checks are in place now to ensure we don't generate these incorrect ranges.

v25.5.8 (20 May 2025)

  • Function metadata updates: Updated function categories for more fine-grained categorizations. Previously all aggregates functions were labeled as "aggregates". Now aggregate functions are split into "general_purpose_aggregates" and "statistics_aggregates". Functions used for operator implementations (add, sub, etc) have also recategorized.

    Documentation has been updated to reflect these new categorizations, as well as the output of the list_functions() table function.

Bug fixes

  • Invalid table refs with IN subqueries: Occasionally queries with IN subquery expressions would return the error "Column expr not referencing a valid table ref ...". This meant that the planner produced a physical plan where a column expression is trying to reference an unreachable column. This was a bug in the join reorder optimizer where when reconstructing a join, its children were added in an incorrect order.

    Internally IN subqueries are implemented with semi joins, which is just a regular join, with only one side being returned. GlareDB implements one semi join type where only the "left" side of the join is returned. The join reorder optimizer knows this, and is careful to ensure the "left" and "right" inputs to the join are in the correct order. However when additional filters were pushed down to the "left" or "right" input, and the inputs were flipped, the filters were not. This produced a "Filter" node containing a predicate with an invalid column reference, since the column reference was actually pointing to a column on the other side of the join.

  • Decimal overflows with sum: The sum aggregate function previously returned a 64-bit decimal output when its input was a 64-bit decimal. However this was very susceptible to overflows with high-scale decimals. sum now returns a 128-bit decimal for both 64-bit and 128-bit decimal inputs.

  • Decimal casting with large scale differences: An internal overflow occurred when attempting to cast to different decimal scales where the scale difference was large. Converting from one scale to another requires compute a constant multiplier to apply to the underlying values. This constant multiplier was internally using a 32-bit integer which could easily overflow with larger scale differences.

    Decimal casting now uses the same sized integer the decimal primitive to avoid the overflow -- a 64-bit int for Decimal64, and a 128-bit int for Decimal128.

v25.5.7 (17 May 2025)

  • Async filesystem state loading: Filesystem interface change to centralize any requests required prior to actually reading files from S3 or GCS. Specifically this allows reusing access tokens obtained using service accounts when querying GCS buckets, reducing the total number of requests made.

  • Labeled materializations in EXPLAIN: Explain output now labels "base" and "materializations" query pipelines to more easily see the shape of a query.

    This change affects the output of "unoptimized", "optimized", and "physical" plan types. The "unoptimized" and "optimized" outputs previously inlined the materialization inside the base plan, while the "physical" output previously omitted materializations altogether.

Bug fixes

  • Large joins and materialized subqueries occasionally returning truncated results: Fixed a bug where a correlated subquery that gets materialized on the left side of a large hash join occasionally emitted fewer rows than expected.

v25.5.6 (15 May 2025)

  • File-globbing interfaces: Prepares the initial globbing API for matching file patterns in S3, GCS, and locally. Globbing can be tested using the glob table function. An upcoming release will integrate globbing into existing file scan functions (read_parquet, read_csv).

  • glob table function: Table function return file names that match a glob using the new file-globbing interfaces.

    For example, querying a public GCS bucket using a glob:

    SELECT filename
    FROM glob('s3://glaredb-public/testdata/csv/glob_numbers/**/{3,5}00.csv');
    

v25.5.5 (12 May 2025)

  • Authenticated GCS access: Enable accessing Google Cloud Storage buckets using service accounts.

  • GCS documentation: Add documentation for querying files in GCS buckets using the GCS File System

v25.5.4 (11 May 2025)

  • approx_count_distinct: Add a new aggregate function for efficient estimation of distinct values.

  • Correlated subquery fix: Ensure joins used for decorrelating subqueries honor proper set semantics. See #3621

  • Unauthenticated GCS access: Introduce read-only, unauthenticated access to public GCS buckets.

v25.5.3 (07 May 2025)

  • High‐core‐count performance: Partitioned aggregate hash table is now fully lock-free, reducing contention on 64+-core machines.

  • Parallel build optimization: Hash table initialization has been moved into the normal execution path, enabling multiple hash tables to be initialized in parallel.

  • Under-the-hood improvements: Removed Mutex locks from the build phase and defer aggregate table allocation to execution time.

v25.5.2 (06 May 2025)

  • Parquet scan filters: Added early pruning of row-groups via scan filters, significantly reducing IO on large datasets.

  • Casting refinements: Tweaked cast rules to avoid unnecessary runtime casts and improve filter pushdown accuracy.

  • Aggregate hash table tweaks: General performance boosts, including specialized integer-sum implementations and DataType refactoring.

v25.5.1 (04 May 2025)

  • LIMIT hint pushdown: Limit rows processed in Sort operators, speeding up queries that combine ORDER BY and LIMIT.

  • Expanded bitwise/exponent operators:

    • Shift left (<<) and shift right (>>) scalar functions.
    • Bitwise AND (&), OR (|), XOR (#/XOR), NOT (~) functions.
    • Exponentiation operator (^) for Float64.

v25.5.0 (03 May 2025)

  • Optimizer rule: Introduced common sub-expression elimination for query plans.

  • Versioning: Switched to a new, date-based versioning scheme.

  • Performance fixes:

    • Support for empty projection lists in scans.
    • Cast-flatten optimization in expression planning.
View all releases on GitHub