It’s been a little quiet around here lately—but for good reason. We’ve been deep in the weeds working on the next major version of GlareDB, and one of the biggest changes under the hood is a complete shift away from DataFusion in favor of a fully custom execution engine.
This rewrite has been a long journey, and while we’re not at the finish line just yet, we’ve made some major progress. We’ll be diving into the technical details in future posts, but for now, I wanted to share some of the core ideas behind what we’re building—and why this rewrite was necessary.
GlareDB is built for people who are fluent in SQL and constantly pushing its limits. We want to make sure the engine can handle the kinds of queries real users write—whether that means deeply nested subqueries or complex joins with a wide range of expressions and filters.
So that’s where we started: with a growing set of test cases representing these challenging patterns, and working backward to design an engine that could support them naturally.
At the SQL level, that means full support for lateral and correlated subqueries right out of the box. Here’s a quick example:
CREATE TEMP TABLE ranges (range_start INT, range_end INT);
INSERT INTO ranges VALUES (1, 4), (5, 7);
SELECT * FROM ranges, generate_series(range_start, range_end);
range_start | range_end | generate_series |
---|---|---|
1 | 4 | 1 |
1 | 4 | 2 |
1 | 4 | 3 |
1 | 4 | 4 |
5 | 7 | 5 |
5 | 7 | 6 |
5 | 7 | 7 |
From the perspective of someone building a query engine, this kind of operation is complex. But that complexity shouldn’t leak out to the user. Our job isn’t to build a system that’s easy for us; it’s to build a system that makes it easy for a BI analyst to ask the right question and get the right answer.
Alongside ergonomics, reliability is a core pillar of this rewrite. Sometimes, the most intuitive query is also the most resource-intensive to execute—but that shouldn’t be the user’s problem.
A database should either return the right result or fail gracefully. Crashing under pressure should be a non-event, not just a rare one. Users shouldn’t have to worry about their queries overloading the system—they should be focused on getting insights, and the database should be built to support that.
To make that happen, we’re being extremely intentional about how we manage resources—everything from buffer allocation and reuse, to operator memory usage and failure points. That’s also a big reason we’ve stepped away from arrow-rs and related libraries. While our internal execution format is inspired by Arrow’s columnar layout, we’ve gone further when it comes to memory management and performance tuning.
And this attention to detail doesn’t stop at buffers. Every part of the system is being rethought and rebuilt with reliability in mind.
This has been a long and challenging rewrite—but we’re finally getting to the fun part. Over the next few weeks and months, we’ll be shipping releases so people can start trying out the new engine on real workloads. We’re excited to hear what you think.
In the short term, we’ve removed support for querying external systems like BigQuery and Postgres. We weren’t seeing enough value to justify the maintenance cost at this stage. But file formats like CSV, Parquet, and JSON will continue to be first-class citizens, and we’re investing heavily in deep integrations with Iceberg (catalogs and tables) and Delta Lake.
If you’re curious to follow along or contribute, check out the GlareDB repo. We’ll be posting more updates, deep dives, and design breakdowns soon so stay tuned.