GlareDB | GlareDB Roadmap 2024

tl;dr> This post is a high-level outline of the projects that we're working on at the core of GlareDB: execution model changes, interoperability enhancements, and ergonomic improvements.

Projects

Distributed Execution

In summer 2023 we developed hybrid execution to scale workloads that have local and remote components. Distributed execution builds on this work by furthering scale and broadening it for workloads where all execution runs remotely. The idea is to take partitions of a query's execution and distribute idle capacity in a GlareDB cluster to complete work on them. As a result, query execution is greatly accelerated, particularly for very large workloads.

This is probably the most exciting feature we're working on and is currently in development. Given the model of GlareDB, horizontal scalability is possible without meaningfully increasing complexity. Queries will just get faster, and compute clusters will get much more efficient, and potentially very elastic. This enables all sorts of really cool potential future features like geographically-aware query execution, retriable sub-queries, and first-class serverless operations.

Standard Integration

The core tenet of GlareDB until now has been the fact that it integrates with your data wherever it lives and in whatever form or system it's in. A challenge is that there are always more integrations and exciting experiments that we'd like to run, each of which requires developer time to implement. To make this easier, we've been looking at adding support for three core protocols:

ADBC is an analytics (columnar) focused database connector protocol, as an analog to ODBC, that provides an in-process protocol for exposing data. It's our goal to add support for ADBC data sources both for local integration, and to streamline our own integration projects. We also want to provide ways for you to use GlareDB as an ADBC data source to improve some of the embedded cases and provide a more idiomatic API in addition to our Python and Node.js bindings.
Flight and FlightSQL provide an analytics-optimized wire protocol for interacting with databases. GlareDB currently exposes the Postgres wire protocol, which is great for its simplicity and ubiquity; however, it isn't always a great fit for our data model or typical workloads.
Substrait is protocol for expressing relational algebra (e.g. queries) in a form that's more abstract than SQL and easier to manipulate programmatically. This means that rather than requiring clients and users to build long and complicated SQL queries (as strings,) Substrait support will allow clients and users to write queries and operations in data frame libraries like Polars and have them run in GlareDB, or in another system.

We're also interested in watching the development and growth of Substrait, because it could allow us to translate and pushdown better queries to GlareDB data sources, which means faster and more efficient query execution.

This work is progressing in the core of GlareDB as well as in the relevant open source projects related to these protocols.

Data Models and Use Case Development

There are so many different small and medium sized features with big impacts that we're excited to bring to GlareDB in the coming months:

We'd like to integrate with databases and tables sourced from OpenAPI specs as well as other popular REST APIs. We currently support line-separated JSON HTTP APIs and a collection of different object store APIs, but enabling access to APIs like HubSpot, Salesforce and GitHub would be very powerful.
We recently launched Node.js bindings for using GlareDB from JavaScript applications, and we'd like to develop the ecosystem and integrations to use GlareDB in this context.
At the end of 2023 we began adding support for Lance, a data serialization format designed to optimize vector search and support LLM integration. We want to continue to develop and expose more functionality for working with Lance, and are consider supporting generating and writing Lance data.
KDL is a document serialization format that is easily human readable (like TOML or YAML) but with richness and semantic power of something like XML, that's well specified and with good library support. We're adding kdl_match() and kdl_select() functions for filtering and projecting data using the KQL query language from columns that store KDL data. We're exploring other ways of using KDL.
We're also working on adding additional support for BSON, MongoDB's document format. We've made some recent improvements to the way GlareDB handles data from MongoDB. This includes support for handling files in BSON format, which while row-oriented, is a compelling interchange format for storing document-style data for use in database systems, particularly compared to JSON or CSV.

GlareDB should be able to read BSON files from object stores and the local file system, and will be able to write BSON document streams to files or object stores.
For a third document-data related feature we'd like to add and improve support for working with JSON data. There are a number of tools that we're discussing. Some possibilities include:
- finding ways to emulate or provide PostgreSQL's JSON functions.
- include support for jq-style manipulation of JSON documents as a GlareDB function.
- struct-syntax, for reaching into json documents using struct syntax.
If you have a lot of JSON data, we'd love to hear from you!
We'd also like to provide improved support for interacting with and manipulating time series data from metrics stores like Prometheus and TimescaleDB, as well as improved support for time series aggregation and window functions. Modern applications collect so much time series data that it is often inaccessible or difficult to integrate efficiently with other data, without a large loss of fidelity. GlareDB is uniquely suited to address this.

GlareDB Cloud

In addition to all of the work going on in the core database this year, there's a lot of exciting work for GlareDB Cloud. We're making lots of improvements to UX and UI, as well as developing integrations with other interfaces and platforms so that GlareDB Cloud is always easy to access and leverage when you need it.

We're also working on bringing some exciting features to GlareDB Cloud directly:

timers, so you can run queries/operations on timers to orchestrate background jobs.

You can of course, get similar results by using your own orchestration system today, and we recommend these kinds of workflows, but having to maintain your own orchestration system doesn't feel very serverless.
cached/persistent cursors, so you can start long running queries to run in GlareDB Cloud without needing a client connected. Start a query, get a query/cursor ID, disconnect, and the later use that cursor ID to get your query results.

Today, you can get a similar effect by fully materializing the data, but this might require more work.
streams, for ingesting and producing streams of data. This is perhaps the item in this post that is most likely to move to the following year, but GlareDB should be able to interact with data that's more real-time and alive than files in object storage and external systems.

We're thinking about data sources from popular message busses, access to GlareDB streams and data via similar streaming protocols, as well as larger improvements in storage protocols for saving streaming data.

Conclusion

We don't have precise timelines for any of these features, but we're excited about them and know that 2024 is going to be a great year for GlareDB.

Are you especially excited about any of these features? Are there additional features you'd like to see? Tell us! You can file a GitHub issue, join our Discord, or send an email to hello@glaredb.com.