GlareDB | Database Performance: On Indexing and Analytics Storage

Database performance can be a pretty gnarly topic and the answers to most questions are "it depends." If you dig a little deeper, database performance is really all about trade-offs. Being good at database performance as a user often centers around being able to accurately predict how the data will be created and used long before any data (or code!) is written.

Indexes (in online-transactional systems) are the best example of this: indexes provide a layer of indirection between values of a specific field (column) and their storage. Operating like a lookup table, the database needs to examine (and read into memory) far less of the table to satisfy a query when using indexes. Because indexes are ordered, you can get ordered results by reading documents using indexed order. They come with a trade-off: when you insert or update a record to a database, you have to update its index entries. Write performance is traded for a lot of read performance, and much improved memory efficiency. It's a good trade.

Different kinds of indexes are appropriate for different kinds of data and queries: scalar data, coordinate/location (geospatial) data, LLM data, natural language search. Some indexes are only efficient for equality queries, while other indexes support a variety of operators.

In analytics databases, indexes don't make the same kind of sense. Distributed query processing means that datasets rarely need to sit in RAM on a single machine, expected/tolerated query latency is higher, and analytics-processing systems tend to store the data in ways that make queries much more efficient, particularly for large data sets. Columnar storage stores the same column from several records sequentially on disk, so that queries that only affect certain columns, particularly those that are aggregating or summarizing values rather than collecting values, can be very efficient.

The trade-off is that queries that need to read entire records are less efficient and writing single records is (often) quite slow. This is reasonable, particularly for analytics workloads. But just like there are many different index formats (and row-based storage engines!), there are also a lot of different ways to build a columnar storage engine: each with its own set of trade-offs, and also particularly optimized queries.

Indeed, if you optimize a storage format for a specific kind of query or specific kind of data, you can probably build something that will always win. The challenge is that you rarely have only one kind of data, and there are a lot of different kinds of performance in addition to query latency: write performance, on-disk compactness, memory-compactness, CPU utilization.

Storage engines are hard.

While analytics databases don't really have indexes that serve as lookup tables, there are summations and metadata that capture and reference ranges of data, which can have a great impact on performance. It's also useful and quite common to have analytics engines or systems that materialize (write to disk) the results of queries or sub-queries to accelerate some operations. In some ways, ETL systems often use the T (transform) step to materialize data into forms that support analytics operations.

This leaves everything a bit muddy, doesn't it? What's the best storage engine? It depends after all.

A slightly better answer is that while there are definitely bad engines for an analytics workload, there isn't a best storage engine for analytics workloads, and there certainly isn't a one-size-fits-all solution.

That's where GlareDB comes in:

GlareDB is built to support different storage engines, storage formats, and to make it possible to join data between different kinds of storage and data systems.
GlareDB has a "native" storage engine based on the deltalake protocol that uses Parquet as the storage format. Parquet is pretty good for many cases: it is reasonably compact, has solid columnar access properties, per-column compression, and many tuneable properties.
GlareDB's COPY TO operation makes it possible to materialize entire queries or portions of queries to native storage as it makes sense to optimize various workflows and operations.

Database Performance: On Indexing and Analytics Storage

Subscribe to the changelog