🚀 GlareDB Pro now available:Simple, powerful, flexible serverless compute
database-fundamentalscompany
April 10, 2024

7 Ways to Make Queries Faster

author image
Sam Kleinman, Engineering Lead
Blog post cover image

tl;dr> This post dives into ways to speed up your database queries without getting too technical. It's about using what you've got more wisely, like ensuring your database isn't wasting effort on unnecessary tasks. Key points include optimizing how much memory you're using, compressing data to save space, and smart use of indexes to skip unnecessary data checks. There's a tip on saving query results for quick access later, plus arranging your data to match how you'll query it. Also, it touches on the idea of letting the database do more of the work upfront, which can mean faster results. In essence, it's about making small, smart changes for a big impact on speed.

1. Allocate More Resources.

Typically, there are only two things that are slow in a database system: network access and disk access. Every other operation is always orders of magnitude faster. So if you want to make a system fast, give it more memory, and make sure it runs on a very fast network with very short hops to whatever it connects to.

While "keep all your data in RAM" is a great way to make things fast, it's also very expensive, and inefficient data structures will still be slow. So while adding RAM is often the answer, you can't just add RAM. Anyway, since it's often not feasible to keep all data in RAM, databases--and we as users of databases--try to use RAM as efficiently as possible. The goal is to manage and access a large amount of data while using as little RAM as possible so that all of the data needed by the queries running during a specific time window--the working set--can all exist in RAM at once. This reduces the number of times that the query has to fetch data from slower storage and reduce the amount of filtering that has to happen in memory.

All of the optimizations and "tricks" in this post boil down to finding ways to more effectively use the memory you do have.

2. Compress more data: Trade CPU for Time

Canonically, database workloads are very I/O- and memory-intensive, but tend to under-utilize the CPU on modern systems. Therefore, taking steps like compressing data will be more expensive in terms of CPU, but can also reduce the amount of memory we use, and so it ends up being a good trade.

You have to look at the throughput and average latency of the system as a whole, though: compressing data in memory will make some specific operations slower, but will also mean that if you have a lot of orthogonal queries running at once, that they will all run faster.

3. Rely on Index Data.

This is particularly true for transactional (typically, row- or record-oriented) systems, but is also applicable for Lance as well. Basically, indexes store a copy of a portion of the data with references to the complete record. That way, instead of scanning all of the records in a table, you can just scan the index, which is a much smaller amount of data, and the chances are that most of the index can remain in memory.

The trade-off with indexes is that the total on-disk data size increases because you're storing additional (partial) copies of some data. Additionally, indexing can slow the throughput of writes--particularly updates--because the database must write changes to different parts of the file for a single operation. This last concern is not relevant for analytical databases that use columnar formats (because the data for a single record is already not in the same place) but can have an impact otherwise.

Lance is an analytical/columnar data format, optimized for the kind of random access patterns that are common to AI/ML workloads. The optimization? Indexes, which is great!

4. Materialize Views.

If you have a query that you end up running many times, where the results don't change, you can create a "materialized view", which for non-database engineers, is just "writing a copy of the data to disk." That way, when you run the query, instead of processing all the input data again, you can copy the data directly from disk. While I/O is slow, sequential I/O is faster, and materialized views are often smaller than the data required to read them. It's a good trade-off.

5. Store Data in Ways that Anticipate Query Patterns.

This "trick" is really the fundamental principle of all database design:

  • relational systems provide very fast joins, and so by normalizing data and storing parts of data in different tables, they can reduce the amount of data they need to read and return to the client, which is a good fit for some transactional workloads.

  • document/non-relational databases store logical units of data together in documents, which reduces (or entirely removes) the need for joins and transactions, and means that data requires relatively little post-processing to return to the user.

  • analytics engines use columnar formats that store columns together, rather than records, which make it possible to compute aggregates for values or fields very efficiently, and reduce the amount of churn or data that the operation needs to pull out of memory.

While a lot of the "pick a data format that matches your data and use case" is really a decision that your database system makes for you, you can make choices in any system about how much normalization you want to have. In columnar stores you can often control the ordering of the records, which can greatly optimize sort operations.

6. Push operations and logic down as far as possible.

Flippantly, "pushing down operations" to lower levels of the system, is "optimization by making the performance someone else's problem," and while this sounds bad, there's logic to it.

For instance, When you query a Postgres database with GlareDB, it's possible to just read all the rows from Postgres and then filter and process the data in GlareDB. But it's going to be faster if GlareDB can push down most of the query to Postgres. The result is that GlareDB has to receive much less data from Postgres and can take advantage of Postgres's indexes and other optimizations. Less data transmitted means less time waiting on the network, and less work for the higher-level systems to do.

For most second-order data sources, GlareDB will push down projections (SELECT) and predicates (WHERE), and perform aggregates, joins, and sorts locally, though many more operations can be pushed down.

7. Manage Parallelism.

There are a lot of queries with orthogonal components that can be processed in parallel, which can have the effect of speeding up the query. Many modern database systems provide some kind of parallel query processing that can take advantage of the extra capacity on the system (the otherwise underutilized CPU) to process the query faster.

Parallelism isn't a silver bullet, and in many cases adding parallelism or too much parallelism will make queries slower. The key here is orthogonal components: if the query requires a sort or "group by" before a lot of the processing can occur, then there are often few opportunities for parallelism. If the components of the query aren't orthogonal then the system will have to synchronize the processing, which can slow things down. There is often some fixed-cost to coordinating parallel work, which can be avoided in direct serial interaction: for queries that are already fast, adding parallelism, will often make them slower.

Some systems provide controls for parallelism, other systems (like GlareDB) provide a more managed execution. Authors of code that use databases can also examine query plans and think about the orthogonality of their queries to improve performance.

Conclusion

Though databases come in a lot of different structures and formats for many different use cases, the fundamental principles remain the same. Some physical limitations will apply in any case--big data will always take longer to process than little data. But by being intentional about the end goals for your data, you can plan ahead and ensure that you're making use of all of the different possible optimizations.

Get started now

Ready to get the most out of your data? Get started with GlareDB locally, or spin up a deployment on GlareDB Cloud!

$curl https://glaredb.com/install.sh | sh
Try GlareDB Cloud