🚀 GlareDB Pro now available:Simple, powerful, flexible serverless compute
database-fundamentalscompany
April 18, 2024

4 Data Formats & 1 Truth

author image
Sam Kleinman, Engineering Lead
Blog post cover image

For analytics workloads, the storage sub-system is a huge contributor to performance. If we think about the breakdown of the time it takes to process a query, the computations and comparisons are (often!) much quicker than network access, decoding the storage format, and encoding data. As a result, understanding the details of your storage tools can have a huge impact when getting the best results for your analytics data.

When considering any kind of storage engine or system, I tend to think about the following properties:

  • read performance and capability,
  • storage and memory compactness,
  • complexity of updating existing data,
  • complexity of writing a single record, and
  • complexity of writing larger batches.

Format One: JSON (and CSV)

JSON is a great format for a few reasons: it closely reflects the way objects (e.g. maps and arrays,) are structured in most contemporary usage, and the tools to interact with JSON are native and deeply integrated into most programming languages today. This makes it easy to use with different tools, and great for "interchange" between different applications and systems. JSON is also easy for humans to look at and understand: you don't need special software just to see what some data is, print() gets the job done.

When thinking about it as a format for data in a database, it has a few shortcomings: first, it doesn't have any mechanism to enforce schema, and all of the schema data are embedded (redundantly!) in each record. Second, if you have an enclosed structure (e.g. string, array, or object), you have to read character by character to find the end of that structure, and arrays and objects can be nested inside of themselves or other arrays and objects. This means if its JSON, even if its newline delimited or follows another convention, you can't get around the performance constraints entailed by reading everything one character at a time.

The same mostly holds true for CSV, though I'm inclined to think that JSON is a bit easier for humans to read. As a back end for a database, JSON (and CSV) are row based.

Format Two: BSON (and Apache Avro)

BSON is a lot like JSON: it's also row-oriented, and has a flexible schema, and conceptually the layout of the document looks a lot like JSON. Tooling isn't as wide-spread for BSON as it is for JSON, but support is still pervasive.

The primary difference from JSON is that BSON encodes both type information and length of variably sized values like objects (documents), arrays, strings, and others. This means that software that uses BSON doesn't have to read values and infer types to be able to parse BSON data. Swanky!

There are downsides, though: like JSON, the structure of records is embedded in the record itself, and so field names and type data are repeated in every record. And BSON is less human-readable than JSON.

Avro, solves some of those problems: data structure is stored in a separate file and includes type information, field name, and order. This reduces duplicated data making the format more compact. In some situations it becomes possible to entirely skip reading portions of a file. This all comes at the cost of needing to write and manage two files rather than one and lacking any kind of schema flexibility. There are always trade-offs.

Row-based formats of all kinds are great for writing records sequentially and also for insert-heavy workloads and operations where you need to return entire single records or update specific values. The arrangement of data in record formats is poorly suited to workloads where you're interacting with a few columns/fields of that record or computing aggregates across records requires more I/O, and more random I/O, which tends to be inefficient.

Format Three: Parquet (and Lance)

Parquet is a format with two major "innovations:"

  • data are stored in columns, or arrays of data of a single (typed) field. The arrays are batched or grouped so you'll see each column for the first 1000 (or so, as configured) records, in sequence before the next group. This makes it easy to compress each column individually--which is typically more effective--and only read the columns that you need. Groups also enable writing batches of records while minimizing the amount of data you need to hold in memory while writing the data.

  • structural information: row names, types, and the locations of the starts of columns and column groups within the file are stored in a special metadata block at the end of the file. This means that the metadata is in the file and can't be misplaced, and also that you can write the data in groups without needing to know the length of any group or column before you begin writing the data.

These attributes make Parquet surprisingly flexible. You get decent streaming writes as well as columnar organization and compression which can reduce storage requirements and may help limit I/O demands for many read operations. Also, since the data are already arranged in columns, they are already ready to be processed.

The downside of Parquet, and columnar formats in general, is that the order of records is in a specific order. This means that any operation that doesn't filter or order by that column will end up reading the entire file, though the columnar format makes it possible to skip unused columns and build in-memory structures more effectively.

Lance is fundamentally very much like Parquet: a columnar serialization format, split internally into groups of columns for some number of records, with a metadata block appended to the end. But unlike Parquet, it includes indexes. (Indexes are references to the location of values outside of the primary ordering column which support queries that involve subsets of other columns.)

The indexes make Lance really well-suited for the kinds of "vector/nearest neighbor" searches that are prevalent in ML/AI workloads, and other workloads with more random access patterns. Lance is an emerging standard with only one primary implementation--and indexes do take up space--but it is often a good tradeoff.

"Format" Four: Delta Lake (and Iceberg and Lance)

Delta Lake isn't really a format, but rather a storage protocol. Built on top of Parquet files (as an implementation detail), Delta makes it possible to insert, update, delete and add records or batches of data to a Delta table safely in an environment where there can be multiple concurrent readers, using a tried-and-true storage-engine mechanism MVCC, or Multi-Version Concurrency Control. The storage engines of most operational databases (PostgreSQL, MongoDB, MySQL) use MVCC systems.

Storage engine protocol that support MVCC, which include Iceberg and Lance as well as Delta Lake, make it possible for processes to write data out in the course of an operation, and only make that data visible to clients when the entire write is complete. This prevents the exposure of partial data or uncommitted data to clients while minimizing the amount of time required for exclusive access. This also means that writing doesn't modify existing data, and that previous version of the data remain available for rollback or historical views: that's the multi-version part.

The downside of MVCC engines is that saving multiple versions of the data takes up space, particularly if operations fail, or there are frequent modifications to existing data, so these engines provide a vacuum() operation to remove unreferenced versions from storage.

For data that is truly never modified, writing Parquet files directly may have slightly has less overhead, and could be easier to reuse in different applications without support for the storage protocol, but should otherwise be roughly equivalent.

One Truth

This post has a structure that builds from "slow" formats to "fast" formats with larger and larger feature sets. You might be inclined to conclude that Delta Lake and Lance are "the best" and the "most advanced," and while this is true in some respects: it depends on your needs.

If your workload has a lot of insert or append operations with very small numbers of input documents, or your queries always need to access entire rows: record-based formats like BSON and JSON will be the best. While BSON will typically be faster than JSON for most workloads, if you're accessing data from multiple applications and have existing tooling, JSON or CSV will likely be a better bet. If you're writing larger batches of data for archival purposes, writing Parquet without Delta Lake may be a better bet.

Understanding the shape and size of your data, the frequency and shape of your queries, and the requirements of the application or applications orchestrating that workload, will often make the decision clear.

Typically technologies just are, and we often go farther when we think about better and worse, faster and slower, as an aspect of the fit of the solution and problem, rather than a fundamental attribute of the technology itself.

Get started now

Ready to get the most out of your data? Get started with GlareDB locally, or spin up a deployment on GlareDB Cloud!

$curl https://glaredb.com/install.sh | sh
Try GlareDB Cloud