GlareDB | Announcing BSON Support in GlareDB

In version 0.8, GlareDB added support for reading and writing data to BSON files either locally during embedded operations or directly to the cloud, and I wanted to take some time to explore this feature and its use a bit more.

Background

First for some context: BSON is MongoDB's native encoding format. Structurally it's like JSON, except it has a richer type system and it has some nice internal allowances (unsurprising, given how it's used internally by MongoDB both for storage and messages) that make it good for use in databases. While BSON is fundamentally row or "record" focused and GlareDB is vector or "column" focused, GlareDB aims to support many different formats, and workloads. And because GlareDB supports MongoDB data sources, we already had all of the pieces in place.

Also, since I have a certain amount of familiarity with BSON and I'm interested in making GlareDB better at reading different kinds of data, this seemed like the right place to start.

The things that make BSON good for MongoDB make it good for use in GlareDB, particularly compared to JSON or CSV, for row-structured data. In addition to the extra type fidelity, because all BSON documents are length-prefixed (internally and externally), you can skip through documents when parsing, without needing to read through them byte-by-byte. While BSON isn't as ubiquitous as either JSON or CSV, the level of language support for BSON and the facility of the tooling is pretty high, so this kind of data is easy to work with outside of tools like GlareDB and MongoDB that have native BSON support and tooling.

Using BSON in GlareDB

From your perspective as a user, BSON support should work just like any other data source and or format. Here are some examples:

SELECT * FROM read_bson('./path/to/file*.bson');

While this works on local file paths, you can use any of the cloud-object storage formats that GlareDB supports with S3, GCS, Azure, and compatible services. You can also specify HTTP URLs, and that works just fine.

Also notice that there's a * glob pattern here: that fully works, and will select all matching files, and treats them as a single file. This won't work with HTTP, but does work on other files. GlareDB sorts the files by name lexicographically, when reading them, which is important for how it determines the schema.

You can also add external tables to GlareDB with BSON files:

CREATE EXTERNAL TABLE bson_beatles
     FROM s3
	 OPTIONS (location='s3://bucket/path/to/beatles*.bson',
      	      file_type='bson',
     	      schema_sample_size=250);

The support for file name globbing is particularly useful here, because you can create the data source, and then add or remove files that match the pattern without needing to reconfigure GlareDB.

The options, other than location are, well optional. GlareDB will infer files based on extension, so if your objects end in bson you're set, but you can specify file_type; and the schema_sample_size - which defaults to 100 - controls how many documents are parsed to determine the schema of the GlareDB table.

These options are available in the read_bson function as well.

You can also write BSON files with GlareDB using the COPY TO operation. This can be the contents of a table if you want to export or migrate data or the results of a query if you want to materialize a view or create a report. Like the read operations, COPY TO will write data to local files as well as S3, GCS, Azure, and compatible services.

To write the contents of a table to a local file:

COPY users TO ./user.001.bson;

That's it. GlareDB deduces the format from the file name. If you want to be explicit or won't be using the .bson extension, you can use the following form:

COPY users TO ./user.002.data FORMAT bson;

You can use COPY TO to filter or transform the data before/while writing the data:

COPY (SELECT id as user,
             count(*) as logins,
	         sum(duration) as total_time)
TO ./user.003.data FORMAT bson;

Just as with other formats COPY TO supports writing data directly to cloud storage like S3, GCS, Azure and compatible platforms (like R2 and MinIO!) See the documentation for details.

Implementation Details

Internally BSON tables are implemented as StreamingTables in DataFusion, which means that data are streamed off of storage, through the query engine directly: depending on the query this can reduce the latency for receiving the first batch of results, as well as memory use for embedded cases. For tables or functions that use globbed files, these partitions mean that data can be read and processed in parallel.

Because GlareDB tables have static schemas and BSON files are a collection of documents with arbitrary schemas, when creating tables from BSON data GlareDB infers the schema of the table by taking the superset of the fields in the first 100 documents. By default, the schema_sample_size option allows you to control this. Fields that don't appear in these documents are ignored and fields that are in the schema but missing in some documents have null values.

There are improvements that we'd like to make to the implementation in the coming releases:

pushing down projections: data from BSON files is always read into GlareDB structures before processing and filtering the data. Because of the streaming properties of the implementation, the penalty is probably minimal. However, BSON really shines for these kinds of on-the-fly projections. If your query only needed a few fields from a BSON table, GlareDB will currently process and decode more data than it needs to.
with the forthcoming distributed execution work, streaming tables, particularly in cases with several underlying files, become pretty exciting and the opportunity to improve query performance with parallelism are pretty exciting.
while there are ways that we could improve schema inference, both operationally and also by including more configuration options, we are hoping to add explicit schema declaration so that you can omit all schema inference and have more control (if you like.) In some ways this is a sort of special case of push-down projection, and goes hand-in-hand with that effort.
there are a bunch of interesting things that we could do to make handling and processing document-structures easier, particularly for embedded documents and arrays. I'm interested in seeing how support for other kinds of document manipulation languages work inside of GlareDB (like KDL Query) or jaq for JSON, and if any of these approaches could be applied to BSON: doing row-level manipulation on BSON data before getting into GlareDB's internal formats (say), or maybe giving GlareDB the ability to handle BSON data in fields of conventional tables would be useful.

As you may have noticed from the rest of this post, most aspects of BSON handling work like other file formats in GlareDB, so improvements to BSON handling will also improve the MongoDB data source and JSON handling. I'm also excited to see what kinds of things people build using GlareDB and BSON! Please let us know what you're working on and if there's any way to help.

Epilogue: Some Notes on BSON Files

I've mentioned "BSON Files" here, as if it's a common thing, and perhaps it isn't. MongoDB's backup/dump tool mongodump produces BSON files, but you don't need to use MongoDB and mongodump to get a BSON file. These files are just BSON documents written to a file, one after another. Since BSON is length prefixed, you don't need to do anything special. Consider the following code snippet from one of our tests:

import bson
import random

beatles = ["john", "paul", "george", "ringo"]

data_path = "beatles.100.bson"

with open(data_path, "wb") as f:
    for i in range(100):
        beatle_id = random.randrange(0, len(beatles))
        f.write(
            bson.encode(
                {
                    "_id": bson.objectid.ObjectId(),
                    "beatle_idx": beatle_id + 1,
                    "beatle_name": beatles[beatle_id],
                }
            )
        )

I put an _id field with an ObjectID like MongoDB would do, but you don't have to. That's it. Any programming language that has a BSON library and lets you write binary data to a file can do this.

I hope you enjoy!

Announcing BSON Support in GlareDB

Background

Using BSON in GlareDB

Implementation Details

Epilogue: Some Notes on BSON Files

Subscribe to the changelog