Announcing BSON Support in GlareDB
In version 0.8, GlareDB added support for reading and writing data to BSON files either locally during embedded operations or directly to the cloud, and I wanted to take some time to explore this feature and its use a bit more.
First for some context: BSON is MongoDB's native encoding format. Structurally it's like JSON, except it has a richer type system and it has some nice internal allowances (unsurprising, given how it's used internally by MongoDB both for storage and messages) that make it good for use in databases. While BSON is fundamentally row or "record" focused and GlareDB is vector or "column" focused, GlareDB aims to support many different formats, and workloads. And because GlareDB supports MongoDB data sources, we already had all of the pieces in place.
Also, since I have a certain amount of familiarity with BSON and I'm interested in making GlareDB better at reading different kinds of data, this seemed like the right place to start.
The things that make BSON good for MongoDB make it good for use in GlareDB, particularly compared to JSON or CSV, for row-structured data. In addition to the extra type fidelity, because all BSON documents are length-prefixed (internally and externally), you can skip through documents when parsing, without needing to read through them byte-by-byte. While BSON isn't as ubiquitous as either JSON or CSV, the level of language support for BSON and the facility of the tooling is pretty high, so this kind of data is easy to work with outside of tools like GlareDB and MongoDB that have native BSON support and tooling.
Using BSON in GlareDB
From your perspective as a user, BSON support should work just like any other data source and or format. Here are some examples:
SELECT * FROM read_bson('./path/to/file*.bson');
While this works on local file paths, you can use any of the cloud-object storage formats that GlareDB supports with S3, GCS, Azure, and compatible services. You can also specify HTTP URLs, and that works just fine.
Also notice that there's a
* glob pattern here: that fully works, and will
select all matching files, and treats them as a single file. This won't work
with HTTP, but does work on other files. GlareDB sorts the files by name
lexicographically, when reading them, which is important for how it determines
You can also add external tables to GlareDB with BSON files:
CREATE EXTERNAL TABLE bson_beatles
The support for file name globbing is particularly useful here, because you can create the data source, and then add or remove files that match the pattern without needing to reconfigure GlareDB.
The options, other than location are, well optional. GlareDB will infer files
based on extension, so if your objects end in
bson you're set, but you can
file_type; and the
schema_sample_size - which defaults to 100 -
controls how many documents are parsed to determine the schema of the GlareDB
These options are available in the
read_bson function as well.
You can also write BSON files with GlareDB using the
COPY TO operation.
This can be the contents of a table if you want to export or migrate data or the
results of a query if you want to materialize a view or create a report. Like
the read operations,
COPY TO will write data to local files as well as S3,
GCS, Azure, and compatible services.
To write the contents of a table to a local file:
COPY users TO ./user.001.bson;
That's it. GlareDB deduces the format from the file name. If you want to be
explicit or won't be using the
.bson extension, you can use the following
COPY users TO ./user.002.data FORMAT bson;
You can use
COPY TO to filter or transform the data before/while writing the
COPY (SELECT id as user,
count(*) as logins,
sum(duration) as total_time)
TO ./user.003.data FORMAT bson;
Just as with other formats
COPY TO supports writing data directly to cloud
storage like S3, GCS, Azure and compatible platforms (like R2 and MinIO!) See the
Internally BSON tables are implemented as StreamingTables in DataFusion, which means that data are streamed off of storage, through the query engine directly: depending on the query this can reduce the latency for receiving the first batch of results, as well as memory use for embedded cases. For tables or functions that use globbed files, these partitions mean that data can be read and processed in parallel.
Because GlareDB tables have static schemas and BSON files are a
collection of documents with arbitrary schemas, when creating tables from
BSON data GlareDB infers the schema of the table by taking the superset of the
fields in the first 100 documents. By default, the
allows you to control this. Fields that don't appear in these documents are
ignored and fields that are in the schema but missing in some documents have
There are improvements that we'd like to make to the implementation in the coming releases:
pushing down projections: data from BSON files is always read into GlareDB structures before processing and filtering the data. Because of the streaming properties of the implementation, the penalty is probably minimal. However, BSON really shines for these kinds of on-the-fly projections. If your query only needed a few fields from a BSON table, GlareDB will currently process and decode more data than it needs to.
with the forthcoming distributed execution work, streaming tables, particularly in cases with several underlying files, become pretty exciting and the opportunity to improve query performance with parallelism are pretty exciting.
while there are ways that we could improve schema inference, both operationally and also by including more configuration options, we are hoping to add explicit schema declaration so that you can omit all schema inference and have more control (if you like.) In some ways this is a sort of special case of push-down projection, and goes hand-in-hand with that effort.
there are a bunch of interesting things that we could do to make handling and processing document-structures easier, particularly for embedded documents and arrays. I'm interested in seeing how support for other kinds of document manipulation languages work inside of GlareDB (like KDL Query) or
jaqfor JSON, and if any of these approaches could be applied to BSON: doing row-level manipulation on BSON data before getting into GlareDB's internal formats (say), or maybe giving GlareDB the ability to handle BSON data in fields of conventional tables would be useful.
As you may have noticed from the rest of this post, most aspects of BSON handling work like other file formats in GlareDB, so improvements to BSON handling will also improve the MongoDB data source and JSON handling. I'm also excited to see what kinds of things people build using GlareDB and BSON! Please let us know what you're working on and if there's any way to help.
Epilogue: Some Notes on BSON Files
I've mentioned "BSON Files" here, as if it's a common thing, and perhaps it
isn't. MongoDB's backup/dump tool
mongodump produces BSON files, but you don't
need to use MongoDB and
mongodump to get a BSON file. These files are just
BSON documents written to a file, one after another. Since BSON is length
prefixed, you don't need to do anything special. Consider the following code
snippet from one of our tests:
beatles = ["john", "paul", "george", "ringo"]
data_path = "beatles.100.bson"
with open(data_path, "wb") as f:
for i in range(100):
beatle_id = random.randrange(0, len(beatles))
"beatle_idx": beatle_id + 1,
I put an
_id field with an ObjectID like MongoDB would do, but you don't have
to. That's it. Any programming language that has a BSON library and lets you
write binary data to a file can do this.
I hope you enjoy!