Explaining GlareDB to My Friends: All Data Are SQL Addressable
tl;dr> I realized that fairly often I find myself explaining what I do and what GlareDB is. I realized that my descriptions of GlareDB were all a bit different, and I thought it might be interesting to share these as a way of shining some light onto what we're building.
"GlareDB imagines that all data are SQL addressable," I say.
"Oh neat," they say, expecting more.
"Imagine you can take Excel files and CSV files and data from Salesforce, and your PostgreSQL, and all your internal databases, and internal APIs, and join them all together as if they were all a single SQL database," I say.
"Huh, like everything," they say, soaking it all in a bit, and then "Wait, wait, really?" they ask. I nod. "That's pretty cool..." and we're off.
It takes a while for everything to settle in, and for GlareDB to click. That's because introducing GlareDB in this way obscures the most important implications of all data are SQL addressable.
All Data are SQL Addressable
First, being SQL addressable means that the data are already SQL addressable where they are, even if that's not currently in a SQL database; no new infrastructure required. The "modern data stack" (Hadoop, Spark, Redshift, Bigquery, Snowflake, etc.,) all grew out of the assumption that analytics tools should "own" the storage of all data, and therefore control the workload: at great expense and complexity for users. The previous generation of data processing tools (the "pre-modern data stack," which are mainframes, I suppose,) required an even larger investment of infrastructure.
Second, if everything is accessible in a single system with common semantics, GlareDB radically unwinds all of our existing assumptions about the way we interact with databases. For the last 50 years, databases have run as big monolithic servers to which we connect over the (local) network with an RPC system for requesting data and returning result sets. This is with the exception of some (notable) embedded systems, that often mimic their larger external counterparts.
Because GlareDB supports SQL and imagines that all data are addressable with SQL, it's possible to build tools that wrap SQL queries and response tuples that aren't SQL, which enables different models for interacting with data. For example, we have a project under consideration that would allow GlareDB to take Polars code, reduce it to an abstract representation, and then use this representation directly in GlareDB suddenly no SQL. PRQL, an alternate pipeline for expressing queries, is embedded in GlareDB, so you can write operations without any SQL. SQL is really important, for sure, but I also think SQL can sometimes be rather user hostile: I don't think any one thing will replace SQL nor do I think that SQL will stop being important, but GlareDB provides the freedom to explore different interfaces and patterns at very low computational or conceptual cost.
While GlareDB can behave like a traditional client-server database, and integrates with all of your existing tooling via the PostgreSQL protocol, it doesn't have to. The node.js bindings and python bindings embed the entire database in your application. This has some pretty obvious (and nifty) implications for having an embedded analytics database in your applications on the edge, in deployed software, etc. With hybrid and remote execution, it also means that the local database can act as a query cache: local files become first class data objects with your remote storage; results of previous queries cached in the local instance can be joined or referenced in new queries. The GlareDB bindings are a client library, but they are also an entire independent database that can run in your application or notebook, and with a single one-line change those same applications can run in GlareDB Cloud.
In the last year, I've come to appreciate the way that pulling apart a problem -- storage from compute; authentication from protocol framing; state machines from consensus -- can radically reduce the total complexity of both halves of the problem.
Pulling apart concepts and reorganizing categories is a big part of what makes GlareDB powerful -- storage and compute; operational and analytic data; distributed and local processing; systems of record and archival systems -- but what makes GlareDB special is that we can put things back together, too. Can we improve the ergonomics of interacting with databases? (yes, but too easy!) Can the data stack provide value without being custodial? (yes!) Can we imagine the database as an engine for interacting with data as a service on the network and within the application? (I think so!) Can we make the experience of interacting with a high-performance globally scalable workflow engine and distributed database feel the same as working with an embedded database? (stay tuned!)