GlareDB | Put That Data Lake in a Box!

tl;dr> There's been a lot of hype around "big data" over the last 10-20 years. But data lakes can come in a lot of sizes, and users should have dynamic tools that can help them scale their data seamlessly. This post explores how GlareDB can provide a hassle-free setup without ETL or servers, making it a versatile solution for various data sizes and computing needs.

I started working on databases at the height of big data. A much younger version of me was so excited by that: finally, (what did I know,) we could collect all the data and do things with it and the times were great. Now, looking back it's almost funny to think of what "big" meant then, what it means now, and how size isn't always the first thing we think about when we talk about our data tooling.

What we do with data has - at least to me - been the more interesting part.

Data lakes, despite the potentially silly name, are a great innovation, and acknowledge the realities of most institutions' "data landscapes": there are a lot of different data sources, you want to be able to access it sometimes, but it's often not "hot" and therefore not worth the cost of maintaining a pipeline for many reporting workloads. While this might be disappointing for the warehouse or distributed workflow people, it's refreshingly pragmatic: let's build a cost effective tool for working with the data we already have that addresses the reality of our application needs.

I can get behind that.

While data lakes are often massive in terms of data size, there's nothing in the paradigm that says they have to be: why can't we have a data pond, or a data tide pool? How we use the data is more important and a lot of our "big data" problems, even back then, were not that big.

It's really interesting to be working on GlareDB which scales up and down as needed. If you have a data lake or a data sea or even just a data puddle: you get something that does all the data lake things and runs the same in a Python library as it does in GlareDB Cloud.

GlareDB has always been a data lake (estuary, reservoir, pond, inlet, ...) in a box. It has the ability to connect to cloud data from local instances and use these data sources as if they were local files on your system. This means that data lakes aren't infrastructure you have to set up, but just a class of operation that you include in your application code.

Here's how we set it up, after installing the python bindings:

# Join some data!

import glaredb
import pandas as pd

# Create a GlareDB connection
con = glaredb.connect()

# Create link to an existing Postgres instance
con.execute(
"""
    CREATE EXTERNAL DATABASE external_db
    FROM postgres
    OPTIONS (
        host = 'my.postgres.host',
        port = '5432',
        user = 'glaredb',
        password = 'password',
        database = 'glaredb_test',
    );
"""
)

# Read data from a Google Sheet into a Pandas DataFrame
event_sales_df = pd.read_csv(f"https://docs.google.com/spreadsheet/ccc?key={sheet_key}&sheet={sheet_name}&output=csv)

# Join a Postgres table, local parquet file and the Google Sheet DataFrame(!!)
con.execute(
"""
    SELECT *
    FROM external_db.public.events AS event_postgres
    JOIN
    parquet_scan('./path/to/my/local/user_data/*.parquet') AS user_parquet
    ON event_postgres.id = user_parquet.event_id
    JOIN event_sales_df
    ON event_postgres.id = event_sales.event_id
"""
).show()

That's it. No ETL. No servers. On your laptop. In your notebooks. Even inside of your existing Node.js or Python application code. Now of course, there are limits: data still take compute to process, and petabytes of data are still a lot of data: there's no way around that, but most of the time there are ways to avoid petabyte scale problems entirely.

If you have a large workload and don't want to run the computation on your laptop because the data flood is starting to top the data levee (this metaphor just keeps on giving), or you're on a slow network or your laptop gets too hot, you can use GlareDB Cloud, with the same code: just add a connection string when you call glaredb.connect().

Put That Data Lake in a Box!

Subscribe to the changelog