GlareDB | Join Data Across Space and Time with GlareDB

The modern data stack can consist of a tangle of different tools and data sources, with levels of cost and effort that can rapidly expand. It is unlikely that organizations (companies, teams, divisions) will ever be able to select a single database or analytics engine to manage all of their data. But what if you want to join across the many databases and files that are recorded by your organization or perform exploratory analysis? You shouldn't need to set up an entire ETL pipeline in order to get your data in the same place, and you shouldn't need to worry about things like keeping your data in sync, provisioning new environments from which to run your pipeline scheduler, or considering how to join the Google Sheet that a field agent just sent you with your "proper" data in Snowflake, or another data warehouse.

GlareDB is built to provide access to decentralized, distributed data without complexity and we're excited to show you how it works. In this post, we are going to build a simple pipeline with GlareDB. By the end, you will:

Create a table in GlareDB based on a directory of Parquet files of NYC real estate sales
Join your data with a table in an external PostgreSQL database as a federated query without needing to extract that data.
Read a Google Sheet as a Pandas DataFrame and insert that into your database
Learn how you can work with your GlareDB data in Pandas or Polars.

For the purposes of this demo, we'll show how to do this with the GlareDB Python bindings, but most of the items above could also be accomplished with the GlareDB CLI, or with the Node.js bindings.

Head to https://console.glaredb.com/signup to sign up for a new Cloud account or sign in to an existing account.

To install the GlareDB Python library, you can run:

pip install glaredb

2. Connect to your GlareDB Cloud deployment from in Python

Once you've signed up and signed in to Cloud, click the "Connect" button in the top right to open up a pop-up window with your connection string. If this is your first time, make sure to store your credentials safely for next time.

After copying your connection string, open up a notebook or other Python environment, and run

import glaredb
con = glaredb.connect("<YOUR CONNECTION STRING>")

By calling glared.connect() with your Cloud connection string, you'll be able to take advantage of hybrid execution, using compute from both your local machine and GlareDB Cloud!

3. Query a local Parquet file

For this demo, we'll be using Parquet files containing data for New York City real estate transactions. There is a Parquet file for each month of the years 2018-2019. To get the data files, you can navigate to the directory in which you'll be following this tutorial, and run:

git clone git@github.com:GlareDB/tutorial_data.git

Once you've downloaded the files, you'll see that you can query them as if they were tables in your database. In my case, I've put the Parquet files in the same directory as the GlareDB executable, and so I can run this code to select from all of the 2018 Parquet files and treat them as a single table:

con.sql(
    """
        SELECT * FROM 'tutorial_data/nyc_sales/SALE_YEAR_2018/**/*.parquet'
    """
).show()

4. Create a GlareDB table from the Parquet file

Perhaps someone has sent me these Parquet files, and I want to save them in my GlareDB Cloud instance so that I can reference the records in the future, or collaborate on them with a team member. I can do this with a single line of code:

con.sql(
    """
        CREATE TABLE nyc_sales AS
        SELECT * FROM 'tutorial_data/nyc_sales/SALE_YEAR_2018/**/*.parquet'
    """)

If you head back to your Cloud instance in the browser and refresh, you'll see that with that single-line query, you've created a table and loaded the data where it can be referenced in the future! And you can query from your new table with:

con.sql("SELECT * FROM nyc_sales limit 10").show()

5. Connect to a lookup table in Postgres

Now I'm feeling pretty good about my progress. But if I do a bit of exploration, and run SELECT DISTINCT "BOROUGH" FROM nyc_sales, I see that my data are normalized! I know that NYC is made up of 5 boroughs, but how can I tell which borough is which?

I recall that the lookup table for boroughs is in my team's Postgres database. Luckily, I can join that table all the way over there with my brand-new nyc_sales table:

con.sql(
    """
        SELECT * FROM READ_POSTGRES(
        'postgresql://demo:demo@pg.demo.glaredb.com:5432/postgres',
        'public',
        'borough_lookup'
        )
    """
).show()

With this query, I can see the results of the lookup table in my team's Postgres instance. And with that, I can join my new nyc_sales table with the borough_lookup table:

con.sql(
    """
       WITH borough_lookup AS (
       SELECT * FROM READ_POSTGRES(
       'postgresql://demo:demo@pg.demo.glaredb.com:5432/postgres',
       'public',
       'borough_lookup'
       )
    )
        SELECT "ADDRESS",
            "SALE DATE",
            "SALE PRICE",
            borough_name AS "BOROUGH_NAME"
        FROM
        nyc_sales JOIN borough_lookup
        ON nyc_sales."BOROUGH" = borough_lookup.borough_id
    """
).show()

6. Google Sheets and DataFrames, oh my!

Let's recap! We've queried some Parquet files, loaded them into our GlareDB instance, added a connection to an external Postgres instance, and joined data from a table in Postgres with our newly loaded data in GlareDB.

Now, I've gotten a call from our front office that the latest month of NYC sales data is ready. The catch? It's in a Google Sheet. Luckily, joining it with the other work I've already done won't be a problem.

GlareDB interoperates seamlessly with Pandas and Polars, so we can do things like:

import pandas as pd
# Load our Google Sheet into a Pandas DataFrame
sheet_key = "1f5epAPxP_Yd3g1TunEMdtianpVAhKS0RG6BKRDSLtrk"
sheet_name = "January"
url = f"https://docs.google.com/spreadsheet/ccc?key={sheet_key}&sheet={sheet_name}&output=csv"
gsheet_df = pd.read_csv(url)

# Select from our DataFrame!
con.sql(
    """
        SELECT * FROM gsheet_df WHERE "BOROUGH"=1
    """
).to_pandas()

Did you notice how we can 1) Select directly from our Pandas DataFrame object in the query and 2) take the results of our query, and propagate them right back to a Pandas DataFrame? GlareDB makes it easy to work with your data wherever it is, and in whatever form it is.

7. Loading our Google Sheet data into GlareDB.

As a last step, let's take our latest month of data from Google Sheets and load it into the nyc_sales data in our GlareDB instance:

con.sql("INSERT INTO nyc_sales SELECT * FROM gsheet_df")

To test it, we can run the query below to confirm that we now have 2019 data in our table as well:

con.sql(
    """
        SELECT "SALE DATE" FROM nyc_sales ORDER BY "SALE DATE" DESC LIMIT 5
    """
).show()

Conclusion

And there we have it! With a few lines of code, we've joined data from Parquet files, a Postgres table, and a Google-Sheet-turned-Pandas-DataFrame, and written it to our database where anyone on my team can query it.

The really cool thing about doing stuff in Python also is how much it unlocks if you want to use any other Python data tooling. We are working on first-class integrations with some of the more common data tools, but in the meantime, if you want to do things like run data quality checks or do some data visualization, it's trivial to take your GlareDB tables, convert them to Pandas or Polars DataFrames with to_pandas() or to_polars(), and run all sorts of other operations on them. In the coming weeks, we'll explore how to do this to run some data quality checks for your tables, and build a simple dashboard that is backed by your data. We'll also show you how you can use GlareDB to build a dbt pipeline that connects to multiple data sources without needing any additional infrastructure.

What else would you like to see? Let us know! You can set up a chat or email us at hello@glaredb.com.

Join Data Across Space and Time with GlareDB

1. Sign up / Install