Join Snowflake with Postgres Data Using Data Federation
Given all of the options for data infrastructure today, it's common to have your data spread across multiple data sources. What do you do if you want to answer questions that rely on more than one data source? You can set up a bunch of infrastructure to move all of your data to the same place, but that's not always practical or convenient. Today, there is a simpler option: data federation, also known as query federation. Data federation lets you treat all of your different data as though they were all in one place, and thus lets you write queries to join data that's in different locations, and in different formats. In this example, we'll show you how to can quickly get started writing queries to join data between Snowflake and Postgres. You can follow along below, or check out the video walkthrough:
The Scenario
Let's say we have a table of New York City real estate sales data stored in Snowflake. This table contains various fields with information about property locations and transactions. One of the fields is the borough ID where each property is located.
However, the lookup table that maps borough IDs to borough names is stored in PostgreSQL. How can we efficiently combine this information? GlareDB makes it simple.
Setting Up Your Environment
For this demonstration, we'll use a Jupyter Notebook, but you can also use the GlareDB Cloud workspace, CLI, or Node.js bindings. Here's how to get started:
-
Install the GlareDB Python bindings:
pip install glaredb
-
Import GlareDB in your Python environment:
import glaredb
-
Set up a connection to GlareDB.
con = glaredb.connect()
Querying Data from Different Sources
GlareDB provides specific functions for reading from various databases. In our case, we'll use:
read_postgres()
for accessing the PostgreSQL lookup tableread_snowflake()
for querying the Snowflake real estate data
Here's an example of how to query the PostgreSQL lookup table:
postgres_table = con.sql(
f"""
SELECT * FROM
read_postgres(
'host=pg.demo.glaredb.com port=5432 user={postgres_user} password={postgres_password} dbname=postgres',
'public',
'borough_lookup'
)
"""
).to_pandas()
postgres_table
You run your query with con.sql(), then output the results of your query to a Pandas DataFrame containing the borough names and IDs using
.to_pandas()`.
Joining Data Across Databases
Now for the exciting part – joining data from Snowflake and Postgres. You can
treat the results of your calls to the read_...
functions as tables within
the same database.
joined_data = con.sql(
f"""
SELECT postgres.borough_name, snowflake.* FROM
read_snowflake(
'{snowflake_account}.us-central1.gcp',
'{snowflake_user}',
'{snowflake_password}',
'sandbox',
'compute_wh',
'accountadmin',
'public',
'nyc_sales'
) snowflake
LEFT JOIN
read_postgres(
'host=pg.demo.glaredb.com port=5432 user={postgres_user} password={postgres_password} dbname=postgres',
'public',
'borough_lookup'
) postgres ON
snowflake.borough = postgres.borough_id
limit 100
""").to_pandas()
joined_data
This query joins the Snowflake real estate data with the PostgreSQL borough lookup table based on the borough ID. The result is a pandas DataFrame that includes all fields from the Snowflake table plus the borough name from PostgreSQL.
If you find that you're often querying the same data sources, you can add your Snowflake or PostgreSQL instances to GlareDB as external data sources, making it easier to query multiple tables without repeatedly entering credentials.
Conclusion
By using data federation and GlareDB, joining data from different data sources becomes a straightforward process. In just a few lines of code, we've combined data from Snowflake and PostgreSQL, eliminating the need for complex ETL processes or data migrations.
Happy data joining!
Links:
Get started now
Ready to get the most out of your data? Get started with GlareDB locally, or spin up a deployment on GlareDB Cloud!