🚀 GlareDB Pro now available:Simple, powerful, flexible serverless compute
productengineering
July 20, 2023

Working with GlareDB in Python

author image
Sean Smith, Founder
Blog post cover image

Since v0.2.0, GlareDB has shipped with Python bindings, providing ability to run fast SQL transformations over Polars and Pandas data frames.

What is GlareDB?

Before jumping into the Python bindings, let's get a quick overview of GlareDB.

GlareDB is a SQL database capable of hooking into a variety of other data sources. We set out to build a system that reduces the need for moving data between databases before being able to run analytics on it. We've built out connectors for some of the most popular databases and warehouses, enabling you to query data wherever it lives.

Everything is managed through SQL, so setting up your data sources is just a CREATE EXTERNAL DATABASE call away. Once data sources are added, they can be referenced in SQL queries inside of GlareDB. For example, we can join data from a Snowflake warehouse and a Postgres instance by just providing fully-qualified tables names:

SELECT *
  FROM my_snowflake.public.user_tracking AS t
  INNER JOIN my_postgres.public.users AS u ON t.user_id = u.id;

Learn more about data sources by checking out our data source docs.

Getting started

GlareDB's Python bindings are hosted on pypi, making installation with pip very straightforward:

pip install glaredb

Once installed, you can now import GlareDB and connect to an in-process database:

import glaredb
# Store everything in memory, data is lost once the script exits.
con = glaredb.connect()
# Or optionally provide a path for persisting your GlareDB database.
con = glaredb.connect('/path/to/db')

Now that we're connected, we can begin writing some queries. GlareDB's Python bindings have support for querying Pandas and Polars data frames directly. Let's check out a full example below.

import glaredb
import pandas as pd

# Our data frame that we'll be querying.
df = pd.DataFrame(
    {
        "A": [1, 2, 3, 4, 5],
        "fruits": ["banana", "banana", "apple", "apple", "banana"],
    }
)

# Connect to an in-memory database.
con = glaredb.connect()
# Query the `df` data frame directly, and
# return its results as another data frame.
out = con.sql("select * from df where fruits = 'banana'").to_pandas();

In this example, we're able to query a data frame directly with SQL, and return the results as a new Pandas data frame. Mix and match data frame operations with SQL where it makes sense.

Incrementally building queries

In our previous examples, we were always calling to_pandas on the results of our queries. Under the hood, this will execute the query to completion, and store the result in a data frame.

But what happens when we don't call to_pandas? When this is omitted, GlareDB will begin planning the query, but won't actually execute it. If we assign this to variable, the contents of the variable will be the logical plan.

Let's take a quick peek using the query from our previous example:

# The same query from before, without the call to `to_pandas`.
out = con.sql("select * from df where fruits = 'banana'");

# Let's print out this variable and see what it gives us.
print(out)

# Output
# Query(
#     Projection: df.A, df.fruits
#       Filter: df.fruits = Utf8("banana")
#         TableScan: df,
# )

out is storing the logical plan for this query. Then, if we execute the query with to_pandas:

results = out.to_pandas()
print(results)

# Output:
#    A  fruits
# 0  1  banana
# 1  2  banana
# 2  5  banana

It gets better.

Since we're not executing queries until we call to_pandas (or to_polars), we're able to use intermediate variables in subsequent queries. Our previous query took the data frame, and only kept rows where "fruits" was equal to "banana". What if we also wanted to keep rows where "A" was greater than 3? Normally this would require that we rewrite the query to include both conditions. With GlareDB's Python bindings, we have an additional option: We can incrementally build up this query by reusing the previous query's logical plan.

To do this, simply reference the variables returned from sql calls in subsequent queries:

# This is the same query we've been using thus far. Note that there's no
# `to_pandas` calls, preventing execution.
out1 = con.sql("select * from df where fruits = 'banana'");

# We can then reference `out1` in this query, allowing us to add our additional
# filtering condition. Again, no `to_pandas` calls.
out2 = con.sql("""select * from out1 where "A" > 3""");

# Now we actually execute the query.
results = out2.to_pandas()

print(results)

# Output
#    A  fruits
# 0  5  banana

Only when we get to the to_pandas call will the query begin executing. Once we do execute, we can see that we get the results we expect given the filters in our queries.

Incrementally building queries lets you break up large queries into smaller, more understandable pieces. This way of writing queries scales up to more complex queries as well, making it easy to write a multi-step data transformation.

Accessing data sources in python

One of GlareDB's biggest features is support for working with external data sources. And since everything is exposed through SQL commands, working with data sources inside of Python scripts is straightforward.

For example, we can connect to a Postgres instance, query it, and store the output in a data frame:

import glaredb
import pandas as pd

con = glaredb.connect()

# Add our demo postgres instance as a data source. This is loaded with
# tpc-h data.
#
# `execute` will eagerly execute the query to completion.
con.execute("""
  CREATE EXTERNAL DATABASE my_pg FROM postgres OPTIONS (
    host = 'pg.demo.glaredb.com',
    port = '5432',
    user = 'demo',
    password = 'demo',
    database = 'postgres',
  );
""")

out = con.sql("select * from my_pg.public.customer limit 10").to_pandas()

We can then take this a step further and join data stored in a Pandas data frame with data stored in Postgres.

Now let's say we have a data frame with information about customer support calls. This data frame contains the customer's phone number and the outcome of the call, but it lacks some information that we want for additional analysis. Enriching our local data frame with information stored in the customer table in Postgres is as simple as running a join:

# Data frame containing information about customer support calls.
support_calls = pd.DataFrame(
    {
        "phone_number": ["15-416-330-4175", "20-908-631-4424", "16-155-215-1315"],
        "outcome": ["RESOLVED", "RESOLVED", "PENDING"]
    }
)

# Join the `customer` table from our Postgres instance onto our local data frame
# with a left join.
customer_info = con.sql("""
  SELECT c.c_custkey, c.c_name, c.c_mktsegment, s.outcome
  FROM support_calls AS s
  LEFT JOIN my_pg.public.customer AS c ON c.c_phone = s.phone_number
""").to_pandas()

print(customer_info)
# Output
#    c_custkey              c_name c_mktsegment   outcome
# 0         42  Customer#000000042   BUILDING    RESOLVED
# 1         18  Customer#000000018   BUILDING     PENDING
# 2         49  Customer#000000049   FURNITURE   RESOLVED

And now we have our final results stored in customer_info with data made up from both our original data frame and customer data stored in Postgres.

We're showing connecting to a Postgres database here, but any data source that we support can be used. Need to pull in data from Snowflake? Check out our docs on Snowflake to see how to connect your warehouse.

What's next

We showed how to use GlareDB's Python bindings to run SQL queries over data frames and how to query external data sources and store the output in data frames. This is just scratching the surface of what GlareDB can do, and we encourage you to check out our docs for more features.

Running into an issue or have a feature request? Let us know by opening a ticket on our repo.

Get started now

Ready to get the most out of your data? Get started with GlareDB locally, or spin up a deployment on GlareDB Cloud!

$curl https://glaredb.com/install.sh | sh
Try GlareDB Cloud