Working with GlareDB in Python
Since v0.2.0, GlareDB has shipped with Python bindings, providing ability to run fast SQL transformations over Polars and Pandas data frames.
What is GlareDB?
Before jumping into the Python bindings, let's get a quick overview of GlareDB.
GlareDB is a SQL database capable of hooking into a variety of other data sources. We set out to build a system that reduces the need for moving data between databases before being able to run analytics on it. We've built out connectors for some of the most popular databases and warehouses, enabling you to query data wherever it lives.
Everything is managed through SQL, so setting up your data sources is just a
CREATE EXTERNAL DATABASE
call away. Once data sources are added, they can be
referenced in SQL queries inside of GlareDB. For example, we can join data from
a Snowflake warehouse and a Postgres instance by just providing fully-qualified
tables names:
SELECT *
FROM my_snowflake.public.user_tracking AS t
INNER JOIN my_postgres.public.users AS u ON t.user_id = u.id;
Learn more about data sources by checking out our data source docs.
Getting started
GlareDB's Python bindings are hosted on pypi, making installation with pip very straightforward:
pip install glaredb
Once installed, you can now import GlareDB and connect to an in-process database:
import glaredb
# Store everything in memory, data is lost once the script exits.
con = glaredb.connect()
# Or optionally provide a path for persisting your GlareDB database.
con = glaredb.connect('/path/to/db')
Now that we're connected, we can begin writing some queries. GlareDB's Python bindings have support for querying Pandas and Polars data frames directly. Let's check out a full example below.
import glaredb
import pandas as pd
# Our data frame that we'll be querying.
df = pd.DataFrame(
{
"A": [1, 2, 3, 4, 5],
"fruits": ["banana", "banana", "apple", "apple", "banana"],
}
)
# Connect to an in-memory database.
con = glaredb.connect()
# Query the `df` data frame directly, and
# return its results as another data frame.
out = con.sql("select * from df where fruits = 'banana'").to_pandas();
In this example, we're able to query a data frame directly with SQL, and return the results as a new Pandas data frame. Mix and match data frame operations with SQL where it makes sense.
Incrementally building queries
In our previous examples, we were always calling to_pandas
on the results of
our queries. Under the hood, this will execute the query to completion, and
store the result in a data frame.
But what happens when we don't call to_pandas
? When this is omitted, GlareDB
will begin planning the query, but won't actually execute it. If we assign
this to variable, the contents of the variable will be the logical plan.
Let's take a quick peek using the query from our previous example:
# The same query from before, without the call to `to_pandas`.
out = con.sql("select * from df where fruits = 'banana'");
# Let's print out this variable and see what it gives us.
print(out)
# Output
# Query(
# Projection: df.A, df.fruits
# Filter: df.fruits = Utf8("banana")
# TableScan: df,
# )
out
is storing the logical plan for this query. Then,
if we execute the query with to_pandas
:
results = out.to_pandas()
print(results)
# Output:
# A fruits
# 0 1 banana
# 1 2 banana
# 2 5 banana
It gets better.
Since we're not executing queries until we call to_pandas
(or
to_polars
), we're able to use intermediate variables in subsequent
queries. Our previous query took the data frame, and only kept rows where
"fruits" was equal to "banana". What if we also wanted to keep
rows where "A" was greater than 3? Normally this would require that we rewrite
the query to include both conditions. With GlareDB's Python bindings, we have an
additional option: We can incrementally build up this query by reusing the
previous query's logical plan.
To do this, simply reference the variables returned from sql
calls in
subsequent queries:
# This is the same query we've been using thus far. Note that there's no
# `to_pandas` calls, preventing execution.
out1 = con.sql("select * from df where fruits = 'banana'");
# We can then reference `out1` in this query, allowing us to add our additional
# filtering condition. Again, no `to_pandas` calls.
out2 = con.sql("""select * from out1 where "A" > 3""");
# Now we actually execute the query.
results = out2.to_pandas()
print(results)
# Output
# A fruits
# 0 5 banana
Only when we get to the to_pandas
call will the query begin executing. Once we
do execute, we can see that we get the results we expect given the filters in
our queries.
Incrementally building queries lets you break up large queries into smaller, more understandable pieces. This way of writing queries scales up to more complex queries as well, making it easy to write a multi-step data transformation.
Accessing data sources in python
One of GlareDB's biggest features is support for working with external data sources. And since everything is exposed through SQL commands, working with data sources inside of Python scripts is straightforward.
For example, we can connect to a Postgres instance, query it, and store the output in a data frame:
import glaredb
import pandas as pd
con = glaredb.connect()
# Add our demo postgres instance as a data source. This is loaded with
# tpc-h data.
#
# `execute` will eagerly execute the query to completion.
con.execute("""
CREATE EXTERNAL DATABASE my_pg FROM postgres OPTIONS (
host = 'pg.demo.glaredb.com',
port = '5432',
user = 'demo',
password = 'demo',
database = 'postgres',
);
""")
out = con.sql("select * from my_pg.public.customer limit 10").to_pandas()
We can then take this a step further and join data stored in a Pandas data frame with data stored in Postgres.
Now let's say we have a data frame with information about customer support
calls. This data frame contains the customer's phone number and the outcome of
the call, but it lacks some information that we want for additional analysis.
Enriching our local data frame with information stored in the customer
table
in Postgres is as simple as running a join:
# Data frame containing information about customer support calls.
support_calls = pd.DataFrame(
{
"phone_number": ["15-416-330-4175", "20-908-631-4424", "16-155-215-1315"],
"outcome": ["RESOLVED", "RESOLVED", "PENDING"]
}
)
# Join the `customer` table from our Postgres instance onto our local data frame
# with a left join.
customer_info = con.sql("""
SELECT c.c_custkey, c.c_name, c.c_mktsegment, s.outcome
FROM support_calls AS s
LEFT JOIN my_pg.public.customer AS c ON c.c_phone = s.phone_number
""").to_pandas()
print(customer_info)
# Output
# c_custkey c_name c_mktsegment outcome
# 0 42 Customer#000000042 BUILDING RESOLVED
# 1 18 Customer#000000018 BUILDING PENDING
# 2 49 Customer#000000049 FURNITURE RESOLVED
And now we have our final results stored in customer_info
with data made up
from both our original data frame and customer data stored in Postgres.
We're showing connecting to a Postgres database here, but any data source that we support can be used. Need to pull in data from Snowflake? Check out our docs on Snowflake to see how to connect your warehouse.
What's next
We showed how to use GlareDB's Python bindings to run SQL queries over data frames and how to query external data sources and store the output in data frames. This is just scratching the surface of what GlareDB can do, and we encourage you to check out our docs for more features.
Running into an issue or have a feature request? Let us know by opening a ticket on our repo.
Get started now
Ready to get the most out of your data? Get started with GlareDB locally, or spin up a deployment on GlareDB Cloud!