Hybrid Execution: Scale your workflow with GlareDB Cloud
We're building GlareDB to query your data, wherever it is. Your data may be on
your laptop, in other databases or in the cloud. It may exist as Parquet, CSV or
JSON formats. Regardless, 100% of the time there are multiple steps across
several destinations in getting to the end result you need. Now with
Hybrid Execution, you can SELECT
and JOIN
all of your data with SQL you
already know.
Typically, your data exists both locally and in the cloud, especially while developing and experimenting. Hybrid Execution allows you to combine local and remote data efficiently by using both your local machine and cloud resources. During query execution, GlareDB splits the query into separate pieces to run remotely or locally. Some queries may be run entirely locally, remotely, or a combination based on efficiency in query planning.
Get started with Hybrid Execution
GlareDB Cloud offers a free tier for getting started with Hybrid Execution. When you sign up, we automatically spin up a deployment for you with 10 GB of storage. You will arrive at our web-based SQL workspace where you can begin querying and working with GlareDB immediately.
By clicking the Connect button in the top right, we provide a simple command that you can copy and use when running GlareDB locally.
If you don't have GlareDB installed or haven't updated to 0.5.0 yet, simply run the following in the directory where you would like to install or update the GlareDB binary:
curl https://glaredb.com/install.sh | sh
Then, simply paste the connection command from GlareDB Cloud in the GlareDB CLI:
./glaredb
\open glaredb://user:pass@hello.remote.glaredb.com:6443/empty_feather
At this point, GlareDB is running locally on your machine and is additionally connected to your GlareDB Cloud deployment. You are ready to combine data from from the deployment as well as process local files.
Hybrid Execution in action
As an example, imagine you have a CSV on your local filesystem containing a single column of user IDs. The CSV might look like the following:
id
cd67387e-3d37-4803-850a-ff2f2d913261
7bd25c6f-c394-4cf7-9715-a5e9cc138bcd
7e168af4-7c77-4d80-90ea-17677ca9de81
0720b2cc-3906-4d7d-a822-33ead2387797
265448e5-35d9-418e-9114-c4b948abdf30
1a04aef1-a458-4b78-9953-6b90e752837b
Futhermore, imagine there's a table in your cloud deployment called user_metrics
that stores entries containing a user ID, and number of output_rows
per
entry.
You can efficiently join this data on your local machine using familiar SQL:
SELECT
m.user_id,
max(m.output_rows),
avg(m.output_rows)::int
FROM
user_metrics m
INNER JOIN './company_users.csv' u on m.user_id = u.id
GROUP BY m.user_id
LIMIT 5;
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā¬āāāāāāāāāāāāāāāāāāāāā¬āāāāāāāāāāāāāāāāāāāāā
ā user_id ā MAX(m.output_rows) ā AVG(m.output_rows) ā
ā āā ā āā ā āā ā
ā Utf8 ā Int64 ā Int32 ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāŖāāāāāāāāāāāāāāāāāāāāāŖāāāāāāāāāāāāāāāāāāāāā”
ā 7bd25c6f-c394-4cf7-9715-a5e9cc138bcd ā 13575 ā 78 ā
ā 0720b2cc-3906-4d7d-a822-33ead2387797 ā 1000000 ā 302 ā
ā 7e168af4-7c77-4d80-90ea-17677ca9de81 ā 11000000 ā 1229 ā
ā cd67387e-3d37-4803-850a-ff2f2d913261 ā 10000 ā 64 ā
ā 265448e5-35d9-418e-9114-c4b948abdf30 ā 1847746 ā 4858 ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā“āāāāāāāāāāāāāāāāāāāāā“āāāāāāāāāāāāāāāāāāāāā
In this query, we are able to provide a relative path ./company_users.csv
while referencing the user_metrics
table residing in the cloud deployment
without any additional syntax. GlareDB can understand how to query the CSV file
locally while joining and remotely processing data from user_metrics
.
Hybrid Execution in Python
Hybrid Execution also works with the GlareDB Python library. Like the prior example, paste the Python connection string from GlareDB Cloud.
We can run the same query (including scanning a local file) from Python. We can also interop with DataFrames from Pandas or Polars. We'll adapt the above example with Pandas:
import glaredb
import pandas as pd
con = glaredb.connect("glaredb://user:pass@hello.remote.glaredb.com:6443/empty_feather")
company_users = pd.DataFrame(
{
"id": [
"cd67387e-3d37-4803-850a-ff2f2d913261",
"7bd25c6f-c394-4cf7-9715-a5e9cc138bcd",
"7e168af4-7c77-4d80-90ea-17677ca9de81",
"0720b2cc-3906-4d7d-a822-33ead2387797",
"265448e5-35d9-418e-9114-c4b948abdf30",
"1a04aef1-a458-4b78-9953-6b90e752837b"
]
}
)
con.sql("""
SELECT
m.user_id,
max(m.output_rows),
avg(m.output_rows)::int
FROM
execution_metrics m
INNER JOIN company_users u on m.user_id = u.id
GROUP BY m.user_id
LIMIT 5;
""").show()
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā¬āāāāāāāāāāāāāāāāāāāāā¬āāāāāāāāāāāāāāāāāāāāā
ā user_id ā MAX(m.output_rows) ā AVG(m.output_rows) ā
ā āā ā āā ā āā ā
ā Utf8 ā Int64 ā Int32 ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāŖāāāāāāāāāāāāāāāāāāāāāŖāāāāāāāāāāāāāāāāāāāāā”
ā 7bd25c6f-c394-4cf7-9715-a5e9cc138bcd ā 13575 ā 78 ā
ā 0720b2cc-3906-4d7d-a822-33ead2387797 ā 1000000 ā 302 ā
ā 265448e5-35d9-418e-9114-c4b948abdf30 ā 1847746 ā 4858 ā
ā 1a04aef1-a458-4b78-9953-6b90e752837b ā 60175 ā 1468 ā
ā cd67387e-3d37-4803-850a-ff2f2d913261 ā 10000 ā 64 ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā“āāāāāāāāāāāāāāāāāāāāā“āāāāāāāāāāāāāāāāāāāāā
A look ahead: Compute and storage at scale
GlareDB Cloud is free and designed for getting started with Hybrid Execution so that you can scale your local data development workflows. When you upgrade your GlareDB Cloud account, you gain access to:
- Unlimited storage
- Dedicated compute (billed on usage)
Compute engines are customizable and cost-effective cloud compute nodes that can be scaled or powered off to suit your needs.
You can utilize compute engines in Hybrid Execution as well as our web-based SQL workspace, depending on your use case.
To use a compute engine with Hybrid Execution, simply specify it in the
connection string. From the previous example, simply qualify the deployment
name (empty_feather
) with the name of the engine, compute
:
\open glaredb://user:pass@hello.remote.glaredb.com:6443/compute.empty_feather
Data everywhere made easy
Our goal with GlareDB is to reduce time to insight. At a processing level, we accomplish this is with our performant engine built in Rust with DataFusion. At a human level, we design our interfaces to be simple and familiar. There's no need to learn new tools or frameworks: just query your data wherever it lives with SQL you already know.
See for yourself just how easy it is to query a CSV hosted on Hugging Face.
Try it out:
# install or update GlareDB
curl https://glaredb.com/install.sh | sh
# query
./glaredb --query "SELECT * FROM \
'https://huggingface.co/datasets/fka/awesome-chatgpt-prompts/raw/main/prompts.csv';"
Hybrid Execution advances our mission by enabling you to join cloud data with local data seamlessly.
We look forward to hearing your feedback. Hit a bug, or have an idea on how to make this even better? Let us know by opening up an issue in our repo.
Get started now
Ready to get the most out of your data? Get started with GlareDB locally, or spin up a deployment on GlareDB Cloud!