Write PostgreSQL Data to Delta Tables with SQL
If you're dealing with transaction data in PostgreSQL you may be looking for ways to move it out for improved analytics or archive it without losing access. The Delta table format is a great option for this kind of storage. In this post, we'll walk you through how to use GlareDB to query PostgreSQL data and write it to Delta tables, all with SQL, creating a seamless archiving process. You can follow along below, or check out the video walkthrough:
Why Use Delta Tables?
Delta tables are an open-source storage format that brings ACID transactions to big data workloads. They're particularly useful for data lakes and offer benefits like:
- Improved query performance
- Time travel (data versioning)
- Schema evolution
- ACID transactions
The Process: From PostgreSQL to Delta Tables
We'll break this down into three main operations:
- Creating a Delta table from PostgreSQL data
- Adding the Delta table as an external table in GlareDB
- Inserting new data into the Delta table
For this guide, we've taken this CSV of web event data, and loaded it into
a Postgres instance. You can do that, or else select directly from the CSV file
on GitHub, or download it and select from it locally, using read_csv
.
Let's dive in!
Step 1: Creating a Delta Table from PostgreSQL Data
First, ensure you have GlareDB installed in your Python environment:
pip install glaredb
In a Python shell, import GlareDB and set up a GlareDB connection:
import glaredb
con = glaredb.connect()
Write a query to see the data in your Postgres table using read_postgres
:
con.sql(
"""
SELECT * FROM read_postgres(
'postgresql://postgres:postgres@localhost:5433/postgres',
'public',
'web_events'
)
LIMIT 10
""").to_pandas()
Now, use the query above with COPY TO
to select from the Postgres table write
it to a Delta table. Ensure the directory for your Delta table exists and is empty before
running this command.
con.sql("""
COPY (
SELECT * FROM read_postgres(
'postgresql://postgres:postgres@localhost:5433/postgres',
'public',
'web_events'
)
ORDER BY
event_id ASC
LIMIT 1000
)
TO 'file:///Users/path/to/my_delta/'
FORMAT delta
""")
Step 2: Adding the Delta Table as an External Table in GlareDB
Once you have your Delta table, add it as an external table in GlareDB:
con.sql("""
CREATE EXTERNAL TABLE IF NOT EXISTS delta_web_events
FROM delta
OPTIONS (LOCATION 'file:///Users/path/to/my_delta/')
""");
Now, query the Delta table as if it were any other table in our GlareDB instance:
con.sql("""
SELECT * FROM delta_web_events
""").to_pandas()
Step 3: Inserting New Data into the Delta Table
As new events are recorded in PostgreSQL, you can insert them into your Delta table. First, set the access mode to read-write:
con.sql(
"""
ALTER TABLE delta_web_events SET ACCESS_MODE TO READ_WRITE;
""").to_pandas()
Then, insert the new data:
con.sql(
"""
INSERT INTO
delta_web_events
SELECT
pg.*
FROM
read_postgres(
'postgresql://postgres:postgres@localhost:5433/postgres',
'public',
'web_events') pg
LEFT JOIN
delta_web_events d
ON
pg.event_id = d.event_id
WHERE
d.event_id IS NULL
""")
This query inserts only the new records that aren't already in the Delta table.
Verifying Your Data
You can easily verify that your data has been successfully archived by querying your external table:
con.sql("""select * from delta_web_events ORDER BY event_id""").to_pandas()
Conclusion
Using GlareDB to archive PostgreSQL data to Delta tables offers a flexible and efficient solution for data management. This approach allows you to maintain easy access to your historical data while improving your analytics capabilities and reducing the load on your transactional database.
By following these steps, you can set up a robust archiving system that grows with your data needs. Whether you're dealing with web events, financial transactions, or any other type of data, this method provides a scalable approach to data archiving and analytics.
Remember, while we've focused on PostgreSQL in this example, GlareDB's versatility allows you to apply the same techniques to query data from various data sources, and archive it to various locations, including local storage, S3, and GCS.
Happy archiving!
Links:
Get started now
Ready to get the most out of your data? Get started with GlareDB locally, or spin up a deployment on GlareDB Cloud!