What is Query Federation?
If you work with data regularly, you've probably run into this situation: Your data lives in different places, but you need to analyze it all together to answer important questions. Let's talk about how query federation can help solve this common headache.
Prefer a video? Watch it here, or continue reading below:
The All-Too-Common Dilemma of Siloed Data
Here are some typical scenarios in modern data architectures:
Picture these scenarios:
- You have a folder of CSVs with historical sales transactions, and someone just emailed you an Excel file with the latest data that you need to compare
- Your web activity data lives in PostgreSQL, your historical analytics are in Snowflake, and someone just emailed you an Excel file to compare
- You're juggling data between Databricks, a Delta Lake, Parquet files on S3, JSON from a REST API, and... yep, someone just emailed you an Excel file
I'm kidding about the Excel file... but only kind of. We've all been there!
The Traditional Solution (and Why It's Not Great)
The traditional approach to data integration involves setting up ETL pipelines to consolidate data in one location. However, this method introduces several challenges:
- Complex data infrastructure requirements
- Ongoing maintenance overhead
- Significant infrastructure costs
- Additional scheduling and monitoring systems
- Comprehensive testing requirements
- Increased dependencies between data teams and infrastructure teams
This traditional approach is kind of like going into your kitchen to cook a stir-fry and discovering you first need to:
- Drive to three different stores for ingredients
- Buy a car first
- Build the roads to get there
How Query Federation Transforms Data Access
Query federation enables real-time access, analysis, and joining of data from multiple sources within a single SQL query. Instead of executing complex ETL processes, query federation treats distributed data sources as a unified database.
Using our cooking analogy: Rather than establishing complex infrastructure to move ingredients, query federation provides instant access to all ingredients, wherever they're stored.
Implementing Query Federation: A Practical Example
Consider this real-world business analytics scenario: analyzing top products by sales volume and customer satisfaction scores.
The data sources include:
- Sales transactions recorded in PostgreSQL
- Customer survey data in S3 Parquet files
- Product information in Snowflake
Using query federation, for instance with GlareDB, a single SQL query can join these diverse data sources seamlessly. When new data arrives in any format, including Excel files, it integrates naturally into your existing queries. In this example, you could run a complex, cross-data source query to compare sales transactions in Postgres, product information in Snowflake, and customer survey data in Excel files on S3, without any need for data migration.
SELECT
p.product_id,
p.product_name,
p.category,
SUM(t.total_amount) as total_sales,
AVG(s.satisfaction_score) as avg_satisfaction,
COUNT(DISTINCT s.survey_id) as num_surveys
FROM
read_postgres(
'postgresql://user:password@location:5433/postgres',
'public',
'transactions'
) t
INNER JOIN
read_snowflake(
'snowflake_account.us-central1.gcp',
'snowflake_username',
'snowflake_password',
'sandbox',
'compute_wh',
'accountadmin',
'public',
'products'
) p
ON t.product_id = p.product_id
LEFT JOIN
read_excel(
's3://my-bucket/latest_survey_data.xlsx',
access_key_id => 'aws_access_key_id',
secret_access_key => 'aws_secret_access_key',
region=>'us-east-2'
) s ON t.product_id = s.product_id
GROUP BY
p.product_id, p.product_name, p.category
HAVING num_surveys > 3
ORDER BY
total_sales DESC
LIMIT 20;
Why Query Federation Matters
Query federation eliminates complex data infrastructure requirements for basic data analysis. The technology supports multiple workflows:
- Interactive analysis in Jupyter notebooks
- SQL-based data exploration
- dbt pipeline development
- Browser-based or terminal-based operations
The only requirement is basic Python knowledge or terminal access - no complex infrastructure setup needed.
Query Federation's Impact on Data Operations
The true value of query federation extends beyond workflow simplification. It fundamentally changes how teams interact with distributed data, shifting focus from data movement to data analysis and insight generation.
Getting Started with Query Federation
Check out our documentation to begin implementing query federation in your data operations. Join our community for support and discussion of best practices in distributed data access. Subscribe on YouTube, LinkedIn, or X to learn more about best practices for data storage, fundamentals of database internals, and product updates for GlareDB. And sign up for GlareDB Cloud to make use of hybrid execution and effortlessly collaborate with your team on your data no matter where they're stored!
Get started now
Ready to get the most out of your data? Get started with GlareDB locally, or spin up a deployment on GlareDB Cloud!