🚀 GlareDB Pro now available:Simple, powerful, flexible serverless compute
productdatabase-fundamentals
October 29, 2024

What is Query Federation?

author image
Tal Gluck, Developer Advocate
Blog post cover image

If you work with data regularly, you've probably run into this situation: Your data lives in different places, but you need to analyze it all together to answer important questions. Let's talk about how query federation can help solve this common headache.

Prefer a video? Watch it here, or continue reading below:

The All-Too-Common Dilemma of Siloed Data

Here are some typical scenarios in modern data architectures:

Picture these scenarios:

  • You have a folder of CSVs with historical sales transactions, and someone just emailed you an Excel file with the latest data that you need to compare
  • Your web activity data lives in PostgreSQL, your historical analytics are in Snowflake, and someone just emailed you an Excel file to compare
  • You're juggling data between Databricks, a Delta Lake, Parquet files on S3, JSON from a REST API, and... yep, someone just emailed you an Excel file

I'm kidding about the Excel file... but only kind of. We've all been there!

The Traditional Solution (and Why It's Not Great)

The traditional approach to data integration involves setting up ETL pipelines to consolidate data in one location. However, this method introduces several challenges:

  • Complex data infrastructure requirements
  • Ongoing maintenance overhead
  • Significant infrastructure costs
  • Additional scheduling and monitoring systems
  • Comprehensive testing requirements
  • Increased dependencies between data teams and infrastructure teams

This traditional approach is kind of like going into your kitchen to cook a stir-fry and discovering you first need to:

  1. Drive to three different stores for ingredients
  2. Buy a car first
  3. Build the roads to get there

How Query Federation Transforms Data Access

Query federation enables real-time access, analysis, and joining of data from multiple sources within a single SQL query. Instead of executing complex ETL processes, query federation treats distributed data sources as a unified database.

Using our cooking analogy: Rather than establishing complex infrastructure to move ingredients, query federation provides instant access to all ingredients, wherever they're stored.

Implementing Query Federation: A Practical Example

Consider this real-world business analytics scenario: analyzing top products by sales volume and customer satisfaction scores.

The data sources include:

  • Sales transactions recorded in PostgreSQL
  • Customer survey data in S3 Parquet files
  • Product information in Snowflake

Using query federation, for instance with GlareDB, a single SQL query can join these diverse data sources seamlessly. When new data arrives in any format, including Excel files, it integrates naturally into your existing queries. In this example, you could run a complex, cross-data source query to compare sales transactions in Postgres, product information in Snowflake, and customer survey data in Excel files on S3, without any need for data migration.

SELECT 
    p.product_id,
    p.product_name,
    p.category,
    SUM(t.total_amount) as total_sales,
    AVG(s.satisfaction_score) as avg_satisfaction,
    COUNT(DISTINCT s.survey_id) as num_surveys
FROM 
    read_postgres(
        'postgresql://user:password@location:5433/postgres', 
        'public',
        'transactions'
        ) t
INNER JOIN 
    read_snowflake(
        'snowflake_account.us-central1.gcp', 
        'snowflake_username',
        'snowflake_password',
        'sandbox',
        'compute_wh',
        'accountadmin',
        'public',
        'products'
        ) p 
ON t.product_id = p.product_id
LEFT JOIN 
    read_excel(
            's3://my-bucket/latest_survey_data.xlsx',
            access_key_id => 'aws_access_key_id',
            secret_access_key => 'aws_secret_access_key',
            region=>'us-east-2'
            ) s ON t.product_id = s.product_id
GROUP BY 
    p.product_id, p.product_name, p.category
HAVING num_surveys > 3
ORDER BY 
    total_sales DESC
LIMIT 20;

Why Query Federation Matters

Query federation eliminates complex data infrastructure requirements for basic data analysis. The technology supports multiple workflows:

  • Interactive analysis in Jupyter notebooks
  • SQL-based data exploration
  • dbt pipeline development
  • Browser-based or terminal-based operations

The only requirement is basic Python knowledge or terminal access - no complex infrastructure setup needed.

Query Federation's Impact on Data Operations

The true value of query federation extends beyond workflow simplification. It fundamentally changes how teams interact with distributed data, shifting focus from data movement to data analysis and insight generation.

Getting Started with Query Federation

Check out our documentation to begin implementing query federation in your data operations. Join our community for support and discussion of best practices in distributed data access. Subscribe on YouTube, LinkedIn, or X to learn more about best practices for data storage, fundamentals of database internals, and product updates for GlareDB. And sign up for GlareDB Cloud to make use of hybrid execution and effortlessly collaborate with your team on your data no matter where they're stored!

Get started now

Ready to get the most out of your data? Get started with GlareDB locally, or spin up a deployment on GlareDB Cloud!

$curl https://glaredb.com/install.sh | sh
Try GlareDB Cloud