GlareDB | Data Mesh: From the Database's Point of View

tl;dr> Data mesh is an architectural paradigm that's been gaining traction. But many of its core concepts have parallels in traditional database design, particularly in how we think about views and materialization. This post explores data mesh principles through the lens of database concepts we've been using for decades.

Data Mesh in Brief

Before we dive into the details, let's recap what a data mesh is. The key principles of data mesh were proposed by Zhamak Dehghani in a post on Martin Fowler's blog and popularized in her O'Reilly book on the subject. At its core, a data mesh is an architectural and organizational approach to managing analytical data at scale. It emphasizes four key principles:

Domain-oriented decentralized data ownership and architecture
Data as a product
Self-serve data infrastructure as a platform
Federated computational governance

Now, if you're like me and you've spent a good chunk of your career thinking about databases, you might be thinking, "Hey, some of this sounds familiar." And you'd be right. Let's break it down.

Domain-Oriented Decentralization: The View from Above

In a data mesh, each domain (think: marketing, sales, logistics) owns its data and is responsible for serving it to the rest of the organization. This might sound new, but it's not too far from how we've thought about views in databases for years.

Remember non-materialized views? These are essentially saved queries that other parts of the system can access as if they were tables. They function kind of like an API layer, making it easier to express certain queries without needing to understand the underlying data structure.

As a reminder, a non-materialized view is a virtual table that doesn't store data physically. Instead, it generates results dynamically based on a query whenever you access it. For example, let's say the marketing team wants to share customer engagement metrics with the sales team. They can create a non-materialized view called customer_engagement_summary that selects relevant fields like customer_id, last_login, and recent_purchases from their detailed tables. This view presents the data in a simplified format, allowing the sales team to access up-to-date information without diving into the complexities of the marketing database schema.

In a way, non-materialized database views were our first attempt at domain-oriented data products. A view could encapsulate the complexity of a particular domain's data, presenting only what's necessary to other parts of the system. The marketing team could create a customer_summary view, while the logistics team could have a shipping_status view. Each "domain" (in this case, a team or department) would be responsible for maintaining their views, ensuring they accurately represent their data.

Data as a Product: Materialization Matters

The data mesh concept of "data as a product" emphasizes treating data as a first-class citizen, with quality, documentation, and a focus on consumer needs. In database terms, both non-materialized and materialized views can serve as data products, each with its own strengths.

Non-materialized views are like "live" data products. They encapsulate business logic and provide a consistent interface for data consumers, always reflecting the current state of the underlying data.

Materialized views, on the other hand, are like "snapshot" data products. They trade some data freshness for improved query performance. These materialized views store the computed results, offering faster access at the cost of potential data staleness. In many ways, materialized views are just like tables.

For instance, in an online retail environment, the inventory team might use a non-materialized view called current_stock_levels that shows real-time stock information. Since stock levels change frequently with each sale or restock, the view needs to reflect the latest data every time it's queried.

Conversely, the finance team might use a materialized view or create a table named monthly_sales_totals to analyze sales performance. Since they're dealing with aggregated data that doesn't need to be updated every second, they can afford to refresh this view once a day or even once a month. This materialized view provides quick access to pre-calculated totals, improving query performance when generating reports.

In a data mesh, each domain is responsible for creating and maintaining these data products. The choice between non-materialized and materialized views depends on the specific needs of data consumers, balancing factors like data freshness, query performance, and resource utilization. It's not just about exposing raw data; it's about packaging it in a way that's useful and efficient for consumers.

Self-Serve Infrastructure: The Database as a Platform

The idea of self-serve data infrastructure in a data mesh is about providing tools and platforms that allow domain teams to manage and share their data products without heavy reliance on a central team. This principle aims to remove bottlenecks and empower domain experts.

In the database world, we've been moving in this direction for a while. The ability for users to create their own views, indexes, and even tables (with the right permissions) is a form of self-serve infrastructure. Modern database systems with their rich ecosystems of extensions and user-defined functions push this even further.

The challenge in both data mesh and database design is balancing this self-serve capability with governance and performance concerns. Just as a DBA might need to step in if someone creates a view that's tanking performance, a data mesh needs mechanisms to ensure that decentralized data products don't create chaos.

But this self-service model isn't without its pitfalls. If every domain team starts creating separate tables with no standards or oversight, it can lead to redundant data products, disorganization, and data inconsistencies. To prevent such chaos, organizations need to establish clear guidelines and best practices. This might include setting up naming standards or implementing a review process for new data products. The goal is to empower domain experts while maintaining overall system health and data integrity.

Federated Governance: Constraints and Contracts

In a data mesh, federated computational governance is about creating global standards for interoperability, quality, and security, while allowing domains the freedom to manage their data as they see fit. This balance of local control and global consistency is crucial.

However, finding the sweet spot between autonomy and standardization can be challenging. Too much control can stifle innovation, while too little can lead to a "Wild West" scenario with incompatible data formats and security gaps. If each domain uses a different data encoding or naming convention, integrating data across domains becomes a nightmare. To address this, organizations can implement a set of "data contracts"—agreements that define the expected data formats, quality standards, and access protocols. These contracts serve as a common language and set of expectations, ensuring that while domains operate independently, they remain interoperable and secure within the larger ecosystem.

Database designers will recognize this challenge. It's similar to deciding what constraints to enforce at the database level versus at the application level. In a database, we might use schemas, constraints, and triggers to enforce certain rules globally, while leaving others to be handled by individual tables or views.

In a data mesh, this might translate to global standards for data formats, quality metrics, and access controls, implemented through a combination of technological guardrails and organizational policies.

Conclusion: Everything Old is New Again

The data mesh isn't about throwing out everything we've learned about data management over the years. Rather, it's about taking concepts we've used successfully at the database level and applying them to the broader challenges of enterprise-wide data architecture.

Views become domain-specific data products. Materialization strategies become ways to optimize those products for different use cases. The database's role as a platform expands to encompass a wider range of self-serve tools. And the balance between centralized control and distributed management that DBAs have grappled with for years becomes a central principle of the entire data architecture.

As we continue to evolve our approach to data management, it's worth remembering that many of the challenges we face aren't entirely new. By looking at new paradigms through the lens of familiar concepts, we can bring decades of hard-won wisdom to bear on our modern data challenges.

At the end of the day, the data mesh isn't about reinventing the wheel. It's about leveraging tried-and-true database principles to tackle the complexities of modern, large-scale data environments. By treating views as domain-specific data products, carefully considering materialization strategies, and balancing self-service with governance, we can create robust, scalable data architectures. The lessons we've learned from decades of database design are not just relevant—they're essential. So, as we embrace the data mesh paradigm, let's carry forward the wisdom we've accumulated, applying it in new ways to meet today's challenges.

In my next post, I'll explore how we're applying these principles in GlareDB to enable data mesh-style architectures. We'll delve into how GlareDB facilitates data mesh-style architectures by enabling domain teams to effortlessly create and manage their own data products, all while ensuring seamless governance and high performance.

You can read part 2 here.