The Case Against Centralized Medallion Architecture

I’ve seen too many articles praising the medallion architecture as the go-to solution for enterprise Data Quality. At first sight, the structured, three-layered approach sounds like a no-brainer – organize your data into neat bronze, silver, and gold layers, and you have established a perfect data quality enhancement.

But on closer inspection, my aversion to this architectural approach grows ever greater. Sure, it promises consistent, scalable, and centralized information quality improvement. In practice, however, quality problems are constantly rectified too late and rigidly with the same tool, regardless of the context.

Enterprises are complex adaptive systems with wildly different data sources, each with unique challenges regarding its information quality. Why impose the same rigid process on all of them? Forcing them all into the same centralized quality framework will lead to inefficiencies and unnecessary overhead.

I want to challenge the Medallion Architecture as the supposed best answer to enterprise data quality problems. I’ll make the case for a more tailored, decentralized approach – one inspired by Total Quality Management (TQM) and aligned with the decentralized approach of universal data supply.

Medallion architecture in a nutshell

The medallion architecture seeks to improve data quality through a tiered approach that incrementally enhances data downstream to its production. By dividing the data into three medals or layers (commonly referred to as Bronze, Silver, and Gold), the architecture systematically applies data transformation and validation steps to ensure quality and usability.

The bronze layer is defined as containing raw, unprocessed data from the sources including any inconsistencies, duplicates or even errors. It serves as the single source of truth and can also be used to trace back the original information.

The silver layer processes and refines the raw data to resolve issues and improve consistency. It produces cleansed and validated data in a more consistent format.

The gold layer is defined to finally deliver highly refined, domain-specific datasets ready for business use. It offers the data aggregated, enriched and optimized for analytics or reporting.

The medallion architecture is actually based on technical advancements from the vendor Databricks that allowed the data warehouse to be redefined as data lakehouse.

As the name suggests, the lakehouse offers classic data warehouse functionality, like ACID updates on structured datasets, on top of a data lake. The data lake is known for supporting the processing of unstructured big data better than data warehouses based on relational databases.

Medallion Architecture is basically identical with the classic Data Engineering Lifecycle of data collection approaches like Data Warehouse and Data Lakehouse — Image by author

The medallion architecture addresses the business need for good information quality with this technical improvements. But does one technical improvement applied to a business requirement already make a better architecture?

Centralized, rigidly layered data collections won’t scale

By investigating my articles on universal data supply you’ll see that I’m a strong advocate of decentralized data processing on the enterprise level.

The fundamental lesson is that no single, centralized platform can solve all the varied information requirements in a sufficiently large enterprise.

Centralized data collection approaches like data warehouse and data lakehouse therefore cannot deliver a universal data supply.

At its core, the medaillon architecture just defines three standardized layers within the data lakehouse setup and is therefore not suitable as an enterprise-wide data quality solution.

Let’s dig deeper and recognize the deficits.

Rigid layering

Applying a rigid three-layer data structure for all sources leads to inefficiencies when certain datasets do not require extensive cleansing or transformation.

Highly reliable internal source systems may not need extensive quality enhancements. Small-scale projects, exploratory data analysis, or non-critical data may not need gold-standard cleansing or structuring. While some data need extensive pre-processing through many transformation applications, other data may be directly fit for purpose without any transformation at all.

Three fixed layers do not fit well to such varied business requirements. Applying the same standard data quality processing can waste resources and slow down innovation in such scenarios.

Operational complexity

Maintaining and enforcing such a centralized layered system requires significant operational overhead, especially in environments with rapidly changing requirements or datasets.

Each data layer involves additional processes like ETL/ELT pipelines and validations. Monitoring and debugging these pipelines become harder as the architecture scales.

The medallaion architecture suffers from the same problems as the centralized data lakehouse. In an extremely distributed application landscape, a single centralized data quality platform cannot efficiently implement all necessary data quality improvements. Similar to the centralized data lakehouse that cannot efficiently apply all the necessary business rules to derive value from data.

Increased latency

Each data layer adds latency since data must move sequentially from one layer to the next.

Real-time or near-real-time analytics may require bypassing or optimizing the bronze/silver stages, which contradicts the layered nature of the architecture.

Overall the forced data layers result in delays for delivering insights for time-sensitive use cases like fraud detection.

One-sided focus on reactive downstream correction

The medallion architecture only improves quality after data has already been created with defects. That’s like trying to repair or optimize a car after it has been fully assembled.

In manufacturing Total Quality Management (TQM) therefore stipulates that quality is designed into the product, starting from raw materials, processes, and components at the source. Defects are prevented rather than corrected.

Medaillon is only reactive and always assumes error-prone raw data that has to be cleaned up and standardized layer by layer.

Total data quality management

TQM in manufacturing is proactive and focuses on preventing defects through continuous improvement, rigorous standards, and embedding quality checks at every stage of production. TQM is a holistic approach that is strongly customer-oriented regarding the product requirements and design. It has been sucessfully applied to many industrial production processes.

The fundamental principles of business excellence for quality:

Customer focus
Applying measurements to the outputs of processes
Continuous process improvement based on the measurements

Because in universal data supply business processes create ‘data as a product’, we can directly apply these manufacturing quality principles.

We need to apply the TQM thinking to the creation of ‘data as a product’. We need Total Quality Data Management (TQDM).

A downstream approach like medallion inherently has higher costs and risks of missed errors compared to an upstream approach like TQDM, where issues are resolved closer to the source. Quality cannot be efficiently guaranteed by making corrections solely in the downstream systems.

I am repeatedly confronted with the following arguments, which suggest that downstream data corrections can be more efficient than process improvements to eliminate the root cause of the quality problem:

Legacy systems might be difficult or expensive to modify to meet required quality standards. Always ask yourself if the continuous correction of errors really can be cheaper than correcting the root cause.
Human errors by manual data entry or inconsistent formats are hard to eliminate entirely. Just because it’s difficult to avoid all possible errors should not stop us from making every effort to prevent them.
Downstream corrections reduce the load on highly stressed process-owning teams and prevents over-engineering at the source. This is against my own experience – decoupled data teams are actually completely overloaded with correcting all conceivable errors from the business domains.
Setting up downstream cleansing and refinement of externally provided data is often the only practicable option and avoids unrealistically high costs due to the demand for perfect quality from external sources. My work for car manufacturers has shown me how far you can go to commit suppliers to minimum quality standards. The same should be possible for external data providers – however we may have to provide temporary work-arounds through quality enhancing agents.
Even in systems with robust quality controls, unexpected issues (e.g., system outages) can still occur. Downstream layers can act as a safety net to relieve source systems to handle every possible edge case. Just because it’s difficult to avoid all possible errors should not stop us from making every effort to prevent them. On the other hand we cannot afford to build costly blanket coverage safety nets in the value chain.

While it can be beneficial to refine data for specific business purposes and having a safety net for specific system outages, a generic and rigid three-tiered approach does not meet the varied requirements at enterprise level. Not every source needs the same ‘enhancement’ and often the arguments listed simply do not apply.

If we are in doubt, we should start measuring the real costs caused by low-quality data in the enterprise. From my experience, I can say that an internal process improvement was long-term always cheaper than on-going downstream data corrections.

If downstream correction is really the only viable option, for instance because an external source cannot directly be fixed, it’s much more efficient to install purpose built quality enhancing agents for that specific source only. This tailored approach fits well with the decentralized universal data supply, where data producers share their data on the outside with all consumers. Quality enhancing agents can act as a decoupled selective corrective participating in the shared data infrastructure. Consumers can choose which enhancing procedures are beneficial for their individual information needs and the process can easily be disconnected when it’s no longer needed.

TQDM needs to address the problem holistically and universal data supply is a perfect fit for this. 'Data as a product' needs to be a quality product— Image by author — TQDM needs to address the problem holistically and universal data supply is a perfect fit for this. ‘Data as a product’ needs to be a quality product— Image by author

We should combine centralized oversight with decentralized execution:

Centralized governance and tools: Define organizational quality standards and provide shared tools (e.g., data validation frameworks as part of the data fabric = self-service data platform tools) to be used directly in the domain teams.
Decentralized implementation: Allow domain teams to customize quality processes based on their specific data sources and use cases.
Selective layering: Customize medallion’s layered approach as needed and implement it only where it’s truly beneficial, avoiding over-engineering for simple and clean datasets.

Instead of setting up centralized downstream layers that all data has to pass through across the board, we should primarily invest in improving the data generation processes in order to prevent errors as far as possible. We need high quality ‘data as products’ across the entire value chain.

TQDM holistically addresses this problem and aligns well with the domain-specific ownership of data in universal data supply. It can adapt quickly to changing business needs without impacting unrelated processes. It emphasizes prevention over correction. Unavoidable corrections can be implemented in a selective and cost-effective manner early in the value chain.

TQDM combined with universal data supply outperforms the centralized medallion architecture to ensure data quality on the enterprise level.

If you want to learn more on TQM and TQDM you can read the excellent book from the information quality expert Larry English:

Improving Data Warehouse and Business Information Quality: Methods for Reducing Costs and Increasing Profits; Larry P. English; Wiley 1999

Universal data supply is an approach based on the adapted data mesh that effectively addresses challenges of the original data mesh as defined by Zhamak Dehghani: