Medallion Lakehouse Architecture in Microsoft Fabric
When managing data in the Azure Cloud, data extraction, transformation, and governance are the core of any organization’s data management strategy. Microsoft Fabric, combined with the Medallion Architecture design pattern, allows organizations to leverage the power of Azure to meet their goals by delivering reliable, governed data for business use and reporting.
In this blog, we’ll explore the Medallion Lakehouse Architecture and how it works within Microsoft Fabric, focusing specifically on the Medallion Lakehouse architecture. I’ve been curious about the different data processing options available in Fabric, and I want to share what I’ve discovered—what works, what doesn’t, and what’s possible.
A Little History for Context
The journey of data management systems started in the 1960s with IBM’s IMS and GE’s IDS, which paved the way for structured data and early business analytics. Fast forward through the 1970s, we saw the rise of relational databases and the introduction of SQL. The 1980s brought us data warehouses, and by the 2000s, the “big data” boom was in full swing with Hadoop leading the charge for distributed data processing. 2010s The Age of AI, Cloud, and Distributed Processing. 2020s The current era is characterized by further advancements and capabilities driven by cloud technologies and Data as a Service (DaaS). Automating data analytics processes and MLOps (Machine Learning Operations) is becoming more prevalent. The market is actively adopting architectures like Data Lakehouse and Data Mesh. The Data Lakehouse architecture gained popularity around 2020 when Databricks introduced this concept along with the Medallion architecture. The core idea behind the Data Lakehouse is to eliminate the traditional relational Data Warehouse (DWH) and utilize a single Data Lake repository. All data types—structured, semi-structured, and unstructured—are loaded into the Data Lake, and all queries and reports are executed directly from the Data Lake. This innovative approach provides a unified storage solution that simplifies data management and enhances the flexibility and scalability needed for modern business requirements. The Data Lakehouse ensures high data quality and efficient data processing and analysis by leveraging the Medallion architecture, which organizes data into different quality zones (Bronze, Silver, Gold).
In 2023 Microsoft Fabric put to market streamlines data ingestion by easily connecting to Azure Data Services with LakeHouse, other cloud platforms, and on-premises data sources. With over 200 native connectors, users can quickly build insights through a user-friendly drag-and-drop interface for data transformation. The Shortcut feature allows connection to existing data without copying or moving it. Fabric standardizes on the Delta Lake format, enabling all Fabric engines to access and manipulate the same dataset stored in OneLake without data duplication.
This system supports building lakehouses using either a medallion architecture or a data mesh, with low-code, no-code, or code-first data transformation options. For data consumption, Power BI can directly pull data from the Lakehouse, and each Lakehouse includes a built-in SQL analytics endpoint for easy connectivity and querying from other reporting tools.
Medallion Terms Explained
Bronze (Raw): This is where your source systems data is stored unchanged, unclean, and may contain duplicates.
Silver (Cleansed): Here, data is stored after being de-duplicated, cleansed, and having missing values replaced, maintaining the same granularity as the source bronze/raw data.
Gold (Curated): This zone stores data modeled for analysis, such as in a Star Schema or Aggregate tables. It contains the highest-quality data ready for reporting.
It is essential to know that the number of these zones and their names may vary depending on organizational needs. For example, the Bronze (Raw) Zone could consist of multiple copies of the data going through different “raw” stages. You could also add a Platinum zone to save clear data for business reports.
As someone who has explored basic implementations of the Medallion Architecture in Synapse Analytics, you probably wonder how this is executed in Microsoft Fabric. Synapse is Platform-as-a-Service (PaaS), whereas Fabric is Software-as-a-Service (SaaS), which might raise concerns about losing control over how the various zones are implemented.
Microsoft recently published an article on Implementing Medallion Lakehouse Architecture in Microsoft Fabric. To implement medallion architecture in Fabric, you can use lakehouses (one for each zone).
Fabric streamlines data ingestion by easily connecting to Azure Data Services, other cloud platforms, and on-premises data sources. With over 200 native connectors, users can quickly build insights through a user-friendly drag-and-drop interface for data transformation. The Shortcut feature allows connection to existing data without copying or moving it. Fabric standardizes on the Delta Lake format, enabling all Fabric engines to access and manipulate the same dataset stored in OneLake without data duplication. This system supports building lakehouses using either a medallion architecture or a data mesh, with low-code, no-code, or code-first data transformation options. For data consumption, Power BI can directly pull data from the Lakehouse, and each Lakehouse includes a built-in SQL analytics endpoint for easy connectivity and querying from other reporting tools.
We have two patterns to use:
Pattern 1: Create each zone as a Lakehouse. In this case, business users access data using the SQL analytics endpoint.
Pattern 2: Create the bronze and silver zones as Lakehouse’s and the gold zone as the data warehouse. In this case, business users access data using the data warehouse endpoint.
This table highlights the key differences and similarities between Warehouse and Lakehouse architectures, making it easier to understand the strengths and use cases for each. In the helicopter view when comparing Warehouse and Lakehouse architectures, both offer unlimited data volume but differ significantly in other aspects. Warehouses primarily handle structured data and are managed by SQL engineers using SQL scripts, providing robust security features and support for multi-table transactions. In contrast, Lakehouses can handle unstructured, semi-structured, and structured data, catering to Spark data engineers and scientists. They organize data in folders and files, and while they offer some security features, they lack multi-table transaction support. Both architectures allow for querying across items and serve as sources and access points for shortcuts, but Lakehouses offer more flexibility in handling diverse data types and integration capabilities.
Example of statement:
- Bronze layer tables are populated by an MS Fabric Data pipeline copying data from our operational SQL DB.
- Another database tables are defined as external tables (fabric shortcuts) hosted in our operational SQL DB. The data is not copied but served from the SQL DB itself.
- Notebooks and Dataflow Gen2 are highlighted here as the primary data transformation tools. This designation is logical, as these tools are specifically designed to enable you to manipulate and refine your data.
- BI systems and analytics endpoints works with the Gold data.
Shortcuts are a handy way to query Lakehouse data from other Lakehouses in either the same or a different Workspace. Shortcuts are Easy to set up via Fabric. You can Query a Lakehouse (or Warehouse) in the same or different Workspace, or you can Query data that sits outside Fabric, e.g., Azure Data Lake Gen2, AWS S3.
Data Pipelines: The Copy Data task is great for moving/copying data around different data stores. However, it’s not particularly nuanced in terms of transformation—there’s not a great deal you can do with the data you’re moving. Allows loading of data to/from Lakehouse/KQL Database/Warehouse. You have a way to use the templates from fabrics.
Dataflow Gen2 allows loading and transforming data using both UI options and the M language. This is a way to connect to many different sources, including Lakehouse, Warehouse, Azure Data Lake, Azure SQL Database, SharePoint, Web API, etc. Load into destinations like Lakehouse, Warehouse, Azure SQL Database, and Azure Data Explorer. Overwrite or append data in a destination.
In notebooks, we can write code to load, transform, and write data to various sources/destinations. There are custom libraries available to connect to a wide variety of services.
SQL Capabilities in Fabric. SQL provides robust capabilities for querying and manipulating data within Fabric. You can query a Lakehouse SQL Endpoint in read-only mode, but only within the same Workspace. Cross-workspace querying is not supported. SQL allows querying Shortcuts in a Lakehouse SQL Endpoint that point to another Lakehouse. This enables indirect querying of data across different Lakehouses. A Warehouse can be queried by referencing a Lakehouse (or another Warehouse) within the same Workspace using a three-part naming convention. You can load data into a Warehouse using SQL.
In summary, while the Medallion Architecture’s zone names and structure may vary, the fundamental goal remains to ensure high-quality data as it moves from source systems to business analysis and reporting.
The transition from Databricks or Synapse Analytics to Microsoft Fabric might change the tools and methods, but the core principles of data quality and structured processing remain intact. I imagine workspaces as individual servers, each with its endpoint for connecting to a SQL Lakehouse or Warehouse. Therefore, separating the bronze, silver, and gold zones across these “individual” workspaces makes sense.
Below are several reasons why to believe the Lakehouse architecture can address the shortcomings of traditional data warehouses and lakes:
- Data Quality and Reliability: These are the primary issues enterprise users face. Implementing effective data processing pipelines is challenging, and the dominant data architectures that separate lakes and warehouses exacerbate this problem.
- Timeliness of Data: Business applications increasingly require up-to-date data. However, two-tier architectures increase data staleness due to the presence of a staging area for incoming data before loading it into the warehouse through periodic ETL/ELT tasks.
- Unstructured Data: Many of today’s data could be more structured, which traditional data warehouses and lakes handle poorly.
- Machine Learning and Data Science: Traditional data warehouses and lakes do not adequately support machine learning and data science scenarios.
Several modern industry trends also highlight customer dissatisfaction with the two-tier model: Support for External Tables: All central big data warehouses have added support for external tables in formats like Parquet. Investments in SQL Engines: There are significant investments in developing SQL engines that operate directly on data lakes.
Resume
They are providing organizations with a central repository for housing large amounts of raw data. Use Azure Fabric to deliver efficient data management to help teams make well-informed decisions.