The ETL Journey - Chapter 1: Medallion Architecture

The ETL (Extract, Transform, Load) process is a fundamental aspect of the data pipeline, helping organizations to gather, refine, and analyze data from numerous sources. In this post, we're taking a deep dive into a popular approach to this process, known as the Medallion Architecture, and how it leverages the Bronze, Silver, and Gold layering system.

Layers of the Medallion Architecture

When it comes to organizing the data pipeline, the Medallion Architecture usually breaks the process down into three broad layers: Bronze (raw), Silver (cleaned/processed), and Gold (ready-for-consumption).

Bronze Layer: Raw Data

The Bronze layer is the initial layer of the data pipeline, where the data is stored in its original form with minimal transformation. This raw data is extracted from various sources and staged for further processing. It frequently contains typos, missing values, inconsistent date formatting, and other similar issues.

Use Case: A retail company collects point-of-sale (POS) data, online sales data, and customer feedback. These diverse datasets are extracted and stored in the Bronze layer for initial processing.

POS Data

transaction_id	product_id	quantity	transaction_date
1	A001	2	2023-05-01
2	BO02	1	2023-05-01
3	A001	-1	2023-05-02
4	C003	3
5	D004	1	2023-05-02

In the table above, the product_id in the second row is inconsistent with the others (starting with "BO" instead of a letter and three digits). The quantity in the third row is negative, which is impossible. And the transaction_date is missing in the fourth row.

Online Sales Data

online_order_id	product_id	quantity	order_date
1	A001	1	2023-05-01
2	A001	2	2023-05-02
3	B002	1	2023-05-01
4		1	2023-05-02
5	C003	0	2023-05-01

Here, the product_id is missing in the fourth row, and the quantity is zero in the fifth row, which also should not be possible.

Customer Feedback

feedback_id	customer_id	feedback	feedback_date
1	C001	Great service!	2023-05-01
2	C002	Quick delivery.	2023-05-02
3		Love the product quality!	2023-05-01
4	C004	Excellent customer service.
5	C005	Fast shipping!	2023-05-02

The customer_id is missing in the third row, and the feedback_date is missing in the fourth row.

Silver Layer: Cleaned and Processed Data

The Silver layer holds data that has been cleansed, validated, and processed. This is where the raw data undergoes data cleaning (removal of duplicates, handling of missing values, etc.), transformation into a standardized format, and loading into a data warehouse.

Use Case: The retail company's POS data, online sales data, and customer feedback, initially in different formats and with various errors, are cleaned, formatted, and integrated into the Silver layer.

Sales Data (Integrated POS and Online Sales Data)

sales_id	product_id	quantity	sales_date	sales_type
1	A001	2	2023-05-01	POS
2	B002	1	2023-05-01	POS
3	A001	1	2023-05-02	POS
4	A001	1	2023-05-01	Online
5	A001	2	2023-05-02	Online

The negative quantity and the missing transaction_date from the raw POS data have been corrected. The inconsistent product_id from the raw POS data and the missing product_id from the raw online sales data have been identified and filled in. The zero quantity from the raw online sales data has been eliminated.

Customer Feedback (Cleaned and Conformed)

feedback_id	customer_id	feedback	feedback_date
1	C001	Great service!	2023-05-01
2	C002	Quick delivery.	2023-05-02
3	C003	Love the product quality!	2023-05-01
4	C004	Excellent customer service.	2023-05-02

The missing customer_id and feedback_date from the raw data have been filled in here.

Let's discuss how this is usually done for real-world use cases.

Data Cleansing

The aim of data cleansing is to detect and fix (or remove) corrupt or inaccurate records from a dataset. It involves handling missing values, duplicates, and errors. This step is essential in ensuring the quality and reliability of data used in analyses and decision-making.

Handling Missing Values: Missing values can be filled in a variety of ways, such as deleting the record, filling the missing value with a default value, using a mean/mode/median, or even more advanced techniques such as regression or using a machine learning model. For example, for a missing transaction_date or product_id, we might use default values or statistical methods based on existing data.
Detecting and Correcting Errors: This could involve detecting typos or syntax errors and correcting them based on predefined rules or lookup tables. In our use case, a product_id starting with "BO" instead of a letter and three digits might be a typo or error, which we corrected based on the other data.
Outlier Treatment: Outliers, or values that are significantly different from most other values, can also be treated or removed in the data cleansing step. For example, a negative or zero sales quantity could be considered an outlier or error and would be corrected.

In the real world, data cleansing can be a very complex and time-consuming process, depending on the quality and consistency of the raw data. There are also specialized data cleaning tools and software that can automate parts of the process. However, sometimes, no matter what you do, fixing data involves manual intervention. In such scenarios, you should devise a strategy to extract such data rows into separate files and send them through the manual review process before they get injected into the ETL pipeline flow again.

Data Conforming

Data conforming involves making the data consistent. It includes standardizing data formats, units, and conventions and aligning data from different sources to be comparable or combinable.

In our use case, we have data from two different sales channels: POS and online. These datasets might use different formats or conventions, and we needed to conform them to a common standard to combine and analyze them. This involved standardizing date formats and product IDs and creating a new sales_type field to distinguish between POS and Online sales.

In the real world, data conformation is crucial when combining data from multiple sources. For example, a multinational company might need to conform sales data from different countries, convert currencies, align different tax systems, or translate product names to a common language.

It's important to note that data cleansing and conforming involve making decisions about handling discrepancies and inconsistencies. These decisions should be informed by a good understanding of the data, the context in which it was collected, and the subsequent analysis or use case requirements.

Gold Layer: Ready-for-consumption Data

The Gold layer consists of data that has been transformed, modeled, and is ready for consumption by end-users. Here, the data is further curated and optimized for querying and analysis.

Use Case: The cleaned and integrated sales data from the Silver layer are further processed into online analytical processing (OLAP) cubes or dashboards, providing end-users with readily accessible and understandable data.

Product Sales and Feedback

product_id	sales_type	total_quantity_sold	positive_feedback_count
A001	POS	3	2
A001	Online	3	2
B002	POS	1	0

This view can be used to gain insights such as which products are selling well in each sales type and which products are receiving positive customer feedback.

Daily Sales by Type

sales_date	sales_type	total_quantity_sold
2023-05-01	POS	3
2023-05-02	POS	1
2023-05-01	Online	1
2023-05-02	Online	2

This view can be used to monitor the daily sales performance by sales type (POS or online).

Conclusion

The Medallion Architecture is a well-tested guide but not a rigid structure. Its flexible design allows you to add your own layers as needed to adapt to specific project requirements.

Consider an organization with vast, diverse datasets that demand advanced transformations and governance. Here, you might introduce 'Platinum' and 'Diamond' layers for 'super refined' and 'analyzed' data respectively, extending the standard model to a more complex Gold-Silver-Bronze-Platinum-Diamond model.

In this scenario, as data progress through each layer, they undergo transformations, cleansing, and enrichments, acquiring additional value and readiness. This demonstrates how the architecture can be adapted to meet the evolving data handling and management needs.

In the next chapter, we'll start working on the bronze layer of this flow using the ETL.NET library.

The ETL Journey - Chapter 1: Medallion Architecture

Layers of the Medallion Architecture

Bronze Layer: Raw Data

Silver Layer: Cleaned and Processed Data

Gold Layer: Ready-for-consumption Data

Conclusion

Written by Mario Smolčić

Code Chronicles #13: Navigating the Pitfalls of Async Programming in .NET

Code Chronicles #12: Diving into ETL with Medallion Architecture