The ETL (Extract, Transform, Load) process is a fundamental aspect of the data pipeline, helping organizations to gather, refine, and analyze data from numerous sources. In this post, we're taking a deep dive into a popular approach to this process, known as the Medallion Architecture, and how it leverages the Bronze, Silver, and Gold layering system.
Layers of the Medallion Architecture
When it comes to organizing the data pipeline, the Medallion Architecture usually breaks the process down into three broad layers: Bronze (raw), Silver (cleaned/processed), and Gold (ready-for-consumption).
Bronze Layer: Raw Data
The Bronze layer is the initial layer of the data pipeline, where the data is stored in its original form with minimal transformation. This raw data is extracted from various sources and staged for further processing. It frequently contains typos, missing values, inconsistent date formatting, and other similar issues.
Use Case: A retail company collects point-of-sale (POS) data, online sales data, and customer feedback. These diverse datasets are extracted and stored in the Bronze layer for initial processing.
POS Data
transaction_id | product_id | quantity | transaction_date |
---|---|---|---|
1 | A001 | 2 | 2023-05-01 |
2 | BO02 | 1 | 2023-05-01 |
3 | A001 | -1 | 2023-05-02 |
4 | C003 | 3 | |
5 | D004 | 1 | 2023-05-02 |
In the table above, the product_id
in the second row is inconsistent with the others (starting with "BO" instead of a letter and three digits). The quantity
in the third row is negative, which is impossible. And the transaction_date
is missing in the fourth row.
Online Sales Data
online_order_id | product_id | quantity | order_date |
---|---|---|---|
1 | A001 | 1 | 2023-05-01 |
2 | A001 | 2 | 2023-05-02 |
3 | B002 | 1 | 2023-05-01 |
4 | 1 | 2023-05-02 | |
5 | C003 | 0 | 2023-05-01 |
Here, the product_id
is missing in the fourth row, and the quantity
is zero in the fifth row, which also should not be possible.
Customer Feedback
feedback_id | customer_id | feedback | feedback_date |
---|---|---|---|
1 | C001 | Great service! | 2023-05-01 |
2 | C002 | Quick delivery. | 2023-05-02 |
3 | Love the product quality! | 2023-05-01 | |
4 | C004 | Excellent customer service. | |
5 | C005 | Fast shipping! | 2023-05-02 |
The customer_id
is missing in the third row, and the feedback_date
is missing in the fourth row.
Silver Layer: Cleaned and Processed Data
The Silver layer holds data that has been cleansed, validated, and processed. This is where the raw data undergoes data cleaning (removal of duplicates, handling of missing values, etc.), transformation into a standardized format, and loading into a data warehouse.
Use Case: The retail company's POS data, online sales data, and customer feedback, initially in different formats and with various errors, are cleaned, formatted, and integrated into the Silver layer.
Sales Data (Integrated POS and Online Sales Data)
sales_id | product_id | quantity | sales_date | sales_type |
---|---|---|---|---|
1 | A001 | 2 | 2023-05-01 | POS |
2 | B002 | 1 | 2023-05-01 | POS |
3 | A001 | 1 | 2023-05-02 | POS |
4 | A001 | 1 | 2023-05-01 | Online |
5 | A001 | 2 | 2023-05-02 | Online |
The negative quantity
and the missing transaction_date
from the raw POS data have been corrected. The inconsistent product_id
from the raw POS data and the missing product_id
from the raw online sales data have been identified and filled in. The zero quantity
from the raw online sales data has been eliminated.
Customer Feedback (Cleaned and Conformed)
feedback_id | customer_id | feedback | feedback_date |
---|---|---|---|
1 | C001 | Great service! | 2023-05-01 |
2 | C002 | Quick delivery. | 2023-05-02 |
3 | C003 | Love the product quality! | 2023-05-01 |
4 | C004 | Excellent customer service. | 2023-05-02 |
The missing customer_id
and feedback_date
from the raw data have been filled in here.
Let's discuss how this is usually done for real-world use cases.
Data Cleansing
The aim of data cleansing is to detect and fix (or remove) corrupt or inaccurate records from a dataset. It involves handling missing values, duplicates, and errors. This step is essential in ensuring the quality and reliability of data used in analyses and decision-making.
- Handling Missing Values: Missing values can be filled in a variety of ways, such as deleting the record, filling the missing value with a default value, using a mean/mode/median, or even more advanced techniques such as regression or using a machine learning model. For example, for a missing
transaction_date
orproduct_id
, we might use default values or statistical methods based on existing data. - Detecting and Correcting Errors: This could involve detecting typos or syntax errors and correcting them based on predefined rules or lookup tables. In our use case, a
product_id
starting with "BO" instead of a letter and three digits might be a typo or error, which we corrected based on the other data. - Outlier Treatment: Outliers, or values that are significantly different from most other values, can also be treated or removed in the data cleansing step. For example, a negative or zero sales quantity could be considered an outlier or error and would be corrected.
In the real world, data cleansing can be a very complex and time-consuming process, depending on the quality and consistency of the raw data. There are also specialized data cleaning tools and software that can automate parts of the process. However, sometimes, no matter what you do, fixing data involves manual intervention. In such scenarios, you should devise a strategy to extract such data rows into separate files and send them through the manual review process before they get injected into the ETL pipeline flow again.
Data Conforming
Data conforming involves making the data consistent. It includes standardizing data formats, units, and conventions and aligning data from different sources to be comparable or combinable.
In our use case, we have data from two different sales channels: POS and online. These datasets might use different formats or conventions, and we needed to conform them to a common standard to combine and analyze them. This involved standardizing date formats and product IDs and creating a new sales_type
field to distinguish between POS and Online sales.
In the real world, data conformation is crucial when combining data from multiple sources. For example, a multinational company might need to conform sales data from different countries, convert currencies, align different tax systems, or translate product names to a common language.
It's important to note that data cleansing and conforming involve making decisions about handling discrepancies and inconsistencies. These decisions should be informed by a good understanding of the data, the context in which it was collected, and the subsequent analysis or use case requirements.
Gold Layer: Ready-for-consumption Data
The Gold layer consists of data that has been transformed, modeled, and is ready for consumption by end-users. Here, the data is further curated and optimized for querying and analysis.
Use Case: The cleaned and integrated sales data from the Silver layer are further processed into online analytical processing (OLAP) cubes or dashboards, providing end-users with readily accessible and understandable data.
Product Sales and Feedback
product_id | sales_type | total_quantity_sold | positive_feedback_count |
---|---|---|---|
A001 | POS | 3 | 2 |
A001 | Online | 3 | 2 |
B002 | POS | 1 | 0 |
This view can be used to gain insights such as which products are selling well in each sales type and which products are receiving positive customer feedback.
Daily Sales by Type
sales_date | sales_type | total_quantity_sold |
---|---|---|
2023-05-01 | POS | 3 |
2023-05-02 | POS | 1 |
2023-05-01 | Online | 1 |
2023-05-02 | Online | 2 |
This view can be used to monitor the daily sales performance by sales type (POS or online).
Conclusion
The Medallion Architecture is a well-tested guide but not a rigid structure. Its flexible design allows you to add your own layers as needed to adapt to specific project requirements.
Consider an organization with vast, diverse datasets that demand advanced transformations and governance. Here, you might introduce 'Platinum' and 'Diamond' layers for 'super refined' and 'analyzed' data respectively, extending the standard model to a more complex Gold-Silver-Bronze-Platinum-Diamond model.
In this scenario, as data progress through each layer, they undergo transformations, cleansing, and enrichments, acquiring additional value and readiness. This demonstrates how the architecture can be adapted to meet the evolving data handling and management needs.
In the next chapter, we'll start working on the bronze layer of this flow using the ETL.NET library.