Fillna() The Kimball Way – Handle Incorrect/Missing Dimension Values

In my opinion, the Kimball methodology of data warehousing is king. It’s that simple.

Developed by Ralph Kimball, this methodology emphasises the importance of dimensional modelling in designing data warehouses. One key challenge in this methodology involves dealing with missing or incorrect dimension values within fact tables. The fillna() function in Python, commonly used in data manipulation libraries like Pandas, can be leveraged to handle such scenarios efficiently.

Today, we’ll delve into the Kimball methodology’s approach to managing dimension keys in fact tables and show how Python’s fillna() function can be employed to insert placeholder values when dimension keys either do not exist or do not match entries in the dimension table.

Understanding Kimball Methodology and Dimensional Modeling

Before we dive into using fillna() with Python, let’s briefly discuss the basics of the Kimball methodology and dimensional modelling.

Key Concepts in Kimball Methodology

Fact Tables: These tables contain numerical measures/metrics and foreign keys referencing dimension tables.
Dimension Tables: These tables store descriptive information and attributes about the business entities. They are linked to fact tables via foreign keys.

Challenges with Dimension Keys in Fact Tables

Missing Dimension Keys: Sometimes, the dimension keys in fact tables may not match any entries in the dimension table.
Incorrect Dimension Keys: Dimension keys in fact tables might reference non-existent or erroneous entries in the dimension table.

Handling Missing/Incorrect Dimension Values Using fillna() in Python

To illustrate how we can apply the Kimball methodology with Python’s fillna() function, let’s create simplified examples of fact and dimension tables using Pandas.

Creating Sample Fact and Dimension Tables

import pandas as pd

# Sample Fact Table (Sales)
fact_data = {
    'date_id': [1, 2, 3, 4],
    'product_id': [101, 102, 103, 104],
    'quantity_sold': [150, 200, 180, 220]
}
fact_df = pd.DataFrame(fact_data)

# Sample Dimension Table (Products)
dimension_data = {
    'product_id': [-1, -2, 101, 103, 104],
    'product_name': ['Unknown', 'Invalid', 'Product A', 'Product C', 'Product D']
}
dimension_df = pd.DataFrame(dimension_data)

In the example above, fact_df represents a simplified sales fact table with columns for date_id, product_id, and quantity_sold. Meanwhile, dimension_df serves as a basic products dimension table containing product_id and product_name.

The key to this is the -1 and -2 entries that have been included in dimension_df, representing Unknown and Invalid entries. When joining back at the end of the process, any rows that don’t join to this dimension originally will still have useful information for us, by virtue of these values being in place.

Let’s look at how to put -1 and -2 in the fact table where necessary.

Handling Missing/Incorrect Dimension Values with fillna()

Now, let’s handle cases where the product_id in the fact table doesn’t match any entries in the products dimension table by using fillna():

# Merge fact and dimension tables to fill missing values
merged_df = fact_df.merge(dimension_df, on='product_id', how='left')

# Fill missing product_id entries in fact table with placeholder value (-1)
merged_df['product_id'].fillna(-1, inplace=True)

In this code snippet, we’re merging the fact table (fact_df) with the dimension table (dimension_df) using a left join on the product_id column. The fillna() function is then applied to replace missing (NaN) product_id entries in the merged DataFrame with the placeholder value -1.

This is where the power lies. When this fact and dimension table are now joined together as part of the final analysis, these rows have a product_name of 'Unknown', rather than just being a confusing blank or null value. This allows analysts to fully understand the data much more easily.

Similarly, we can handle cases where the dimension key exists in the fact table but doesn’t match any entries in the dimension table:

# Find incorrect product_id entries in fact table
incorrect_entries = ~merged_df['product_name'].notnull()

# Fill incorrect product_id entries with another placeholder value (-2)
merged_df.loc[incorrect_entries, 'product_id'] = -2

In the above code, we’re identifying incorrect product_id entries in the merged DataFrame where the product_name is not available (NaN). These entries are then replaced with the placeholder value -2.

Again, we can see the power here. Instead of the product_name just showing as a blank value, we can distinguish the 'Invalid' value that is produced – this can then be investigated as necessary to potentially discover the missing dimension value, or incorrect key assignment upstream.

Conclusion

The Kimball methodology offers a structured approach to data warehousing, emphasising the significance of dimensional modelling. Dealing with missing or incorrect dimension values in fact tables poses a challenge, but Python’s fillna() function can be a powerful tool in implementing placeholder values (-1, -2, etc.) to handle such scenarios efficiently.

Jammos Analytics