How to Speed Up Data Analysis with Junk Dimensions

Far more useful than this junk

In the realm of data warehousing and analytics, the concept of junk dimensions might seem counterintuitive at first glance. Contrary to their name, junk dimensions play a pivotal role in optimising data performance and simplifying complex data structures.

They act as repositories for low-cardinality attributes, reducing the overall size of a data model and enhancing query efficiency. Today, we’ll delve into what junk dimensions are, why they’re beneficial, and how to create them using Python and specifically Pandas, although they are by no means language dependent.

Before we get into what a Junk Dimension is, let’s go through the concept of Cardinality, as it plays a vital role in Junk Dimensions and is also an extremely useful concept in data as a whole.

What is Cardinality?

In the context of data, cardinality refers to the uniqueness or distinctiveness of values within a dataset column. It measures the count of unique values present in a particular attribute or field.

High cardinality means that a column has a large number of unique values, while low cardinality indicates fewer unique values or a smaller variety of data.

For example, a field named IsActive might only have possible values of True or False, making this column one of very low cardinality. On the other hand, EmailAddress will likely have a unique value for every individual person being stored in a table, so this would have a very high cardinality.

Understanding cardinality is crucial in data analysis and database design as it impacts various aspects such as query performance, storage requirements, and the effectiveness of certain data operations. High cardinality columns might pose challenges in indexing, while low cardinality columns may potentially limit analytical insights due to fewer variations in data. Analysing and managing cardinality is essential for optimising database structures and extracting meaningful insights from datasets.

Luckily, that’s exactly where Junk Dimensions can come in.

What is a Junk Dimension?

Junk dimensions consolidate various flags, indicators, or categorical attributes that have a relatively low cardinality. These attributes might not merit individual dimensions due to their limited unique values, hence referred to as “junk.” Instead of cluttering the main data model with separate small dimensions, they’re combined into a single junk dimension table, simplifying the structure and optimising queries.

For example, a Store dimension may have many fields included that are of low Cardinality, such as StoreSize, StoreType, IsOpen, IsOnline, HasParking and CanDeliver, just to name a few possible ideas.

If there are 10,000 Stores in this table, that is 60,000 possible text values that can impact performance fairly heavily. These can all be replaced with a single key, linking the Store dimension to the Junk dimension and heavily reducing the amount of data.

Junk Dimensions can be built in one of two ways – either create every possible combination of the fields in the table, or only create those that are present in the table. We’ll focus on the former in our example.

Creating Junk Dimensions with Pandas

Let’s consider an example scenario using Python’s Pandas library to showcase the creation of a junk dimension. We will use the Store dimension mentioned above:

import pandas as pd

import itertools



# Define the values for each attribute

store_sizes = ['A', 'B', 'C', 'D', 'E', 'F', 'G']

store_types = ['Standard', 'Shopping Mall', 'Airport']

boolean_values = [True, False]



# Generate all possible combinations of the attributes

combinations = list(itertools.product(store_sizes, store_types, boolean_values, boolean_values, boolean_values, boolean_values))



# Create DataFrame for the junk dimension

junk_dimension = pd.DataFrame(combinations, columns=['StoreSize', 'StoreType', 'IsOpen', 'IsOnline', 'HasParking', 'CanDeliver'])



# Add a unique identifier for each combination

junk_dimension['JunkID'] = range(1, len(junk_dimension) + 1)



print(junk_dimension)

Running this code and viewing the output should look similar to the following:

This in total produces 336 rows, which we can also calculate by multiplying the cardinalities of each column:

7 x 3 x 2 x 2 x 2 x 2 = 336

In our Store Dimension, we would join to this Junk Dimension on each value to determine which ID should be inserted into the Junk Dimension.

Advantages of Junk Dimensions

Improved Query Performance

By consolidating low-cardinality attributes, junk dimensions streamline the data model, leading to faster query execution. They reduce the number of joins required, optimising database performance.

Simplified Data Model

Junk dimensions declutter the primary data model by grouping related attributes, enhancing its readability, and making it more manageable.

Space Efficiency

Combining multiple small dimensions into a single junk dimension table reduces storage space, particularly beneficial for large-scale datasets.

Conclusion

Despite their name, junk dimensions offer substantial advantages in data warehousing and analytics. They provide a systematic way to manage low-cardinality attributes, contributing to enhanced performance and efficiency within a data environment. By leveraging tools like Pandas in Python, creating and implementing junk dimensions becomes more accessible, enabling smoother data management and faster query processing. Incorporating junk dimensions into your data modelling strategy can significantly optimise performance and simplify complex datasets.

Jammos Analytics