Categories → Data , Statistics , Variables
Data cardinality refers to the number of distinct values that a data attribute can take in a dataset.
Data cardinality refers to the number of distinct values that a data attribute can take in a dataset. It is an important aspect of data modeling, as it helps in designing the database schema and selecting appropriate data types for the attributes.
The cardinality of a specific attribute or column in a dataset refers to the number of unique values that exist in that particular column. For example, in a dataset of customers, the "age" column may have a cardinality of 50, which means that there are 50 distinct ages present in the column.
Cardinality can be of two types: low cardinality and high cardinality. A low cardinality attribute has a small number of distinct values, whereas a high cardinality attribute has a large number of distinct values.
For instance, the "gender" attribute is an example of a low cardinality attribute because it has a limited number of distinct values, e.g., "male" and "female." On the other hand, the "email" attribute can have a high cardinality as each email is unique, and there can be a large number of distinct email addresses.
An understanding of data cardinality also helps in identifying relationships between different attributes and can aid in the selection of appropriate data models for efficient data processing.
Understanding data cardinality is crucial for data analysis, especially when performing operations such as filtering, grouping (including stratifying), or aggregating data.
For machine learning systems, it is common to transform the data into a tabular form. This involves reducing the cardinality of various related data to common entity, which may involve various compromises and trade-offs. These compromises may have significant impacts on overall model performance.
Given a finite number of samples, data columns (features, or variables) with high cardinality will have some low-frequency values. This means these values occur rarely in the dataset.
This is especially problematic with discrete, or non-numerical data (e.g. categorical data types) where the effects of rare values may be dissimilar to other values (whereas in numerical data types, similar values may help with modelling).
Often, we are interested in ensuring model performance on specific subsets of the data; if these have low frequency, the results will be poorly fitted and performance will vary considerably on unseen test sets. Sometimes, data scientists prefer to drop (exclude) these samples rather than include them in analysis.
Finally, cardinality is important when sampling data, for validation or as part of a training or analysis regime. If we over-sample rare values, it artificially inflates their importance and creates misleading performance metrics. But if we under-sample those rare values, they will not be modelled as well as high frequency values.
High cardinality is particularly problematic in non-numeric features (e.g. categorical data types), because it creates very sparse encodings and increases the chance of low-frequency values as described above.
Cardinality, frequency and data types are all important considerations when modelling data and will constrain your approach. In particular, cardinality will impact: