Exploratory Data Analysis

CategoriesStatistics , Data

Exploratory Data Analysis (EDA) involves inspecting and visualising your data in various ways to understand its properties and qualities.

What is Exploratory Data Analysis (EDA)?

Exploratory data analysis isn't guided towards a specific goal. Instead, it's an open-ended "explore" of what's available. You’ll look at the data again later, often using the same techniques, but with specific questions in mind.

The first step in any machine learning or causal ML analysis should always be looking at the data. In fact, we have a guide to this process in our ML Project Builder tool here (click the “Data” tab). That guide covers a wider range of considerations than this article, which is limited to insights you can gain from data visualisation. Looking at data allows you to spot data issues which could have a big impact on any analysis.

Benefits of EDA

EDA should identify potential issues such as:

  • Missing or sparse data
  • Very uneven value distribution, or many rare values in categorical data
  • Inconsistent data types or encoding

Distributions describe the range of values encountered and the frequency of each value. Distributions can be problematic because rare values are usually modelled poorly by ML models (which are statistical in nature) and therefore need careful handling and evaluation.

Consistency refers to the ease with which machines can interpret data, both statistically and practicably. For example, if data encoding as text or number types is inconsistent, it may be hard to recognise equivalent values. Similarly, if there is bias or inconsistency in the way data was recorded, this will affect the quality of your ML models and solutions. Without strict instructions, human-entered data is usually quite inconsistent. Free text data has historically been especially difficult to interpret, although modern Large-Language Models such as ChatGPT may make this easier in future.

There are two solutions to missing and sparse data: Imputation (replacing missing data with default or average values) and exclusion (cutting these records out of the data). Imputation affects the quality of your results, and exclusion reduces the size of your dataset - potentially fatally, if there's not enough data left. That's why it's important to examine the data as soon as possible.

Exclusion is often an important part of study design, reducing your samples to comparable groups. Exclusion can help to maintain Positivity in your data.

Role of EDA in Causal analysis

In a Causal study, we particularly care about the relationships between variables (columns in your CausalWizard data). This means we should aim to gain a good understanding of correlation and association between variables in our data.

  • We can use EDA to validate or invalidate our assumptions about the relationships between variables by examining bivariate association (bivariate means between pairs of variables)
  • We can use EDA to choose which features to include in our models (this is known as feature-selection).
  • Finally, we can also use EDA to validate our models and our interpretation of feature-effects (aka feature-importance).

Univariate EDA

To begin to understand each variable, you should also look at the univariate distribution of each variable. You can use the histogram plot provided in the Data view and Data tab of your Studies for this. Just select one or two columns (i.e. variables) from your Dataset, to examine the distribution of values:

A histogram plot is useful to see whether the data contains the expected range of values; it will be obvious if many values are badmissing or don’t have the expected distribution. Histograms are also useful for categorical, text data, so see which categorical values are rare.

Use the Linear and Log axis scale options to make small differences apparent when there are also large differences between values.

Bivariate EDA

CausalWizard provides plots which let you examined the bivariate distribution of variables you suspect may be associated, correlated or have a causal relationship. Bivariate means between two variables. All the bivariate data visualisations are provided in the Data view and Data tab of your Studies. Select one or two columns (i.e. variables) from your Dataset, to examine their joint distribution. For efficiency, CausalWizard generates a sample of 1000 rows for these plots (this avoids waiting to transfer hundreds of thousands of rows to your browser). The sample generating algorithm always includes the most extreme values of all numerical distributions.

CausalWizard will select an appropriate plot type based on the combination of variables selected:

  • If both variables are numerical, you can choose between Scatter and Contour plots.
  • If one variable is numerical and one is categorical, you will see a Violin plot of the numerical values for each categorical value.
  • If both variables are categorical, you will see a Heatmap plot (a grid of rectangular cells).

The CausalWizard software interface allows you to force the app to present a variable as numerical or categorical, if that's possible. By default, CausalWizard tries to detect your data types from your data file, but it doesn't always guess correctly.

Scatter plot

scatter plot is usually suitable for examining a bivariate distribution, although a density or contour plot may be needed to see high densities. Use the guide below to interpret scatter plots of two variables and make judgements about whether variables are associatedcorrelated or independent (not associated). If you suspect a causal relationship between two variables, you would also expect to see some association or correlation between them, although statistically significant association is not always visible:

It can be difficult to interpret the pattern of dots in a scatter plot. The figure below shows some of the ways data appears when associated, correlated, or independent (not associated):

Contour plot

A contour plot is used to view the joint distribution of two numerical variables; it is useful to detect changes in density which can't be easily detected in scatter plots. The same data as shown n the scatter plot above reveals a clear density gradient when viewed in the contour plot:

Violin Plot

Violin plots are used to display a combination of categorical and numerical variables. The categorical variable is always displayed on the x-axis. Box-plots with mean and inter-quartile range are displayed on top of each violin:

Heatmap plot

CausalWizard provides a heatmap plot to show the joint distribution of two categorical variables. It's useful because it can show all combinations of many values in one graphic. Brighter cells indicate higher frequency in your data. For example, in the data below we can see that way more guests arrived on the 1st July than any other date of the year:

The categorical axes are not sorted because there is no universal sort order for arbitrary categorical data. Check the order of your categorical values carefully when interpreting the visuals.