Dataset (Tabular, Matrix)

CategoriesCausal Wizard Concept , Data , Variables

Causal Wizard accepts Tabular data, such as Excel and CSV files.

Tabular Data

Tabular data files like CSV or Excel organize data in rows and columns, allowing them to be represented as matrices, and can contain various data types including numbers, text and categorical labels.

A tabular dataset is a collection of data organized in a table format, where the data is presented in rows and columns. The data may represent historical data, a collection of events, the results of a delivery data-gathering survey or any other process.

Data format

Causal Wizard supports two tabular data formats. The general-purpose format is described in this article, and a more specific Panel Data format is introduced in a separate article. This article also covers fundamental concepts and terminology for describing datasets.

In typical datasets, Causal Wizard assumes that each row represents a single record (or sample unit), and each column represents a specific attribute or feature of that record (and can be used as a Variable in the Causal Diagram in Causal Wizard). 

Header row

You should include a header row in your data file. This allows us to use the header row column names as your variable names.

File type / format

Tabular datasets are commonly stored in file formats like CSV or Excel, both accepted by Causal Wizard.

Data types

Most common data-types are supported. Time and date values are converted to isoformat datetime string (aka text) values. They are treated as categorical values without ordering in Causal Wizard models, because this does not affect our models (which are not time-series models). In Fixed-Effect models, each time-value is handled independently. Time and date values may be ordered in some plots, but this is not guaranteed in all plots.

Matrices

The mathematical concept of a matrix is closely related to a tabular dataset. A matrix is a rectangular array of numbers arranged in rows and columns. Like a tabular dataset, a matrix can represent a collection of data with each row and column representing a specific attribute or feature.

However, the a big difference between tabular data and a matrix is that in a matrix all types of data are encoded as numbers, via a preprocessing step. Depending on the cardinality or distribution of the values, 

In fact, most data analysis and machine learning algorithms operate on matrices rather than tabular datasets directly. Matrices offer a convenient and efficient way to perform mathematical operations on large amounts of data.

Dimensions and Variables

Remember that each Column of your dataset represents an attribute, feature or dimension, which represents a statistical Variable. The number of Variables is also known as the dimensionality of the dataset.

Panel Data format

Causal Wizard also supports a data format known as Panel Data. In this format, the same entities (which could be individuals, or groups rather than individuals) are observed over a common set of time points. This data format enables use of additional methods, in particular our Fixed-Effects models. If your data matches this format, you may want to consider using the Fixed-Effects models and methodology.

In contrast, the standard Dataset format described here is used with our Causal-Diagram and Potential Outcomes methodology.

Data types

A tabular dataset can contain not only numerical data, but also other types of data such as text or categorical variables

Relational data

Normalized, relational data is stored in multiple tables which are related by keys. The struture depends on the cardinality of the data and the entities it represents. This structure is tabular, but it comprises many tables. 

Relational databases are ideal for storing and representing data, but not ideal for machine learning and other statistical analyses. Therefore, you will need to denormalise the data into a single table to create a data file suitable for Causal Wizard (or any other machine learning tool).

Summary

In summary, a tabular dataset is a collection of data presented in a table format. A matrix is a mathematical concept of a rectangular array of numbers that can represent a collection of data. In Causal Wizard, the columns of the dataset represent attributes that can be used as Variables, and the rows represent the Sample of the population; each row contains one sample-unit