Categories → Data , Variables , Study Design
Statistical data types refer to the different formats of data used in exploratory data analysis and machine learning, including numeric, categorical, ordinal, and others.
Data types are an essential aspect of exploring and analyzing data, and preparing data for machine learning and other statistical models. They refer to the different formats or representations of data. Many data types, including numeric, categorical and ordinal, are related but subtly different.
Causal Wizard attempts to automatically detect the data types of the columns in your data, and adapt the user interface accordingly. The main places you'll notice the impact of data types is when defining the Treated or Control cohort assignments using your Treatment variable. But the data type also affects internal calculations and methods, and will affect how your results are displayed.
This video provides a good introduction and taxonomy of the different types of data:
https://www.youtube.com/watch?v=7bsNWq2A5gI
Numeric data refers to quantitative data represented by numbers. This type of data can be continuous or discrete.
Continuous data can take any value within a range (any real number, such as 5.4321), while discrete data can only take specific values such as 1, 2, 3 and might be represented as an integer. Examples of continuous numeric data include age and height, while examples of discrete numeric data include the number of siblings and the number of pets.
Categorical data is qualitative data that describes characteristics or attributes. It can be nominal or ordinal.
Nominal data doesn't have any inherent order, such as gender or color, while ordinal data has a specific order, such as a rating scale, where there is a ranking of categories. An example of nominal categorical data is the color of a car, while an example of ordinal categorical data is a rating scale for a product or service.
Ordinal data is similar to nominal categorical data, but it has a specific order or ranking. It is used to represent subjective data or opinions. Examples of ordinal data include a rating scale from one to five for a product or service or a ranking of academic performance, such as A, B, C, D, and F.
Binary (0 or 1) might be considered a type of Discrete numerical data, but is often used to represent Boolean (True/False) values, which are are type of Nominal data.
Other data types include text, time-series, image, and audio data. Text data refers to unstructured data in the form of sentences, paragraphs, or documents. Time-series data is data that is recorded over time, usually at regular intervals, and is used to make predictions and identify patterns over time. Image data consists of visual information in the form of pictures, graphics, or diagrams, and requires specialized techniques such as computer vision to analyze. Audio data is data in the form of sound waves at different frequencies and can be analyzed using techniques such as speech recognition and music classification.
Data types are important because the data type constrains how the data can be effectively prepared and ingested into your statistical and machine learning models. Inappropriate encoding of data types can produce biased or inaccurate models, which fail to properly exploit the data and its features.