Finding and preparing data for your AI/ML project

Consider the questions below to identify and explore data that will be needed in your project.

Q4.1 Key sources

List data which is necessary or highly desirable for the function or evaluation of the solution. In each case, make a note:

  • Who owns, controls or maintains the data?
  • How much data is available (e.g. date ranges, number of samples)?
  • Which version or source of the data is most definitive?
Additional context & tips

Reach out to colleagues to understand the data potentially available for use on the project. Try to think of all the key pieces you need and whether they contain the data needed to link them together. 

Do pay attention to the quality and quantity of data in each of these sources. You can express quantity in terms of the entities described in the data, or as transactions or date ranges stored (for example, "daily records since Dec 2005"). Quality can be captured in terms of completeness (how many fields or entries are blank) and consistency. Consistent data is typically generated according to a documented process, either by automation or software guided procedures. Ad-hoc collection of unstructured text is generally the least consistent and hardest to use in AI/ML projects (although this is changing with the advent of LLMs, which can exploit unstructured text).  

Note down who owns, maintain or control access to the data. They will need to be involved in your project. Facilitation and export of data in a convenient format is a huge assistance (and it can be quite time-consuming, with multiple iterations required). These people are often busy with other tasks, and may see your project as less important than day-to-day work.

You will often encounter multiple versions of data used by different people or teams. Often, the provenance (origin story) of the data is poorly defined. It may be an undocumented, ad-hoc process. In these cases, you'll need to ensure that the data acquisition and storage process becomes documented and systematic as part of your project.

To do this, first figure out which are the best sources of each key data file. Then, trace them back to the sources and engage the people who manage those sources. You may be able to build support for the project by delivering more accurate, reliable and timely data to others who rely on this data.

Q4.2 Data structure

Explore with your team the structure of the data sources, and how they can be linked together, e.g. via unique identifiers or date ranges. List any potential issues with linking the data together, such as cardinality changes (e.g. many to 1 relations), gaps or missing data. Make notes here, and consider making an Entity-Relationship Diagram.

Additional context & tips

Relational databases

Most databases are relational, which means they store tables of data, and the relationships (links) between these tables.  However, most AI/ML methods assume that the data is a single table! Even if your data is stored in spreadsheets, you will probably also need to link these spreadsheets together.

This means that there is usually a data transformation step in any AI/ML solution development in which multiple tables or data sources are joined together, and transformed into a single table or matrix (this is also known as denormalization). This step may be difficult, so it's worth exploring it now if you can.

Linking data structures together and denormalizing them to produce a machine learning dataset

The figure above shows a tabular ML dataset which includes Subscription status and Customer age data, produced from two linked tables in a relational (SQL) database. The tables must be joined to produce the dataset. The number of rows in the two tables is probably different, creating cardinality changes that must be tackled.

Cardinality

Cardinality refers to the change in the number of records (or rows) between one data source or table and another. Cardinality changes are particularly problematic when trying to denormalize data sources or tables into an ML dataset, because the gaps or duplicates in the resulting table are problematic for ML and AI methods (and also problematic for evaluation of the solution). Where possible, it is preferable to define rules which collapse duplicates into a single record, and explicitly create evaluation methods which consider the non-independence of linked samples.

Other types of database

There are other types of data and database you might encounter, which have their own issues.

  • Timeseries data is typically large, although the structure is usually simpler. However, you will often have to join this data to less dynamic, relational data.
  • Graph databases typically have data-defined structure making the denormalization process even more complex, unless you use methods which are explicitly designed to model graphical structures.
  • No-SQL or unstructured databases are usually very difficult to use in AI/ML solutions as the data provides very few or zero guarantees about the data structure. Usually, some structure is enforced by application-layer logic.

Q4.3 Origins and sources

Explore and note the process by which new data would continually be obtained, including cadence, latency, and any manual processes which might be difficult to automate. Who is responsible for this and how will continuity be guaranteed?

Additional context & tips

The need for continual integration

If you're only aiming for a one-off analysis, without a production use-case, it might be OK to perform heavily manual data cleaning and preparation. But in most cases, you'll need to continually repeat data preparation - either for incremental model re-training and re-validation as new data arrives, or to enable continual inference in production use. This means you have to figure out how data will be continually integrated from other systems or sources.

Replacing manual processes

If data preparation involves some manual steps, it's important to consider ways around this - either automation of the manual process, or substitution with other sources (even if the substitutes are less ideal).

Ensuring truth

Aim to obtain data from the most trustworthy, correct source. Others are likely to come to rely on your more guaranteed, trustworthy data. It's worth making it as correct as possible.

Cadence

The cadence or frequency of updates is often important to ensure up-to-date data is available. Many systems produce data in daily or hourly batches, which may be insufficient if you need to use that data earlier. It can be difficult to push to re-engineer these systems for more frequent data. You should consider whether you can make-do with older data, or have processes to deal with the absence of the most recent values. What impact will cadence have on your solution?

Latency

Latency is the delay between the time of a real event or measurement and the data becoming available on your system. It is different to cadence - you can have regular updates of data which is still delayed by days! Like cadence, note down the potential hurdles latency could introduce and some potential means of dealing with latency in key data sources.

Q4.4 Data quality

Note any known issues, concerns or risks due to data quality. Consider obtaining Exploratory Data Analysis (EDA), which should identify potential issues such as:

  • Missing or sparse data
  • Very uneven value distribution, or many rare values in categorical data
  • Inconsistent data types or encoding / recording
  • Need for dimensionality reduction
Additional context & tips

Exploratory Data Analysis (EDA)

In previous questions you have identified the data and where it might come from. Now it's time to think about the quality of the data. This is often part of a PoC or an AI/ML project and isn't usually part of the planning stage. However, you should be aware of it because analysis of the data might already exist, and you might be able to make use of it. This analysis is often called "Exploratory" data analysis, because it isn't guided towards a specific goal. Instead, it's an open-ended explore of what's available. EDA typically covers the following:

Missing and sparse records

Quality issues include: 

  • Missing data. Linked records are simply not present.
  • Sparse data. The data is there, but many values are defaults, blank, or otherwise incomplete. The ratio of complete to incomplete data is low. Or alternatively, the frequency of observations or other records is low.

There are two solutions to missing and sparse data: Imputation (replacing missing data with default or average values) and exclusion (cutting these records out of the data). Imputation affects the quality of your results, and exclusion reduces the size of your dataset - potentially fatally, if there's not enough data left. That's why it's important to examine the data as soon as possible.

Distributions

Another quality issue is the distribution of values. Distributions describe the range of values encountered and the frequency of each value. Distributions can be problematic because rare values are usually modelled poorly by AI/ML models (which are statistical in nature) and therefore need careful handling and evaluation.

Consistency

Consistency refers to the ease with which machines can interpret data, both statistically and practicably. For example, if data encoding as text or number types is inconsistent, it may be hard to recognise equivalent values. Similarly, if there is bias or inconsistency in the way data was recorded, this will affect the quality of your AI/ML models and solutions. Without strict instructions, human-entered data is usually quite inconsistent. Free text data has historically been especially difficult to interpret, although modern Large-Language Models such as ChatGPT may make this easier in future.

Dimensionality

Dimensionality refers to the number of independently variable elements or values in the data. Most datasets have too many dimensions (also known as features) and too few samples. This can lead to overpowered models which are impractical to train and generalise badly. For these reasons, data scientists and ML researchers often speak of the "curse of dimensionality".

If your data has many dimensions (features) and especially if it has few samples, you might want to plan to reduce the dimensionality of your data as part of your project. This process is known as dimensionality reduction.

Q4.5 Recognise the value of your data

Integrated, trustworthy and coherent datasets are very valuable, even without use in an AI/ML solution. Consider how the dataset you are producing for the project can be used in other business functions, perhaps replacing less well maintained or more inconsistent / outdated data sources. This may be a key value proposition of your project. Where can you find uses for your data?

Additional context & tips

Actively explore additional use-cases for the data you will produce in your project. Shop the data around your stakeholders and other colleagues to see who is inspired by it. Tell people it's available - put it in your internal newsletter!

We have many experiences that once we create the data for an AI/ML project, other users and uses appear and start emailing and phoning us begging for access to it. It will be better than their existing data, because you're forced to find the most trustworthy sources, link them together, clean up all the bad values, and be able to repeat this exercise consistently. Often, the data becomes the first early value delivered by an AI/ML project. Make sure you plan to leverage it.