Biases in Data
We work with data in a variety of different ways throughout our organisation. Some people are consumers of data and in particular data that is the output of various data analytics, machine learning or artificial intelligence applications. Being a consumer of data from these applications we (easily) made the assumption that the data used is correct and the results being presented to us (in various forms) is correct.
But all too often we hear about some adjustments being made to the data or the processing to correct “something” that was discovered. One the these “something” can be classified as a Data Bias. This kind of problem has been increasing in importance over the past couple of years. Some of this importance has been led by the people involved in creating and process this data discovering certain issues or “something” in the data. Some has been identified by the consumer when the discover “something” odd or unusual about the data. This list could get very long, but another aspect is with the introduction of EU GDPR, there is now a legal aspect to ensuring no data biases exist. Part of the problem with EU GDPR, in this aspect, is it is very vague on what is required. This in turn has caused some confusion on what is required of organisations and their staff. But with the arrival of the EU AI Regulations there is a renewed focus on identifying and addressing Data Bias. With the EU AI Regulations there is a requirement that Data Bias is addressed at each step when data is collected, processed and generated.
The following list outlines some of the typical Data Bias scenarios you or you organisation may encounter.
- Definition bias: Occurs when someone words or phrases a problem or description of data based on their own requirements, rather than based on the organisational or domain definitions. This can lead to misleading results or when commencing an analytics project can lead the project is a specific (biased) direction
- Sample bias: This occurs when the dataset created for input to the analytics or machine learning does not reflect the data from the original data sources. The sampling method used fails to attain true randomness before selection This can result in models having lower accuracy with certain sub-groups of the data (i.e. Customers) which might not have been included or under-represented in the sampled dataset. Sometimes this type of bias is referred to as selection bias.
- Measurement bias: This occurs when data collected for training differs from that collected in the original data sources. It can also occur when incorrect measurements or calculations are applied to the data. An example of this bias occurs with inconsistent annotation labeling and/or with re-coding of data to give incorrect or misleading meaning.
- Selection bias: This occurs when the dataset created for analytics is not large enough or representative enough to include all possible data combinations. This can occur due to human or algorithmic data processing biases. Sample bias plays a sub-role within Selection bias. This can happen at both record and attribute/feature selection levels. Selection bias is sometimes referred to as Exclusion bias, as certain data is excluded by the whoever is creating the dataset.
- Recall bias: This bias arises when labels (target feature) are inconsistently given based on subjective observations. This results in lower accuracy.
- Observer bias: This is the effect of seeing what you expect to see or want to see in data. The observers have subjective thoughts about their study, either conscious or unconscious. This leads to incorrectly labelled or recorded data. For example, two data scientist give different labels for an event. Their labeling is based on the subjective thoughts rather than following provided guidelines or seeking verification for their decisions. Sometimes this type of bias is referred to as Confirmation bias.
- Racial & Gender bias & Similar: Racial bias occurs when data skews in favor of particular demographics. Similar scenarios can occur for gender and other similar types of data. For example, facial recognition fails to recognize people of color as these have been under represented in the training datasets.
- Minority bias: This is similar to the previous Racial and Gender bias. This occurs when a minority group(s) are excluded from the dataset.
- Association bias: This occurs when the data reinforces or multiplies a cultural bias. Your dataset may have a collection of jobs in which all men have job X and all women have job Y. A machine learning model built using this data will preclude women from job X and men from job Y. Association bias is known for creating gender bias.
- Algorithmic bias: Occurs when the algorithm is selective on what data it uses to create a model for the data and problem. Extra validation checks and testing is needed to ensure no additional biases have been created and no biases (based on the previous types above) have been amplified by the algorithm.
- Reporting bias: Occurs when only a selection of results or outcomes are present. The person preparing the data is selective on what information they share with others. This typically leads to under reporting of certain, and somethings important, information.
- Confirmation bias: Occurs when the data/results are interpreted favoring information that confirms previously existing beliefs.
- Response / Non-Response bias: Occurs when results from surveys can be considered misleading based on the questions asked and subset of population who responded to the survey. If 95% of respondents said they link surveys, then is misleading. The quality and accuracy of the data will be poor in such situations