Missing values
Valid data values can be missing for many different reasons, namely:
- Invalid data: data entry errors, out of range values (errors made when collecting and entering the data)...
- Non-availability of data:
- Data not collected or not published. For instance many socioeconomic indicators are not available for micro-countries
(like Andorra, Monaco,...) or are missing for a particular year.
- by design (data collection): E.g. in a national survey, with some specific questions for women, men
will have missing values for these; some data available only for developed countries in a data set with
all countries in the world.
- In a survey or experiment: Recording "refuse to answer", "don't know" etc; measurement failure of an instrument,...
- Manipulation errors, logical errors or side effects when creating derived (transformed variables). Examples are (1) Forgetting some
values when recoding
a variable into a new one (2) when building composite scales from many variables with some missing values (3) residuals
from a regression with many variables having missing values etc etc.
Analysis
Data analysis software has mechanisms that consider specific values (either user or system defined)
as missing, and are handled differently than valid values; usually they are automatically excluded
from analysis.
Statistically speaking
In the statistical literature, the following types of missing values are distinguished:
-
Missing completely at random (MCAR): Missing data-values are independent both of observable variables
and unobservable variables of interest to the researcher.
This means that the valid values are not biased in any way, as the missing values are in no way related
to the measurement of the variable of interest. This assumption usually impossible to verify.
-
Missing at random (MAR); the missingness is not random, but can be accounted for, e.g. persons
with higher levels of education being less likely to admit that they are not well informed than
respondents with a lower level of education. Another example is that males are less likely to fill in a depression survey
(this has nothing to do with their level of depression, after accounting for maleness).
Again this an assumption that is impossible to verify statistically, we have to rely on its substantive plausibility.
- Missing not at random (MNAR) (also called nonignorable nonresponse): missing values that are related to what
you wish to measure; the example used in the literatures is that of
men failing to fill in a depression survey because they are depressed; persons who are not interested in politics not answering
questions related to politics; persons with low educational level not understanding some questions....
Related documents