Data Matrix

To be analysable with statistical tools, data needs to be presented as a *rectangular data matrix*.
Each column in a data matrix contains a (indicator, measurement, questions in a survey...) and each row
an () (countries [example below], individuals in a survey, subjects in an
experiment, institutions, groups of persons,......].

Each cell contains a single value for a particular variable and observation, e.g. the GDP per capita for Albania. If the value is not available, the cell content will contain show somehow that the value is missing ( indicator); all statistically oriented software will automatically skip that kind of value in computations.

Here's a schematic representation of a (example with country data):

Country | Continent | ContNum | GDPperCapita | Variable_{i} | .... | Variable_{k} |
---|---|---|---|---|---|---|

Afghanistan | Asia | 1 | value | value | value | value |

Albania | Europe | 3 | value | value | value | value |

... | ... | .... | .... | .... | .... | .... |

Switzerland | Europe | 3 | value | value | value | value |

country_{n} | ... | value | value | value | value | value |

Each column has a name () used to refer to it. All values of a particular variable have to be of the same type (numeric, string,...) In the example above obviously the country and continent names are strings, and GDP per capita contains numerical information.

A second example with survey data:

Id | Language | PolInt | PartyPref | Gender | Age |
---|---|---|---|---|---|

100 | Italian | 1 | 4 | 1 | 28 |

101 | German | 4 | 1 | 1 | 64 |

102 | French | 1 | 11 | 2 | 37 |

103 | German | 3 | -1 | 2 | 41 |

... | ... | .... | .... | .... | .... |

i_{n} | ... | value | value | value | value |

Data structure

Survey data is usually coded data, i.e. the textual answers to questions are coded, e.g. High Political Interest (PolInt) is coded as 1, low interest as 4; 4 on PartyPref records the answer of someone preferring the Socialist Party. Age of course records simply the age of the person.

If a person does not answer a question (don't know, refuse to answer, etc, a specific code is used; in the example PartyPref interviewee id=103 did not select one of the parties. A code of -1 has been entered for a missing answer. You will then need to instruct the statistical software to consider -1 for that variables as missing, i.e. not include it into statistical computations.

The rectangular data matrix is mandatory for statistical analysis; if data is presented in a different way it has to be restructured first to produce a rectangular data matrix.

Documenting the data matrix

A data matrix needs to be documented to be meaningful in any analysis.

Statistical software provides a way of describing and documenting the data matrix. More specifically:

- Variables can have explanatory labels. While GDPperCapita is to some extent self-explanatory it is not complete and does not, when shown in a table to be displayed or printed, make a nice title. A more informative label for that variable could be "GDP per capita for 2010 (constant US$ base 2000) [Source: Worldbank]"
- Categorical variables Contnum in the first and PartyPref in the second show a numerical code. Statistical software provides a way to document each code (value) (e.g. "Asia" for code 1). Software like SPSS lets you add labels to the values (codes) of a variable that will be displayed whenever you need to know that e.g. code

In addition to documenting the data using descriptive labels, it is essential to document the data source (who has provided/collected/produced the data, how it has been produced/collected etc.). Information on data quality and the measurement process are also important, as are, if the data is a sample, the details on the sampling procedure, the sample size, non-responses. For a survey questionnaires need to be available, in all languages used for interviewing people.

See also