Data Matrix

To be analysable with statistical tools, data needs to be presented as a rectangular data matrix. Each column in a data matrix contains a (indicator, measurement, questions in a survey...) and each row an () (countries [example below], individuals in a survey, subjects in an experiment, institutions, groups of persons,......].

Each cell contains a single value for a particular variable and observation, e.g. the GDP per capita for Albania. If the value is not available, the cell content will contain show somehow that the value is missing ( indicator); all statistically oriented software will automatically skip that kind of value in computations.

Here's a schematic representation of a (example with country data):


Each column has a name () used to refer to it. All values of a particular variable have to be of the same type (numeric, string,...) In the example above obviously the country and continent names are strings, and GDP per capita contains numerical information.

A second example with survey data:

100Italian 14128
101German 41164
102French 111237
103German 3-1241
Data structure

Survey data is usually coded data, i.e. the textual answers to questions are coded, e.g. High Political Interest (PolInt) is coded as 1, low interest as 4; 4 on PartyPref records the answer of someone preferring the Socialist Party. Age of course records simply the age of the person.

If a person does not answer a question (don't know, refuse to answer, etc, a specific code is used; in the example PartyPref interviewee id=103 did not select one of the parties. A code of -1 has been entered for a missing answer. You will then need to instruct the statistical software to consider -1 for that variables as missing, i.e. not include it into statistical computations.

The rectangular data matrix is mandatory for statistical analysis; if data is presented in a different way it has to be restructured first to produce a rectangular data matrix.

Documenting the data matrix

A data matrix needs to be documented to be meaningful in any analysis.

Statistical software provides a way of describing and documenting the data matrix. More specifically:

In addition to documenting the data using descriptive labels, it is essential to document the data source (who has provided/collected/produced the data, how it has been produced/collected etc.). Information on data quality and the measurement process are also important, as are, if the data is a sample, the details on the sampling procedure, the sample size, non-responses. For a survey questionnaires need to be available, in all languages used for interviewing people.

See also