Data Matrix

To be analysable with statistical tools, data needs to be presented as a rectangular data matrix. Each column in a data matrix contains a (indicator, measurement, questions in a survey...) and each row an () (countries [example below], individuals in a survey, subjects in an experiment, institutions, groups of persons,......].

Each cell contains a single value for a particular variable and observation, e.g. the GDP per capita for Albania. If the value is not available, the cell content will contain show somehow that the value is missing ( indicator); all statistically oriented software will automatically skip that kind of value in computations.

Here's a schematic representation of a (example with country data):

CountryContinentContNumGDPperCapitaVariablei....Variablek
AfghanistanAsia1valuevaluevaluevalue
AlbaniaEurope3valuevaluevaluevalue
..........................
SwitzerlandEurope3valuevaluevaluevalue
countryn...valuevaluevaluevaluevalue

Each column has a name () used to refer to it. All values of a particular variable have to be of the same type (numeric, string,...) In the example above obviously the country and continent names are strings, and GDP per capita contains numerical information.

A second example with survey data:

IdLanguagePolIntPartyPrefGenderAge
100Italian 14128
101German 41164
102French 111237
103German 3-1241
......................
in...valuevaluevaluevalue
Data structure

Survey data is usually coded data, i.e. the textual answers to questions are coded, e.g. High Political Interest (PolInt) is coded as 1, low interest as 4; 4 on PartyPref records the answer of someone preferring the Socialist Party. Age of course records simply the age of the person.

If a person does not answer a question (don't know, refuse to answer, etc, a specific code is used; in the example PartyPref interviewee id=103 did not select one of the parties. A code of -1 has been entered for a missing answer. You will then need to instruct the statistical software to consider -1 for that variables as missing, i.e. not include it into statistical computations.

The rectangular data matrix is mandatory for statistical analysis; if data is presented in a different way it has to be restructured first to produce a rectangular data matrix.

Documenting the data matrix

A data matrix needs to be documented to be meaningful in any analysis.

Statistical software provides a way of describing and documenting the data matrix. More specifically:

In addition to documenting the data using descriptive labels, it is essential to document the data source (who has provided/collected/produced the data, how it has been produced/collected etc.). Information on data quality and the measurement process are also important, as are, if the data is a sample, the details on the sampling procedure, the sample size, non-responses. For a survey questionnaires need to be available, in all languages used for interviewing people.

See also