Raw data (also called text data or similar) is stored in a format that is completely independent form any software and can be edited using a simple text editor. Normally a raw data file only contains data, no information on the data like variable names, descriptive information and some additional information is needed to create a correct data matrix, namely information on the format of the data, more precisely where the data is located in the data file.

This document only covers simple data structure, i.e. rectangular data matrices (variables by observations). . For more complex structures see

How does your data look like?
Fixed format

Data values for all variables for each observation are found in exactly the same column positions (in this examples there are obvously two lines for each observation)

CANADA                   1CNDA
 15 14312 2140633.6 2.06.3
BAHAMAS                  1BHMS
 35999999   17099.099.09.0
CUBA                     1CUBA
 27  5033  856599.0 6.17.5
Delimited (Free) formats

Data values appear on a single line for each observation as a sequence of values (variable sequence), separated by a separator.

Comma separated
CANADA,1,CNDA,15,14312,2140633.6, 1,2.0,6.3
BAHAMAS, 1,BHMS, 35,999999, 17099.0, 99.0,9.0
CUBA,1,CUBA,27, 5033, 856599.0,6.1, 7.5
Comma separated, string values are quoted
"CANADA",1,"CNDA",15,14312,2140633.6, 1,2.0,6.3
"BAHAMAS", 1,"BHMS", 35,999999, 17099.0, 99.0,9.0
"CUBA",1,"CUBA",27, 5033, 856599.0,6.1, 7.5
Space separated, string values are quoted
"CANADA" 1 "CNDA" 15 14312 2140633.6  1 2.0 6.3
"BAHAMAS"  1 "BHMS"  35,999999  17099.0  99.0 9.0
"CUBA" 1 "CUBA" 27  5033  856599.0 6.1  7.5
Special: Comma delimited files (CSV) with column names

Some software is able to read in a first line that contains titles for each data column (variable names) to be read in. For spreadsheets this is simply a header that at appears at the top of every column, in statistical software this is usally a variable name that - like in the example below - follows strict naming conventions (no spaces, only letters and numbers and possibly some special symbols).

"Cname","Continent","Cntry", "Infmort","Adultpop","TotalPop","ExpGov","ExpMil","ExpEduc"
"CANADA",1,"CNDA",15,14312,2140633.6, 1,2.0,6.3
"BAHAMAS", 1,"BHMS", 35,999999, 17099.0, 99.0,9.0
"CUBA",1,"CUBA",27, 5033, 856599.0,6.1, 7.5

Depending upon software installation, if you choose the .csv extension for the file name, software, for instance Excel, might be registered to handle this file type automatically, i.e. a double-click on the file name will launch the program (you can tell if a particular application handles this kind of file by looking at the file icon).

Note that sometimes this format is also used to store tables meant for printing, with titles, totals, subtotals, i.e. not a simple data matrix that you can analyze as such.

Beware

Language specific conventions might cause trouble, namely

Related documents