Introduction and Overview

EDA Exploration de Donnees Agrégatives
Exploratory Data Analysis

The acronyms used to name the program presented here reflect the main purpose and underlying principles of this interactive data analysis program: Exploratory data analysis is an approach to data analysis inspired and pioneered by John Tukey's seminal "Exploratory Data Analysis" (Reading, Mass 1977). This approach is understood here in the first place as an attitude and awareness towards methodological problems in their specific context of utilization. The EDA program is a general interactive tool for data analysis with a specialization in the area of analysis of aggregate data. The program grew out of teaching and research in the area of exploratory data analysis on the on hand and multivariate analysis on the other; this experience is reflected in the various aspects of the program.

This analysis tool is not a system with EDA techniques, but an analysis environment for exploratory analysis with some "standard" techniques. It is interactive, easy to use (but with many options left to the more advanced user); explicit procedures are preferred to automatic ones, thus providing a tool for open ended exploratory research. The program is conceived in a way to facilitate specific implementations for specific applications and easy adaptation of other program components, which then can take advantage of the syntax analysis, data handling and documentation facilities as well as the common library.

A command language using a few syntactical constructs (keywords, named values, names and variable lists) is used to communicate with the program. Variables analyzed reside in a work area (WA) which needs not be a rectangular data matrix. Variables are referred to by integer numbers or variable names. A variable descriptor allows for more documentation of variables. Numeric information can be stored in names and descriptors and be extracted from it (e.g. year of an election). Alphanumeric case identifications and a numeric grouping variable (column indices) are associated with each work area. Variables can be tied together as "bundles" (variable ties, groups of variables, row indices). The work area may be transposed; labels then become casids, variable-ties become group memberships and vice-versa. Associated with the work area are several matrices, where results are stored and manipulated: An area called MATRIX contains a similarity or dissimilarity matrix (e.g. correlations for the factor analysis), two matrices called C1 and C2 ("configurations") keep the results from dimensional analyses (e.g. in the case of principal components C1 contains the factor loadings, C2 the factor scores). These matrices are then used for displays (plots, coded lists) rotation, comparison etc.

Selections (filters, group-wise analysis) can be activated without altering the current work area.

"Documents" of any length can be used to describe data. They may be attached to variables, work areas, cases or user defined concepts; there are provisions for levels of documentation, for searching strings and for extracting numeric information from documents.

EDA specific system files are used to store data and documentation. Special attention has been paid to the communication with other standard packages, especially SPSS and SAS (EDA produces SPSS or SAS setups for the creation of an SPSS system file or a SAS data step).

EDA has a comprehensive set of arithmetic transformation commands (more than 100 arithmetic and logical functions including many EDA oriented functions). Data transformation and correction commands are provided. Arithmetic transformations may be applied to variables (conditionally or unconditionally), as well as to individual cases and matrix elements (case reference as target of an arithmetic expression). If the expressions specified yield scalar results EDA may be used like a calculator. Alterations are always flagged with a modification stamp (automatic modification of the variable descriptor) and documented by encoding the command used to alter the variable into its descriptor. Some macro capabilities (single line or multi-line macros) together with scalar variables and control structures facilitate repetitious tasks.

The results of an EDA sessions may be placed into a print file, either completely or selectively. Additional options and a text editor (also used to edit documents and macros) facilitate the transfer of the results to a text formatter.

Some of the program components are adaptions of already published programs (McNeil, Interactive Data Analysis, N. York, 1977; Anderberg: Cluster Analysis for applications, N. York 1973) Implemented analysis commands include: Displaying and summarizing single variables or groups of variables, a powerful plotting command (including case identification plots), bivariate and multivariate regression (Tukey), a smoothing program; simple display maps (stored in a map archive or user defined) analyses of groups of cases; profiles of cases. Multivariate analysis: Principal component analysis (including factor scoring, rotation and a general configuration manipulation routine), a series of hierarchical and non-hierarchical clustering techniques, T-Scale, correspondence analysis, MINISSA, MDS and canonical analysis.

EDA contains also a toolbox containing may tools useful when working with data or text. Depending upon your operating system these tools will be more or less useful to you. These tasks are not necessarily linked to work with EDA.

The user can also get on-line assistance on general concepts and the syntax of each command. For ease of use there is also a command line editor allowing to correct the current command and to remember and recall previous commands. More information on the EDA approach can be found in [Horber, 1980].