Exploration

Introduction

This section describes the data analysis tools, except the commands used to perform multidimensional analysis and the commands used to do cluster analysis, which are explained in detail in the following sections.

This section describes the following analysis commands
ADDFIT fits additive relation to table
BOXPLOT box and whisker plot
BREAK (cross) break of variables
COMPARE compare batches
DIAGNOSTIC diagnostic routines
DISPLAY basic statistics
DLINE density line
FREQUENCY Frequencies
HISTOGRAM histogram
LINE Tukey-line, and LSQ line
LIST listing cases
LOWESS scatterplot smoothing
MAP mapping facility
MARCOM MARginal COMparisons
MDIAG multivariate diagnostic routines
PLOT plotting facility
PI see plot command
PROFILE profiles of cases
REGRESS biweight multiple regression
REEXPRESS reexpressions
QSUMMARY quick summary
SHOW list cases (conditional)
SMOOTH smoothing
STEMLEAF stem and leaf display
SUMMARY summaries and letter values
TRACES display boxplots by groups
XTAB crosstabulation

Getting started

This chapter presents - in alphabetical order - data exploration commands used to describe single variables, compare variables, study relationships between two or more variables.

Before reading on you should know: How to write a command line and especially how to specify a variable list.

Data for your exploration

Before issuing any of the commands shown below, you will need to bring data into the work area (WA), i.e. the data matrix containing the variables you want to analyze. Either the data are in a form of a specific EDA data set or as raw data stored in a file, or only written down on a sheet of paper). EDA files are read using the GET command.

The *READ RAW command could be used to read data from a raw data file, e.g. a file written by another piece of software in a form that can be read by most other software. In many programs you will issue a command dealing with exporting data or writing ASCII files. Finally the NEWVAR command allows to enter data at the keyboard.

If you want to learn how EDA works, without worrying about reading data, try first to issue a GET DEMO command, which should bring a demonstration data set into your WA. Note that this might not work, if no data set called DEMO is accessible. If you need to read data from a file please consult the section on files. The NEWVAR command is treated in a separate section.

Finally it is also important to understand that the results of all commands you issue appear only on the screen, as long as you do not activate the print file, i.e. a file keeping track of the results. Later you may print that file or treat it in other ways.

Detailed command descriptions

ADDFIT


    ADDFIT t [<options>]
    ADDFIT v TABLE=(nrows,ncols)  [<options>]

<options1> [CUTOFF=c] [EPSILON=epsi] [CENTER]

<options2> [RESIDUALS] [ROWeffects] [COLumn_effects] [COMPPLOT] [CODRES {str | ALTERNATE}]

Note: <Options2> are common to the ADDFIT and
MEDPOLISH command; (they are explained there). <Options1> are specific to the ADDFIT command.

Fits an additive relation to a table (defined as such or specified by the TABLE= option) [See MEDPOLISH for a more detailed explanation) using Tukey's biweight (algorithm described as by McNeill). The Cutoff= options specifies the tuning constant (defaults to 4) and Epsilon= specifies an epsilon used as stopping criterion for the iterative process (defaults to 0.01). [The REGRESSION command is based on the same biweight function, see there for a somewhat more detailed explanation of the C= and E= options].

CENTER centers the row and column effects on the row and column medians. By default this is not done.

References: McNeil 1977, chapter 7. Credit: The ADDFIT command is based on McNeils program as published in McNeil 1977.

BOXPLOT


  BOXPLOT  vlist [SHORT] [RESCALE | STANDARDIZE]
                 [DLINE {"alt.symbols"}] [NOTCHED]

BOXPLOT vlist PARALLEL [SHORT | FULL] [NOTCHED] [HIGHLIMIT=val] [LOWLIMIT=val] [CSCALE]

BOXPLOT v1 LADDER_OF_POWER

BOXPLOT
Displays the boxplot(box-and-whisker-plot) for each variable in vlist. The program identifies also the outliers, the adjacent values, as well as the extreme cases by their casid. For the definition of these values refer to the glossary. This command is sensitive to the SET DEFOUT settings.

SHORT displays a single-line boxplot (and suppresses the additional outlier information.

NOTCHED This option adds two "notches" { } to the boxplot. The notches define a confidence interval around the median useful for comparison of several boxes. If the intervals for two boxes do not overlap, we can be confident at the 95% level that the two population medians differ. The notches are at:

median + 1.58 * H-spread / sqrt(n) median - 1.58 * H-spread / sqrt(n)

(See Velleman&Hoaglin 1981 for an explanation of the rationale behind this).

DLINE Adds a coded density line (single-line histogram) after the boxplot (and before the outlier identification). This is useful to identify discontinuities in a variable, i.e. situation where the boxplot might not be an appropriate tool. See the DLINE CODED command for an explanation of the "alt.symbol" option (change the symbols used).

STANDARDIZE/RESCALE The STANDARDIZE and RESCALE option are use to apply some form of standardization to the variable before producing the boxplot. STAND removes the median from each value and divides it by the midspread; the RESCALE option removes the minimum and divides by the range.

Hints

Please notice: As the boxplot is a graphical representation where large or very small values are scaled, some precision is necessarily lost; therefore it may occur under specific circumstances that a value appearing as an outlier in the list, does not have a corresponding symbol on the graphic display; this simply means that the outlier is very close to the fences and that rounding puts it inside. The end of the whishers is normally marked with a 'x', if it does not show (i.e. there is a 'o' or a '@', marking an out or far out value) the adjacent value and the outlier are so close that they are found on the same location in the display.

Parallel boxplots

PARALLEL produces parallel boxplots for the variables in the vlist, i.e. using the same scale for all specified variables (see also COMPARE). This is useful for comparisons of batches of numbers This form of the BOXPLOT command normally does not show the additional outlier identification, unless the FULL option is specified. The variable descriptor is not shown on the screen, but added to the output in the print file.

CSCALE Normally the scale used for the parallel boxplots is computed from the actual data (possibly a selected subset) in the variable list. Sometimes you need to use a scale comparable to other plots you make. The CSCALE uses the common scale, as set by the SET CSCALE command instead of using the actual minimum and maximum found in the data set. Note that the common scale can make the scale only larger, i.e. the scale minimum must be smaller than the actual minimum and the scale maximum larger than the actual maximum; this means that the common scale cannot be used to remove observations.

HIGHLIMIT/LOWLIMIT The HIGHLIMIT and LOWLIMIT options are used to specify the minimum and the maximum for the scale used with parallel boxplots, e.g. for percent data you might wish to set them at 0 and 100 and not to some actual values. Note that the upper limit may not be set to a value smaller than the actual maximum, and the lower limit must be smaller than the actual minimum, i.e. these options can be used to show a wider range, but cannot be used to remove observations by setting the limits inside the actual data range (use a case selection command for that purpose).

NOTCHED, SHORT These option have the same meaning as with simple boxplots.

Missing values are eliminated listwise.

BOXPLOT LADDER: Tukey's ladder of powers

BOXPLOT LADDER displays all reexpressions of v1 according to Tukey's ladder of power, i.e. cubic, square, "raw", square root, logarithmic (base 10), reciprocal root, reciprocal, reciprocal square. These reexpressions are displayed as vertical boxplots.

The command is useful to search for appropriate reexpressions of the batch of numbers. See Velleman&Hoaglin 1981 for more details.

Notice that the BOXPLOT reexpression command does not actually reexpress a variable, it only shows how the various powers would affect the variable. You might use the REEXPRESSION command to do the actual transformation, or perform it using the LET command, as shwon in the following table: (let us take a variable labelled EX as an example)

    ladder of power     power     EDA transformation
    ---------------------------------------------------

Cube 3 LET #EX=#EX^3 Square 2 LET #EX=#EX*#EX Raw 1 (actual variable) Square root 1/2 LET #EX=SQRT(#EX) Log 0 LET #EX=LOG(#EX) Reciprocal root -1/2 LET #EX=(-1)/SQRT(#EX) Reciprocal -1 LET #EX=(-1)/#EX Reciprocal square -2 LET #EX=(-1)/(#EX*#EX)

The REEXPRESS command is designed for interactive re-expression: It lets you search for an appropriate reexpression, before actually applying it to the variable.

Related commands

Compare also with the COMPARE command (vertical boxplots) and the TRACES commands (Boxplots by groups)

(*) There is also an output procedure BXP() which may be used to produce single line boxplots. This is useful whenever you want to design a specialised macro.

References

McNeil 1977 chapter 1; Velleman & Hoaglin 1981. Credits: The basic boxplot production algorithm was originally based on McNeil 1977, and has been considerably modified and extended.

BREAK


Note: As BREAK and XTAB share most options, both commands will be explained here.

There are three distinct command forms

 BREAK v1[,v2] <size> <opt> [INTERVALS | READ_CUT_POINTS]
 XTAB  v1[,v2] <opt>

  <size>  | TWO | [THREE] | FOUR  | FIVE
          | {XDim=xd}{YDim=yd}{ZDim=zd}|

  <opt>   [GVAR {DOUBLE}{COLUMNWISE}]
          [RESTABLE]
          [INSPECT_TABLE] or [IDENTIFY]
          [IDENT={cell}]
          [DROP=min_freq]
          [COPYTABLE{=var}]

BREAK v1,v2 WITH=var# [<size>] [<stat>] [COPYTABLE{=var#}] [INTERVALS | READ_CUT_POINTS] XTAB v1,v2 WITH=var# [<stat>] [COPYTABLE{=var#}] <stat> | <stat1> [DIFFERENCES] | COUNT [IF{=|>|<|~}val] [Fuzz=val] | REFVAL[=val] [MEDIAN | MEAN ] <stat1> | [MEDIAN] | MIDSPRD | MEAN | SD | MIN | MAX

XTAB t1 BREAK t1

The BREAK command produces frequency tables or crosstabulations from quantitative (continous) variables by breaking them into categories corresponding to some interval.

The XTAB commands displays frequency tables and crosstabulations for the specified variables. Variables are considered as integer variables, i.e. the fractional part of a variable is always discarded. The command displays simple frequency tables for one variable or crosstabulations for two or three variables, according to the number of variables specified on the vlist.

Options specific to the BREAK command

The categories (bins) are defined by cut points. Default is to break the variables into three categories using thirds of the distribution of the variables as cut points. If the INTERVAL option is present, the range of the variable is divided into intervals of equal width. The READ option directs to program to ask the user for the cut points, instead of using the default distributional criterion. Note that in order to define e.g. five intervals, four cut points are needed.

The TWO, FOUR, FIVE options, as well as the X,Y and Z options allow to define different numbers of categories. The FOUR option then displays a 4x4x4 table, if three variables are specified, by breaking each variable into four pieces using the fourths (hinges) as cut points.

Options common to XTAB and BREAK

REST       Display residual table expected-empirical frequency)
DROP=mf    Drop cells with counts less than mf (min-freq)
INSPECT    Enter Inspect module
IDENTIFY   Same as inspect
GVAR [DOUBLE] [COLUMNWISE] copy cell references into GVAR
COPYTABLE Copy table as table variable into WA
The tables normally show the counts for each cell. Zero frequencies are shown as blank cells.

The RESTABLE option displays the difference between the true frequency and the expected frequency, expected frequency being the number of cases falling into a specific category assuming a uniform distribution, i.e. the same number of cases in each cell. (Note that whole numbers are used, therefore some rounding errors will occur, e.g in the case of a 4x4 table with 26 cases, the theoretical frequency used will be (26/16, rounded to the next integer, i.e. 2).

The DROP=min-freq option causes all frequencies equal to or smaller than min-freq to be dropped from the table (i.e. be considered as zero and displayed as blank). This is useful to highlight concentrations without being disturbed by small frequencies.

You may identify observations in specific cells or inspect the table in some more detail. Basically you may either enter the INSPECT_TABLE module using either the INSPECT or IDENTIFY option, or - if you need to identify only observations in a specific cell you may use the IDENTIFY=(cell-ref) command option. See below for additional information on the INSPECT module.

The GVAR option lets you define a GVAR based on cells. The tables below reflects how group numbers are assigned (the numbers within the cells show the group numbers).

       1     2     3
    __________________
 1 |   1     2     3  |
 2 |   4     5     6  |
 3 |   7     8     9  |
    __________________
This is the default way of defining a GVAR (rowwise). There are other forms COLUMNWISE does it columnwise (numbering scheme), wheres DOUBLE uses a double index, i.e. the cases falling into cell 1,4 will have a 14 (this does only work for tables where all dimensions have less than 10 categories). DOUBLE can be combined with COLUMNWISE (produces 41 instead of 14 for a case in cell 1,4)

COPYTABLE this option copies the table into the WA as a table variable. (See the glossary for an explanation of table variables.

INSPECT_TABLE

The INSPECT (or IDENTIFY) option enter - after the display of the basic table, a special module with its own command. [Currently experimental; more to come].

Currently the following commands are available within INSPECT:

 IDentify [col,row]   Identify observations, if col/row are not
                      specified you will be asked.
 SElection            Observation in current cell are selected
                      Leaves table inspect mode (back to EDA mode).
 SI                   Save selection index of current cell into
                      a variable
 H/?                  Provides some help
 Q                    Quit;  (same as blank line)
Note that the SEL and SI commands require a current cell to be defined; this is done using an ID command to designate the current cell.

BREAK/XTAB WITH=var

The second form of command displays statistics of a third variable specified by WITH=var#. By default medians are shown.

You may also display MIDSPREADS, MEANS, SD (Standard-deviation), Minimum and maximum. If you add the DIFFERENCE option to one of these, the table will show the difference between the cell median (mean, sd etc) and the overall median (mean, sd).

You may also use COUNT to obtain frequency counts for the third variable based on some criterion. By default the table will show the number of positive non-zero values in each cell. IF>val and IF<val let you count the number of observations above or below some specific value, whereas IF=val and IF~val count the number of observations equal or not-equal to some value. These two forms are sensitive to the system FUZZ value, a value you may override with the FUZZ option. The default form is also sensitive to fuzz (definition of zero is 0+fuzz).

Finally the REFVAL option display difference tables with respect to the variables reference value (center) or an arbitrary value specified with REFVAL=val. By default the value is compared to the cell medians for the third variable, unless MEAN is present, then the table shows differences to the cell means.

COPYTABLE copies the table into the WA as a table variable, suitable for display (BREAK/XTAB t) or analysis (ADDFIT, MEDPOLISH).

BREAK/XTAB t1

If the variable on the vlist is a table (variable) it will simply be displayed, i.e. no computation whatever is performed, the table is displayed as such.

COMPARE


  COMPARE vlist [<opt>]

<opt> [CSCALE] [REEXPRESS=power | REMOVEMEDIAN | STANDARDIZE]

Displays vertical box and whisker plots for all variables in vlist for comparison. A common scale is used for all boxplots. This command is sensitive to the SET DEFOUT settings.

CSCALE Instead of using the minimum/maximum found in the data specified by the vlist, the common scale as defined by the SET CSCALE command is used. Note that this scale may only extend the data mininum/maximum, i.e. CSCALE may not be used to eliminate observations.

REEXPRESS=power Before they are displayed variables are re-expressed using power transformations. See the glossary (entry power transformations) for more details.

REMOVEMEDIAN Removes the median from each variable.

STANDARDIZE Standardizes the variables before display (remove median and divide by midspread).

Related commands

See also the BOXPLOT and TRACES commands.

Credit

COMPARE is based on a Fortran program published by McNeil 1977. It has been modified for EDA; today except the name nothing is left from the original code, but thanks for the inspiration anyway.

DIAGNOSTIC


   DIAGNOSTIC  v1 <diag>  [PLOT=(xchar#,ychar#)]
Performs various diagnostics plots and summaries specified by <diag>.

 <diagnostic> |  TUKEY
              |  EXTREMES
              |  MEDIAN
              |  BOXES
              |  NORMAL [MEDIAN]
              |  QQ [MEDIAN]   (same as NORMAL)
              |  v1 v2 EMPIRICAL_QQ
              |  TRASH  [COPY{=var#}]
              |  AUTOCORR
              |  GAPS [WEIGHT] [COPY{=var#}] [SHOW{=n}]
              |  ANDERBERG{=sample_size}
This command contains a series of diagnostic tools for individual variables used to assess normality, symmetry and the like. Most options display a diagnostic plot. One of the options is required.

The first three diagnostic plots are based on the idea of the equivalence of the data points when stepping in from the extremes to the center of the distribution. The data points x(1), x(2) .. x(n) are ordered then the following plots might be produced to assess a symmetric distribution:

EXTREMES

Data points (x(1),x(n)) , (x(2),x(n-1)) ... are plotted. The variable is symmetric if the slope is linear -1 and the intercept 2b, b being the center of the symmetry (Wilk & Gnanadesikan, in Gnanadesikan 1977).

MEDIAN

A similar plot can be produced by displaying (M-x(1),x(n)-M) , (M-x(2),x(n-1)-M) ... where M is the median. In this case the distribution is symmetric, if the slope is 1 and the intercept 0.

TUKEY

Tukey (Tukey, in Launer&Wilkinson, 1979) proposes to plot

(x(n)-x(1),x(1)+x(n)) , (x(n-1)-x(2),x(2)-x(n-1))

then a horizontal cloud of points indicates symmetry. Note that when using this techniques attention has to be paid to the scale of the diagnostic plot, if the range plotted is very small a superficial glance at the plot might give the impression of enormous dispersion.

BOXES

This option plots seven selected quantiles:

           x               y  (empirical)
    (1)    1/16          first 16th
    (2)    1/8           first 8th
    (3)    1/4           first 4th (hinge)
    (4)    1/2           median
    (5)    3/4           last 4th (upper hinge)
    (6)    7/8           last 8th
    (7)    15/16         last 16th

NORMALITY (Theoretical Q'Q plot)

NORMAL and QQ are synonyms. Probability plots are another possibility for diagnosing distributions. Its principle is very simple: on one axis, usually the y-axis plot the order data values and on the other axis plot the value of some distributional function with probability p. This kind of plot is often called a (theoretical) Q'Q plot (quantile quantile plot) or normal probability plot.

Traditionally. the ith normal score for a sample of size n is the mean of the sampling distribution of the i-th order statistic, i.e. the i-th ordered value) in a sample of size n from a standard normal distribution.

   Gauss-inverse(i) = (i-1/2)/n     ; for i = 1,N
The MEDIAN option approximates the medians of the sampling distributions of the order statistics of the standard normal distribution.
   Gauss-inverse(i) = (i-1/3)/(n + 1/3)    ;  for i = 1,N

EMPIRICAL_QQPLOT

Empirical Q'Q plots two empirical variables: the X-variable is ordered and then plotted against the ordered Y-variable.

TRASH

The TRASH (Trimmed absolute sharpness) curve is used with residuals. Sharpness is defined as follows: Let r(1) .. r(n) be the ordered absolute residuals (r(1) is the smallest r(n) the largest absolute residual) and R the sum of the absolute residuals.

Then the overall sharpness is defined as R/n, where n is the number of observations. It is then possible to define a number of sharpness values by omitting the largest residual, the second largest and so on.

It is possible to compute all sharpness values to build the trash curve:

     TRASH(i)  =  1 / (n-i) * sum (r (j) ),  for j = 1 to n-i

                     for i = 0 to n
These values are then plotted. It is also possible to COPY the sharpness values into a variable.

AUTOCORRELATION

The AUTOCORR option displays a plot of x(i) against x(i-1) to show autocorrelation (lag-1).

GAPS

GAPS performs a Gapping analysis for the specified variable. (useful when searching for possible breaking points in the data). In any set of N numbers (ordered set) are N-1 possible breaking points between adjacent data points. In many distributions values tend to cluster around some points, often around the central value. The gaps (i.e. difference of two adjacent may be studied to find suitable break points or to diagnose special clustering features present in a given data set.

Tukey (1977) (see also Thiessen & Wainer 1982 ) proposes to use WEIGHted gaps, where gaps toward the dense center are more weighted than gaps towards the extremes:

Weighted gap = SQRT [ gap * I * (N-I) ]

where

   N = number of data points
   I = index number of the gap in the ordered data set
With no option the gaps are plotted against the sequence. The COPY option allows to copy the gap values into a specified variable. The SHOW option displays the top 10 gaps (or the number of (top) gaps requested with the SHOW=n option (shows case-ids, gap-value and data points).

ANDERBERG

The ANDERBERG option is used to show confidence intervals for the eigenvalues for correlation and covariance matrices, i.e. to study whether eigenvalues are "really" distinct from each other. The confidence intervals are determined using theoretical results from Anderson 1963 (Laplace). If the confidence intervals of two subsequent eigenvalues overlap the "proximity" of the eigenvalues is suggested and the corresponding axes are defined as close to each other (less the rotation). The user then could limit him/herself to the interpretation of the subspace defined by the clearly distinct eigenvalues (Morineau 1983).

This diagnostic is used as follows: Perform a factor analysis with the EIGEN option to copy eigenvalues as separate variable (or enter values into a variable e.g. with the NEWVAR command). Then use DIAG v ANDER with that variable. The command needs the sample size: Normally, if FACTOR has been performed previously the command takes the N from that command (i.e. information stored with the correlation matrix). Otherwise the sample size is asked from the user or may be given as option using the second format ANDERBERG=sample_size.

Plot options

PLOT=(x#,y#) [applicable only to options where a plot is produced] may be used to control the size of the plot. X is the number of characters across (up to 130), whereas y# is not limited.

Using EDA expressions for diagnostic plotting

Besides the options available here, it is very easy with EDA to build other probability plots using the various functions provided with expressions.

If the distribution is normal, a perfect straight line would be exhibited by the diagplot.

In fact the same plot could have been produced with the PLOT command using the following transformations:

 (long form)
 >let #varx =ugrd(#varx)     ! #varx is just an example
 >GENERATE 3
 >let #3=ginv((#3-0.5)/$NOC.3)
 >PLOT #varx,3 SCAT
The first command sorts #varx in ascending ordrer (upgrade). GENERATE generates a uniform index variable 1..N, the third line computes the probability variable: ginv() returns the probability. It is clear from that example that any probability plot could be constructed, using the EDA facilities.

EDA has facilities to do exactly the same on a single command line using the PLT() output procedure to produce the plot. [This solution has the advantage of no using/creating intermediate variables, i.e. changing the current WA].

>out plt(ugrd(#varx),ginv((idx(1,$noc.varx,1)-0.5)/$NOC.varx))
Explanation plt(x,y) requires two arguments, the first is as above (ugrd(#atom2), the second is much more complex: $noc.atom2 is just the number of cases of the variable atom2; the idx() function is used to generate an index variable starting at 1 through the number of cases in atom2 by increments of 1, thus 1,2,3 ... n (The Generate command did that in the first example).

An empirical probability plot (Empirical Q'Q plot) could be produced with the following command

>out plt(ugrd(#varx),ugrd(#vary))
This plots the first variable (varx) ordered against the second variable (vary) also ordered in ascending order.

References

Tukey 1977, Thiessen, Wainer 1982; Lingoes 1973, Everitt 1978, Gnanadesikan 1977, Morineau 1983, Anderson 1963

DISPLAY


 DISPLAY vlist <stat>  [TRIM=case%]
                       [ASTAT{=var#}] [BSTAT{=var#}]
                       [STORE_CENTER]
                       [VSA | VSB {DESC | ASC}]

DISPLAY v BYGVAR | GVAR[=var#] <stat> [TRIM=case%] [ASTAT{=var#}] [BSTAT{=var#}]

DISPLAY v CASIDS

Displays statistics as specified by <stat>:

    MEDIAN     Median  and midspread (default)
    MEAN       Mean and standard deviation
    MAD        Median and MAD (Median absolute deviation)
    HINGES     upper and lower hinges (quartiles)
    BIWEIGHT   Tukey's biweight [C=const][E=epsi]
    IFEN       Inner fences
    OFEN       Outer fences
    RANGE      Minimum and maximum
    CENTER     The prestored center estimate (def. median)
    P=val      Any percentile specified by val (0-100).
    SUM        Sum of all cases
    SHAPE      Skewness and kurtosis
    DURBIN     Computes the Durbin-Watson coefficient (usually
                 applied to residuals).

The definition of the inner and outer fences is sensitive to the SET DEFOUTLIER settings.

All statistics may be trimmed, i.e. a percentage of cases is omitted from the variable: the variable is ordered and then trimmed at both ends, i.e. the specified percentage of cases is removed and the statistics computed as usual. Trim=5 means remove 5% of the cases at both ends, i.e. the statistics are computed on 90% of the original cases. Trimming is used especially with non-robust statistics to make them more robust (e.g. means and standard deviations).

Note that CENTER displays the prestored center or reference value (default median) for that variable.

The ASTAT and BSTAT option, if present, copy the first (A) and/or the second (B) statistic, if available, into a variable. For example, if the MEAN is computed for 10 variables and the ASTAT or BSTAT options are present, the program will copy 10 means into a variable and the standard deviations into a second variable (as specified by BSTAT=var#, or a free location if only BSTAT is given.

VSA/VSB is used to build a sorted variable list based on a statistical criterion. DISPLAY 1-10 IFEN VSA will - in addition to the statistics display rearrange the variable list in a way that variables appear in ascending order of the lower inner fences of variables 1-10. VSA refers to the first statistics, VSB to the second if available. (in the example VSB would mean upper inner fence). The ASCENDING or DESCENDING options are used to override the default sort order set by the SET SORT command. These options are similar to the options available with DS/VARS and DESCRIBE. See there for details.

The STORE option replaces in the case of MEAN, MEDIAN and BIWEIGHT the stored center value by the value computed.

DISPLAY v BYGVAR

BYGVAR produce the same statistics, but computed for each group in the current GVAR or the variable specified with the GVAR option.

With this option only one variable is used on the variable list. The STORE option, as well as the VSA/VSB options do no apply here. As well as the CENTER statistic.

DISPLAY v CASID

This form of the DISPLAY command displays the case ids, together with the group membership (if a GVAR is defined).

This command is useful to produce a list of the currently defined CASIDS, and, if a case selection is active, a list of the currently selected cases. The same information can be obtained from the LIST command, but in a vertical form together with at least a numerical list of a variable.

DLINE


    DLINE vlist
    DLINE vlist CODED ["alt.symbols"]
Produces a density line, i.e. a sort of a single line histogram (or one-way plot), where the density of the cases is shown either in numerical or coded form.

The variable (or each variable if a list is given) is divided into many intervals (some 72 intervals, depending on the screen width and other information to be displayed). Then the number in each interval is computed and displayed in each location either in numerical or coded form.

By default EDA shows the number of cases. As for each interval there is only one character position available, no more than 9 cases can be shown exactly: No cases appear as blanks, 1 to 9 cases are shown as '1', '2' ... '9's; if there are between 10 and 19 cases, they will appear as '*'; more than 20 as '#'.

The CODED form shows symbols instead of numbers. By default four density symbols are used, besides the ' ' (blank space) for no case, standing for 1,2,3 or 4 occurrences. If higher frequencies are encountered, each of the four symbols will represent more than one occurrence; a message indicates how many cases a symbol represents. Blank is always exactly zero cases; the first of the symbols represents at least one case; for the other frequencies the symbols is computed by rounding to the closest symbols, e.g. if the largest frequency found is 20 and there are the four default symbols each symbol represents approximately 5 observations (20/4), i.e. for instance a frequency of 10 will be shown using the second symbol, as well as frequency of 8 or 12 (rounding). Please notice that frequency of 1 does not appear as 0, but as 1 (i.e. a blank shows always no case at all).

FREQMIN If you prefer that very small frequencies are not coded (cf. the previous example) but appear as blanks, you may used the FREQMIN=freq option, where freq indicates the smallest frequency to be coded as the first non-blank symbol.

"alt.symbols" tells EDA to use the alternative symbols instead of the built in symbols. You may specify up to 60 symbols. The first symbol correponds to one occurrence (or more if the number of supplied symbols is not sufficient to represent all the different frequencies), the second to 2 occurrences and so on. If the highest frequency encountered exceeds s the number of symbols, a symbol may represent more than a single frequency count (see above for more details on how this is done).

If a variable list is specified, EDA produces a single line for each variable with the variable name at the beginning of the line.

FREQUENCY


  FREQUENCY v [<opt>] [BARCHART | HISTOGRAM {NOFREQTABLE}]
  FREQUENCY v BYGVAR [<opt>]

<opt> [FMIN=mincount] [SLIDE=dec_pos] [CODES=var# | RANGE=maxcod | RANGE=(mincod,maxcod)] [VAR{=target#} {PERCENT | CODES} {KDLAB}] [NOTABLE]

Produces a frequency table for the current variable (category number, count and percentage). The variable is treated as an integer variable, i.e. the decimal part is not considered.

FMIN= is used to control display of categories with small counts. Default is to display all categories with count 1 or higher. FMIN may be used to set it to a higher minimal count.

SLIDE is used if you wish to work on real variables, and not on integers (i.e. studying various intervals). This option helps to avoid using LET commands for simple tasks as sliding the decimal point to the left or the right before creating categories. SLide=decimals, where the number of decimals may be positive or negative (positive meaning to the left, negative to the right). This is equivalent to transforming the variable by dividing or multiplying by powers of 10.

NOTABLE inhibits the display of the table (producing only result variables and - if requested - derived variables.

BARCHART and HISTOGRAM

In addition to the frequency table, HISTOGRAM shows a for each class the name (casid) of the observations. Note that cases appear in their original sequence in the work area. The NOFREQTABLE options drops the frequency table portion from the default display. Note that the display of the case names is sensitive to the setting of the SET CASID command (i.e. it might contain in addition to the name a group membership.

BARCHART produces a barchart. The NOFREQTABLE option applies also.

BYGVAR

BYGVAR is a special form of the FREQ command showing in addition to the frequency table a table with the frequencies in each group defined by the current GVAR (shown to the right of the overall frequency table). Only the absolute frequencies are shown in the group table. The maximal number of groups shown depends on the output line width. This option requires a GVAR.

CODES=var# and RANGE

The default form of the command shows only categories with non zero count. However sometimes you require a display with a predefined set of categories.

CODES=var# may be used to specify the values (codes), whose frequencies are to be counted. Normally the table shows all codes existing in a particular variable. Sometimes it is useful to produce a set of tables with the same codes. CODES=var# points to a variable where the codes to appear in the table can be found. Then these codes will appear in the table. If a code cannot be found in a particular variable, a count of 0 will be produced; on the other hand a code not in the CODES variable will never appear in the frequency table, even if it occurs in the variable.

The source variables for the codes (categories) is not sensitive to case selection.

The CHECK CODES command may be used to produce a CODES=var# by checking the occurrence of codes in a series of variables. See the CHECK command for more information.

RANGE: The range option is used to specify a range of categories to appear in the table (without the need to create a variable for the CODES=var# option. RANGE comes in two forms: RANGE=(mincod,maxcod) specifies the starting and ending codes for the table; RANGE=maxcod starts with code 1 and ends with code maxcod.

Derived variables: VAR

VAR=target# will produce a new variable containing the counts (default), CODES (categories) or the percentage (PERCENT option).

KDLAB lets you keep the original label of the target variable. This is useful whenever the label of the variable where you want to copy the frequencies should be left untouched. Default is to create a label and descriptor reflecting its contents.

HISTOGRAM


 HISTOGRAM v [GVAR{=var#}] [BOTH {GVAR=var#}]
             [SYMWID=n] [VERTICAL] [SEQUENCE] <opt>
             [CODE=(v1{,v2,v3,v4}) <code

>]

HISTOGRAM v BAR [GAUSS] [NSAV] <opt>

<opt> := [DETAILS=nlin | BINWIDTH=width | VELLEMAN] [{DE|AS}CENDING] [SAVE_BOUNDARIES] [VBOUNDARIES=var#] [GLOBAL] [STORE_GVAR] [UPPERLIMIT=val] [LOWERLIMIT=val]

<code

> BINS | [FRAC] | EXACT | READ ["alt.symbols]" DISTRIBUTIONAL [SIMPLE] ["alt.symbols"] REFERENCE=value [FUZZ=val] ["alt.Symbols"] MARK|=val | IF>val |IF=val| IF<val | IF~val ["alt.symbols"] [FUZZ=val] ASIS ["alt.symbols"] DICHOTOMY ["alt.symbols"]

Draws a histogram using the case identification as "leaves" (default). The BAR form displays a traditional histogram. below.

<opt>

The following options are common to both forms of the histogram display:

By default the number of intervals (lines) is determined using Dixon and Kronmal's rule, i.e. lines = 10 * log (n). Alternatively you might use VELLEMAN's rule [2 * sqrt(n)] or specify the number of intervals (bins) using either the DETAILS= or the BINWIDTH option. Details= controls the number of intervals (bins). Instead of specifying the number of bins the bin-width (interval width) may be specified using the BINWIDTH=width option. With the default histogram each interval may be displayed on several lines if the number of case ids exceeds the line width. With HISTO BAR however each interval uses exactly one line (a symbol then might represent more than one case).

Note that when specifying a bin width exceeding the limits of the command, the message given diagnoses too many bins, i.e. the same message as with the DETAIL= option.

The GLOBAL option is only useful if a case selection is active. It causes the global minimum and maximum to be used for scaling, instead of the min/max of the selected cases (default). This option is useful when studying different sub-populations where the same scale is desired.

The LOWER and UPPER options allow to specify other upper and lower limits for the histogram than the default (true minimum and maximum); this may be used to specify a user defined scale and/or to eliminate cases from the histogram.

The intervals appear in either ascending or descending order, depending upon the setting of the SET SORT switch (default ascending). The DESCENDING or ASCENDING options are used to override the default value, i.e. show the intervals in descending or ascending order.

VBOUNDARIES=var# uses a variable indicating the boundaries for building the histogram. The number of intervals created is n-1 (intervals are closed). If the first value is the largest, then the sort order is descending. The number of cases not in the histogram will be reported. Note that this options overrides all scaling options, if one of them is present.

SAVE_BOUNDARIES saves the bin-boundaries into a variable; the next free location is used and the copy is reported. This variable will contain all boundaries, i.e. it is longer by one than the number of intervals. This option is useful especially for input to the VBOUNDARIES option if you need to build a series of similar histograms.

The HISTOGRAM command may also be used to define a GVAR, i.e. cases belonging to the same interval are coded to have the same group membership, use STORE_GVAR to do so.

HISTOGRAM (default form with case identifiers)

Default is to display case identifiers at each position; The full ids are shown followed by a blank character. EDA checks whether the CASIDS use all four possible characters for casids by checking a sample of case identifiers. [This technique might fail in some special occasions, then you will have to use SYMWID=]. If SYMWID=n is present, where n indicates the number of characters of the CASIDS to show, this number is used (the possible range of of n is 1 to 5, i.e. the four casid letters plus one blank).

The VERTICAL option can be used to display the casids or (see below) other information vertically instead of horizontally. This is useful to keep the shape information of a casid histogram closer to a standard histogram, especially when several lines would be needed to display a single bin.

Alternatively you may display GVARs instead of casids, or both of them (BOTH). In this second case the first two character positions are taken from the casid; the second from the GVAR. This means that this form is limited to group numbers less than 100. In both cases you may specify GVAR=v#, i.e. instead of picking up the GVAR the variable mentioned is taken for the group definition (only the integer part is used.).

HISTOGRAM CODE

The CODE option replaces the casids by symbolical codes representing the variables (up to four) appearing on the CODE=var option. Various forms of <code

> exists. They are explained in detail in Chapter 4 (Glossary).

SEQUENCE: Normally all cases within the same leaf (interval) appear in ascending (descending) order. This is ok as long as no values are identical; if some values are identical the sort algorithm used does not leave the cases in their initial order (as they appear in the WA). If their number is rather large this causes some trouble in locating specific cases. SEQUENCE leaves the cases within each interval in the order they appear in the WA, i.e. within an interval no sort is performed.

HISTOGRAM BAR

The BARS option displays a "classical" histogram using stars to represent values. The GAUSS option adds a gaussian reference curve to the histogram displayed giving a rough idea of how close the displayed batch of numbers comes to a normal distribution.

NSAVE: this option saves the number of cases per interval to a variable, i.e. creates a variable in the WA at the next free location with a length equal to the number of intervals. Compare also to the STEMLEAF command, as well as the FREQUENCY (HISTOGRAM option) useful for integer variables.

LINE


 LINE x[,y] <method> [RESIDUALS{v#}] [FITTED{=#v}] [TRASH]

<method> | [RLINE] [TRACE][STEPS=nstep] | [SHORT] [Tol=tolerance] | TUKEY [STEPS=nstep] | LSQ | LSQ1 | LSQORTHOGONAL

Fits a line to x (v1) and y (v2) and optionally copies residuals and the fitted values into new variables (RESID=, FITTED= options.

You may either specifify a command line containing both the x and y variable (e.g. LINE 1,2) or omit the y variable. If the y variable is omitted then the default YVARiable is used as defined by the SET YVAR= option. This second possibility is useful especially when your are hunting for a good explanatory variable for the same variable to be explained. By default no YVAR is defined, i.e. LINE 3, without a previous SET YVAR will produce an error.

Note that if you specify e.g. LINE 3 and a SET YVAR=10 has been defined previously, the current variable list will be modified, i.e. after the line command the current list will contain 3 and 10. This is done because you might want to PLOT the variables, then a PLOT without a new variable list will exactly do what you want, i.e. plot x and y on the appopriate axes.

Default method

The default method used is the resistant line described by Velleman and Hoaglin. If TRACE is present the result of each polish iteration is shown, as well as more information to assess straightness of the line. With the Step option more steps may be requested (default maximum=50), if the procedure does not converge (a message will then be given along with the last half-slopes). The tolerance value (default set to a system defined small value) controls the precision required in computing the slope (remember that it is computed iteratively).

The program also computes the half slope ratio as in the Velleman and Hoaglin book, but also a modified version, as proposed by Nosanchuck HSR1:(always smaller or equal to 1: the smaller half-slope is dived by the larger). Based on this HSR1 on the output, a message is issued (see Nosanchuck for the reasons). HSR1: 0.9-1: Linear fit is appropriate; 0.5-0.9: Linear fit may be inappropriate; 0-0.5 : Linear fit inappropriate and a serious warning with negative half-slopes. (This may be suppressed with SHORT, if you don't like it). If the FIT cannot be computed, a ??? string is displayed. (Check your data).

Tukey

The TUKEY option computes the Tukey-line as described in McNeil. The STEP= option may be used to control the number of iterations (default 1).

LSQ LSQ1LSQORTHOGONAL

LSQ computes a least squares line.

LSQ1 uses the prestored center estimate (default median) as estimator of location, instead the mean. As the center, stored with each variable, is a global attribute of a variable, you should not use LSQ1 with a case selection, unless you have a specific reason to compute it that way.

LSQORTHOGONAL computes statistics based on orthogonal least squares.

Copy residuals and fitted values

The RESID and FITTED options may be used in two different ways: either you specify the destination variable by using RESID=v# or simply RESID; then the program searches for a free location where the new variable is stored; if none is found RESID and/or FITTED are not copied. Note: when copying residuals and/or fitted values into the WA and using at the same time case selections a problem arises (a solution will be found later; see the PUSH command): the Residuals/fitted values take the number of cases of the included cases, i.e. the global size is different to the original variables. Therefore for the residuals the casids are not correct. You should therefore use filtering in an instance you wish to do analysis on subsets of cases.

TRASH (residual analysis)

The TRASH (TRimmed Absolute SHarpness) option displays the overall sharpness, sharpness-1 and sharpness-2 as well as the TRASH curve. Overall sharpness is the average residual (sum of the absolute values of the residuals divided by n). Sharpness-1 is the same, except that the largest (absolute value) residual is omitted. Sharpness-2 omits the two largest residuals. The TRASH curve shows the same, but computed for all residuals. This is used to check the residuals and to assess the quality of the fit. A good fit is shown by a very fast descent of the TRASH curve. See also the DIAGNOSTIC command.

References

Velleman&Hoaglin 1981; McNeil 1977; Erickson & Nosanchuck. Credits: Resistant line (default method): adapted from Velleman & Hoaglin 1981 (heavily modified); Tukey method: adapted from McNeil 1977 (modified and corrected).

LIST


 General options:

LIST vlist [START=cas#] [END=cas#] [NODISPLAY] [SORTED {KEY=keys | ALL | CASID | GVAR | C2} {ASC|DESCEND}] [ORDER{|=cas# | 0.c1dim#} | C1 | TABLE | MATRIX} {ASC|DESC}]

Numerical lists:

[DECIMALS=ndec] [FIX] [WIDTH=l] [INTEGER {ZERO}| ROUNDED {ZERO}] [NOFIX] [EXPONENT] [BLANK{=fuzz}] [BLANK | IF=val {fuzz=val} | IF~val {fuzz=val} | IF<val | IF>val ] [FLAG {{IFENCES} | HINGES | OFENCES} ["alt.symb"]

Stream format: STREAM [WID=l] [DEC=ndec] [I=nitem.per.line]

Single case listing

CASE=#cas

Coded displays

CODED [MEAN] [UDIV=val] [GLOBAL][LONG]

DISTRIBUTIONAL [SIMPLE] [<o] "interval.ids" [<o>] BIN [GLOBAL] [<o>] EXACT_FRACTILES [<o>] FRACTILES [<o>] READ_BINBOUNDARIES [<o>] ASIS [<o>] DICHOTOMY [<o>]

<o> ["alt.symbols"] [Width=CasidChars]

| MEDIAN | REFERENCE | MEAN | [FUZZ=val] | [CENTER] | ["alt.symbols"] | VALUE=v |

MARK=val [FUZZ=val] MARK | IF>val | | IF<val | ["alt.symbols"] | IF=val [FUZZ=val] | IF~val [FUZZ=VAL]

Split listing: (numerical and coded listings)

LIST v BYGVAR <params>

Sorted list with label/casid display

LIST vlist | TOPVAR | [ASCENDING] [SHOW=nvals] | TOPCAS | [LABELLENGTH=l] [NODISPLAY]

List all cases for the variables specified in vlist together with the case identifications and the group membership (if a GVAR is stored). The basic form of the command (default) produces a numerical display. The general options hold for all additional options, except for the CASE= option, which by limited to its simple purpose.

NODISPLAY suppresses the listing of cases on the terminal (to inhibit lengthy output on the terminal). Note that NODISPLAY is the same as the global option /D. If one of the nodisplay options is active the number of variables printed depends on the printer width. In all other cases the number of variables shown depends on the screen width.

SORT options

If the SORT option is present, then the listed cases are sorted on the first variable in the list, unless different keys are specified. Default is to sort on a single key; more than one key may be specified using the KEY=(keys) option. Note that up to MXDIM keys may be specified that way. (MXDIM is an implementation dependant option, in most cases set to 8.) The first key is the primary sort key, the second the secondary etc. You may also specify ALL instead of a key list, if you want to use all variables on the vlist as sort keys. Note that the variables listed on the KEY= option need not be included in the current variable list, i.e. you may sort on variables you do not list.

Instead on variables you may also sort on CASIDs, the GVAR or the first dimension in the C2 (case-related) configuration. Note to advanced users: With C2 only minimal checks are performed to make sure that the data in the WA correspond to the data in C2. In standard situations e.g. when working with principal components this is correct, unless you do tricky things with the configurations.

When using these options you can specify only a single sort key, unless you use the KEY=keys option. You may include references to C2, GVAR or CASID by using the following numbers:

  positive integers     variables in the WA
  0                     GVAR (must be stored)
  -1                    CASID
  0.dim                 dimension #dim in C2

The sort order depends upon the setting of the SORT switch (see SET SORT (default ascending). The ASCENDING or DESCENDING option are used to override the default sort order for a particular LIST command.

ORDER (sorting variables)

The various forms of the ORDER option rearrange the order in which the variables appear on the list.

IMPORTANT: ORDER changes the sequence of variables in the current variable list, i.e. subsequent commands without an explicit variable list will use that order.

You may use the ASCENDING or DESCENDING option (see the SORT option for details). If you use SORT and ORDER on the same command the sort order will be the same for case (SORT) and variable (ORDER) sorting.

ORDER{=case#} orders the variables according to a specified case. If the ORDER=case# form is used, the case specified is the sort key and the variables will appear in that sequence.

If ORDER is used without an key all cases are used as sort keys (the first case is the first key, the second the secondary key and the last case the last sort criterion used to determine the order of the variables. If the SORT command is specified on the same command line, then the sort key ordering for variable sorting will be the sorted sequence of cases. SORT (ordering of cases) is performed before variables are ordered.

ORDER=0.c1dim# (e.g. 0.2) sorts the variables on the second dimension of C1. This option is useful when you want to sort on a dimension other than the first (which is used when the ORDER C1 option is used on the command line). Note that absolute values are used.

ORDER MATRIX orders the variables on the similarity or dissimilarity measures found in MATRIX (e.g. correlations). The first variable in the variable list determines what row of the matrix is considered). LIST 10, 1-9 orders the variables according to the correlation (or distance etc) all between variable number 10 and the other variables on the list. If the variables cannot be located in C1 an error message is given. Note that absolute values are used.

ORDER C1 orders the variables in the list according to the values found in the first dimension of C1. See also ORDER=0.c1dim. If the variables cannot be located in C1 an error message is given. Note that absolute values are used.

ORDER TABLE orders the variable according to the table tie.

Note that more complex ordering of variables can be achieved in combination with the VAR command, offering a large number of selection and sorting options. E.g. If you want to list variables ordered by their respective medians you could type the following:

    VAR 1-20 SORT MEDIAN
    LIST
The first line defines a variable list by computing the median of all variables in the list and sorting them accordingly. The LIST command then takes up the current variable list.

Should you require additional facilities for ordering, refer to the SORT command (note however that the SORT command changes the order of variables/cases in the WA; the LIST command never affects the WA, only the way data is displayed).

START/ END

The START=case# and END=case# options are used to display a selected range of cases. If Start is not specified it defaults to 1, END defaults to the number of cases (n). If the variables have been sorted the START/END specification refers to the sequence AFTER sorting, i.e. LIST 1 2 3 SORTED START=12 END=20 skips the first 11 observations with the smallest values on variable 1, and the lists from observation 12 up to 20.

Please notice that with sorted lists you cannot reliably use case identifiers on the START= or END= option, as in the current EDA version e.g. START=CID1 will look up the sequence number of a case named 'CID1' without adjusting it for the sequence change in the list due to the sorting process.

Numerical formats

Normally the column width for each variable depends on the size of each variable. In some circumstances however you would like to change that display format (e.g. In order to see the full variable name, which is truncated for short column widths).

The number of decimal places shown depends on the setting of SET DECIMALS. The DEC=#dec option may be used to override the default. Setting DEC=0 is the same as the INTEGER option explained below.

If it is desired to have the same column width for all variables, use the FIX option, which will use the width required to show the largest value in the WA. The Width=n option is used to ask for a different width. To more options allow for convenient display of large numbers using exponential form. The EXPONENT option forces the uses of exponential form, whereas the NOFIX option uses exponential form only when the size of the number to display calls for it, given the column width. In both cases the L= and D=0 option may also be used.

The INTEGER option displays only the integer part of the variables. In this case zeros appear as blank fields. ROUNDED does the same, but instead of truncating the numbers they are rounded. A value of zero is shown as blank, unless the ZERO option is added.

BLANK

The BLANK option shows specific values as blanks instead of numbers. The simple form, BLANK without an IF specification shows all values close to zero as a blank field. If BLANK is entered without further specification the EDA wide fuzz value (see SET FUZZ) is used to determine how close a value may be to zero; if BLANK=fuzz then this value is used. If a B=0.5 is used, then all values between -0.5 and 0.5 appear as blank fields.

The IF option blanks specific cases according to some numerical criterion: IF=value blanks cases with numerical values equal or close to <value>; fuzz may be used to qualify equality. IF~value, IF>value and IF<value are used to specify inequality, greater and less than conditions.

FLAG

(Numeric displays) Values outside the inner fences (IFEN, default option), values outside the hinges (HINGES) or values outside the outer fences (OFEN) are flagged, i.e. instead of the numerical values a special symbol is show. By default this symbol is "===" for values below the criterion (below the lower inner fence, lower hinger or lower outer fence) and "&&&" for values above the selected criterion (upper inner fence, upper hinge or upper outer fence). Note that these symbols appear right justified; if the display format is smaller than three positions, less symbols are shown, i.e. "==" or only "=".

You may change the default symbols using the "alt.symbol" option. The first half of the string is used for the low values, the second half for the high values, e.g. LIST 1-10 FLAG "Low High" displays "Low" for values below the (default) inner fence and "High" marks values above. Note the space after "Low", it is needed to make the character count even (4 letter for the first symbol and four for the second.

The FLAG option is applied after the BLANK option, i.e. values blanked might be changed into flagged values.

LIST STREAM

Normally the variables are displayed in column form, i.e. one case per output line. An alternative is STREAM, where cases are displayed horizontally using as many lines as needed to output a variable. The Width/Decimals options are the same as for the default form. The Item=nitem option controls the number of items output on a single line (default: depending on the device (terminal / printer) width); note that changing Item= together with changes in width= might not exactly produce what you want, as whenever the line is full, the next item will be shown on the next line, no matter the options you have set. Some adjustments will often be needed before you get what you want.

CASE

The CASE=case# option displays the variables on <vlist> only for the case specified. SORT, START/END and other options are meaningless, as the option refers to a single case. Note that the <case#> options always refers to the true case number, even if a case selection is active.

CODED lists

The CODED option displays the cases in coded form instead of the numeric value. The code symbolizes the position of a particular case within the overall distribution of the variable using '+' et '-'s for each unit of the deviation measure the case is distant from the center of the distribution. Default center estimate is the Median (deviation measure midspread. This can be changed using the MEAN option implying standard deviation). The UDIV value (default 2) controls the unit for each "+" or "-", i.e. default measure is 1/2 midspread (1/2 mean if MEAN). The GLOBAL option specifies that the coded displays should be base on the global median/mean, rather than the local, when a case selection is active. The LONG option prints also the median (or mean) and the corresponding measure of spread for each variable.

A series of options produce a coded display, where each line contains a variable. Only the first cases fitting on a terminal (typically 80 columns) or a printer (typically 132) will be displayed. For each case a single-character symbol is displayed. All forms have a common option, WIDE=casidchars controlling the number of characters shown from the case identifier (Default length depends upon the length of casid identifier, see the CASID command).

Note that the coding options are explained here briefly. Refer to the Glossary (Chapter 4, The Art of Coding) for more details.

REFERENCE marks positive differences form a reference value with a "+" symbol, negative differences with a "-". (These symbols can be changed using the SET GRAPH command). Default reference point is the CENTER of the variable. (See the glossary for an explanation of the CENTER concept/value). Options include the MEAN, the MEDIAN or an arbitrary value given by the VALUE=val option. Values equal to the reference value appear as blank characters. Equality may be qualified using a fuzz value, e.g. if you specify LIST 1-10 REFERENCE MEDIAN FUZZ=10, i.e. coding refers to the median of each variable and values close to the median are shown as blanks, close then means all values between median-10 >= and <= median+10 (FUZZ=10).

DISTRIBUtional indicates whether a case is "far-out", "out", "adjacent" or "in" using the standard EDA symbols. The SIMPLE option does not distinguish between upper and lower far, out and adjacent values. (See the glossary for more details). This option is sensitive to the SET DEFOUT settings.

The MARK option marks cases corresponding to a specific criterion with the default marker, other cases are blank. Criteria are greater than, less than, equal to and not equal to. EQUAL and NOTEQUAL are sensitive to the FUZZ value either specified or implied (see REFERENCE).

BIN or "string": Whenever only a string is specified, each character is used to represent an interval of equal width on the variable. The number of intervals (bins) depends upon the length of the string (number of characters specified between the double ""). If BIN is present the range is divided into four intervals corresponding to the EDA depth symbols and may show semi-graphical characters.

Other options are similar: FRACTIONAL creates bins containing approx. the same number of cases, EXACT of exactly the same number of cases (if possible of course), READ asks the user to enter the interval boundaries from the keyboard.

ASIS (useful for categorical variables) uses the numerical values as they are and codes them.

DICHOTOMY shows values greater than 0 as "1" (default) and 0 or less as blanks.

The GLOBAL option (intervals of equal width only) may be used to specify that the intervals are defined on the whole variable, even if a case selection is active (default is to compute the intervals on the cases included).

LIST BYGVAR

The LIST BYGVAR displays the cases belonging to some particular group in separate columns (compare to SHOW SPLIT). Only one variable is allowed with this form. All options regarding numerical displays and coded displays are valid with this command format. Cases which are member of no group appear in group 0 (remember that group 0 means no group). If the number of groups is too large to fit the screen all remaining cases appear in group 0 (a warning is given). A GVAR is required for this command.

Note that there is also a LST output procedure (--> expressions) which may be used to produce simple customized lists.

TOPCASE/TOPVARIABLES : displays case identifiers (TOPCAS) or labels (TOPVARIABLES) for the top values (sorted in descending order) for each case or variable.

For instance in a data set, where there are percentages obtained by 10 political parties (variables) in 26 voting districts (cases) a LIST 1-10 TOPVARIABLES will show 26 lines, one for each district containing the labels of the parties in the sorted within each district, thus showing how the parties performed.

Default is to show as many items as a screen line can hold. (unless there are less items). SHOW=nitems is used to show the top nitems. ASCENDING performs an ascending sort, i.e. bottom values are shown. LLength= is used to shorten the section of the label/casid shown (thus making room for more items. Default is to show all 8 characters of the labels and all 4 characters of the casids.

NODISPLAY produces the list only to the print file (using the full print page width).

LOWESS


LOWESS vx,vy [F=proportion]  [NOROBUST]
             [SMOOTHED=var#] [RESID{=#v}] [NOPLOT]
LOWESS (Locally weighted regression scatter plot smoothing) scatter plot smoothing is a tool for studying the dependence of y on x by smoothing.

A smooth set of values is produced using the following procedure: Each case and its neighbours (the number of neighbours depends on the value of the F option are considered. Then for each point neighbourhood weights are defined for all neighbours (weights are decreasing as neighbours move away from the considered point); then a line is fitted to the points using weighted least squares regression. This procedure is repeated for each point in order to produce a y-hat value for each y value.

(NOROBUSTNESS will skip this) In order to achieve robustness by proceeding as above, then computing robustness weights from the residuals and refit y-hat values using neighbourhood weights and robustness weights.

EDA proceeds as follows: a set of smooth values (y-hat) are produced from the x anx y variables). Unless SMOOTHED=var# is present, the smoothed set of values is copied into the next free location in the WA. SMOOTH=var# is used to specify the location of that variable. Then a plot is produced containing x,y and y-hat. NOPLOT inhibits this plot.

The F=value option is used to indicated the fraction of data used for smoothing at each point (neighbours). F defaults to 0.25, i.e. one fourth of the cases are considered at each point.

RESID or RESID=targetvar# may be used to store the residuals into a variable in the WA.

The variables produced by LOWESS have the following names LOWnnnnn for the smoothed values and RLnnnnn for the residuals, nnnnnn is the variable number of the original y variable. The descriptor contains additional information.

Note that on slower machines and with larger data sets this procedure will be quite slow considering that a regression is performed for each point and this repeatedly (unless you use the NOROBUST option).

LOWESS is also available as a command within PI, the PLOT_INSPECT module.

References

Cleveland, 1979, Chambers et al. 1983. Credit: This command has been programmed by B. Rapacchi, CICG Grenoble.

MAP


Display a map

MAP v [INTEGER][<opt>] MAP v ID ["idnam"][<opt>] MAP v GVAR [NONAMES][<opt>] MAP v CUT | NBINS=n | BINBOUNDARIES=(b1,b2,...) ["symbols"][<opt>]

MAP v[,v2,v3,v4] CODE <code

>[<opt>]

Advanced option

MAP CREATE ["fnam"] [N=nid]

<code

> BINS | [FRAC] | EXACT | READ ["alt.symbols]" DISTRIBUTIONAL [SIMPLE] ["alt.symbols"] REFERENCE=value [FUZZ=val] ["alt.Symbols"] MARK|=val | IF>val |IF=val| IF<val | IF~val ["alt.symbols"] [FUZZ=val] ASIS ["alt.symbols"] DICHOTOMY ["alt.symbols"]

<opt> [NODISPLAY | RAWOUT] [BLANK_IF_MISSING]

Draws a simple display map on the screen (character terminal). In order to use this command a maps needs to be defined for the current case identifiers (casids). See below for additional information on maps and how maps can be built. The explanation of all command forms displaying a map assume that a map exists for the current data in the Work aread. (If this is not the case use of any of the options produces an error message <No map for these casids <name>>).

Numerical display (default map)

By default EDA displays the (numerical) values for a variable on the map, i.e. each value appears in a predefined location on the map. If there is no value for a particular observation (due e.g. to a case selection), etc), a ? symbol is shown. The print positions on the map are four characters long, if this field is exceeded and no simplification (dropping decimal positions) is possible, a simplified exponential representation is used, e.g. 12+3 meaning 12E+003 etc.

The INTEGER option specifies that only the integer part of the variable should appear on the display.

ID maps

If the ID option is present, a map containing the casids is displayed. This is useful to show the location of the different cases. Note that in order to use the ID option, the case identifiers must be the identifier type for which a map exists.

It is however possible to display any case-id type on a map by indicating the map type with "id". Then the command puts the current casids on the map, regardless of the type. It is the user's responsibility to make sure that the cases are in the correct sequence, i.e. the sequence of the cases as specified in the map archive. If this is not the case, results may be completely wrong and EDA is unable to detect such a condition.

GVAR maps

GVAR shows GVAR memberships instead of the contents of a variable. If there are names (user defined names) for the groups the names will be shown (truncated to the first four characters), unless the NONAMES option is added, causing the membership numbers (integers showing group id numbers) to be shown instead.

Coded displays

There are two forms of coded maps: (1) maps with text labels (up to four letters) corresponding to specified intervals and (2) single character codes for up to four variables. The second form follows the same rules and syntactical conventions as other EDA commands using coding (e.g. LIST or HISTOGRAM).

The first form (alpha labels) is invoked using the CUT option or either the NBINS or the BINBOUNDARIES= option. It shows a different user specified label for each interval defined on the variable you want to map. This implies that you need to supply cutpoints, i.e. numerical values indicating where the program should cut the variable in order to define the intervals, as well as for each interval a label. This information may be supplied in different ways.

If you are using the default format the program will ask you to enter the cutpoints interactively, followed by the labels for each interval. With CUT you will be asked to enter cutpoints. The number of cutpoints you enter will inform EDA how many intervals you want (number of bins = number of cutpoints + 1). EDA will check whether the supplied cutpoints are within the range of the variable. If this is not the case you will be asked to reenter the cutpoints.

The other options let you specify more information on the command line. If NBINS=nbin is present you will be asked to enter nbin-1 cutpoints used to define nbin bins. EDA will prompt you for the cutpoints and will show the minimum and the maximum value for the variable being mapped.

If BinBoundaries=() is present EDA will only ask for the labels for each bin as the bin boundaries are specified on the command line. If in addition to the option you specify the symbols in "symbols" no further information is needed. Note that the "symbols" option expects as many four-character labels as there are bins to be created. If "symbols" is shorter it will be padded with blanks, i.e some intervals will be shown as blank characters, i.e. not appear on the screen.

When EDA prompts for cutpoints you will be shown the minimum and the maximum of the current variable. The label prompt shows the bin boundaries for the intervals to be labelled.

The second form of coding is invoked with the CODE option (offering the same options as all other EDA commands using coding options; see the section on coding in the glossary). As each map position is four characters wide, you may map up to up to four different variables. Each character represents the observation coded according to some criterion specified by the <code

> option. Default is to divide each variable into four bins, where each bin contains approximately the same number of cases. Each bin is then represented by a different symbol. Refer to the glossary (Chapter 4, the Art of coding) for more information on coding.

Obsolete options

CUT=number_of_cutpoints while still available is no longer documented. NBIN=number_of_bins is now used to indicate the number of bins (= number of cutpoints + 1!) or alternatively you might use the BINBOUNDARIES= option to specify the bin boundaries (or cutpoints). The CUT= option has been abandoned to make the MAP command consistant with other commands performing similar tasks. Note that with the new additional options the MAP command has become more flexible.

Options common to all map display commands

BLANK_IF_MISSING normally data values missing for locations on a map are shown as question marks (?). BLANK replaces the question marks with blanks, i.e. nothing is shown at the particular location.

RAWOUT directs the map output to the RAWOUT file. If RAWOUT is not open (see SET RAWOUT for details) an error message is issued and the map is not produced. This option is intended for situations where the actual map is not drawn by EDA, but by some other piece of software. In this situation the EDA map might not be a real map, but directives for some cartography package, with the variable information filled in by the EDA "map". This of course makes only sense from within a macro taking care of all the rest, including the call to the cartography software.

NODISPLAY: inhibits the display of the map on the screen (PF only). This option is particularly useful if the map defined does not fit on the screen (long lines). This option is obsolete as the global /d option available with all EDA commands offers a more general solution.

More information on maps (*)

This section provides some background information on maps, i.e. the MAP display options do not work when no map has been defined previously. Note that, when you need to build maps or manage them, you will need to read additional sections of this manual (appendix on map formats and the MA (Map-Archive) tool found in the EDA tool box.

The link with a particular data set and a map is established via the case identifiers. If you type MAP EDA will search for a map for the current case identifiers and if one can be found the map will be displayed along with the requested information. If no map can be found a message informs you that no map is available for the current case identifiers. If this happens make sure that a map exists and, this can happend during data management tasks, the CASIDS for the current WA are no the ones you intended.

In most situations maps will be stored in a map archive established by the EDA system or group administrator. If is also possible to store maps in EDA archives, i.e. whenever the GET command accesses that archive the map will be read and made available for the MAP command (as long as that archive resides in the current work area). The program searches in the following order: the currently active user defined map, then the archive. Use the CASID LIB command to find out whether there is a map for the current casids. The same command will also inform you whether there is a map archive or not. Below you will find useful hints explaining related commands and tasks.

See the CASID command for details on how to define new casids.

MAP CREATE: The CREATE option creates a map from an external file, i.e. it reads in the map you have put into a file and makes it into the currently active map (user defined map). If the file name is not specified the default name is <id>MAP, (where <id> is the identifier of the current case ids). The N=nid option is used to tell EDA how many different cases are to be displayed on the map. This option is required, unless it is defined in the file. (see appendix for more information)

The currently active map may be saved on an EDA file using the MAP option on the PUT command; then this map will always be loaded and made into the active map, whenever the WA is accessed.

See HELP "FMT:MAPS" for more information on how to prepare a map suitable to be read with a MAP CREATE command; see also the HELP on the AE (archive editor) in the EDA toolbox.

For more details refer to the appendix sections on map creation and the AE (archive editor).

MARCOM


  MARCOM v1[,v2] [ GINI | INEQUALITY | LORENZ]

MARCOM (= Marginal comparison) is essentially designed for comparisons between groups, where representativity considerations are important.

Without options the Lorenz curve plus inequality measures are displayed. If LORENZ is present the Lorenz curve is displayed. GINI, resp. INEQUALITY display only coefficients. If the vlist contains only one variable the ordinary Gini coefficient is displayed. If 2 are present an inequality measure between the two variables is computed.

MDIAGNOSTIC


MDIAGNOSTIC [vlist] | [PROBABILITY] [PLOT=(xchar#,ychar#)]
                    | ANDREWS [VERTICAL] [PLOT=(x#,y#)]

Multivariate diagnostic routines. Currently a probability plot and Andrew's multivariate plot are available.

The Probability plot is used to help to assess multivariate normality. A multivariate probability plot may be defined by:

D2i = trans(x(i) - mean(x)) inv(S) (x(i) - mean(x))

where S is the variance-covariance matrix, and the x(i) - mean(x) a vector of the difference of the individual responses to the means of the variables. The D2i are squared distances (Mahalanobis distance of the invidual to the center of the multivariate distribution). It can be shown that these distances follow a chi2 distribution with degrees of freedom equal to the number of variables.

Here again (like in normality plots for single variables) a straight line indicates normality. (cf. Gnanadesikan 1977, Everitt 1978).

The variables included depend on the setting of the ALLVARS switch. The PLOT option is used to control the plot size in terms of characters x (up to 130) is the dimension across the screen, whereas the vertical dimension y# is not limited.

ANDREWS

Andrew's multivariate plot: on the matrix (configuration) stored in C2 (usually principal comp). Up to 26 cases can be plotted simultaneously; until an empty line is entered the program queries for more cases to plot.

PLOT=(x#,y#) are used to control the dimension of the plot in characters. x may be up to 130, whereas the maximal y# depends on internal storage.

VERTICAL is used to turn the plot (default x as y,and y as x), causing the scale from -pi to +pi to appear on the x axis. This may be useful to produce more readable plots with large numbers of cases using the full printer/screen width.

References:

Lingoes 1973; Andrews 1972. Credits: PPCHI2 subroutine is algorithm AS 91 (1975), GAMAIN algorithm AS 32(1970), programmed by Dominique Joye. The PROBABILITY and ANDREWS options are base on a Fortran program initially written by Dominique Joye, IREC/EPFL Lausanne, formerly University of Geneva.

MEDPOLISH


    MEDPOLISH t [<options>]
    MEDPOLISH v TABLE=(nrows,ncols)  [<options>]

<options1> [CFIRST] [HSTEPS=nhs] [NOCENTER]

<options2> [RESIDUALS] [ROWeffects] [COLumn_effects] [COMPPLOT] [CODRES {str | ALTERNATE}]

Note: <Options2> are common to the ADDFIT and MEDPOLISH command, (they are explained below) <Options1> are specific to this command. Fits an additive relation to a table using median polishing.

The table is either defined as such or contained in a variable and defined with the TABLE=(nrow,ncol) option. Tables may be defined from variables with the MAKE TABLE command or created for instance with the BREAK or CROSSTABS commands using the appropriate options.

CFIRST starts the polishing iterations on columns instead on rows.

HSTEPS= lets you define the number of half steps to be performed. A half step corresponds to one iteration step accross the rows or the columns. (A full step is a full row and column sweep).

NOCENTER does not center the row and column effects on the row and column medians.

Options common to MEDPOL and ADDFIT

RESID replaces the original table by the residual table.

The program displays the row (cases) and column (variables) effect, the typical value and the mean absolute deviation. The ROW, respectively. COLUMN option copy the row, resp. column effects into a free location of the work area as new variables.

References

McNeil 1977, chapter 7; Velleman & Hoaglin 1981 Chapter 8. Credit: The MEDPOLISH algorithm is based on Velleman and Hoaglin's Fortran program (heavily modified).

PI


PI (PLOT_INSPECT) is explained below with the PLOT command.

PLOT


The PLOT command comes in two formats; the first format draws a plot which may be analyzed and inspected using the PI (plot inspect) commands. See below for details.

The first format is used plot two variables, whereas the second format is used to plot several variables against another. This second format will be explained below.

Format 1

PLOT v1,[v2] [BIG] [<type-spec>] [<axis-spec>] [<size-spec>] [<limit-spec>]

<type> <symbol> [FULL | SYMWID=nc] [POSITION=charnum]

<symbol> | DOTS | CASID | NUMBER | GVAR{=var#} | THIRD=var# {"altsym"} <code

>] [NOSYMBOL]

<axis-spec> | NO{X|Y} | BOX | BA{X|Y} | [FRAME]

<size-spec> [XUNITS=val] [YUNITS=val] [FULL | SYMBWIDTH=n]

<limit-spec> | [{NO}GLOBAL | PERCENT | LIMIT{=x1,x2,x3,x4)} | NOOUTILIER or IFENCES | NOFAROUT] or OFENCES | HINGES

<code

> see below (explanation of THIRD=)

Format 2

PLOT v1,v2,v3,[v4,v5,v6] [XUNITS=val] [YUNITS=val] [BIG] [LIMIT] ["alt.symols"|LETT] [DENSITY {"alt.symbols"}] PLOT v1 SCAT same options PLOT v1,v2 SCAT same options

Format-1

Plots v1 on the horizontal axes (X), v2 on the vertical (y) axis.

If only v1 is present it is plotted on y against sequence (sequence plot. Many options provide control of the form of the plot, as well as of the type of information shown. Note that the options offered with the PLOT command here are also available as commands from within PI (see below).

If no overall default values have been changed (see below), a simple plot command will produce a screen-size plot using dots for each observation.

Markers (symbols)

The default symbol used on the plot depends on the setting of the SET PLOT TYPE command (default DOTS). The options explained in this section are used to override the default setting.

DOTS plots a dot or '*' for each observation. If more than one observation is to be plotted at the same location a '2' appears for 2, '3' for three observations, up to '9'. For more than 9 observations a '$' sign will be shown. Note that the actual symbol shown on the screen depends upon the definition of these symbols in your profile or the symbols set with the SET GRAPH command.

CASID uses the case identification instead. If more than one case identifier occupies the same location on the plot a $ symbol appears instead. The case identifier is an up to four letter string. The number of letters shown depends upon the setting of the SYMBWID option. See also the CASID command for details on what things you might do with case identifiers in order to enhance plots to your liking.

NUMBERED uses the sequential case number as plotting symbols.

GVAR shows GVAR membership numbers. This option requires a GVAR to be stored, unless the GVAR=var# form is used; then the variable var# is taken instead. Note that only the integer part of such a variable is shown (the real part is truncated).

If the WA contains centroids (See CLUSTER) the case identifier is used for those centroids, instead of the group number, i.e. you will be able to see and distinguish cases and their group membership as well as the centroids.

THIRD=var# A third variable using alphanumeric symbols to represent depth is used to define plot symbols. Default is to divide the range of that variable into seven intervals; then for each interval (min to max) the following symbols will appear: '.:;o%#@', where '.' represents the first, ':' the second (...) interval. Using "altsymbols" you may change default operation; the number of alt-symbols determines the number of intervals. Intervals are specified from smallest to largest.

Other coding options can be specified. See <code

> for a list of coding options and Chapter 4 (Glossary) for an explanation). Note that the default here is to define intervals (bins) of equal width.

<code

> | [BINS] ["alt.symbols]" | FRACTIONAL | EXACT ["alt.symbols]" | READ ["alt.symbols]" | DISTRIBUTIONAL [SIMPLE] ["alt.symbols"] | REFERENCE=value [FUZZ=val] ["alt.Symbols"] | MARK|=val | IF>val |IF=val| IF<val | IF~val ["alt.symbols"] [FUZZ=val] | ASIS ["alt.symbols"] | DICHOTOMY ["alt.symbols"]

The NOSYMBOL option is used to force EDA to use the standard coding symbols used elsewhere in the program (default 4 bins with approx. equal number of observations).

As non-DOT plots potentially produce quite unreadable plots you may control the number of symbols used for each plot location. A case identifier may be up to four letters long; you might want to use less on the plot. The default number of symbols depends upon the setting of the SET PLOT TYPE command (default usually 2, i.e two letters appear on the plot for each case).

The FULL or SYMWID option are useful to override the current default setting. FULL asks for all four letters; SYMWID lets you specify the exact number of symbols. Note that PLOT BIG uses four letters as default.

POSITION=charnum: Normally each plot symbol type specification replaces the old symbols by new ones. With the POSITION= option you can add a symbol and place it into character position <charnum> all other positions remain unchanged. This could be use to add e.g. symbols to "normal" casids or to encode up to four different variables each in a different character position. If you want even more possibilites see the CASID command, where many more options are offered to modify the case identifiers. See above the explanation on the number of symbols used with symbol plots: If you add symbols you should make sure that they actually appear on the screen.

See also the SET command below. It offers a possibility to show a selected portion of the symbol built up using the facilities mentionned in this section.

Size-options

The plot size may be controlled with the X and Y options (number of lines and columns used on the display.

The default size depends upon the setting of the SET PLOT SIZE command (type ?STAT PLOT SIZE to see the actual setting). Initial defaults are set in a way to show the full plot on the screen with x and y-axis approximately the same size. You may use the X= and Y= options to override the current defaults. The units are measured in characters.

In the case of the BIG option a printer-page size plot is produced.

Sequence plots (i.e. plots with only one variable on the list): If only one variable is specified on a plot command, EDA plots the specified variable on the y-axis against the sequence (in many situations this would be a time series plot). For this reason EDA will use in this case the dimensions specified for time-series plots as default (see SET/STAT PLOT TSIZE).

FULL and SYMWID= are explained in the symbol type section. (These options are mentionned here to explain the SIZE command in the PI module).

Limit-options

The PERCENT, GLOBAL, LIMIT, NOOUT and NOFAROUT deal with the plotting limits (minimal and maximal value plotted). This is important for two reasons: the plotting scales depend on the min/max values and it is often desirable to use a different scale, e.g PERCENT (0..100) for percent data. It is often useful to exclude some extreme values: they can be excluded by modifying minimum and maximum (these values are simply treated, as if they were at the (new) minimum and/or maximum.

Several options deal with this problem:

THE NOOUT and NOFAROUT eliminate all outliers (NOOUT) or only far out values from the scatter plot. See the glossary for more explanation regarding the definition of out and far-out values. Remember that the user my change the definition of an outlier using a SET command option.

LIMIT: asks for new min and max for the x and y variables. A message will appear showing you the current settings and asking to replace them. Note that four values are required for xmin,xmax and ymin/ymax. If you wish to retain a current value you need not enter that value, but you need to show that you wish to retain it by typing in an additional comma. e.g. 10,,10,40 means that the second "number" is not changed, i.e. xmax will be retained as previously. Note that on such value lists commas are only required if you want to skip a number, otherwise a blank character may also be used to separate items.

GLOBAL This option affects the min/max settings only when a case selection is active. Default operation is to use the current (i.e. the min/max of the selected cases) for scaling. If you wish to use the global min/max (i.e. for all cases including the unselected ones) specify this option. This is useful when comparing subsets of data in order to retain the same scale for all plots.

GLOBAL sets also an internal switch, i.e. on subsequent plots global mode will be used, until you specify a NOGLOBAL. However this is only meaningful for the untransformed variables.

Compare with the WINDOW command, available within PI.

Axis options

The default axis is a full drawn axis using using the standard symbols and showing min/max at the ends. If zero lies within the plotted range, the axes appear within the plot space, otherwise in the margins.

Additional options provide alternatives. FRAME requests that axes always appear outside the plot area (margins).

NOAXIS requests the suppression of the axes; NOX and NOY may be used to remove the X or Y axis, respectively.

BOX, BAX, BAY: Requestd a special axis form showing a one row or column boxplot information instead of the normal axes. This axis shows the median, the position of the hinges and the adjacent values, as well as ticks marking outliers. BOX requests this form for X and Y. BAX/BAY request it for one of the axes.

BIG plots

BIG produces a printer page size plot. The only difference with respect to the previous option is that nothing will be displayed on the screen, i.e the plot goes directly to the print file (the print file must be open of course); for these reason PI (plot inspect) may not be used with this option. (Note that this option applies only to print file plots; if your screen has a large size option, and EDA knows about it, you need not use this option, just adjust the size with X=/Y=).

             *******************
             * PI Plot inspect *
             *******************

PI is a special module used to analyze in depth a bivariate plot using specialized commands. Note that only format-1 type plots may be inspected further using the PI command. PI is not available with the BIG option.

In order to enter PI mode just type PI after you have issued a PLOT command, i.e. PI is used, as the command name shows, to inspect a plot you have produced; PI is used as the very next command after the PLOT command. PI has a PRINT option used to set the initial print mode. (See below for information on PI and printing.)

If the PI command is not the very next command after a PLOT command PI works just like the PLOT command (with all options), except that after the PLOT has been produced EDA enters PI mode.

PI is a special mode where many commands may be used to further analyze the variables you are looking at on the plot. PI has a slightly different syntax, as you need not specify variables (because we are looking at the relationship of two variables).

PI will prompt with:

  pi:
inviting you to enter PI specific commands. A simplified EDA syntax is used within PI. All PI commands may be abbreviated to two letters. Options are specified as usual. Variable lists are not needed.

There are two types of commands (1) Specific PI commands or (2) options available with PLOT. The second category is specified the same way as on the plot command, except that you will have to add a command. E.G. in order to request a CASID plot within PI you need to say TYPE CASID.

The following PI commands are available: A description follows below where needed:

***********************************************************
*                    PI commands                          *
***********************************************************
*  PLOT/PI options (see above)                            *
***********************************************************
*  AXIS <axis-spec>      change the axes                  *
*  SIZE <size-specZ>     change the size                  *
*  SLIM <limit-spec>     redefine min/max                 *
*  TYPE <type>           change the plot marker type      *
***********************************************************
*                                                         *
* PI mode commands (specific)                             *
***********************************************************
*  <return>, Quit        leave PI, return to EDA mode     *
*  ? or HELP             syntactical information          *
*  ADdscale              add a reference scale            *
*  ALINE                 add a line to the plot           *
*  DCase [str]           display case ids                 *
*  DRAW                  Same as P                        *
*  IDENTIFY              identify cases                   *
*  INFO                  display information on the plot  *
*  LINE                  LINE command                     *
*  LOWESS                LOWESS command                   *
*  MARK                  mark a case on the plot          *
*  P or PLOT             Plot request                     *
*  POINT                 add a reference point            *
*  PRINT                 print the current plot           *
*  Q/QUIT                leave PI                         *
*  REVERSE               reverse x/y on the current plot  *
*  SAVE                  save variables                   *
*  SET                   set working conditions           *
*  STATUS                display information on the plot  *
*  TRACE                 add trace lines                  *
*  WINDOW                window on subset                 *
*  W=(x1,x2,y1,y2)       same as WI W=                    *
*  XTrans                reexpress X                      *
*  YTrans                reexpress Y                      *
*  X                     change the X variable            *
*  Y                     change the Y variable            *
***********************************************************

Some of the more complex PI commands are further detailed here; others are hopefully self-explanatory.

ADDSCALE

Adds a special form of a scale allowing for identification of the row and column numbers needed for other commands. This information will be put into the axis area just outside the plot area. It appears like this 123456789*123.... This form is helpful for locating points in rows and columns, which are needed with other commands.

ALINE

ALINE [A=val] [B=val] ["alt.symb"]
Adds a line to the plot with options A (intercept) and B (slope). If LINE has been used previously in PI and no transformation has been applied since that event PI will use A/B as defined by the line options, otherwise they default to 0/1 (reference line). As it is not possible to draw a line on a character oriented plot, you will find two ticks (default "x") at the edges of the plot, i.e. you will have to draw the line by linking the two marks.

AXIS

AXES  [FRAME] [BOX | NOAXES | [BAX | NOX ] [BAY | NOY]
Controls the axes shown with the plot. See the Axis options described with the PLOT command above for an explanation of the various options. The AXES command specified without an option restores the default axis style.

DCase

DCASE [str]
DCase displays the names of the cases shown in the current plot. Cases not selected (e.g. by a limit or window command) are never shown. If DC is entered without a string all cases (casids) are shown. If DC is followed by <str> then only cases matching <str> are display. <str> may may contain a wildcard specification. DC B* shows all active cases starting with the letter B. Upper and lower case letters are not distinguished. If you need to specify cases ids starting with a numerical character (e.g. the default case ids) you will have to specify a single quote, e.g. DC '10* shows all cases starting with 10.

DRAW

This is a synonym for P (PLOT).

IDENTIFY

 IDENTIFY  | [Col=col#] [Row=row#]
           | QUAD= 1|2|3|4
           | Neighbours=cas# [PROX=(xpos,ypos)]

The IDENTIFY command displays a list of cases (case id, group number (if defined), coordinates (x,y) as well as row and column plot positions for the observations selected by the options.

Several forms are available:

COL=col# Row=row#: identification by row and/or column. [Use ADDSCALE to show row/column numbers if you have problems with getting the number(s) right.] You may specify only COL= or Row= to identify all cases in a specific row or column, or specify both of them to identify the case(s) in a specific row and column position in the plot (e.g. to show what's hidden behind a $ sign, saying that there are many cases at that location).

QUAD= 1|2|3|4 identifies cases in a specific quadrant of the plot. Usually the quadrants are defined with the TRACE command, but if no trace is defined, the quadrants are simply defined as implicit "traces" in the middle of the x and y axis.

NEIGHBOURS: shows the neighbours of a case given with the command. A neighbour is defined (default definition) as a case being in the same location or in an adjacent location (i.e. one column up or down and/or one row to the left or the right), i.e a window of a single character position around the center position. The definition of a neighbour. and/or one row to the left or the right), i.e a window of a single may be changed using PROX=(rowprox,colprox) [default 1,1]. E.g. PROX=(3,5) means that a case is considered a neighbour 3 positions to the left and the right of the case considered and 5 positions below and above.

INFO

Display information on the current plot.

LINE

LINE is the same as the LINE command, i.e. is used to define a robust line for the two active variables. All options apply as for the LINE command, except the YHAT and RESID options (PI holds these variables in its YHAT and R variables (temporary variables available within PI) and, of course, the variable list (PI takes the current x and current y variable).

Within PI LINE computes automatically residuals and yhat and makes them available for further analysis (See PLOT below). However these copies are only made available to PI, i.e. if you intend to keep residuals for further analysis you still should used the RES/YHAT options in order to copied these variables into the WA.

LOWESS

(see the LOWESS command for details) LOWESS picks up the current x and y variables and produces the smooth values into YHAT and the residuals into R, i.e. you may use all PI commands to inspect the values.

LOWESS called from PI does not produce a variable with the smoothed values and the RESID option is not available (use SAVE to save either YHAT or RESID).

MARK

MARK=case# ["alt.symbol"]
Marks the case with the default mark symbol (default @, may be installation specific). You may specify an alternative symbol on the command line. Additionally this command displays the coordinates of the case marked.

P or PLOT or DRAW

P is used to make a specific plot request. Initially the X-axis will show the first variable specified on the EDA command line and second variable on the Y-axis. P lets you specify different variable combinations to be shown as X or Y variable, especially variables created during a PI session, like residuals, transformed variables and the like.

The general format for P is

  PLOT
  PLOT  x-var y-var
Without option, P shows the current plot. Options specify different variables to appear on the X or Y axis. The variables specified become the new current X/Y variable. <x-var> and <y-var> are the variable you wish to see on the x-axis, resp. on the y-axis. You may use the following specifications:
   X        the current X variable
   I        synonym for X (independent variable)
   Y        the current Y variable
   D        synonym for Y variable (dependent variable)

X<trans> transform and plot (see below) Y<trans> transform and plot (see below)

S index variable: sequence of cases

XT X transformed: the current transformation of X IT synonym for XT YT Y transformed: the current transofmration of Y DT synonym for YT

If LINE or LOWESS command have been performed

R residuals F fitted values YHAT, E synonyms for F

With any variable specification P sets the new current variables, i.e. subsequent operations will work on the currently specified variables and in the order they appear.

Initially X and Y are the only x-var/y-var available. In order to create the other variables you you need either to perform a transformation or use LINE/LOWESS to produce residuals and fitted values.

A transformation is requested by adding its name (or UP/DONW) to a X, Y (I or D) or using the XT/YT commands described below. This means that the variable is transformed first and then plotted. For instance:

    P XLOG YSQUARE         XT LOG
                           YT SQUARE
                           P XT YT
do exactly the same, i.e. create variable XT by taking the log of variable X, create YT (square of Y), setting XT, YT to the current x-var/y-var and plotting XT and YT. This is important to understand as the following example (suppose X/Y have not been transformed previously):
  P XUP XDOWN
will simply yield a plot of X by X (untransformed), because it means
  XT UP
  XT DOWN
  P
which is usually not very interesting.

For an explanation of the transformations (etc). refer to the XT/YT commands below.

POINT

POINT [X=at_val] [Y=at_val] ["altsym"]
Marks an arbitrary spot on the plot with coordinates (X,Y). The default symbol used is a "+" mark; you may specify a different symbol on the command line. IF you do not specify coordinates (X,Y) the median of the current X and Y selection will be used instead.

Compare to the MARK command used to mark a specific case on the current plot.

PRINT

writes the current plot to the print file. This option requires an active print file. This command is useful if you choose not to write all PI output to the print file (default) and you whish to keep a particular plot.

QUIT

Leave PI and return to normal EDA mode.

REVERSE

reverse X and Y on the current plot, i.e. the current X variable becomes the Y variable, and Y X.

SAVE

SAVE spec
Saves <spec> (copies) a variable internal to PI as a variable in the WA (next free location). All specifications used on the PLOT (sub-)command can be used including transformations specifications.

IMPORTANT: SAVE copies the currently selected cases (several PI commands make selections, e.g. WINDOW), i.e. the number of cases of the variable stored into the WA includes only cases shown on the current plot. Therefore if, e.g. you want to copy residuals (the R variable) and you used a WINDOW HINGES command do a WINDOW ALL command first, before saving.

SET

  SET SILENT                 plot only if asked for
  SET PRINT                  print toggle
  SET SYMBOL[=(start,end)]   set symbol section to show
SET SILENT Many PI commands perform automatically a plot reflecting the changes to the plot you made. If the SILENT switch is turned on, you will have to request an explicit plot using e.g. the DRAW command. SET SILENT is a toggle switch, i.e. turns silent on and off alternatively.

SET PRINT controls printing. See below for details on how the print file works with PI. SET PRINT toggles printing on and off as explained in the section below.

SET SYMBOL is used to show only a specified section of the current marker. By default EDA shows, depending upon the marker type and other options set, the full marker or a truncated marker. For instance, if the current marker type is CASID, EDA typically displays the first two letters of each case identifier. If is possible to request the FULL case id (depending upon the defaults set by SET PLOT or your modifcations using one of the PI commands). Sometimes, especially with well structured case identifiers or GVARs you might want to show an arbitrary portion of the identifier, e.g. the third letter as it has some special meaning in your application. This option is made for such a situation. SET SYMBOL=(start,end) is used to set the portion you want to show; start indicates the starting position, end the end position, i.e. SET SYMBOL=(3,3) to ask for the third character position. Note that markers may be up to eight letters long (see TYPE for more information). SET SYMBOL without specification returns you to the default, i.e. disactivates this option. Note that the INFO command shows the current settings.

SIZE

SIZE [FULL | SYMWID] [X=units] [Y=units]
This command is used to set various size options. They are identical to the different size options offered with the PLOT command. See there for an explanation of the options (heading size options).

SLimit

 SLimit  [{NO}GLOBAL}]  | PERCENT
                        | LIMIT=(xmin,xmax,ymin,ymax)
                        | NOOUTLIERS or IFENCES
                        | NOFAR      or OFENCES
                        | HINGES

The SL command offers the same <limit-spec> options as the PLOT command. See there for an explanation.

TRACE

 TRACE | [POSITION] [X=xval] [Y=yval]
       | THROUGH=case#

Adds traces to an existing plots. A line of periods is drawn at the median position of both variables (default); the identify Q= command may then be used to identify the cases in each of the quadrants defined this way. WINDOW also can show one of the quadrants defined this way.

X/Y may be used to put those traces at an arbitrary position (indicate the coordinates); use POSIT if you wish to specify column and row numbers on X/Y.

Instead you may also designate a case where the trace lines should cross; this is done with the TRACES THROUGH=case# command.

TYPE

TYPE is used to modify the symbol type used on the plot. The same options apply as on the PLOT command line (see above).

WINDOW

WINDOW  | HINGES
        | IFENCES
        | OFENCES
        | Window=(row1,col1,row2,col2)
        | CORNERS=(case#,case#)
        | CORNERS=(case#,case#)
        | AROUND=case# [PROX=(rows,cols)
        | LAST
        | [ALL*]
The WI command is used to limit the display of points to a subset according to some criterion. This command is in many respects identical or similar to the SL command (.e.g SL NOOUT is exactly the same as WI IFENCES;

Without option WI resets the initial limits, i.e. all cases are shown again, i.e. the full window without any restriction. (WINDOW ALL)

The HINGES, IFENCES (synonym NOOUT) and OFENCES (synonym NOFAR) options are used to limit the display according the exploratory distribution criteria (hinges, inner and outer fences). These criteria do not define a window on the current window, but are based on the original selection (as opposed to the CORNER/AROUND option).

AROUND CASE= defines a window around a specified case. You may specify the size of the window with the PROX option.

CORNER=(case#,case#) defines a window where the cases specified are the edges.

CORNER=case# with only one case specified uses the specified case as one of the corners of the window, the second corner will then the nearest corner of the plot, e.g. if the case is within the lower left half of the plot the corner will be the lower left corner of the plot. LAST: is used to go back to the previous window, i.e. when the CORNER and AROUND options are use to "dive" into a relationship you might sometimes step back through the sequence.

XT/YT XP/YP

These commands are used to reexpress the X, resp. Y variable according to the additional option present on the line. The only difference between XT/XP and YT/YP is that with T only the reexpression is done, whereas with P a plot is shown immediately. In the first case you will have to issue a PLOT or DRAW command in order to see the transformed variables. Note that these commands change the current variables.

The possible reexpression follow the idea of Tukey's ladder of power (See BOXPLOT reexpress for details and reference).

The following options may be present:

   RS   reciprocal square        ^
   RE   reciprocal               |
   RR   reciprocal root          | down
   LO   log base 10              |
   SR   square root              |
   RAW  untransformed            -
   SQ   square                   | up
   CU   cubic transformation     v

UP one step up on the ladder DOWN one step down on the ladder

SE case sequence

You may either specify explicitly what transformation you need or ask to go one step up or down on the ladder. If you say initially UP then the square of the initial X or Y variable will be taken. If you then say again UP the third power of the initial variable will be taken. If then you say again UP a message will be displayed that you cannot go further up (use normal EDA transformations outside PI, if you need to do that).

A transformation implicitly creates the XT/YT variables and sets the current x-var/y-var. If in the course of the UPs and DOwns you hit the untransformed "raw" step the current variable will be switched back to the original X or Y and the XT/YT disappears.

SE is a special "transformation" in a sense that it is a variable stored to XT and YT (i.e. the transformation variables). It just contains the case sequence which may be used in sequence plots. E.g. if you want to plot X against its sequence you will issue a PLOT X XSEQ. Here again beware of the pitfalls PLOT XLOG XSEQ will go wrong, as first XT will be the log of X, then it will be replaced by the sequence and finally we plot sequence by sequence. USE PLOT XLOG YSEQ instead.

X/Y

  X | Y    |  VAR=var#
           |  C=c1dim
           |  K=c2dim
           |  RESID
           |  FIT / YHAT
This command allows to change the variables you are plotting from within PI.

VAR=var# replaces the current X or Y by variable var# from the current WA.

C=C1dim/K=C2dim uses the specified dimension from C1 or C2 as new X or Y variable.

FIT/RESIDUALS replaces X or Y by the residuals or the fit you have produced using LINE or LOWESS.

PI and the Print file

As PI is intended for close inspection of a specific relationship, it is usually not desirable to write all plots and other information into the print file, as this would be the case when PRINT ALL is active. Therefore the print file behaves in a different way. Note that for all possibilites below, an active print file is required. If no print file is open all options produce an error message.

Initially when entering PI, nothing will be written to the print file, unless you enter PI with the PRINT option, then all output will be written to the print file.

The SET PRINT command is used to toggle printing, i.e. if printing is not active (shown on the status line) SET PRINT sets printing, if printing is active, SET PRINT sets no printing. (Note that this will only affect PI behaviour).

Furthermore you may add a PRINT option to all commands requesting to print the results of that command, even if printing is not active (SET PRINT). Note that there is a also a PRint command, you may use to write the current plot to the print file (without displaying it on the screen).

Other EDA features available within PI

PI uses a simplified syntax and does not offer all the facilities available in normal command mode. You may however used the following EDA features:

Command line editing and multiple commands on a single line are available.

(*) The > command is implemented.

Line macros may be used. You will however need to define them outside PI using the DEFMAC command.

(*) You may define scalar variables using only simple expressions (you will find a more detailed explanation of this with the EDIT and TED commands, where the same facility is available).

While in PI the status line shows the variables plotted (names and possible transformations), as well the number of cases plotted and (if active) the current selection, e.g. windows etc. The number of cases is shown as np/nt where np is the number of cases plotted out of nt (total of cases). Note that the total is the total of the first plot and np the number of cases included by some PI selection mechanism (windows etc). The total number of cases (nt) is not necessarily the total number of cases in that variable, as a case selection may be active when you enter PI.

PLOT (format-2, SCAT)


  PLOT v1,v2,v3,[v4,v5,v6] [XUNITS=val] [YUNITS=val]
                           [BIG] [LIMIT] ["alt.symols"|LETT]
                           [DENSITY {"alt.symbols"}]
  PLOT v1 SCAT             same options
  PLOT v1,v2 SCAT          same options
If more than 2 variables (max 6) are present on the vlist, then the first variable is plotted on X against the others on Y using symbols * 0 + x = for the first, second .. fifth Y variable. Positions with more than one case are plotted as 2 ... 9, 9 meaning "9 or more points".

The X and Y options have the same meaning as above. The BIG option produces a printer-paper size scatter plot to the print file (must be open).

The DENSITY option does not distinguish between Y-variables but plots only 's and ;'s to show.

The "alt.symbols" options allows to enter alternative symbols for each variable, whereas LETTER uses the letters a,b,c,d and e to represent the variables. With DENSITY the alt.symbols options specifies alternative symbols for all symbols shown, default is , and ' showing single plot positions, : two cases at the same place and then 3,4,5,6,7,8,9 and $ to show several cases at the same location. The LIMIT option is the same as above, allowing for plotting within other limits. Points not within the range are dropped from the plot and reported.

SCAT option

The SCAT option forces this routine to be used for two variables, instead of the normal plot procedure. (Differences SCAT produces nices scales and has different options; PI however is not available).

As with the standard PLOT command PLOT SCAT with only one variable produces a sequence plot (x=sequence, e.g. time). As with the general form the default size of this plot will be taken from the SET PLOT TSIZE option (TSIZE=Time series size).

Result variables

The plot command does not define any ResVar, except that the LINE command defines the ResVars described with the LINE command called as a normal EDA command. There is also an output procedure PLT() which may be used to fit simple plots to your specific needs (within macros etc). These plots are particularly useful whenever you wish to test many different reexpressions or the like.

Credit

The procedure producing plots with more than two variables (and the SCAT option) is based on algorithms published in Applied Statistics: AS 168 (Scale Selection and Formatting), by W. Douglas Stirling an AS 169 by the same author.

PROFILE


 PROFILE   vlist  [GROUP=group# {MEAN}] |
                  [GROUP {MEAN} {NMIN=min_members}]
                  ["alt.symbols"] [CASE=cas#]
Draws a "profile" for specified cases using the variables of the vlist. The cases are requested from the user after the command line has been entered. Up to 10 case identifications can be specified on the sollicited lists.

Case=cas# may be used to specify a single case for a profile (avoids the query).

If only one case is specified a coded value corresponding to the position in the distribution (far, out etc) is displayed). This option is sensitive to the SET DEFOUT settings.

If more than one case are specified (a maximum of 10 is allowed) the values are not coded, but only scaled, and the symbols "0" to "9" are used to identify" the cases for each variable.

If the G= option is present, the program uses the GVAR instead of querying for the case identifications. If a group contains more than 9 cases, only the first 9 are used. This option shows also the group centroid using the # symbol to mark it.

The GROUP option (requires a GVAR) shows the profiles of the groups using the group centroids. If more than 10 groups are stored the program queries for the groups to show.

In the case of the last two commands there are two definitions of the group centroids available. Default centroid is the median; the MEAN option uses the mean instead. The NMIN option is used to drop groups with nmin or less cases from analysis; it defaults to 1.

"alt.symbols" replaces the normal symbols used to show group or case position by the symbols specified. Normally the symbols are "1", "2" .. "9", "0" for the 1st, 2nd .. 10th group or case. The first character in "alt.symbol" is used to show the 1st, the 2nd the second and so on. Up to 10 symbols my be specified. If less are specified and more groups requested the default symbols are used. This option does not affect single case profiles, where the in/out/adj/far symbols are used.

NOTE: this command turn case selections off.

QSUMMARY


QSUMMARY [vlist] [OUTLIERS_ONLY]

Provides a means of getting a one line numerical summary for each variable in the vlist. Informations provided are the name of the variable, the minimum, maximum and the median of the variable (and the descriptor with the print file), as well as the number of low and high outliers. The definition of outliers is as usual, i.e. the command is sensitive to the setting of the SET DEFOUTLIER option.

The OUTLIERS_ONLY option inhibits the display of variables with no outliers, i.e. a QSUMM #0 OUTLIER command will display the summary information for all variables with outliers.

REEXPRESS


  REEXPRESS vlist
REEXPRESS is a tool for finding the appropriate reexpression of one or several variables. This tool is a module, with a number of specialised commands assisting you in that task. Initially you enter the module by typing REEXPRESS followed by a variable or a list of variables. Then you will be able to reexpress one variable after the other until you quit the module. Basically this tools is a combination of other tools available as separate commands, but brought together to simplify reexpressions. With each variable you may try any number of reexpressions before applying it to the actual variable - or you may decide not to apply it.

When entering REEXPRESS you will see a summary (numerical, boxplot and density line) of the current variable. The effect of the re-expression you select will be shown together with the previous state of the variable.

There are two classes of commands: commands which cause REEXPRESS to go to the next variable and others which do something to the current variable and redisplay it (e.g. standardizing options).

Furthermore it is also important to remember that in order to do logs and similar operations values have to be positive, therefore implicitly the variable is always transformed if necessary, to contain only positive values (non-zero), i.e. some rescaling might be done automatically because a reexpression you are requestion might not be possible otherwise. (A message will tell you that automatic rescaling took place).

Below you will find a list of commands available within the REEXPRESS module. Commands are one or two letters only; some commands have options (specified in the normal EDA way).

Vocabulary

Make sure to understand the following two terms before reading on: The current variable is the variable you are currently re-expressing using the different commands below. The current re-expression is the current re-expressed state of the current variable, e.g. the log of a variable. The current state does not affect the (current) variable in the work area, until you apply it to it. Changing the current variable means going through the variables in the current variable list.

APPLY, NEXT, QUIT and MARK

  APply [NEW] [CONTINUE]    Apply the reexpression
  Next                      Go to the next variable, do not apply
  Quit                      Quit REEXPRESS
  MArk                      Mark, then go to the next variable

APPLY Applies the current reexpression to the current variable, then goes to the next variable in the list. If all variables have been reexpressed, APPLY brings you back to normal EDA mode. The NEW option writes the reexpression to a new variable in the WA area, leaving the original variable untouched. The variable is written into the next free location in the work area. CONTINUE lets you actually apply the current reexpression, but you will continue to reexpress the current variable (this might be useful if you use it, combined with the NEW option to create several re-expressed variables from a single source.

Note for former users: SAVE is a synonym for APPLY (it is no longer used in the documentation because it introduced some confusion with other uses of "saving".

NEXT Go to the next variable in the list, if the current variable was the last you will be back in normal EDA mode. The current reexpression is not applied to the current variable,

QUIT Abandon REEXPRESS, the current re-expression is not applied to the current variable and no further variable is processed. Note however all previous variables have been re-expressed (and APPLY command changes the variable in the work area).

MARK Marks the current variable with a tag, and goes to the next variable, without applying the re-expression. This is useful when you did not succeed in finding a suitable re-expression, but want to examine the variable later using other tools.

Power transformations

Power transformations (Tukey's ladder of powers) can be specified in several ways:

 UP           one step up the ladder
 DOwn         one step down the ladder

CU select CUBE SQ select Square SR select square root LO select log RR select reciprocal root RS select reciprocal square

POWER=pow Transformation using power pow (see note)

Note that power transformations are not applied to previous steps of power-transformation, i.e. they work on the initial variable or the variable transformed by other re-expression commands explained in the next section. You may however use the PT command to set a power transformed variable as a new "raw" variable.

POWER=pow POWER=pow: if pow>0, compute x**pow; if pow=0 compute log(x), if pow<0, compute -x**pow (x is the current variable). Note that the ladder of power oriented commands (UP/DOWN or a specific transformation from that family) do not interact with this command. Power works always from the non-power-transformed variable (i.e. other transformation affect the variable). Note that you should either user the ladder of power framework or use POWER=pow.

See the section on power transformations in the glossary for additional informations.

Other Reexpressions

 STandardize     remove median divide by midspread
 NOrmalize       remove median
 SCale           remove mininum-1 divide by range
 SCale POSITION  remove minimum-1
 NICE Lim=([low,]high) rescale to range low-high (defaults
                 to (0,100)
 REPLACE         | FAR      | | BY=val  |
                 | OUT      | | MEDIAN  |
                 | ADJACENT | | MISSING |
                 | Depth=n  | |
 +val            Add the constant val
 -val            Subtract val
 *val            Multiply by val
 /val            Divide by val

Utility commands

 ? or HElp     Help: show commands you may enter
 <return>      show the current reexpression
 SHow          SHOW ALL | [BOX] [DLINE] [NUMBER] controls the
               information displayed with ever re-expression
 PRint         ON| OFF turns printing on and off
               (copy to the print file) Default
               is OFF (requires a print file)
 SUmmary       Show letter values and related info
 SYM PLOT      Symmetry diagnostic plot
 LAdder        Show ladder of power
 LADDER REEXP  Show ladder of powers on the current power transformed
               variable
 DL            Show density line
 DC <dopt>     Show density line, coded form
                 <dopt> are DLINE CODED options
 PT            Set the current power-transformed variable as new
               "raw" variable.
 ORiginal      Back to original variable
 BAck          Back to previous reexpression
 DEscript      Enter a new descriptor for the current variable
 HIstory       Show transformation history
 INFO          Shows the current vlist

Notes

The changes applied to the variable are documented by adding some letters to the descriptor of the variable. This information is included in parentheses and has the following meaning:

(abcc) where

References

See Hoaglin, D. et al. 1983, namely the chapters written by J.D. Emerson.

REGRESS


REGRESS vlist [YVAR=var#]
              [NOCONST] [CUTOFF=val] [EPSI=val]
              [RESIDUALS{=v#}] [FITTED{=v#}] [BCOEFFS{=v#}]
              [MAXITERATIONS=num] [TRACE {FULL}]
              [TRASH]

Obsolete command form : REGRESS yvar,yvars

Performs a biweight multiple regression using a modified Gram-Schmidt procedure.

Specifying variables for the regression

In most cases you will specify the variable to be explained (Y-variable, dependent variable) with the YVAR= option and the explanatory (X-variables, independent variables) on the variable list.

For situations where you are studying a particular variable it is convenient to preset that variable using the SET YVAR= command. In this case you will not need to add the YVAR= option on the REEGRESS command line (unless you would like to override the default YVAR).

(Obsolete form, kept for compatibility with previous EDA versions): If there is no previous SET YVAR= and no YVAR= option on the command line, EDA takes the first variable on the list as Y-variable, i.e. REGRESS y,xvars.

Computation options: C, EPSI and MAXIT

The procedure is controlled by three options. EPSI= (epsi, default 0.0001) Epsilon is the proportion by which the sum of absolute residuals must be reduced for iterations to continue.

C= (default 4). C is a tuning constant, used in the biweight computation. If C is large (>30) then the procedure yields results very similar to LSQ regression.

MAXITERATIONS may be used to change the default number of iterations set to 999. If the MAXITERATION value is reached the algorithm stops and produces the results. This option is useful especially when you want to understand the process by e.g. stopping the iterations before the procedure converges.

NOCONST

performs a regression, where no constant is computed. By default a constant is included.

Copy fitted values and residuals

RESIDUALS copies residuals, FITTED the estimated y's into the WA as new variables, following normal rules regarding the target variable (if not specified with the option, EDA looks for the next free location in the WA.)

Copy regression coefficients

BCOEFF copies the coefficients. Note that the constant term is the first coefficient.

Checking residuals: TRASH

This option produces the TRASH curve (see DIAGNOSTIC or LINE for an explanation).

Getting more information TRACE

If you need information on the progress of the computation at each stage you specify the TRACE option. After each step (iteration) EDA will display the SAR (sum of absolute residuals). If you add the FULL option, regression coefficients will be shown as well.

Hints

Below you will find a handful of useful hints for a more intensive use of this command.

See the SET YVAR command, if you need to perform repeated REGRESSions always trying to explain the same variable.

If you prefer (for didactical or other reasons) to change the designation of the y- or x-variable (e.g. dependent/independent, explained/explanatory, x/y, predictor/carrier etc) you will be able to do so using the SET XNAME/YNAME options.

Variable list editing is useful if you are exploring a particular set of variables and you would like to check what happens if ... you add or remove a particular variable.

  >SET YVAR=10            ! let's concentrate on this one
  >REGRESS 1-8            ! 1-8 are explanatory variables
  >REGRESS -8             ! what happens if we remove variable 8?
  > +12 13                ! .. and if we add variables 12 and 13
SET YVAR: we use always the same dependent variable, set YVAR, so we don't need to worry about this.

REGRESS 1-8: uses variables 1-8 as explanatory variable, the variable to be explained will be variable 10, i.e. the variable SET in the preceding command.

REGRESS -8: The current list contains variables 1-8; -8 will remove variable 8, i.e. the regression will be performed on variables 1 through 7.

+12,13 (REGRESS is omitted, we could have done this already on the preceding line): The current variable list is 1 through 7; we add variables 12 and 13 to it and perform a new regression.

Reference

McNeil 1977. Credit: Command based on a program published by McNeil 1977.

SHOW


Format 1: (first variable = criterion variable)

SHOW v <crit> [LONG] <opt> <list

s> SHOW vlist <crit> <list

ions> <opt> SHOW vlist <crit> CODED {Width=wid} <opt> ["symbols"] SHOW v <crit> SPLIT [SORT] <opt>

<crit> | FAR Far out values | OUT Out values | EXTREME Out and far-out values | ADJACENT Adjacents values | OUTSIDE Outside hinges | INSIDE Inside hinges | IF>value Larger than value | IF<value Smaller thant value | IF=value [FUZZ=val] Equal to value | IF~value [FUZZ=val] Not equal to value | IF_IN=(low,high) Within low and high value

<opt1> [SELECTION] [SAVEINDEX{=var#}]

Format 2:

SHOW vlist CloseTo=cas# <opt2> SHOW vlist AwayFrom=cas# <opt2>

<opt2> | {MIDSPREAD} [UNITS=v] [<opt2a>] | SDV [UNITS=v] [<opt2a>] | DISTANCE=value [<opt2a>] | NCASE=NToShow

<opt2a> DETAILS {CountsOnly} | NOMARK [FUZZ=val] [NOALIGN]

Format 3:

SHOW vlist TOPVARS [SHOW=nitems] [NOALIGN] [ASCENDING] SHOW vlist TOPCASES [SHOW=nitems] [NOALIGN] [ASCENDING] SHOW v,vlist TOPRANKS [SHOW=nitems] [ASCENDING]

SHOW vlist BOTVARS [SHOW=nitems] [NOALIGN] [DESCENDING] SHOW vlist BOTCASES [SHOW=nitems] [NOALIGN] [DESCENDING] SHOW v,vlist BOTRANKS [SHOW=nitems] [DESCENDING]

Format 4: SHOW vlist RANK=case# [ASCENDING]

The purpose of the SHOW command is to list selected observations based on some criterion or concentrating on a particular aspect. The SHOW command comes in three different flavours: (1) Concentrating on a criterion variable (2) Show observations close (or far away from) to a specific observation and (3) showing top (or bottom) observations and variables.

Format 1

SHOW lists cases corresponding to some criterion (conditional list). If no criterion is specified, the command is identical to the LIST SORTED command for numerical displays (with the same formatting options), with the exception of SHOW CODED (distributional coding, with variables as columns). This command form has basically two uses (1) look at a single variable and list outliers etc. (2) select specific observations from a criterion variable (e.g. outliers) to check their values on a series of other variables.

The syntax chart lists all available criteria. All criteria referring to outliers are sensitive to SET DEFOUTLIER settings. The various IF forms are used to display the cases larger than, smaller than, equal to and not equal to a specified values. Equality and non-equality can be qualified with a F= fuzz value (if no fuzz is present, the system fuzz is used). Finally IF_IN=(up,low) is used to specify an interval, only cases within that interval are shown.

Single variables

With a single variable specified, SHOW lists the cases corresponding to the criterion (case identifiers and, if defined, group membership). The LONG option adds the data values for each case. With no criterion, this form reverts to LIST CODED.

Several variables

The first variable of the list is always the criterion variable, i.e. the basis for the selection of the observations to be displayed.

With more than one variable SHOW displays always the numerical values for all the cases (case id, group membership and value). With no criterion, this form reverts to LIST CODED. You may use all formatting options available for numerical lists (See the LIST command for details).

SHOW CODED

(More than one variable needed) If CODED is used the criterion (first) variable is shown with numerical values, all other variables are coded. They are coded using the standard codes for distributional coding and distinguishes (upper far out, out and ajacent values as well as in values and lower far out, out and ajacent values). thus allowing easy comparison of values on different variables. The symbols can be modified locally by using the "symbols" option. or globally using the SET GRAPH DISTCODE command. (See also the section on the Art of Coding in the Glossary). Note that the symbols are specified from High far-out to low far-out (7 symbols).

This option is SET DEFOUTLIER sensitive. The WIDTH option controls the column width for the coded variables. It defaults to 4, i.e. The full variable label is shown (on two lines). If a W=1 is used the minus sign for values below the median is not shown.

SHOW SPLIT

The SPLIT option is only applicable to a single variable: instead of displaying only the cases satisfying some criterion, all cases are displayed, but in separate columns: the left column shows cases satisfying, the right column cases not satisfying the criterion. Data values are shown as always (i.e. sensitive to the SET DEC setting). Default is to produce an unsorted list (original order), SORT asks for a sorted list

Format 1: Common advanced options (*)

SELECTION The SELECTION option stores the displayed cases directly as the current (case) selection (superseding a currently active selection). For details on Selections (case selections, filters) see the chapter on selections (ANALYZE, INCLUDE, EXCLUDE etc commands).

SAVEINDEX (useful in macros) saves the index to the selected observations, e.g. After a SHOW 1 FAR SAVEINDEX=10 you will find a variable #10 in your work area containing an index to all far-out values of variable 1. As always you can use the SAVEINDEX form to store the index variable into the next free location or the SAVEINDEX=var# form to indicate the destination directly. Both result variables are set to 0 initially, i.e. if $0=0 no case has been selected and if $1=0 no SAVEINDEX variable has been created.

Format 2: Concentrate on a single observation

The purpose of this command format is to concentrate on a specific observation and ask for a list of observations close to it or far away from it (for each variable of the vlist).

This command format requires either ther CloseTo=cas# or AwayFrom=cas# option, specifying the observation you want to concentrate on.

By default "close to" or "away from" is measured in units of midspreads, i.e. close to means within 0.5 midspread and "away" means more than 2 mid-spreads away from target observation. The default measures of "closeness" or "distance" can be changed using the UNIT=val option, where val is a multiplier for the midspread. SHOW 1-10 CLOSE=GE UNIT=1 asks for a list(for each variable) of observations within a interval of + 1*midpread and -1*midspread around the observation "GE".

Instead of using midspreads as unit, you may also use standard deviations (SDV option).

Instead of using a statistical criterion you can specify a DISTANCE=val option, where val specifies the distance from the observation (distance here is a simple difference).

Finally the NCASE=nc option lets you list the nc observations closest to (or farthest away from) the target observation.

Format 2 options

These distinct forms share a number of options affecting the way the information is displays.

By default EDA lists cases close to the target without distinguishing whether the neighbours are above or below the target observation.

DETAILS distinguishes (+) above, (-) below and (=) equal and list the cases separately.

FUZZ=fuzz On non-detailed lists observations with values equal to the target value are enclosed in [] on the display. On a detailed display these values appear on a separate line. Note that equality is not strict equality but depends upon the definition of the current fuzz value (See SET FUZZ for details) or the presence of the FUZZ= option on the SHOW command. Equality of course is only possible with the CloseTo=cas# option; AwayFrom=cas# will only list cases far away (low or high).

COUNTSONLY: Instead of listing all details, EDA only displays the number of cases above/below.

NOALIGN: By default observations listed are presented in tabular form (case names aligned on the same columns). NOALIGN leaves a single space between each observation (stream form).

NOMARK (applies only to the non-detailed form): Does not mark identical ("equal") observations (see above)

Note that the information displayed (observation names or variable names) can be controlled by the SET CASID resp. SET LABELS commands.

FORMAT 3: Top or bottom variables/observations

SHOW TOPCASES: Show the names of the largest observations for all variables in the vlist. By default EDA shows a screen row full of observations; SHOW=nitems lets you ask for more or less.

SHOW TOPVARIABLES shows for all observations the names of the variables with the largest values.

SHOW TOPRANKS shows the names of the largest observations of the first variable and the ranks of the same observations for the other variables in the vlist.

BOTCASES, BOTVARIABLES and BOTRANKS are identical to the corresponding TOP options, except that the ranking is started at the bottom. Note that the linguistic variations are supplied - together with the DESCENDING or ASCENDING options - to suit the command form to all possible uses. In many applications top means "best","outstanding" and "best" is sometimes associated with large values sometimes with the smallest values (e.g. pollution data). Basically you will select the command form most suitable for your data.

NOALIGN supresses the automatic alignment of the items shown.

ASCENDING shows the smallest values instead of the largest (TOP).

DESCENDING shows the largest values instead of the smallest (BOTTOM).

Note that the information displayed (observation names or variable names) can be controlled by the SET CASID resp. SET LABELS commands.

Format 4: SHOW RANK=case#

SHOW RANK concentrates on a single observation and lists all its ranks (positions) for all variables of the list. The ASCENDING option is used to start ranking with the smallest value.

SMOOTH


  SMOOTH v1[,vsm] "smr" <options>

<options> [TWICE] [HISTORY] [LOCAL] [EPSI=val] [ENDPOINT {DEF} | NEVER | ALWAYS] [LOMED | HIMED | MEAN] [SMOOTH=var#] [ROUGH=var#] [ZIGZAG {SCALE}] [NOPLOT] [PLOT=(xchars,ychars)][VERTICAL]

SMOOTH v1[,vsm] [PLAY] <options> [PRINT]

Smooth a variable using running medians.

In addition to the standard command mode SMOOTH can be used in PLAY mode, i.e. the command offers a special module allowing for smoothing step by step and examining the results at each step. This special mode will be explained below, after the explanation of the standard command.

Basic command

Smooth v1 according to the smoother specified by "smr" (smr must be enclosed in double quotes). If no "smr" is present the user is asked to supply one. If no other specification is present v1 will be smoothed, the final smooth will be plotted on the screen and will be copied into a free location in the WA as a new variable. If a second variable vsm or the SMOOTH=var# option are present the final smooth will be copied into that variable.

Smoother specifications

A smoother specification <smr> is string of symbols, each individual symbol producing the action described below. Smoothing operations are performed sequentially, e.g. a smoothing specification like 3RS asks for medians of three, repeated until convergence, followed by a split.

A <smr> specification is always required, there is no default value).

You may combine any symbol from the table below into a smoother specification. This does however not mean that all combinations are (always) meaningful. No restrictions have been introduced in order to encourage experimentation.

------------------------------------------------------------
|symbol | Smoothing specification                  | note  |
------------------------------------------------------------
| 3     | running medians of length 3              |       |
| R     | repeat until convergence (3R,SR)         |       |
| S     | split mesas (always followed by 3R)      |       |
| E     | applies endpoint rule                    |  1    |
| H     | hans the sequence (weig. mean of 3)      |       |
| I J K | weighted means of 5 7 9 (Ianning, Janning|  4    |
|       |   Kanning)                               |       |
| 5 7 9 | running medians of span 5 7 9            |       |
| 2     | running median of span 2                 |  2    |
| 4 6 8 | running medians of span 4 6 8            |  2    |
| #     | running median of span 12                |  2    |
------------------------------------------------------------
|         Modifiers                                |       |
------------------------------------------------------------
| <     | Low medians (prefix to 4 6 8)            |  5    |
| >     | High medians (prefix to 4 6 8)           |  5    |
| =     | Use mean, prefix to 4 ..9                |  6    |
------------------------------------------------------------
|       | Plot, display and copy                           |
------------------------------------------------------------
| .     | displays sequence at specified point     |  3    |
| :     | scatter plot at specified point          |  3    |
| ;     | copy smooth values into a variable       |  3    |
------------------------------------------------------------

Notes

The <options> have the following meaning:

Computing options

TWICE commands "twicing", i.e. the smoother is aslo applied to the first rough, and the second smooth is added to the first. This is recommended to preserve local features better (reroughing)

ENDPOINT controls the application of the endpoint rule (linear extrapolation). There are two possibilities to define the endpoints: copying and the end point rule. Copying leaves the endpoints unchanged, the end-point-rule extrapolates them from the adjacent values.

Default action is to apply the end-point-rule after each 3R, just before S and R (if necessary) and at the end of the compound smoother, otherwise the end-points are copied.

The ENDPOINT ALWAYS option causes the end-point rule to be applied after each smoother. The ENDPOINT NEVER option never applies the rule. Note that the 'E' symbol may be used at any point to command the application of the rule.

LOCAL (affects running medians of 3) uses an alternative smoother proposed by Tukey (see McNeil) preserving local features better than the ordinary running medians.

HISTORY after each smoother a diagnostic statistic is displayed. D is sum (x(i) - y(i)) where x is the sequence before the current smoother, y after the current smoother has been applied. D-aver is the average difference, i.e. D/n, where n is the length of the sequence. This statistic may be useful when assessing the impact of a single smoother.

[LOMED | HIMED | MEAN] are special options changing the default way of computing the smoothers (See notes 5 and 6 below the smoothing specification table for details). LOMED computes always low medians, HIMED high medians (the >< smoother specs are only valid for the span following it immediately). MEAN uses running means systematically (read note 6 please).

EPSI= (Repeated running medians) Repetition is stopped when the difference between the series from the preceding step and the current series is too small. This is measured by taking the sum of the absolute differences between the two series. By default the difference is considered too small (close to zero), when its value is below the system fuzz value (typically 0.01). EPSI=val may be used to specify a different value.

Copy results: SMOOTH and ROUGH

SMOOTH=var# copies to final smooth into the variable specified. As already explained, if SMOOTH is not present the final smooth is copied into the next free location in the WA. As an alternative you may specify a second variable on the vlist.

ROUGH[=var#] copies the rough into a variable. If =var# is not present, it is stored in the next free location in the WA.

Plotting options

By default EDA produces a single plot with the final smooth.

ZIGZAG: Instead of a standard plot EDA offers a special plot called a Zig-Zag plot, i.e. a plot shown on three lines only, stressing trends and changes from one step to the next. Furthermore zig-zag plots show a single case in each column. Zig-zag plots are shown after each smoothing step, i.e. you will see on the screen the evolution from step to step. If you select this option the final plot is not produced. ZIGZAG has two options: THREE and SCALE. By default EDA uses two different symbols - (dash) and _ (underscore) to show two distinct levels on each line (with three lines this means that the plot will distinguish 6 different levels on the vertical axis. The THREE option uses three distinct symbols, adding the ~ (tilda) symbol for the topmost position in a character (this is the best possible approximation of a dash symbol shown on top), i.e. distinguishing three levels on each line: low, middle and high. The SCALE options is used to change the scaling of the vertical axis. By default the scaling is based on the minimum and maximum of the original variable, i.e. all zig-zag plots will have the same scale, even if a few peaks have been ironed out and the upper or lower part of the plot may appear empty. SCALE forces the vertical scale to be based on each smooth to be plotted.

The remaining options deal with the default plot.

NOPLOT suppresses the scatter plot of the final smooth at the end of the smoothing procedure.

PLOT=(xchars,ychars) controls the size of the plot(s) produced. xchars is the numbers of characters across (up to 130), ychars is the number of characters in the vertical direction, ychars is not limited.

VERTICAL inverts the normal x,y axes, i.e. the sequence is displayed as y instead of x. This allows (together with PLOT=) long printouts (on the screen this does usually not look too nice) with a large number of points.

Note that the size options change also the size of intermediate plots (i.e. using the : smoothing specification).

SMOOTH PLAY mode (PLAY or STEP)

Sometimes you might want to study in detail the smoothing process. For this reason EDA offers a special mode with its own syntax.

The basic options are the same, but when you specify the PLAY option you will enter the special play module, offering a number of additional commands. Basically it works as follows: After each smoothing step EDA pauses and offers you various options (plots, lists etc) used to examine the current smooth and possibly decide on a different way of smoothing.

If you are only interested in watching the smoothing process advance step by step its operation is very simple: when entering the module the first smoothing step will be performed and the result is shown in a standard plot. Hitting the <return> key (i.e. entering a blank line) EDA will proceed to the next smoothing step and again plot the result. This will go on until all smoothing specifications are processed.

Note that SMOOTH PLAY mode does not write the results to the print file, unless the PRINT option is present when invoking this mode. Then everything is written to the print file, i.e. this might produce quite a lot of output.

The table below shows the other commands you may use at each step.

-------------------------------------------------------------------
|              Smooth Play mode commands                     |Note|
-------------------------------------------------------------------
| <return>  Next smoothing step                              |    |
| ? or H    List available commands                          |    |
| ?S or HS  List possible smoothing specifications           |    |
| D         Display smoother and d                           |    |
| OP        Option Print toggle                              | 1  |
-------------------------------------------------------------------
|  Q        Quit now, do nothing more (return to EDA)        | 2  |
|  S        Stop smoothing/Save plot and copy this smooth    | 3  |
-------------------------------------------------------------------
| /smoother Start again with this smoother                   | 4  |
| =smoother Continue with this smoother                      | 5  |
-------------------------------------------------------------------
| PP         Plot previous smooth step                       |    |
| PO         Plot original variable                          |    |
| PC         Plot current (show again after PP,PO etc)       |    |
| PR         Plot rough                                      |    |
| L          List original, rough, previous and current      |    |
| LC         same, but list changed cases only               |    |
-------------------------------------------------------------------
Notes

References

McNeil 1977 Chapter 6; Velleman & Hoaglin; Rapacchi; Ladiray. Credits: Initially this command has been built using components from McNeill 1977, completed by bits from Velleman & Hoaglin. After the heavy modifications due to work of B. Rapacchi, suggestions from D. Ladiray and J. Vanpoucke, there a probably only a few Fortran code lines left from McNeill and Velleman & Hoaglin.

STEMLEAF


  STEMLEAF v     <options>

STEMLEAF v BYGVAR | GVAR{=gvar#} [NGROUPS=ng | ALL] [PARALLEL] <options>

STEMLEAF v SPLIT (log) <options> [PARALLEL] <options>

STEMLEAF v1,v2 BACKTOBACK <options>

<options> [Scale=val] [WIDTH=chars] [NOHILOSTEM] [FARONLY] [NOLINE] [ASCENDING | DESCENDING]

Displays a stem and leaf plot. By default a single stem and leaf is produced. Outliers (far and out values) are shown on a separate stem. Case identifiers are shown on the hi/low stems, unless there is not wide enough (then only the number of observations are shown).
    26 cantons
 Stemleaf:Sucre   (  4) Economie sucriere
 Legend:  2|3 stands for     22.9;   5|6 for     55.9
  lo |BS

2|3 3|022345577889 4|0112567 5|1446

hi |FR

The definition of the outliers can be modified with the SET OUTLIER command. The legend explains to what values the first and last symbols in the stem correspond. Note that due to scaling and rounding symbols

of options control the stemleaf production.

Controlling the SCALE

The scale option controls the length (number of intervals) of the display. Scale defaults to 1. Smaller values produce less intervals, larger values more.

Controlling the width of the display

Width= (default 40) is used to control the display width. If the number of leaves is larger than the width not all cases appear, but their number is shown. The maximum width is 72.

NOHILOWSTEM and FARONLY

With NOHILOSTEM no hi/low stems are shown, i.e. outliers are treated as any other cases, i.e. they will influence the scaling heavily and make the STEM less readable.

FARONLY puts only far-out values on a hi or low stem, i.e. out values are treated as ordinary values.

NOLINE

NOLINE omits the blank line separating the hi and low stems from the other "normal" stems.

SORT ORDER

The sort order of the cases depends upon the setting of the SET SORT switch, default is to show cases in ascending order. The ASCENDING DESCENDING option is used to override the sort order for a particular STEM command.

BACKTOBACK stemleaf for two variables

BACK_TO_BACK requires two variables (more variables are ignored) and produces two stemleafs shown back-to-back using the same stems. If the PARALLEL option is used both STEMLEAFS are shown in the standard fashion.

SPLIT Back to back stemleafs

The SPLIT option splits the dataset into two groups, displayed back to back (or PARALLEL). The split is specified by the (log) expression. (In fact the SPLIT keyword is optional). Logical expressions are specified as simple logical expression. See the glossary for details. Scaling and filtering options apply also; note that the width option specifies the width of each stemleaf display (max. = half of the screen width).

Groupwise stemleafs: BYGVAR

BYGVAR creates a stemleaf for each group (either defined by the current GVAR or specified with GVAR=var#) using a common scale.

If the number of groups is larger than the number of stemleafs that can fit on a single screen (depends on the width option) nothing is produced and an error message is issued. If you want to display more stemleafs (several screenfulls) the NGROUPS= or the ALL option can be used. ALL asks for all groups to be shown on a separate stemleaf. NGROUPS=ng will show the first ng groups. Note that BYGVAR does not produce all possible groups by default to avoid to produce too may stemleafs you cannot see simultaneously on the screen.

If only two groups are found, a back-to-back stemleaf is produced, unless the PARALLEL option is specified.

Related commands

See also the HISTOGRAM, as well as the FREQUENCY HISTO command.

  Result variables defined:
    0   number of lines used on display
    1   max number of possible lines (internal limit)

Reference

McNeil 1977 Chapter 1. Credit: Earlier versions of this command were based on McNeill 1977.

SUMMARY


  SUMMARY v [MIDS | SPREAD] [<detail>] [<lvals>]

<detail> EIGHTS | ALL] | Detail=level

<lvals> [ONLY] [LVALS{=var} | COMPLETE]

Produces a numerical summary. Default is to produce a five number summary: median, hinges and ones (min/max). These summaries are called letter values and labelled as such on the display: M stands for the median, H for the hinges and O for the Ones (extremes).

MIDS or SPREAD: in addition to the letter values mids and spreads are displayed as well as the trimean.

The <details> options let you produce more detailed summaries, i.e. include more letter values on the display. EIGHT adds the eights (labelled 'E') and ALL shows all possible letter values. Letter value labelling contines with D (16th), C (32nd), B, A, Z, Y, X. If more letter values are defined for the current variable the depth of the correspoding letter value will be displayed instead of a letter. In addition to the EIGHT and ALL option you may use the DETAIL=level option. The median is considered the first level, the hinges the second, the eights the third and so on, i.e. DETAIL=5 would ask for a display of all levels from the median out to the 5th level (i.e. 32nd). The 'O' level is always displayed.

Letter value copying (*)

The LVAL option lets you copy information pertaining to letter values into variables.

The LVALS option copies the letter values into the specified variable. The sequence of the letter values is 1..M..1, i.e. the first value and the last value are the extremes (depth one) the middle value is the median, the value below the median the lower hinge, the value above the upper hinge and so on.

The COMPLETE option copies three distinct variables into the current WA, i.e. the lower letter values (median, lower hinge, lower eight ... lower one), the upper letter values (median, upper hinger, upper eight ... upper one) and the corresponding depth.

ONLY suppresses the numerical summary, i.e. only the letter value copy takes place (useful in macros).

TRACES


TRACES v <what> <opt> [DLINES {<dopt>}| HINGES | SHINGES]
TRACES v <what> <opt> TRACES[=lvals | FULLRANGE]
                      [NOMEDIAN | LINK ]

<what> | GVAR | GVAR=var# | GVAR=(var#,div) | BREAK=var# | [EXACT] [NSTRIPS=n_of_breaks] [STORE_GVAR{=var#}][NOHBOX] | INTERVALS [NSTRIPS=n_of_breaks] [STORE_GVAR{=var#}][NOHBOX] | CUTPOINTS=(val1...) [STORE_GVAR{=var#}][NOBOX] | READ_CUTPOINTS [STORE_GVAR{=var#}][NOHBOX]

<opt> [MINFREQ=nmin] [REMOVE_MEDIAN | STANDARDIZE] [REEXRPESS=trans.power] [SORT <crit> {ASC|DESCENDING}]

<crit> [MEDIAN] | MIDSPREAD | N | MEAN | SDEVIATION | RANGE | VAR=critvar#

TRACES displays vertical boxplots, density lines or selected letter values for groups defined either by a GVAR or a second variable broken into groups (categorization).

By default traces displays (vertical) boxplots for the groups defined by the GVAR, as well as a boxplot for the variable (no groups distinguished). If more than 10 groups are defined a single column boxplot is displayed.

HINGES, SHINGES HINGES shows only the box, i.e. no whiskers and outliers. SHINGES is similar to HINGES, but uses a vertical bar (not a full symbol like HINGES).

DLINES Instead of showing vertical boxplots, vertical density lines (one-character wide histograms) are shown. <dopt> refers to the options available with the DLINE CODED command (see there for more information). Note that the default density lines shown are coded density lines. (DLINE default is numerical density lines). Numerical denity lines may be produced using the TRACES DLINE NUMERIC command options.

Groups defined by a GVAR

Different options are used to define the groups shown. If none of the <what> option is present, the current GVAR defines the groups. If no GVAR exists an error message is issued and no display is produced.

GVAR GVAR=var# may be used to designate a variable in the WA as grouping variable, i.e. the variable is taken as an integer variable (decimal part discarded). Furthermore you may specify GVAR=(var#,div), where <div> means that the variable has to be divided by <div> before discarding the decimal part. This is useful when the variable in the WA contains many different values and it is desired e.g. to divide the variable by 10, i.e. shifting the rightmost integer position. <div> could also be e.g. 0.1.

Groups defined by breaks

BREAK=var# and its various forms is used to obtain the groups by dividing a continuous variable into bins according to some criterion (categorization). Default is to divide the variable into bins containing approximately the same number of observations: default three categories, i.e. taking the thirds of a distribution.

NStrips Nstrips=n.breaks is used to produce a different number of intervals. E.g. NSTRIP=4 divides the variable into quartiles.

EXACT Note that when dividing the variable into quartiles (or other fractiles), the program looks for the value at the quartiles and divides the variable up according to these values. This divides the variable into exactly four groups, if the values at the quartiles are unique. If, say the first quartile, falls on a '4' and there are several 4s in the data series, then all fours will go into the same group, i.e. the number of cases in each group will not necessarily be equal. In many applications this is exactly what you want, i.e. it does not make sense to put cases with identical values into different groups. In other circumstances however, if you need equal sized groups, you will need to force equal groups by adding the EXACT option.

INTERVAL This option is used to divide the variable into intervals of equal width. Again the default is three (divides the variable range into three intervals) and NSTRIPS= may be used to take a different number of intervals.

READ and CUT The READ and CUT= options are used to enter arbitrary cutting points in order to define the intervals. READ will ask you to enter the interval boundaries, whereas CUT=(val1,...) does the same on the command line. [Remember that the number of values in the CUT=(val..) specification is limited by an implementation constant typically 8, sometimes 6].

Display and NOHBOX Whenever you used a BREAK option and you have only a few groups (depends upon the display width, often 7), the display will include small horizontal boxplots showing the distribution of the broken variable for each group. These horizontal boxplots may be suppressed with the NOHBOX option.

STORE_GVAR (BREAK only) stores the defined groups as GVAR. Instead of storing them as a GVAR you can also save the group memberships as a variable.

TRACES TRACES: Tracing letter values

TRACES[=lettervalue] is used to trace a specific letter value, i.e. instead of showing a boxplot, only selected information is shown. Without further specification, the medians for each group are shown (plotted).

If TRACES=lval is specified, the specified letter value is shown. A <lval> of 2 asks for the hinges, 3 for the eights, 4 for the 16th etc (a value of 1=median is also possible, but is the same as no specifications). <lval> may take values between 1 and 9. Furthermore you may use FULLRANGE, meaning the minimum and maximum.

Normally the following symbols appear: a '-' of the lower letter value, a '=' for the higher letter value, a '*' for the median (if no star appears it is either at the '-' or the '=' location). If a letter value cannot be computed (e.g. a sixteenth is not defined with say 6 cases in a group) a '?' appears at the bottom. [Note that if you prefer other symbols, the SET GRAPH options provides a way of changing the default symbols.]

NOMEDIAN suppresses the median symbol. LINK links the upper and lower letter value, producing a display similar to the display produced with SHINGES (vertical bars link the values shown).

The TRACES command is sensitive to the SET DEFOUT settings.

For an explanation of the boxplot symbols refer to the BOXPLOT command. Refer especially to the GVAR and the BREAK commands for more details on how to define a GVAR.

Display options (sorting)

By default the groups are shown by ascending order of groups numbers, i.e. natural order.

SORT: TRACES offers options to sort the boxplots (density lines or traces) using a statistical criterion: the median of each group (default), the mean, the midspread, the standard deviation (SDEVIATION), the number of cases in the groups (N), the range. The sort order can also be taken from a variable in the WA (VAR=critvar#); this WA could have been produced for instance by the DISPLAY command (using the BYGVAR option to work on groups).

The sort order is by default as defined by the SET SORT_ORDER switch (default ASCENDING). The ASCENDING or DESCENDING options may be used to override the default sort order.

Small groups: MINFREQ

MINFREQ=nmin: This option is used to exclude small groups from the display (often producing strange looking "boxplots"). By default no group is excluded, even if it contains only a single observation. MINFREQ specifies the minimal number of cases required for a group to be display as boxplot (trace or density line).

Remove median and standardize

REMOVE_MEDIAN: removes the group median from each group (normalization).

STANDARDIZE: each group is standardized (remove median and divide by midspread).

Note that these operations are performed before reexpression transformations or computations of the statistics used with the SORT option.

Reexpress: Power transformations

REEXPRESS=trans.power reexpresses the variable using trans.power, i.e. TRACES let you see the effect of a transformation on the groups shown. See the section on power transformations in the glossary for additional informations.

Related commands

The CODE command (especially when used with the PLAY option) provides a way of interactively creating groups (GVARS). The same options are offered as with the BREAK option of this command, but the CODE PLAY command lets you experiment with various numbers of groups to achieve optimal breaks.

XTAB


 XTAB v1[,v2]  <options>
Displays frequency tables and crosstabulations for the specified variables. Variables are considered as integer variables, i.e. the fractional part of a variable is always discarded.

XTAB shares all options with the BREAK command. In fact the two commands are the same, except the method of obtaining the frequencies, i.e. BREAK cuts variables into bins and counts the observation in each bin, whereas XTAB simply counts occurrences in the specified variables.