Cluster analysis

Introduction

This section describes commands useful for doing cluster analysis. It should be noted that only the specific commands for cluster analysis and commands working specifically on cluster analysis results are described here. However many other commands are useful and may be used to analyze groups of cases defined by any kind of means.

Usually a cluster analysis defines groups of cases. Group memberships are stored in a location called GVAR, which is simply an integer variable defining for each case a group membership. Initially no GVAR is defined. If a GVAR exists it can be considered as an attribute of the whole WA. It is automatically saved with the WA.

Note that a GVAR may be defined by a number of commands (not only the commands shown below). The CODE may be used to build a GVAR by cutting interval variables into pieces (interval coding). Other commands defining GVARS include HISTOGRAM, FREQ, XTAB and others. (*) Furthermore the GVAR can be referred to in expressions using either GVAR to refer to the GVAR as a whole vector or G[i] to access individual elements. (GVAR is an alternative form of G[] i.e. all elements).

  (1) LET GVAR=(#VARX*#VARY)/100
  (2) IF GVAR>100 THEN GVAR=0
  (3) LET G[1]=2

Examples (1) and (2) use a GVAR, i.e. refer to all values in the GVAR (GVAR as a vector, i.e. GVAR is treated like any variable reference. (1) computes a new GVAR (multiply VARX by VARY, divide by 100). (2) Sets the GVAR to 0 for all values larger than 100. Finally (3) refers to the first element in GVAR and sets it to 2.

Note that a membership number of zero, means that the case does not belong to any group.

Many commands use and show the GVAR: all plotting and listing facilities (whenever case identifiers are shown the group memberships will also be shown); the selection mechanism may be used to analyze groups separately (see the ANALYZE command). Whenever you see a GVAR option you know that it shows the group memberships: this is e.g. the case with PLOT GVAR, C2 PLOT GVAR, HISTOGRAM GVAR and others.

The following commands are described in detail below.

Cluster analysis techniques

CLUSTER non-hierarchical clustering HIERARCHY hierarchical clustering VHIERARCHY Hierarch.clustering (variables)

Interpretation

GANALYSIS Group analysis GSUMMARY Group summaries

GVAR manipulation

GVAR manipulate GVAR MEMBER display group members

Other important commands (explanation in other sections)

TRACES parallel boxplots for each group LIST BYGVAR group listing FREQ BYGVAR frequency with groups

CODE GVAR define a GVAR by coding ANALYZE Selection: analyze a group

Analysis commands

CLUSTER

  CLUSTER [vlist] <method> [VARIABLEWISE] [<options>]

<options>

     Starting configuration:

          [NCLUSTERS=nclusters  | INIT=(c1,..,cn) |*
          [SIZES{=s1,...,sn)]   | RANDOM          |*

             *) not with MACQUEEN

     Save classification:
        (available as options or as subcommand)

       StoreGVAR  [GVAR{=var#}]  [NEW]
       StoreTIES  [TIES{=var#}] [NEW]
   Other options:

          [R=metric] [LOADCENTROIDS {"symbols"}] [R2]
          [MIN=min_of_relocations]*

  Subcommand:

  CLUSTER | StoreGVAR [<sopt>]
          | StoreTIES [<sopt>]
<sopt> : see above (Saving classification)

Performs a non-hierarchical cluster analysis on observations using the <method> indicated, controlled by additional options.

If no <vlist> is present all variables in the current WA are analyzed, unless ALLVARS is OFF; then the current variable list is taken.

VARIABLEWISE modifies the default clustering of cases to produce a classification of variables. Important: Currently you cannot analyze more than NVAR cases with the VARIABLEWISE option (this kind of analysis requires a lot of memory space; this limit will (hopefully) disappear from future versions).

Clustering methods

The first option controls the method used: <method>:

    FORGY (default method)
    JANCEY
    MACQUEEN  (K-Means)
    CONVERGENCE

Number of clusters and starting configurations

Non-hierarchical methods require the following information (1) the number of clusters to produce (2) a (initial) starting configuration.

By default the user is asked to enter the cases forming the initial configuration. The user then replies by typing a number of case identifiers; the number of identifiers entered determines the number of clusters to be produced.

A second method requires that the user supplies the number of cases in each cluster. This is done by specifying the SIZE option. Of course the sum of the numbers in each cluster must equal the number of cases.

The NCLUSTER=n option may be used to give explicitly the number of clusters, then the user is asked to enter exactly n case-ids or sizes. This option is required for the MACQUEEN command, if it is desired to change the default number of 4 clusters (as the MACQUEEN does not need a user specified initial configuration).

Alternatively starting configurations may be specified on the command line, using the INIT=(c1,c2,...,cn) or the SIZE=(s1,s2,...,sn). Note that on the INIT option you should preferably specify case-ids.

RANDOM selects the starting configuration randomly (random selection of cases); by default 4 clusters are produced; use NCLUSTER=n to change the default.

VARIABLEWISE (classify variables instead of cases) When using this option the default method of entering an initial configuration is not available. You will have to use the INIT, SIZE or RANDOM option instead.

Saving the classification

CLUSTER shows the classes of cases or variables (VARIABLEWISE) on the screen, but in most situations you need to keep the classification for further analysis, i.e. you need to save the classification of cases as GVAR or the variables as ties (alternatively you might save the classification as a variable in the WA).

There are two ways of dealing with this: (1) Add an the appropriate option to the CLUSTER command or (2) Use the same options as sub-commands i.e. after seeing the results on the screen (this lets you decide after checking the results).

StoreGVAR/StoreTIES: This option requests that the classification be saved as GVAR or variable (bundle) ties. Note that these options are in fact identical, i.e. the VARIABLEWISE option decides whether EDA saves ties or a GVAR.

If you need to save the classification as a variable in the WA you may add the GVAR{=var#) or the TIES{=var#} option: GVAR/TIES store the classification into a free location in the WA, whereas GVAR=var#, resp. TIES=var# store the resulting partition into a the target variable specified by var#.

STORE NEW

When using the STORE option as a sub-command, i.e. after a CLUSTER command a problem arises whenever you are not satisfied with the classification and you need to start a new classification using a STORE option.

    >CLUSTER                  >CLUSTER
    >CLUSTER STOREGVAR        >CLUSTER STOREGVAR NEW

In the first example a CLUSTER analysis is performed and then the result is stored as a GVAR, i.e. the second command line is a sub-command, i.e. it takes up the results produced by the first command line. In the second example a cluster analysis is performed (first command) and a new cluster analysis is performed with the second command line, i.e. the NEW option tells EDA that the second command line is not to be taken as a sub-command to the first command line but as a new request to perform a cluster analysis and STORE the GVAR directly.

Other options

The M=minrel (default 0) controls the number of case relocations per iteration used as stop criterion. 0 means, that the program iterates until no cases are moved around on a iteration. The option might be useful in cases, where the method does not converge.

The R=metric option controls the metric used for computing inter-individual distances. By default CLUSTER users the distance requested by the settings of the SET POWER command (defaults to 2, i.e. euclidean distances). Any power metric can be set using the SET POWER command (sets the default metric). The R= option may be used to override the default setting.

R2 produces additional output (R2 for each variable).

Note that the iteration history is sensitive to the setting of the SET MESSAGE switch.

LOADCENTROID adds the group centroids to the variables in the WA. (modification stamp: *+*).

LOADCENTROID "symbols" (available only in case wise mode) may be used to generate single letter casids for the new cases created by LOAD (by default the casids are of the form Cnn, where nn is the number of the cluster). All casids are followed by a '*'; this enables the PLOT GVAR command to recognize these values as centroids and show in other situations that these are "special" case identifiers. LOAD "abcd" e.g. will created a casid of 'a' for the first, of 'b' for the second cluster etc. If the number of symbols in the "string" is smaller than the number of clusters, the default casids are create, i.e. "string" is ignored (and a message given).

Additional information

Note that the iteration history is sensitive to the setting of the SET MESSAGE switch.

Reference

Anderberg 1973. Credit: Adapted mostly by Dominique Joye from Anderberg 1873.

HIERARCHY

 HIERARCHY [vlist] <method> [TREE] <option> [NOTREE]
 HIERARCHY [vlist] VARIABLEWISE <method> [TREE] <option> [NOTREE]
 HIERARCHY C1 <method> [TREE] <option> [NOTREE]
 HIERARCHY C2 <method> [TREE] <option> [NOTREE]

Performs a hierarchical cluster analysis on all selected cases from all variables or a vlist (if present, or ALLVARS mode is turned off).

VARIABLEWISE performs a hierarchical classification of variables instead of cases.

C1/C2 perform a hierarchical classification on the C1 or C2 configuration matrix. Depending upon the contents of these matrices either variables or observations are classified. The most common and obvious analysis you are likely to perform is a classification of a principal component analysis, i.e.

will produce a classification of the factor scores stored into C2 by the FACTOR command. Applied to C1, variables would be clustered.

Methods available are:

    WARD      (default method) Ward's method.
    CENTROID   centroid method
    WITHIN     minimal within group sum of squares
    MEAN       minimum mean of within group deviations
    WWARD      within group sum of squares, with WARD
               criterion reported.
    MWARD      minimal within group of sum of squares,
               WARD criterion reported.
    CWARD      centroid, Ward's criterion reported.

Compare to the VIERARCHY command which by default produces a classification on variables and optionally on cases (CASWEWISE). Note however that there is a fundamental difference in a sense that VHIER works from a matrix stored into MATRIX, and HIERARCHY does the computations directly.

All other options are explained below, as they are common to the VHIERARCHY command below.

VHIERARCHY

 VHIERARCHY [vlist] <meth> [SIMIL] [TREE] <options> [NOTREE]
            [<dist>] | [NOCOMPUTE {SIMI | DISSIM}]
            [FLEV=var#]

Performs a hierarchical cluster analysis on the variables in the WA or in the vlist depending on the setting of the ALLVARS mode. Available <methods>:

SINGLE Linkage (default method) COMPlete linkage AVERage linkage within new groups CENTROID mean distance LINK average linkage between merged groups MEDIAN method of Gower WARD's method minimum variance increase FLEXible Method described by Lance and Williams. FLEXIBLE Parameter=beta FLEXIBLE FREE Parameter=(alpha1,alpha2,beta,gamma)

This last method requires additional information from the user. The first format needs only the beta parameter, the other parameters are computed internally; with the FREE option all parameters are required : alpha1, alpha2, beta, gamma

Unless the NOCOMPUTE option is present, VHIERARCHY computes a distance matrix using the default metric as set by SET POWER (default euclidean distances) into MATRIX. The R= option may be used to override the default settings.

You may specfiy all options available with the DISTANCE command on a VHIERARCHY command line, namely the R= and CASEWISE option. (see there).

Despite the name of the command it is possible to analyze observations instead of variables using the CASEWISE option.

Otherwise the stored MATRIX is used. If the MATRIX has been produced by an EDA command, no more information is needed, if a user defined matrix is used (i.e. MATRIX STORE) SIMIL or DISSIMIL must be specified to determine whether analysis is to be done on similarity or dissimilarity measures.

The TREE <options> and the NOTREE option are the same as for the HIERARCHY command, because both commands display and manipulate a hierarchical tree. The only difference is that the NCUT option does not produce GVAR memberships but variable ties. Variables are matched by name (name in MATRIX and name in WA). This of course will not produce correct results if the WA no longer contains the original variables or MATRIX has been filled by some other means. See TREE for more details.

references

References

Anderberg 1973, Joye 1980; Joye 1983, Lance and Williams 1967, 1968. Credit: Adapted from Anderberg and the standalone adaptation by Dominique Joye; FLEX option based on modifications of D. Joye.

TREE (subcommand)

The following <options> are available:

 [HIER]  TREE  NOTREE
 [VHIER] TREE  [START=level] [END=level]
               TIDENT[=var#]
               NCUT=ncluster [VARS{=var#}]
               FUSIon_levels[=var#]
               DETAILS
               MIN_MEMBERS=nmin
               NOALLOCATE
               DETAILS=object| DETAILS "object"
               NEWTREE [WIDTH=chars#][ARROW][FULL]

The following section applies to the HIERARCHY as well as to the VHIERARCHY command and deals with hierarchical trees.

The cluster analysis produces information needed for building the hierarchical tree; this tree normally is displayed. Various options deal with this tree. These options may be used in two ways: either they are specified with the first (V)HIERarchy command or used as sub-commands, i.e. you first issue a (V)HIER command and the immediately next command is the same command with the TREE option to tell the system that tree information is already existing and the cluster analysis need not be performed.

The NOTREE option is used to inhibit tree display (not the internal construction), if only one of the <options> is of interest to you.

Note that TREE without sub-command causes the tree to be displayed.

The S=/E= display the sub-tree from fusion level Start (default 1) to fusion level end (default all levels). This option is interesting if the details from a large tree are not completely clear.

TIDENT copies the sequence of the cases (resp. variables) as they appear on the tree as a variable into the WA for further use. E.g. you might use this variable as a sort key to produce a case listing in the same sequence as on the tree. (IF not target variable is supplied, the program searches for a free location; note that a existing variable with the same label will be overwritten).

FUSION_levels: copy the fusion levels into a variable. (see TIDENT for the target var#.)

DETAILS displays the merge information needed to build the tree: for each level of fusion the ids of the case/variable (or "fused case") are displayed, together with the coefficient and for both cases/variables the last level in which it has been involved. This option is useful if interpretation problems with the printed tree arises.

NCUT=ncluster: defines ncluster clusters by cutting the tree at the appropriate level. In the case of the HIER command a new GVAR is defined. In the case of the VHIER command variable ties (bundles) are created for the variables in the WA; in this case a problem arises, when the content of the MATRIX differ from the WA (e.g. in the case of the NOCOMPUTE option). The program then displays a warning that such a variable could not be found (the match is made via the variable name).

VARS copies the groups (observation or variables) into a variable in the WA instead of storing them as GVAR or ties.

The MIN_MEMBER option does not define groups (GVAR/Ties) for groups with less members than MIN_MEMBER. (Set to 0, i.e. member of no group.) If no target variable is specified the next free location in the WA is used.

MIN_MEMBERS : By default if a group (cluster) contains less than two members a diagnostic message is issued (indicating the case or variable). number of cases/variables in a group. Note that without the NOALLOCATE option the MIN_MEMBERS serves only diagnostic purposes, i.e. a message is given and the case(s) or variable(s) are allocated to their group.

NOALLOCATE if a group (cluster) contains less that the number of cases or variables required with the MIN_MEMBER option (default 2), the group is not created and its members allocated to group 0 (i.e. no group at all).

TESTONLY my be used to simulate the creation of a GVAR (or ties); i.e. group sizes which are displayed, but the results are not stored.

SHOW this options displays not only the group sizes but also the members of each group. (This is similar to the use of the MEMBERS command, except that it is available as subcommand, i.e. if you do not want this partition you just ask for another one, without going through the full computation process).

DETAILS=object or DETAILS "object" produces the same information as DETAIL but only for a single object (case or variable), i.e. only the fusion levels where "object" is involved are shown. The command takes two forms DETAILS=object using a standard var#, resp. cas# reference or DETAILS "objectname". In the first case you may use either names or reference numbers; in the second case only names are allowed. Note that on the display the object's name might appear only once at the first level where it is merged to another group (even for merged groups basic object names appear).

NEWTREE is used to produce - from the same information produced by the previously specified HIER or VHIER command - a different tree with other attributes. [Note that using the S=level E=level options will also produce a new tree].

The following options are available with NEWTREE (they can also be specified on the HIER, resp. VHIER command):

WIDTH=#chars the width of the tree in characters. By default a tree fitting on the screen is produced (internal memory constraints permitting). WIDTH is used to change the default value.

FULL may be used to show all links that would be omitted from the tree to make the distinction of many merged objects on the same display level.

ARROW in some situations it is difficult to know to what cluster an object belongs. ARROWS may be used to add visual help, by adding "arrowheads". An example will make this clear:

 6    |-|                                6    |-v
 8    | |-------|                        8    | |-------v
 5    |-|       |                        5    |-�       |
 3    |-|-----| |                        3    |-v-----v |
 9    |-|     | |                        9    |-�     | |
 2    |-|-|   |-|                        2    |-v-v   |-�
 7    |-| |-| |                          7    |-� |-v |
 10   |---| |-|                          10   |---� |-�
 1    |---|-|                            1    |---v-�
 4    |---|                              4    |---�

In this example there is a doubt if the object '9' merges with 3 or 2; the arrowheads show that it goes with '3'

Note that ARROW is not available on implementations using line drawing characters (it will not be needed there).

GVAR and MEMBER command

GVAR

      | [DISPLAY]
      | v STORE [ASIS] [<sopt>]
      | v STORE CODE [NOSORT] [<sopt>]
              <sopt>  [DELETE] [SLIDE=decimals]
                      [POS=(begpos{,endpos}]  ["name"]
      | DROP or DELETE
      | LOAD
      | v SWAP
      | v SELECTION
 GVAR | glist DEFINE  ["name"]
      | MEMBERS <options>
      | NAMES [DEFINE | GROUP=g#]
      | SET [LENGTH=n] ["name"]
      | CASID [POS=(first,last)][N=ncas]
      | RENUMBER [INCLUDEZERO]
      | CHECK [MINFRQ=minfreq]
      | RECODE [MINFREQ=minfreq] [INTO=code]
      | RECODE GROUP=g# [INTO=code]
      | GINFO [[N] [MINFREQ=minfreq]
      | COMPAREWITH=gvar.var#

Manipulates GVAR (the grouping variable). If no option is specified, the command reports whether there is a GVAR stored, and if there is one, its label and the length (number of cases).

DROP (or DELETE)

deletes an existing GVAR. This option should be used before defining new groups with the DEFINE command, except if only partial modification is desired.

LOAD

LOAD loads the current GVAR into the WA as an EDA variable. A free location is searched and a message tells in what position the GVAR has been copied.

(*) As you may LOAD the GVAR only in order to keep more than one GVAR, you might use EDIT to change the type of the variable, setting it to 2 instead of 1; then the variable would not be considered a normal variabel, i.e. not included into analyses.

STORE

stores a variable in the WA as the current GVAR, by either storing the variable directly or by coding it first.

By default the variable is taken as an integer variable containing group memberships. Normally only integers ranging from 1 to MCAS (i.e. the max. number or cases in your EDA implementation) are accepted (this insures that any group membership is always correctly displayed). Only positive values are acceptable, if negative values are found in the variable an error message is issued. Also if all values are 0 (e.g. when truncating values larger than 0 but smaller than 1) an error message is issued, saying that the variable defines 0 groups.

If you want to store the variable as it is, i.e. without checking that the above conditions are met, use the ASIS option. (Note as the display format of the GVAR memberships depends upon MCAS, i.e. numbers larger than MCAS will be displayed as a sequence of stars, at least in some situations, i.e. you will not always be able to read the group membership on all displays.

STORE CODE codes the variable to define the GVAR. Different (integer) values are interpreted as different groups, the smallest value will define group 1 and so on. If you prefer to define the first value of a variable as group 1, the second different value as group 2 (and so on) use the NOSORT option (this means that groups are assigned sequentially as EDA works from the first observation to the last looking for different values). See also the CODE command (an EDA command documented separately) for other means of coding a GVAR.

STORE options

The options explained here are common to both GVAR STORE command forms.

The DELETE option is used to delete the variable in the WA after copying (e.g. to avoid that the variable be included into analysis while in ALLVARS mode.

"name" may be use to change the GVAR name.

The SLIDE and POSITION options are used to modify the variable or consider only part of the variable before storing it as a GVAR. Note that with GVAR CODE these operations are performed before the variable is CODED.

The SLIDE option is used to slide the decimal point of the variable to be stored to the left or to the right before storing it, i.e this option is equivalent to a LET command, where the variable is multiplied or divided by a power of ten. SLIDE=2 e.g. will multiply the initial value by 100, i.e. shift the decimal point two positions to the left.

The POS=(begin,end) option is used to consider only a part of the variable as a GVAR. (Note that SLIDE, if present, is applied before POS=). Consider the following example:

    1112
    2212     Each column of this variable corresponds
    1211     to a specific group, but also the overall
    1212     information makes sense.
    2121

The POS option is useful to extract a column from a variable and store it as a GVAR. Columns are counted from right to left. POS=(1,2) means to consider the first and second column, i.e. in our example the result will be 12,12,11,12,21 (the two rightmost columns). If the endpos is omitted, begpos is assumed, i.e. a single column is used.

SWAP

GVAR v SWAP swaps (exchanges) the current GVAR and the variable v. Note that v should be a variable containing group memberships. When using SWAP the variable label will be lost, as well as group names if they are defined.

NAMES

GVAR NAMES is used to display and define group names (see the glossary for additional information on group names). By default the currently defined names are displayed. GVAR NAMES DEFINE will ask you to enter - for each group in the current GVAR - a name. If you do not supply a name the current name will be used. Note that there is a implementation specific limit on how many groups may have names.

The GROUP=g# may be used to specify a name for a single group. In this case you may specify the name as "name" string.

DEFINE

defines groups according to <glist>. For each group in the list (the vlist is interpreted as group identifiers) the members are entered individually. Terminate a group by striking the return key and continue with the next group until the vlist is exhausted. This option can be used to define a new GVAR or to edit an existing one (all GVAR values not modified remain unchanged).

CASID

The CASID option creates a GVAR based on case identifiers. Identical casids are put into the same group. The N option specifies the number of casids to consider (length). N is required, if the WA is not rectangular. Refer to the CASID command for a large choice of pre-editing possibilities.

The POSITION=(first,last) option is used to check only character positions first through last, instead the full four character long casid, e.g. a CASID POS=(1,1) would create groups from the first letter in each casid. The CASID feature is very useful for creating a GVAR based on casids, when you wish e.g. to aggregate specific groups of cases.

SELECTION

Defines a new GVAR based on the current selection (required). Cases in the current selection define group 1 (name=Selected), cases not selected define group 2 (name='NonSelec').

Make sure to specify a variable on the variable list when using the MISSVAL selection command or when working with non-rectangualar WAs to make sure that the GVAR definition is based on the right selection.

SET (*)

SET Length= sets the length of the GVAR, i.e. the number of elements the GVAR contains. This command is needed if you generate a GVAR using expressions (G[x] appearing as target of an expression). The name field defines the GVAR label.

MEMBERS

GVAR MEMBERS is the same as the the members command. All <options> of the members command may be specified.

RENUMBER

Renumbers the current GVAR from 1, i.e. the group numbers will be consecutive integers starting from 1 to the number of currently defined groups. This command is especially useful when you remove groups with small N's or define the GVAR from a variable, where some group numbers are missing or are not regular.

Example: If the current GVAR defines groups 1 3 5 and 20, RENUMBER will modify the GVAR to define groups 1 2 3 and 4.

By default groups memberships of 0 (i.e. no group membership) will not be included in the RENUMBER operation, i.e. left unchanged. If the INCLUDEZERO option is present, 0 will be included, i.e. it will be renumbered to a group membership of 1, i.e. these cases will now belong to a group and will no longer be considered as 'no group' members.

CHECK

Checks the current GVAR for the presence of small groups and displays a diagnostic message. By default groups containing 3 members or less are considered "small groups". The MINFREQ=m option is used to indicate a different minimal frequency.

Compare to the RECODE option.

RECODE

The GVAR RECODE command has two functions (1) recode small groups and (2) recode a specific group. The default form recodes small groups (see also GVAR CHECK above): Groups containing 3 members or less are recoded, i.e. combined into a new group. By default they are recoded as 0, i.e. no group. The INTO=code option may be used to specify a different target group number. The MINFREQ=minfreq option lets you specify a different small group limit (default 3).

GVAR RECODE GROUP=g# recodes g# into 0 (no group) or, if the INTO=g# option is present into a different group number.

GINFO (*)

(This option is used in macro programming) GINFO stores group ids (groups numbers) and optionally number of cases in each group (N option) into variables in the WA. The MINFREQ=min option may be used to exclude groups with too few cases. By default all groups are copied (minfreq defaults to 1).

The variables are copied into free locations in the WA. Their variable name will be 'GrpIds' for group numbers (always produced) and 'GrpFreqs' for the size of the groups (only produced when the N option is present). The descriptor will be the GVAR descriptor.

This facility helps you develop macros performing operations on groups individually. It is good practice to remove the variables created by this option before returning control to the user.

COMPARE_WITH

Compare two GVARs, i.e. compare the current GVAR and a grouping variable in the WA. This option produces a table similar to the MEMBERS/GVAR MEMBERS command; the groups shown are the groups found in the grouping variable (COMPAREWITH=gvar.var#) and with each case the group membership in the current GVAR is shown in parentheses.

Only the integer part of the gvar.var# variable is considered, the fractional part, if it exists, will be discarded.

GVAR and expressions

The GVAR may also be defined and used with expressions. (See there for more details).

   >LET GVAR=#var20
   >IF GVAR=3 & GVAR=4 INCLUDE
   >IF GVAR=3 THEN GVAR=2
   >LET G[1]=2

The first example defines a new GVAR from variable 'var20'. The second activates a selection, where groups 3 and 4 are included. The third puts all group 3 members into group 2; the last example modifies a single group membership. In fact GVAR is a alternative form for G[], i.e. a vector reference, where the index has been omitted, i.e. all values.

MEMBER

  MEMBER  [SHORT]
  MEMBERS GROUP=group#
  MEMBER  CASE=cas#]

This command examines the current GVAR and displays each group with its members (cases). If no option is specified, all groups are displayed.

SHORT does not show the case ids, but only the number of cases in each group.

MEMBERS GROUP=g# displays a list of cases belonging to group g#.

MEMBERS CASE=c# displays the group membership of c#.

Interpretation: GANALYSIS, GSUMMARY

GANALYSIS

GSUMMARY

   GANALYSIS   | MEDIAN   | [DIVDEV=val] <options>
               | MEAN     |





   GSUMMARY   | [ MEDIAN ]  | <options>
              |   MEAN      |
              |   IFEN      |
              |   OFEN      |
              |   HINGES    |
              |   RANGE     |

  <options> [GVAR=var#]
            [NMIN=min_members]
            [KEEP_GROUP_0]
            [LONG {DIFFERENCES} {BOXPLOT}]

These two commands are used to analyze groups of cases. Therefore they need a GVAR stored with the WA or a variable containing group memberships (GVAR=var#) option. Note that with GVAR= only the integer part of the variable is used to determine group membership.

GANALYSIS represents each group center (MEDIAN or MEAN) graphically in terms of deviation from the center of the whole variable in units of a dispersion measure (midspread or standard deviation). By default half of the dispersion measure is used as unit; DIVIDE=value (default 2) is the factor by which the deviation measure is divided, i.e. with the default 2 a symbol is printed ("+" for positive, "-" for negative deviations) for each 1/2 midspread distance of the group median from the overall median. If the distance is more than 8 units, these symbols are preceded by a $ sign.

GSUMMARY shows numerical summaries for each group (similar to the DISPLAY command).

If the display of more than five groups is requested, the program asks to specify which groups you want to analyze, unless the LONG format is used.

NOTE: GANAL and GSUMM turn selection off.

Other options

NMIN The NMIN option (default 2) drops groups from analysis which contain NMIN or less members. KEEP_GROUP_0 Observations not belonging to any group have a group membership of 0, i.e. 0 means member of no group. By default group 0 is always excluded from the display. The KEEP_Group_0 includes group 0 as a separate group.

LONG Format

With LONG EDA uses a different display format: for each group and for each variable a separate line is used. Any number of groups can be shown this way. The LONG option produces the same display with GANAL and GSUMMARY. The LONG display contains always the numerical information; with MEDIAN and MEAN the coded form is always shown in addition. LONG DIFFERENCES By default LONG displays (e.g. with the default median/midspread summaries) a column containing group medians and a second column containing group midspreads. DIFFERENCES displays four summary columns the first two contain the median and the midspread for the variable (overall median and midspread); the third and fourth column the group differences (group median minus overall median; group midspread minus overall midspread). LONG BOXPLOT The default format shows - in addition to the numerical summaries - +/- coded group deviations from the overall center. BOXPLOT shows a one-line BOXPLOT instead.

DISPLAY BYGVAR

An alternative to GSUMMARY and GSUMMARY LONG is DISPLAY BYGVAR, a command offering additional statistics as well as a different display format, namely for each variable you will see statistics for each group [GSUMMARY LONG shows each variable for each group].

TRACES

TRACES ---> see basic exploratory commands