Transformations I and WA management

Introduction

The various transformation commands are found in two sections. The first section contains all transformation commands, i.e. commands modifying the variables in the WA and producing new variables from existing variables, sharing the basic EDA syntax, as well as commands used for the management of the WA (housekeeping tasks)

The second section contains commands with a special syntax, using logical and algebraic expressions.

The distinction is only based on syntactical differences. The following commands are explained in this section:

    ADDVARS     sum of variables
    AGGREGATE   Aggregation
    CHECK       check the contents of the WA
    CODE        code continuous variable
    COMBINE     combine two variables
    COPY        copy variables
    COUNT       counting
    CTRANS      transpose variables
    CUTOUT      variable construction
    DELETE      deletes variables in the WA
    DICHOTIMIZE dichotimize variables
    DCOPY       destructive copy
    ERASE       erase WA
    GENERATE    generate artificial data
    KEEP        keep variables
    MAKE        make variables
    MATCH       match variables
    MAX         compute maximum across variables
    MIN         compute minimum across variables
    MOVE        transfer variables
    NORMALIZE   normalize variables
    PACK        pack WA
    REARRANGE   rearrange categorical variables
    PCTCHECK    Percent checking
    PERCENT     percent computation
    REPVAL      operations based on replacement (missing) values
    RENAME      rename variables
    REVERT      variable protection switch
    SWAP        swap variables
    STANDARDIZE Standardize variables
    TRANSPOSE   transpose work area
    WEIGHT      weight variable (biweight)
Note that a number of management tasks can be found as options of commands described previously, namely
   C1           Manage the C1 configuration
   C2           Manage the C2 configuration
   CASID        Casid oriented tasks
   CENTER       Manage the center/reference value
   GVAR         Manage the GVAR
   MATRIX       Manage the MATRIX area

Alphabetical list of commands



ADDVARS
 ADDVARS  vlist  [VAR={var#}] [NOLABEL] [DELETE]
Adds the variables in vlist and puts the result into the next free location or into var# if VAR=var# is present. This command simplifies the addition of several variables, a task which is often cumbersome with the LET command.

Normally the command will ask you to enter a label and a descriptor for the newly created variable. NOLABEL causes default labels to be generated.

DELETE removes the variables in the vlist after the new variables has been computed. Beware: when using VAR=var# var# should not be specified in the vlist, as DELETE takes places after everything else has been done, i.e. it would also remove the newly created variable.

The resulting variable has a table tie, identical to the table tie of the first variable in the vlist.

AGGREGATE


  AGGREGATE [vlist] <on> <stat> [NEW {MODIFYDOC}]
                                [NAGDROP] [FREQMIN=b]
                                [KEEPCASID] [LEAVEDOC]
                                [COUNTS] [NOGVARDROP]
                                ["check"]

<on> | [GVAR] [GROUP=grpno] | VAR=aggvar# [SLIDE=decimals] [GROUP=grpno] | LIMIT=(begincas#,endcas#)

<stat> | [MEDIAN] | MEAN | MIN | MAX | SUM | CENTILE=percentile | BIWEIGHT [CUT=val] [P=epsi]

Aggregate the whole WA (must be rectangular) or the variables in the vlist, depending upon the setting of the ALLVars mode. Default is to use the GVAR as criterion variable (defining the groups) and to compute the median for each group. The variables on the vlist (or the whole WA if ALLVARS is ON and no vlist is present) are replaced by the aggregation results, i.e. the groups become the new cases.

Alternative aggregation statistics are MEAN, MIN, MAX, SUM, CENTILE and BIWEIGHT. The meaning of the options for BIWEIGHT are the same as on DISP BIWEIGHT, except, that the E param is P.

The groups are defined by a GVAR, if no other <on> specification is present (then a GVAR (is required). The VAR=aggvar# option specifies a variable in the WA as the criterion variable. Only the integer part of the variable is used to define the groups. You may then use the SLIDE= option to manipulate the variable: SLIDE=2 shifts the decimal position two positions to the left (i.e. multiply the value by 100). Negative values shift to the right.

Normally all groups are aggregated, except when the GROUP= option is used. Then only the group specified by grp# is aggregate, the other cases are left unchanged.

There is a third way for specifying the cases to aggregate: LIMIT=(Begcas,endcas), where a range of cases is specified; <begcas> and <endcas> are valid case specifications, e.g. LIMIT=(10,20) aggregate cases 10 through 20 to form a single group.

Note: The GVAR is produced by a number of different commands. The GVAR command may be used to define a GVAR from variables or case identifiers.

NEW: Default is to replace the variables by their aggregated form. The NEW option is used to create a new variable for each variable aggregated and storing it into free locations in the WA. The new variables have the same labels and descriptors as the original variables, but are prefixed by A_. The descriptor contains also the aggregate statistic used. (With NEW the LEAVEDOC option is turned on implicitly, i.e. casids and GVARS are not modified, unless you ask for it explicitly by adding the MODIFY option).

NAGDROP: This options deals with non-aggregated cases, i.e. cases for which no group membership is defined (i.e. the GVAR is zero). By default these cases are left untouched. NAGDROP drops them from the aggregated variables.

FREQMIN=n In some cases it useful to retain only aggregates based on a sufficient number of cases. FREQMIN=n may be used to drop aggregates based on less than n cases from the aggregated variable. Note that this option only affects newly aggregated cases, cases left untouched by the aggregation process are not examined.

COUNT creates a variable in the WA containing the number of cases in each group.

AGGREGATE generates new case identifiers for the aggregate cases. These casids are Gnnn, where nnn is the group number. the editor or the CASID command.

The KEEP_CASID option inhibits generation of the new casids, but leaves the casid of the first case (old case) of each group as casid for the aggregated WA. This feature is useful when aggregating groups formed by identical case identifiers (see GVAR casid).

AGGREGATE alters also the column-name by prefixing the current name by the string ag_ and adjusts the GVAR. (If the aggregation criterion is the GVAR and all groups are aggregated the GVAR is turned off).

LEAVEDOC inhibits the creation of new casids and the alterations of any descriptive information.

NOGVARDROP: if AGGREGATE creates new cases by aggregation, each case is in a separate group, unless you do a partial aggregation with some of the command options. Normally a GVAR where all cases are in different groups the (old) GVAR will be dropped (not in the other cases). However in some circumstances you need that GVAR (e.g. for sorting the cases into a new sequence); here NOGVARDROP inhibits the deletion of the GVAR and adjusts it to the new situation.

"check" (useful when designing macros for automatic aggregation etc) checks the "check"-string against the GVAR label or the variable descriptor (not the label!) of the grouping variable. If "check" does not match with that name, AGGREGATE will not go on. Only the actual length of "check" is used for comparison (the GVAR label might be much longer. E.g. using "Swiss d" will match a label like "Swiss districts (1970)".

Note: The selection mechanism is turned off by the AGGREGATE command.

CHECK


CHECK [REPORT] | [INCOMPL_DOC] [CORRECT]
               | LABELS [CORRECT]
               | CASIDS [CORRECT]

| DATA [CORRECT] [NOTEQUAL=val] [RANGE=(low,high)] CHECK vlist [REPORT] | CODES{=var#} | VARIANCE [M=min_variance] [ALL | CASES] [CLEANUP] | NMIN [N=nmin] [ALL | CASES] [CLEANUP]

CHECK v1 v2 [REPORT] DIFFERENCES [SMALL=val] [NOLIST | MAXLIST=mdif]

<option> [CORRECT] [REPORT]

CHECK performs a series of checking/validation functions facilitating data preparation and housekeeping. The main purpose of this command is to check the integrity of the WA, i.e. it makes sure that some rules regarding data and names are observed and variables properly documented.

REPORT is an option common to all command forms. By default diagnostics are only shown on the screen, i.e. do not appear in the print file. REPORT (requires an active PF) produces a REPORT to the print file AND to the screen.

A number of command forms share a common option:

CORRECT: produces not only a diagnostic message, but asks for corrections.

INCOMPLETE_DOCUMENTATION
INCOMPLETE_DOC (the default option checks, whether all variables in the WA are properly documented, i.e. the descriptor/label reflect the current state of the variables. Documentation is considered incomplete, if a variable contains a modification stamp, i.e. the variable has been modified, but the user did not change the documentation (labels descriptors) accordingly. CHECK without options reports these variables. Note that the modification only hints at possible documentation problems. In many cases labels and descriptors are sufficiently explicit and/or the transformation was straightforward (i.e. the result of subtracting 50 from a variable does not really change the nature of a variable, but the result from averaging three variables surely does). If you wish to remove the modification stamps and leave the label and the rest of the descriptor, just clear it.

With CORRECT the user is asked to supply a replacement descriptor for each incompletely documented variable. If the user enters a blank string, the old descriptor is not changed, except that the modification stamp is cleared. If you wish to correct the current descriptor you may enter a :. The ':' invokes the EDA command line editor to correct the current descriptor.

CHECK labels and descriptors
The LABEL and the CASID option checks the uniqueness of labels and casids, i.e. items where - in some instances - it is important the they are unique to assure correct operation. If two or more labels, resp. casids with the same spelling is found, the program reports the duplicates, as well as blank casids and labels. With the CORRECT option you will be asked to enter a replacement for each casid or label found incorrect.
CHECK DATA
The DATA option checks the variables specified by the <vlist>, to make sure that the data values are within a specified range or do not contain some specific values. Default operation is to scan the data for the occurrence of the system replacement (missing) value (default -1), if one is found the case is identified with its name and corresponding variable. Other values may be specified using the NOTEQUAL=value option. Note that NOTEQUAL are sensitive to the setting of the fuzz value.

Alternatively a range may be specified. If RANGE without a value is specified the current settings of the data range (see SET RANGE) is used (defaults are a very large, resp. a very small value). RANGE=(low,high) specifies an arbitrary range.

CORR asks for a replacement for each value in error.

CHECK CODES
CHECK CODES is used to determine how many different values (codes) can be found in a series of variables. This command is useful for categorical variables and is therefore intended for integer variable (fractional parts are truncated). The result is stored into a new variable and is suitable for input to the FREQ CODES command, i.e. the command permitting to produce frequency tables on the basis of a predefined set of codes (values). CHECK CODES is useful to determine all the codes found in a series of variables therefore permitting to build a series of frequency tables with exactly the same structure, even if not all codes are present for all variables in the set.
CHECK VARIANCE
CHECK VARIANCE checks the variance of variables and cases and issues diagnostic messages. Default action is to check only variables. ALL does the same on variables and cases, and CASES on cases only.

Insufficient variance is diagnosed if the difference between the minimum and maximum value is smaller than a small system defined value (value to be considered zero, typically 0.001). The MIN=min_variance option may be used to specify some different value for minimal variance.

By default only diagnostic messages are issued for each variable/case (the detailed messages are suppressed with SET MESSAGES OFF). If cleanup is specified, then the corresponding variables and/or cases are removed from the WA.

Note that a selection may be active, then the criteria are only based on the cases in the analysis.

CHECK NMIN
This command is similar to CHECK VARIANCE, except that it checks whether there is a minimal number of non-missing (default RepValue; MISS=val if present) observations in a variable or accross an observation (CASES).

By default CHECK diagnoses variables or observations with more that a half of missing observations; NMIN=n lets you define the minimum n required.

In all other respects, this option is the same as CHECK VARIANCE; see there for an explanation of additional options.

CHECK DIFFERENCES
Checks the differences between two variables (numerical differences only). The following checks are performed and reported: First the length is checked. If both variables have the same number of cases, EDA checks whether the variables are identical ( (ABS(var1-var2).lt.small).

By default small is the EDA constant atom, a reasonably small value (often 0.01), i.e. not the smallest value the computer system can represent. You may set small to any value using the SMALL= option.

CHECK DIFFERENCES normally lists the first 20 different cases. If you do not want to see the list use the NOLIST option. If you want to see more or less than 20 different cases use the MAXLIST= option.

CODE


 CODE v <code> <opt>
 CODE v <code> <opt> [PLAY {PRINT}]

<opt> [GVAR | VAR=var# | REPLACE | NOCOPY] [NOREPORT] [SAVE_CUTPOINTS{=var#}]

<code> | [FRACTILE] [NBINS=nint] | EXACT [NBINS=nint] | INTERVAL [NBINS=nint] | READ | FROM=var# | CUT=(val1,..) 2 | GAPS [NBINS=nint]

The CODE command is used to break a continuous variable v into categories by dividing it up into several bins (intervals) and assigning category numbers 1,2,.. to each bin.

The command can be used in two different modes. A first mode (default) creates a categorical variable as directed by the options. It displays a summary report for each category, containing the category number, the number of cases in that category, a simplified (one-line) boxplot for that group, as well as the MAD (Median absolute deviation). If the number of cases in a group is too small, the boxplot is not shown. A message is issued instead.

A second mode, called PLAY mode, is used to show the same information but it lets you examine alternative coding schemes before creating ing the categorical variable.

Coding schemes available
Default (FRACTILES) is to divide the input variable into intervals containing approximately the same number of observations; by default three categories are created, i.e. the bins correspond to the thirds of a distribution. NBINS=nint is used to produce a different number of cutpoints, i.e. NBIN=4 defines four cateogries.

The EXACT option uses exact fractiles; the default approximates fractiles, i.e. observations with the same value are always assigned to the same category.

The INTERVAL option divides the variable into intervals of equal width. Again the default is three (divides the variable range into three intervals) and NBIN=nint may be used to produce a different number of intervals.

GAP uses the gapping method, described with the DIAGNOSTIC GAP command. The WEIGHT option uses weighted gaps. Again NBIN= is used to specify a different number of categories. The READ and CUT= options are used to enter arbitrary cutting points in order to define the intervals. READ will ask you to enter the interval boundaries, whereas CUT=(val1,...) does the same from the command line. [Remember that the number of values in the CUT=(val..) specification is limited by an implementation constant. (MAXC; see STAT LIMITS; typically set to 8)

FROM=var# gets the cutpoints from variable var#.

Control options
The result of the CODE command is, by default, stored into a free location in the WA; if you specify REPLACE the coded variable replaces, i.e. overwrites the current variable. A new label is created (C_ is prepended to the original label) and the descriptor contains the coding scheme used. it). GVAR may be used to store the result into GVAR instead. VAR{=target} allows for a different target variable. NOCOPY may be used to suppress storing the result, i.e. the command is only used to display the report, without creating a variable.

NOREPORT inhibits the display of the default report, i.e. a table containing for each category its count and the interval boundaries, as well as simple boxplot for each group defined (except if the count is too small).

Operational considerations: The boundary limits are always included in the upper interval and never in the lower, i.e. CODE 1 CUT=20.5 will create two intervals; any value of 20.5 will be in the second interval. CODE produces additional diagnostics (may be suppressed with SET MESSAGES OFF): all categories with 0 counts and all categories with counts less than 3 (but not zero).

SAVECUTPOINTS
(*) SAVECUTPOINTS stores the cutpoints into a variable in the WA; CP_ will be prepended to the variable name. The variable may then be used for instance with the MKG() function.
PLAY mode
PLAY mode: when you specify PLAY, the program wil perform a first codification and display the result, but without storing the result into a variable or GVAR. EDA then enters play mode, i.e. you can a number of options to display alternative codifications, until you are satisfied; the result will only be saved when you quit play mode.

In play mode you may enter a number of commands, dealing with codification. All <code> options are available as commands. In addition you may specify the following commands:

Play mode commands:
<code>           code options as shown above

S or <return> leave play mode and store the result into the target specified on entry. Q quit play mode, without storing the result H or ? Help

Limit Show the current cutpoints Limit=num Replace limit <num> by a new limit (EDA will ask) Limit=num DELE Delete limit number <num> ADD=val Add a new cutpoint

RENUM [START=i] renumber codes, start with i GROUP=id Show members of group/interval id RESULT <o> Change target. <o> NOCOPY | REPLACE GVAR | VAR{=v#}

Usage notes: When using the LIMIT=num option EDA will ask for a new cutpoint to replace the cutpoint specified as num. EDA will display the current value and ask for a new value. Alternatively you may specify D to delete that cutpoint, i.e. the same as LIMIT=num DELETE.

RENUM: if you do not specify a START=i option, EDA will use the default numbers, i.e. starting at 1 and numbering consecutively.

RESULT: This command is used to change the destination specified on the initial command line. If no options are given on the RESULT command, the current destination is displayed. Options are the same as on the command line, i.e. NOCOPY (do not copy) GVAR (produce a GVAR) VARIABLE or VARIABLE=var#. Note that when using Q to leave PLAY mode nothing will be saved.

Printing and CODE PLAY: As CODE PLAY is intented as a experimental tool to hunt for codification best suited for your purpose, nothing is copied to the print file, unless you enter PLAY mode with the PRINT mode, then all codification reports are copied to the print file (if a print file is active at that time). Note that in normal command mode, CODE works like all other EDA commands, i.e. if a print file is active the report will be shown in the print file.

COMBINE


  COMBINE v1,v2,[v3]
Constructs one variable (v2 or v3) from two others, so that the first n cases are from v1 and the second n1 cases from v2. (v3 resp. v2 is the target variable) If a selection is active only selected cases are combined to a new variable containing only the selected values (be aware of the unmodified casids]).

NOTE: This command is now obsolete; the CC (concatenate) function used within expressions is much more powerful than the COMBINE command.

COPY conditional copy ->EDIT




COPY
   COPY v1[,v2]
Copies variable v1 to v2 with all its attributes. If v2 is not specifed v1 is copied into the next free location in the WA.
Related commands
The MOVE commands performs a destructive copy. The LET command can also be used to copy variables, i.e. LET #1=#2 copies variable 2 into variable 1.

COUNT

    COUNT vlist <crit>  [VAR=targetv#]                                   ar#]
                 [NOLABEL] [NOINTEGER] [DELETEVARS]

<crit> | EQUAL=val [FUZZ=f] | NOTEQ=val [FUZZ=f] | GREATER_THAN=val | LESS_THAN=val | IN=(low,high) | DIFF_VALUES [NONZERO | POSITIVE] [NOINTEGER FUZZ=f]

Counts the occurrence of a value corresponding to the criterion specified for each case of the variables in the vlist and puts the result into the target variable.

With the exception of DIFF_VAL the meaning of the criteria is the same as on the SHOW command. The Fuzz value for testing equality or inequality is, by default the global fuzz value as set by the SET FUZZ command and other fuzz related commands.

DIFFERENT_VALUES is used to check the number of different values accross the variables of a case. By default all different values are counted; NONZERO does not include zero in the count, whereas POSITIVE only counts positive values. Note that this option is especially useful with categorical variables (integer variables). (See also the FREQ command for similar operations on variables, or cases - when you transpose the work area). Please note that in NONINTEGER mode the fuzz value is always used.

VAR=target specifies the target variable where the result will be stored. If no VAR= is present the next free location in the WA will be used.

Only the integer part of the variables are used for comparison, i.e. fractional parts are not considered in counting. The criterion section however accepts real values. If it is desired to base the comparison on decimal numbers, use the NOINTEGER option.

NOLABEL inhibits query for the label and descriptor for the newly created variable; this will cause a default label 'Count' to be generated with a descriptor containing the command line.

DELETE the vlist will be deleted from the WA after the count has been done. Note that this causes a DELETE command to be executed, i.e. the default settings (security apply here)

CTRANS


  CTRANS v1 CASE=c# | [START=n1] [END=n2] |
                    | TABLE=#list         |
Case transpose: creates a new variable v1 taking the case specified by C across all variables (or only from n1 to n2, as specified by the transposition is useful in analyses, where the data matrix is analyzed in both directions (lines and columns). The T option selects only variables with belong to list #list. If no limits are given, all non-empty variables are transposed.

CUTOUT
  CUTOUT | v1[,v2] | [START=c#1] [END=c#2]
         | CASES   |
This command constructs a new variable v2, resp v1 using the c1-th through the c2-th case of variable v1. C1 defaults to 1, c2 to the n of cases. If a selection is active, the resulting variable has an N corresponding to the selected cases. No options is a simple variable copy. Compare with COPY/DCOPY.

CASES rearranges the casids so that the new casids run from the c1-th through c2-th of the old ones. Note that casids are for the whole work area. So this feature should be used very carefully. NOTE: This command is now somewhat obsolete; its functions are easily performed with the SUB() function used with expressions.

DCOPY


   DCOPY v1[,v2]
DCOPY is a synonym for MOVE. See there for details.

DELETE
   DELETE vlist [QUERY]
   DELETE  [vlist] [QUERY] [FORCE]

DELETE CASES[=(caslist)]

DELETE GROUP=(grouplist)

See also the KEEP command.
Delete variables
Deletes the variables specified from the WA. In order to avoid frustration the command behaves differently from other EDA commands, i.e. by default a vlist must be present on a command line. If no variable list is present an error message is issued and no variable is deleted. In order to delete the variables from the current (implied) variable list use the FORCE command.

The QUERY option asks for confirmation; "OK to delete variable xxxxx [Y]?:": if the answer starts with y or Y the variable is deleted, otherwise the variable remains in the WA.

DELETE reports the number of variables deleted. If variables in the list do not exist, the fact is reported and the other variables deleted.

The ZERO option causes all deleted areas to be zeroed. This is useful if you wish to make some special use of these locations using EDIT or a case reference.

The DELETE command is sensitive to the setting of the security switch. If SECURITY is on the QUERY option is automatically enabled. Then the NOQUERY option may be used to suppress this automatic query.

Delete cases
The second command format is used to remove cases from the WA. This command works on rectangular and non-rectangular WAs.

The case identifiers (CASIDS) are automatically modified. For more sophisticated manipulations see the EDIT and its DCAS command, where you have options do perform variable editing with and without casid modification.

CASES[=(caslist)] specifies what cases to remove. The CASES=(<caslist>) form of the command is either a single case reference (then the parenthesis may be omitted) or a list of several case references. The maximum number of cases permitted is usually 8 (system implementation dependent). If CASES without a list is used the user is asked to enter the casids for the cases (s)he wishes to remove.

Delete groups
DELETE GROUP=group# removes the groups of cases specified from the work area. This option only works, if a GVAR is stored.

Note that a warning is issued, if the WA is not rectangular.

Many other facilities are available for removing cases from the WA; basically (besides the facilities offered by EDIT) the command sequence is as follows:

       <selection command>
        SELECT
The selection commands selects cases temporarily for analysis; the SELECT command applies the selection to the WA, i.e. removes the cases not in the analysis from the WA, whereas turing the selection of by some other means reestablishes the condition before the selection. In fact a case selection does not alter the WA, it only modifies retrieving of observations for analysis.

Here are some examples, performing common tasks:

         Remove a group of cases

ANALYZE GROUP=1 or IF GRP(1) INCLUDE SELECT SELECT

Removing cases according to some value on a variable

IF #VARIAB>55.2 EXCLUDE SELECT

Besides the commands mentioned, SETLIMIT, INCLUDE and a special form of the SHOW command are also selection commands.

DELMAC --> macros


DICHOTOMIZE
   DICHOTOMIZE vlist [MAXCAT=limit] [DELETE]
Creates as many dummy (binary) as a variable contains different values; only the integer part of the variable is used. Note that this variable is usually a categorical variable.

The newly created variables are copied into free locations in the WA. They _d suffix is added to the variable name; its descriptor contains the name of the source variable, the code (value) for which the dummy variable stands; the remaining part contains the descriptor of the source variable (possibly truncated).

If you wish to create a binary variable from a quantitative variable (specifying a cut point) you will have to use the IF or LET command. Logical expressions with LET are an easy way: e.g. LET #DUMMY=#20>50, meaning that for each value in #1 smaller than 50 (expression false) a 0 value will appear in the new variable, and 1 for all values larger than 50.

MAXCAT=limit allows to specify a limit to the number of binary variables to be generated from a single variable. If this option is not present as many variable as needed or WA space allows are generated, which in some circumstances is not wanted. MAXCAT defaults to 10, i.e. if no MAXCAT option is present up to 10 dummies are generated from each variable; otherwise a message is given and the current variable is not done; EDA then skips to the next variable if the vlist is not exhausted. You may also set MAXCAT=0; then for each variable in the vlist EDA will stop and tell you how many dummies would be generated; you are then asked whether you wish to generate that many dummies or you want to skip to the next variable. The DELETE option destroys the source variable(s). This is equivalent to a DICHO command followed by a DELETE command. In fact DELETE is called, therefore the options of the delete command do also apply here.



ERASE


   ERASE [ZERO]
Erases (empties) the whole WA. After this command data must be read into the WA to continue analysis.

The ZERO option fills the whole WA with zeros. Without this the variables are only logically deleted, i.e. the data locations are left untouched.

If the Security switch is on, the user is asked to confirm the Note also that you may not erase a WA with the WAPROTECTION switch on.

GENERATE


  GENERATE [vlist]  <option> [Ncas=number-of-cases]
Generates artificial data as guided by <option>. The number of cases (NCAS=) option is is only required if the WA is empty or not rectangular. If the WA is rectangular generate produces a variable with the same length as all variables in the WA. You may however specify a different length with the n= option, thus creating a non-rectangular work area. with the same number of cases. If the WA is not rectangular and a N= option is not present, you will be asked to enter it.

If no <vlist> is present, GENERATE creates a single new variable and stores it into the next free location in the WA. If a <vlist> is present on a command line, EDA will generate as many variables as specified and overwrites existing variables (if such variables are included in the list).

GENERATE turns the current selection off.

<option> can take the following formats:

   1.       RANDOM [NORMAL {SD=val} {MEAN=val}]
   2.       POISSON [PAR=value]
   3.       BOXMULLER
   4.       STROB [EPSI=value] [SIGMA=value]
   5. v1,v2 RCOR [MX=meanx] [LY=meany] [SX=sdx] [TY=sdy]
            [CORR=correl]
   6.       ORDER
   7.       BINOM TRIALS=NT PROB=probability
   8.       SAMPLE [POP=m]

default GENERATE vlist [START=val] [INCREMENT=val]

The first format generates a random variable: With no options (except possibly N=) enerates pseudo random numbers in the interval 0-1. NORM generates normally distributed random numbers with default mean 0 and standard deviation 1; these two options can be altered using the S and/or M options.

POISSON generates random number following a poisson distribution with option param.

BOXMU generates random numbers using an algorithm described by BOX and MULLER [ G.E.P. BOX and M.E. Muller: A note on the generation of random normal deviates, Ann. Math. Statist. Vol 29, 1958,pp. 610-11]. The result is a uniform (0,1) random number.

STROB draws random numbers in a perturbed gaussian distribution. Epsi (default value 0.5) and sigma (default value 1.0) are the control options: random numbers are drawn either with probability 1-epsi in a normal distribution (0,1) or with probability epsi in a normal distribution (0,sigma). This function might be very useful for the evaluation of the robustness of statistical measures.

RCOR generates a pair of random variables with correlation corr= (default = 0.5) M, and N (default 0.0) are the means of the first, N of the second variables; S and T (default 1.0) are the standard deviations of the first, resp. the second variable.

GENERATE BINOM generates a binomial (Bernouilli) variable, where Trial gives the number of trials and prob the probability.

GENERATE SAMPLE draws a sample without replacement from a population of M, where M defaults to N (number of cases); M may be specified with the POP option. Note that if POP=M is not specified, SAMPLE means a permutation of the indices (as M is equal to N).

The last format (default) generates an "incremental" (index) variable: The option START (default 1) specifies the first (starting) value; INCREMENT= (default 1) specifies the increment.

Random numbers
Random numbers a based on a time base default seed. By default each (random) series will be different, If you need to re-create the same series you can use the SET SEED command to start with a known seed. (See the SET SEED command for more details).
Variable labels and descriptors
GENERATE will create labels and descriptors reflecting the origin of the variable. The second part of the variable descriptor will contain the command line used to generate the variable.

See also the section on expressions for alternative forms of variable generation, e.g. the IDX or the random functions.

Credits
BOXMU, STROB, RCOR, POISSON from Lebart, Fenelon 1979

The random number generator used within EDA is based on algorithm AS 183, Applied statistics 1982, vol 31,2.

The random SEED is time based.

KEEP


     KEEP vlist [QUERY]
     KEEP [vlist] FORCE [QUERY]

KEEP CASES[=(caselist)]

KEEP GROUP=(grouplist)

Compare to the DELETE command.

KEEP and its options operate like the DELETE command, with the fundamental exception that the variables, cases or groups specified are kept in the work area and all others deleted. (See DELETE for a more complete explanation of the options).

KEEP variables
KEEPS only the variables in the vlist; all other variables are deleted from the WA. A variable list must be specified, unless the FORCE option is used to accept the current (implied) variable list. QUERY will ask you to confirm the deletion of each variable the command will delete.
KEEP cases and groups
KEEP CASES/GROUPS are used to remove all cases or groups not specified from the work area. CASES can be used either with a case list or without; then the you will be asked to type in the cases you want to keep.

MAKE
MAKE vlist CONCATENATE[=newvar#] [TRANSPOSED] [<opt>]
MAKE vlist TABLE[=newvar#] [INTEGER][TRANSPOSED] [<opt>]

<opt> STORE_GVAR FIRST_DES | "lab descript" | LAB_DES_READ

MAKE vlist FROM=var# [TIE=tie#] MAKE FROM=var# NCASES=n [TIE=tie#]

The MAKE command creates new variables from existing variables.
MAKE CONCATENATE
The variables of the vlist are concatenated into a single variable. If CONCATENATE=var# is specified the concatenated variables are stored into var#, otherwise it is stored into the next free location in the WA.

Note that the command takes into account a selection, if active. The resulting variable may not exceed MCAS (the maximal number of cases limit).

TRANSPOSED: instead of concatenating each variable in turn, TRANSPOSED starts with the first observation of each variable in sequence and continues with the second observation and so on.

STORE_GVAR: creates a new GVAR pointing to the relative variable number, i.e. all observations from the first variable of the vlist will be in group number 1, all observations from the second in group number 2 and so on.

Label generation options: By default the newly created variables is labelled "CVAR" (or "Table" with TABLE) and the descriptor will be the command line. If FIRST is used, labels are as above and the descriptor will be taken from the first variable of the vlist. With "lab des" the first word of the string will be the label, the remainder the descriptor. Finally LAB_DES_READ will prompt for a label and a descriptor.

Variables to be concatenated must have the same length. If you need to concatenate variables of different lengths, you should use the CC() function available with expressions (see there for more details).

MAKE TABLE
This command creates a table (variable) from the vlist, i.e. a variable suitable as input to table related commands as ADDFIT, MEDPOLISH, BREAK or XTAB. In addition to the options explained with MAKE CONCATENATE has an INTEGER option, creating an integer table, instead of the default real (decimal) table.
MAKE FROM
This command form creates new variables from a single variable; the number of cases and the number of variables created depends upon the remaining specifications.

If the NCASE=n option is used (NCASE specifies the number of cases for the new variables), the number of variables is obtained by dividing the original length of the FROM variable by the number of cases specified by the option.

If no NCASE option is specified, the current vlist is used, i.e the vlist indicates the number of variables to created; the number of cases of each of the new variables is obtained by dividing the length of the FROM variable by the number of requested variables.

Note that for both situations an exact match of the initial variable length and the new length times the number of variables is required.

The labels and descriptors of the newly created variables are created from the label/descriptor of the FROM variable; a two digit sequential number will be appended to both the label and descriptor. E.g. if the original variable is called 'ABC' the new variables will be labelled ABC01, ABC02 and so forth. If the original label uses more than 6 positions the numbers are inserted in positions 7 and 8.

The TIE=tie# option is used to define a common table tie for the newly created variables.

MATCH


MATCH vlist SORT BY=var# <dup> [NOMOD_WA_ATTRIBUTES]

<dup> : NOCHECK | [REMOVE] | AGGREGATE

(More options to come...)

MATCH SORT sorts the variables in the vlist using the sequence given in var#.
      by=var#     old           new

         7        10           60
         2        20           20
         1        30           10
         3        40           30
         5        50           50
         0        60

Note that values smaller or equal to zero in the match variable will remove that case in the new variable(s).

NOMOD_WA_ATTRIBUTES will inhibit the changes to the case identifiers and (if defined) the GVAR putting them into the same sequence as the new variables.

The other options deal with duplicates, i.e. the same id number in the BY variable. If nothing is specified duplicates will be removed, i.e. only the first case with that id will be kept. NOCHECK does not check for duplicates. This could be used to create duplicate cases for the target variables.

Finally the AGGREGATE option aggregates all cases with the same id by computing their sum (compare also with the AGGREGATE command).

MAX


   MAX vlist  [VAR={var#}] [DELETE] [NOLAB]
              [IGNORE=val]
Puts the maximum for each observation on vlist into a variable. (Maximum across variables). The result is copied into the next free location (VAR) or into #target if VAR=#target is used. If you need to look for the maximum of a variable use the MAX function (see the LET command).

Normally MAX will ask for labels and descriptors of the newly created variable. NOLAB suppresses that and provides default labels and descriptors.

DELETE removes the vlist after the target has been computed. (Beware: VAR=var# should not be contained in vlist).

IGNORE{=val} This option is used to ignore a specific value when determining the minimum or the maximum. Default is 0; <val> may be used to designate a different value. If all values for an observation are to be ignored it's value will be that value, i.e. 0 or the values specified by val.

MIN



   MIN vlist  [VAR={var#}] [NOLAB] [DELETE]
              [IGNORE=val]
Puts the minimum for each observation on vlist into a variable. (Minimum across variables). The result is copied into the next free location (VAR) or into #target if VAR=#target is used. If you need to look for the minimum of a variable use the MIN function (see the LET command).

The NOLABEL, IGNORE and DELETE options have the same meaning as on the MAX command (see there for details).

MOVE


   MOVE v1[,v2]
Moves v1 to v2, i.e. move performs a destructive copy. If v2 is not specified v2 will be the next free location in the work area.
Related commands
DCOPY is a synonym for MOVE. COPY performs a non-destructive copy (duplicate) of a variable.

NORMALIZE
NORMALIZE  vlist [NEWVAR][MEDIAN] | MEAN | CENTER | RESCALE
Normalizes the variables in the list. Default is to center the variable around the median; other options are the MEAN and the CENTER values. Finally the RESCALE option subtracts the minimum and makes sure that the smallest value in the variable is never smaller than 1.0.

NEWVAR creates new variables instead of replacing the variables.

Compare to the STANDARDIZE command.

PACK
   PACK
Packs the WA, i.e. after packing all variables are stored in sequence from v1 to vn, i.e. empty variable positions (relative numbers) are eliminated.

PCTCHECK
 PCTCHECK vlist BY=var | CONST=val | TABLE
         [TOL=val] [REPORTCASES {BELOW | ABOVE}]
The purpose of this command is to check percentages before computing them with other commands. When computing percentages it is often useful to check whether the variables expressed as percentages really sum up to 100%. E.G. let PRIM, SEC and TERT be the number of persons working in the primary, secondary and tertiary sector and ACTIV to total working population. However when working with large bases of aggregate data errors are not infrequently found and may be disastrous when over seen, especially after transforming them into percentages. Therefore before computing PRIM SEC and TERT as percentages of ACTIV it could be interesting to know whether they will sum up to 100%. Of course you could compute percentage versions of PRIM SEC and TERT, sum them up, check them and remove the original variables ... PCTCHECK simplifies all this by checking if values WOULD sum up to 100% (more or less).

Default is to check 100% plus or minus 3%, TOL=val is used to specify a different tolerance.

REPORTCASES lists the cases which fall outside the tolerance limits. If BELOW or ABOVE are specified only cases below or alternatively above the limit are shown.

ALL other options are identical to the PERCENT command (C=, B=, TABLE).

PERCENT


 PERCENT vlist BY=#var | C=constant | TABLE
         [NEW] [PROPORTIONS] [REP=val]
         [NOMODIFY] [NOCENTER]
All variables in the vlist are recomputed as percentage of either (1) the variable specified by BY=#var, (2) a constant value specified with the C= option or (3) TABLE, i.e. the sum of the variables in the vlist is used as total. Either B=,C= or TABLE are required; if none is given on the command line EDA asks for the BY= variable.

The resulting variables have the same label and descriptor as the original variable, but slightly modified. With BY=var# (%<name>) is appended to the descriptor of the modified variable; where <name> is the label of the variable specified on BY. If PERCENT is done on constant, the target variable descriptor will contain a *%* modification stamp. The modification of the variable descriptor may be inhibited by specifying NOMOD.

If no other option is used the original variables are replaced with the percentages; if NEW is present the resulting variables are added to the WA as new variables; they are copied in the first free locations of the WA.

Division by zero is checked: if you attempt such a division the result will be the system replacement value (set SET D=) for that case; REP=val is used to set the result to <val> instead of the system replacement value.

PROPORTIONS computes proportions instead of percentages, i.e. does not multiply the result by 100.

As percent operations are most often used to produce percentages not exceeding 100% (or proportions not exceeding 1) EDA checks the result and reports the number of values not in the "normal" percentage range. If you intend to have larger percentages ignore those warnings. To avoid these warnings with rounding errors and the like, the warning is only given if the result is larger than 100,9 (or 1.09 for proportions).

This commands in fact duplicates the % operator available within expressions. However as percentage operations are often performed on many variables always using the same variable (e.g. computing percentages of many variables in terms of the total population of, say, a state).

The PERCENT command computes also the global percentage for each variable and stores it into CENTER for further reference. The NOCENTER option inhibits this and stores the defaul center instead, i.e. the median.

REARRANGE


 REARRANGE vlist | AUTORECODE [START=val]  |
                 | NSMALL=n1 [VAL=v1]      |
Purpose: rearranges categorical variables. Categorical variables mean in EDA that only the integer part is considered, i.e if your data has decimal values these will be disregarded.

The various options perform cleanup manipulations which are often tedious. They are especially useful when preparing some more complex analyses with special requirements.

AUTORECODE: rearranges the variable in such a way that the different codes are consecutive starting with 1. For instance a variable having codes 1 4 6 7 8 will have codes 1 2 3 4 5 after the transformation. START=val changes the default starting value 1 to some other desirable code.

NSMALL=n1 recodes all codes having frequencies equal to or smaller than n1. All codes corresponding to that cutoff value are recoded into code 0 (unless the user specifies a different value with the VAL= option). If such a recodification produces a variable with a single category a warning message will be issued.

RECODE -> EDIT




RENAME
   RENAME [vlist] [FORCE]
   RENAME [v1] "new_name" [FORCE]
   RENAME DOCUMENT "old_name"
Purpose: change the names of variables and documents.

The first form of the command changes the names of the variables in the specified list. For each variable in the vlist you will be asked to enter a new name. Empty names are not accepted. You may also enter a = sign to tell EDA to keep the current name (this is useful when you change your mind for some variables with a long variable list).

The second command form takes renames a single variable; the new name is specified on the command line.

As long as you do not use documents with your variables, this is all the information you need. When working with documents please read on.

Unlike other variable label and descriptor editing commands (TED LABELS, LABEL) this command changes also the names of the associated documents. As documents belonging to variables are kept in a file, those commands do no go through that file and therefore loose the associated documents. (With LABEL you have a security mode, where variables with documents are not altered).

As RENAME goes through the document file, this command might take some time with a WA containing many documents.

DOCUMENT is used to rename a document not related to a variable. Then you may specify the name of the document to rename in double quotes. You will be prompted for the new variable. FORCE In addition to ordinary documents stored with the archive, EDA lets you link additional documents to a file. These documents may not be changed with a RENAME command; therefore EDA will, by default, skip variables with linked documents (and issue a message); if you really want to change the name (and accept to loose the link) add the FORCE option.

REPVAL


    REPVAL <opt> [MISS=val | GREATERTHAN=val | LESSTHAN=val
                 | OUTSIDE=(min,max)]
                 [SHORT]

<opt> vlist [DISPLAY] vlist LISTWISE vs vlist UPDATE vlist VALUE=val vlist [MEDIAN | MEAN] [NMIN=n_min] vlist GROUPWISE [{MEDIAN} | MEAN] [NMIN=n_min] vs,vlist CASEWISE [{MEDIAN} | MEAN] [NMIN=n_min]

The REPVAL command deals with system replacement (missing) values or any value you designate as such; it offers options to count missing values and replace them with some suitable valid (non-missing value).

All forms of the command share two common options. Normally the commands checks and replaces the system replacement value (RepVal) as set by the SET REPVAL command or the default (usually -1). is present, val is used instead. You may specify the MISS=val option to use <val> instead of the default RepVal; GreaterThan=val treats all values greater than <val> as missing, whereas LessThan=val treats values lower than <val> as missing and OUTSIDE=() specifies a range (min,max) of valid observations, i.e. all values outside this range are considered missing.

REPVAL is usually rather talkative; SHORT inhibits variable/casewise reports.

Default option (DISPLAY)
This form of the command shows a report containing the overall number of observation and the number of missing observations for each variable in the vlist. Each variable is handled individually (see the LISTWISE option for an alternative view).

With several variables on the vlist, the command also displays the smallest and largest missing value count found in any variable of the vlist.

REPVAL vlist LISTWISE
Reports how many valid observations remain, after missing values have been removed listwise, i.e. no missing value in the data matrix defined by vlist.
REPVAL vs, vlist UPDATE
Updates a target variable (vs) when it contains missing values (default current replacement value) or any specific value you designate with the MISS=val option from the other variables in the list, i.e. if a missing value is found in <vs>, UPDATE checks the first variable of <vlist>, if the observation is not missing on that variable, it is copied to the <vs> variable, otherwise UPDATE will check the second variable in the <vlist> and so on.
REPVAL vlist VALUE=val
Replace all missing values with val.
REPVAL vlist [MEDIAN | MEAN] [NMIN=nmin]
Replaces missing values in each variable with the overall MEDIAN (MEAN) of the valid observations of the variable. Missing values are not replaced, if there are less than nmin (default 2) valid observations in the variable.
REPVAL vlist GROUPWISE [{MEDIAN} | MEAN] [NMIN=nmin]
Replaces missing values in each variable with the MEDIAN (MEAN) of the valid observations in the same group as the missing observation. Missing values are not replaced, if there are less than NMIN (default 2) valid observations in the group for the variable.

This option requires a GVAR.

REPVAL vs,vlist CASEWISE [{MEDIAN} | MEAN] [NMIN=min]
Replaces missing values in variable vs with the MEDIAN (MEAN) of all variables in the vlist accross the same observation. If the number of valid observations is smaller than NMIN (default 2), the missing value is not replaced.
Related commands
CHECK NMIN (CLEANUP option) may be used to remove variables or observations with insufficient valid observations.

The MISSVAL command (a selection command) combined with the SELECT command eliminates missing observations from the WA.

The expression related commands (LET/IF/INCLUDE EXCLUDE) can of course be used to specify any kind of operation you care to perform.

STANDARDIZE


   STANDARDIZE vlist [NEWVAR][MEDIAN] | MEAN | RESCALE
Standardizes the variables in the vlist. Default is to center each variable around its median and divide by its midspread.

The MEAN option yields a z-transformation (center around the mean and divide by the standard deviation).

The RESCALE option subtracts the minimum and makes sure that the smallest values in each variable is newer below 1.0.

NEWVAR creates new variables instead of replacing the variables of the vlist.

Compare to the NORMALIZE command.

The descriptor modification stamp *s* for STAN, *n* for NORM.

SWAP


   SWAP v1 v2
Variable v1 and v2 change places in the current WA.

TRANSPOSE
    TRANSPOSE [NODESCRIPTION]
Transposes the work area, i.e. the variables become cases and the cases variables. This facility is limited to NVAR cases to be transposed. The case identifiers and variable labels are interchanged, the variable descriptions are replaced by default descriptors and the row and column names are inverted.

NODES transposes only the data matrix, i.e. no descriptive information whatsoever is transposed. This option should be used with extreme caution in situations where a specific transformation you need to perform is easier to perform on the transposed matrix.

WEIGHT


  WEIGHT vlist  BY=var#
  WEIGHT v1[,v2] BIWEIGHT [CUT=val] [E=epsi]
The first form of the command weights the variables in the variable list using the BY=#var. The weight variable should contain only integer values (real values are truncated). For each integer value the values in the variables of the vlist are repeated as many times.

Let the weight variable be 2 4 3

and v1 (a var# in the vlist) 2.5 4.6 -1.2 v2 1.2 1.3 4.5

after weighting : v1 2.5 2.5 4.6 4.6 4.6 4.6 -1.2 -1.2 -1.2 v2 1.2 1.2 1.3 1.3 1.3 1.3 4.5 4.5 4.5

BIWEIGHT copies the weights for each case of v1 into v2 (or v1 if not present. The weights result from computing Tukey's biweigth, a iterative estimation of center (location). The options C and E default to 4, 0.01 respectively. A C larger than 10 is equal to the mean.
Reference
McNeil 1977, Chapter 7, Section 1; Credits: Originally from McNeil 1977, adapted; modifications suggested by Bernard Rapacchi