Transform variables

There are three central commands to transform variables or compute new variables from other variables:

generate and replace

General syntax

generate newvar[:lblname] =exp [if] [in]
replace oldvar =exp [if] [in]  

Where newvar is a variable to be created, oldvar an existing variable and exp an expression of any complexity using

generate a1=urb+22-log(infmor)
generate b1=12 Constant value of 12
generate b2= .Set all of b2 to missing
replace b=log(urb) if urb≥60Replace values if urbanisation has some particular value
replace age2 = age2^2
generate c=uniform()Create a uniform random variable

A large number of arithmetic functions are available; here's a partial list:

Advanced functions

Several functions are shortcuts simplifying commond tasks that could be achieved with a sequence of generate/replace commands.

egen(" extensions to generate")

This command offers a number of useful functions (some of them are documented below). The general syntax is:

egen newvar = efn(exp) [if] [in] [, options]

Where efn is one of the functions offered by egen (see a partial list below) and exp is an expression, often a simple variable name.

egen urbm=median(urb)Creates a constant containing for each observation the median of urb
egen urbsd=sd(log(urb))Contstant containing the standard deviations of the logged urb
egen urb1=mean(urb) if urb > 70Set all values of urb1 to the mean of urb if urb is larger than 70 (assumes that urb1 already exists)

Functions returning a constant:

min(exp)Minimum max(exp) Maximum
mean(exp) Mean sd(exp) Standard deviation
mdev(exp) Mean absolute deviation median(exp),Median
mad(exp) median absolute deviation iqr(exp) Interquartile range
pctile(exp) [, p(#)] Percentile, p defaults to 50 count(exp)Count the number of non missing values
total(exp) sum of observations total(exp) sum of observations
Functions transforming a variable
std(exp) [, mean(#) std(#)]Standardize, mean and std default to 0 1and 1
pc(exp)Percentages (proportions/franctions with the , prop option
rank(exp) , uniquetransforms into ranks (with ties witout the ,unique option
Functions operating accross variables

These function create a new variable with statistics obtained for each observation accross a variable list.

rowmax(varlist) rowmin(varlist)
rowmean(varlist) rowmedian(varlist)
rowsd(varlist) rowtotal(varlist)
rowpctile(varlist) [, p(#)] rowmiss(varlist)Count of missing values
rownonmiss(varlist) Count of non missing values
Missing values

Consider the following two commands:

    generate nw=(v1+v2+v3+v4)/4
    egen nw=(rowmean(v1 v2 v3 v4)
 

If there are no missing values the results of the two commands will be the same; if however one or more values are missing from an observation the results will differ. In the first case a missing value will be generated for nw, in the second example an average will be computed for the non-missing values.

Numbers of missing and non-missing observations

count(exp) counts the number of non-missing values in a variable. To obtain the number of missing values you can use the following:
egen c=count(urb)Count the number of non-missing observation in urb
display c-_NDisplay the difference between "c" and the total number of observations in the dataset (_N is a system constant)

rowmiss(varlist) and rownonmiss(varlist) can be used to inspect missing/non missing observations across several variables.

Related commands
Related documents