Groups are defined by categorical variables. Frequently it is useful, for instance, to compare infant mortality in countries with low, average and high urbanisation; as urbanisation is a continuous variable we need to break it into a categorical variable with, as an example, three groups.
generate urbcat=autocode(urb,4,0,100) | break urb into four evenly spaced categories from 0 to 100 |
generate cat1=recode(urb,21,38,64,100) | 4 groups (≤ 21, ≤ 38, ≤ 64 and ≤ 100) |
xtile urbcat = urb, nquantiles(3) | Three groups with roughly the same number of observations (default 2 groups) |
table urbcat, contents(min urb max urb) | Show the min/max of the groups |
egen urbcat1 = cut(urb), at(0,34,68,101) | Three groups, based on specified limits |
The cut function available in egen lets you specify bin boundaries. In the example:
group | Boundaries (breaks) |
---|---|
1 | From 0 up to (but not including) 34 |
2 | From 34 up to (but not including) 68 |
3 | From 68 up to (but not including) 101 |
Note that if observations are found outside the specified boundaries, egen will generate missing values for them (message displayed).
cut has three useful options: The urbcat generated in the example above has three values corresponding to the lower boundary of the bin, i.e. 0. 34 and 68.
egen urbcat1 = cut(urb), at(0,34,68,101) icodes | urbcat will have values 0,1,2 |
egen urbcat2 = cut(urb), at(0,34,68,101) label | same, but in addition defines labels "0- ", "34-" and "68-" |
A third option group defines groups that contain roughly the same numbers of observations, i.e. groups(3) will create three groups correponding to the thirds of a distribution:
egen urbcat3 = cut(urb), group(3) label
label urbcat3 "Urbanization in 3 categories" | Label the newly created variable |
label define urblab 1 "Low urb." 2 "Average" 3 "High" | Define value label set urblab |
label values urbcat3 urblab | Attach labels urblab to variable urbcat3 |
tab urbcat3 | Show frequency tables with the newly defined labels |