DATA Step Example¶
In the below DATA
step, we read in a file specified by the set
statement, in.fake_micro
, retaining only the first 100 observations (obs = 100)
using a data step option (we will discuss these in more detail later). We then create several new variables, subset the data, and save the result to the output dataset, in.fake_micro_10
. The length
statement sets the length of the new variable married
to 3 - the shortest numeric length. SAS automatically sets new numeric variables to length 8, the largest possible numeric length. This is wasteful and you can save space by carefully setting all the lengths of your variables to the minimum required to store the data you need (although be careful to not lose information here by setting the variable length to be too small, more on this later).
We create a married
indicator by evaluating whether or not another variable is equal to 1. We then restrict to rows with non-missing mom_pik
or dad_pik
. An if
statement without a subsequent then
statement drops any observations that do not meet that criteria specified by the if
condition (more on this later).
Then we create two versions of a parent income variable. par_inc_2015_miss
will be missing if mom_inc_2000
is missing or dad_inc_2000
is missing or both. par_inc_2015_nomiss
will never be missing, since the sum()
function assigns missing arguments to zero. This is similar to the distinction between using +
and egen, rowtotal()
in Stata.
The keep
statement specifies which variables are to be kept and written to the output file. In SAS, :
acts like a wildcard and is similar to the *
in Stata, except it can only be used for variable name suffixes. Be mindful of this when naming variables.
First let’s look at the input data set. Note that missing values here are displayed as NaN
because we use Python to display the data. In SAS missing values are displayed as .
similar to Stata.
pik | mom_pik | dad_pik | kid_married_2015 | mom_inc_2000 | dad_inc_2000 | |
---|---|---|---|---|---|---|
0 | 141 | 173 | NaN | 1.0 | 16474.893 | NaN |
1 | 138 | 149 | NaN | 0.0 | 21689.520 | 37348.6450 |
2 | 177 | NaN | 003 | 0.0 | 11420.902 | 14666.4970 |
3 | 146 | 013 | 004 | 0.0 | NaN | 34485.8130 |
4 | 104 | 187 | 005 | 1.0 | 52476.680 | 6003.8027 |
5 | 144 | NaN | 006 | 1.0 | 13744.854 | 20464.3630 |
6 | 083 | 025 | 007 | 0.0 | 22038.420 | 17941.8010 |
7 | 115 | NaN | 008 | 1.0 | 16084.973 | 6875.2935 |
8 | 170 | 157 | 009 | 1.0 | 17766.367 | 31663.4510 |
9 | 118 | NaN | 010 | 1.0 | 29356.543 | NaN |
* Set the directory where the data is;
libname in "/media/sf_myfolders";
data in.fake_micro_10;
set in.fake_micro(obs=100);
* Set length of the marriage variable;
length married 3;
* Create married indicator;
married = kid_married_2015=1;
* Restrict to kids with a mom_pik or dad_pik;
if ~missing(mom_pik) or ~missing(dad_pik);
* Combine parent income, this will be missing if either component is missing;
par_inc_2015_miss = mom_inc_2000 + dad_inc_2000;
* Combine parent_income, this will treat missing values as zero and will never be missing;
par_inc_2015_nomiss = sum(mom_inc_2000, dad_inc_2000);
keep pik mom_pik dad_pik par_inc_2015: ;
run;
No we look at the output data set and we can see the difference between using +
and sum
.
pik | mom_pik | dad_pik | par_inc_2015_miss | par_inc_2015_nomiss | |
---|---|---|---|---|---|
0 | 141 | 173 | NaN | NaN | 16474.8930 |
1 | 138 | 149 | NaN | 59038.1650 | 59038.1650 |
2 | 177 | NaN | 003 | 26087.3990 | 26087.3990 |
3 | 146 | 013 | 004 | NaN | 34485.8130 |
4 | 104 | 187 | 005 | 58480.4827 | 58480.4827 |
5 | 144 | NaN | 006 | 34209.2170 | 34209.2170 |
6 | 083 | 025 | 007 | 39980.2210 | 39980.2210 |
7 | 115 | NaN | 008 | 22960.2665 | 22960.2665 |
8 | 170 | 157 | 009 | 49429.8180 | 49429.8180 |
9 | 118 | NaN | 010 | NaN | 29356.5430 |