DATA Step Example

In the below DATA step, we read in a file specified by the set statement, in.fake_micro, retaining only the first 100 observations (obs = 100) using a data step option (we will discuss these in more detail later). We then create several new variables, subset the data, and save the result to the output dataset, in.fake_micro_10. The length statement sets the length of the new variable married to 3 - the shortest numeric length. SAS automatically sets new numeric variables to length 8, the largest possible numeric length. This is wasteful and you can save space by carefully setting all the lengths of your variables to the minimum required to store the data you need (although be careful to not lose information here by setting the variable length to be too small, more on this later).

We create a married indicator by evaluating whether or not another variable is equal to 1. We then restrict to rows with non-missing mom_pik or dad_pik. An if statement without a subsequent then statement drops any observations that do not meet that criteria specified by the if condition (more on this later).

Then we create two versions of a parent income variable. par_inc_2015_miss will be missing if mom_inc_2000 is missing or dad_inc_2000 is missing or both. par_inc_2015_nomiss will never be missing, since the sum() function assigns missing arguments to zero. This is similar to the distinction between using + and egen, rowtotal() in Stata.

The keep statement specifies which variables are to be kept and written to the output file. In SAS, : acts like a wildcard and is similar to the * in Stata, except it can only be used for variable name suffixes. Be mindful of this when naming variables.

First let’s look at the input data set. Note that missing values here are displayed as NaN because we use Python to display the data. In SAS missing values are displayed as . similar to Stata.

pik mom_pik dad_pik kid_married_2015 mom_inc_2000 dad_inc_2000
0 141 173 NaN 1.0 16474.893 NaN
1 138 149 NaN 0.0 21689.520 37348.6450
2 177 NaN 003 0.0 11420.902 14666.4970
3 146 013 004 0.0 NaN 34485.8130
4 104 187 005 1.0 52476.680 6003.8027
5 144 NaN 006 1.0 13744.854 20464.3630
6 083 025 007 0.0 22038.420 17941.8010
7 115 NaN 008 1.0 16084.973 6875.2935
8 170 157 009 1.0 17766.367 31663.4510
9 118 NaN 010 1.0 29356.543 NaN
* Set the directory where the data is; 
libname in "/media/sf_myfolders";

data in.fake_micro_10;
    set in.fake_micro(obs=100);
    
    * Set length of the marriage variable;
    length married 3; 
    
    * Create married indicator;
    married = kid_married_2015=1;
    
    * Restrict to kids with a mom_pik or dad_pik;
    if ~missing(mom_pik) or ~missing(dad_pik);
    
    * Combine parent income, this will be missing if either component is missing;
    par_inc_2015_miss = mom_inc_2000 + dad_inc_2000;
    
    * Combine parent_income, this will treat missing values as zero and will never be missing;
    par_inc_2015_nomiss = sum(mom_inc_2000, dad_inc_2000);
    
    keep pik mom_pik dad_pik par_inc_2015: ;
    
run; 

No we look at the output data set and we can see the difference between using + and sum.

pik mom_pik dad_pik par_inc_2015_miss par_inc_2015_nomiss
0 141 173 NaN NaN 16474.8930
1 138 149 NaN 59038.1650 59038.1650
2 177 NaN 003 26087.3990 26087.3990
3 146 013 004 NaN 34485.8130
4 104 187 005 58480.4827 58480.4827
5 144 NaN 006 34209.2170 34209.2170
6 083 025 007 39980.2210 39980.2210
7 115 NaN 008 22960.2665 22960.2665
8 170 157 009 49429.8180 49429.8180
9 118 NaN 010 NaN 29356.5430