DATA Step Example¶

In the below DATA step, we read in a file specified by the set statement, in.fake_micro, retaining only the first 100 observations (obs = 100) using a data step option (we will discuss these in more detail later). We then create several new variables, subset the data, and save the result to the output dataset, in.fake_micro_10. The length statement sets the length of the new variable married to 3 - the shortest numeric length. SAS automatically sets new numeric variables to length 8, the largest possible numeric length. This is wasteful and you can save space by carefully setting all the lengths of your variables to the minimum required to store the data you need (although be careful to not lose information here by setting the variable length to be too small, more on this later).

We create a married indicator by evaluating whether or not another variable is equal to 1. We then restrict to rows with non-missing mom_pik or dad_pik. An if statement without a subsequent then statement drops any observations that do not meet that criteria specified by the if condition (more on this later).

Then we create two versions of a parent income variable. par_inc_2015_miss will be missing if mom_inc_2000 is missing or dad_inc_2000 is missing or both. par_inc_2015_nomiss will never be missing, since the sum() function assigns missing arguments to zero. This is similar to the distinction between using + and egen, rowtotal() in Stata.

The keep statement specifies which variables are to be kept and written to the output file. In SAS, : acts like a wildcard and is similar to the * in Stata, except it can only be used for variable name suffixes. Be mindful of this when naming variables.

First let’s look at the input data set. Note that missing values here are displayed as NaN because we use Python to display the data. In SAS missing values are displayed as . similar to Stata.

	pik	mom_pik	dad_pik	kid_married_2015	mom_inc_2000	dad_inc_2000
0	141	173	NaN	1.0	16474.893	NaN
1	138	149	NaN	0.0	21689.520	37348.6450
2	177	NaN	003	0.0	11420.902	14666.4970
3	146	013	004	0.0	NaN	34485.8130
4	104	187	005	1.0	52476.680	6003.8027
5	144	NaN	006	1.0	13744.854	20464.3630
6	083	025	007	0.0	22038.420	17941.8010
7	115	NaN	008	1.0	16084.973	6875.2935
8	170	157	009	1.0	17766.367	31663.4510
9	118	NaN	010	1.0	29356.543	NaN

* Set the directory where the data is; 
libname in "/media/sf_myfolders";

data in.fake_micro_10;
    set in.fake_micro(obs=100);
    
    * Set length of the marriage variable;
    length married 3; 
    
    * Create married indicator;
    married = kid_married_2015=1;
    
    * Restrict to kids with a mom_pik or dad_pik;
    if ~missing(mom_pik) or ~missing(dad_pik);
    
    * Combine parent income, this will be missing if either component is missing;
    par_inc_2015_miss = mom_inc_2000 + dad_inc_2000;
    
    * Combine parent_income, this will treat missing values as zero and will never be missing;
    par_inc_2015_nomiss = sum(mom_inc_2000, dad_inc_2000);
    
    keep pik mom_pik dad_pik par_inc_2015: ;
    
run; 

      
    

No we look at the output data set and we can see the difference between using + and sum.

	pik	mom_pik	dad_pik	par_inc_2015_miss	par_inc_2015_nomiss
0	141	173	NaN	NaN	16474.8930
1	138	149	NaN	59038.1650	59038.1650
2	177	NaN	003	26087.3990	26087.3990
3	146	013	004	NaN	34485.8130
4	104	187	005	58480.4827	58480.4827
5	144	NaN	006	34209.2170	34209.2170
6	083	025	007	39980.2210	39980.2210
7	115	NaN	008	22960.2665	22960.2665
8	170	157	009	49429.8180	49429.8180
9	118	NaN	010	NaN	29356.5430

Introduction to SAS for Incoming Predocs

DATA Step Example¶