DATA Step Options¶
DATA step options provide a set of tools to interact with input and output datasets. They are declared in parentheses after the filenames in data or set statements:
If a data set option follows the set statement, the options apply to the data being read in. If it follows the data statement, SAS applies the action to the output dataset. The general order of operations is:
SAS applies
datastep options specified in thesetstatement to the input dataSAS evaluates the programming statements in the
DATAstepSAS applies
datastep options specified in thedatastatement to the output data
Last thing to keep in mind is that SAS always evaluates keep/drop options before rename options. This means you need to use the original variable names in the keep/drop options when using a rename option in the same statement.
Important DATA Step Options¶
Below we summarize some of the most important data step options and give examples of their use. You can find a comprehensive list here.
KEEP and DROP Options¶
The keep data step option allows us to only keep certain variables from the input or output dataset. Similarly, the drop data step option allows us to drop variables from the input or output dataset.
To use the keep/drop options, write keep = or drop = followed by the list of variables you would like to keep or drop.
Example¶
Below we drop the kid_married_2015 variable from the in.fake_micro dataset using the drop option. Then we create par_inc_2000 in the data step and only keep pik, mom_pik, dad_pik, and par_inc_2000 in the out.par_income output dataset using the keep option.
Here is the in.fake_micro data:
| pik | mom_pik | dad_pik | kid_married_2015 | mom_inc_2000 | dad_inc_2000 | |
|---|---|---|---|---|---|---|
| 0 | 141 | 173 | NaN | 1.0 | 16474.893 | NaN |
| 1 | 138 | 149 | NaN | 0.0 | 21689.520 | 37348.6450 |
| 2 | 177 | NaN | 003 | 0.0 | 11420.902 | 14666.4970 |
| 3 | 146 | 013 | 004 | 0.0 | NaN | 34485.8130 |
| 4 | 104 | 187 | 005 | 1.0 | 52476.680 | 6003.8027 |
| 5 | 144 | NaN | 006 | 1.0 | 13744.854 | 20464.3630 |
| 6 | 083 | 025 | 007 | 0.0 | 22038.420 | 17941.8010 |
| 7 | 115 | NaN | 008 | 1.0 | 16084.973 | 6875.2935 |
| 8 | 170 | 157 | 009 | 1.0 | 17766.367 | 31663.4510 |
| 9 | 118 | NaN | 010 | 1.0 | 29356.543 | NaN |
And the out.par_income dataset:
| pik | mom_pik | dad_pik | par_inc_2000 | |
|---|---|---|---|---|
| 0 | 141 | 173 | NaN | 16474.8930 |
| 1 | 138 | 149 | NaN | 59038.1650 |
| 2 | 177 | NaN | 003 | 26087.3990 |
| 3 | 146 | 013 | 004 | 34485.8130 |
| 4 | 104 | 187 | 005 | 58480.4827 |
| 5 | 144 | NaN | 006 | 34209.2170 |
| 6 | 083 | 025 | 007 | 39980.2210 |
| 7 | 115 | NaN | 008 | 22960.2665 |
| 8 | 170 | 157 | 009 | 49429.8180 |
| 9 | 118 | NaN | 010 | 29356.5430 |
RENAME Option¶
The rename data step option allows us to change variable names from the input or output datasets. If you use rename in the data statement, the variable name is changed in the output data set. If you use rename in the set statement, the variable name is changed in the input data set and the new name is used for all the subsequent DATA step commands.
To use the rename option to rename old_name to new_name we would write rename = (old_name = new_name).
Don’t forget that the rename option is evaluated after keep/drop options!
Example¶
In the example below, we load in the pik variable from in.fake_micro and immediately rename it to kid_pik — this is the name that should be used in subsequent data step commands. Then we restrict the data set to rows where the last digit of pik is 4 using substr and an if statement (more on these later). Finally, we only keep the kid_pik variable and rename kid_pik to pik. Note that we reference kid_pik in the keep statement, this is because keep is evaluated before we rename kid_pik to pik.
Here is the input data in.fake_micro:
| pik | mom_pik | dad_pik | kid_married_2015 | mom_inc_2000 | dad_inc_2000 | |
|---|---|---|---|---|---|---|
| 0 | 141 | 173 | NaN | 1.0 | 16474.893 | NaN |
| 1 | 138 | 149 | NaN | 0.0 | 21689.520 | 37348.6450 |
| 2 | 177 | NaN | 003 | 0.0 | 11420.902 | 14666.4970 |
| 3 | 146 | 013 | 004 | 0.0 | NaN | 34485.8130 |
| 4 | 104 | 187 | 005 | 1.0 | 52476.680 | 6003.8027 |
| 5 | 144 | NaN | 006 | 1.0 | 13744.854 | 20464.3630 |
| 6 | 083 | 025 | 007 | 0.0 | 22038.420 | 17941.8010 |
| 7 | 115 | NaN | 008 | 1.0 | 16084.973 | 6875.2935 |
| 8 | 170 | 157 | 009 | 1.0 | 17766.367 | 31663.4510 |
| 9 | 118 | NaN | 010 | 1.0 | 29356.543 | NaN |
And the output data out.kid_piks:
| pik | |
|---|---|
| 0 | 104 |
| 1 | 144 |
| 2 | 054 |
| 3 | 154 |
| 4 | 064 |
| 5 | 194 |
| 6 | 074 |
| 7 | 084 |
| 8 | 014 |
| 9 | 124 |
WHERE Option¶
The where option lets us select observations from the input or output data sets by applying conditions to the data. This is similar to using a use if in Stata:
To apply a condition using a where option we would write where = (condition).
Example¶
Recall that in the example above we used an if statement to restrict to rows where kid_pik ended in a 4. We make the same restriction below using a where option in the set statement.
Note
The where option is evaluated after the rename statement, so we use the name kid_pik in the where option.
Here is the input data in.fake_micro:
| pik | mom_pik | dad_pik | kid_married_2015 | mom_inc_2000 | dad_inc_2000 | |
|---|---|---|---|---|---|---|
| 0 | 141 | 173 | NaN | 1.0 | 16474.893 | NaN |
| 1 | 138 | 149 | NaN | 0.0 | 21689.520 | 37348.6450 |
| 2 | 177 | NaN | 003 | 0.0 | 11420.902 | 14666.4970 |
| 3 | 146 | 013 | 004 | 0.0 | NaN | 34485.8130 |
| 4 | 104 | 187 | 005 | 1.0 | 52476.680 | 6003.8027 |
| 5 | 144 | NaN | 006 | 1.0 | 13744.854 | 20464.3630 |
| 6 | 083 | 025 | 007 | 0.0 | 22038.420 | 17941.8010 |
| 7 | 115 | NaN | 008 | 1.0 | 16084.973 | 6875.2935 |
| 8 | 170 | 157 | 009 | 1.0 | 17766.367 | 31663.4510 |
| 9 | 118 | NaN | 010 | 1.0 | 29356.543 | NaN |
Here is the output data out.kid_piks:
| pik | |
|---|---|
| 0 | 104 |
| 1 | 144 |
| 2 | 054 |
| 3 | 154 |
| 4 | 064 |
| 5 | 194 |
| 6 | 074 |
| 7 | 084 |
| 8 | 014 |
| 9 | 124 |