Merging Data with PROC SQL¶
PROC SQL is the most useful and versatile SAS procedure. When you use PROC SQL, SAS can leverage some of the efficiencies of SQL processing, which can improve the processing time of your queries. The flexibility and specificity of SAS SQL statements are often preferable to the SAS provided alternatives.
PROC SQL steps are generally of the form:
The statements contained in parentheses are optional. When you write your PROC SQL statements, do not include the parentheses. If you omit the create table line, the result of you PROC SQL step will be written to the log but no data will be written to the disk. The optional group by and order by lines instruct SAS to collapse the data to a single row by the specified keys and sort the data by the specified keys, respectively.
Merging data in SAS using PROC SQL JOINS¶
Generally for our purposes, when we want to merge together two datasets in SAS we use JOIN commands in a PROC SQL step. All SQL joins are akin to many-to-many joins implemented by the joinby function in Stata — these joins are not on their face equivalent to 1:1, 1:m, or m:1 joins unless you ensure that the tables you are joining are unique on the join variables yourself.
Conceptually, SQL joins involve a “left” table and one or more “right” tables to be joined. All joins are Cartesian products made on the the specified merge key variables. This means if there are two duplicate matching keys in the left table and two duplicate matching keys in the right table, there will be four duplicate keys in the output dataset. Another way to phrase this is that each observation is assigned all of the rows that it matches to.
The most common join we use is the LEFT JOIN but all types are outlined below
LEFT JOIN¶

A LEFT JOIN selects all the obervations in the left table regardless of their key values but only selects observations with matching key values from the right table. Variables from the right table in rows where there is no match are assigned missing values. A LEFT JOIN is not commutative (i.e. reversing the order of the data sets can produce different results).
Example¶
In this example, we merge on state from out.kid_geos using a LEFT JOIN onto in.fake_micro and save the resulting dataset in out.left_join_example.
Here, we use the select statement to only keep certain variables from each dataset, labeled A or B. These labels are assignedbelow in the from and left join statements.
The merge key here is pik and we require that pik is not null for the left table (A).
The output data set out.left_join_example is:
| pik | mom_pik | dad_pik | state | |
|---|---|---|---|---|
| 0 | 001 | NaN | 034 | 24.0 |
| 1 | 002 | NaN | 117 | 48.0 |
| 2 | 003 | 001 | NaN | 81.0 |
| 3 | 004 | NaN | 101 | 24.0 |
| 4 | 005 | NaN | 058 | NaN |
| 5 | 006 | 010 | NaN | NaN |
| 6 | 007 | 172 | NaN | 1.0 |
| 7 | 008 | NaN | 017 | 37.0 |
| 8 | 009 | NaN | 029 | 2.0 |
| 9 | 010 | NaN | 142 | 38.0 |
RIGHT JOIN¶

A RIGHT JOIN selects all the observations in the right table. When observations do not match to the left table the key values are set to missing. A RIGHT JOIN, like a LEFT JOIN, is not commutative and is not the same as reversing the datasets in a LEFT JOIN.
Example¶
In the example below, we have the same setup as in the LEFT JOIN example, except that we use a RIGHT JOIN.
Note: out.kid_geos is a subset of out.fake_micro so pik will never be missing in the output data set since all the keys in out.kid_geos appear in out.fake_micro.
| pik | mom_pik | dad_pik | state | |
|---|---|---|---|---|
| 0 | 001 | NaN | 034 | 24.0 |
| 1 | 002 | NaN | 117 | 48.0 |
| 2 | 003 | 001 | NaN | 81.0 |
| 3 | 004 | NaN | 101 | 24.0 |
| 4 | 007 | 172 | NaN | 1.0 |
| 5 | 008 | NaN | 017 | 37.0 |
| 6 | 009 | NaN | 029 | 2.0 |
| 7 | 010 | NaN | 142 | 38.0 |
| 8 | 012 | NaN | 126 | 61.0 |
| 9 | 013 | NaN | 180 | 3.0 |
INNER JOIN¶

An INNER JOIN selects all the observations that have keys in both the left and the right data set.
An INNER JOIN is a logical AND between the two tables. Unlike a RIGHT JOIN or a LEFT JOIN, an INNER JOIN is commutative
so tables can be joined in either order.
Example¶
In the example below, we have the same setup as in the previous examples, except that we use an INNER JOIN.
| pik | mom_pik | dad_pik | state | |
|---|---|---|---|---|
| 0 | 138 | 149 | NaN | 2.0 |
| 1 | 177 | NaN | 003 | 87.0 |
| 2 | 146 | 013 | 004 | 71.0 |
| 3 | 104 | 187 | 005 | 21.0 |
| 4 | 083 | 025 | 007 | 13.0 |
| 5 | 115 | NaN | 008 | 28.0 |
| 6 | 170 | 157 | 009 | 65.0 |
| 7 | 051 | 058 | NaN | 35.0 |
| 8 | 054 | 034 | NaN | 99.0 |
| 9 | 111 | 004 | NaN | 2.0 |
FULL JOIN¶

A FULL JOIN selects all observations from both datasets and sets missing key values for observations that only appear in the right dataset.
A FULL JOIN is the logical OR between two datasets, but unlike an INNER JOIN it is not commutative because missing key values are
set based on the right dataset. We illustrate this below.
Example¶
In the example below, we have the same setup as in the previous examples, except that we use a FULL JOIN with out.kid_geos as the right dataset.
| pik | mom_pik | dad_pik | state | |
|---|---|---|---|---|
| 0 | 001 | NaN | 034 | 24.0 |
| 1 | 002 | NaN | 117 | 48.0 |
| 2 | 003 | 001 | NaN | 81.0 |
| 3 | 004 | NaN | 101 | 24.0 |
| 4 | 005 | NaN | 058 | NaN |
| 5 | 006 | 010 | NaN | NaN |
| 6 | 007 | 172 | NaN | 1.0 |
| 7 | 008 | NaN | 017 | 37.0 |
| 8 | 009 | NaN | 029 | 2.0 |
| 9 | 010 | NaN | 142 | 38.0 |
Reverse Example¶
Now lets reverse the datasets to illustrate why a FULL JOIN is not commutative.
| pik | state | mom_pik | dad_pik | |
|---|---|---|---|---|
| 0 | 001 | 24.0 | NaN | 034 |
| 1 | 002 | 48.0 | NaN | 117 |
| 2 | 003 | 81.0 | 001 | NaN |
| 3 | 004 | 24.0 | NaN | 101 |
| 4 | NaN | NaN | NaN | 058 |
| 5 | NaN | NaN | 010 | NaN |
| 6 | 007 | 1.0 | 172 | NaN |
| 7 | 008 | 37.0 | NaN | 017 |
| 8 | 009 | 2.0 | NaN | 029 |
| 9 | 010 | 38.0 | NaN | 142 |