Merging Data with PROC SQL

PROC SQL is the most useful and versatile SAS procedure. When you use PROC SQL, SAS can leverage some of the efficiencies of SQL processing, which can improve the processing time of your queries. The flexibility and specificity of SAS SQL statements are often preferable to the SAS provided alternatives.

PROC SQL steps are generally of the form:

proc sql; 
    (create table NEW_TABLE as)
        select VARIABLES
        from SOURCES 
        (group by KEYS)
        (order by KEYS)
        (where);
quit;

The statements contained in parentheses are optional. When you write your PROC SQL statements, do not include the parentheses. If you omit the create table line, the result of you PROC SQL step will be written to the log but no data will be written to the disk. The optional group by and order by lines instruct SAS to collapse the data to a single row by the specified keys and sort the data by the specified keys, respectively.

Merging data in SAS using PROC SQL JOINS

Generally for our purposes, when we want to merge together two datasets in SAS we use JOIN commands in a PROC SQL step. All SQL joins are akin to many-to-many joins implemented by the joinby function in Stata — these joins are not on their face equivalent to 1:1, 1:m, or m:1 joins unless you ensure that the tables you are joining are unique on the join variables yourself.

Conceptually, SQL joins involve a “left” table and one or more “right” tables to be joined. All joins are Cartesian products made on the the specified merge key variables. This means if there are two duplicate matching keys in the left table and two duplicate matching keys in the right table, there will be four duplicate keys in the output dataset. Another way to phrase this is that each observation is assigned all of the rows that it matches to.

The most common join we use is the LEFT JOIN but all types are outlined below

LEFT JOIN

image

A LEFT JOIN selects all the obervations in the left table regardless of their key values but only selects observations with matching key values from the right table. Variables from the right table in rows where there is no match are assigned missing values. A LEFT JOIN is not commutative (i.e. reversing the order of the data sets can produce different results).

Example

In this example, we merge on state from out.kid_geos using a LEFT JOIN onto in.fake_micro and save the resulting dataset in out.left_join_example.

Here, we use the select statement to only keep certain variables from each dataset, labeled A or B. These labels are assignedbelow in the from and left join statements.

The merge key here is pik and we require that pik is not null for the left table (A).

proc sql; 
    create table out.fake_micro_with_geo as
        select A.pik, A.mom_pik, A.dad_pik, B.state
            from out.fake_micro as A
            left join out.kid_geos as B
                on A.pik = B.pik and A.pik is not null;
quit;

The output data set out.left_join_example is:

pik mom_pik dad_pik state
0 001 NaN 034 24.0
1 002 NaN 117 48.0
2 003 001 NaN 81.0
3 004 NaN 101 24.0
4 005 NaN 058 NaN
5 006 010 NaN NaN
6 007 172 NaN 1.0
7 008 NaN 017 37.0
8 009 NaN 029 2.0
9 010 NaN 142 38.0

RIGHT JOIN

image

A RIGHT JOIN selects all the observations in the right table. When observations do not match to the left table the key values are set to missing. A RIGHT JOIN, like a LEFT JOIN, is not commutative and is not the same as reversing the datasets in a LEFT JOIN.

Example

In the example below, we have the same setup as in the LEFT JOIN example, except that we use a RIGHT JOIN.

Note: out.kid_geos is a subset of out.fake_micro so pik will never be missing in the output data set since all the keys in out.kid_geos appear in out.fake_micro.

proc sql; 
    create table out.fake_micro_with_geo as
        select A.pik, A.mom_pik, A.dad_pik, B.state
            from out.fake_micro as A
            right join out.kid_geos as B
                on A.pik = B.pik and B.pik is not null;
quit;
pik mom_pik dad_pik state
0 001 NaN 034 24.0
1 002 NaN 117 48.0
2 003 001 NaN 81.0
3 004 NaN 101 24.0
4 007 172 NaN 1.0
5 008 NaN 017 37.0
6 009 NaN 029 2.0
7 010 NaN 142 38.0
8 012 NaN 126 61.0
9 013 NaN 180 3.0

INNER JOIN

image

An INNER JOIN selects all the observations that have keys in both the left and the right data set. An INNER JOIN is a logical AND between the two tables. Unlike a RIGHT JOIN or a LEFT JOIN, an INNER JOIN is commutative so tables can be joined in either order.

Example

In the example below, we have the same setup as in the previous examples, except that we use an INNER JOIN.

proc sql; 
    create table out.fake_micro_with_geo as
        select A.pik, A.mom_pik, A.dad_pik, B.state
            from out.fake_micro as A
            inner join out.kid_geos as B
                on A.pik = B.pik and B.pik is not null;
quit;
pik mom_pik dad_pik state
0 138 149 NaN 2.0
1 177 NaN 003 87.0
2 146 013 004 71.0
3 104 187 005 21.0
4 083 025 007 13.0
5 115 NaN 008 28.0
6 170 157 009 65.0
7 051 058 NaN 35.0
8 054 034 NaN 99.0
9 111 004 NaN 2.0

FULL JOIN

image

A FULL JOIN selects all observations from both datasets and sets missing key values for observations that only appear in the right dataset. A FULL JOIN is the logical OR between two datasets, but unlike an INNER JOIN it is not commutative because missing key values are set based on the right dataset. We illustrate this below.

Example

In the example below, we have the same setup as in the previous examples, except that we use a FULL JOIN with out.kid_geos as the right dataset.

proc sql; 
    create table out.fake_micro_with_geo as
        select A.pik, A.mom_pik, A.dad_pik, B.state
            from out.fake_micro as A
            inner join out.kid_geos as B
                on A.pik = B.pik and B.pik is not null;
quit;
pik mom_pik dad_pik state
0 001 NaN 034 24.0
1 002 NaN 117 48.0
2 003 001 NaN 81.0
3 004 NaN 101 24.0
4 005 NaN 058 NaN
5 006 010 NaN NaN
6 007 172 NaN 1.0
7 008 NaN 017 37.0
8 009 NaN 029 2.0
9 010 NaN 142 38.0

Reverse Example

Now lets reverse the datasets to illustrate why a FULL JOIN is not commutative.

proc sql; 
    create table out.full_join_example as
        select A.pik, A.state, B.mom_pik, B.dad_pik
            from out.kid_geos as A
            full join out.fake_micro  as B
                on A.pik = B.pik and A.pik is not null;
quit;
pik state mom_pik dad_pik
0 001 24.0 NaN 034
1 002 48.0 NaN 117
2 003 81.0 001 NaN
3 004 24.0 NaN 101
4 NaN NaN NaN 058
5 NaN NaN 010 NaN
6 007 1.0 172 NaN
7 008 37.0 NaN 017
8 009 2.0 NaN 029
9 010 38.0 NaN 142

References

This file was referenced when creating this page.