{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Merging Data with PROC SQL\n",
    "\n",
    "`PROC SQL` is the most useful and versatile SAS procedure. When you use `PROC SQL`, SAS can leverage some of the efficiencies of SQL processing, which can improve the processing time of your queries. The flexibility and specificity of SAS SQL statements are often preferable to the SAS provided alternatives. \n",
    "\n",
    "`PROC SQL` steps are generally of the form:\n",
    "\n",
    "```sas\n",
    "proc sql; \n",
    "    (create table NEW_TABLE as)\n",
    "        select VARIABLES\n",
    "        from SOURCES \n",
    "        (group by KEYS)\n",
    "        (order by KEYS)\n",
    "        (where);\n",
    "quit;\n",
    "```\n",
    "\n",
    "The statements contained in parentheses are optional. When you write your `PROC SQL` statements, do not include the parentheses. If you omit the `create table` line, the result of you `PROC SQL` step will be written to the log but no data will be written to the disk. The optional `group by` and `order by` lines instruct SAS to collapse the data to a single row by the specified keys and sort the data by the specified keys, respectively. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Merging data in SAS using `PROC SQL` `JOINS`\n",
    "\n",
    "Generally for our purposes, when we want to merge together two datasets in SAS we use `JOIN` commands in a `PROC SQL` step. All SQL joins are akin to many-to-many joins implemented by the `joinby` function in Stata &mdash; these joins are not on their face equivalent to `1:1`, `1:m`, or `m:1` joins unless you ensure that the tables you are joining are unique on the join variables yourself.\n",
    "\n",
    "Conceptually, SQL joins involve a \"left\" table and one or more \"right\" tables to be joined. All joins are *Cartesian products* made on the the specified merge key variables. This means if there are two duplicate matching keys in the left table and two duplicate matching keys in the right table, there will be four duplicate keys in the output dataset. Another way to phrase this is that each observation is assigned all of the rows that it matches to. \n",
    "\n",
    "The most common join we use is the `LEFT JOIN` but all types are outlined below\n",
    "\n",
    "### LEFT JOIN\n",
    "\n",
    "![image](left_join.png \"LEFT JOIN\")\n",
    "\n",
    "A `LEFT JOIN` selects all the obervations in the left table regardless of their key values but only selects observations with matching key values from the right table. Variables from the right table in rows where there is no match are assigned missing values. A `LEFT JOIN` is **not** commutative (i.e. reversing the order of the data sets can produce different results).\n",
    "\n",
    "#### Example\n",
    "\n",
    "In this example, we merge on `state` from `out.kid_geos` using a `LEFT JOIN` onto `in.fake_micro` and save the resulting dataset in `out.left_join_example`. \n",
    "\n",
    "Here, we use the `select` statement to only keep certain variables from each dataset, labeled `A` or `B`. These labels are assignedbelow in the `from` and `left join` statements.\n",
    "\n",
    "The merge key here is `pik` and we require that `pik` is not null for the left table (`A`).\n",
    "\n",
    "```sas\n",
    "\n",
    "proc sql; \n",
    "    create table out.fake_micro_with_geo as\n",
    "        select A.pik, A.mom_pik, A.dad_pik, B.state\n",
    "            from out.fake_micro as A\n",
    "            left join out.kid_geos as B\n",
    "                on A.pik = B.pik and A.pik is not null;\n",
    "quit;\n",
    "\n",
    "```\n",
    "\n",
    "The output data set `out.left_join_example` is:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "popout"
    ]
   },
   "source": [
    "```{margin} Note\n",
    "The rows with missing `state` are rows that did not match to the right data set, they remain in the output data set but are assigned missing values of `state`.\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "tags": [
     "remove_input"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pik</th>\n",
       "      <th>mom_pik</th>\n",
       "      <th>dad_pik</th>\n",
       "      <th>state</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>001</td>\n",
       "      <td>NaN</td>\n",
       "      <td>034</td>\n",
       "      <td>24.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>002</td>\n",
       "      <td>NaN</td>\n",
       "      <td>117</td>\n",
       "      <td>48.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>003</td>\n",
       "      <td>001</td>\n",
       "      <td>NaN</td>\n",
       "      <td>81.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>004</td>\n",
       "      <td>NaN</td>\n",
       "      <td>101</td>\n",
       "      <td>24.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>005</td>\n",
       "      <td>NaN</td>\n",
       "      <td>058</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>006</td>\n",
       "      <td>010</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>007</td>\n",
       "      <td>172</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>008</td>\n",
       "      <td>NaN</td>\n",
       "      <td>017</td>\n",
       "      <td>37.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>009</td>\n",
       "      <td>NaN</td>\n",
       "      <td>029</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>010</td>\n",
       "      <td>NaN</td>\n",
       "      <td>142</td>\n",
       "      <td>38.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   pik mom_pik dad_pik  state\n",
       "0  001     NaN     034   24.0\n",
       "1  002     NaN     117   48.0\n",
       "2  003     001     NaN   81.0\n",
       "3  004     NaN     101   24.0\n",
       "4  005     NaN     058    NaN\n",
       "5  006     010     NaN    NaN\n",
       "6  007     172     NaN    1.0\n",
       "7  008     NaN     017   37.0\n",
       "8  009     NaN     029    2.0\n",
       "9  010     NaN     142   38.0"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "df = pd.read_sas(\"../data/left_join_example.sas7bdat\", encoding=\"latin-1\");\n",
    "df.head(n=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### RIGHT JOIN\n",
    "\n",
    "![image](right_join.png \"RIGHT JOIN\")\n",
    "\n",
    "A `RIGHT JOIN` selects all the observations in the right table. When observations do not match to the left table the key values are set to missing. A `RIGHT JOIN`, like a `LEFT JOIN`, is not commutative and is **not** the same as reversing the datasets in a `LEFT JOIN`.\n",
    "\n",
    "#### Example\n",
    "\n",
    "In the example below, we have the same setup as in the `LEFT JOIN` example, except that we use a `RIGHT JOIN`. \n",
    "\n",
    "**Note:** `out.kid_geos` is a subset of `out.fake_micro` so `pik` will never be missing in the output data set since all the keys in `out.kid_geos` appear in `out.fake_micro`.\n",
    "\n",
    "```sas\n",
    "proc sql; \n",
    "    create table out.fake_micro_with_geo as\n",
    "        select A.pik, A.mom_pik, A.dad_pik, B.state\n",
    "            from out.fake_micro as A\n",
    "            right join out.kid_geos as B\n",
    "                on A.pik = B.pik and B.pik is not null;\n",
    "quit;\n",
    "\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```{margin} Note \n",
    "Unlike in the example above, `state` is never missing because the `RIGHT JOIN` restricts to observations that appear in `out.kid_geos`, where `state` is never missing. Note that `pik = 005` and `pik = 006` are missing now (see the `LEFT JOIN` output), this is because they do not appear in `out.kid_geos`.\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "tags": [
     "remove_input"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pik</th>\n",
       "      <th>mom_pik</th>\n",
       "      <th>dad_pik</th>\n",
       "      <th>state</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>001</td>\n",
       "      <td>NaN</td>\n",
       "      <td>034</td>\n",
       "      <td>24.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>002</td>\n",
       "      <td>NaN</td>\n",
       "      <td>117</td>\n",
       "      <td>48.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>003</td>\n",
       "      <td>001</td>\n",
       "      <td>NaN</td>\n",
       "      <td>81.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>004</td>\n",
       "      <td>NaN</td>\n",
       "      <td>101</td>\n",
       "      <td>24.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>007</td>\n",
       "      <td>172</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>008</td>\n",
       "      <td>NaN</td>\n",
       "      <td>017</td>\n",
       "      <td>37.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>009</td>\n",
       "      <td>NaN</td>\n",
       "      <td>029</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>010</td>\n",
       "      <td>NaN</td>\n",
       "      <td>142</td>\n",
       "      <td>38.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>012</td>\n",
       "      <td>NaN</td>\n",
       "      <td>126</td>\n",
       "      <td>61.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>013</td>\n",
       "      <td>NaN</td>\n",
       "      <td>180</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   pik mom_pik dad_pik  state\n",
       "0  001     NaN     034   24.0\n",
       "1  002     NaN     117   48.0\n",
       "2  003     001     NaN   81.0\n",
       "3  004     NaN     101   24.0\n",
       "4  007     172     NaN    1.0\n",
       "5  008     NaN     017   37.0\n",
       "6  009     NaN     029    2.0\n",
       "7  010     NaN     142   38.0\n",
       "8  012     NaN     126   61.0\n",
       "9  013     NaN     180    3.0"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_sas(\"../data/right_join_example.sas7bdat\", encoding=\"latin-1\");\n",
    "df.head(n=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### INNER JOIN\n",
    "\n",
    "![image](inner_join.png \"INNER JOIN\")\n",
    "\n",
    "An `INNER JOIN` selects all the observations that have keys in **both** the left and the right data set. \n",
    "An `INNER JOIN` is a logical AND between the two tables. Unlike a `RIGHT JOIN` or a `LEFT JOIN`, an `INNER JOIN` is commutative \n",
    "so tables can be joined in either order. \n",
    "\n",
    "#### Example\n",
    "\n",
    "In the example below, we have the same setup as in the previous examples, except that we use an `INNER JOIN`. \n",
    "\n",
    "```sas\n",
    "proc sql; \n",
    "    create table out.fake_micro_with_geo as\n",
    "        select A.pik, A.mom_pik, A.dad_pik, B.state\n",
    "            from out.fake_micro as A\n",
    "            inner join out.kid_geos as B\n",
    "                on A.pik = B.pik and B.pik is not null;\n",
    "quit;\n",
    "\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "tags": [
     "remove_input"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pik</th>\n",
       "      <th>mom_pik</th>\n",
       "      <th>dad_pik</th>\n",
       "      <th>state</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>138</td>\n",
       "      <td>149</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>177</td>\n",
       "      <td>NaN</td>\n",
       "      <td>003</td>\n",
       "      <td>87.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>146</td>\n",
       "      <td>013</td>\n",
       "      <td>004</td>\n",
       "      <td>71.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>104</td>\n",
       "      <td>187</td>\n",
       "      <td>005</td>\n",
       "      <td>21.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>083</td>\n",
       "      <td>025</td>\n",
       "      <td>007</td>\n",
       "      <td>13.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>115</td>\n",
       "      <td>NaN</td>\n",
       "      <td>008</td>\n",
       "      <td>28.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>170</td>\n",
       "      <td>157</td>\n",
       "      <td>009</td>\n",
       "      <td>65.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>051</td>\n",
       "      <td>058</td>\n",
       "      <td>NaN</td>\n",
       "      <td>35.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>054</td>\n",
       "      <td>034</td>\n",
       "      <td>NaN</td>\n",
       "      <td>99.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>111</td>\n",
       "      <td>004</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   pik mom_pik dad_pik  state\n",
       "0  138     149     NaN    2.0\n",
       "1  177     NaN     003   87.0\n",
       "2  146     013     004   71.0\n",
       "3  104     187     005   21.0\n",
       "4  083     025     007   13.0\n",
       "5  115     NaN     008   28.0\n",
       "6  170     157     009   65.0\n",
       "7  051     058     NaN   35.0\n",
       "8  054     034     NaN   99.0\n",
       "9  111     004     NaN    2.0"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_sas(\"../data/inner_join_example.sas7bdat\", encoding=\"latin-1\");\n",
    "df.head(n=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "remove_input"
    ]
   },
   "source": [
    "### FULL JOIN\n",
    "\n",
    "![image](full_join.png \"FULL JOIN\")\n",
    "\n",
    "A `FULL JOIN` selects all observations from both datasets and sets missing key values for observations that only appear in the right dataset.\n",
    "A `FULL JOIN` is the logical OR between two datasets, but unlike an `INNER JOIN` it is not commutative because missing key values are \n",
    "set based on the right dataset. We illustrate this below. \n",
    "\n",
    "#### Example\n",
    "\n",
    "In the example below, we have the same setup as in the previous examples, except that we use a `FULL JOIN` with `out.kid_geos` as the right dataset.\n",
    "\n",
    "```sas\n",
    "proc sql; \n",
    "    create table out.fake_micro_with_geo as\n",
    "        select A.pik, A.mom_pik, A.dad_pik, B.state\n",
    "            from out.fake_micro as A\n",
    "            inner join out.kid_geos as B\n",
    "                on A.pik = B.pik and B.pik is not null;\n",
    "quit;\n",
    "\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "popout"
    ]
   },
   "source": [
    "```{margin} Note \n",
    "`out.kid_geos` is a subset of `out.fake_micro` so there will never be missing keys (`pik`). A `FULL JOIN` where the right dataset is a subset of the left dataset is equivalent to a `LEFT JOIN`. \n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "tags": [
     "remove_input"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pik</th>\n",
       "      <th>mom_pik</th>\n",
       "      <th>dad_pik</th>\n",
       "      <th>state</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>001</td>\n",
       "      <td>NaN</td>\n",
       "      <td>034</td>\n",
       "      <td>24.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>002</td>\n",
       "      <td>NaN</td>\n",
       "      <td>117</td>\n",
       "      <td>48.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>003</td>\n",
       "      <td>001</td>\n",
       "      <td>NaN</td>\n",
       "      <td>81.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>004</td>\n",
       "      <td>NaN</td>\n",
       "      <td>101</td>\n",
       "      <td>24.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>005</td>\n",
       "      <td>NaN</td>\n",
       "      <td>058</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>006</td>\n",
       "      <td>010</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>007</td>\n",
       "      <td>172</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>008</td>\n",
       "      <td>NaN</td>\n",
       "      <td>017</td>\n",
       "      <td>37.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>009</td>\n",
       "      <td>NaN</td>\n",
       "      <td>029</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>010</td>\n",
       "      <td>NaN</td>\n",
       "      <td>142</td>\n",
       "      <td>38.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   pik mom_pik dad_pik  state\n",
       "0  001     NaN     034   24.0\n",
       "1  002     NaN     117   48.0\n",
       "2  003     001     NaN   81.0\n",
       "3  004     NaN     101   24.0\n",
       "4  005     NaN     058    NaN\n",
       "5  006     010     NaN    NaN\n",
       "6  007     172     NaN    1.0\n",
       "7  008     NaN     017   37.0\n",
       "8  009     NaN     029    2.0\n",
       "9  010     NaN     142   38.0"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_sas(\"../data/full_join_example.sas7bdat\", encoding=\"latin-1\");\n",
    "df.head(n=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Reverse Example\n",
    "\n",
    "Now lets reverse the datasets to illustrate why a `FULL JOIN` is not commutative. \n",
    "\n",
    "```sas\n",
    "proc sql; \n",
    "    create table out.full_join_example as\n",
    "        select A.pik, A.state, B.mom_pik, B.dad_pik\n",
    "            from out.kid_geos as A\n",
    "            full join out.fake_micro  as B\n",
    "                on A.pik = B.pik and A.pik is not null;\n",
    "quit;\n",
    "\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "popout"
    ]
   },
   "source": [
    "```{margin} Note \n",
    "We can see that observations 4 and 5 are missing keys (`pik`) because these observations appear in `out.fake_micro` but do not appear in `out.kid_geos`. These are the same observations that are missing `state` above.\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "tags": [
     "remove_input"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pik</th>\n",
       "      <th>state</th>\n",
       "      <th>mom_pik</th>\n",
       "      <th>dad_pik</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>001</td>\n",
       "      <td>24.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>034</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>002</td>\n",
       "      <td>48.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>117</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>003</td>\n",
       "      <td>81.0</td>\n",
       "      <td>001</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>004</td>\n",
       "      <td>24.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>101</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>058</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>010</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>007</td>\n",
       "      <td>1.0</td>\n",
       "      <td>172</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>008</td>\n",
       "      <td>37.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>009</td>\n",
       "      <td>2.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>029</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>010</td>\n",
       "      <td>38.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>142</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   pik  state mom_pik dad_pik\n",
       "0  001   24.0     NaN     034\n",
       "1  002   48.0     NaN     117\n",
       "2  003   81.0     001     NaN\n",
       "3  004   24.0     NaN     101\n",
       "4  NaN    NaN     NaN     058\n",
       "5  NaN    NaN     010     NaN\n",
       "6  007    1.0     172     NaN\n",
       "7  008   37.0     NaN     017\n",
       "8  009    2.0     NaN     029\n",
       "9  010   38.0     NaN     142"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_sas(\"../data/full_join_example_reverse.sas7bdat\", encoding=\"latin-1\");\n",
    "df.head(n=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### References\n",
    "\n",
    "[This](https://support.sas.com/resources/papers/proceedings/proceedings/sugi25/25/cc/25p109.pdf) file was referenced when creating this page. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}