{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# DATA Step Example \n",
    "\n",
    "In the below `DATA` step, we read in a file specified by the `set` statement, `in.fake_micro`, retaining only the first 100 observations `(obs = 100)` using a data step option (we will discuss these in more detail later). We then create several new variables, subset the data, and save the result to the output dataset, `in.fake_micro_10`. The `length` statement sets the length of the new variable `married` to 3 - the shortest numeric length. SAS automatically sets new numeric variables to length 8, the largest possible numeric length. This is wasteful and you can save space by carefully setting all the lengths of your variables to the minimum required to store the data you need (although be careful to not lose information here by setting the variable length to be too small, more on this later).\n",
    "\n",
    "We create a `married` indicator by evaluating whether or not another variable is equal to 1. We then restrict to rows with non-missing `mom_pik` or `dad_pik`. An `if` statement without a subsequent `then` statement drops any observations that do not meet that criteria specified by the `if` condition (more on this later).\n",
    "\n",
    "Then we create two versions of a parent income variable. `par_inc_2015_miss` will be missing if `mom_inc_2000` is missing or `dad_inc_2000` is missing or both. `par_inc_2015_nomiss` will never be missing, since the `sum()` function assigns missing arguments to zero. This is similar to the distinction between using `+` and `egen, rowtotal()` in Stata. \n",
    "\n",
    "The `keep` statement specifies which variables are to be kept and written to the output file. In SAS, `:` acts like a wildcard and is similar to the `*` in Stata, except it can only be used for variable name suffixes. Be mindful of this when naming variables. \n",
    "\n",
    "First let's look at the input data set. Note that missing values here are displayed as `NaN` because we use Python to display the data. In SAS missing values are displayed as `.` similar to Stata."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "tags": [
     "remove_input"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pik</th>\n",
       "      <th>mom_pik</th>\n",
       "      <th>dad_pik</th>\n",
       "      <th>kid_married_2015</th>\n",
       "      <th>mom_inc_2000</th>\n",
       "      <th>dad_inc_2000</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>141</td>\n",
       "      <td>173</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.0</td>\n",
       "      <td>16474.893</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>138</td>\n",
       "      <td>149</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.0</td>\n",
       "      <td>21689.520</td>\n",
       "      <td>37348.6450</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>177</td>\n",
       "      <td>NaN</td>\n",
       "      <td>003</td>\n",
       "      <td>0.0</td>\n",
       "      <td>11420.902</td>\n",
       "      <td>14666.4970</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>146</td>\n",
       "      <td>013</td>\n",
       "      <td>004</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>34485.8130</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>104</td>\n",
       "      <td>187</td>\n",
       "      <td>005</td>\n",
       "      <td>1.0</td>\n",
       "      <td>52476.680</td>\n",
       "      <td>6003.8027</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>144</td>\n",
       "      <td>NaN</td>\n",
       "      <td>006</td>\n",
       "      <td>1.0</td>\n",
       "      <td>13744.854</td>\n",
       "      <td>20464.3630</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>083</td>\n",
       "      <td>025</td>\n",
       "      <td>007</td>\n",
       "      <td>0.0</td>\n",
       "      <td>22038.420</td>\n",
       "      <td>17941.8010</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>115</td>\n",
       "      <td>NaN</td>\n",
       "      <td>008</td>\n",
       "      <td>1.0</td>\n",
       "      <td>16084.973</td>\n",
       "      <td>6875.2935</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>170</td>\n",
       "      <td>157</td>\n",
       "      <td>009</td>\n",
       "      <td>1.0</td>\n",
       "      <td>17766.367</td>\n",
       "      <td>31663.4510</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>118</td>\n",
       "      <td>NaN</td>\n",
       "      <td>010</td>\n",
       "      <td>1.0</td>\n",
       "      <td>29356.543</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   pik mom_pik dad_pik  kid_married_2015  mom_inc_2000  dad_inc_2000\n",
       "0  141     173     NaN               1.0     16474.893           NaN\n",
       "1  138     149     NaN               0.0     21689.520    37348.6450\n",
       "2  177     NaN     003               0.0     11420.902    14666.4970\n",
       "3  146     013     004               0.0           NaN    34485.8130\n",
       "4  104     187     005               1.0     52476.680     6003.8027\n",
       "5  144     NaN     006               1.0     13744.854    20464.3630\n",
       "6  083     025     007               0.0     22038.420    17941.8010\n",
       "7  115     NaN     008               1.0     16084.973     6875.2935\n",
       "8  170     157     009               1.0     17766.367    31663.4510\n",
       "9  118     NaN     010               1.0     29356.543           NaN"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "fake_micro = pd.read_sas(\"../data/fake_micro.sas7bdat\", encoding=\"latin-1\");\n",
    "fake_micro.head(n=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "popout"
    ]
   },
   "source": [
    "````{margin}\n",
    "```{admonition} Tip\n",
    ":class: tip\n",
    "The `~` is the logical not symbol in SAS, you may be familiar with `!` as the logical not (as in Stata, although you can use `~` in Stata as well).\n",
    "```\n",
    "````"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "remove_output"
    ]
   },
   "source": [
    "```sas\n",
    "* Set the directory where the data is; \n",
    "libname in \"/media/sf_myfolders\";\n",
    "\n",
    "data in.fake_micro_10;\n",
    "    set in.fake_micro(obs=100);\n",
    "    \n",
    "    * Set length of the marriage variable;\n",
    "    length married 3; \n",
    "    \n",
    "    * Create married indicator;\n",
    "    married = kid_married_2015=1;\n",
    "    \n",
    "    * Restrict to kids with a mom_pik or dad_pik;\n",
    "    if ~missing(mom_pik) or ~missing(dad_pik);\n",
    "    \n",
    "    * Combine parent income, this will be missing if either component is missing;\n",
    "    par_inc_2015_miss = mom_inc_2000 + dad_inc_2000;\n",
    "    \n",
    "    * Combine parent_income, this will treat missing values as zero and will never be missing;\n",
    "    par_inc_2015_nomiss = sum(mom_inc_2000, dad_inc_2000);\n",
    "    \n",
    "    keep pik mom_pik dad_pik par_inc_2015: ;\n",
    "    \n",
    "run; \n",
    "```\n",
    "\n",
    "No we look at the output data set and we can see the difference between using `+` and `sum`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "tags": [
     "remove_input"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pik</th>\n",
       "      <th>mom_pik</th>\n",
       "      <th>dad_pik</th>\n",
       "      <th>par_inc_2015_miss</th>\n",
       "      <th>par_inc_2015_nomiss</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>141</td>\n",
       "      <td>173</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>16474.8930</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>138</td>\n",
       "      <td>149</td>\n",
       "      <td>NaN</td>\n",
       "      <td>59038.1650</td>\n",
       "      <td>59038.1650</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>177</td>\n",
       "      <td>NaN</td>\n",
       "      <td>003</td>\n",
       "      <td>26087.3990</td>\n",
       "      <td>26087.3990</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>146</td>\n",
       "      <td>013</td>\n",
       "      <td>004</td>\n",
       "      <td>NaN</td>\n",
       "      <td>34485.8130</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>104</td>\n",
       "      <td>187</td>\n",
       "      <td>005</td>\n",
       "      <td>58480.4827</td>\n",
       "      <td>58480.4827</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>144</td>\n",
       "      <td>NaN</td>\n",
       "      <td>006</td>\n",
       "      <td>34209.2170</td>\n",
       "      <td>34209.2170</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>083</td>\n",
       "      <td>025</td>\n",
       "      <td>007</td>\n",
       "      <td>39980.2210</td>\n",
       "      <td>39980.2210</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>115</td>\n",
       "      <td>NaN</td>\n",
       "      <td>008</td>\n",
       "      <td>22960.2665</td>\n",
       "      <td>22960.2665</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>170</td>\n",
       "      <td>157</td>\n",
       "      <td>009</td>\n",
       "      <td>49429.8180</td>\n",
       "      <td>49429.8180</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>118</td>\n",
       "      <td>NaN</td>\n",
       "      <td>010</td>\n",
       "      <td>NaN</td>\n",
       "      <td>29356.5430</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   pik mom_pik dad_pik  par_inc_2015_miss  par_inc_2015_nomiss\n",
       "0  141     173     NaN                NaN           16474.8930\n",
       "1  138     149     NaN         59038.1650           59038.1650\n",
       "2  177     NaN     003         26087.3990           26087.3990\n",
       "3  146     013     004                NaN           34485.8130\n",
       "4  104     187     005         58480.4827           58480.4827\n",
       "5  144     NaN     006         34209.2170           34209.2170\n",
       "6  083     025     007         39980.2210           39980.2210\n",
       "7  115     NaN     008         22960.2665           22960.2665\n",
       "8  170     157     009         49429.8180           49429.8180\n",
       "9  118     NaN     010                NaN           29356.5430"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fake_micro = pd.read_sas(\"../data/fake_micro_10.sas7bdat\", encoding=\"latin-1\");\n",
    "fake_micro.head(n=10)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}