Code Riffs: Stata and Regression Tables

There are different approaches as to how to get publication quality regression tables. There are three main camps that one can belong to: camp “copy paste,” camp Excel, and camp Latex. I chose the latter in order to automate the production of tables as much as possible. In the process I wrote a piece of code I frequently use whenever I want to produce such a table. I find that it allows me to have plenty of flexibility while producing very good looking tables (although that is more LaTeX doing a great job than anything else). In this post, I will walk through the main snippet I use. Consider this a long, but simple, guide to taking full control and customizing your tables in Stata. You begin with data, produce regression estimates, and end up with a LaTeX table that is fine-tuned to your preferences. And the best part, it is ALL done within Stata, one piece of code to rule your table, all of your tables. No more having to change Stata code, produce half of the table, and then muck around in latex until it works.

I use estout which is a user-written command that gives control over many of the post estimation parameters and their incorporation into the table. Notice that in the following I change the delimiter right before and right after the command itself. This is because I tend to stick with Stata’s default carriage return delimiter (each line is a new command line), but I find that for this particular case changing the delimiter to “;” is somewhat necessary to produce easily readable code. Those of you who are already on the “;” bandwagon can just remove those two lines at the beginning and the end. First, let us look at the code that loads some data provided by Stata ( the National Longitudinal Survey of Young Women and Mature Women) and run a few regressions:

webuse nlsw88, clear 

gen logWage = log(wage)

* South subsample 
* Without industry FE 
reg logWage union collgrad married i.race if south == 1, cluster(industry)
estadd scalar clusterN = e(N_clust), replace 
estadd local industryFE = " ", replace 
estimates store model1 
* With industry FE 
reg logWage union collgrad married i.race i.industry if south == 1, cluster(industry)
estadd scalar clusterN = e(N_clust), replace 
estadd local industryFE = "X", replace 
estimates store model2 

* Full sample  
* Without industry FE 
reg logWage union collgrad married i.race, cluster(industry)  
estadd scalar clusterN = e(N_clust), replace 
estadd local industryFE = " ", replace 
estimates store model3 
* With industry FE 
reg logWage union collgrad married i.race i.industry, cluster(industry)  
estadd scalar clusterN = e(N_clust), replace 
estadd local industryFE = "X", replace 
estimates store model4

Line 1 simply loads the data, and in line 3 we take the natural logarithm of the hourly wage. We run two regressions for the subsample of women in the south. The first regression in line 7 regresses log wage on a dummy variables for union membership, college degree, married status, and race. This is a simple and boring regression just meant to provide us with some coefficients. Standard errors are clustered at the industry level. I mention this because in line 8 I add a scalar to the post-estimation data that Stata stores. That scalar is already stored in the results, but this is just to illustrate how one can add such results to the estimation data. Next I add a local with an empty space to indicate that I did not include industry fixed-effects. On line 10 I store the results in memory as model1.

Best practice here would be to save the results to disk (using estimates save) and then in a different file, load those estimates (using estimates use followed by estimates store). This type of workflow will allow you to have one do-file where you run the analysis and produce the estimates, which could be computationally intensive, and another file that simply uses the results to produce the tables. That way you can tailor and customize the tables without having to rerun the analysis each time.

In line 12 we repeat this regression but include industry fixed effects. Notice now that in line 14 we add “X” to the string we are adding to the results. Call this model2 and move on to replicate these two regressions without the condition if south == 1. We will call this the full sample. We have four different sets of regression results which we will turn over to estout and produce a nice looking table:

#delimit ; 
estout model1
       model2
       model3
       model4
       using "table.tex", 
       style(tex) 
       cells(b(star fmt(2)) se(par fmt(2))) 
       label 
       stats(industryFE
             r2 
             N 
             clusterN, fmt(0 3 %9.0gc %9.0gc)
             labels("\hline Industry FE"
                   "\hline \(R^2\)" 
                   "N" 
                   "Clusters")) 
       mlabels(,none)  
       numbers
       collabels(none) 
       varlabels(union "Union Member" 
                 collgrad "College Degree" 
                 2.race "Race, Black" 
                 3.race "Race, Other" 
                 married "Married") 
       starl(* 0.1 ** 0.05 *** 0.01)   
       keep(union collgrad married 2.race 3.race)              
       order(union collgrad 2.race 3.race married) 
       prehead( 
           \begin{table}[h]
           \refstepcounter{table}            
           \label{table:Results}            
           \centering
           \colorlet{tempColor}{RedOrange}
           \colorlet{RedOrange}{black}
           \textbf{Table ??. Wage Regressions} \\
           \textbf{Dependent Variable: Log(Hourly Wage), Measured in \\$1988 }
           \colorlet{RedOrange}{tempColor}
           \begin{tabular}{@{\extracolsep{4pt}}l*{@M}{c}@{}} 
           \hline \hline 
           & \multicolumn{2}{c}{\textbf{South Subsample}} &
           \multicolumn{2}{c}{\textbf{Full Sample}} \\
           \cline{2-3}  
           \cline{4-5}              
       )
       posthead(\hline) 
       prefoot() 
       postfoot(
           \noalign{\smallskip} \hline \hline 
           \end{tabular}
           \medskip
           \begin{minipage}{0.6\textwidth}
           \footnotesize \justify Notes: \( @starlegend \). 
           Standard errors clustered at the industry level. 
           Extra notes go here.
           \end{minipage}        
       \end{table}
       )
       replace;
#delimit cr

Break down what each line is doing:

Again, just setting the delimiter to “;”
Identify the results we want to include in this table by naming the stored estimation results.
Calling the estout command and telling it to save the table as a .tex file by the name of table.
style(tex) tells estout we want this to be a LaTeX table.
The cells option specifies what values from the estimates it should report as well as options on how to report them. Here we are simply reporting the coefficients b, including stars to indicate statistical significance using star, and are formatting it such that we are keeping two digits after the decimal point, fmt(2). Same follows for the standard errors se, only now we surround them by parentheses, par.
label will tell estout to automatically use the variable label instead of its name, unless we explicitly specify a new temporary label under varlabels.
Line 10 is important. This is where we add additional information to the table, and format the table structure. I begin by specifying the names of the locals and scalars that I wish to add to the table: industryFE (receiving string values of either ” ” or “X”), r2 (the R-Squared of each model), N (number of observations), clusterN (the number of clusters in each regression run).
Then follow the options, beginning with formatting. Since industryFE is a string I “keep” 0 digits beyond the decimal point. Round the R-Squared to two digits, and format the number of observations and cluster to have commas when appropriate.
The next step places labels on each newly added element. I want industryFE to read as “Industry FE” in the table, but I also want a line to separate that part from the estimated coefficients. I achieve that by including a piece of LaTeX code in the label: “\hline Industry FE” which will produce the desired result, even though it looks weird right now.
Again, I separate the fixed effects section from the other statistics with a line and use a bit more tex code for the R-Squared.
This stats part can be customized to include whichever information you want, at the order you want it in, and in the format you want it in. I find this way of specifying fixed effects much preferable to using indicate, especially when using xtreg or reghdfe (that do not produce coefficients for the fixed effects and therefore cannot be indicated).
mlabels allows to specify the column headers for each model. This is useful when the dependent variable varies by column, or the estimation changes from OLS, to FE, to IV, or something like that. Here I force no text in the columns by specifying the none option.
I do however want column numbers so I write exactly that, numbers.
colllabels is yet another option allowing to customize the headings in each column. As a default, I leave it as none. I cannot think of a case where I deviated from that.
Line 21 introduces another key component of table sorcery, labeling your coefficients. If all your coefficients have labels, and those labels are exactly how you would like them to appear in the table, then just ignore this part as the labels option is already taking care of that. But, if you want to change those labels, or you did not bother labeling to begin with, or you have some newly created variables because you used factor variables – then varlabels is your friend.
There is no magic here, just writing the name of the variable and the string you want attached to it. Order does not matter. Notice that you need to know the name Stata is using to store the coefficient, which is a simple transformation when using factor variables . With experience you will know how Stata names these factor variables, but if you are ever at a loss, typing matrix list e(b) in the console window after you ran a regression will help.
Line 26 sets the p-values and the starts (or other symbols) that denote them.
Line 27 tells the program which variables to keep and display in the table. This is useful when you have many control variables that you do not want to report in the table. Alternatively, you can use drop().
Line 28 sets the order in which the variables appear.
Line 29 is where we start injecting LaTeX code.
1. Line 30 creates a table float.
2. For total control on where and how the caption of the table appears, I use lines 31, 32, and 36.
3. All you need to remember from here is the label you assigned the table so you can use that in your document.
4. If you use colors to indicate links to the references in the text, then you will need to use this small trick of temporarily changing the color so that it does not appear in the header of the table. This is done in lines 34, 35, and 38.
5. Line 39 creates a table, which uses estout to figure out how many columns you need. You can also set the distance between columns separately for each table.
6. From here it simple table formatting using multicolumn and cline. Notice you can inject more code in the posthead() and prefoot() sections.
7. Line 50 ends the table environment, and line 51 creates a minipage for the notes section.
That is it. Using this type of template allows to control and fine tune the table. All that is left is to use \input() in your document to incorporate the table. The final results looks like this:

table1

Code Riffs: Stata and Regression Tables

Previous PostFour Not-So-Random Links On Conservation

Next PostTrade Ban on Ivory: Are We Getting it Right?

Code Riffs: Stata and Regression Tables

Share this:

Previous PostFour Not-So-Random Links On Conservation

Next PostTrade Ban on Ivory: Are We Getting it Right?