Economics Meets Data Science: ML and Economics Together at Last

Hello again, Internet! It's December now, the season of presents, Christmas carols, parties, catching colds and eating fried chicken. This time I'm taking a break from the Structural Estimation Series. Instead, I'll discuss a topic that I have been playing with recently: Double Machine Learning estimators for causal inference. This is a topic that is gonna have a huge impact on empirical Economics, but most people don't seem to have heard of yet.

The Background

Since 6 years ago I have been hearing about Economics adopting (or being replaced by) Machine Learning (ML). It was at that time that Hal Varian published his paper Big Data: New Tricks for Econometrics, and Google published its paper on making causal inference employing BSTS. Future looked bright.

In my dreams, this was me waking up to a new world in Economics:

https://media.giphy.com/media/QsPiApKClOffvxIXUy/giphy.gif

Years passed and I'm still waiting for that promised land. Behind all the hype around ML models there were some issues that needed to be addressed before they could be introduced into the typical economist's toolbox:

ML models are black boxes, and economists really care about understanding the meaning of the coefficients of the model. We are the kind of people who have fun seeing shapes in the clouds.

Also, economists care a lot about bias. Machine Learning algorithms are mostly used for prediction. In order to become useful for that task, algorithms must be good outside-the-sample predictors. To achieve this, regularization must be applied, which affects the capability of the model to fit the data in a given sample. In other words, they admit some bias in exchange for lower variance. In comparison, most econometric models, especially in the policy evaluation literature, don't really attempt to be good predictors outside the sample, and prefer unbiased/consistent estimators that are as precise as possible given a sample size. Thus, the purposes of Economics and ML seem incompatible at first glance.

Finally, when trying to identify a causal effect or a structural parameter, you usually need to know how variable is your estimate, and perform hypothesis testing. ML methods don't offer an easy way of knowing the variance of the millions of parameters they employ. Furthermore, you're probably not interested in what the distribution of the 34th embedding of a word looks like asymptotically. You can try to employ bootstrap to obtain the variance of the million plus parameters in your neural network that took you two days to estimate, but good luck with that.

So Why Bother?

If the inconveniences are so many, then why did economists want ML so badly then? Well, the advantages are there:

In many situations you can sacrifice some interpretability and employ some devices to obtain a consistent estimate, without caring much about their meaning. This way of thinking is becoming more popular thanks to new non-parametric methods.

Some types of data need to be transformed into high-dimensional representations to be useful at approximating a given function, which often means employing even millions of parameters. Typical methods employed in Economics attempt to estimate the variance-covariance matrix involving all the parameters, which can run out of degrees of freedom in the presence of very complex data. Regardless of this, high-dimensional data can lend itself to all sorts of econometric magic. For example, network data can be employed to control for the effect of unobservables thanks to the phenomenon of homophily, which makes network structure a good predictor of certain features of the data. Image data can be employed to measure and/or control for the beauty premium. Once econometric methods adapt to new types of data, the limit is only your creativity.

Functional misspecification is a problem in Economics. Unless you have strong theoretical foundations on why the decision process happens to be the simple linear specification you chose, most of the time you're only guessing. If your guess is wrong, you potentially introduce bias. I've seen researchers employ polynomials to go around that problem, but doing that wastes degrees of freedom, and the type of polynomials that you can put into a regression is inferior as a universal approximator compared to a neural network.

No matter how strong the Force is in your linear model, it may not be enough.

https://media.giphy.com/media/Za97sdmXk48B8dac8h/giphy.gif

It was until recently that a good way of introducing ML into Economics was finally proposed: Double/Debiased Machine Learning (DML).

What is DML in a nutshell?

Explained in few words, this technique employs ML models to obtain consistent causal estimates with nice asymptotic properties. In a broader sense, the authors propose a type of score that can be used to obtain root-n consistent estimates of causal effects in a variety of settings.

Consider the following partially linear specification:

$\begin{align} 𝑌= \theta𝐷+𝑔(𝑋)+𝑈 \tag{1} \end{align}$

$\begin{align} 𝐷=𝑚(𝑋)+ 𝑉 \tag{2} \end{align}$

Here, Y is the variable we want to explain (grades, salaries, you name it), and D is the treatment variable of interest, which can be discrete or continuous. $\theta$ is the parameter we're interested in. The functions $g()$ and $m()$ are arbitrary functions of the characteristics $X$ , and as a group they are usually called nuisance parameters. Assume that $U$ and $V$ are independent.

Let's see all the possible scenarios we have here:

If we knew the shapes of both functions, and could observe $X$ we could easily obtain a consistent estimate of $\theta$ .

If we could not observe $X$ the relationship between the omitted variable $X$ and $D$ would bias the estimate of $\theta$

If we could observe $X$ but didn't know the functional forms of $m()$ and $g()$ , then the estimates using our best guess would potentially have a misspecification bias.

In this last case, we could still clear the misspecification by employing a very flexible approximator. Then, we can obtain the following estimator of $\theta$ :

$\begin{align} \hat{\theta}_{DML} = \left(\sum_{i\in I} \hat{v}_i^2\right)^{-1}\dfrac{1}{n}\sum_{i\in I}\hat{v_i}\hat{w_i} \tag{3} \end{align}$

Here, $I$ is the sample, $\hat{v_i} = D_i - \hat{g}(X_i)$ , $\hat{w_i} = y_i - \hat{l}(X_i)$ are residuals, and $l = E\left[Y|g(X)\right]$ . $\hat{l}(X)$ and $\hat{m}(X)$ are estimated employing whatever ML method with good properties.

The fact that two ML estimators are necessary is what gives this technique its name. And by the way, if you think that looks suspiciously similar to instrumental variables and GMM, then your intuition is right. The scores that lead to estimators are in fact used as moment conditions.

The authors show that this estimator is consistent, but it's not efficient. That's because the residuals are obtained for the same sample that was used to train the model. An overfitting bias was introduced. This bias disappears as the sample gets larger, but not very fast. We can do better:

1- Divide randomly the sample into two subsamples of equal size, let's say $A$ and $B$ .
2- Obtain $\hat{g}^A(X)$ and $\hat{m}^A(X)$ using the sample A, but obtain the predicted values of the residuals employing sample B.
3- Use those residuals to obtain $\hat{\theta}_{DML}^B$ , the DML estimate for the sample B.
4- Repeat the same process, but switching the samples: use sample B to obtain $\hat{g}^B(X)$ and $\hat{m}^B(X)$ , and use those models to calculate the residuals employing sample A
5- Calculate the simple average $\hat{\theta}_{DML} = \frac{(\hat{\theta}_{DML}^A + \hat{\theta}_{DML}^B) }{2}$

And that's it! That estimator is consistent. Additionally, it can be root-n consistent if the ML models converge to the true model at a fast-enough rate.

This is just a very brief introduction to the technique, the topic is too broad to be covered in a single post. If you wanna learn more about it, this video of Chernozhukov explaining it might be a good place to start.

www.youtube.com

Finally

You can be sure an econometric technique has come to stay when big names such as Chernozhukov (whose research on quantile decomposition helped me get through PhD), Duflo (Nobel Prize in Economics 2019) and Newey (a legend in the non-parametric estimation literature) are involved. My bet is that DML is gonna become a more mainstream technique in the future.

The payoff for learning ML for economists is certainly high, so you might want to start learning as soon as possible if you're new to the field. The new edition of Bruce Hansen's Econometrics has a full chapter on ML, so there you go if you still prefer traditional sources of knowledge. Otherwise, go to Kaggle and you'll be on your way real soon.

The only complain from Economists I can expect now is that ML models can take long to train. So if you were wondering what to ask Santa for this Christmas, here's my suggestion: A good GPU! And with Half-Life: Alyx coming up soon, you have good reasons to go for a full gaming PC.

Do it for SCIENCE!

References

Brodersen, Kay H., Fabian Gallusser, Jim Koehler, Nicolas Remy, Steven L. Scott (2015), Annals of Applied Statistics, vol. 9, pp. 247-274.

Chernozhukov, Victor, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney K. Newey and James M. Robins (2018) “Double/debiased machine learning for treatment and structural parameters”, The Econometrics Journal, 21(1):C1–C68.