Hello again, Internet! It's December now, the season of presents, Christmas carols, parties, catching colds and eating fried chicken. This time I'm taking a break from the Structural Estimation Series. Instead, I'll discuss a topic that I have been playing with recently: Double Machine Learning estimators for causal inference. This is a topic that is gonna have a huge impact on empirical Economics, but most people don't seem to have heard of yet.
The Background
Since 6 years ago I have been hearing about Economics adopting (or being replaced by) Machine Learning (ML). It was at that time that Hal Varian published his paper Big Data: New Tricks for Econometrics, and Google published its paper on making causal inference employing BSTS. Future looked bright.
Years passed and I'm still waiting for that promised land. Behind all the hype around ML models there were some issues that needed to be addressed before they could be introduced into the typical economist's toolbox:
ML models are black boxes, and economists really care about understanding the meaning of the coefficients of the model. We are the kind of people who have fun seeing shapes in the clouds.
Also, economists care a lot about bias. Machine Learning algorithms are mostly used for prediction. In order to become useful for that task, algorithms must be good outside-the-sample predictors. To achieve this, regularization must be applied, which affects the capability of the model to fit the data in a given sample. In other words, they admit some bias in exchange for lower variance. In comparison, most econometric models, especially in the policy evaluation literature, don't really attempt to be good predictors outside the sample, and prefer unbiased/consistent estimators that are as precise as possible given a sample size. Thus, the purposes of Economics and ML seem incompatible at first glance.
Finally, when trying to identify a causal effect or a structural parameter, you usually need to know how variable is your estimate, and perform hypothesis testing. ML methods don't offer an easy way of knowing the variance of the millions of parameters they employ. Furthermore, you're probably not interested in what the distribution of the 34th embedding of a word looks like asymptotically. You can try to employ bootstrap to obtain the variance of the million plus parameters in your neural network that took you two days to estimate, but good luck with that.