Economics Meets Data Science: The Structural Estimation Series Part IV

Hola Again

I’m DSOC’s Juan. It's been a while. Shockingly, it's September now (can you believe it!). I'm a big fan of astronomy since I was a child, and this year I got many chances to watch the night sky. The Perseids meteor shower was visible this year thanks to the good weather, although the Moon was a bit too shiny. But I can't complain, most of the time it gets cloudy during the peak, so I'll say that we got lucky this year. It's gonna get cold soon, so if you want to go outdoors and spend the night watching the sky, September is your month.

It's time to continue with The Structural Estimation Series. On the last post I talked about how Dynamic Discrete Choice models can be described by the search of a fixed point which happens both in the Value space and in Probability space. This led us to two important functions that are constantly used by many algorithms: $\Phi$ and $\Lambda$ . This time we will discuss how to estimate the utility function parameters by employing three algorithms that are fundamental to understanding most of the literature on DDC models.

When it comes to understanding structural estimation models, I propose a two-step approach (you'll see a lot of this below):

Isolate the fundamental and simple ideas
Go for the details

This post, as well as everything in this series, is about Step 1. For the rest, check the papers in the References section at the end.

The Estimation Problem

There are several ways of estimating DDCs, including Maximum Likelihood, Least Squares, Simulations, Moment Inequalities, etc. I think the easiest starting point is Maximum Likelihood estimation.

We attempt to find

$\theta^* = \arg\max_{\theta}{L(\theta)}$

Where $L(\theta)$ is the Likelihood of the sample given the model as described by $\theta$ . Now define $l(\theta)$ as the log-likelihood. Under the assumptions introduced in the previous post, we can decompose it into the following parts:

$l(\theta) = l_1(\theta) + l_2(\theta_f)$

Where:

$l_1(\theta) = \sum_{i=1}^N{ \sum_{t=1}^{T_i}{ \ln{P(d_{it}, x_{it}; \theta_u, \theta_f ) } } }$

$l_2(\theta_f) = \sum_{i=1}^N{ \sum_{t=1}^{T_i}{ \ln{f(x_{it+1} | x_{it}, d_{it}; \theta_f ) } } }$

The first part corresponds to the likelihood of the evolution of the state-space, and the second part to the likelihood of the choices of the agent, given the sate.

Here $\theta_u$ represents the parameters of the likelihood function and $\theta_f$ are the parameters of the transition function. When the state-space is discrete, $\theta_f$ represents the transition probabilities.

There are three ways of estimating the parameters.

The first one is simply operating on the full log-likelihood function $l(\theta)$ . The disadvantage here is that there can be tons of parameters which need to be updated simultaneously, so this can be quite a challenge for even not so complicated models. For example, remember that almost all of the cells in the transition matrices are parameters that require estimation.

A second alternative was proposed in Rust (1987) as a three-step procedure:

Estimate the transition probabilities $\theta_f$ first by optimizing $l_2(\theta_f)$ .
Use the estimates from Step 1, $\hat{\theta_1}$ to estimate $\theta_u$ by optimizing the partial likelihood $l_1(\theta)$ .
Employ both estimates from the steps above as starting values for a final run of the optimization algorithm.

This approach should yield similar estimates to those of the full likelihood method, with the advantage that the last step begins from an advantageous position. However, some heavy lifting can't be avoided, as the last step still has to deal with a potentially large parameter space.

As it turns out, the parameters obtained at Step 2 happen to be consistent. That's nice, but they are also inefficient. The reason lies in the error in $\hat{\theta_f}$ , which contaminates the estimates of $\theta_u$ . Adjustments to the Step 2 parameters have been proposed, but these tend to also be very computationally expensive.

A final alternative is to just stop at Step 2 and call it a day. As explained by Rust & Phelan (1997), estimates from Step 2 are equivalent to full maximum likelihood ones when the partition $\theta = (\theta_u, \theta_f)$ is block-diagonal. In fact, Rust (1987) found that the Step 2 estimates were very close to the full maximum likelihood ones in the case of the Harold Zurcher model. Rust & Phelan (1997) employ this alternative as well. For the purpose of explaining the different algorithms with clarity, I will avoid full maximum likelihood estimation and stick to this simpler alternative. Keep in mind that the estimates obtained this way are only good as long as the assumptions above are close to being satisfied.

Estimating the Utility function parameters

Note that there are two sets of parameters that need to be estimated: the conditional choice probabilities and the utility function parameters. Both sets need to be consistent with the fixed point we talked about in the previous post. Also, any estimation algorithm must find both sets in some order. There are several ways of performing this estimation, but there are three of them that everyone should know, so I will introduce them here.

Writing the whole implementation would require including many details that would distract you from the central point of this post. So to avoid that, I will make use of the implementation in our open source library Source DDC. The algorithms are implemented as classes that extend the abstract MaximumLikelihoodDDCModel class. Each instance of that class contains the following properties:

p: the conditional choice probabilities as an array of shape (n_choices, n_states, 1).
v: the value at each choice and discrete state level as an array of shape (n_choices, n_states, 1).
transition_matrix: the transition matrix of the state-space as an array of shape (n_choices, n_states, n_states).
utility_function: a function that takes the parameters, choices and states and returns an array of shape (n_choices, n_states, 1) representing the utility value of each choice and state.
discount_factor: a float value in the range [0, 1).
state_manager: an instance of the StateManager convenience class that simplifies the use of complex state-spaces.
endog: the choice data from the sample.
exog: the states data from the sample.

If you need some help understanding these components, you can check the blog post that DSOC researcher Komatsu wrote about our Econ Fiesta event, where we include a Google Colab notebook showcasing the use of Source DDC.
buildersbox.corp-sansan.com

The library's README file may also come in handy.

As any Maximum Likelihood algorithms, all of the methods below calculate the Log-Likelihood of the sample for different values of the target parameters. They differ in the steps that they perform to obtain that value. So now we will look at what happens at each iteration of the optimization algorithm when it is presented with some parameter values. In particular, we will look at how they calculate the negative log-likelihood of the observations through Statsmodels' nloglikeobs function. Remember that self in the functions below refer to the instance of each algorithm class.

Nested Fixed Point algorithm (NFXP)

This is the method proposed by John Rust in his iconic paper about Harold Zurcher:

Use some initial estimate $\hat{\theta_u}$ to find the fixed point in probability or value space.
Take the conditional choice probabilities at the fixed point to calculate the negative log-likelihood of the sample.
Find a new set of $\hat{\theta_u}$ by using the derivatives of the log-likelihood function.
Rinse and repeat until convergence in parameters (which should imply convergence in the fixed point in probability and value spaces.

The implementation looks like this:

def nloglikeobs(self, parameters):
    # Find the fixed point in probability space
    self.probability_iteration(parameters)
    
    # Use the conditional choice probabilities to calculate the negative log-likelihood of the sample
    pr = self.p[self.endog.ravel(), self.exog.ravel()]
    ll = -np.log(pr).sum()
    return ll

Here, the probability_iteration method finds the fixed point of the conditional choice probabilities by jumping many times between $\Phi$ and $\Lambda$ .
The implementation is as follows:

def probability_iteration(self, parameters):
    converged = False
    while not converged:
        p_0 = self.p
        self.v = phi_map(self.p,
                         self.transition_matrix,
                         parameters,
                         self.utility_function,
                         self.discount_factor,
                         self.state_manager)
        self.p = lambda_map(self.v)  # <- ATTENTION: we are updating the CCPs with the new value at each iteration.
        delta = np.abs(np.max((self.p - p_0)))
        if delta <= self.p_convergence_criterion:
            converged = True

As you can see, this method has to perform the fixed point iteration each time that it is presented with a new value of $\theta_u$ . This is very costly, especially because of the matrix inversion that happens in $\Phi$ , which can become really burdensome very fast as the complexity of the state-space increases. So we have good reasons to avoid that inversion whenever possible. This leads us to the following method.

Hotz & Miller's two-step algorithm:

The reason why NFXP has to jump between $\Phi$ and $\Lambda$ until convergence is because for each value of $\theta_u$ it tries to find conditional choice probabilities at the fixed point before evaluating the likelihood of the sample. You could say that the algorithm is very obsessed about finding the most accurate value of the likelihood function at each step. That would not be necessary if we had consistent estimates of the CCPs to begin with. The CCP method by Hotz & Miller (1993) does this with a one-two punch approach:

Consistently estimate the conditional choice probabilities. This needs to be done just once.
Take those probabilities as immutable, and calculate the likelihood by $\Phi$ and $\Lambda$ once. This generates a new likelihood value which can be used to update $\hat{\theta_u}$ . Repeat this process until convergence.

The log-likelihood calculation at each optimization step looks like this:

def nloglikeobs(self, parameters):

    # Take one jump from P -> V -> P
    self.v = phi_map(self.p,
                     self.transition_matrix,
                     parameters,
                     self.utility_function,
                     self.discount_factor,
                     self.state_manager)
    p = lambda_map(self.v)
    # IMPORTANT: note that this time we're not setting the new value of the conditional choice probabilities!
    # That way, any change in the new value of P is only due to different utility function parameters.
    
    # Use that new value of P to calculate the likelihood of the sample
    pr = p[self.endog.ravel(), self.exog.ravel()]
    ll = -np.log(pr).sum()
    return ll

At the end of this procedure, we should have convergence in the Value array.

This is much more computationally efficient, but it assumes that we have consistent estimates of the conditional choice probabilities at hand, which then poses the question of how to get them. Aguirregabiria & Mira proposed a way of performing both steps at the same time, which leads us to the last algorithm.

Nested Pseudo-Likelihood (NPL)

This is the algorithm proposed by Aguirregabiria & Mira (2002), and the concept is actually very simple. It swaps the order of the steps in the NFXP algorithm: Instead of solving for the fixed point at each value of the utility function parameters, it solves for the utility function parameters at each step of the fixed point routine. It goes like this:

Take some initial values of $\hat{\theta_u}$ and the conditional choice probabilities $P$ (these don't need to be consistent estimates of the true $P$ ).
Apply Step 2 of the CCP algorithm
Take the new values of $\hat{\theta_u}$ and the CCPs and repeat Step 2.
Rinse and repeat until convergence in the probability space.

In Python, it would look like this (I modified the original code for clarity):

def estimate(self):
    #  First, perform the Step 2 in Hotz and Miller's algorithm using the initial values of the CCPs.
    results = hotz_miller_step_2(initial_params, initial_ccps)
    
    # Store the updated CCPs
    self.p = lambda_map(self.v)
    converged = False
    while not converged:
        # Store the previous values of the CCPs.
        p_0 = self.p
        
        # Use the new parameters and CCPs to find the next values of both
        results = hotz_miller_step_2(results.params, self.p)
        
        # Measure how much the new CCP values have diverged from the previous iteration values.
        delta = np.abs(np.max((self.p - p_0)))
        
        # If the difference is small enough, stop iterating, otherwise keep going.
        if delta <= self.p_convergence_criterion:
            converged = True

    return results

The computation is lighter because usually the parameter update step is cheaper than the application of $\Phi$ , so it makes sense to perform it more in exchange for less fixed point iteration. As it turns out, if you used the true value of the CCPs as the starting value for NPL, and stopped at step 2, this would yield the same result as the two-step algorithm by Hotz & Miller. The only reason NPL needs to update the CCPs and repeat is because we don't know the true CCPs. Therefore, applying the NPL algorithm by using starting CCPs that are close to the true values can help speed it up.

And that's it for today

In this post I talked about how to estimate the utility function parameters, introducing some of the most fundamental algorithms in the literature. If you were to implement these, you'd find yourself in the need to verify that your code does what it's supposed to do. In web development it is a common practice to include unit tests, that check exactly that for different parts of the code. Regarding structural models, you'll want to produce artificial data and verify that your algorithm is returning values that are asymptotically close to the true parameters. So the next stop in this series is about doing exactly that, by simulation a DDC model.

See you then~

References

Aguirregabiria, V., Mira, P., (2002) "Swapping the Nested Fixed Point Algorithm: A Class of Estimators for Discrete Markov Decision Models," Econometrica 70(4):1519-1543.
Hotz, V., Miller, R. (1993). "Conditional Choice Probabilities and the Estimation of Dynamic Models". The Review of Economic Studies, 60(3), 497-529. Retrieved September 14, 2020, from http://www.jstor.org/stable/2298122
Rust, J. (1987) "Optimal replacement of GMC bus engines: An empirical model of Harold Zurcher," Econometrica 55(5):999–1033.
Rust, J., Phelan, C., (1997) "How Social Security and Medicare Affect Retirement Behavior In a World of Incomplete Markets," Econometrica 65:781-831.

Economics Meets Data Science: The Structural Estimation Series, Part I
Economics Meets Data Science: The Structural Estimation Series, Part II
Economics Meets Data Science: Coding With Style
Economics Meets Data Science: The Structural Estimation Series Part III