Sansan Tech Blog


Economics Meets Data Science: The Structural Estimation Series Part III

Today's Agenda

Hello again. I'm DSOC's Juan. It's time for the part III of the Structural Estimation Series. In the last post I described the dynamic optimization problem of Harold Zurcher according to Rust (1987), and presented some important concepts such as the transition function, the value function and the structural parameters. Here I plan to achieve three things:

  1. Discuss how information that is unobservable to the researcher is typically included in the decision process model.
  2. Define two important functions that form the basis for the estimation of structural parameters of most DDC problems. Once you get these down, the rest becomes considerably easier.
  3. Show how to express both functions in code. This blog post is batteries-included. At the end, you can find a gist with some Python code implementation of the functions I will discuss here, showing how they work. You can open the gist in Google Colab and get your hands dirty with the code.

Since this post is a continuation of the Part II, the numbering of the equations also starts from where we left last time (just in case you wonder why it starts in 5).

This is a busy post, let's begin.

A Model With Unobservable Information

So far we have been assuming that we, as researchers, can observe all the information available to the agent. However, there is always some information affecting the choices of the agent that cannot be observed. Therefore, even if we could calculate the utility of all the alternatives, the agent might not choose what we believe is the maximizing option. Economists often assume that these deviations are random, and follow some probability distribution.

The two following assumptions are commonly found in most DDCs:

Additivity Separability

The utility function can be expressed in the following way:

u(x, i, \theta_u) + \epsilon

Note that I'm dropping the time index for the sake of simplicity. Instead, whenever I refer to a value one time step into the future I will use an apostrophe, as in  x'.

Conditional independence

When stochastic components enter the decision process the possibility arises that these random deviations affect the evolution of the state. For the purpose of simplification, it is common to assume that stochastic deviations materialize as a result of the current state, and vary by choice, and that they only affect the evolution of state through their effect on the decision made by the agent. These errors are therefore serially uncorrelated, which simplifies things. The transition of the state and the deviations can therefore be stated as follows:

p(x', \epsilon'| x, i, \epsilon; \theta_p) = g(\epsilon' | x'; \theta_g)f(x' | x, i; \theta_f)

Here,  g() is the distribution of errors, parameterized by  \theta_g. Note also that this distribution is the same for all agents at all points in time.

Discrete State with finite support

In the last post I only talked in very broad terms about the state. In this post, we will assume that  x is a discrete variable. The evolution behaves as a Markov Chain, meaning that the future value  x' is only determined by the value one period before  x. It has a very short memory. Markov chains can be described by a transition matrix (also called a stochastic matrix). The dimensions of the matrix represent all the possible states that the variable can take, and each row represents the probability of the variable evolving from a given state to a different one in one period. Quantecon explains very well Markov Chains, in case you need to review them.

 f(x' | x, i; \theta_f) in Equation (6) represents the transition probability given by the corresponding transition function.

The set of parameters that define the DDC is:

  •  \theta_u: the parameters of the utility function.
  •  \theta_f: the parameters of the transition function. In the case of discrete state, these are just the transition matrix itself.
  •  \theta_g: the parameters of the distribution of random disturbances.

Let's define  \theta as the set of all the parameters above.

The Building Blocks of DDC Algorithms

Using these assumptions, we can express the value function conditioned on choice  i at state  x in the following way:

v(x, i; \theta) = u(x, i; \theta_u) + \epsilon(i) + \beta \sum_{x} f(x'|x, i; \theta_f)\int{V(x', \epsilon'; \theta) g(d\epsilon'|x'; \theta_g)}

Note that the value at a given state is:  V(x; \theta) = \max_{i \in D}{v(x, i; \theta)}, which is the value of choosing the best option for the agent, whatever it is, at all times in the future.

Related to  V(x) is the policy function  \delta(x; \theta) = \arg max_{i \in D}{v(x, i; \theta)}, which represents the agent's decision rule among all the choices in  D.

Now consider the probability of an agent choosing a given option when it finds itself in state  x:

P(i, x; \theta) = \int{I(\delta(x; \theta) = i)  g(d\epsilon | x; \theta_g})

These probabilities are called Conditional Choice Probabilities . These are an important building block, and we will go back to them in a second.

So far we have talked about the "value function", which takes as parameters a choice  i and a state  x, is parameterized by some  \theta and returns a real value. Since the number of choices and the number of states are both discrete and finite, we can (in theory, with enough computer juice of course) calculate the value for each combination of state and choice. We can represent all these values using matrix notation.

Now a confession: I think that the matrix notation used in most papers is confusing. For example, Aguirregabiria and Mira (2002) represent the conditional choice probabilities as an  M \left |D \right | \times 1 vector, with  M being the number of states and  \left | D \right | the size of the choice set... which means that they are a column vector which mixes the conditional choice probability of different choices... and if you need to obtain the conditional probabilities for a given choice you have to somehow remember the corresponding indices and extract them from that vector? Well that's a tall glass of NOPE right there.

I bet this small detail confuses many, even if it is the right notation for an academic paper.

This blog is about implementing algorithms in software, so I'll use a notation that fits better that purpose: the NumPy array notation! It goes like this:

An array is a three-dimensional thing with the shape:  (\left | D \right |, M, k).

Here,  \left | D \right | is the cardinality of D, or the number of choices available to the agent. In the example of Harold Zurcher, he only has the options of replacing or not replacing the engine, so  \left | D \right | = 2.

 M is the number of discrete states. For example, in the 1987 paper, John Rust divided the continuous number of miles into 90 categories. In this case,  M = 90.

Finally,  k is the size of the remaining dimension, which can depend on the vector you're dealing with.

Having said that, let's define the following arrays:

  •  F:  (\left | D \right |, M, M). The transition matrix (an M x M matrix for each possible choice).
  •  P:  (\left | D \right |, M, 1). The Conditional Choice Probability, or the probability of choosing an alternative. The sum of the elements across the first dimension must add up to 1.
  •  \overline{V}:  (\left | D \right |, M, 1). The discounted expected value of choosing action i for each state x.
  •  U:  (\left | D \right |, M, 1). The utility level corresponding to action i for each possible state x.
  •  E:  (\left | D \right |, M, 1). The expected value of the random deviation given that i is the optimal choice when agent is in state x.
  •  V:  (\left | D \right |, M, 1). Represents the expected discounted value of each possible combination of choice and action.
  •  \beta: A scalar in  [0, 1[ .

Note that both  P and  E depend on  g(). We will see later an example of these arrays when we make a parametric assumption about that distribution.

So basically, you're just stacking arrays side by side, and the meaning of each dimension is clear and easy to get.

The expected value can now be expressed as:

V =  U + E + \beta F \overline{V}

Remember that the first dimension of  V is the choices dimension. Summing across this dimension, weighting each choice by its CCP, gives us the expected discounted value:

\overline{V} = \sum_{i \in D}{P \otimes \{ U + E + \beta F \overline{V} \}}

Here,  \otimes is the Hadamard (pairwise) product.

 \overline{V} appears on both sides of the equation. Some refactoring reduces it to:

\overline{V} = {(I_M - \beta \sum_{i \in D}{P \otimes F})}^{-1} \{ \sum_{i \in D}{ P \otimes (U + E) }  \}

This equation is one of the most important concepts in the literature on DDCs. Let's call it  \Phi, or Phi Map. It takes values from the Probability space and maps them into the Value space (the reals which are consistent with the solution to the dynamic optimization problem given the current parameter values, which not necessarily are THE structural parameter values we're searching for).

We can also obtain the inverse map: a function that takes elements from the Value space and maps them to the Probability space. For the kind of simple models we're going to focus on, we need to make assumptions about the distribution of the stochastic utility shocks. A distribution that is commonly used in the literature is the Type I Extreme Value Distribution (surprise!). This distribution yields the following values:

P = \frac{ e^{U + \beta F \overline{V}} }{ \sum_{x \in D} e^{U + \beta F \overline{V}} }

Remember that  U + \beta F \overline{V} is an array with dimensions  (\left | D \right |, M, 1), so when we use  \sum_{i \in D} , we're summing across the first dimension. In other words, this is nothing else but the typical Logit formula everybody knows.

Since we made a parametric assumption about  P we must also use the corresponding value for  E, which for the Type I Extreme Value distribution is  \gamma - \ln(P), with  \gamma = 0.5772156649 being Euler's constant.

Let's call Equation (11)  \Lambda, or Lambda Map.

Note that  \Phi and  \Lambda hold for any value of  V,  P and  \theta, not just for the values that we expect to estimate. Also, note that you can form a self-map in probability space by composing  \Phi and  \Lambda. For example,  P_{k+1} = \Lambda(\Phi(P_k; \theta)). Conversely, you can form a self-map of  V onto itself by composing  V_{k+1} = \Phi(\Lambda(V_k; \theta)).

What is important is that both compositions form a contraction mapping, and therefore have a fixed point. If you keep iterating long enough, you will eventually see that  V_{k+1} gets very close to  V_k and  P_{k+1} converges to  P_k. Also, reaching the fixed point in  P means that you have reached a fixed point in  V. Many algorithms employ this property to estimate the structural parameters of the utility function.

Code Implementation in Python

The rest of the blog post continues in the gist below. It shows the Python code implementation of the components of the DDC problem using the Bus Engine exchange problem. I left enough comments on each cell for you to follow each of the steps. The gist shows how the fixed point in both the Value space and the Probability space is reached by iterating through  \Phi and  \Lambda. Feel free to open the gist in Google Colab and play with it. Save a copy on your own Drive if you wanna modify it on your own.

And that's it for now. In the next post we'll implement some of the most famous algorithms by making use of the mappings we discussed this time.

Until the next one!


  • Aguirregabiria, Victor and Mira, Pedro (2010) "Dynamic discrete choice structural models: A survey," Journal of Econometrics, 156(1): 38-67.
  • Rust, J. (1987) "Optimal replacement of GMC bus engines: An empirical model of Harold Zurcher," Econometrica 55:999–1033.
  • 楠田 康之(2019)経済分析のための構造推定アルゴリズム

© Sansan, Inc.