Sansan Tech Blog


Economics Meets Data Science: Finite Mixture Models - A Christmas Story


Merry Christmas! 🎄

¡Hola! I'm Juan, researcher and economist at DSOC's SocSci Group. I hope you're having a great end of the year. This time I'm contributing to Sansan's Advent Calendar, so this is a Christmas edition of Economics Meets Data Science.

This time I'd like to discuss one class of models that is widely used in Economics and yet many people haven't tried yet: Finite Mixture Models (FMMs).

What are Finite Mixture Models (a lazy explanation)

In very general terms FMMs are a way of modeling data that comes from a finite number of different distributions, when we don't know which observation comes from which distribution. This is a very common technique employed in all sorts of areas of Statistics and Machine Learning.

The setting, the way it is used in Economics, is very simple. There's a variable  y we're interested in. It can be continuous or discrete. There's a set of observed variables  x that can be used to explain that variable.

Usually, we could fit a regression model to understand the relationship between  y and  x. However, there's one complication: the data we observe is generated from a finite number  K of different models. In other words, there are  K models explaining the relationship between  y and  x.

If we knew which observations are generated by which model, we could simply divide the dataset and estimate the models separately, or include dummy variables to separate the effects. However, we don't know the group affiliations of the observations.

As with everything unknown in Statistics, we need to assume that the group affiliations of the observations are generated by a probability distribution, which has some parameters. The estimation task is therefore to estimate the parameters of the distribution that generates the affiliations, and the parameters of the models of each group.

Because FMM estimation outputs not only the parameters of the within-group models, but also the most likely group affiliations (labels) for each observation, FMMs can be also considered a clustering technique.

Use cases

FMMs are a quite general class of models, so their uses are very heterogeneous. Arcidiacono & Miller (2011) showed that FMMs can be used to estimate dynamic discrete choice models where a part of the utility function is unobserved. For example, some individuals may derive a higher benefit from the consumption of some good, but that fact is unknown to them and to the econometrician. Furthermore, the probability to be affiliated to that type may change over time and depending on choices made by the agent.

This class of models has also been employed in the literature on risk preference. For example, differences in the shape of utility functions may result in some individuals being more averse to risk than others. Preferences regarding the consumption of risky assets is at the center of Finance. Recently, Gerhard, Gladwell & Hoffman (2018) and Araki (2020) have employed FMMs to study the impact of financial literacy on the investment decisions of households, finding that different types of households respond differently to increased financial literacy, thus highlighting the importance of accounting for unobserved heterogeneity in explaining financial behavior.

Here I explain the model employed by the last two papers, which study the impact of financial literacy on investment behavior:

The paper considers two classes:  k = \{1, 2\}

The probability that  i belongs to class  k is given by:

 \pi_{i,k}(z_{i,k} = 1 | x_i^C; \alpha)

Here,  z is a set of dummy variables (or class labels), one for each class in  k, where  z_{i,k} indicates that individual  i is affiliated to class  k.

 x_i^C represents the set of characteristics that influence the class affiliation of individual  i. In this paper these are socio-demographic characteristics of the household.

The probability of having investment experience is defined by the probability distribution function:  \Phi_{i, k}(y_i, x_i^w; \beta).

The parameters of interest are  \Psi=\{\alpha, \beta\}

How to estimate FMMs

The estimation cannot be performed directly without knowing the affiliation of the observations to the unobserved classes. The most commonly used type of algorithms for solving this sort of problems are Expectation Maximization (EM) algorithms. These are iterative Maximum Likelihood algorithms that alternate between obtaining estimates of the within-class model parameters taking the class affiliations as constant, and updating the class affiliation probabilities given fixed guesses of the parameter estimates.

In this particular case, the steps are as below:

Expectation Step

At iteration  J, the algorithm takes some model parameters guesses as fixed at their values  \alpha^J and  \beta^J, and updates the posterior class affiliation probabilities using the Bayes rule:

\tau_{i, k}(\Psi^J) = \frac{ \pi_{i, k}(z_{i, k} = 1|x_i^C; \alpha^J )\Phi_{i,k}(y_i, x_i^W; \beta^J) }{ \sum_l^K{ \pi_{i, l}(z_{i, l} = 1|x_i^C; \alpha^J )\Phi_{i,l}(y_i, x_i^W; \beta^J) }  }

This step returns a function that takes some value for the model parameters and returns the full expected log-likelihood of the model:

Q(\Psi; \Psi^J) = \sum_{i=1}^N{ \sum_{k=1}^K { \tau_{i, k}(\Psi^J) [ log \pi_{i, k}(z_{i, k} = 1 | x_i^C; \alpha) + log \Phi_{i, k}(x_i^W; \beta) ] }}

Maximization Step

Obtain the parameters  \alpha and  \beta that maximize the expected log-likelihood obtained from the E-Step.

The whole EM algorithm is iterated on until some criterion is reached, such as convergence on the parameter values or the posterior probabilities, or a maximum number of iterations is reached. In some cases, good initial values for the class affiliations are found before running the EM algorithm to help it converge faster. In cases where the likelihood function maximization is very costly, parameter search is performed by maximizing a simpler minorizer function. This is the case of model-based clustering of networks, where likelihood functions may become unwieldy for even small networks.

The two papers mentioned above show that the effect of higher financial literacy levels differ significantly across two types of households, and that household type is associated with socio-demographic characteristics of the members. Note that this does not mean that there is some sort of interaction term between socio-demographics and financial literacy affecting investment behavior. Instead, it means that socio-demographics affect the probability that a household is of a type for which the effect of financial literacy is larger. The difference is not only conceptual. First, a full interaction model would be massive, potentially exhausting the available degrees of freedom. On the other hand, the clustering of households itself is a policy-relevant result, since it can help policy makers prioritize some types of households and optimize the content of financial literacy programs.

Simulating and estimating an interesting FMM

Although this type of models is implemented in statistical software such as Stata, I personally think that it is very hard to understand the output of such algorithms without implementing them on my own. This is especially true because FMMs suffer from stability issues, including local maxima and non-identification, when they aren't properly specified. Knowing why they may fail will save you lots of time. Finally, FMMs are a class of models, of which there are tons of variations, and it is very possible that the particular flavor you need for your research has not been implemented yet in the existing packages. Being able to implement the algorithm on your own should greatly broaden the frontiers of you research.

The following Google Colab notebook includes code for simulating and estimating similar models to the model discussed above. This is a lazy, computationally inefficient implementation, but it should help you see the moving parts of the algorithm more clearly.

The following image shows the distribution of Monte Carlo simulations of the model simulated in the notebook. The true parameters are located at the red vertical lines:

Estimates distribution

Although the distributions are centered at the true values, note that variation is quite large. FMMs need to estimate not just the model parameters, but also the class affiliations, which adds to the parameter variation. When the model fails to predict the class affiliations properly, estimates can vary a lot. Initializing the classes by using some clustering algorithm before the EM iterations can greatly help reduce variation, and is a standard practice in existing implementations.

Finally, I have assumed through this post that the number of unobserved classes is known with certainty. In reality, unless we have some prior knowledge, we don't know the exact model. The best fitting number of classes can be obtained via bayesian methods (basically treating the number of classes as a random variable coming from some distribution), or by trial and error aiming to minimize the Akaike/Bayesian Information Criterion.

Final words

I hope that this blog post could bring some light into the strengths and weaknesses of FMMs, and convinced you to include them as a part of your data science toolbox.

Have a great and safe end of the year, and feliz año nuevo 🍻


  • Araki Hiroko, 2020. Financial Literacy, Unobserved Heterogeneity and Investment Behavior in Japan. Association of Behavioral Economics and Finance, 14th Annual Conference.
  • Arcidiacono, P. and Miller, R.A., 2011. Conditional Choice Probability Estimation of Dynamic Discrete Choice Models With Unobserved Heterogeneity. Econometrica, 79: 1823-1867.
  • Gerhard, P., J. J. Gladstone and A. P. I. Hoffmann, 2018. Psychological characteristics and household savings behavior: The importance of accounting for latent heterogeneity. Journal of Economic Behavior & Organization 148, 66-82.
  • Hunter, D. R. and Lange, K., 2004. A tutorial on MM algorithms. The American Statistician, 58:1, 30-37.

▼Other articles in this series

© Sansan, Inc.