Counting clusters with mixture models and EM
I remember back when taking a Bayesian statistics course we were able to guess the number of subpopulations of fish based on a histogram of fish length measurements. Well, a few months later I totally forgot how we did this so I set out to relearn it. This problem in general is called "clustering", where one finds "clusters" in the data. I've talked about clustering a bit before on my post on using k-means clustering to generate color themes from pictures. Here I'll talk about clustering using mixture modeling and the EM algorithm and how we can use this model to get an idea of how many clusters are in our data set.
Take the (artificial) example below. It looks like there are populations of points: one broad circular one, and a denser diagonal one in the middle. How do we decide which of the two clusters points belong to, and even before that, how do we even decide that there are only two clusters?

If we already knew there were two clusters, why not try using the k-means clustering algorithm? A potential problem with k-means is that it divides spaces up into a Voronoi diagram, meaning that the boundaries are lines. Worse yet, with only two clusters, k-means tries to separate these two clusters using one line!

Not a very good separation of these clusters, is it?
Let's try using a different model. Let's assume that the points are generated from one of several multivariate Gaussian distributions (which they actually are in this case, which is kind of cheating, haha). This is called a multivariate Gaussian mixture model. So we can think of the probability of a point $x$ being generated is
\[p(x) = \sum_{i=1}^k \alpha_i \cdot \text{N}(x|\mu_i, \sigma^2_i)\]
where $\alpha_i$ is the probability of being in cluster $i$ and $\mu_i$ and $\sigma^2_i$ tell us about the location and spread of the $i$th cluster. So we're interested in estimating the values of $\mu_i$, $\sigma^2_i$, and $\alpha_i$ given a bunch of data. However, there's no closed-form solution to estimate all these values at once, the way we might if there were only one Gaussian cluster. Instead, we use the EM algorithm to iteratively estimate values of the parameters of interest. Here's what the EM algorithm returned for two clusters on the same data set:

Very cool! It recognized the diagonal shape of the inside cluster and has a nice, rounded border. So how does this estimation actually work? You can actually find R code for the EM algorithm on Wikipedia.
Basically, there are two steps in EM. "E" stands for "expectation", where we estimate values for hidden or latent variables. Only by estimating values for these latent variables are we able to perform step "M" where we "maximize" likelihood using maximum likelihood estimation (MLE). In this problem our hidden variables are the memberships of each point. We don't observe which cluster a ...