Aug 3 2014

Experiments with Latent Dirichlet Allocation

In a couple of my previous posts I talked about using clustering colors with k-means and counting clusters with EM. This kind of clustering is fairly straightforward, as you have some notion of distance between points to judge similarity. But what if you wanted to cluster text? How do you judge similarity there? (There are certain measures you could use, like the F-measure, which I’ll talk about in a later post.)

One way is to use Latent Dirichlet Allocation, which I first heard about while talking to a Statistics 133 GSI, and then later learned about while reading probabilistic models of cognition. Latent Dirichlet Allocation is a generative model that describes how text documents could be generated probabilistically from a mixture of topics, where each topic has a distribution over words. For each word in a document, a topic is sampled, from which a word is then sampled. This model gives us probabilities of documents, given topic distribution and words. But what’s more interesting here is learning about topics given the observed documents.

Here’s the plate notation view of LDA, which describes exactly how documents are generated:

A plate notation explanation of LDA.

Rendering MRI volumes in-browser with XTK

Recently I’ve been playing around with interactive visualizations, using tools like d3.js and GGobi. One of the things I like about interactive visualizations, as opposed to static graphics, is that with interactive visualizations you don’t have to make all the information available at once. You can present a broad overview of your data. And by having the user query specific data points, you can present more data as needed. Take this example of airport flight connectivity in the United States. If you had to display all the airport names and all the connections in one graph, it’d probably look pretty gross and would be very confusing to disentangle.

Similary, with MRI data, it’s usually hard to see the big picture at once. MRI data is usually just displayed in 2D slices. If you’re showing activations you may show a couple slices, perhaps one axial and one sagittal, so your audience can get an idea of where your clusters take place. If you wanted to show a whole brain, you could perhaps do an animated GIF, like so.

Animated GIF displaying axial slices of a brain.

Counting clusters with mixture models and EM

I remember back when taking a Bayesian statistics course we were able to guess the number of subpopulations of fish based on a histogram of fish length measurements. Well, a few months later I totally forgot how we did this so I set out to relearn it. This problem in general is called “clustering”, where one finds “clusters” in the data. I’ve talked about clustering a bit before on my post on using k-means clustering to generate color themes from pictures. Here I’ll talk about clustering using mixture modeling and the EM algorithm and how we can use this model to get an idea of how many clusters are in our data set.

Take the (artificial) example below. It looks like there are populations of points: one broad circular one, and a denser diagonal one in the middle. How do we decide which of the two clusters points belong to, and even before that, how do we even decide that there are only two clusters?

A set of points. There is a broad spattering of
points in a disk. Inside the disk is a denser,
slightly diagonal elliptical region of points.

Getting IBus working with Emacs

Emacs comes with a lot of Chinese input methods like pinyin, four-corner method, and various forms of Cangjie among others (listed quite handily here). For basic usage, it actually does fairly well. I’ve been able to use the four corner method to look up characters of which I don’t know the pronunciation. However, Emacs’s 4corner and Cangjie methods are limited in that they only use traditional characters and can’t look up simplified characters. So if I tried to look up 龙 (“dragon”), which looks like 4corner “43040” to me, I wouldn’t be able to, since it’s a simplified character. I’d only be able to look up the traditional form of dragon: “龍” (which is “01211”). So I looked for other input methods that might support both traditional and simplified, one of which is Wubi. Wubi isn’t available for Emacs, but can be installed via IBus.

I installed IBus and tried it out. It’s input is pretty good, and better than Emacs’s pinyin in that it has phrase matching. So if I wanted to enter in “lǎoshī” (“teacher”, “老师”) in Emacs, it would get “lao -> 老” correct, but would guess that “shi” is “是”, since shì (是) is more common than shī (师). IBus’s pinyin is smart enough to recognize “laoshi” as “老师”, among other words and phrases.

IBus worked out of the box for applications like Chromium and even xterm, but for some reason it seemed to have no effect whatsoever in Emacs. I thought this had something to do with not having ibus-el installed, so I installed it via apt. Even with correct setup I still had problems. Nothing was showing up. When I tried ibus-toggle I got the error 'IBusELInputContext' object has no attribute 'enable'. It turns out that IBus 1.5 no longer works with ibus-el, and that ibus-el pretty much doesn’t work anymore (see this discussion). But some seemed to be able to get IBus working without ibus-el. Since Emacs has XIM support, it should be able to support it automatically. But whenever I entered text, only English characters appeared, without the IBus character selection dialog popup. I tried adding

export GTK_IM_MODULE=ibus
export XMODIFIERS=@im=ibus
export QT_IM_MODULE=ibus

to my ~/.zshrc (it turns out you probably don’t need to, as GTK_IM_MODULE and XMODIFIERS were already set to these values).

I found someone mention the workaround of using LC_CTYPE="zh_CN.UTF-8" emacs to start Emacs. It turns out that this somewhat works. I started to see the IBus character selection dialog popup, but I wasn’t able to enter any characters. I tracked the problem to Gtk-WARNING **: Locale not supported by C library, which suggested that I didn’t actually have “zh_CN.UTF-8” installed. So I installed it via sudo dpkg-reconfigure locales and selected the appropriate option. Now if I start emacs using

LC_CTYPE="zh_CN.UTF-8" emacs

it can accept input through Wubi, Pinyin, Cangjie5, and others. Cool!

There’s still some problems with using IBus on Emacs without ibus-el. It’s hard to do commands like C-x k (kill-buffer) without the k being read as something else. Usually you have to switch to temporary English mode using Shift, or switch back to the US keyboard. Maybe someday ibus-el will work with IBus again, but the API conflicts seem to suggest this won’t happen anytime soon.

Anyways, that’s it. Hopefully this helps if you with Emacs and IBus if you were tearing your hair out like I was. 再见！