Articles by Joshua Mōller-Mara

  • Counting clusters with mixture models and EM

    I remember back when taking a Bayesian statistics course we were able to guess the number of subpopulations of fish based on a histogram of fish length measurements. Well, a few months later I totally forgot how we did this so I set out to relearn it. This problem in general is called "clustering", where one finds "clusters" in the data. I've talked about clustering a bit before on my post on using k-means clustering to generate color themes from pictures. Here I'll talk about clustering using mixture modeling and the EM algorithm and how we can use this model to get an idea of how many clusters are in our data set.

    Take the (artificial) example below. It looks like there are populations of points: one broad circular one, and a denser diagonal one in the middle. How do we decide which of the two clusters points belong to, and even before that, how do we even decide that there are only two clusters?

    A set of points. There is a broad spattering of points in a disk. Inside the disk is a denser, slightly diagonal elliptical region of points.

    If we already knew there were two clusters, why not try using the k-means clustering algorithm? A potential problem with k-means is that it divides spaces up into a Voronoi diagram, meaning that the boundaries are lines. Worse yet, with only two clusters, k-means tries to separate these two clusters using one line!

    k-means does not cluster this set very well.

    Not a very good separation of these clusters, is it?

    Let's try using a different model. Let's assume that the points are generated from one of several multivariate Gaussian distributions (which they actually are in this case, which is kind of cheating, haha). This is called a multivariate Gaussian mixture model. So we can think of the probability of a point $x$ being generated is

    \[p(x) = \sum_{i=1}^k \alpha_i \cdot \text{N}(x|\mu_i, \sigma^2_i)\]

    where $\alpha_i$ is the probability of being in cluster $i$ and $\mu_i$ and $\sigma^2_i$ tell us about the location and spread of the $i$th cluster. So we're interested in estimating the values of $\mu_i$, $\sigma^2_i$, and $\alpha_i$ given a bunch of data. However, there's no closed-form solution to estimate all these values at once, the way we might if there were only one Gaussian cluster. Instead, we use the EM algorithm to iteratively estimate values of the parameters of interest. Here's what the EM algorithm returned for two clusters on the same data set:

    Illustration of clustering using mixture models and EM.

    Very cool! It recognized the diagonal shape of the inside cluster and has a nice, rounded border. So how does this estimation actually work? You can actually find R code for the EM algorithm on Wikipedia.

    Basically, there are two steps in EM. "E" stands for "expectation", where we estimate values for hidden or latent variables. Only by estimating values for these latent variables are we able to perform step "M" where we "maximize" likelihood using maximum likelihood estimation (MLE). In this problem our hidden variables are the memberships of each point. We don't observe which cluster a ...


  • Getting IBus working with Emacs

    Emacs comes with a lot of Chinese input methods like pinyin, four-corner method, and various forms of Cangjie among others (listed quite handily here). For basic usage, it actually does fairly well. I've been able to use the four corner method to look up characters of which I don't know the pronunciation. However, Emacs's 4corner and Cangjie methods are limited in that they only use traditional characters and can't look up simplified characters. So if I tried to look up 龙 ("dragon"), which looks like 4corner "43040" to me, I wouldn't be able to, since it's a simplified character. I'd only be able to look up the traditional form of dragon: "龍" (which is "01211"). So I looked for other input methods that might support both traditional and simplified, one of which is Wubi. Wubi isn't available for Emacs, but can be installed via IBus.

    I installed IBus and tried it out. It's input is pretty good, and better than Emacs's pinyin in that it has phrase matching. So if I wanted to enter in "lǎoshī" ("teacher", "老师") in Emacs, it would get "lao -> 老" correct, but would guess that "shi" is "是", since shì (是) is more common than shī (师). IBus's pinyin is smart enough to recognize "laoshi" as "老师", among other words and phrases.

    IBus worked out of the box for applications like Chromium and even xterm, but for some reason it seemed to have no effect whatsoever in Emacs. I thought this had something to do with not having ibus-el installed, so I installed it via apt. Even with correct setup I still had problems. Nothing was showing up. When I tried ibus-toggle I got the error 'IBusELInputContext' object has no attribute 'enable'. It turns out that IBus 1.5 no longer works with ibus-el, and that ibus-el pretty much doesn't work anymore (see this discussion). But some seemed to be able to get IBus working without ibus-el. Since Emacs has XIM support, it should be able to support it automatically. But whenever I entered text, only English characters appeared, without the IBus character selection dialog popup. I tried adding

    export GTK_IM_MODULE=ibus
    export XMODIFIERS=@im=ibus
    export QT_IM_MODULE=ibus
    

    to my ~/.zshrc (it turns out you probably don't need to, as GTK_IM_MODULE and XMODIFIERS were already set to these values).

    I found someone mention the workaround of using LC_CTYPE="zh_CN.UTF-8" emacs to start Emacs. It turns out that this somewhat works. I started to see the IBus character selection dialog popup, but I wasn't able to enter any characters. I tracked the problem to Gtk-WARNING **: Locale not supported by C library, which suggested that I didn't actually have "zh_CN.UTF-8" installed. So I installed it via sudo dpkg-reconfigure locales and selected the appropriate option. Now if I start emacs using

    LC_CTYPE="zh_CN.UTF-8" emacs
    

    it can accept input through Wubi, Pinyin, Cangjie5, and others. Cool!

    There's still some ...


  • Color theme generation from images using k-means

    Note, this post more or less follows this post by Charles Leifer, except in less detail, and explained more poorly.

    One of the top posts on the unixporn subreddit (SFW, really.) is this post that shows how a redditor generates color themes for his window manager from images using a script. He gets the code from Charles Leifer, who explains how the script works. Basically, the script detects the dominant colors in the image using k-means clustering.

    As an exercise, I tried recreating the script in R. I didn't exactly look at Charles' code, but I knew the basic premise was that it uses k-means to generate a color palette.

    I liked the idea of using R over Python because (a) as a statistics major I use R all the time and (b) there's no other reason, R's just fairly nice to work with.

    Color spaces

    k-means performs differently depending on how you represent colors. A common color space to use is RGB, which represents colors by their red, green, and blue components. I found that representing colors in this manner tended to result in points along the diagonal. This happens since images usually have many shades of the same color, so, if you have $(r, g, b)$ you also tend to have $(r+10, g+10, b+10)$. This results in clusters having a sort of elongated shape, which isn't that great for k-means since it seems better at picking out more "round" clusters. There is often a lot of correlation between dimensions. Maybe I'm not making a lot of sense here, suffice to say I wasn't terribly pleased with the clusters I was getting.

    A 3 dimensional representation of the colors used in an image. In RGB space.

    The next color space I tried was HSV, which represents colors in terms of hue, saturation, and value. This actually got me some fairly satisfactory clusters. As you can see in the graphic below, it's much easier to separate different colors. The only problem was that it made me want to put more weight on the "hue" dimension than the "saturation" or "value" dimensions. Many clusters ended up just being gray.

    A 3 dimensional representation of colors in the same image, but in HSV space.

    One cool thing is that R already does HSV fairly easily using the rgb2hsv function.

    I was most satisfied using LAB space. This represents colors with one "lightness" dimension and two color dimensions "A" and "B". It was made to approximate human vision, and as you can see from the graphic below, distances between colors seem more meaningful. In fact, using Lab space is a recommended way of finding color difference. A good package for using this in R is the colorspace package.

    Colors represented in LAB space.

    k-means

    Another nice thing about R is that it has its own kmeans function built in. I actually tried writing my own, which looks like this:

    ## Do k-Means
    ## It tends to lose some k values
    kMeans <- function(k, X, iter = 5) {
        ## Assign random membership
        membership <<- sample(1:k, size=nrow(X), replace=TRUE)
    
        for(i in 1:iter) {
            mus <<- tapply(1:nrow(X ...

  • Emacs is great for sysadmins, too

    I work as a Unix Systems Administrator for UC Berkeley's Rescomp and it occasionally comes up that sysadmins generally prefer vim while programmers prefer Emacs. The reasoning for this is that vim or vi is generally more available on servers and generally has a more consistent interface across servers. That is, if you use Emacs, you generally have a hefty .emacs file, and using an unconfigured Emacs is painful.

    I think it's no longer the case that Emacs isn't installed by default. I've only ever had to use vim a handful of times, and the only thing I really needed to know was how to

    1. Insert text (i)
    2. Save & Exit (Esc : wq ENTER)

    However, I'm a sysadmin that prefers Emacs, and there are a number of reasons why using Emacs is very helpful for sysadminning.

    Dired

    Dired mode is Emacs's visual "directory editor", and it makes navigating and operating on files much easier than just using the command line.

    Using marks

    One task that's very easy in Dired that's really cumbersome to do elsewhere is repeated grepping. Say, for example, that I want to find files with "hello" in them. In Dired I do this by pressing % g and entering the string.

    A number of files displayed in Dired.

    And what I get is a number of marked files (in orange), that I can easily, among other things:

    • copy (C)
    • move/rename (R) (even to another server with Tramp!)
    • change the mode of (M)
    • run a shell command on (!)
    Highlighting files to perform actions on them.

    Now I can filter out files that don't match by pressing t k (which toggles, then kills lines).

    Filtering out a file by "killing" lines.

    Now say I forgot that I also need the files to contain "world" somewhere in them. I just repeat the process by pressing % g again and entering "world" to get a list of marked files that contain both "hello" and "world".

    Searching with dired highlights files.

    And now it's really easy to do any operations on them.

    In bash, however, it feels a little more clumsy for me. It's possible to search by doing:

    grep -l "hello" .
    

    But if I remember later that it also has to contain "world", I have to go edit the last command to be:

    grep -lr hello . | xargs grep -l world
    

    And now I just get a list of files. Say now that I want to copy these files somewhere. I have to again tack on another command, like so:

    grep -lr hello . | xargs grep -l world | xargs -n1 -i cp {} /some/directory
    

    It gets really cumbersome, and it requires you to remember how to use substitute arguments like {} in xargs. And you might also have to hope your file names don't contain whitespace. With Dired, you really don't have to worry about these kinds of things. Dired's marking system makes a bunch of operations super convenient.

    Edit Dired

    "Edit Dired" mode also just makes it so much easier to rename files in bulk. Instead of having to think of a regexp or sed expression to ...


  • Adding template pages to Pelican

    I was having a lot of trouble just getting my site to generate an authors file, even though I'm the only author here. The pelican documentation says you can add something like

    AUTHORS_URL = 'blog/authors.html'
    AUTHORS_SAVE_AS = 'blog/authors.html'
    

    To generate the authors.html file.

    It wasn't working for me. Well, after going through the source code and finding the relevant section in generators.py I found that you have to set DIRECT_TEMPLATES like so:

    DIRECT_TEMPLATES = ('index', 'tags', 'categories', 'archives', 'authors')
    AUTHORS_URL = 'blog/authors.html'
    AUTHORS_SAVE_AS = 'blog/authors.html'
    

    Now it works! And looking back at the documentation, it actually sort of hints at this. D'oh!


« Page 2 / 3 »

Category
Tagcloud
neuro LaTeX js bad-tagging-scheme pelican R testing vim-is-okay-too emacs statistics wm sysadmin productivity
Latest posts