Jul 22 2015

Clustering with the F-measure

In one of my previous posts I talked about how to cluster text into different topics using Latent Dirichlet Allocation. The problem with it for me was that 1) it ran really slowly and 2) I actually didn't get good results (perhaps due to my hyperparameters or maybe my sampling wasn't converging). Because of its complexity, I wasn't able to run it on a dataset I really wanted to cluster, which is a system log. One thing about system logs is that you often get the same messages printed over and over again, shuffled in at different times. It would be nice if you could re-split the data and see logs related to one thing.

What i eventually ended up doing was clustering data by computing a distance matrix with the F-measure. The F-measure is the harmonic mean of precision and recall. Precision is $$\text{Precision} = \frac{\text{Sum of words shared in both texts}}{\text{Number of words in new text}}$$ while recall is $$\text{Recall} = \frac{\text{Sum of words shared in both texts}}{\text{Number of words in old text}}$$

All together this is $$ \frac{2 \cdot (\text{Sum of words shared in both texts})}{\text{Number of words in old text}+\text{Number of words in new text}} $$

which basically gives us a measure between 0 and 1 of basically how many words the two texts share. It really doesn't take into account any notion of position, so the sentences "Hello, my name is Josh" and "Josh is my hello name" have an F-measure of 1. Kinda stupid, but a good first pass for clustering. Basically we just need some kind of a distance matrix to make a dendrogram.