To many people, “geek” and “nerd” are synonyms, but in fact they are a little different.
A great little statistic for measuring how much company two words tend to keep is pointwise mutual information (PMI). It’s commonly used in the information retrieval literature to measure the cooccurrence of words and phrases in text, and it also turns out to be a good predictor of how humans evaluate semantic word similarity (Recchia & Jones, 2009) and topic model quality (Newman & al., 2010).
For two words w and v, the PMI is given by:
where in this case is the probability of the word(s) in question appearing in a random tweet, as estimated from the data. For instance, if we let v = “geek,” we compute the log-probability of a word w in the “geek” search corpus, and subtract the log-probability of w in the background corpus.
If you ever wondered what the difference between a “Geek” and a “Nerd” was, this should settle it once and for all … or not.
Read the full article and rather nerdy analysis here at slackprop: