Yeah that was not nearly enough detail! I do research in the field of psycholinguistics and I'm currently designing an experiment that tests how we use mental representations of phrases during language processing and production. In this case, I've hypothesized that words which appear in highly frequent phrases undergo less "lexical priming" -- that is, activating a strong multiword representation reduces the amount of activation for any one of the individual words within the single phrase. (Assuming that "representation strength" scales in a nice way with expression frequency, which we have some evidence for). This is based on some provocative data I got in a previous experiment, but that experiment wasn't explicitly designed to test this hypothesis so I'm designing another one! As a first step, I'm analyzing naturally occurring speech data in a corpus of telephone conversations to see if my hypothesis is supported in natural data, then if it's promising I'm going to do a tightly controlled experiment in the lab. Unfortunately, doing fancy calculations on hundreds of thousands of rows is a huge pain in the ass for the language I'm most comfortable with, R. R is great for statistical analysis but is painfully slow otherwise...