Yeah that was not nearly enough detail! I do research in the field of psycholinguistics and I'm currently designing an experiment that tests how we use mental representations of phrases during language processing and production. In this case, I've hypothesized that words which appear in highly frequent phrases undergo less "lexical priming" -- that is, activating a strong multiword representation reduces the amount of activation for any one of the individual words within the single phrase. (Assuming that "representation strength" scales in a nice way with expression frequency, which we have some evidence for). This is based on some provocative data I got in a previous experiment, but that experiment wasn't explicitly designed to test this hypothesis so I'm designing another one! As a first step, I'm analyzing naturally occurring speech data in a corpus of telephone conversations to see if my hypothesis is supported in natural data, then if it's promising I'm going to do a tightly controlled experiment in the lab. Unfortunately, doing fancy calculations on hundreds of thousands of rows is a huge pain in the ass for the language I'm most comfortable with, R. R is great for statistical analysis but is painfully slow otherwise...
Idioms are difficult to use to study this precisely because they have different possible meaning. I'm talking about phrases that appear to still be fully compositional, but are very frequent with words that co-occur together more than expected by chance - some examples are "parmesan cheese", " academic achievement ", " good job", etc.