OMG! Google has announced that not only have they been collecting n-gram data from a training corpus they have built from on-line sources of a trillion works, but they’re goiing to make the n-gram data available via the UPenn Linguistic Data Consortium in the near future. I don’t even have a need for this data at the moment, but I’m drooling over the idea. I’m sure there’s some way I can make use of this in my current project…… [via Language Log]