There's been some significant research on analyzing text to match writing patterns. The general idea is that every person has unique linguistic patterns and turns of phrase. Fin can pick out my writing (or her sisters) almost immediately.
At some point, I read about some researchers using compression to identify authors of text excerpts. Compression algorithms create new encoding schemes based on pattern recognition. In theory, we can recognize an author's style by seeing which piece of writing results in the best compression. Does it work?
The idea seems too simple to work - at least with any meaningful accuracy. Still, finding it fascinating, I decided to run an experiment. Searching my google reader, I found 2 blogs covering similar topics to mine, and a third wildly different blog covering technical posts. Scouring these blogs, I worked at creating a text collection for each author. Excerpts were selected based on covering similar subject matter.
My 'test' subjects included a blog post written 3 years ago by myself, a work email sent roughly 3 weeks ago, and a collection of Google+ posts over the past few months. In addition, I grabbed 2 posts from the selected blogs. The work email and technical blog use extremely similar terminology throughout. In theory, the compression technique should fail in this case - picking up technology idioms instead of language usage.
To form a baseline, an unrelated text excerpt is added to each text collection. The collection is compressed using "Zip" and the final size recorded. After forming the baseline, I replace the additional text with each excerpt to identify.
Running the tests, I expected the results to be poor at a minimum. I'd purposefully selected difficult scenarios for the test, hoping to prod it into failure. In the end, all 5 tests resulted in a correct identification of the author. I'd suspected a few to hit on chance, but not a 100% positive identification rate. For those curious, my work email scored first with my personal blog here, and second with the technical blog.
The strongest match? Identifying the social media posts.
I'd guess that increasing the number of authors would decrease the positive ID rate. Still, we could improve that situation by adding to the baseline and test data sets. Obviously, a short test using a common sentence( eg: I'm hungry ) won't work well. Conspiracy theory thought: isn't social media providing an ever growing baseline data set?
The idea of social media building the strongest matches has interesting implications for this technique and author identification in general. While we write on social media with our real names, are we working against our interest in remaining anonymous elsewhere? In any security scenario, the weakest element tends to be the humans running the show. While we research technologies such as "Tor" for privacy and protection of political dissidents - the very published speech points right back at the author. Could a child's grammar school paper condemn them as an adult?
In general, the take away here is that writing on social media, or blogs, or English papers can be used to identify people in other contexts. Could I write this post and publish it truly anonymously?
Not as much as I'd like to think.