BACKGROUND
If you’re interested in linguistics, at some point you’ve probably run into “Corpus Analysis”, which is linguistic work done on a substantial body (“corpus”) of text. There are all kinds of corpora: one might be all the works by Shakespeare, or all of the language used on a particular news site in a given year, or all the emails between members of a corporation. They can be tagged for grammatical information, or simply compiled with minimal metadata like the time and origin of the text. The ones that I’m interested in right now consist of data from social platforms like Reddit, Tumblr, Twitter, and 4chan; as well as apps with more narrow use-cases like Tinder and Grindr. All of these are unique compared to traditional media in that they feature varied genres of discourse, and some like Reddit and 4chan make the process of taking swaths of data from the site fairly easy, as long as you have some scripting knowledge.
This year I started a corpus project with 4chan for a number of reasons — the site is uniform, there’s no need to register to access its data, and it poses a unique set of social circumstances that make it worth studying. I wanted to put together a corpus and then poke at the data until something interesting come out, and many things did. I read a couple papers to help guide this informal research. I didn’t have any particular agenda so essentially everything I will chronicle here is out of curiosity. I hope it’s presented in a relatable and useful way if you’ve ever thought about doing something like this yourself.
As I get more into the weeds, keep in mind that what I’m studying includes text data from 4chan, which has earned its highly offensive reputation, so just beware that over the course of this writing, I’ll occasionally include uncensored excerpts from the dataset(s), which include all manner of objectionable material from homophobia and racism to casual discussion of paraphilias and so on. In posts where this of concern (most of them), [CW] appears in the title and the article will be prefaced with a more specific content warning, so that readers can steer clear if they prefer.