My First Corpus (3: Poking the Data [CW])

CW: slurs in the data.

Data collection

As I continue collecting data, I’ve settled on a format for now — a large CSV in which each record consists of:
[board], [thread number], [post number], [poster name], [timestamp (seconds)], [body]

The [thread number] corresponds to the [post number] of the OP, thus the OP will have the same post & thread numbers, while subsequent posts to that thread immediately follow it given the way the csv is sorted.

I fooled around with csvcut, grep, sort, uniq, jq, sed, and wc to come up with a few reusable queries.

grep -iE "TERM" data.csv returns, of course, the entire record containing TERM, meaning just by searching for the line containing that expression, I can get all the metadata (especially the thread and post number) from the commandline. For manually searching for instances and gathering general statistics, this is super helpful.
grep -iEo "TERM" data.csv | tr [[:upper:]] [[:lower:]] | sort | uniq -c | sort -r gives a list of the most common variants of a regex pattern (ignoring capitalization differences) in the file. For instance, \w+TERM returns a count-ordered list of words suffixed with -TERM, and "\w+ TERM \w+" shows how often a given word is surrounded by two others. Likewise "TERM1 \w+ TERM2" can be used to rank the words most likely to be found between TERM1 and TERM2.
grep -iE "TERM" data.csv | csvcut -c 1 | sort | uniq -c | sort -r gives a count of messages containing a search term (since without -o, grep will only return one matched line even if there are multiple matches on the line) within each board. This may even be more useful than the raw count of messages because certain terms like the N word are likely to appear copypasted hundreds of times in a single message.

First peeks

While there are some other queries I’m still trying to figure out (how to get count of terms by board, subthreading/reply-tracing, stuff that would be easier to do in SQL), the above is enough of a start to write some simple scripts and search for things to look into. This folder contains some results that I got from poking at the CSV. 'fef' is in reference to fef.sh which is a script that simply performs the last two commands against a json list of search terms and generates a more legible report including excerpts from the files it runs and displaying the patterns that were searched. Some of these are as simple as affixes, but the below queries are what I came up with to detect 1st- and 2nd-person identification. First they match a selection of expressions like “I am the”, “I’m not a”, “You’re the” etc. followed by another word. For instance, “You are the [one]”, “I am the [best]” etc.

\bi(?:m|\Wm| am| was| have been) (?:not )?(?:a[n]?|the|from|someone|somebody) \w*
\byou(?:re|\Wre| are| were| have been) (?:not )?(?:a[n]?|the|from|someone|somebody) \w*

The results of queries like this prompt further questions, based on what words are frequently used for self- and other-identification, that can be used to further investigate specific lexical items. For instance, if we find that “-bro” is a common suffix in self-identifications, we can use the metadata in each message to follow up that finding by asking what the most common variants of <?>bro are, what board most uses one, and these observations can prompt even more questions about why each term is so common, or why it is used at all. Below is a list of variations of <?>bro with notes inserted.

\w*bro[s]?\b
==> ./stats/fef/bro-suff-vars.csv <==
count in corpus,variant
30467,"bro"
23819,"bros"
739,"yejibros" <- why is the plural so common?
705,"hasbro" <- ... the toy company?
413,"gokubro"
157,"retardbro"
154,"bejitabro"
135,"kabukibro" <- literally kabuki?
81,"shieldbro" <- Shieldbro is a specific person...
72,"dudebro" <- ...but dudebro is a generic
68,"bejitabros"
65,"yejibro" <- Yeji is a korean musician
...
49,"dobro" <- English 'do' + 'bro', or slavic 'dobro'?
46,"dobros"   Maybe the style of guitar?
...
44,"bonebros" <- Euphemistic?
43,"turkbro" <- Turks? Fan of The Young Turks??
42,"swordbro" <- Is swordbro generic or a name?
42,"mektonbro"
...
39,"cunnybros" <- Fetishization of trans men?
37,"holobros" <- holo...?

Out of the above sample there are suddenly so many things to interrogate — why is ‘yejibros‘ used SO MUCH more frequently than ‘yejibro_’? In this case, Yeji is a korean rapper; my observation is that ‘yejibros‘ is used to generally rally/address Yeji fans, or to generalize about them — much less discussion seems to happen about an individual yejibro, even as a generic. How can we tell from a given word’s context that it is a reference to a specific person? Bunkerbro, for instance, is a specific user well-known for posting about a bunker they allegedly acquired and have documented renovating in threads. But, bunkerbros can also refer to (doomsday) preppers. Shieldbro is also a particular user who posts under that name (which, as has been discussed, is uncommon). If we search “dobro”, we find lots of slavic text as well as “WAGMI DOBRO”??? What is wagmi¹? Does 4chan really talk about hasbro toys that much²?

==> ./stats/fef/cuck-all-vars.csv <==
count in corpus,variant
8959,"cuck"
4329,"cucked"
3255,"cucks"
962,"cuckold"
945,"cucking"
759,"christcucks"
677,"christcuck"
385,"cuckoldry"
246,"cuckshit"
224,"wagecuck"
184,"cuckolds"
182,"cuckolding"
161,"cuckquean"
148,"cuckposting"
123,"cuckservatives"
116,"christcuckery"
115,"cuckery"
108,"cuckservative"

Here is one of my favorite findings — “cuck” used as an affix evokes irrelevance, bystanderness, impotence, emasculation etc. based on its literal origins in the expression “cuckold”. The most common compound variant in the 12/29 dataset is “christcuck(s)”. Since 4chan is often characterized as a hub for the alt-right, which is a group identity constructed in distinction from the traditional right, one such distinction could be made against Christian religious values. This would be in line both with this oppositional construction and 4chan’s seemingly atheist-nihilist disposition. Pulling from the data, it is used to attack people for perceived blind followership; c.f. “sheeple”, and it seems to be especially popular on /x/, the “paranormal/occult” board.

Lazy minded christcuckery.
(/x/, 30670862)

“Cuckservative” is only a little bit behind — examples pulled from the data point to an impression of impotence:

“cuckservatives (MTG’s simping fanabse^[sic]) aren’t ever going to do shit”
/pol/, 354879684

In the next post, I analyze a pattern of interest that I call an identity dispute, which is a speech event that is characteristic of, and ubiquitous on, 4chan.

Footnotes:

1
(We’re all gonna make it)
2
(yes; discussion of anime figurines is quite popular)

My First Corpus (3: Poking the Data [CW])

Data collection

First peeks

Footnotes:

Comments

Leave a Reply Cancel reply