My First Corpus (2: wgetting Data)

WGETTING DATA

Before I began, I knew that eventually I wanted to get as much text data from 4chan as possible. I knew that there were boards, which were subdirectories of the site (e.g. the board /diy/ appears at 4chan.org/diy/), and that these each corresponded to a topic, had their own moderators; and to some extent, their own rules. Each board consists of threads, which are a linear series of posts by users, and for each post there is exactly one attachment available. Generally this would be an image, sometimes another format. I figured this meant that my dataset would have no issues with, for instance, html embedding or dealing with a tree of threads and subthreads like twitter would — overall, this is true, but 4chan includes a reply feature using >> that should allow construction of such a tree if I wanted one.

Users are not required to identify themselves, meaning that they appear with the name field Anonymous, the namesake of the formless hacktivist community that is associated with the site. One can choose to use a name if they want, but this is quite unpopular — out of 964544 posts in my dataset, 934610 posts (96.9%) were made by Anonymous and only 29,934 (3.1%) were created by someone using a name. As we will see, using a name is something that other users broadly disparage. This anonymity poses a unique situation where each user has the opportunity to assume an entirely new identity for each thread (or even each post in a single thread, if they want). It also imposes time constraints — while it’s straightforward to refer to other threads to identify one’s previous posts, it’s not always convincing (more on that later), and threads are neither infinite nor permanent, so to some extent an identity created on one thread will be destroyed when the thread is bumped off the board. On /b/, 4chan’s notorious “Random” board, most threads only survive for a few minutes, meaning that in at least this space, users have a very small amount of time to present (and negotiate) their identities with each other. Other boards are slower and may show the same threads for several days at a time.

At any rate, I thought the smartest thing to do would just be to hand-assemble a list of boards I was interested in from the front page by copying the html, and then finding-and-replacing the html code in notepad++ to narrow down the links I would use to get the data. It wasn’t even smart at all! But it did the trick. I downloaded the front page using wget, and in my browser I used inspect element to narrow down the section of the data I needed to get to the next step. This method was a quick and scrappy way to get started — I ditched it later on, but if it’s all you’ve got and you know what you’re doing from there, you can apply this to other sites that don’t make an API available.

closeup of the 4chan board list through the lens of Inspect Element

I looked through the html to find the <div> labeled ...id="boards", and trimmed away the rest. Then I found-and-replaced all the html and weird characters until the only thing that was left was a list of links to boards, and their names as provided by the front page. Very hacky.

I then turned my attention to the links I had gotten. Each board contains a catalog page, which has a similarly formatted list of all the threads on the board. I did something like the following to download every catalog page out of that file.

#!bin/sh

Lines=$(cat listfile)
for Line in $Lines
do

    Url=$(echo $Line | csvcut -c 1)
    wget "$Url"catalog/

done

I think, at first, I trimmed one of the catalog pages down manually and then used some grepping to isolate the urls for each thread, which I then read with another script, until at some point I remembered some sites make data available in .json form as well as nicely formatted webpages, and started poking around. For the hell of it, I tried boards.4channel.org/diy/catalog.json in my browser and to my amusement it actually produced a file that looked like this:

[{"page":1,"threads":[{"no":670504,"sticky":1,"closed":1,"now":"07\/20\/14(Sun)16:06:58","name":"Anonymous","sub":"Makers gonna make!","com":"Welcome to \/diy\/, a place to:<br><br>Post and discuss \/diy\/ projects, ask questions regarding \/diy\/ topics and exchange ideas and techniques.<br><br>Please keep in mind:<br>- This is a SFW board. No fleshlights or other sex toys.<br>- No weapons. That goes to <a href=\"https:\/\/boards.4chan.org\/k\/\">\/k\/ - Weapons<\/a>. The workmanship and techniques involved in creating objects which could be used as weapons or the portion of a weapons project that involves them (e.g., forging steel for a blade, machining for gunsmithing, what epoxy can I use to fix my bow) may be discussed in \/diy\/, but discussing weapon-specific techniques\/designs or the actual use of weapons is disallowed. Things such as fixed blade knives or axes are considered tools, things such as swords, guns or explosives are considered weapons.<br>- No drugs or drug paraphernalia (See <a href=\"http:\/\/www.4chan.org\/rules#global\">Global Rule 1<\/a>). If you want to discuss something that could involve such things (e.g., carving a tobacco pipe from wood) that&#039;s fine, but make sure it&#039;s \/diy\/ related and doesn&#039;t involve drugs or it will result in deletion\/ban.<br><br>Helpful links:<br>
...
...
...

Excellent. This meant I could use $ wget to get the data from the catalog, sort the data by number, all kinds of things. In fact, several endpoints on the website can be suffixed with .json to yield the results I needed, for instance boards.4channel.org/diy/thread/670504.json yields the data for the thread 670504, the board’s sticky. Other threads also include data for replies.

{"posts":[{"no":670504,"sticky":1,"closed":1,"now":"07\/20\/14(Sun)16:06:58","name":"Anonymous","sub":"Makers gonna make!","com":"Welcome to \/diy\/, a place to:<br><br>Post and discuss \/diy\/ projects, ask questions regarding \/diy\/ topics and exchange ideas and techniques.
....

My First Corpus (2: wgetting Data)

Comments

Leave a Reply Cancel reply