FSE Meets the FBI! — FSE Blog

FSE Meets the FBI!

I have for you a bizarre tale of scrapers, feds, data poisoning, Torswats, and everyone's favorite fedi instance. It veers technical, because I suspect it will be of interest to other people running servers with UGC (i.e., every fedi instance), and also because the mechanics of how I figured out what I figured out might be useful. It's also got information about how the FBI collects data, which is of interest to everyone, but especially US citizens. I have a few pieces of the puzzle, maybe someone with interlocking pieces can say more; I'm happy to compare notes.

To summarize, the FBI pays some shady companies to scrape data, the data is scanned for keywords (yep, just like CARNIVORE). Links and content are then fed into Facebook, organized by topic based on the keywords. Some rudimentary analysis is performed (sentiment analysis at least, but as friendly as Microsoft is with the feds, and as LLMs have gotten popular, the influence of machines has probably expanded) and perused by agents, using some FBI internal interface.

The TL;DR above probably implies this, but this is the longest post on here to date, by a wide margin. I expect that most people will skip around instead of reading straight through: probably only people running instances are going to be interested in the technical parts, and when building out the chronology I erred on the side of providing too much information rather than simplifying.

A note about links that go to fedi sites: some of them (like the main FSE site) are down, but if you put the links into your own instance's search box, you can generally find the post (if it federated; obviously if your instance is newer or if it's configured to scrub old posts, you might not have it) and view it on your own instance. In either case, it's generally better to try searching for the post's URL on your home instance, whether that instance is currently live or not, because then you can interact with it locally. (FSE is actually just a few commits from resuming federation, though being ready to use is a bit farther out. Within the next few days, probably, it'll be possible to resume fetching objects from it; I have a couple of bugs to fix in the way Revolver stores users, though it's serving objects just fine.)

Root Cause

Pedophiles were showing up on FSE. The rest of this section is background, the TL;DR is I had a problem getting pedophiles to stay away from FSE and I wanted to stop waking up in a cold sweat. You can skip this bit without missing much besides my woes and what I did about them.

As far as I could reason, pedophiles were, besides being a problem themselves, the most likely reason for FSE to attract attention from law enforcement. Nobody wants to host CP (or be in its vicinity) or get their gear seized because they fell asleep on the wheel and they let the pedos have free or they got CP-raided. (If you are not familiar, this is a type of false-flag where a group of people floods a place with CP and then immediately alerts the FBI, usually done to get the site taken down. Of course, you have to have it to upload it and usually if it results in any arrests, it results in the arrest of the attackers.)

I was not happy about their arrival because, like with most fedi instances, it represented the primary existential threat to FSE, but as long as they weren't bothering any kids or uploading anything illegal, FSE has zero viewpoint censorship. Almost invariably, they would immediately do something illegal.

My initial suspicion was that they misunderstood the rules, the meaning of freedom of speech, something like that. It turns out that a lot of them have a habit of just dumping illegal stuff everywhere and coming back in a week to see which accounts have been banned and which places have admins that are asleep (an approach that basically leaves no doubt that they are knowingly parasitic), then telling their friends", so you have to stay on top of it and get rid of them early, or they bring more. It should not be a surprise that if someone's gratification is predicated on getting what they want without regard for who it hurts, that person is happy to engage in parasitic behavior like this: violating a server and panicking an adult is no big deal if you're willing to violate a child and potentially ruin the kid's life.

I'd like to also thank fediblock for never fact-checking anything ever, giving the false impression that things that FSE has never permitted were allowed. FSE being fedi's equivalent of a dive bar, I understand people on "gated community" instances not wanting to deal with it (though it turns out that instance-blocking is ham-fisted and just blocking a handful of accounts solves the problem), but I would prefer if they did not lie about their reasons or about me personally. That sort of thing doesn't help when pedos show up having heard that there are no rules. The blocks don't help them, either: I've sent messages to admins on instances that were (hopefully unknowingly) hosting CP but got no response because they blocked FSE. That's their problem, though; hope the block was worth it.

It does mean that when the FBI seized the kolektiva.social database backups, nothing from FSE was in there. As combative and block-happy as that instance was, very few other instances actually did make it into their database; for all anyone knows, the "accidental seizure" might have been just cover for a CI: if the FBI wanted the database and the CI had it, serving an overly broad warrant lets them collect it without burning that informant. I don't know anything about the people involved, but the FBI has used that tactic before.

So when someone came looking for or attempting to provide CP I started just posting their IP and email and UA and whatever I had or could dig up. (If you've punched a waiter, you can't complain that he refuses to bring you food. Likewise with anyone trying to get me to host CP and then whining about me leaving them at the mercy of the internet. I'm happy to care about your expectations for reasonable terms of service until you intentionally try something that you know is not just against the rules anywhere, but that can get the site eliminated and get me arrested. I will do my best to discourage you from proceeding.) I wanted to convey, completely unambiguously, that this is a hostile place to people doing that kind of thing: the thing that really worries pedos is transparency.

But it turns out that almost none of them were even paying attention: they were just here to dump files or grab files and leave, or they expected to be banned 90% of the time and were looking for places where they didn't get banned. So it didn't work, and I kept digging to figure out where they were coming from.

Doing some digging

If you are running an instance, it's even odds that you're doing it because you're interested in computers in general and the best way to learn is solving the problems that crop up. This is great! I have some helpful information; I post about it a lot but I haven't put it in one place. Here is one place. It's hopefully helpful to people that don't have a lot of experience with the topic and possibly has some bits that are of interest even to people that have been at this a long time.

I've got a bigger piece about this in the larder but here's a survival guide: it should give you enough that you can fill in the gaps by hitting the books, and enough of the technical background to understand the rest of the story. It's a little dense in parts.

Why?

Unfortunately, a lot of the documentation for any given piece of software ignores the problems that crop up: coders like to think their software is painless and they treat information to the contrary as a bug in their code (or sometimes as a bug in the real world). Pleroma and similar software also have two audiences: users and people with their hands in the guts, and people with their hands in the guts need the real info. But workarounds, real troubleshooting information, things like this are a little embarrassing to include. "This is how much it'll tolerate before it breaks" is critical information, but the first thought a coder usually has is "It shouldn't break" and try to come up with a solution rather than document the tolerances. (Eventually, everything breaks. You just don't know how much it takes or what happens when it breaks if you don't test it.) Combine that with the the fact that most complaints come from people that expect the software to Just Work™ and you can expect that you won't get too many tips for learning how to deal with Weird, and if you run networked software that talks to the open internet, you will encounter Weird. There are bots and scanners and worms, and as fedi grows, all servers become more interesting targets. And if it's a high-traffic server with open registrations, you'll attract at least a little targeted attention. (And of course, more if you call the server something like "freespeechextremist.com": I may as well make the favicon a big, red bullseye. But I like encountering Weird, so it's no trouble for me.)

But the Weird is only weird until you have an answer, so between that and the paucity of documentation of the Weird (because it is nebulous, because coders often don't know all the bugs and if they do, they often don't like to document them, and because high-traffic servers are rare on fedi), your best bet is to get good at diagnostic tooling and analyzing data. Half of running a server of any sort is being able to tell if something undesirable is going on and the other half is figuring out the shape of it. (Actually fixing it tends to be trivial.) That is, you come up with questions and then you figure out how to answer them. A lot of the questions are going to be "Why is this slow?" or "Why did this stop working?" but if you understand the stream of logs, sometimes the questions are going to be "What the hell is that?"

You want to be able to understand the logs directly, but you will also need to be able to understand them in aggregate and correlate them with other logs. Luckily for you, it's all text streams and text files and Unix is full of tools for answering questions about text. Here's the crash course!

Crash course!

This is dense but not difficult: if you can set up Postgres and Pleroma and nginx, this material is all within your grasp. Learn one of the things below and you will get some use out of it, and the use is exponential the more of them that you learn.

If you know any scripting languages, you can learn enough awk for it to be useful in 30 minutes: every awk program is predicate1{action1} predicate2{action2} [...]. If you don't, you can probably pick up awk in a few hours. (I'm serious, no exaggerating. Do yourself a favor.) If you combine this with tail -f and mawk -Winteractive then this is as much as you need in order to do real-time analysis of your log files. I can't speak highly enough of awk's usefulness: it's like the SQL of plain text.

If you are any good with awk and some basic networking tools (dig, whois, traceroute, tcpdump, iftop), you know how to use datasets (NRO delegated stats, for example; whois on IPs will often give a geofeed URL, etc.) and services (whois again, Shodan is a good start, DDG lists several services that it integrates), it's possible to figure out just from your webserver logs who is who, when someone signed up (grep the logs for the POST to the accounts endpoint), from where, whether or not it was a Tor exit or proxy, what language they have told their browser to say they speak, things like that. There is no shortage of tools for network exploration, and the more you learn about how the Internet works, the better you'll be able to use them. That should suffice: do a few hours of reading and you have just put your competence into the top 10% of fedi admins, and learned some things that you can apply anywhere.

nginx, lighttpd, Apache httpd, and almost all of the other webservers that are popular to set in front of fedi for load-balancing or caching or filtering or rate-limiting, they all have a some directives that allow you to control the format of the logfiles, log arbitrary headers, timing information. It's not always necessary, but keep it in your pocket for when you need it. With FSE (and other servers I run, not just fedi stuff), I usually strip most of the quote marks added in the common log format, I add a lot of timing information (especially time required to get a response from the backend). I also use tabs, effectively making the log files a big TSV: the extra space makes the files a little easier to read visually, a little simpler to use with awk, but also opens up the tooling options: R, sed, sort, nearly anything can read TSVs. Read the output of grep into irb or some other REPL that's good at string-mangling; set IFS and pipe it into a while read line loop in bash. Or if you have some favorite spreadsheet software, they can all read TSVs (but you'll probably want to filter it or split it into chunks unless you have the kind of spreadsheet software that doesn't choke on a 20GB file). (Of course, because I deal with 20GB files, the tooling I use can handle pretty big files.) In fact, although awk is like SQL for delimited text, you can even use regular SQL: sqlite3 can operate on text pipelines. And if you're good with SQL, check the man page for psql: you can have it emit TSVs pretty easily! A full SQL tutorial would take longer than awk, but SQL is very useful to pick up.

Really simple numerical analysis is indispensable when you are looking for aberrances: the Weird sticks out. If you're tailing the logs and piping that somewhere, you can keep a running average and calculate standard deviations and find the outliers. 68, 95, 99.7: calculate the standard deviation of your window, and you can find things that stick out. (In awk, this is really concise, since it keeps track of line—technically "record"—numbers automatically: {a[NR%1000]=$0} keeps the last thousand records in the array a.) Some endpoints start to get a disproportionate amount of traffic out of nowhere, some IPs show up out of nowhere and flood you. Finding things that stick out is the point of data analysis. (This is the sort of tool I was writing/using when I would post things like "90% of the POSTs are coming from the same address".) On fedi, network-based DDoS tends to be more popular than targeting specific endpoints; if you see someone suddenly hammering TWKN, and it's probably a scraper. (Probably!)

The last thing to learn is when you're being too paranoid: sometimes it's not a DDoS, someone just linked to a post from a popular site like HN, sometimes an idiot wrote the scraper and doesn't understand rate-limiting. (Always include a mechanism to limit the rate, even if you are making a trivial Markov bot.) Sometimes, it's just someone trying to write a new client and the client has a bug: fedi is exciting like that, there are hackers all over.

Scar Tissue

Over time, running a service, you bump into problems, and your service grows solutions to those problems, and you keep the solutions around so that you don't have the same problem twice. The service acquires scar tissue, reminders of problems, marks that distinguish it from a new service. (The scar tissue is one of the reasons that you encounter so many surprises when doing a big rewrite.)

Early in FSE's history, we had some malicious signups: just normal spammers. I view captchas as hostile to legitimate users and ineffective (and of course, Pleroma's captchas have been cracked since then). Likewise with email verification, except that expecting a real email isn't just hostile design, it's also a privacy risk. There's a tradeoff there: I have to manually generate password resets if people forget them, FSE can't send out email notifications, things like that, but on the other hand, someone that wants email addresses isn't going to target FSE, and in the event FSE is compromised, the email addresses aren't leaked. (If someone has an email address and a password, the first thing they'll do is try to use that combination on other services. If they don't have a real email address, though, they can't.) But there are a lot of tradeoffs you have to make and people that don't like the ones I made can make different servers or join different ones.

There's work stuff where, if you can use an off-the-shelf solution to a problem, that's usually what you want; it's generally an explicit policy to spend as little time as possible on things that aren't part of your core business. That's reasonable but I don't think web applications have gotten more reliable in the decade and a half since this became conventional wisdom, so maybe it's worth considering the emergent properties of this kind of policy. So when I'm on my own, a project where I have no boss and no reason to compromise, I generally roll my own solutions. I like coding anyway so it's not a costly experiment, and these solutions generally end up faster, more reliable, and more flexible; aside from that, I'm not at a vendor's mercy for bugfixes. (There are always bugs. The difference is whether I've got to try to find them in someone else's 200kLOC codebase or in my own 4-line script. Additionally, you can guess whether it's faster to bash a four-line script into the text editor or tweak a 400-line config file.)

So I needed a way to tamp down on the spam without making the site suck: my solution was to tail the logs, send them through an awk script, and it would just email me when it saw someone do POST /api/v1/accounts and get a 200 back. Eventually I expanded the script and it would do things like check pidof xlock as a reasonable approximation of whether I was at my desk and if xlock wasn't running, it would pipe a message through espeak -s120 -v other/en-sc. (I call him "scotbot". As you can hear, I did have it correctly pluralize "user" but forgot to change "there are" to "there is" when only one occurred.) At some point, we got a really big hit, and I had nginx rate-limit signups to one per minute. Later on, I did the welcomebot, so new signups would be announced in public.

So I would keep an eye on new users arriving: usually it was just merry shiptoasters, but once in a while I see a pedo show and this sets off the self-preservation instincts, so I dump the stuff they've done and if they are still there and I haven't had to hit the red button yet (that is, they haven't done POST /api/v1/media yet, compelling me to kill off their IP and account and delete all of the shit they uploaded), I can watch them move around, look at what they are searching for.

The pedos would land on a page, some post or something, usually a local mirror of a post from another instance, then they'd sign up, start mashing search terms into the box (which is usually how I noticed them: some search terms were added to an awk script that would ping me), follow a handful of accounts, and usually just leave. A peculiar thing stuck out: a lot of them were coming from boardreader.com, based on the Referer header, so I tugged on that thread, and that thread turned out to be the weird one.

Who is boardreader.com?

I'd never heard of them, so I looked around, and boardreader.com was a strange site indeed: very barebones, didn't work over Tor, no contact information listed anywhere. (Some time in the interim, they added a SocialGist banner at the bottom...that goes to a 404 now that SocialGist has moved.) I bashed in some of the search queries that the pedos had used on FSE and was pretty horrified to find the posts they landed on, all of which originated on other servers, but all of which were also ascribed to FSE.

It turns out that BoardReader was a tool for searching forums. The authors don't appear to understand what fedi is, so they had treated FSE as a forum, and all of the public posts that came to FSE from elsewhere as forum posts made on some forum called "freespeechextremist.com".

Apparently it was a small search index for forums and it got big enough to be bought by a Japanese company, the two founders had issues with the new owners, and the company eventually was sloughed off and acquired by SocialGist (which now redirects you to socialgist.ai). SocialGist purports to sell "accessible social data", they list several data sources, and per their blog, their developers are in Serbia, which lines up with the IP they were using, so I've started thinking I've got the right people.

Most of the search results indicated that long ago, BoardReader identified itself in the User-Agent header and most of the targets viewed it as hostile: it's present on a lot of lists of poorly behaved bots. There are also complaints about it on a lot of forums, and there are threads where people are asking how to stop it; in those threads, some people show up from nowhere and suggest that the person running the board should be grateful for the traffic bump. (If you owned a search engine for discussion boards, wouldn't you use it to search for mentions of your engine? And if you were running a somewhat aggressive crawler that was annoying people, it's a matter of temperament to decide whether to ask what would bother people less versus showing up to argue with them. I give it even odds that those posts were made by owners or employees at BoardReader itself.)

BoardReader and FSE

2023-03-05 (Sunday)

I went over and grepped the logs to see if they'd been to FSE: nothing. But they had to be getting data from FSE: they had posts from other instances and links to FSE. So I kept looking and found a large amount of scraping on /api/v1/timelines/public?local=false from a browser claiming to be Chrome, and coming through way faster than a human could scroll even if they were leaning on the Page Down key.

spider1.boardreader.com through spider43 all had A records, but traffic was coming through 45.15.176.187 (which was, at the time, owned by DediPath). That was odd, right: why would BoardReader go to the trouble of making A records for their spiders and then go through some other service?

So, I tell the server to drop traffic from the IPs that were scraping. Problem solved! Then immediately I start seeing a large number of attempts from different IPs. Residential IPs in the US: they're buying residential proxies. It's one thing to lie in the User-Agent header, but it's a step past that to pay money to evade detection. Someone that has money to burn wants FSE scraped, probably a business. At this point I'm certain enough that it's BoardReader. I dash off a quick email to info@boardreader.com asking for information on their crawler. Since they are going to lengths to hide what they are doing, I don't expect much, but it doesn't hurt.

2023-03-08 (Wednesday)

So I need an automated approach if they're automatically hopping proxies. awk and iptables plus a really quick Ruby script to sit between nginx and Pleroma for that endpoint, and I can start dropping traffic from any IP that tries to hit FSE with that token. If I keep it there, I'll exhaust their proxies before they fill their cache.

Eventually, the requests dry up and I see a request from an IP owned by a Serbian ISP that leads back to devtools.boardreader.com. It acts like a normal browser: it loads all the resources, grabs a Bearer token, executes JavaScript, and subsequent to that, the scraping resumes using that token. They're trying to play back a browser session: that's clever. Watching the logs confirms it: bots using that token start arriving, playing back the sequence of requests, and then hammering the hell out of the public timeline again. To verify, I wander back through their site and see that they are indeed getting new posts from that batch of requests.

I start severely limiting TWKN by cranking the rate-limiting way up. At this point, it starts throwing 429s even at legitimate users, so I finally talked about the problem in public some after mostly keeping it to myself or in DMs. (And, of course, writing this has made me regret not having taken notes with timestamps. I've had to piece everything together from scattered notes, timestamps on scripts and logs, DMs, etc. The first draft of this post was missing some information and had the chronology wrong.)

BoardReader is sneaky and annoying and using fraudulent means to extract data from FSE over my objections, but now thay I can isolate their traffic, I can I try a lot of different approaches. They aren't going to get any legitimate data again, but I can see how their crawler behaves. I start just sending back 429s: their scraper responds by sending more requests. Apparently, if it doesn't get the response it wants, it just repeats the request immediately, no delay. Rude, but they have some confidence in their ability to get around restrictions, so they don't have to be polite. But this is worse: they send the reqs back so fast that it actually saturates the pipe, it's basically a worse DoS than before. Unsurprisingly, sending back 401s, 403s, 500s, same result. So I start just sending 402 Payment Required, an idea I got from graf. Unfortunately, this means no one gets anything from TWKN for a while.

2023-03-13 (Monday)

...And that's when they finally get back to me. I had sent an email to info@boardreader.com on the 5th telling them that I was looking for information on their crawler, and on the 13th, after the server has started adamantly refusing to I finally get an email from dave@socialgist.com asking what I want to know. Noncommittal. I reply to him a few hours later, at 20:45 UTC, explaining the problem and telling him if he wants to index that he'll need to only fetch local posts and use a UA that identifies BoardReader. He tells me at 21:01 that he'll forward it to the engineering team and asks what domains I'd like them to quit crawling. I give the entire IP range that I owned and complain about the pedophiles. While we're corresponding, their developers are still scraping and actively debugging the scraper, so I mention that he could save them some time. I can another Serbian IP address. Over email, I offer to talk to their devs, I pass a few links to FediList (which was still on the demo. subdomain at the time), I try to explain how fedi works.


[2023-03-13T10:24:39+00:00] https://freespeechextremist.com/main/all [200] 109.92.154.188 https://devtools.boardreader.com/ [2023-03-13T10:53:48+00:00] https://freespeechextremist.com/main/all [200] 109.92.154.188 https://devtools.boardreader.com/ [2023-03-13T13:57:18+00:00] https://freespeechextremist.com/main/all [200] 109.92.154.188 https://devtools.boardreader.com/

So obviously I don't trust them. Dave stops replying to emails, and they're not only ''still' scraping, but they are trying to get around the countermeasures: SocialGist is lying. They're actively putting work into continuing to do something they've promised to stop doing: they can't get anything out of TWKN (the last real post they got from FSE was on the 8th), and they're doing their damnedest to try to rectify this, while telling me that they'll stop. If they felt really good about it, they'd have no reason to lie, so either their motivations are not what what they say, or the person I'm dealing with is not the same person that has decided to put FSE on the list and then told the devs to make sure they can get posts from it.

...Then the FBI shows up

2023-03-14 (Tuesday)

The morning after Dave ghosted me, I got an email from an fbi.gov email address, the subject line "Emergency Disclosure Request", and this in the body:

This is Special Agent Peter Christenson, with the FBI. I am requesting subscriber information for the user "WitchKingOfAngmar." This user posted the attached threat. Please let me know if you can assist with this request.

It also includes FSE Screen Shot.PNG. I've never seen someone outside fedi refer to freespeechextremist.com as FSE, so my first thought is it was a prank, but the headers and my mail server's logs and the SPF info for fbi.gov all indicated that this was a real email from the place it claimed to be from.

This was the attached screenshot, which, despite being labeled "FSE Screen Shot", is not a screenshot of FSE:

Not a screenshot of FSE

So after I was reasonably certain that it was a legitimate email, my first reaction was to rack my brain trying to remember who the hell "WitchKingOfAngmar" was. He didn't sound familiar to me. I checked and this was a user at sneed.social (which was dead for a long time, but appears to have come back recently). The screenshot had some interesting bits in it. For one thing, despite being named "FSE Screen Shot", FSE has never looked like that. It also described FSE as a "forum". In fact, the top said "Forum • Blackrock Executiv...". Some text is highlighted, "kill blackrock" "larry fink", as if those were search terms. There was also some rudimentary sentiment analysis. The post itself was from 26 days before the email was sent, but the screenshot read "11 hours ago" and "13 hours". (So much for "Emergency" from the subject line, but it also indicates that it takes two hours for a post to go through whatever that system is.)

Obviously, you don't expect to receive an email from the FBI, so it took me a minute to figure out what to make of it. WitchKingOfAngmar's post was clearly a threat but it was also clearly absurd, an obvious joke, not a credible threat. And obviously, he wrote it to troll the admins of the site.

I know a couple of pretty good lawyers in case anything crazy happens, but the goal is always to make sure that nothing crazy happens. So, best-case scenario, everything from FSE is public: if you can't see it, I don't have it. Ideally, the FBI gets that and I don't have to do any convincing. Worst-case scenario is they kick in the door and grab the server. They'd need a warrant to do that, and they wouldn't ask politely if they had a warrant already...unless they were trying to get me to say no or they were trying to see how I respond.

Dealing with law enforcement is usually an uphill battle to convince a person afflicted with motivated reason of the obvious. They are looking for something, their job performance is predicated on finding it, and when that's the case, it's hard to get them to look at something that isn't the thing they're looking for, even if that thing demonstrates very clearly that what they are looking for is not here.

On top of that, I'm paranoid. I go ask the dead spacemen for their thoughts and one of them points out that posting about it might count as obstruction and the FBI had lately been somewhat zealous about obstruction charges. (Good to have solid friends with level heads.)

And the previous day, fresh on my mind because the memes were flying, Mike Chitwood (mostly notable for publicizing mugshots of people arrested for, though not convicted of, cruising, part of his long-running battle to stop gay dudes in Daytona Beach since at least 2006) had just had Richard Golden extradited for saying "Just shoot him in the head" on, uh, apparently it was "a 4CHAN chatroom, a communications platform shared by extremist groups". (Informed citizens may recognize 4chan as a far-left website as well.) It is kind of interesting that Chitwood used the same "eradicating scumbags" language about neo-Nazi propagandists and gay dudes bangin' in the woods. But anyway, I couldn't help noticing that there had just been a case where cops were getting excited about anonymous internet threats.

The timing was a little obvious: they got it from BoardReader. I go and find the post on BoardReader, to make sure it's in their index. It is:

Exactly the same shit the FBI sent me, just a different UI.

The odds that the FBI and BoardReader would screw up the Unicode in exactly the same way are pretty low. (The original Unicode codepoints, 1f9e2, 1f438, and 1f44d, got turned into question mark boxes indicating invalid Unicode. BoardReader's codebase being a mess of PHP, no surprise.) A common glitch is not absolute, but there is the other unlikely mistake, that a post from somewhere else is ascribed to FSE. Guess now I know why Dave ghosted: they're scraping for the FBI, he can't turn it off. Legally, the FBI can't pay a private organization to do something that they can't do, but if the organization is doing something on their own and the FBI "doesn't" (cough) know and doesn't ask, they're just buying access to a stream of data, not paying someone to violate the CFAA.

🧢🐸👍

I had been saying for a while that the three-letter agencies don't really "get" fedi. Decentralized networks take some explaining to regular people, but individuals can get them. But there's a long way from an individual understanding it to an organization understanding it. (If you haven't ever worked in a stultifying bureaucracy, think about the amount of time that passes between a rumor propagating through your extended family and any sort of concrete change in behavior propagating. Something can be obvious to everyone and still not obvious to the organization as a whole.) Apparently the German feds get it and the feds in the US are starting to get it. But this more or less confirmed it: SocialGist doesn't understand what fedi is, really, and the FBI saw "It came from this website" and they just rolled with it.

So I get my head together and reply, explaining that since the guy was on another server, I don't have the information he is looking for, that BoardReader wrote "Free Speech Extremist" on the post but that it didn't come from FSE. And, miraculously, that works: he asks who to ask, I tell him to check the origin server, and ask him if he'd rather I not discuss the exchange in public, with no response.

2023-03-15 (Wednesday)

BoardReader is still hammering FSE and getting only 402s in response.

Still corresponding with Special Agent Christenson, but no reply until East Coast business hours start. (My last email was 15:50 on the 14th in LA, 18:50 in Quantico.) The last email is him saying thanks, I ask him one more time if it would bother the FBI if I said something, and nothing. In the mean time, I've been alternating between disappearing from fedi and running around TWKN being twitchy and paranoid while the actual endpoint is spewing errors. It's better to be transparent and I usually am but people are asking questions. The FBI guy probably can't say "Sure, write whatever, here's a selfie!" but if they wanted me not to say anything, he could definitely tell me not to say anything, so I figure it's fine. On the other hand, the whole thing still made no sense. (Maybe it made no sense to him either and he was just doing what his boss said.) So I give a limited explanation and a promise to deliver a full explanation, then wait for the flashbang to come through my window, because, although I was certain I wasn't doing anything wrong or illegal, I am really paranoid:

Long-ass post and I always fuck up at least one of the ejimos because no autocomplete in bloatfe and I compose the longer ones in acme anyway and in any case, I consider cosmetic fuckups to be acceptable collateral damage

(That post was also available at https://freespeechextremist.com/objects/19711ab5-5025-4733-8b7d-602c309621ed if you are playing along from your home instance.)

I realize that was a screenshot of a very long post in the middle of a very long article; you'll notice that it ends with an announcement that FSE is going into lockdown until further notice: no viewing TWKN or public timelines without an account, and registrations closed. I recommended everyone else do the same. I hate doing that and I hate when other instances do it, but a lot of instances follow suit.

Violins making suspenseful sounds

2023-03-16 (Thursday) and a while after

I'm on the edge of my seat, watching BoardReader continue to fail to get around the wall of 402s. (As verifying the bearer token means a round-trip to the DB, I'm still mostly kicking them out by using nginx, along with a pile of awk scripts that .) They're using residential proxies, they're using Tor, they're rotating the User-Agent strings every request No word from them or the FBI for a week.

I don't say much beyond the public post, but I ping the admin of sneed.social and ask him if the FBI agent contacted him, I send him the link to my post. (I didn't know him, but everyone said he was reasonable to deal with when it was something like this, and he was.) He goes to check his email, says that he'll reach out, and remarks that the user in question was actively trying to get banned, due to some other issue; I didn't ask.

2023-03-20 (Monday)

Detroit Riot City trolls the neo-Nazi admin of Pieville, Alex Linder. (Neo-Nazis are notorious for having no sense of humor; they take themselves too seriously. They also tend to have difficulty with subtlety.) Purely by coincidence, right after Linder blocks them, someone registers a new account on DRC to post a threat to blow up some Jewish hospitals, and then someone reports this post to the FBI in under a minute, less time than it would have taken to read it.

This dissipates pretty quickly: the fed checks it out because he's got to, but it's pretty obvious, isn't it? Your guess is as good as mine with regards to why no one was arrested for sending the FBI a false report.

Not a True Ending

2023-03-21 (Tuesday)

At this point, I figure enough time has elapsed that if the FBI wanted me to keep their secrets, they would have asked by now, so I just dump everything I had at that point. The above covers it more thoroughly, and there's no screenshot this time; the raw post has the object ID https://freespeechextremist.com/objects/5c7246c1-024b-4e74-b4e2-7e88ef019024 if you want to dump that into your instance's search bar. You can grab the raw JSON representing the post if you want.

There are a couple of bits worth including here, so I'll quote them. I've mentioned this before on the blog, but the tone on fedi is significantly less formal (one is less likely to tell the same joke at work on Monday morning that they told at the bar on Friday night), so please bear with me while I quote myself. If you're unfamiliar with the slang, a "fedpost" is a post that includes threats of violence, and a "glowie" is a federal agent. These terms are usually used humorously.

I can't find any other fedi instances on there, but this is a pretty annoying scraper to get rid of.

This hasn't changed; as far as I know, FSE is the only one they were scraping.

The glowies are (or want to convey that they are) specifically looking at threats against Blackrock executives.

It turns out that "want to convey that they are" was correct, but I didn't know that until much later.

Remember everyone that was freaking out about the various search engines on fedi, most recently as:Public? Remember that I keep saying that there are scrapers getting at fedi *without* identifying themselves? It turns out that I was right and this is because I AM A GODDAMN GENIUS and EVERYONE THAT HAS EVER TOLD ME THAT I AM WRONG IS A RETARDED COPROPHILIAC. There are scrapers getting data out of fedi without identifying themselves and at least one of them is selling data to the FBI.

This has still not sunk in for most of the people that are worried about, e.g., Archive Team, as:Public, FediList, etc. (Especially the text in all-caps.) I have linked to some other cases above.

I recommend that you be careful of fedposters on your instance.

I continue to recommend this.

I think I'm going to reopen the public timeline and registrations, but that's tentative. Since boardreader.com is still attempting to scrape TWKN, if I reopen TWKN to people that aren't logged in, it will be with the terrible hacks I was using before to get boardreader.com to stop scraping.

I had kept TWKN unavailable to the outside still.

The Mechanical Criminal vs. FaceBook

It is important to note here that every sentence in the following paragraph is completely wrong.

It looked like the situation with the FBI was over and they had what they wanted. They were just following up on some idiot making a random threat on the internet. So the balance remaining was just mopping-up and getting BoardReader off my back. That shouldn't be too difficult!

2023-03-23 (Thursday)

Despite promising to stop, BoardReader hasn't just kept scraping, but they are still trying to debug their scraper so that it can resume collecting posts:


[2023-03-22T14:57:57+00:00] 109.92.154.76 https://freespeechextremist.com/main/all [402] "https://devtools.boardreader.com/"
[2023-03-22T14:57:58+00:00] 109.92.154.76 https://freespeechextremist.com/main/all [402] "https://devtools.boardreader.com/"
[2023-03-22T14:58:03+00:00] 109.92.154.76 https://freespeechextremist.com/main/all [402] "https://devtools.boardreader.com/"

So I send SocialGist another email:

I sure would like to hear back from you confirming when your company plans to either comply with acceptable use or stop scraping my sites. I'd expect ten days would be enough time.

I've been keeping track of traffic referred to FSE by BoardReader. Unsurprisingly, this post, written by a Markov bot that lives on a completely different server (and as always, attributed to FSE by BoardReader), is the most frequent URL that people land on if they come from BoardReader:

a bot rambling about CSAM for some reason

Anyone that is putting search terms into BoardReader and getting that post is someone that I would like to discourage from signing up.

2023-03-24 (Friday) and subsequent weeks

Clearly, sending them emails and making sure that they can't scrape were not working, and they're still trying to fix their scraper, they're still hammering the API endpoint for fetching the posts. Pedophiles are still landing on FSE, coming from boardreader.com. That was the issue from the beginning, and it's been weeks. I want to open the timelines back up, re-open registrations. We've been on lockdown too long! Plus, the site is slow as hell because BoardReader is choking my server (even though they're getting no data, they're still sending multiple requests per second), and on top of that, I'm paying bandwidth overage charges.

Since I've talked to them and got them to agree and I've stopped sending them data, and they're still trying to get around the restrictions and still sending pedophiles, I've exhausted reasonable methods.

So I grab some samples of the timeline, and bash out a small CGI script: it just does string substitutions, mashing together accounts that do not exist and generating posts that do not exist. (Initially the IDs were just random 32-bit numbers. Eventually half the number varied per post and half was derived from the timestamp of the request, so I could trace the posts through BoardReader more easily.) I don't bother making the URLs match actual Pleroma URLs: why would I? They just have to be unique. I also have to start up lighttpd to serve it: FSE uses nginx, but since nginx doesn't support CGI scripts (a travesty), I've got to send the requests for that endpoint to lighttpd. Because I was still all the way in awk mode, of course I just used awk. For fun, I grabbed some lists of words to include in the posts: some based on search terms people had used on BoardReader, and then rounded out with the CARNIVORE list. (First search result; I don't know if it was the real list or not but it didn't really matter.)

It worked: you can see an archived sample of the gibberish or a screenshot of different gibberish. Of course, it worked beautifully:

Searching for FSE posts on boardreader.com just gives gibberish

The only problem was that the scraper loved it too much. We were suddenly getting DoS'd by their scraper. So I built in a little delay and then had a fun idea: jam the BoardReader search terms in. And I saw something really weird very suddenly:

timestamps and a large number of blue 4s and hyphens, explained below

This is an awk script that draws a histogram of the requests coming into FSE. The basic idea is really simple: the first version of it just printed a "." and then when the timestamp changed, it printed a newline, too. This draws a histogram for you of the number of requests per second, in real-time if you can convince awk not to buffer excessively (e.g., mawk -Winteractive). By this time, it had gotten somewhat more sophisticated: the hyphens represent requests that received a 2xx status code, the blue hyphens representing POST requests and the others GETs. When the request resulted in some other status code, the first digit was printed. The end of the line contains summary data: number of requests that second in brackets, followed by the number of 5xx errors, followed by the ratio of server errors to total requests (as a fraction and a percentage) and the average number of requests per second. Like many scripts that were unreasonably useful and then gew bit by bit (usually under duress, while trying to fix a problem with the server), it is nearly unreadable but is surprisingly compact and reliable. (Here is the version that I was running when I took the screenshot for the curious; you probably won't get any direct use out of it unless you're using the same logfile format as me, but if you can read it, it should be pretty straightforward: it's messy, but not complex.)

The big field of green-on-blue 4s sticks out: those are requests that resulted in a 402, in this case almost all originating from Facebook, and all of them requesting URLs matching the fake posts. Facebook shouldn't have been crawling FSE's public timeline.

You might have noticed that the random IDs were present in the posts: the script generating the random gibberish didn't keep any history, because I didn't want the problem of storing infinite random gibberish, but I could match posts on BoardReader to URLs in the webserver logs by just pasting these random IDs into the search form on BoardReader. So I dropped IDs from the posts into the BoardReader search box, and that more or less confirmed it: Facebook was fetching these posts shortly after BoardReader indexed them. Apparently BoardReader was giving Facebook a feed of their data, but it wasn't just that: there was a common thread in the gibberish, a pattern in the posts that Facebook was interested in. You can probably guess the hypothesis, the test, and the result: I opened up the CGI script and, where there had been a long, random list of words to cram into the posts, I replaced it with just one phrase: "larry fink".

Almost as soon as I saved the file, Facebook started flooding my server. I wanted my keystrokes to start echoing again so I un-did it, replacing the list with the previous version except without "larry fink", and the flow slowed to a trickle and then stopped. Curiosity got the best of me so I re-did it, and after the wait for Boardreader to index it, the flood resumed.

So the pipeline was my terrible awk script generating JSON that represented gibberish posts, and that went out through lighttpd, then nginx, then it left my machine and went into BoardReader's crawlers, from there into their index (however that was built) and straight out to Facebook, and presumably from there to the FBI, and from there into whatever UI that was that they were using to search. (Zuckerberg had just testified in Congress that Facebook was critical national infrastructure: maybe he wasn't lying.)

Further Shenanigans

How do you get BoardReader to stop? I couldn't get them to respond to emails and filling their database with gibberish wasn't helping.

So I shoved some more delays in: eventually I spaced out the writes until they were a trickle designed to finish exactly one second before the timeout happened. That solved the bandwidth overages: they were just using a trickle. For fun, I tossed in a little more randomness: once in a while, I'd omit some random characters from the end so that it wouldn't parse. Maybe if the error rate spiked, they'd notice. You can get partial data out of something like that if you're using an event-based parser or you've structured it to use coroutines or thunks or whatever; the type of thing that builds up a data structure piecemeal and leaves you with a valid (if incomplete) data structure (and this approach can let you work with JSON structures that are too big to fit in memory, of course), but the overwhelming majority of JSON parsers just take a string and give you a data structure or an error: it's easier to call something like that. It looks like BoardReader is using the common type, so they spend about a minute on a single request and end up with no useful data. The situation stays exactly like that for a while.

It's annoying, but I don't have to worry about it too often. I pop in and tweak the random timeline script's output or its behavior once in a while. And then I have a pretty evil idea: I just start putting Dave's phone number into the randomly generated posts. The following morning I get a reply from Dave:

hey peter, sorry for the radio silence. i've filed a jira ticket this week and hopefully will have an answer for you shortly. if we need any more information i'll loop in one of our engineers.

Guess it worked! We go back and forth very briefly and they stop scraping pretty quickly, though it takes about a week for them to get FSE out of their index. It shouldn't take a week for that to happen; whether they were stalling or it actually took that long, I probably won't find out.

Epilogue: Torswats

The story wrapped up just short of a year after it started, in a very unexpected way.

2024-01-18

Alan Winston Filion of Lancaster, CA—not too far from my home—is arrested: https://www.wired.com/story/alan-filion-torswats-swatting-arrest/, https://www.wired.com/story/torswats-swatting-arrest/.

Torswats. The guy responsible for creating hundreds of bomb scares, fake hostage situations. This was essentially a griefing tactic, and he had a long enough run that he was able to build up a little business making anonymous calls to the police and the FBI.

It turns out that that's who the FBI was looking for. That was "WitchKingOfAngmar", which apparently is a "Lord of the Rings" reference. That was why they were so interested in threats against Larry Fink: apparently, Torswats had a habit of tirades full of nonsensical threats against Larry Fink. Apparently there's a lot of information about Torswats on KiwiFarms: https://archive.ph/yqwuA.

And that's it: most of the things that didn't make sense about the story fit together after that. There are still some murky bits: what is BoardReader at present? Just a front to give a plausible excuse to SocialGist to scrape? Around 2010, scraping social media to find ISIS recruitment got popular, and of course, PRISM was the logical conclusion of that: is BoardReader even a legitimate site at this point, or just the forum-scraping division of SocialGist? What was Facebook doing in the pipeline? Are they providing the FBI with tools for this kind of thing or do they just act as a convenient repository for this kind of data?

An Aside: Some Advice for Shady Jagoffs

The best advice is the advice that you are almost guaranteed not to take: don't scrape fedi, it's evil.

If you want data from fedi, just make a fake instance and cram it onto a bunch of relays. You're still a shady jagoff, but at least you don't break anyone else's server, it's easier than scraping, and the data gets delivered to you in real-time and dumped in your database rather than you having to make some Rube Goldberg system to extract it from unwilling participants.

I usually make a remark like "I wonder why they don't do this" but I can't be sure they're not: how would you know if anyone actually is doing that? Maybe the only scrapers we know about are the noisy ones doing conventional scraping and the other ones don't make enough noise to cause problems. People only noticed newjack.city because it was full of followbots, but you don't need to use followbots any more. There are several varieties of ActivityPub relay and ActivityPub software and some of them lend themselves to repurposing as a scraper. As demonstrated by gangstalking.services (a signed-fetch workaround and proof of concept) as well as pls.zuck.dad and other instances, a lot of normal fedi software can be repurposed for this kind of thing.

So, if you're not a shady asshole and you're just trying to run a server, keep it in mind. A company like SocialGist can make themselves hard to find: I only knew about them because they screwed up, but once I knew where to look, it wasn't difficult to create a trail of breadcrumbs. How many people or organizations are out there doing the same thing, but without SocialGist's mistakes? How would you know?