I noticed something weird when running FediList: some instances had numbers for users and peers that varied wildly, and it was always GotoSocial instances. I don't use that software because I don't like the approach. (Matter of taste; good luck to them, and if their software gets anywhere, they'll end up having to fix the approach anyway.)
So I started looking through the code (annoyingly, at some point between my last pull and the present, they had rewritten history) and found commits 07d27709957248008c61d6b8d553e3d2eb14d154 and a55bd6d2bd7b11aed653f4614836caed4103bec3. (You'll do yourself a service if you know how to operate gitk
, tig
, or git log
and grep
; I'm linking to commits on github.com but their interface is less than ideal.) These commits were associated with issues, so I read the discussion at https://github.com/superseriousbusiness/gotosocial/issues/3723. I think it is as close to a good-faith discussion as I have seen on the topic. I'd be pleased to participate in one and to say why I'm doing what I'm doing and answer some objections. I was going to put this into that GitHub discussion, but it got very long, because I've got a lot to say and haven't said it.
The discussion seems to include only speculation about why people run crawlers. I am running a crawler, so I'm in a position to answer the question without speculating, so I probably should answer.
TL;DR: I make tools that I want to use, and I make the tools public so that anyone can use them, with the hope that this is helpful to the network. While running an instance, I want to help diagnose and solve problems and answer questions, and I want to help other people do the same, so the tool has been shaped so that it can be used to build more tools. The form it takes, I am not particularly tied to: if there is another solution, one that solves the problems I am actually rying to solve, and that solution bothers fewer people, then that's great. I want to solve the problem in the least evil way possible.
This is from the discussion:
> I think the general problem here is that the said parties are not really cooperative
I am actually happy to cooperate (or I wouldn't be writing this), but discussions on fedi have been exclusively one-way. So far, the only people that have ever reached out to me about this were either directly hostile, or they showed up to make demands, usually from throwaway accounts. People do not tend to attempt to cooperate, or even ask what I am doing: "Do this!" to which I respond "No" (I am a hacker and bristle when a rando shows up and barks orders, which seems to happen a lot in some corners of fedi), and then if they stick around to talk, I get accused of various and sundry. (For example, that I have commercial motives for doing any of this, which I do not.)
So people tell me they have an issue with what I'm doing and they don't want to hear what I have to say. I don't intend to listen to someone that doesn't listen, but I'm happy to cooperate with people that are willing to actually talk. Of course, if we don't communicate, effectively you've eliminated any influence you might have had over what I'm doing.
As I have run plenty of instances, I have seen a lot of traffic from bots on the metadata endpoints, including from bots pretending to be humans. This data is privately collected, and used for private benefit (and a bot trying to conceal itself by impersonating a human can be assumed to be doing something that the bot's operator expects you would not allow). Given that I cannot stop this data from being collected by unknown entities for unknown purposes, I would like to level the playing field some by producing something that is open and that anyone can use.
Aside from that, I find the data useful. Initially, I would just fire off some kind of loop in the shell, which is trivial to do. It was usually during a discussion on fedi to answer some question that I was curious about or that someone asked me, and I'd usually end up with some kind of answer and attach the raw data that I got the answer from and sometimes I'd paste an awk script or attach a graph made in R. After fediverse.network went away, there wasn't something to fill in the gap, so the ad hoc scripts began to multiply. Eventually that sort of thing turns into a tool.
So, from there the tool grew features for monitoring, like an RSS feed of status changes for instances; I use that myself, to see when instances I care about come up or go down. (There's a Prometheus-compatible endpoint for people that want that kind of thing.) If I get a bunch of spam from an instance, I go look to see who the admin is, if the instance is large or small, if they've just had a big uptick in users (indicative of an attack), and if you've blocked them temporarily, seeing the number drop sharply indicates that the admin is awake and has fixed the problem, things like that. Normally you have to just hit the endpoint yourself, but an instance that is getting a large volume of spam accounts is usually going to be overloaded anyway, and unless someone keeps the data around, relevant questions like "What happened before this instance went down?" or "When did this instance go offline?" or "What was this instance that closed?" are unanswerable. There's also an RSS feed for new instances, used to power a bot that announces them, and this gets some use by admins: for example, a domain registrar killed off an instance that was hosting some illegal photos (I think anyone knows what I mean and I don't wanna sidetrack the discussion) and another instance arrived with a similar name and the same admin, an admin replied to the announcement and tagged other interested admins, that was useful. When Elon bought Twitter and there was a big uptick in new instances, anyone could look and see something large was happening. (Someone should thank Elon for making fedi bigger. 1,016 new instances in a single day!)
Essentially, everything on FediList is related to a question I had, a question someone asked me, or a problem that cropped up in the process of running an instance.
For one example of a problem, the activitypub-troll.cf instances showed up (which fedi considered hostile, because it provided poisoned data), it was really easy to use sharp upticks in the peer count to find the affected instances and ping the admins (or it was at least easy to send messages; instance-blocking stopped some of the messages from getting through). So that people can get that information without asking me, that's now on the Hockeysticks page (which is named as a joke about startup jargon: a graph that is shaped like a hockey stick represents a sharp uptick).
A lot of people suggest that the crawling should be opt-in, but this doesn't solve the problem. activitypub-troll.cf or an instance that intends to host illegal content certainly wouldn't have participated in any kind of opt-in system: when fedi starts to attract real attackers, opt-in sources of data will not help us.
I don't think there is a way to solve these problems without someone crawling fedi and providing the information publicly. The problems are all problems I have had personally and the results of the data have been made public so that the data can be used by anyone. I'm flexible on the form the solution takes, but so far, people just tell me not to solve the problem.
FediList isn't my main priority at present. I just tack things on as needed. So it's ugly but it gives you what you need. (And if it doesn't, you know where to find me.)
I wouldn't have read the discussion if I wasn't interested in people's concerns. I'm interested in the network and the people running it. But also being interested in those concerns is a prerequisite for cooperation to begin with (which is why it's easy to dismiss the concerns of people that dismiss my concerns).
robots.txt
is intended to avoid content being indexed or expensive endpoints being crawled. I do not do anything like that. The search engine exclusion standard doesn't apply. Misskey fetches the nodeinfo and makes the information public. When a server on fedi sees a reply to a post, it fetches the post, and doesn't consult robots.txt
, because robots.txt
doesn't apply. When someone links to a post on your instance (e.g., they paste a link into a chat service like Discord or Telegram or Whatsapp, or they post it to Twitter or Facebook, or they make a post on fedi that contains that link), the link previews are generated by fetching the URL and then using either Facebook's OGP or the Twitter cards conventions (or whatever Twitter cards are called now that Twitter isn't called Twitter).
I probably would respect robots.txt
if I were trying to fetch content. I am averse to fetching any actual conversations or letting anyone scrape FSE and I do my best to help other instances stop scraping. That is more like trying to engage in mass-collection of IRC logs: it's invasive.
The crawler does respect relevant standards and conventions. FediList respects the discoverable
flag on accounts, for example, because this kind of thing is exactly what that flag is for, and hide the admins' bios with a note explaining the discoverable
flag. (I had thought of making an exception: people should be able to see the information the admin writes if the instance has open registrations, reasoning that if a person is offering to take responsibility for strangers' private data, people should at least be able to see how the admins describe themselves. I didn't do that because the focus changed to providing information for people plumbing the network, as noted above, so for now the only exception is instances that provide poisoned data.)
As far as the overhead of making individual requests goes, it actually is significant. In the case of robots.txt
being treated as something that you'd have to use as a basis for decision-making, that's more overhead than just adding an extra request.
This is actually the only criterion I use for determining if an instance is malicious, and it's always manual. (I don't want a bug in fedi servers or in my code to suddenly incorrectly flag people.)
I view poisoned data the same way I view a bot filling in a User-Agent
intended to make it look like a human. It seems better to not supply data than it does to supply a lie to screw with people because you have some notion about why they want the data. Honk, for example, doesn't supply information at /.well-known/nodeinfo
, so Honk servers end up more or less ignored by FediList.
A lot of people have tried poisoning FediList's data or creating a tarpit or something. I think they usually don't understand what I'm doing and they overestimate the impact of a low-effort tarpit. (The overwhelming bulk of the data in FediList's DB is legitimate and if you want to fill it with nonsense, you've got to send a lot of data for a long time and hope no one notices. You can flood my RSS reader, but that's about the extent of the impact.)
The discussion had some mentions of people that don't understand fedi and cared about "number go up". I don't care about that, really.
I've been on fedi for a long time. I get that the statistics are not reliable. They're presented as-is. "This is what this server says about itself." That's all I'm actually interested in; analysis is one step down the chain from this and I want to get as close to an unfiltered view as I can, so there's not even much aggregate data.
Network exploration is fascinating, and I would like to make my results public so as to help other spelunkers and because sometimes I find something really cool and I want to show it to people. I solve some problems and then people can use my results and then find more cool shit and sometimes they show it to me, so I see more cool shit.
This is not why I run a site that collects this information. Other sites like fediverse.observer do a better job of helping people choose an instance. FediList has graphs, but the front page just shows a graph of the count of instances. The numbers don't matter much, so they don't have to be precise and I don't make any effort to correct them if they are wrong.
To my surprise, that appeared in the discussion:
> mapping a space is, after all, the first step towards taking control over it.
If there is a space, it's mapped. Your options are a secret map or a public map. If I stopped running fedilist, nothing stops anyone from writing something that enumerates every instance: this is a one-liner in the shell. FediList exists because I wrote that one-liner about a dozen times.
Pretending that a space cannot be mapped does not prevent someone from possessing a map: it's like a cat trying to hide by shoving its face into a box. A large number of people have that map. Not everyone can write a shell script (although I think most people could learn pretty easily), but enough people can that you should assume there are going to be dozens of maps. Fedi is packed to the brim with hackers; that is what makes it an interesting place. Hackers from outside fedi create maps, too. Given that a large number of private maps exist—and we have evidence that this is the case—the people creating private maps have access to a resource the rest of us do not.
I have seen plenty of accusations of trying to "control" fedi before. When I announced CofeSpace, people called that an attempt to centralize fedi. When I tossed around the idea of selling hosting, people said I was trying to centralize fedi hosting after I said I'd want to offer a discount to anyone using it to host a fedi instance. When the CP floods were happening and admins started getting together to compare notes, I got accused of being part of a sinister cabal trying to form a "Fedi UN". People accuse anyone of anything, but the number of people that find FediList useful dwarfs that: when I changed the colors, I got more complaints from people that use FediList and hated the new colors than I ever got from people that disliked FediList.
'''I don't want to control fedi. I don't want anyone else to control fedi.''' I'm an anarchist. I do not want anyone to control anyone. I'm happy to help people prevent this: this is one of the reasons I think instance-blocks are stupid. People with their asses on the line have a common interest. Anyone's free to ping me if they need help with this; it feels stupid to say this, because it is obvious, but I don't care about the ideology of an admin running an instance, I care that fedi doesn't get killed or co-opted. People tend to make incorrect assumptions about my politics: I hate politics and I love freedom, everyone's freedom, especially freedom of speech, and I think decentralized systems are the best way to ensure it.
Revolver is an attempt to ensure that nobody can control fedi, and that is why it is my main focus at present. I think that's my best shot at ensuring that hosting companies and domain registrars and sysadmins and admins of fedi instances can't control fedi. (You don't have to trust me and I don't think you should, but you can watch what I do and see how well it matches my description of my intent.)
There are people actually trying to scrape fedi using evil means for evil or unknown (but presumably evil) purposes.
There have been large-scale attempts by large businesses to scrape fedi for profit. The government has used fedi to conduct surveillance or for more sinister purposes. There are AI companies doing this and completely ignoring robots.txt
.
The University of Utah has a (really cool) HPC cluster that someone used to do massive scraping of fedi (including posts, followers/following relationships, instance metadata); this stopped when I sent an email to the admins (who were very helpful/prompt/professional), but it implies some kind of large-scale institutional support if you can buy time on a supercomputer. (Again, I'd like to stress that this was someone buying computes from the cluster, not an official project of the admins running the cluster. This kind of thing is usually a result of grant money and then the grant money is used to buy units; it is the university equivalent of buying EC2/GCE time.) The scraper got about 50GB of data before their nodes were shut down. I corresponded with the admins over email (very cool people), but I don't know (and didn't ask, because I didn't expect them to compromise anyone's privacy) who was collecting this information or why or what happened to it (though a copy of it did happen to fall off a truck, so I saw what they saw). Since they were scraping followers, it is hard to tell what kind of inferences they were attempting to draw, but you can reason it was something people would probably rather they didn't do. They were scraping the public timeline, so it might be the case that they were trying to map deeper than just following relationships.
NewsMast, despite having an entire instance to federate with, actually use a scraper that fetches not just posts but also tries to get followers/following lists. They might be doing that for discovery purposes, but they seem to be looking for more. They claim on their blog to be doing open-source code, but what they have released is a fragment of a Rails application, they claim to do no scraping, but grep your logfiles for "backend.newsmast.org" and you'll see that's false. They also own channel.org and that can't have been a cheap domain, so they have some funding.
Google honors robots.txt
(though not the Crawl-Delay
directive and if you want them to slow down so that they stop choking your machine, they really want you to sign up for their "webmaster tools", which I will not do) on the site where it finds robots.txt
. If you haven't pieced it together yet, this means that your posts are indexed if you appear on the public timeline of any Mastodon instance (Google really prefers Mastodon to other software) that has a permissive robots.txt
. You post, it goes to another server, that server shoves it into the public timeline, Google has it. I can find my own posts mirrored on other servers by searching Google for distinctive phrases; try it yourself on things you have posted in public.
archive.today completely ignores robots.txt
and impersonates a browser. It uses Headless Chrome with some tweaks to the UA. People use it all the time to take snapshots of sites that do not wish the be snapshat. When I stood up the bugout zone, people immediately started taking snapshots of that, including my account. Try looking for your own instance and see if you find anything you'd rather not see archived.
But shadowy figures operating outside public view (even if they are occasionally unmasked) are a difficult target. Contemplating them can make you paranoid if you don't check yourself once in a while (and posting about them can help people check you), talking about them can make you sound paranoid, and there's not a face and a name. So people on fedi tend to pile onto people that are visible rather than the ones that are invisible. This means that people that are close to fedi and love fedi and are very visible on fedi tend to bear the brunt of the complaints. as:Public and FediList and fediverse.observer and similar projects are the recipients of the complaints about things that these anonymous businesses and "researchers" are actually doing, or even what Google is doing in public as Google.
I don't think there's a good way to fix this because of the nature of cognition: it's easier to be upset with someone you can see than it is with some amorphous entity, so visible people end up as the recipients of anger people feel about what these amorphous entities are doing. I do think, however, that the net effect of this is that Google keeps doing whatever it wants but there's a chilling effect on the hackers of fedi. I certainly would rather not get shit on, but I don't care too much; people that do care, however, end up just not making cool things, or they make things designed specifically to piss off the people that told them no. The pitchforks have a chilling effect: it is stopping other people from building cool stuff. I hate that. Everyone hates that. I want more hackers to show up and do more cool stuff on fedi. I want more people to run more instances.
I'd like to figure out a good way to make all of the raw data available. Right now, your best option is to seed the data by grabbing a CSV of all of the instances, then fetch the data for each of them from fedilist and periodically grab new copies, check the list of the newest instances in order to get the new ones it finds, etc. I'd like to just dump the entire DB on the world, but I hosting costs prohibit that at the moment. I am working on a solution to that at present, and Revolver turns out to be really good at distributing large files and large numbers of small files.
The FediList source should be open. (If a 0day drops and you manage to hack into the box running it and get a copy, consider it AGPLv3. It'll be AGPLv3 when it is open.) The source is more or less trivial (except the crawler has a lot of work put into throughput, deduplication, scheduling; it's pretty cool if a bit messy), but it would be nice to make the project public so people can implement things and I can merge them instead of asking me for things and I have to build it, or tell them I'll do it later, or tell them I don't want to build it, or tell them it shouldn't be in FediList. It went from "embarrassingly trivial copypasta from other projects" to "discrete entity" a little faster than expected, and as noted, it's not exactly the top priority at present. Maybe if you're reading this some time in the future, the source will already be on git.fse or mirrored elsewhere.
The raw responses are now stored in Revolver. If Revolver ever starts doing what Misskey/Lemmy/etc. do and fetching the nodeinfo, then FediList might actually just start getting the data from there instead of getting it directly. (In fact, FediList might become obsolete if the feature in Revolver is designed and written carefully enough, but that feature is in the cloud of potential ideas, not the list of features that precede the release.)
If it doesn't, you probably know where to find me, I'm easy to find. I've not ever hidden. (As noted, just showing up to yell at me or accuse me of various evil motivations is not going to enable you to make headway. Look at the domain in the address bar if you think you will come up with some mean words that will alter my opinion.)