activities_visibility_index Slowness

Pete, Nov 22, 2022

I keep getting asked about this. I have gotten like 10 questions about this lately. Apparently a very common problem! I am writing it here once so I can stop writing it.

Essentially, the activities_visibility_index requires a full-table scan of the users table for each activity. I wrote another index to fix this and it has been merged, but the problem when you're doing a dump and restore is that the activities_visibility_index gets built first, so it can't avail itself of the other index. There is a lot more information in that PR. Some informal testing indicated that building that index using FSE's DB took about 180 hours (yes, more than a week) but with the AP ID/follower address index, it took a few hours. If it is taking forever to load a dump, this is probably what's happening.

halp

I'll toss more stuff into here as required.

How to fix that:

If you are actively restoring a DB, you can do this in parallel with the activities_visibility_index getting built.

CREATE INDEX CONCURRENTLY tmp_apid_follower_address ON users USING btree (ap_id, COALESCE(follower_address, ''::character varying));

Once you run that, the activities_visibility_index will just fly through. If you're on an older version of Postgres, you might get an error about trying to use "CONCURRENTLY"; you can omit it safely.

'''This won't work if you're doing the load as a single transaction!''' The relevant tables won't be visible outside that transaction. If you have a regular plain SQL dump (that is, rather than a Postgres binary dump), then you can add the index inline by just piping it through awk. For example, I used this line to load a dump recently:

pbzip2 -d < backup.sql.bz2 | mawk '/CREATE  *INDEX  *activities_visibility_index/{print "CREATE INDEX CONCURRENTLY tmp_apid_follower_address ON users USING btree (ap_id, COALESCE(follower_address, '\'\''::character varying));"}1;' |  psql

The reload had taken more than a day, so I was a bit hesitant to kill it and restart it, but after I gave it another ten or twelve hours and it was still building the activities_visibility_index, it turned out to be worth the time: it only took about four hours total to load the whole thing.

How to prevent the issue:

If your DB is alive and you're doing a dump, either dump the schema and the data separately (pg_dump --schema-only and pg_dump --data-only, drop those in separate files, and then load the schema, then the data) or to use PostgreSQL's binary dump format, which lets you reload the schema and data separately.