[Help] Oc_addressbookchanges has grown to >1GB ; possible cleanup?

hede · March 5, 2019, 2:13pm

Does anyone know how to cleanup the oc_addressbookchanges tables?

For me it has grown in the past, some months to years ago. The current synctoken for principals/system/system is ~16.000.000. There are ~16.000.000 rows in oc_addressbookchanges consuming roundabout 1.2 Gigabytes. There are hundreds to thousands of rows for the same URI (matching: “LDAP:[name].vcf” for uri) with increasing synctoken values. That’s more than half of the whole DBMS size (incl. other databases not related to nextcloud).

I do not think they are all needed. There must be a way to clean all but the last for every single uri+addressbookid+operation value.

I don’t provide NC- or System-Versions here because it doesn’t matter. This is a leftover of the past, it happened anywhere between OC 7 and NC 13. Since some months, namely the NC 13 update last year, excessive growing has stopped. Maybe it’s some error with our ldap, maybe NC, maybe some NC Dav Backend, maybe the ldap server itself or a combination of all like the ldap config within NC. So… if there’s some way to cleanup oc_addressbookchanges we don’t need to dig further into the root cause of this; which probably doesn’t even rely to the NC at all.

The only answer I did find so far: “We let it grow forever. What problem are you running into?”

https://groups.google.com/forum/#!topic/SabreDAV-discuss/pFHpLzbLTe0

… still there’s no real problem with that. Nevertheless its size of more than a Gigabyte (more than all address books all together) seems quite unnecessary (way more metadata than real data).

skjnldsv · March 11, 2019, 8:44am

@blizzz @georgehrke

hede · March 18, 2019, 2:48pm

For a better understanding: The addressbookchanges table only gets used by the CardDav Backends to get all changes between the synctoken of the last sync (i.e. a timestamp) and the current synctoken? The relevant result seems to be a list of unique (disctinct) uri+operation values.

That means, I do only have to remember the highest / biggest synctoken value for every uri/operation. Right?

Then it would be fine (even for device which didn’t do a sync for a long time!) to thin out this table the following way:

select distinct on (uri, operation) * into oc_addressbookchanges_tmp from oc_addressbookchanges order by uri, operation, synctoken desc;
… plus renaming both tables afterwards.

For us it shrinks the oc_addressbookchanges table from 1.2 Gigabytes to 180 Kilobytes. Unfortunately it has to be done “offline” and needs several Minutes to succeed. (Maybe someone finds a solution to thin out the table in place.)

Btw: Both places in the Nextcloud code which I found (function getChangesForAddressBook() in ./3rdparty/sabre/dav/lib/CardDAV/Backend/PDO.php and ./apps/dav/lib/CardDAV/CardDavBackend.php, both seem to be derived from the same source or from each other) do use the same suboptimal(?) algorithm: fetch all values from the database between two timestamps (synctoken) and loop through the resulting uri+operation pairs, successively replacing old values. Why not use the SQL “SELECT DISTINCT” feature and let the DBMS filter out a single (distinct) match for every uri+operation pair? (won’t work with “ORDER BY synctoken” but this one seems also unnecessary)
And the initial sync is even worse: There’s no such duplicate reduction at all. !?

Btw2: Evert Pot asks on Google Group: “We let it grow forever. What problem are you running into?” - Answer: Performance. The NC CardDav Backend has to work hard through Millions of Rows for every single sync here.

hede · April 8, 2019, 12:18pm

There’s a possible in-place cleanup:

## create table with all needed values (just to be sure):
SELECT DISTINCT ON (uri, operation) * INTO oc_addressbookchanges_tmp FROM oc_addressbookchanges ORDER BY uri, operation, synctoken DESC;
ALTER TABLE oc_addressbookchanges_tmp ADD PRIMARY KEY (id);
CREATE INDEX addressbookid_synctoken_tmp ON oc_addressbookchanges_tmp (addressbookid, synctoken);

## create table with all unneeded values:
SELECT * INTO oc_addressbookchanges_trimmer FROM oc_addressbookchanges AS ori WHERE addressbookid = 2 AND synctoken < (SELECT synctoken FROM oc_addressbookchanges AS maxvals WHERE ori.uri = maxvals.uri AND ori.addressbookid = maxvals.addressbookid AND ori.operation = maxvals.operation LIMIT 1);
ALTER TABLE oc_addressbookchanges_trimmer ADD PRIMARY KEY (id);
CREATE INDEX addressbookid_synctoken_trimmer ON oc_addressbookchanges_trimmer (addressbookid, synctoken);

## check if there are same IDs on both tables – that shouldn't happen:
SELECT oc_addressbookchanges_trimmer.id, oc_addressbookchanges_trimmer.uri FROM oc_addressbookchanges_trimmer, oc_addressbookchanges_tmp WHERE oc_addressbookchanges_trimmer.id = oc_addressbookchanges_tmp.id;

# if the result is 0 rows:
## delete - thin out the real table
DELETE FROM oc_addressbookchanges USING oc_addressbookchanges_trimmer WHERE oc_addressbookchanges.id = oc_addressbookchanges_trimmer.id;

## remove temporary tables
DROP TABLE oc_addressbookchanges_tmp;
DROP TABLE oc_addressbookchanges_trimmer;

I’m not a SQL Dev so this code is certainly less-than-ideal. But it seems to work. I’m still waiting for anyone raising objections …

skjnldsv · April 12, 2019, 6:42am

Hey!
Thanks for the help here!
I’ll admit I have no idea why we let it grow like that.

Do you have the link to the google group of the discussion with Evert?

hede · April 15, 2019, 11:17am

I didn’t had a link so I googled it (using duckduckgo ):
https://groups.google.com/forum/#!topic/sabredav-discuss/pFHpLzbLTe0

Btw. there’s some kind of a typo up there in my list of SQL statements. The nested select should match from oc_addressbookchanges_tmp instead of the way bigger oc_addressbookchanges table:

SELECT * INTO oc_addressbookchanges_trimmer FROM oc_addressbookchanges AS ori WHERE addressbookid = 2 AND synctoken < (SELECT synctoken FROM oc_addressbookchanges_tmp AS maxvals WHERE ori.uri = maxvals.uri AND ori.addressbookid = maxvals.addressbookid AND ori.operation = maxvals.operation LIMIT 1);

Second: you can dump this subselect at all, in favour of a inner join:

SELECT ori.id AS id INTO oc_addressbookchanges_trimmer_join FROM oc_addressbookchanges AS ori INNER JOIN oc_addressbookchanges_tmp AS maxvals ON ori.uri = maxvals.uri AND ori.addressbookid = maxvals.addressbookid AND ori.operation = maxvals.operation AND ori.addressbookid = 2 AND ori.synctoken < maxvals.synctoken ;

runtime for us was:

subselect from oc_addressbookchanges: days
subselect from oc_addressbookchanges_tmp: hours
join: seconds

And I’m pretty sure some database junkie will dump that all and present you with a way more efficient solution. Or someone will refactor the whole syncing process to not add those oc_addressbookchanges in the first place. Or leave it well enough alone – we do have our solution.

hede · April 15, 2019, 11:55am

Ah, sorry, that’s indeed a sabredav thread and the link was right in the first post.

I thought it’s related because both getChangesForAddressBook() functions are nearly identical in ./apps/dav/lib/CardDAV/CardDavBackend.php and ./3rdparty/sabre/dav/lib/CardDAV/Backend/PDO.php – like said in the first post, as well.

skjnldsv · July 4, 2019, 5:59am

Thanks for the help!
Closing as solved then