MySQL utf8mb4_unicode_ci collation

Hello everyone!

with my app fulltextsearch_sql I’m trying to implement fulltextsearch using only the Nextcloud database, without additional components like Elasticsearch.

One problem I have is with MySQL. Nextcloud has a repair step that aggressively changes every column in every table to use utf8mb4_bin collation. This causes every search to be case-sensitive, which is quite unfortunate.

My app creates the content column as utf8mb4_unicode_ci, which allows case-insensitive search (hence the _ci suffix), but Nextcloud changes that back after every update.

In my installation, which has quite some indexed content, this conversion takes almost 10 minutes - and then again 10 minutes changing it back.

Other places seem to use ->iLike(), which expands to `column` COLLATE utf8mb4_general_ci LIKE ... , but that syntax doesn’t exist for the fulltext query (and wouldn’t use the index, causing extreme slowness).

So I’m wondering two things:

  • What’s the reason for changing everything to utf8mb4_bin so aggressively, on every update, more than 10 years after changing collation?
  • Is there any way to opt out of it? (Looking at the source, I guess not…)

The only way out I currently see is creating the table without the configured prefix (and hence not using the QueryBuilder at all, because that automatically inserts the prefix). Or writing my own repair step that converts the collation back (and thus accepting the 20 minute delay on every update in my case).

Hello @jplitza and welcome to the forum!

Did you think of just doing the case modifications in-app? So, in fact of using a collation (and relying on it), you could simply use toLower to only work on lower char text.

Chris

1 Like

Thanks for the suggestion! I actually thought of that only after writing the post. But it has its own drawbacks: In order to profit from the DB index, I’d have to lowercase everything before storing it, meaning I’ll only have access to the lowercase variant after retrieving. That would make the excerpt that is presented along with the search result to only be lowercase only as well.

Probably still the best approach given the circumstances, but I’d still rather have Nextcloud not override the developer’s decision of what collation to use.

I have not checked, but is every supported DB capable of collations? I am thinking towards SQLite.

NC tries to be as DB agnostic as possible thus there might be DBs that do not support your use case…