Utilizing MySQL fulltext index

RL_Byrd · January 5, 2017, 11:32pm

I am currently running Nextcloud 11, having used OC from 6 through 8, then moving to NC. I have a reasonably large collection of files, in the 10s of millions of rows; I use NC to access my work files (I’m a database consultant) from wherever I happen to be.

The only issue I’ve ever had with %cloud is that the search function is back-breakingly slow, because it uses the double wildcard LIKE operation (e.g., WHERE name LIKE ‘%string%’) which guarantees a full table scan every time.

As a comparison, this query (which is the standard NC search query)…

SELECT
fileid,
storage,
path,
parent,
name,
mimetype,
mimepart,
size,
mtime,
encrypted,
etag,
permissions,
checksum
FROM
oc_filecache
WHERE storage = ‘1’
AND name COLLATE UTF8_general_ci LIKE ‘%clarinet%’;

…takes almost two minutes, whereas this version…

SELECT
fileid,
storage,
…etc…
AND MATCH(Name) AGAINST (‘clarinet’ IN NATURAL LANGUAGE MODE);

…returns in under a second.

What would be the correct way to edit the search query in the NC code to use this version of the search query?

PS: Yes, I’ve investigated the Nextant plugin/Solr addition, and it’s not what I’m after–overkill, and rather resource-hoggy at that.

Stuart_Naylor · January 10, 2017, 10:44am

? That would as its a database query and that side of solr is fast anyway.
The slow bit is indexing the documents and that isn’t in your query.

I think Solr is definately the way to go as also it fits Nextclouds federated mode.

A single line can use the local shard to make a distributed search across many installs in a simple command line.
curl 'http://localhost:8983/solr/select?shards=localhost:8983/solr,x.x.x.ip:7574/solr&indent=true&q=ipod+solr'

The only dev work by next cloud would be an infrastructure to provide a secure tunnel to each instance.

https://wiki.apache.org/solr/DistributedSearch

Also another problem is access and also that is also part of the framework as current efforts with fulltext return results on the complete index. The problem being that it also shows results on the full index without regards to any security acl, but I could be wrong about that being a noob.

https://manifoldcf.apache.org/release/release-2.6/en_US/concepts.html

The grind has been done by Solr & Apache and its relatively simple to employ, create indexes and trawling is always going to be resource intensive I why I am slightly bemused in an event driven file system, it would seem a asynchronous trawling method is being used whilst file notification is available.

In fact really it must be event notification driven so that Acl and metadata can be pushed at the same time.
I am rusty as hell and don’t know for definate but with mysql you would have to provide the whole security and access system development whilst with solr the framework is off the self.

Its a good discussion though and one I am interested in, I could be completely wrong and don’t mind being told so.

I would be more worried about the current security model as it is a full text index and theoretically it can recreate all the documents and isn’t encrypted at the moment.
Also access in comparison to how the database is locked down it is relatively wide open which could be considered a huge security flaw.

I used to do a bit with a product called Worlddox for legal practises, wow thinking about it its almost 20 years ago now.
The MySQL full text search is not all that different from the MsSQL full text of that time, without doing any research.
Nextant just perked my interest and got me reading about Solr and wow expletive, its ace and probably exactly what Nextcloud needs with vastly less development need than any other solution I can think of apart from elastic.
Solr is sexy indexing stuff indeed and they already have the solutions for ACL & distributed search, which quite frankly are well beyond my capability to create, but I will gladly use.
Great piece of opensource heard about it but didn’t realise how good until a bit of reading of late.

The federated mode of nextcloud is like secure p2p storage and solr already could provide secure p2p indexing via acl control.

Its either elastic or solr and I haven’t enough knowledge to make difference between the two.

Because of previous experience my initial worries where about security and access, also you can not encrypt index fragments as it scatters the byte order and is no longer an index, but all those questions seem to be answered, but I am sure there is someone more knowledgeable who could make a better judgement call.

https://www.datanami.com/2015/01/22/solr-elasticsearch-question/ ?

http://solr-vs-elasticsearch.com/ ?

RL_Byrd · February 4, 2017, 7:22pm

Thanks for all the information, Stuart. My original question remains, however: Can a NC developer advise me on how, if any way exists, to make a simple query change to the search function as described in my initial post? It seems as if what I am asking is pretty simple, and I’m surprised that I haven’t been able to figure it out meself.

All I want to do is to make a relatively straightforward edit to the query that does the actual searching.

aquarat · April 12, 2019, 9:09am

Apologies if this is necromancy, but were you able to find the location of the query (for the purposes of editing it) ?

I’m trialing Nextcloud in an environment where oc_filecache has roughly 2 million rows in it and search queries time out. The query itself doesn’t have any LIMIT argument on it either, which seems silly.

As you noted in your original post, in my situation LIKE can take more than a minute to execute but MATCH takes 22 ms.

Update:
Its in lib/private/Files/Cache/Cache.php, line 646:

// Original
$sql = ’
SELECT fileid, storage, path, parent, name,
mimetype, storage_mtime, mimepart, size, mtime,
encrypted, etag, permissions, checksum
FROM *PREFIX*filecache
WHERE storage = ? AND name ILIKE ?';

// MariaDB/MySQL FULLTEXT Index
$sql = ’
SELECT fileid, storage, path, parent, name,
mimetype, storage_mtime, mimepart, size, mtime,
encrypted, etag, permissions, checksum
FROM *PREFIX*filecache
WHERE storage = ? AND MATCH(name) AGAINST (?)';

and you’ll naturally need to add the index to the table :
USE nextcloud;
ALTER TABLE oc_filecache ADD FULLTEXT INDEX name_idx (name);

Search queries now return results in a reasonable time frame (the network connection is now the bottleneck).