Indexer question to Nextant

Sanook · January 9, 2017, 8:26pm

I’m running NC on a virtual machine with only 20 GB harddisk but with external storage.

What will happen if i upload 100 GB pdf and epub files? Does Nextant eat up my harddisk for indexing?

In comparison to Recoll desktop search engine, it’s eating more than 50 GB harddisk space for indexing and caching.

In which folder is the Nextant index stored and how big can it grow? Are there any limitations?

Thank you.

Cult · January 9, 2017, 9:52pm

You should expect 30% of disk space needed for indexing.

The index is stored in the install folder of Solr; I think the default is /opt/solr/

Stuart_Naylor · January 10, 2017, 12:35pm

Yeah that is one of the things I have been thinking about.

Solr can use external storage and the current manner of employment is going to cause obvious problems especially embedded applications.

I think its brilliant someone is bring Solr to NC but wow there is quite a bit to think about.

Cult · January 10, 2017, 1:04pm

I think that if you set the Resource Level to Lower (in Admin Interface of Nextant), you will need less disk space. This should be tested. If it works, I can add an option to only index (not store) content of external storage (but not local content). You won’t have the highlighting context but at least you can do your searches.

I have no idea of how the Nextcloud’s users/admins are using the federated shares. In the case of some files/folders from few external NC, the current Nextant is enough.
Of course, if you start sharing 100GB from 100 NC, it won’t be the same.

In a huge federated cloud of Nextcloud, a cloud of Solr might be a fun thing to put in motion; however you’ll be on the edge to break your users’ privacy.

Stuart_Naylor · January 10, 2017, 1:43pm

That doesn’t matter all that much as each NC has its own shard and a long as any shard is aware of the other shards it can act as a controller in a distributed search.
Problem is the shard indexes are being stored in the system partition, but really should reside on any added storage.

@Cult I am a noob with NC but so far I don’t see anywhere security ACLs are being provided to the shard.
So when you do a full text search there is no way to limit the result set to those are who authorised and the current highlighting context is breaking current security mechanism?

I really like the idea of federated p2p storage and p2p indexing as hugely important knowledge bases can be created in ad hoc networks. From a distributed wikileaks to advocacy or legal a secure p2p storage and indexing system could be hugely beneficial and of much use to many.

Storing entities as integers is fubar though and they should be UUIDs and looking at the database the same is true of the ACLs.
UUIDs should be used for all entities that are visible and fast integer indexes should be purely an internal process.

You could have thousands of users on thousands of NC’s with thousands of GB with no break in user privacy apart from what has been granted.
NCs could have public, trust relationships and private data and both the share and index would be representative of this.

With the current database using integers and my noob status where I haven’t seen any form of UUID entity reference or ACL scheme then really you can not federate at all as you are always taking the risk of a duplicate entity ID.

I think Nexant is extremely important to Nextcloud but currently struggling to see how it is not going to cause problems with system storage and security. Thing is its the manner of the core and operation and my concerns are a bit chicken and egg as Nexant (Solr) can work in that manner, or could.

As I say don’t take my word for it though as I am still struggling to get my head round the individual file encryption methods where the database is hold a hash for each file rather than and encrypted file system.
I think the answer to a secure index is just to bung it on a encrypted file system and hold a file system hash, but probably displaying in all glory tremendous noobness.

https://wiki.archlinux.org/index.php/Dm-crypt/Encrypting_an_entire_system

You could have an extremely small embedded system running huge volumes and indexes with full text but as storage is added via LVM and maybe the files would have individual encryption and the corresponding index could just be an LVM encrypted partition holding the index.
But currently I am wondering why NC is encrypting files rather than volumes?

Stuart_Naylor · January 18, 2017, 12:40pm

After some reading with Solr & Encryption my initial concerns about security and size of index are not that much of a big deal.

We could do with a little tutorial and maybe make it clear that it is possible to reconstruct documents from an index and if that index is not encrypted, it does to a certain extent negate current encryption.

It isn’t easy to rebuild a document from an index, but it is possible and if that is a concern here is a relatively simple method to ensure it can not be done.

https://cwiki.apache.org/confluence/display/solr/DataDir+and+DirectoryFactory+in+SolrConfig

Specifying a Location for Index Data with the dataDir Parameter
By default, Solr stores its index data in a directory called /data under the Solr home. If you would like to specify a different directory for storing index data, use the parameter in the solrconfig.xml file. You can specify another directory either with a full pathname or a pathname relative to the instance dir of the SolrCore. For example:
<dataDir>/var/data/solr/</dataDir>

Also if the size and location of the index is a problem, it is up to the installer to provide a index path that will solve this.

This is the same for security concerns as the index directory should be mounted through ecryptfs and that mount point set in the above directory parameter.
There are stronger methods than encryptfs directory encryption, but hard coding size or partitions is a lot less flexible.

Or basically have a non persistent index, but again has RAM considerations.

The solr.RAMDirectoryFactory is memory based, not persistent, and does not work with replication. Use this DirectoryFactory to store your index in RAM.
<directoryFactory class="org.apache.solr.core.RAMDirectoryFactory"/>

You can use this to estimate your index size https://github.com/apache/lucene-solr/blob/master/dev-tools/size-estimator-lucene-solr.xls