NC11 + OCR 2.0.0: How to OCR a folder?

Sanook · January 3, 2017, 2:26pm

At the moment i can only OCR one single file after the other.

janis91 · January 21, 2017, 9:54am

This will could be a feature request, but actually I will not have the time to implement it in the next month. If you want it to be implemented just go for a feature request issue in github, as it will be tracked a lot better there.

Stuart_Naylor · January 21, 2017, 1:57pm

Would be rather amazing and I thought that myself, but after some thought it could be a bit of a nightmare to employ.
You could only work it with drag and drop folders as the OCR could be working as you are incrementally adding.

Also it means you have to have folder, maybe we could create two text files .ocrstartbatch and .ocrendbatch and all files added between the files datetime will be batch processed?

Sanook · January 21, 2017, 7:46pm

janis91 · January 21, 2017, 8:16pm

Ty! Answered already.

Stuart_Naylor · January 22, 2017, 6:25am

Just wondering but we have a scanner and finereader OCR where we are already creating multipage pdfs in batches pdf image over text or searchable pdf as it seems to be called now.
What stops OCR2.0.0 from rewriting the text on these?
Would it just recognize them as PDF and try to OCR?
Just asking as its why I haven’t enabled the app yet and also haven’t had the time yet.

The form handling we do often requires us to OCR and we do that at source and its creates a multipage pdf, we do quickly scan other documents and they are usually single page just saved as images.
Generally the folder structure is Year/Client/Case and we wouldn’t want additional folder structure purely to accommodate OCR direction.
Would be cool if you could mark the folder for batch single document OCR creation which grabs the pages and replaces the folder with the multipage document using the deleted folder name as the new document name.
That way the marked folders would just be temporary collection holders and no other file types should be in there.

I keep thinking files & folders need a metadata section that allows you to select entities from apps.
Its just an extra tab where the sharing and tags are now.

Could be OCR yes/no or multiple links to OCR multipage document.
Link contact, task or any app entity.

The files and folders are the basis of everything and its doesn’t make sense for each app to try and create individual linkage mechanisms.
Just needs a core mechanism that all the apps can use, rather than each one supplying its own.

If you could setup templates and make entity selection easy it would make the whole app / core more extensible and also allow more customization on how we use Nextcloud.

janis91 · January 23, 2017, 1:37pm

Good point. I will test it as soon as possible. But I think it should recognize the text layer inside the pdf.

At this point I have to disappoint you. Actually this is a very special requirement. Another one might require a folder with all files as single page pdfs after processing. I will only go for the OCR processing thing, not for any pdf compression/combination/merging or splitting tool. If someone is interested in pdf merge or split in nextcloud, this would be worth enough for another app by someone who has knowledge about that topic.

Another topic is the point of marking a folder for ocr processing, which will result in a search for all ocr processable files on the server in that particular directory (maybe the app could ask for “recursive”-mode, too). This is a valid feature request and has already been proposed by @Sanook, as I understand him right. Yet (v2.2.0 NC11 / v1.1.0 for NC10) it is only possible to add multiple files at once by selecting them inside the folder. And only if they have the correct Mimetype (image/png, image/tiff, image/jpg, application/pdf).

Every additional functionality and customization option will make Nextcloud AND each app much more complicated, because of test cases and so on. Nexctloud also lives by the simple and clean design and usage. If you add for example individual linkage and such things, this would make it more complicated. As the ocr app is intended to be and stay a simple tool, that uses the framework features and interfaces provided by nextcloud as much as possible and aims to enable the user(s) to start ocr processing for images and pdfs from the webclient, it should stay simple and be isolated from other ideas. The app should be implemented as simple and reliable as possible and give some advanced features like processing whole folders or enabling/disabling the status updates - nothing more. I don’t want to provide support for such specific requirements, as it is not that difficult to do the ocr processing on a folder basis (in the future), either. (and go for one click on that folder in your case)
Entity selection in order to process in ocr in my opinion is as easy as deletion, renaming or downloading from within the files “app”. I don’t know how it could be implemented easier.
Additional requirements like any specific metadata or linkage feature should be implemented as another app and be isolated from others. We should avoid aiming to create another “jack of all trades device” app.

Stuart_Naylor · January 23, 2017, 7:38pm

“Additional requirements like any specific metadata or linkage feature should be implemented as another app and be isolated from others. We should avoid aiming to create another “jack of all trades device” app.”

That is where we differ and I think I am looking at things in a different way, purely because in the end it will force one app to define the entities of all other apps. I have to say when I was scrolling the database and developer documentation, I was actually surprised there isn’t some sort of simple entity mechanism.

“Additional requirements like any specific metadata or linkage feature should be implemented as another app and be isolated from others. We should avoid aiming to create another “jack of all trades device” app.”

I can understand the reasoning to keep a simple pure core as additional specific workings, add complexity and narrow the core scope to specific methods. Maybe they should be implemented as another app but because it is a linkage between core and apps, it will have to update on any change to core and every change to every app.
I did have a passing thought of maybe providing a meta-data app and apart from my dev skills being rusty as hell, trying to interface to a large collection of 3rd party apps is always going to a kicking to nowhere.

But I must admit I do see the lack of metadata being included in the core whilst an app like ‘Tags’ is, slightly paradoxical, but hey!

I am not saying you should be aiming to create another “jack of all trades device” app, I wasn’t actually thinking you should create a metadata app, but I was thinking there is a need for apps to register entity types. This make the whole process of extending the core and apps so much easier without the need for individual rewrites and singular offerings.
It would mean the core and app are a series of lego bricks that could allow Nextcloud to be open to specific customization and specific purposes and the absolute opposite of a “jack of all trades device”.

In the core and its prob my rusty dev skills that has me blind to this each should register itself and its entity type, possibly a json to iterate entity data and a json to navigate to that enitity in the app.
The apps at first glance seem to be completely isolated from each other I can not help think this is an extremely easy and non invasive method to add, also it would be easier if it was common and in the core.
The actual metadata app could be third-party but yeah it could do with a simple bit of glue from next cloud.

You should never try and create a “jack of all trades device” app, but simple additions to make the base easily extensible, should never be avoided if simple to accomplish, with little overhead.
Just opinion and its my manner of sharing open idea’s and about providing open extensible apps.

“At this point I have to disappoint you. Actually this is a very special requirement. Another one might require a folder with all files as single page pdfs after processing. I will only go for the OCR processing thing, not for any pdf compression/combination/merging or splitting tool. If someone is interested in pdf merge or split in nextcloud, this would be worth enough for another app by someone who has knowledge about that topic.”

Probably displaying my ignorance over the methods of OCR and the way Nextant is being used and how we can search in documents. Doesn’t matter if its PDF or Libre but as a user of scanned documents that become individual page documents, its actually awful to use. As you are reading through you end up with a load of single page documents, maybe I am getting the wrong end of the stick but any multipage document that is scanned into collection of singular images is just purely process requirements, but actually not the desired format.
Guess it time to have a look at the OCR module as think I must be getting the wrong end of the stick.

I have a perference for PDF because it offers a format that Libre does not where Searchable PDF and its older title 'PDF Image over text" is just a hybrid of the original image sitting on top of transparent text and having them combined is just easier to work with as you have both.
You get the perfect rendition of the image and the searchable text and really wish libre would do something similar.

From previous document management and OCR experience folders and individual pages is a bit of a mare to work with.

janis91 · January 23, 2017, 8:23pm

Well then I didn’t get your point before. I am completely on your side! But maybe it would be better to take this kind of discussion on to another “level”, at least in the forum. Because I think “ocr” isn’t actually the right target.

Maybe it would be a good idea to get a PDF creator and merger app on the road.

Sorry have not much time today. Just want to get a short answer down to paper

Stuart_Naylor · January 23, 2017, 8:29pm

Yeah apols Janis, its just my manner or sharing thoughts.
Often too open with thoughts and not specific enough.

I am going to install OCR and have a look as I should of kept my trap shut until so

But yeah, just a share, put the opinion out, if it grows or sinks…

Cheers for your time.

janis91 · January 25, 2017, 11:35am

It’s always good to spread the word or sharing thoughts regarding technical and architectural optimizations.

I would be happy, if you could review OCR and give some feedback in this thread: https://help.nextcloud.com/t/ocr-optical-character-recognition-for-your-image-and-pdf-files/5071/2.

Feedback and ideas are always welcome.

Cheers.

Stuart_Naylor · January 26, 2017, 7:21am

Yeah apols I meant to, got sidetracked with SBC possibilities.

I am going to have a look at see what happens to searchable pdf files and others.