OCR and (hardware) scanning

Yeah I doing a freebie for a local community / advocacy centre and because of the dynamic work environment the simplicty of nextcloud is truly wonderful for them.
Also that it has become a secure client portal is a thing of beauty.

When it comes to scanning, a recent purchase is now a top tip as the photocopier/network scanner has been relegated.

http://www.czur.com/product/ET16 it was £300 quid but all are saying it was worth every penny.

Takes up hardly any space and does A3 but here is the clever bit as it has a laser ruler to auto correct orientation and it uses the OpenComputerVision libs to flatten books and documents.
Really fast 1.5 sec per scan and no lid to move up or down or document to align and after that it comes bundled with Finereader so we are creating a working searchable PDF digital system whilst the paperwork is just an archive.
It also can be used as a big screen presentation device for hard copies with the HDMI connection which also was trailed today, it was a good day as its perfect for there needs and great to see them happy as they do a huge amount of extremely valuable work for the local community.

The OCR module using Tesseract is great and if you aint got Finereader its a worthy alternative.

The centre have been totally stunned by nextcloud and today was the first day and they where up and running in an hour, after a brief tutorial from me. That is truly amazing.
I did the work on a voluntary basis and the centre manger is going to get locked away if he continues to giggle and gurgle about the amazing simplicity and functionality of his new toys.

PS I never got Collabora to work I must be doing something wrong and I can not figure it, doesn’t matter though as that function isn’t really high on the list and maybe later after I do some testing at home.
Docker is relatively easy, can not make my mind up about Collabora and who or what the problem is.
The app works but it refuses to connect to the document as Librewriter is there empty on the screen with an apologetic message, so I just uninstalled and delivered today and they have that to look forward to.

1 Like

Good to hear. I’m liking that scanner as well. We have documents need gentle handling and are preserved inside cases etc, so getting a damn fine scan is paramount to keeping the originals locked up safely.

Most volunteer places are only to happy to have qualified techs onboard. I myself have started seriously considering leveraging the luck I got with receiving two of those Poweredge’ss to manage the resources (we have fibre internet here that actually works) of a multitude of other’s as well. NC’s federation and SSO will certainly bgein to help out there.

So far, Lets Encrypt for a free certificate for online security, opnsense for a firewall with all the bells’n’whistles for VPN’s and intra-site connectivity to go with everything else you need to save your arse from ‘teh intertubes’ and NC managing files and editing to boot barring some issues with Collabora. But as was mentioned LibreOffice works perfectly fine remoting via webdav to edit the document’s source.

1 Like

God I sound like some sort of salesman but yeah it would be prefect for you.

Originally could only find one by fujitsu that was over double the price, but that is what it is a book scanner and can get accurate results with no force on the source document unlike a flat bed,

There are £100 ones on the tinternet but they are webcams on poles and have none of the preprocessing stuff.

11.1 RC1 just arrived did an install test and later tonight will be back on with Collabora.

Yes, he just said that guys =)

Yeah mate I get where you’re coming from. Exciting times, so it’s hard not to get excited.

I might finally be able to get rid of the rank stigma associated with linux as well with this stuff =D

Okay I know I’m getting annoying and a bit off original posts issue, but I have to know about this scanners speed.

We have a small library at the moment of several thousand items (war museum, so it’s all kinds of shit like flyers, intelligence briefs, manuals, artworks, etc) so it’s a wide variety of paraphernalia. But it’s the books from ‘yonks ago’, which in Australia are quite often out of patent protection period scan them in. We have thousands as I said. I’m interested in:

This device recognises page turns? Do you simply manually turn a page and it will recognise the page-turn action and scan the next page?
1.5 seconds per page?
Single page or double? Configurable for both or either?
Driver/software support for linux?
Remote protocols like direct to webdav, email or ftp?

Apols just got back, it does A3 and the software automatically recognises pages and splits what is in front of it.
I am at the community center now and I will ask them for a honest speed result.
Yeap just asked its as quick as he can put docs with the button press and laser alignment test maybe taking 2 seconds on each document, which is usually two pages, but as you can say you just flip through with none of that lifting the book up flipping over turning page and placing down carefully.
For books it is massively faster and they process a huge amount of booklet forms so its perfect for them.
We have a big Richo photcopier .stapler document handler here but it is absolutely useless for the documents we receive as it just wants sheet a4 & a3 so they end up lifting the lid, placing correctly pressing the button…
Richard says because its on his desk by his side his scanning speed has increased by 10-20x and its only now a digital office has become a reality as all documents can be scanned easily and quickly.
If you go on the site there are some videos and the feedback I am getting is that they think its amazing.
Because of the laser rulers it auto corrects so its slap it down and go.
I have it linked up to a new skylake I5 6600 as it was time to get a new computer and get quite a descent one and did my usual of just updating the mother, proc, ram and m.2 drive for just over £400 as I am doing a John Inman.

The scans are that quick but the system here is slower as after you get all the images we are sending them in batches to create a single pdf book.
That takes about 10 to 20 seconds on 10 to 30 pages on the above.
Guess it all depends on the tesseract app as it would be great if there was an option to treat a images in a folder as pages of a single document which is a post scan process here.
you will have to look on the site but they all have a twin head angled scanner that does flat pages whilst the book is only open at what is 130 degree angle to protect the spline and you wear these strange checkered thumb mittens that both scanners can use to automatically remove them from the scan image if you need to gently hold pages in place.

They are proper library book scanners but actually they are finding it much easier and faster than a flat bed by quite an order.
If you could feed the documents through the paper feed and and the photocopier was closer then the photocopier would win, but the reality is the opposite, by a long way.

Docker to be honest is really easy and once you have done the install that is it.

Occasionally an admin might have to sudo docker ps (list running containers) which shows the conatinerid and name.

If its not running just sudo docker rm <containiner_id> (remove the container)

Then its back to the initial command sudo docker run -t -d -p 127.0.0.1:9980:9980 -e 'domain=cloud\\.nextcloud\\.com' --restart always --cap-add MKNOD collabora/code to get a new container running again.

Occasionally images will have to be updated so first delete your containers then sudo docker images (list downloaded images)

sudo docker rmi <image-id> (deletes the image)

Then it is once more back to the original command docker pull collabora/code to grab the best and latest off Collabora.
Once more to start a running container sudo docker run -t -d -p 127.0.0.1:9980:9980 -e 'domain=cloud\\.nextcloud\\.com' --restart always --cap-add MKNOD collabora/code

I was like you when I first saw docker and could not get it going, but once it is the above commands is all you ever have to know.

We could prob do with a Docker section where we can just repeat those steps to anyone struggling, but that is all there is to it really.

I really like the PDF viewer as its so freaking fast and its brilliant in how it looks and the clarity it provides.
I am new to Nextcloud and have had my sysadmin hat on but I am now thinking of donning my dev hat.

I do intend to do a really simple app that uses libreoffice to create PDF copies of office documents.

Nextcloud has some great hooks that should make this really easy.

Filesystem Root
Injectable from the ServerContainer by calling the method getRootFolder(), getUserFolder() or getAppFolder().

Filesystem hooks available in scope \OC\Files:

preWrite (\OCP\Files\Node $node)
postWrite (\OCP\Files\Node $node)
preCreate (\OCP\Files\Node $node)
postCreate (\OCP\Files\Node $node)
preDelete (\OCP\Files\Node $node)
postDelete (\OCP\Files\Node $node)
preTouch (\OCP\Files\Node $node, int $mtime)
postTouch (\OCP\Files\Node $node)
preCopy (\OCP\Files\Node $source, \OCP\Files\Node $target)
postCopy (\OCP\Files\Node $source, \OCP\Files\Node $target)
preRename (\OCP\Files\Node $source, \OCP\Files\Node $target)
postRename (\OCP\Files\Node $source, \OCP\Files\Node $target)  

So any document can have a accompanying PDF copy for that wonderfully fast PDF viewer that is like Documents on amphetamine.

In the admin section a couple of simple options.
Retain source.
Delete source on deletion.
Turn off copies.
List of extensions to convert.

Might even make some folder exceptions where you can set various root folders which different settings.

Ain’t a clue how long it will take me, but creating PDF’s that are text searchable and very fast to view is as simple as…

libreoffice --headless --convert-to pdf mydocument.odt

Also means for users that the viewer is the same and to be honest the PDF viewer is miles better than the document one.
The document one is hampered with large documents that firstly it has to convert then it has to load, then display.

I am thinking of doing most of the work as the file is created or changed.

Wish Nextcloud had shadow copies that didn’t turn up in plain view and a setting to dictate what viewer would be used.

The Ubuntu Snappy Core images for PI look amazingly simple and hopefully the complete install of Nextcloud and Collabora will be merely as shown here.

https://www.linuxbabe.com/cloud-storage/install-nextcloud-server-ubuntu-16-04-via-snap.

Same with the Odroid C1 images as Nextcloud & Kodi could be pretty ace and it could be quite interesting if those cheap Chinese TV boxes could get the firmware as the newer S912 with the better Mali 820 is just great with 4K HDMI!

I am using old PC’s off ebay and the current E5200 2.5Ghz dual core motherboard combo’s cost just a bit over £20 quid (Got two with 4GB ram for £50).
Been buying 320Gb sata drives (x4) at £10 a pop and each box works out as a RAID5 960GB £60 Nextcloud box.
I am like a modern day tech rag&bone man as I am always asking and recieving free PC cases :slight_smile:
Shocking most just get mashed after the embodied energy in them.

Sooo, if it’s so easy, can you not make some automated build of docker for various platforms anyway? =D Winkwink, noodge noodge, know what oi mean? Or possibly some guides (does Nextcloud have a wiki facility such things, like Cyanogen or Linux Mint?)

And the pdf task youre talking about is already an application in the store I believe. You need to be running at least Ubuntu 16.10 (I only run deb-based stuff so dont know about other distro’s) though for some of the packages as they were only available recently. I didn’t find a ppa of any sort when I last checked

I’ve been playing with an old Dell Poweredge 2400 from ten years ago. Amazing how much grunt that thing has even with simple 4GB of RAM (now at 48). Both CPU’s filled. Came with seven 10K RPM drives so room for a nice modern drive (the controller only addresses 2TB, but eh might see if it supports a newer one later and again, eh) to do internal backups from the 10k drive array. And the two extra SATA slots on the mobo itself pretty much became the SSD system drive.

That bad boy was donated, but checked out the price online, and it’s damn cheap.

1 Like

Yeah I think they prob will, they are developing the snap app as they are also developing 11.0 & the Collabora app.

In 11.1 its highly likely on the Pi3 it could be just sudo snap install nextcloud and some options to set credentials and preferences.
Interesting and simple times seem to be very close.

I have only been using Nextcloud for three days but as an ex sys-admin I have a lot of knowledge that others shouldn’t have to give two hoots about.
All I can say with what is being done is that yes it could be that simple and it looks like they are trying to make it so.

[EDIT]

I am finding the £20 motherboard & £40 raid (software which I prefer) extremely quick on the local lan that PDF viewer is instant with the Nextcloud manual which is 70 pages odd.

Also and apols but install Java 8 and get Solr, honest its actually really easy, easier than my experience with Collabora by a long way.
The Nextant App is a thing of beauty and you need to see it in operation.

While it was top of my list of addons to install, it just went higher after this bit:

“If needed, will OCR your stuff.”

Somehow completetly missed that tidbit. Will be needing that as I deal with a lot of information at a museum, and parsing ALL documents in Nextcloud for general public consumption would make our lives ridiculously easy.

Nextcloud is a sexy piece of work!

@Stuart_Naylor @stiiixy this was going way off topic, enjoy your own thread to continue your discussion so we keep the documents app topic relevant :slight_smile:

Well, technically we ARE discussing OCR and documentation per the header. Stuart just realised for many a brilliantly designed scanner for getting all those rascally documents in to NC, along with some app’s like Nextant that can handle it.

Nevermind, you split the topic. no wonder i was getting confused. Thanks =D

1 Like

No bother about the thread moves, cheers for the heads up.

I think this sort of shows why the scanner works so well as its right next to him, if it had been an A3 flatbed or document handler it would of been a desk hog.
PS we are very strange but that 2k monitor portrait is just fantastic for viewing documents and nextcloud with its lists works extremely well that way.

1 Like

Hey mate, so how’s the process coming along?

I’ve got lots of people converted to that vertical monitor format. Researchers and librarians love it =) Literal walls of text. Some are a bit poopy they can’t rotate theirs. Yet. Lucky we have a workshop department who like to tinker. Vertical works a treat as well with Firefox’s Reader View mode.

Great really, I was looking at loads of CRM software and it was a choice of CiviCRM and SuiteCRM and I would still be onsite giving training if I had chosen one of those two.

Nextcloud I installed, gave a 30 minute tutorial and that was it, I will let them run and go back and see if there are lose edges when they are familiar.

The center manager said it to me, I have forms, books, paper, newspaper and magazines and for some reason for the last 30 years I have been looking at my computer sideways.

The AOC Q2577Pwq/25 2560x1440 monitor is rather nice, but was actually purchased because its the best stand / rotate action I have seen.
Its quite easy to rotate and the Nvidia Quatro has hot keys to change between Lanscape & Portrait.
Might of been a waste as it has remained firmly in portrait mode.

I need to check out the SMS functionality available in Nextcloud as with a Community center / Advocacy a mobile number is far more common than email.
Delivering shares by SMS is the idea so the clients can look at there own documents.

Did you give a vertical overhead scanner a go?

PS my latest rave is a little eSata III / USB 3.0 4 disk multibay that costs less than the Sata III port multipliers I can find.
Esata/USB 3.0 Sata III / USB 3.0 UASP
JMicron JMS567 USB 3.0 Bridge
JMicron JMB575 SATA III Port Muitiplier

For IT depts or Ebay bargains there is a glut of disks and when I get one I will post some info on JBOD and software RAID performance on one of these.
There is a glut of 320GB drives on Ebay and you can make cheap resilient storage units.
I just managed to get 5x 640GB WD black for £55 +£7 postage and will be posting results on those.

Also playing with the experimental BTRFS RAID5 as if they do iron out the last remaining quirks it will mean RAID5 is dragged back out of depreciated as resync with BTRFS only resyncs the lost disk and not the whole volume.
Redundant Array of Inexpensive Disks might actually start to make sense as it has always been RAD.

The vertical scanner is on the to do list at the moment. That unit is a very tempting purchase. However. we’re still ‘solidifying our infrastructure’. We have some crazy stuff going on at the moment just to get things working and disparate elements interacting. One example, to connect a gallery to the local network and the Internet we have a 100 metre length of CAT5 running between a gallery and the ‘comm centre’ over their rooftops. Being an ex-military installation (this is a military history museum), there is an armoury in the middle. Reinforced concrete with metal plating and other amusing anti-fire material (aspestos) wasn’t my idea of power-drilling fun. There are also two complications that go with a rooftop flying-fox approach (zipline); where we live, the UV index regularly tops 15, and we’re currently seeing 17. That’s 5 minute cancer stuff. What it does to cheap Chinese plastic’s is indescribable, coupled with the heat. Thankfully we have some sparky’s (electricians) who can donate some conduit to fix that up a wee bit which also hlps with the second bit, cyclones. One does what one must =D

Just a friendly warning regarding BTRFS5/6; be VERY careful using it for primary storage. I had to migrate off of it because of some recent revelations in the code. Check ou hte Phoronix articles about the initial warnings, and do a follow up. The last I checked, four months ago, there was no real estimate on when the 5/6 code would see work let alone a proper fix. And then it would need years of testing anyway. Personally I couldn’t justify the potential loss of ALL data which made RAID moot. Things like all data disappearing when the drives filled isn’t my idea of sweet cherry pie.

How arew you finding the speed of the NAS box with the high-speed drives? On our little 4-bay Seagate NAS, the (Atom) CPU was constantly maxed. I was wasting my time trying to use RAID6, and obviously encryption was not going to be an option on it.

Honestly, I prefer retooling old SOHO servers with 4+ SATA ports for fileserving. Using FreeNAS for the ‘firmware’ you can take advantage of a tried and tested filesystem in ZFS. You wont have real CPU or RAM limitations, and the only thing to consider is a 2TB limitation per drive on the older controllers, but obviously isn’t an issue for yourself.

And do you intend on hooking the CRM in to NC somehow? I’m still trying to decide on a more elegant solution al la MS Server/AD and SSO for all these you-beaut fun application suites. The two CRM’s you mentioned piqued my interest, and I’m also looking for project management integration a well. I’ve seen things like Univention Corp Server for that simplified unified log in exerience across the board.

No actually its staff learning curve, suiteCRM is not so bad, CiviCRM is high, with a voluntary organisation we are just too dynamic with personnel.
It all depends on how you use the folder structure as if you don’t take things verbatim and use some lateral thought Nextcloud can be the basis of many solutions and the layout can tailor this to an extent.

I am banging ideas into the GSOC 2017 thread that are not about solutions but additions to allow user solutions.

Fingers crossed but just ideas to start a dialogue or spark that may be completely different.

You are right about old hardware as it is often available and often much cheaper whilst having more performance.
Wintel sometimes sticks in my throat and I am a fanboy of Linaro style of things and this is where I am playing out of interest.