nuxeo-plattform-ocr and image pdfs

I have installed the nuxeo-plattform-ocr plugin ( https://github.com/nuxeo/nuxeo-platform-ocr#readme ) and is working very nice, but I am not able to run the OCR inside image PDFs.

Is there any plugin to do this?

Regards

Ruben Bahntje Ushuaia - Argentina

0 votes

1 answers

3009 views

ANSWER



Great to learn that you could install this addon successfully despite the list of non trivial dependencies to build from source :)

To make it work on PDF files it would require to first extract the image files (e.g. JPEG files) included inside. If you are a Java developer, this should be doable with the http://pdfbox.apache.org/ , e.g. you can take class from the PDFBox source tree as an example.

The source code of the OCR plugin is not too complicated to dive into and I can probably assist you on the nuxeo-dev mailing list or better directly through the inline review system on pull request directly on github.

0 votes