Prevent PDF text extraction on upload

We use the CMS to store a lot of documents, many of which also have PDF or other attachments. We don’t wish to extract text from our PDFs, since we have equivalent HTML content already and don’t need to duplicate this.

I’d posted previously on the old forums about this:
https://groups.google.com/forum/#!topic/hippo-community/5QkdAhDDg14

At the time of that post, we were importing lots of content using our own importer tool - as such, when we were transferring PDFs and other documents, we had the ability to add the ‘hippo:text’ property as recommended, so that the Tika text extraction didn’t take place. This did exactly what we were looking for.

Since my last post, the CMS is now in use and users are now manually importing their own PDF attachments. As such, we’re investigating how best to replicate this behaviour and prevent Tika performing this text extraction when people upload PDFs manually through the CMS.

I had a go at using the event bus, writing a daemon module to watch for all events and then set both the ‘hippo:skipindex’ mixin and to set ‘hippo:text’ to an empty string (in a similar fashion to how our importer had been working). However, it seems none of the events happen early enough in the workflow to prevent the extraction happening - by watching the node in the console, I can see the ‘hippo:text’ property become populated during the Tika text extraction (with a size greater than 0kB). Whenever one of the workflow events triggers afterwards, I can then see this property being set to the empty string as I’ve specified in the module (back to 0kB), and the mixin applied.

I’ve read https://www.onehippo.org/library/deployment/configuring/repository-assets-performance-tuning.html again, and it seems as if there’s perhaps no way to control this ‘hippo:text’ property if you’re uploading through the CMS.

Could you confirm if there’s a way to set this ‘hippo:text’ property through the event bus (or otherwise) before the Tika text extraction has a chance to begin?

Woonsan previously brought up the Tika configuration file but stated that setting ‘hippo:text’ was the easiest option since we were previously using our own importer. Given that we’re now using the CMS for PDF upload, is it a better time to evaluate overriding the Tika configuration? Would you be able to provide any guidance for this, or point me towards any documentation?

Off the top of my head I don’t know the classes involved, but you’d need to stop the parsing of the PDF. Basically you need to override the specific gallery processor used. Not sure if you can configure this, but I suspect you would probably need to override the plugin class that actually calls the gallery processor. I think this is configured on the workflows…

It’s possible, but it will lead to a maintenance burden. It’s also not a documented/supported extension point. So it may break in some future update without warning.

As of https://issues.onehippo.com/browse/CMS-4802, CMS UI code extracts PDF text content and store it in hippo:text property, in which case the repository doesn’t try to extract the pdf text content again.
The related CMS UI code is org.hippoecm.frontend.editor.plugins.resource.ResourceUploadPlugin and org.hippoecm.frontend.editor.plugins.resource.ResourceHelper#handlePdfAndSetHippoTextProperty(Node, InputStream).
It is possible to override the plugin class which is configured at /hippo:namespaces/hippo/resource/editor:templates/_default_/upload. But considering that the class is not designed for that kind of extensibility, it might be not so easy to maintain in the future.

Another possibility is to customize the default tika configuration at hippo-repository-tika-5.6.0.jar!org/onehippo/repository/tika/tika-config.xml by shadowing it in cms/main/resources/org/onehippo/repository/tika/tika-config.xml.
You can try to comment out PDFParser and add application/pdf mimetype to the empty parser for example. Let us know if this tika config works for you. One thing to note is that this tika configuration change will be globally applied.

Right. EventBus is asynchronous. That’s not gonna work.

Regards,

Woonsan

Hi Woonsan,

My colleague tried the tika-config.xml changes to recommended and we can see that the file is now included but it does not seem to be getting picked up as the PDF is still being parsed. Are we missing anything to get it to use this over the default settings in the hippo-repository-tika-5.6.0.jar?

We overwrite the indexing_configuration.xml in a similar way locally, but hold the file externally and reference it through a repository.xml entry on our deployed environments. Is it possible to do something similar with the tika-config.xml?

It is possible that my instruction on what to change in the tika-config.xml missed something. I’ll test it out locally and let you know.

I don’t think it’s a good solution because CMS app, not repository, now extracts text before the repository does it in the low level first as mentioned before.

Woonsan

Unfortunately, it turns out that CMS module does not use the tika-config.xml, but it directly instantiates and uses an org.apache.tika.parser.pdf.PDFParser in org.hippoecm.frontend.editor.plugins.resource.PdfParser:

    private PdfParser() {
        // SNIP
        tika = TikaFactory.newTika(detector, new PDFParser());
    }

Therefore, even if I shadowed the configuration of org.apache.tika.parser.CompositeParser in cms/src/main/resources/org/onehippo/repository/tika/tika-config.xml, it never uses the configured CompositeParser. It reads the configuration, but never uses it.

I don’t think there’s a proper way to customize the PDF parsing in CMS at the moment.

Perhaps you can file a JIRA ticket for an improvement (JIRA project name: CMS): “Allow to customize pdf parsing behavior through tika-config.xml” for instance.

Regards,

Woonsan

Thank you for trying this for us and for the feedback, Woonsan.

Another workaround, in the short term, could be to override org.hippoecm.frontend.editor.plugins.resource.ResourceUploadPlugin configured at /hippo:namespaces/hippo/resource/editor:templates/_default_/upload like the following example:

public class MyResourceUploadPlugin extends ResourceUploadPlugin {

    private final IEditor.Mode mode;

    public MyResourceUploadPlugin(IPluginContext context, IPluginConfig config) {
        super(context, config);
        mode = IEditor.Mode.fromString(config.getString("mode"), IEditor.Mode.EDIT);
        addOrReplace(createFileUploadPanel());
    }

    private FileUploadPanel createFileUploadPanel() {
        final FileUploadPanel panel = new FileUploadPanel("fileUpload", getPluginConfig(), getValidationService()) {
            @Override
            public void onFileUpload ( final FileUpload fileUpload) throws FileUploadViolationException {
                handleUpload(fileUpload);
            }
        };
        panel.setVisible(mode == IEditor.Mode.EDIT);
        return panel;
    }

    // SNIP
    private void handleUpload(FileUpload upload) throws FileUploadViolationException {
        // SNIP
        try {
            // SNIP
            if (MimeTypeHelper.isPdfMimeType(mimeType)) {
                // no need to parse pdf in my case, so set empty hippo:text...
                setEmptyHippoTextBinary(node);
            } else if (node.hasProperty(HIPPO_TEXT)) {
                node.getProperty(HIPPO_TEXT).remove();
            }
        } catch (RepositoryException | IOException ex) {
            // SNIP
        }
    }

    private static void setEmptyHippoTextBinary(final Node node) {
        String nodePath = null;
        try {
            nodePath = node.getPath();
            final ByteArrayInputStream emptyByteArrayInputStream = new ByteArrayInputStream(ArrayUtils.EMPTY_BYTE_ARRAY);
            node.setProperty(HippoNodeType.HIPPO_TEXT, getValueFactory(node).createBinary(emptyByteArrayInputStream));
        } catch (RepositoryException e) {
            log.error("Unable to store empty hippo:text binary for node at {}.", nodePath, e);
        }
    }
}