We use the CMS to store a lot of documents, many of which also have PDF or other attachments. We don’t wish to extract text from our PDFs, since we have equivalent HTML content already and don’t need to duplicate this.
I’d posted previously on the old forums about this:
At the time of that post, we were importing lots of content using our own importer tool - as such, when we were transferring PDFs and other documents, we had the ability to add the ‘hippo:text’ property as recommended, so that the Tika text extraction didn’t take place. This did exactly what we were looking for.
Since my last post, the CMS is now in use and users are now manually importing their own PDF attachments. As such, we’re investigating how best to replicate this behaviour and prevent Tika performing this text extraction when people upload PDFs manually through the CMS.
I had a go at using the event bus, writing a daemon module to watch for all events and then set both the ‘hippo:skipindex’ mixin and to set ‘hippo:text’ to an empty string (in a similar fashion to how our importer had been working). However, it seems none of the events happen early enough in the workflow to prevent the extraction happening - by watching the node in the console, I can see the ‘hippo:text’ property become populated during the Tika text extraction (with a size greater than 0kB). Whenever one of the workflow events triggers afterwards, I can then see this property being set to the empty string as I’ve specified in the module (back to 0kB), and the mixin applied.
I’ve read https://www.onehippo.org/library/deployment/configuring/repository-assets-performance-tuning.html again, and it seems as if there’s perhaps no way to control this ‘hippo:text’ property if you’re uploading through the CMS.
Could you confirm if there’s a way to set this ‘hippo:text’ property through the event bus (or otherwise) before the Tika text extraction has a chance to begin?
Woonsan previously brought up the Tika configuration file but stated that setting ‘hippo:text’ was the easiest option since we were previously using our own importer. Given that we’re now using the CMS for PDF upload, is it a better time to evaluate overriding the Tika configuration? Would you be able to provide any guidance for this, or point me towards any documentation?