How to handle a rather large quantity of binaries?

Hi all!

I am currently investigating the posibilities of managing a rather large set of binaries within the Hippo CMS. I have some options in mind but I am not really sure of all the risks, so I would like you to share your thoughts with me. But first, let me explain the customers wish.

The usecase I have is as follows: We are running a rather small Hippo application with currently only 4 different document types. One of those types a sort of article, nothing fancy, just like any other article. The ‘special’ part about this article, is that it’s going to link to approximatly 10 to 30 attachments EACH. With attachments I mean any possible asset, meaning images, videos, pdfs or any other office documents. From the start we will have around 100 of those articles, meaning we have around 1000 to 3000 assets. This will grow with approximatly 100 to 200 articles a year, so 1000 to 6000 assets a year.
Some other requirements are that images could be previewed and that all attachments are downloadable for the endusers. It would be considerd a nice to have if the documents in the assets could be indexed for search, although this is not a hard requirement.

Option 1: Changing JCR Datastore to VFSDataStore
I’ve found this wonderfull blogpost(http://woonsanko.blogspot.com/2016/08/cant-we-store-huge-amount-of-binary.html) from @Woonsanko. He explains that you can use a different type of Datastores to make it possible to store binaries on a separate location, for example S3 or SFTP. This looks like a pretty solid solution, but I’m not really sure of the down sites. I think the downsides of this solution are the following:

  • SFTP server goes down or when the connection is lost, the data will not be stored in the SFTP server and will be lost permanent.
  • This solution still uses JCR, meaning the repository will contain all binaries and therefore grow quite big.
    Positive side:
  • We can use Hippo CMS as before, nothing changes for the users.
  • Documents are indexed into JCR so therefore searchable.
  • Low effort, just configuration

Option 2: Iframe with external application (https://www.onehippo.org/library/concepts/editor-interface/cms-perspectives.html)
It’s also possible to use an existing Digital Asset Management system for this and implement them via an Iframe CMS Perspective.
Downsides of this solution:

  • We cannot make use of the Hippo CMS way of working with assets
  • We cannot see unused files, or at least not as easy.
  • Yet another system we have to maintain, purchase or develop

Option 3: Stick with Hippo as is
Downsides:

  • large repository
  • large database, database with binaries
    Positive side:
  • We can use Hippo CMS as before, nothing changes for the users.
  • Documents are indexed into JCR so therefore searchable.
  • no effort, just configuration

Personally I’m really positive about the first option. The only thing is that I’m not a JCR expert and cannot really see the disadvantages of this solution.
I’m looking forward to your thoughts and experiences. Especially on the first option.

Have a nice day!
Jesper

“still uses JCR” seems ambiguous to me. With this option, every content including documents and nodes are still stored in the database, but every binary content data, except of those smaller than the minRecordLength, is now stored in SFTP, not in the database any more. The database will keep only binary identifier information to associate with a file stored in SFTP. So the database size will be hugely reduced.
In addition, as PDF binaries are indexed into lucene index, those index data will be still in the same lucene index directory whether you use the option 1 or 3. So, the lucene index size won’t be changed.

Regarding the potential downside, VFSDataStore or S3DataStore is not popularly used in our community yet even if I’d love to recommend them and support the community myself. And we as a company have never been fully fed enough ourselves yet. So, it’s not something yet that can be guaranteed by the company.

Other than that, I don’t personally see big downside by the option 1 (using either SFTP or S3). If your usage pattern in the database is like up to 70% or more in DATASTORE table as time goes by, the option will help a lot as it just reduces the database size and helps in backup, migrating, management, etc. in real.
Also, the option uses local file system cache for the immutable binary items in retrieval, there’s no performance penalty or availability issue in delivery tier even when the backend SFTP is down (in which case binary uploading will fail until it comes back).

Just my two cents,

Woonsan