I’m wondering if it’s possible to replace Lucene filesystem storage with some other like memory cache - for example Redis. Creation of lucene index (storage folder) takes a lot of time for bigger websites during warmup. In some cases I experienced 40 minutes warmup due to indexing operation. I think that moving this to Redis might speed up startup operation.
Is it anyhow possible? I found that there is DirectoryManager class and we can provide our own implementation of getDirectoryManager() to override default behaviour and return our own manager implementation.
Do you have any experience with this? Do you think that it’s possible and worth effotr trying to keep this index in-memory to speed up and make it possible to share index between instances in cluster?
Maybe it’s somewhere out of the box in lucene and I’m not able to find it?
Its dockerization is probably deprecated now since v13 ships its own docker support in product level, but the scripts (*.sh) files might be still useful to download/backup, restore, or re-initialize on startup (when invoked by setenv.sh) lucene index folder.
In Lucene 3.6, which Jackrabbit depends on now, I see only FSDirectory and RAMDirectory as essential Directory implementations, each of those is mapped to Jackrabbit’s manager implementation: FSDirectoryManager or RAMDirectoryManager.
So, it doesn’t seems that Redis is supported even in Lucene (3.6) level, so it will be very difficult.
According to the warning in the JavaDoc of RAMDirectory, it doesn’t sound like for normal production envs:
Maybe in the latest version, but not in v3.6 as far as I can see.
As Apache Jackrabbit v2 is under the maintenance mode and mostly done for API compatibility for Apache Jackrabbit OAK, I don’t expect a big new feature or improvement in that area in Jackrabbit v2.
@woonsanko
If we use the lucene import export and we want to use auto-scaling in cloud - how do we set the cluster id for each instance? Based on documentation on hippo site, cluster id is set hostname-whoami and hostname when deployed through a ECS or EKS may not be the same everytime there is a restart or autoscaling?
The online documentation is not specifically for that kind of cloud envs, but it just says in principle:
A repository instance must have a unique cluster id. And the cluster ID should be the same as before when restarting; otherwise, it will resync again on startup.
As an example, you can use the combination of hostname and whoami.
If you cannot use the combination, you can find a different way. Not sure, but for example you can figure out the unique id generation in that specific env or you can even pass a unique ID somehow… But those are cloud platform specific.