Content storage strategy during development

dhughes-xumak · November 15, 2018, 8:49pm

Hello,

Our team has built a few small demo/proof-of-concept projects on BloomReach over the past couple months, and we’re still disagreeing over the correct strategy for storing our content during development. We also need to understand what, if anything, needs to change in our content strategy as we transition from development to deployment/go-live.

I might start to ramble a bit as I try to reason through these questions, I apologize in advance. I’ve divided the conversation topics into sections below to help organize them. I am hoping that some community members can help validate my thoughts around this.

Context regarding our team

A bit of context, our team primarily consists of experience with AEM/CQ5. In AEM, our content is generally excluded from the version control system (git), and managed with the use of content packages.

Some members of our team have attempted to follow this same model with packages consisting of YAML exported from the Hippo Console, but this feels incorrect. In fact, it has proven to be problematic. Since these packages don’t have include/exclude patterns, dependency management, or subpackages, it is very easy to accidentally overwrite data, import to the wrong repository path, or miss necessary packages.

Content vs Config

In BloomReach, it seems as though content belongs in the VCS repository (YES PLEASE!), or at least the bootstrap content does. We understand the bootstrap concept, but I’m also skeptical of if we really want the content to be limited to bootstrapping during development, or if it makes more sense to treat the content (documents) as config and update it on every local cargo start.

The reason I believe the documents should be treated as config is because, in my mind, every dev should have the complete project contents everytime they run their local Hippo instance.

For example, if another developer makes changes to a component and includes content changes with that, then we would expect all developers to have the updated content when they pull in those changes and restart cargo.

If documents are marked as content, then anybody running Hippo with an external repo.storage path would have to start the project with the bootstrap flag set to true in order to update the content in their repository. Without an external repo.storage path, developers would still have to be sure to mvn clean. Are there any reasons you would advise against storing documents as config? Is there a better way to solve this?

Application vs Development

Now, as for location of content in the VCS repository, I understand that repository-data/application is for data which should go everywhere (including production), and repository-data/development is for data which will be excluded from the distributables.

When we create a project from the archetype, content is stored under repository-data/application. This is somewhat surprising to me.

I suppose it makes sense, since content bootstrapping will ensure that only new content is ever added to the JCR, however this conflicts with my previous assertion that documents should be stored as config during development (in which case, documents should definitely be stored in repository-data/development).

Merge conflicts from combined folder and document YAML

One of the teams biggest pain points in developing with documents stored as content has been merge conflicts.

It seems that the content for all documents in a directory will be combined into a single YAML file (/content/sample-document and /content/another-document are both stored in their parent folder’s content.yaml). Splitting the documents into their own files solves 90% of the issues that we have had with the content storage so far (the extensive merge conflicts).

I have tried to separate that content into three YAML files (content.yaml, content/sample-document.yaml, content/another-document.yaml), but I received errors when I attempted to start cargo with that data. I was, however, successful in splitting these files when I converted the documents from content to config. Unfortunately, new documents would still be exported to the parent folder’s YAML file, but at least we could manually split that out whenever we create a new document.

So did I just do something wrong when I split the YAML files? Is there actually a way to store documents as content, with each document in a YAML file of its own? And better yet, is there a way to force BloomReach to export documents to their own YAML files, instead of including them in the folder’s YAML?

Runtime reloading for config YAML

Finally, I have been wondering about hot-reload/sync for hcm-config files (I would call it auto-reload, but that refers to the auto-reload module for forcing browsers to reload).

This does not appear to be a thing, it seems that the server must be restarted in order to load config changes into the repository. I’m curious why this is, and if anybody has tried to implement auto-reload for config files. It’s great that we can use auto-export to sync changes made in the CMS onto our filesystems, but I frequently find that I would prefer to alter YAML on my filesystem than to modify configurations in the CMS Console.

I attempted to do this using the webfiles module, with the use of a symlink to bypass the SubDirectoriesWatcher#DIRECTORY_FILTER. It worked, but the WebFilesWatcher and related classes explicitly treat the watched files as webfiles and write them to the repository as such. Scanning through the related classes, it seems conceivable that a similar ConfigFilesWatcher module could be implemented in order to support the watch and sync of config files.

I suppose the biggest concerns would be:

Infinite loop created by auto-export and ConfigFilesWatcher
Attempts to reload invalid YAML (if an IDE saves automatically, a YAML file may be in an invalid state when it tries to sync/reload)

Thank you in advance!
Dave

marsdev · November 16, 2018, 11:04am

I’m interested in what others have to say about this as well.

I’ve come to the conclusion that I don’t want to count on development having all of the content, but then when I want to test/build new content it is nice to work on it development first.

If there is already documentation on this, then a link would be great.

thanks also,

John

machak · November 16, 2018, 11:32am

Code/configuration/content flows are described here:

Normally we keep only few documents for development purposes.

jasper.floor · November 16, 2018, 2:03pm

To expand on what machak linked (see also link [1] below), content does not belong in the project except for development content and possibly some initial bootstrap content for new features. Content belongs to the production repository only. This may be copied to acceptance/test servers, but shouldn’t be used for local development generally.

Content is created by authors, not devs. Their environment is the cms. This is controlled by workflow and is versioned in the repository itself (upon publish). Given this there is little need for a hot deploy scenario. I would even consider this dangerous. Config changes should be done on development, then tested on test and acceptance environments before reaching the server. Only for emergencies should you do config changes on a running system, in which case it will be console work, preferably first on acceptance then production. While it should be possible to do hot reload of config, it may still not work as sometimes it references classes or services that are initialized only on startup.

As for autoexport [2], you have some limited control over it’s behavior. Generally speaking though, it is impossible for it to conform to every possible division of content. It also has to be predictable to the engine as it needs to be able to match a node to a file. You can file improvement requests on the behavior, but it is up to engineering and product management to determine whether any feature is desirable and/or worth the effort. It is not a trivial exercise to get it to work in the first place.

You most likely did do something wrong in creating the content yaml files. The syntax there is slightly different. Please refer to the documentation found under [1]

See also:
[1] https://www.onehippo.org/library/concepts/configuration-management/introduction.html
[2] https://www.onehippo.org/library/development/automatic-export-add-on.html

dhughes-xumak · November 16, 2018, 11:43pm

Thanks for the input. I should have included these links in the original post, because I had already reviewed both in depth. That is largely how I arrived at my questions/uncertainties, especially from [1] and the related documentation.

I feel like this furthers my argument in the section Application vs Development. If content should not be included in deployments, why does the archetype store the sample content in Application instead of Development? It seems to me that the best practice would be for all of /content to reside inside Development.

Production content is created by authors, but my questions are focussed on development. During development, developers do need to create content along with their components/features, for the purpose of being able to demonstrate the populated components. It’s preferable that every member of the dev team should have all of that content, all the time.

And yes, hot deploy of config does not make sense for production, but I think it would be very valuable during development. The console can be slow and clunky to use in situations where I know the location of the YAML in my repository and know which changes are necessary. It’s in these cases where config sync/hotdeploy could improve development efficiency. This becomes even more apparent if you start treating content during development as configuration files (assuming content changes would never be hot deployed, since that contends with the bootstrapping mechanism). As a developer, there may be times where I want to make bulk changes to development content, such that it would be much quicker to find-and-replace across multiple files, than to use the CMS UI. I can still do this, but without hot-reload, I need to restart cargo to apply the changes.

And yes, some things may not be reloadable, if classes/services are initialized on startup and do not reinitialize when their config changes. But that’s not related to hot deploy, it is already an issue with configuration changes made in the console. For example, updating the configuration of most (maybe all) modules does not reinitialize them. Updating the config for the WebFilesWatcher module logs an error:

HippoServiceException: A service was already registered with name...

[INFO] [talledLocalContainer] 16.11.2018 14:07:45 WARN ObservationManager [ObservationDispatcher.run:163] EventConsumer org.onehippo.repository.modules.AbstractReconfigurableDaemonModule$ModuleConfigurationListener threw exception
[INFO] [talledLocalContainer] org.onehippo.cms7.services.HippoServiceException: A service was already registered with name org.onehippo.cms7.services.webfiles.watch.WebFilesWatcherService
[INFO] [talledLocalContainer] at org.onehippo.cms7.services.HippoServiceRegistry.registerNamedServiceInternal(HippoServiceRegistry.java:274) ~[hippo-services-4.6.0.jar:4.6.0]

I’ve reviewed [2] once again, but it’s not clear if any of the available configuration allows us to specify how content should be divided between files. The section File Structure is the only place that I could find any mention of how nodes are serialized to files. Unfortunately it offers very little detail as to how that is decided, instead just stating that it follows best-practice conventions.

Nope, I tried this again, and I’m pretty sure the issue exists with the way Hippo keeps track of content files which have already been imported. When we try to split the documents out of their folders’ YAML files, we get an exception during startup unless we start with a clean JCR repository.

For example, I used the archetype’s sample banners content and split it into three YAML files: banners.yaml (the original, for the folder), banners/banner1.yaml, and banners/banner2.yaml. If I start Hippo with a clean JCR repository, that content all gets bootstrapped correctly. If I change one of the documents, the changes are auto-exported to the correct file (banner 1 changes export to banners/banner1.yaml).

However, if I create a new document in the CMS (call it banner3), auto-export serializes it to the folder’s YAML file (banners.yaml) again. If we move that content to a new file, banners/banner3.yaml, then try to restart Hippo (even without bootstrapping enabled), we get this exception on startup:

ItemExistsException: Node already exists at path /content/documents/xumakcom/banners/banner3

[INFO] [talledLocalContainer] 16.11.2018 15:10:59 ERROR localhost-startStop-1 [ConfigurationContentService.apply:174] Processing ‘APPEND’ action for content node ‘/content/documents/xumakcom/banners/banner3’ failed.
[INFO] [talledLocalContainer] javax.jcr.ItemExistsException: Node already exists at path /content/documents/xumakcom/banners/banner3
[INFO] [talledLocalContainer] at org.onehippo.cm.engine.JcrContentProcessor.validateAppendAction(JcrContentProcessor.java:110) ~[hippo-repository-engine-5.6.0.jar:5.6.0]
[INFO] [talledLocalContainer] at org.onehippo.cm.engine.JcrContentProcessor.apply(JcrContentProcessor.java:135) ~[hippo-repository-engine-5.6.0.jar:5.6.0]
[INFO] [talledLocalContainer] at org.onehippo.cm.engine.ConfigurationContentService.apply(ConfigurationContentService.java:157) [hippo-repository-engine-5.6.0.jar:5.6.0]
[INFO] [talledLocalContainer] at org.onehippo.cm.engine.ConfigurationContentService.apply(ConfigurationContentService.java:87) [hippo-repository-engine-5.6.0.jar:5.6.0]
[INFO] [talledLocalContainer] at org.onehippo.cm.engine.ConfigurationServiceImpl.applyContent(ConfigurationServiceImpl.java:666) [hippo-repository-engine-5.6.0.jar:5.6.0]
[INFO] [talledLocalContainer] at org.onehippo.cm.engine.ConfigurationServiceImpl.init(ConfigurationServiceImpl.java:216) [hippo-repository-engine-5.6.0.jar:5.6.0]
[INFO] [talledLocalContainer] at org.onehippo.cm.engine.ConfigurationServiceImpl.start(ConfigurationServiceImpl.java:122) [hippo-repository-engine-5.6.0.jar:5.6.0]
[INFO] [talledLocalContainer] at com.onehippo.repository.HippoEnterpriseRepository.initializeConfiguration(HippoEnterpriseRepository.java:178) [hippo-enterprise-repository-engine-5.6.0.jar:5.6.0]
[INFO] [talledLocalContainer] at org.hippoecm.repository.LocalHippoRepository.initialize(LocalHippoRepository.java:292) [hippo-repository-engine-5.6.0.jar:5.6.0]
[INFO] [talledLocalContainer] at com.onehippo.repository.HippoEnterpriseRepository.create(HippoEnterpriseRepository.java:63) [hippo-enterprise-repository-engine-5.6.0.jar:5.6.0]
[INFO] [talledLocalContainer] at com.onehippo.repository.HippoEnterpriseRepository.create(HippoEnterpriseRepository.java:53) [hippo-enterprise-repository-engine-5.6.0.jar:5.6.0]
[INFO] [talledLocalContainer] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_91]
[INFO] [talledLocalContainer] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_91]
[INFO] [talledLocalContainer] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_91]
[INFO] [talledLocalContainer] at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_91]
[INFO] [talledLocalContainer] at org.hippoecm.repository.HippoRepositoryFactory.getHippoRepository(HippoRepositoryFactory.java:147) [hippo-repository-connector-5.6.0.jar:5.6.0]
[INFO] [talledLocalContainer] at org.hippoecm.repository.RepositoryServlet.init(RepositoryServlet.java:184) [hippo-repository-servlets-5.6.0.jar:5.6.0]
[INFO] [talledLocalContainer] at org.apache.catalina.core.StandardWrapper.initServlet(StandardWrapper.java:1144) [catalina.jar:8.5.34]
[INFO] [talledLocalContainer] at org.apache.catalina.core.StandardWrapper.loadServlet(StandardWrapper.java:1091) [catalina.jar:8.5.34]
[INFO] [talledLocalContainer] at org.apache.catalina.core.StandardWrapper.load(StandardWrapper.java:983) [catalina.jar:8.5.34]
[INFO] [talledLocalContainer] at org.apache.catalina.core.StandardContext.loadOnStartup(StandardContext.java:4978) [catalina.jar:8.5.34]
[INFO] [talledLocalContainer] at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5290) [catalina.jar:8.5.34]
[INFO] [talledLocalContainer] at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150) [catalina.jar:8.5.34]
[INFO] [talledLocalContainer] at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:754) [catalina.jar:8.5.34]
[INFO] [talledLocalContainer] at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:730) [catalina.jar:8.5.34]
[INFO] [talledLocalContainer] at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:734) [catalina.jar:8.5.34]
[INFO] [talledLocalContainer] at org.apache.catalina.startup.HostConfig.deployWAR(HostConfig.java:985) [catalina.jar:8.5.34]
[INFO] [talledLocalContainer] at org.apache.catalina.startup.HostConfig$DeployWar.run(HostConfig.java:1857) [catalina.jar:8.5.34]
[INFO] [talledLocalContainer] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_91]
[INFO] [talledLocalContainer] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_91]
[INFO] [talledLocalContainer] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_91]
[INFO] [talledLocalContainer] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_91]
[INFO] [talledLocalContainer] at java.lang.Thread.run(Thread.java:745) [?:1.8.0_91]

I have figured out that adding hcm-actions to reload each node which we have split to a new file resolves that issue. Marking these paths as reload also causes them to be re-imported to the repo on startup (overrides bootstrapping and imports everytime), which may not always be desired. Unfortunately, action-lists don’t seem to support wildcards, so every single document we create would have to be added to this list.

hcm-actions.yaml

action-lists:

1:
/content/documents/xumakcom/banners/banner3: reload

For content which is needed during development, it still feels like the best option is to treat the content as config, since we do want that content reloaded everytime the server is started, and because then we can separate nodes into their own YAML files without this added hcm-actions overhead. Am I nuts?

@marsdev, I am curious how you have approached this on your projects. Does anything in this post stand out to you as either highly desirable or completely wrong?

Thanks again,
Dave

marsdev · November 17, 2018, 2:18am

@dhughes-xumak I’m very new to Hippo and am still working on our site. I don’t have a strong opinion yet. But I appreciate you sharing your thoughts.

jasper.floor · November 19, 2018, 11:48am

So I think you make good points. I think I agree with you for a large part.

Yes, at this point the nodes themselves are marked as coming from a specific file. But if autoexport has to create files then it needs to make a decision on how to split them. This behavior could be different, but a choice was made. It is something that needs to be absolutely correct, so perhaps that influenced design.

No what you say makes sense. At this point all I can say is I will bring it to the attention of our PM/Engineering, if they aren’t reading along already.

agonia · October 15, 2019, 3:11pm

In our development team are having the same problem with the data and files in the local environment. We are interested in this thread.

dhughes-xumak · December 10, 2019, 5:00pm

It has been almost 13 months since I raised this thread.

I’m just checking in to see if brXM 13 or 14 has included any changes spurred by the questions/concerns raised in this discussion. Does anybody know of any changes in the last two major releases which relate to this discussion?

Thanks!
Dave

Ivo_Eersels · March 31, 2020, 2:08pm

Coming from AEM too this was an interesting read and I am also still wondering about how to migrate content. Believe it or not… in the real world, the business expects full blown demos of content pages like they were discussed in the web designs. So all content must be already there! In AEM we could manage this using their package system. But for brXM I am not so optimistic at the moment.

rvriel · July 31, 2020, 9:06am

A disadvantage of content moving stricty from PROD to ACC/TEST/DEV is that you might want to have some “development” data to run test suites like Cypress or Selenium against. This could, of course, be done locally by individual developers, but I think it generally makes more sense to run those from (and against) a more centralized machine, like a TEST server. At this point in time, “test data” created on TEST will be overwritten whenever we try to make our TEST environment more representative by copying over repository data from PROD. Which is… not ideal .

jpierson · December 17, 2021, 2:17pm

BTW, it’s been a while since there was a reply on this thread but I wanted to add some notes about my experience regarding content during development. After watching a team struggle greatly while attempting to follow the standard content/config workflow guidance from for Bloomreach Experience Manager (HippoCMS) back then I noticed that there were a few developers who were much more productive than others to getting their work accomplished in the platform. One characteristic of those developers is that they would regularly pull the full production content database onto their local machine at the beginning of each development cycle. While they did develop with a mix of H2 and the production database, by having the production content local and readily available during development seemed to be the key to their success.

After getting more acquainted with the brXM I also came to the conclusion that the majority of changes that were asked of us required tweaks in templates, front-end JavaScript files, and CSS changes with some occasional component work. To do these types of tweaks on the site in general it became pretty much necessary to either (a) repeat a large portion of the production content in development-data to replicate precisely pages that we needed to work on (b) develop changes based directly on the production data.

My first inclination was to see if there were any supported workflows for automatically exporting all production content (and perhaps configuration too) on a regular perhaps automated basis to source code so that developers would always be up-to-date. My thought is that if this worked out it would allow the team to work within brXM in a way that is reminiscent of static site generators and JAM Stack style sites where often times all changes including content are part of the source code and the development deployment pipeline. After confirming with a few Bloomreach representatives and others who had more experience with the platform, they had convinced me that this pattern was outside of the best practices for the platform and and had generally casted doubted that support for workflow would be viable at all.

After giving up on replicating all production content in a more automatic fashion I turned to what had already been working for some team members and started working off of a production database backup. While it was a hassle to download and restore the database on regular intervals to keep things up to data and even more painful to in many cases manually repeat various content changes by hand in various test environments, what I was able to do is actually get work done for the workloads described above which required developers to accurately reproduce target parts of the production environment in order to confidently fix bugs, introduce features, and most importantly demo the change effectively to their team and sponsors before deploying to production.

Eventually we ended up realizing that at least for this project that since the development data was so drastically different from the production content and that there was no reasonable way to keep it effectively updated that we were unfortunately best off just abandoning its usage. In a few cases where developers attempted to use the development data for working on a change, we had situations where they unfortunately had to redo their work after realizing how inapplicable their solution was in real world once it met the production content in a lower environment sometimes days or weeks after their development efforts. For these reasons we made it our our teams official workflow to work off of the production data exclusively.

Things were not all roses after moving to this workflow but we’ve definitely smoothed the edges of some of the process such as automating the database backup retrievals and restoration for local development usage. We still have odd cases due to the development-data module still bein intact in cargo (because we still use development-data for a small select set of config changes specific to local development). Since the database is quite large, retrieving the database and even more significantly starting up the application take quite a toll each time a change in the Java code is done or if we need to test bootstrapping or some config changes which require a restart. All this being said, it’s the one way we’ve found work on our project to be manageable.

Now I’ll point out that the content and config workflow that Bloomreach advices in their documentation seems to fit a particular type of team and set of use cases that may not fit our particular team. For a single site, single tenant system with a small set of components only having limited use case, the investment on a specific workflow for component development may not yield much returns as if component work is done it is typically limited and done very seldom by the team. For this type of team creating a development feedback loop for production like changes to be incorporated as early as possible allows them to see the results of their changes more quickly and leads to faster development times and likely less issues discovered late in the process. I think teams like this may be cross functional and cover the whole workflow being responsible of outcomes on the site directly and not just output of a new or modified component. One could argue that in some cases whether it’s effective to be using a CMS at all but there are some benefits with regard to non-developers being able to contribute changes effectively such as images, text and other tweaks where a CMS is at least one of many other possible solutions that does make this possible. On the other hand, as I’m sure it may happen on occasion, developers may have to get involved managing content or other types of changes in the CMS due to the complexity of the platform or the solutions that they developed anyway. When you have developers as both the producers and consumers of the components then putting GUI or any atypical development tools or workflows in the middle typically must makes the whole experience more cumbersome and less effective than it could otherwise be.

By contrast for example, if you are building a large system with multiple tenants which each have their own independent or in some cases interdependent resources and the traditional CMS work separation between development of components and then another team who utilizes those components and is fully responsible for the production runtime aspects of the site then I can see how this could potentially work. In my opinion it would still require a lot of discipline and tradeoffs to ensure components are created to be completely independent with regard to dependencies and potential placement within various layouts. In this case development content would mainly be concerned with having enough effectively organized use cases which could house the individual components and demonstrate their usage, options, and appearance and in a way that can likely be covered by automated frontend tests as well. This team would essentially be responsible for delivering high quality, well tested, and somewhat more generic components which a separate team would consume. Once the components are developed, it would likely be a separate team or ate least a different phase of development workflow where those components would be leveraged within actual pages. The team responsible for making use of the components would then be more likely be fully owning smaller site issues such as layout changes, visual tweaks, and behavioral changes which are exposed as parameters to the components or the documents that their are bound to. In this model it sort of assumes that there is a lot of up front investment with components because they will get a lot of reuse with potentially various uses through potentially multiple tenets spanning multiple channels and sites. I believe it also assumes that the component development team is the typical team that Bloomreach documentation refers to as “developers” since they are the ones served by things like development-data and beyond component development any work that is done with those components to effectively build a site is really on the shoulders of the content creators and any mix of developers who may help out in that effort.

These are my experiences but you’re mileage may vary. Would love to hear more feedback on this topic though especially as the Bloomreach Experience Manager changes with each subsequent major version release.

Topic		Replies	Views
Add content to Auto Export Experience Manager (PaaS/OnPrem)	2	698	July 1, 2025
Hippo upgrade: Content Bootstrap Experience Manager (PaaS/OnPrem)	3	395	September 6, 2019
Develop on "storage" and move to database for production Experience Manager (PaaS/OnPrem)	4	460	July 1, 2025
Best practice for bootstrap the channel information Experience Manager (PaaS/OnPrem)	3	388	April 24, 2019
What to include in SCM Experience Manager (PaaS/OnPrem)	2	521	September 24, 2018

Content storage strategy during development

Related topics