Weird bugs and extremely long reindexing

Since a few weeks we started having weird bugs on our production environment:

  1. In the Channel Editor we are often, but not all the time, unable to publish or discard changes. The publish and discard buttons are there, however after clicking on them and then clicking on “yes” in the prompt showing up asking if you are sure you want to publish or discard, nothing happens.
  2. In the cms/repository we were suddenly unable to execute queries on the repository: there are no results.
  3. Old documents which have been replaced recently sometimes show up one the site instead of the new ones and sometimes not.

When I was working on the first bug I wasn’t able to reproduce this on our staging or test environment and I could not find anything in the logs. I tried throwing away the preview but this didn’t work either. I also tried to redeploy and this also did not work.
I was also unable to reproduce the second and third bug. The cms/repository was working fine on our other environments and I did not see old documents showing up in our other environments.

My guess now, after connecting these bugs, is that there might be something wrong with the index in one or more of our 4 kubernetes pods.
In order to test this hypothesis I’m trying to reindex the pods one by one, however we have a new problem: it has been reindexing for the last 3 days and it’s still not finished. If we want to do this for every pod, it might take weeks to finish and as this is a production problem, this is a serieus problem.

Any thoughts on what else could cause these bugs and what could be causing this reindexing to take this long and how to fix this are very welcome. We would like to solve this a.s.a.p.