How to resolve warning: "Not all bytes were read from the S3ObjectInputStream"

Hi,

We’re using Hippo with an S3 DataStore, and have followed the instructions to Deploy the Authoring and Delivery Web Applications Separately.

We’re seeing the following warning crop up repeatedly in the logs, mostly on app servers for the public site (but also a very few times on the CMS host):

WARN [com.amazonaws.services.s3.internal.S3AbortableInputStream.close():178] Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.

I guessed that this might’ve been caused by clients requesting a document (a PDF, say) stored in the datastore, and then abandoning the connection before finishing the download. However, that doesn’t seem to be the case (testing with curl … | head -c 10)

I then went to see if I could find out what was creating the request but not reading the content, and found it occurs in different threads. The chart below shows how often we’re seeing the warning, grouped by the type of thread that logged the warning.

This roughly matches the shape of traffic we’ve seen on the site, but I’m no closer to having discovered the cause.

EDIT: just to add, we’re using AWS SDK version 1.11.376 and Jackrabbit AWS Extensions version 2.17.1.

Is this warning normal behaviour with Hippo / Jackrabbit / S3?
If so, I’ll suppress the warning from our logs.
If not, can anyone suggest how to debug this further?

Thanks.

Hi Martin,

Would you mind posting the question to Jackrabbit Users List [1] again? There are more people who have used and been interested in that S3 DataStore module, which is quite independent from Hippo Repository actually, and I think it’s better to discuss the issue there as it will give more visibility and credits to the community.
I’ll try to answer your question there promptly as well.

Regards,

Woonsan

[1] http://jackrabbit.apache.org/jcr/mailing-lists.html

Content authoring users may also download PDF files for example on the CMS host.

The warning seems almost no harmful:

Basically what it does is, if S3 client tries to close an InputStream without consuming it fully after reading N bytes, it gives the warning message because they need to close the HttpRequest and the underlying http connection (with removing the connection from the connection pool). Of course, it is suboptimal, but there seems to be no functional issue.

This could happen in this case:

  • A web visitor wants to download a PDF file.
  • Web application reads data from S3 initially and S3 DataStore normally caches the data in local file system.
  • While this is happening, if web visitor disconnects the HTTP connection, and so if Tomcat indicates it, then the http processing thread will be halted and as a result, the InputStream will be closed, which is what Jackrabbit S3 DataStore can do best, IMO.
  • AWS toolkit library leaves the warning as it believes the client can do better, but at the moment Jackrabbit S3 DataStore just closes the stream.

Perhaps, Jackrabbit S3 DataStore can ‘consume’ all the stream by reading everything before closing, but it’s a bit questionable who should do the cleanup thing: Jackrabbit S3 DataStore or AWS Java toolkit itself. :wink:

Anyway, I don’t see a big functional problem there. Just occasionally HTTP connection will be kicked out from the pool, but that shouldn’t be a big performance issue in most cases.

Yes, I think the warning is quite ignorable.

Regards,

Woonsan

This could happen in this case:

… web visitor disconnects the HTTP connection, and so if Tomcat indicates it, then the http processing thread will be halted and as a result, the InputStream will be closed, which is what Jackrabbit S3 DataStore can do best, IMO

This is what I imagined what the cause, but I wasn’t able to reproduce this reliably (in my case, by piping curl into head after clearing out the datastore cache). It’s possible that curl was just buffering more than I was expecting before writing to stdout. I’m seeing the warning more consistently today when I try, so perhaps that’s the cause after all.

However, I’m still surprised to be seeing this in other threads, especially the background threads. When I get a chance, I’ll re-post to the Jackrabbit list as you suggest.

Thanks
Martin

The reason why you see the warnings in background threads is that S3DataStore does proactiveCaching [1] by default, which means it tries to download binary content even when it just reads metadata such as ‘lastModified’ and ‘length’ of the binary content. proactiveCaching spawns threads to download content in a new thread when reading metadata already.
Anyhow, that’s not a major issue, functionally speaking, even if it’s a suboptimal as mentioned before.

I found something to improve in Jackrabbit, so I filed an improvement ticket in Jackrabbit:

Regards,

Woonsan

[1] http://jackrabbit.apache.org/api/trunk/org/apache/jackrabbit/core/data/CachingDataStore.html
(If you want, you can turn it off, but that’s not a primary issue here anyway.)

Ah, yes. That’s good to know about. Thanks.

I was able to reproduce the warning more reliably when testing with the -N (--no-buffer) option of curl in conjunction with head. I guess with buffering, curl was fetching enough of the file to cause Jackrabbit to read the whole object from S3.

Martin

Hi Martin,

Just FYI, I made a fix branch for the minor improvement issue (https://issues.apache.org/jira/browse/JCR-4369):

I think it should fix the issue, but I’d like to test it myself more with a S3 (compatible) service locally with the snapshot version from the feature branch. You can track the status through the JIRA ticket. I’m planning to submit a pull request by the end of this week.

Regards,

Woonsan

Done.

By the way, just in case, please make sure that you never remove the temp directory under the local cache directory (e.g, ${rep.home}/datastore) at runtime. e.g, ${rep.home}/datastore/tmp. The folder name, tmp, is hard-coded. Cleanup the local cache directory is fine while stopped. Those directories are created on start up if not existing.