While doing some auditing of a database, I found that some attachment content did not match the hashes given in the document's _attachments
map.
I tested this by downloading the document and calculating its hash. Comparing that to couchdb showed that they did not match. I then noticed that the mismatched attachments were ones that couchdb was configured to compress. It appears that my couch id configured to use snappy compression:
foobox# grep -E 'file_compression|compressible_types' /etc/couchdb/{default,local}.ini
/etc/couchdb/default.ini:file_compression = snappy
/etc/couchdb/default.ini:compressible_types = text/*, application/javascript, application/json, application/xml
However, when I attempt to compress the attachment content using snappy, and calculate the hash of the compressed data, it still does not match couchdb hash. In my example below, document-25977
is uncompressed (type application/pdf), and the uncompressed hash matches that provided by couchdb. The 2nd, document-78608
, is a compressible type (text/plain), and the hashes do not match:
foobox$ python hashcompare.py
document-25977
couch len: 142918
couch hash: 028540dd92e1982bcb65c29d32e9617e (md5)
local uncompressed len: 142918
local uncompressed hash: 028540dd92e1982bcb65c29d32e9617e
local compressed len: 132333
local compressed hash: 3157583223dc1a53e1a3386d6abc312d
document-78608
couch len: 2180
couch hash: e613ab6d7f884b835142979489170499 (md5)
local uncompressed len: 2180
local uncompressed hash: 0ab2516c820f5d7afb208e3be7b924dd
local compressed len: 1382
local compressed hash: d9e79232662f57e6af262fc9f867eaf2
This is the script I used to do the comparison:
import couchdb
import snappy
import md5
import base64
server = couchdb.Server('http://localhost:9999')
db = server['program1']
for doc_id in ['document-25977', 'document-78608']:
print doc_id
doc = db[doc_id]
att_stub = doc['_attachments'][doc_id]
hash_type, tmpdigest = att_stub['digest'].split('-', 1)
att = db.get_attachment(doc, doc_id)
data = att.read()
# CouchDB is using snappy compression
compressed_data = snappy.compress(data)
print 'couch len: ', att_stub['length']
print 'couch hash: ', base64.b64decode(tmpdigest).encode('hex'), '(%s)' % hash_type
print 'local uncompressed len: ', len(data)
print 'local uncompressed hash: ', md5.md5(data).digest().encode('hex')
print 'local compressed len: ', len(compressed_data)
print 'local compressed hash: ', md5.md5(compressed_data).digest().encode('hex')
print
I've verified that the documents are uncorrupted when fetched. So what am I missing? I'm not versed enough in Erlang to read the couchdb source and figure out what is going on. Why would the documents have a digest that does not match its contents compressed or other wise?
Not sure if you got this sorted out, but I started going down the same path. After looking at the source for a bit, it appears that digest calculations take place prior to compression, so I don't believe compression will have a bearing on the digest value.
I was able to reproduce the md5 digest produced by CouchDB for attachments using the following in node:
Hopefully that helps you or someone searching for details in the future.
CouchDB indeed calculates hash after compression for compressible files.
But attachments are compressed using zlib, and I've been unable to match what they do, so the only solution seems to fetch their digest after uploading and store it somewhere.