A couple of months ago, we published a post discussing how the number of pages in each gigabyte can vary widely and, to help illustrate the concept, we took one of our blog posts and put it into several different file formats to illustrate how each file had the same content, yet was a different size. That’s not the only concept that example illustrates.
Content is Often Republished
How many of you have ever printed or saved a file to Adobe Acrobat PDF format? Personally, I do it all the time. For example, I “publish” marketing slicks created in Microsoft® Publisher, “publish” finalized client proposals created in Microsoft Word and “publish” presentations created in Microsoft PowerPoint to PDF format regularly. Microsoft now even includes Adobe PDF as one of the standard file formats to which you can save a file, I even have a free PDF print driver on my laptop, so I can conceivably create a PDF file for just about anything that I can print. In each case, I’m duplicating the content of the file, but in a different file format designed for publishing that content.
Another way content is republished is via the ubiquitous “copy and paste” capability that is used by so many to duplicate content to another file. Whether copying part or all of the content, “copy and paste” functionality is essentially available in just about every application to be able to duplicate content from one application to the next or even one file to the next in the same application.
Same Content, Different HASH
When publishing a file to PDF or copying the entire contents of a file to a new file, the contents of the file may be the same, but the HASH value, which is a digital fingerprint that reflects the contents and format of the file, will be different. So, a Word file and the PDF file published from the Word file may contain the same content, but the HASH value will be different. Even copying the content from one file to another in the same software program can result in different HASH values, or even different file sizes. For example, I copied the entire contents of yesterday’s blog post, written in Word, into a brand new Word file. Not only did they have different HASH values, but they were different sizes – the copied file was 8K smaller than the original. So, these files, while identical in content, won’t be considered “duplicates” based on HASH value and won’t be “de-duped” out of the collection as a result. As a result, these files are considered “near-dupes” for analysis purposes, even though the content is essentially identical.
What to Do with the Near-Dupes?
Identifying and culling these essentially identical near-dupes isn’t necessary in every case, but if it is, you’ll need to perform a process that groups similar documents together so that those near-dupes can be identified and addressed. We call that “clustering”. For more on the benefits of clustering, check out this blog post.
So, what do you think? What do you do with “dupes” that have different HASH values? Please share any comments you might have or if you’d like to know more about a particular topic.
Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.
Browse eDiscovery Daily Blog