Don’t Be “Duped”, Files with Different HASH Values Can Still Be the Same – eDiscovery Best Practices

By: Doug Austin

A couple of months ago, we published a post discussing how the number of pages in each gigabyte can vary widely and, to help illustrate the concept, we took one of our blog posts and put it into several different file formats to illustrate how each file had the same content, yet was a different size.  That’s not the only concept that example illustrates.

Content is Often Republished

How many of you have ever printed or saved a file to Adobe Acrobat PDF format?  Personally, I do it all the time.  For example, I “publish” marketing slicks created in Microsoft® Publisher, “publish” finalized client proposals created in Microsoft Word and “publish” presentations created in Microsoft PowerPoint to PDF format regularly.  Microsoft now even includes Adobe PDF as one of the standard file formats to which you can save a file, I even have a free PDF print driver on my laptop, so I can conceivably create a PDF file for just about anything that I can print.  In each case, I’m duplicating the content of the file, but in a different file format designed for publishing that content.

Another way content is republished is via the ubiquitous “copy and paste” capability that is used by so many to duplicate content to another file.  Whether copying part or all of the content, “copy and paste” functionality is essentially available in just about every application to be able to duplicate content from one application to the next or even one file to the next in the same application.

Same Content, Different HASH

When publishing a file to PDF or copying the entire contents of a file to a new file, the contents of the file may be the same, but the HASH value, which is a digital fingerprint that reflects the contents and format of the file, will be different.  So, a Word file and the PDF file published from the Word file may contain the same content, but the HASH value will be different.  Even copying the content from one file to another in the same software program can result in different HASH values, or even different file sizes.  For example, I copied the entire contents of yesterday’s blog post, written in Word, into a brand new Word file.  Not only did they have different HASH values, but they were different sizes – the copied file was 8K smaller than the original.  So, these files, while identical in content, won’t be considered “duplicates” based on HASH value and won’t be “de-duped” out of the collection as a result.  As a result, these files are considered “near-dupes” for analysis purposes, even though the content is essentially identical.

What to Do with the Near-Dupes?

Identifying and culling these essentially identical near-dupes isn’t necessary in every case, but if it is, you’ll need to perform a process that groups similar documents together so that those near-dupes can be identified and addressed.  We call that “clustering”.  For more on the benefits of clustering, check out this blog post.

So, what do you think?  What do you do with “dupes” that have different HASH values?  Please share any comments you might have or if you’d like to know more about a particular topic.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine Discovery. eDiscoveryDaily is made available by CloudNine Discovery solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscoveryDaily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

Share on FacebookTweet about this on TwitterShare on Google+Share on LinkedIn

Leave a Reply

Your email address will not be published. Required fields are marked *

+ three = 12



Browse eDiscovery Daily Blog

About the Bloggers

Brad Jenkins

Brad Jenkins, President and CEO of CloudNine Discovery, has over 20 years of experience leading customer focused companies in the litigation support arena. Brad has authored many articles on litigation support issues, and has spoken before national audiences on document management practices and solutions.

Share on FacebookTweet about this on TwitterShare on Google+Share on LinkedIn

Doug Austin

Doug Austin, Professional Services Manager for CloudNine Discovery, has over 20 years experience providing legal technology consulting and technical project management services to numerous commercial and government clients. Doug has also authored several articles on eDiscovery best practices.

Share on FacebookTweet about this on TwitterShare on Google+Share on LinkedIn

Jane Gennarelli

Jane Gennarelli is a principal of Magellan’s Law Corporation and has been assisting litigators in effectively handling discovery materials for over 30 years. She authored the company’s Best Practices in a Box™ content product and assists firms in applying technology to document handling tasks. She is a known expert and often does webinars and presentations for litigation support professionals around the country. Jane can be reached by email at

Share on FacebookTweet about this on TwitterShare on Google+Share on LinkedIn