Near Deduplication and Email Threading

Deduplication is just one of the tools to help you save time and money.  As mentioned last week, data collections are growing bigger and bigger every day.  Deduplication will help you identify and exclude those documents so your reviewers spend less time reading the same document over and over again, but how can we take this to the next level?  That’s where Near Deduplication and Email Threading come in.  The time saved by identifying near duplicates and email threads can considerably streamline your review process and maintain consistency on how documents are reviewed.

First off, what is a “Near Duplicate”?  Near duplicates focus on the context of the document, the actual text found in the body of an email or document.

How does this differ from duplicate documents identified during deduplication?  During deduplication, duplicate documents are identified by comparing hash values.  These values are calculated by analyzing the metadata and body of each document.  While, near deduplication contextually analyzes the body of each document alone.  It’s quite possible that two files may contain the exact same text in the body of each document, but since one file is a PDF and the other an email, these documents were not caught by the deduplication process.

Near duplicates within an e-discovery collection are grouped together by “similarity”.  The similarity is a percentage calculated by comparing the body (or text) of one document to a base document.  This base document is identified by your near deduplication tool, and is typically the document that shares to most text with the other document found in a near duplicate family.  The near duplicate family consists or documents that fall within a set percentage of similarity from one another.  This can range anywhere from 100 to 1.  Although you can manually set similarity to as low as you like, typically most near deduplication tools maintain a lower threshold to 70.

After the analysis is complete your reviewers can then be saved from the torture of rereading the same documents over and over again. Reviewers can quickly identify a document, isolate the near duplicate family and compare the documents to one another using a viewer.  The viewer allows you to see comparison of the selected document against the base document.  The viewer highlights the differences between the two documents.  Instead of the reviewer reading both documents in their entirety, the viewer will allow the reviewer to quickly and simply focus on the differences between the two documents.  Reviewers can then compare other documents within the family.  This allows them to quickly review documents and maintains consistency.

Along with near deduplication, Email Threading allows you to organize and group email conversations together.  Allowing your reviewers to logically review emails and their attachments as the conversation progressed.  Coupled with near deduplication, you can quickly move through repetitive emails in the collection and streamline your review.  Email threading also allows you to properly batch out reviewsets.  Maintaining families together and preventing them from being split across reviewsets.

These tools can have a major impact in your review process.  Utilizing your tools to help you weed through the gigabytes, terabytes and eventually petabytes of data you’re faced with, saves time and money you would have spent on processing and the countless hours reviewing every document.  There are many tools out there that will do what I just wrote, but if you are a current LexisNexis client, you may already have them.  Near Duplicate and Email Threading is currently available in our Early Data Analyzer, LAW PreDiscovery and Concordance Evolution applications.