Overview and near duplicate concepts

Duplicates vs. near duplicates vs e-mail threads

Near-duplicate processing and e-mail threading can cut the documents you have to review by two-thirds or more by grouping your documents and e-mails into three types of clusters:

  • Exact duplicates
    These are documents that are exact duplicates of each other. Exact duplicates can either be exact copies of the same document file (and have the same MD5 hash) or documents that have the same textual content. For example, a document available in both Word and PDF format will be considered exact duplicates since they have the same text content.
  • Near-duplicates
    These are text documents that are similar, but not exact duplicates, of each other. For example, a contract that has gone through several revisions.
  • E-mail threads
    These are e-mails that belong to the same conversation and have the same history so you only need to read the last e-mail in the thread (and the history in that e-mail) to review the entire conversation.

Note that documents or e-mails that have been OCRed, after being converted into an image format, perhaps by scanning, photographing or faxing, will not produce good results when processed for near duplicates or e-mail threads.

Understanding near duplicate clusters

When documents are processed for near duplicates and organized into clusters of similar (i.e. near duplicate) documents, each cluster is assigned a unique cluster number, prefixed with "D", and reference document. The remaining documents in the cluster are ranked by their similarity, as a percentage, to the reference document (see near duplicates with 98% and 94% similarity marked as #1 in the screenshot below).

 

A near duplicate cluster may also contain duplicate documents. For example, the cluster may contain several copies of the reference document as well as several copies of a document that has, say, 98% similarity to the reference document. Each cluster of duplicates are assigned their own subcluster number (see #2 in screenshot above).

If a near duplicate cluster does not contain any near duplicates, but only exact duplicates, its unique cluster number is prefixed with an "S" instead  of "D". "S" stands for "Solo" meaning this document (and its exact duplicates) exists by itself without any near duplicates.

As mentioned, within each near-duplicate clusters one document is flagged as the 'Reference' against which the others will be compared and rated by similarity. Review now starts with the reference and if relevant, continues with its near-duplicates. Review time is reduced in two ways:

  1. you don't have to sift through the near-duplicates of documents you already marked as irrelevant, and
  2. you don't have sift through any exact duplicates of the reference or any similar ones.

In the screen shot below, the 37 documents have been clustered into 6 near-duplicate clusters, each of which has a reference documents shown in the red-box, which are often all that needs to be reviewed.

Understanding e-mail threads

When people reply to an e-mail, they usually leave the original message intact as history at the end of their email. Sometimes, the reply retains the subject and sometimes, the subject is changed and a new discussion ensues.

E-mail threading groups e-mails with the same history together into a thread. Each thread has a unique thread number, prefixed with "E", and reference e-mail which includes the content of all the others in its group.

Review time is drastically reduced because you only have to review the reference in each thread rather than every e-mail in the thread and all their repetitive history.

Sometimes, an e-mail conversation may have several threads if a particular email in the conversation was replied more than once. In the diagram below, c-mail #5 was replied twice, once as in email 6 and once in email 7. Similarly, e-mail 10 was replied three times, in emails 11, 12, and 13:

Notice the 20 emails  in the diagram are broken into 6 threads, with each thread ending with an e-mail in black. MasterFile will create 6 email threads from the above conversation as follows:

  • Thread #1: E-mails 1, 2, 3, 5, 6, 9, 10, 13, 15, 16, 18, 20
  • Thread #2: E-mails 11, 14
  • Thread #3: E-mails 17, 19
  • Thread #4: E-mail 7
  • Thread #5: E-mail 8
  • Thread #6: E-mail 12

The last e-mail in each thread, shown in black, is the reference e-mail and is the only e-mails you need to review since it contains all of the history of the e-mails in its thread. So instead of reviewing 20 emails, you only need to review 6.

The screen shot below shows how you how e-mail threads and the reference e-mails are displayed in MasterFile. The conversation below has 13 emails which have been grouped into 4 threads . Only the 4 reference e-mails of each thread (the first ones under each red category) need to be reviewed as those 4 contain all the history.

Reviewing near duplicate clusters and e-mail threads

After documents and e-mails have been processed for near duplicates and e-mail threads, MasterFile has several views to help you efficiently review the reference documents/e-mails and the other documents or emails in the cluster or thread.

See Reviewing near duplicate document clusters and e-mail threads for details.

Processing documents for near duplicates and e-mail threads

See How to process documents for near duplicates and e-mail threads for step by step instructions on how to process documents and e-mails for near duplicates and e-mail threads.

Merging e-mail threads

E-mail threading requires the history in all emails of a thread to contain identical content. Any e-mail whose history does not match the history of the current threads is split off into a new thread. Some e-mail systems will inject additional  text into the history (such as "image removed") and thereby change the history. This will cause those e-mails to be split off into their own threads.

MasterFile tries to identify and remove as much of this injected text as possible so that e-mails are threaded correctly, however the process can never be perfect as e-mail systems are change. Therefore, when you come across threads that are should not be split, because they actually have the same history, you can manually merge the two threads and save future review time. See What does merging e-mail threads mean for additional information.

Checking if a document or e-mail has duplicates or near duplicates, or is part of an e-mail thread.

Open the document's or e-mail's profile and examine the "Near duplicates" field shown below. The field may have one of the following values:

  • "Not processed for near duplicates." -- the document or e-mail has not been processed for near duplicates or e-mail threads.
  • "No near duplicates." -- the document or e-mail does not have any exact or near duplicates, and is not part of an e-mail thread.
  • "X near duplicates." -- the document or e-mail has exact or near duplicates, or is part of the e-mail thread, where X is the number of duplicates, near duplicates or other e-mails in the thread

Clicking on "View near duplicates" takes you to the view above, and opens the near duplicate cluster or e-mail thread the document belongs to so you can review the other documents and e-mails in the cluster or thread.