Reviewing near duplicate document clusters and e-mail threads
After processing your documents and e-mails, review the document clusters and e-mail threads using the near duplicate views found under the [L+ Near Duplicates] section shown below:
Views group documents into three types of clusters:
- Near duplicates
These are documents that have near duplicates - Solo documents
These are documents that are do not have any near duplicates - E-mail threads with or without their attachments
E-mails that are part of a single conversation and have the same history
Use MasterFile's review mode to efficiently review the reference document of near duplicate clusters and e-mail threads. Click on [R+General > Review Mode]
Reviewing documents
Each near duplicate cluster has a reference document which, in general is the only only you need to review. All other documents in a cluster display a value indicating their similarity to the reference. These other documents can be reviewed if they may have differences from the reference document may be relevant.
In addition, some documents may be exact duplicates of each other or of the reference and these are grouped into subclusters. Since subclusters contain duplicate documents, only one document of a subcluster needs to be reviewed.
Note that the reference document is chosen by a complex algorithm and is not necessarily the largest document of the cluster.
Reviewing e-mail threads
When reviewing email threads, review the reference e-mail of the largest threads first so that the bulk of the history is quickly reviewed. Then review the reference e-mails of the shorter threads. Note the shorter threads may contain the same history found in the longer threads, but that can be ignored since it was reviewed when the long threads were reviewed. For example, in the example below, the thread ending with email #14 contains the history from emails 1, 2, 3, 5, 6, 9 and 10 which can be ignored as all that history is also contained in the longer thread ending with email #20 and was reviewed when that thread was read.
The near duplicate views present e-mail attachments in two ways:
- Together with other documents they are near duplicates of. These views have titles with "as docs", indicating the attachments are treated as documents, in their own right, disconnected from their parent e-mails.
- Together with their parent e-mail in the thread containing parent e-mail. These views have titles with "emails + attachments" indicating the attachments are threaded with their parent e-mails.
E-mail thread E00000000
A special thread, designated by the number "E00000000" contains all e-mails that have no content. Even though some or all of these e-mails may be part of e-mail threads, the thread cannot be determined as the e-mails have no content. These emails may have just had an attachment or their message may be the subject line. You should review every email in the this special thread.
The views
- everything as docs
Separates attachments from their emails and clusters them with other exact to near duplicate documents they match. E-mail are threaded into their conversations, but attachments are not included in the thread.
- e-mails + attachments
Shows only e-mails grouped with their attachments. Unlike the "everything as docs" view, other documents, not attached to e-mails, are not shown.
If you hold the Ctrl key down while switching views any documents you've highlighted or selected remain selected in the view you are switching to. This feature is very helpful when using the above two views.
For example, if you are in the "everything as docs" view and notice a document is an attachment and want to see the e-mail the document is attached to along with the other e-mails in the thread, select the attachment and switch to the "e-mails + attachments" view while holding the Ctrl key pressed.
Similarly if you are in the "e-mails + attachments" view and want to see if a document attached to an e-mail has any near-duplicate copies, select the attachment and switch to the "everything as docs" view while holding the Ctrl key pressed.
- e-mails by Contact to merge threads
Due to various reasons, e-mail threading is never a perfect process and single e-mail thread can sometimes be broken into two or more threads, so MasterFile allows you to manually merge these threads together. This view helps you identify e-mails that should be merged. See Merging e-mail threads for details on how to use the view.
- everything as docs: by Reviewer
e-mails + attachments: by Reviewer
These two views are the same as the above, but group the documents and e-mails by the person who has been assigned to review them. Assign the reviewer using a "near duplicate" reminder task in the "Things to do" section of the document profiles.
- by Processed Date, as docs
by Processed Date, Doc Type
Groups together all documents and e-mails processed in the same batch.
- duplicates: by MD5 hash
Groups together documents whose files are identical. For example, all copies of an attachment that was e-mailed several times will be have the same MD5 hash and be grouped together in this view.
Documents with the same MD5 hash are duplicate copies of the same file. Documents with the same content, but not the same MD5 hash (such as a DOCX file and PDF copy of the same DOCX file), are also duplicates but these are clustered together as near duplicates with 100% similarity in the other near duplicate views.
To compare two documents or e-mail, select them and click on [R+Review and Revision Tools > Compare 2 documents]. The documents will be compared and the differences will be highlighted in your browser. This may take a minute or two depending on the size of the documents.
The view columns
Each cluster of similar documents or thread of emails is grouped in its own red category and assigned number (see #1 in the screen shot above). There is no correlation or sequence to group numbers; they simply identify clusters. Document clusters numbers are prefixed with a "D" and e-mail threads are prefixed with an "E". If a document does not have any near duplicate documents, then it placed in it's own group along with any exact duplicates of the document and it's group number is prefixed with an "S", for "Solo".
Highlighted in the red box (#2) in the screen shot above, are the 4 columns in the near duplicate views contain the near duplicate and e-mail threading information:
- Processed date:
Documents and e-mails will generally be processed in batches, as they are received. The processed date lets you know when the document was processed to help you decide if it's a document that needs to be reviewed or not.
- Reference date:
If a document or e-mail is designated as reference document, its "Processed date" is coped to "Reference Date". Later, when more documents are process, the reference document may change but it's original "Reference Date" will remain to remind you that document or e-mail was once a reference document and to alert you that a near duplicate cluster or e-mail thread has got a new reference document that needs to be reviewed.
- Similarity:
For documents, this column may have the values "Reference", "Duplicate" or a number. The first document in any near duplicate cluster is the reference document against which others in the cluster are compared to for similarity. If the near duplicate cluster contains duplicates of the reference document, they will be designated as "Duplicate". All other documents in the near duplicate cluster have a number which represents the percentage similarity that document has with the reference document. For example, "98" means that the document is 98% similar to the reference document. If, for example, the document was a contract and the near duplicate cluster contains various drafts of the document, that 2% difference may be quite significant if it materially altered the terms of the contract.
For e-mails, this column has the values "Reference" or a number. The reference e-mail is the latest e-mail in the thread and contains all the history from the thread so it is the only e-mail that needs to be reviewed. All other e-mails have a sequence number of the email in the thread; the higher the number the older the email and earlier in the thread it was written. The first email in the thread will be listed last and have the highest number.
- Subcluster/Inclusive:
Each document is assigned a sub-cluster that it belongs to. Documents that are duplicates of each other are assigned the same subcluster number so you can avoid reviewing them.
Reference e-mails of threads are also designated as "Inclusive", indicating they are inclusive of all history of all other emails in the thread.
- Cluster or Thread #:
Some near duplicate views, such as "e-mails by Contact to merge threads" don't group documents into near duplicate clusters or e-mail threads. These views contain a "Cluster or Thread #" column which displays the thread or cluster number to help you identify which documents and e-mails belong to the same cluster or thread.
Comparing documents
When you need to review the differences between the reference, or any other document, with another document simply select both and click [R+ Review and Revision Tools > Compare 2 documents]. The viewer opens the two in your browser and highlights differences.
Passages that are different will appear highlighted in bright yellow. Ignore differences in punctuation and, in e-mails, ignore e-mail headers (the block of text with the date, from, to, etc.) signature blocks. Sometimes differences are highlighted which are not differences, so double check any differences carefully.
Buttons in the comparison window let you quickly hide documents, clusters, or find the documents in the original view they were selected from.
- After hiding documents, refresh the view by pressing F9 or click the blue circular arrow in top left corner of the view to refresh the view.
- "Find in view" has a delay of a couple of seconds before MasterFile displays search results.
- If the buttons do not appear to be working, ensure the Watch Folder Monitor is working.
Reviewing new batches of documents processed
When new batches of documents are processed, existing near duplicate document clusters and e-mail threads are updated documents and e-mails associated with those clusters and threads. Any remaining documents and e-mail are used to create new near duplicate clusters and e-mail threads.
In general not all documents and e-mails in a new batch processed need to be reviewed, but just those new documents and e-mails that were identified as "Reference".
For example, if a set of documents were inserted into an existing near duplicate cluster, but none of those documents was flagged as Reference, then the existing Reference document of that cluster has not changed and the new documents are just near duplicates of the existing Reference document. If near duplicates are not critical for that document, then they do not need to be reviewed.
Similarly, if a set of e-mails were inserted into an existing e-mail thread, but none of those e-mail was flagged as Reference, then none of them need to be reviewed because the history of existing reference e-mail of the thread already has all the content of all the new e-mails.
Use the [L+Near Duplicates > by Processed Date; as docs] view to quickly identify the new Reference documents and emails to review as follows:
- Switching to the view.
- Click on the red row with the date and time of the latest batch processed.
- Press Shift+ (Shift key and Plus key) to expand all document clusters and e-mail threads in the view.
- Examine the first row after each red document cluster and e-mail thread heading and see if it is a Reference document.
If it is not, then all the remaining documents or emails in that cluster or thread are just near duplicates of the existing Reference document or just older e-mails in the thread.
If the first row is a reference document then that document or e-mail should be reviewed.
If you wish to quickly see the full near duplicate cluster or e-mail thread that new documents or e-mails were added to click on any row and then switch to the [L+Near Duplicates > everything as docs] view while holding down the Ctrl key.
You can continue your review of the new batch of documents by again holding down the Ctrl key while you switch to the [L+Near Duplicates > by Processed Date; as docs] view.
Advanced users can use the [L+Near Duplicates > everything as docs] to quickly review Reference documents in each batch of documents as follows:
- Switch to the [L+Near Duplicates > everything as docs] view
- Click on the "Processed Date" column title.
This will sort the view by the "Processed Date" and then by "Similarity". Documents and e-mails will no longer be grouped by near duplicate clusters or e-mail threads, so the display may be confusing for some users. However, with this display, all reference documents and e-mails for a batch will appear first with each batch of documents for fast review.