How to process documents for near duplicates and e-mail threads
Near duplicate processing and e-mail threading require accurate, error free text from the original native format documents, or PDFs created from native documents. Scanned documents that have been OCRed are not recommended because OCR processing is not 100% accurate and even small errors will produce poor near duplicate and e-mail threading results.
For e-mails, we strongly recommend you obtain native format e-mails (i.e. .EML or .MSG format files) for the best e-mail threading results.
Basic near duplicate and e-mail threading
Before processing documents for near duplicates, attachments for all e-mails must be extracted into their own profiles. To extract the attachments, select all e-mails in the [L+ Documents > by Doc Type] view and choose [R+ Profile Maintenance > Process Attachments].
The first time you process a case's documents for near duplicates, it's best to process them all. As more are received, you'll process each batch adding to the processed documents and updating clusters. To select all documents, switch to the [L+ Documents > by Date] view and click on <Ctrl-A> to select all documents in the view.
Once the documents to process have been selected, and choose [R+ Near Duplicate Functions > Process selected documents] as follows:
Selecting an e-mail will automatically process its attachments, whether or not they were selected for processing. Documents that are attachments to e-mails can only be processed if their parent e-mails are selected for processing.
Like OCR processing, near duplicate processing and e-mail threading is an intensive process and can take several minutes, to several hours to complete, depending on the number of documents and emails to process.
Before processing documents for near duplicates, please make sure that all documents (other than e-mails) and attachments have been OCR/PDF Crunched to ensure every document to be processed has a PDF copy of the document with extractable text. If this is not done, near duplicate processing will assume that all documents without text are binary files and they will not be able to be re-processed as files with text even if they are later OCR/PDF crunched without reprocessing the entire database for near duplicates.
To find out which documents are missing PDF copies of their documents and/or require OCR processing:
- Select the documents (other than e-mails) to process for near duplicates
- Use [R+Evidence Cruncher > Verify Crunch Status] to analyze the documents.
- Open the [Documents > by Date] view
- Copy/paste the following search string into the search bar and click "Search":
([DP_Crunch_Status_k] contains "No PDF attached to profile" or
[DP_Crunch_Status_k] contains "Has OCR but document is not PDF" or
[DP_Crunch_Status_k] contains "PDF without OCR text" or
[DP_Crunch_Status_k] contains "PDF has pages without OCR text" or
[DP_Crunch_Status_k] contains "Unknown ... OCR and PDF have 25% - 50% difference in size" or
[DP_Crunch_Status_k] contains "Unknown ... OCR and PDF have 50% - 75% difference in size" or
[DP_Crunch_Status_k] contains "Unknown ... OCR and PDF have > 75% difference in size" or
NOT [DP_Crunch_Status_k] IS PRESENT)
And (Not([DP_Document_Type_k] Contains "Communications\e-mail") Or [DP_Document_Type_k] Contains "Communications\e-mail attachment")
- Select the documents displayed and use the Evidence Cruncher to OCR/PDF Crunch them.
Processing additional batches of documents or e-mails
Once a MasterFile database has been processed for near duplicates and e-mail threads, further processing of new documents and e-mails added to the database must be performed on the same workstation where the database was first processed for near duplicates.
As your case progresses, and new production sets or documents you've received are loaded into the case, you should process them to update the near-duplicate clusters and threads as above.
- Use the [L+ Near duplicates > by Processed Date, as docs] and [L+ Near duplicates > by Processed Date, Doc Type] views to quickly identify the newly loaded documents and e-mails that have not been processed.
Select the documents and e-mails under the "(Not processed)" and process them as explained earlier. You can select the documents from any view and if you mistakenly select ones that have already been processed they will be ignored, but it's faster and simpler to start processing from these two views.
- Newly processed documents can also be reviewed in the two "by Processed date" views, however if documents were inserted into other clusters, the clusters may not be complete as the documents are first grouped by the date they were processed.
All document clusters and e-mail threads are recalculated to ensure each contains the maximum number of documents and e-mails so as to reduce the number of documents and e-mails to review.
As a result of this recalculation, the reference documents of near duplicate clusters and e-mail threads may change. Reference documents and e-mail that were demoted from reference status retain their "Reference Date" value and can thus be identified from the "Reference Date" column in the near duplicate views.
Processing documents and e-mails with Bates numbers
Documents are stamped with Bates numbers need to have these removed because, if Bates numbers are not removed, then documents that have identical content will not be identified as duplicates, but as near duplicates with 99% similarity (the difference being the different Bates numbers stamped on each page of the document), creating additional review work.
To remove Bates numbers, commence near duplicate and e-mail threading processing using the Evidence Cruncher as follows:
- Select the documents and emails to process.
- Start the Evidence Cruncher by clicking on [R+ Evidence Cruncher > Document Services ...]
- And configure it as follows:
- Choose the "Dump documents (...)" for the "Service".
- Choose "Near duplicate processing" for the "Purpose".
- Enter "Bates number masks" as explained below to remove Bates numbers for improved results results.
- Click "OK".
Once processing is complete, you can review the results in the near duplicate views.
Near duplicates base directory
The "Target Folder ...", in the above screen shot, will automatically be set to the folder configured in the 'Near duplicates base directory' field of [R+ Administration > User settings] as shown below:
This directory stores near duplicate results for each of your databases and is used when processing additional documents and e-mail added to the database.
Do not delete or modify the "Target Folder" from your disk nor the "Near duplicates base directory" or any subdirectories under it or near duplicate data will be lost and databases will need to be re-processed again which can impact the state of your review.
Specifying Bates number masks
The quality of near duplicate and e-mail threading results can be reduced by spurious text, not part of the original document or e-mail, injected into the document or email text by e-mail systems (such as "Forwarded on") or other document processing (such as stamping Bates numbers). MasterFile removes most common e-mail system injections during processing, however Bates numbers stamped on the documents need additional information before they can be removed.
The "Bates number masks" setting, in the above screen shot, allows you to specify a mask (or template) of Bates number formats that have been stamped on the documents so the Bates numbers can be identified and removed.
Bates numbers typically have the format A# where A is a prefix of several letters, spaces and punctuation marks and # is a number with leading zeros to pad the number to a certain size, such as 7 digits.
For example, the following are the Bates masks for the Bates sample numbers shown:
Bates number: JAORG-000004
Template: JAORG-#####
Bates number: BADL 000433
Template: BADL ######
Use the [L+ Pleadings : Disclosures > by Production History : Bates #] view and note the Bates number formats used in all the different productions.
For each Bates number format, view the PDF copy of the file to view the actual Bates number stamped on the page. Then create the mask format on the Bates number stamped on the page instead of what is showing in MasterFile.
Create a mask for each Bates number format found and enter each mask on a separate line in the 'Bates number masks' field. For example, if your documents have the above 2 Bates format, then enter:
JAORG-#####
BADL ######
Documents that have Bates numbers stamped on their pages that match either of these two masks will have their Bates numbers removed before being being processed for near duplicates and e-mail threading..
All the masks you enter are saved and automatically used when you process new batches of documents, however, be sure to review new documents for additional Bates formats, before processing them, and enter their masks as above.
Re-processing a database for near duplicate and e-mail threading
Near duplicate and e-mail threading results for a batch of documents can not be rolled back or undone. If, for some reason, you need to remove some near duplicate or e-mail threading results, all results from the database must be erased and the database re-processed again.
This means that:
- All information about manually merged e-mail threads will be lost.
- Reference documents you've reviewed may not be identified as reference documents again, which can confound tracking of documents already reviewed.
- All Bates number masks you had identified and specified will be lost and have to be respecified again.
Erase a database's existing near duplicate and e-mail threading results as follows:
- Delete the "Target Folder ..." explained above.
- Open the [L+ Near Duplicates > everything as docs] view.
- Click Ctrl-A to select all documents and e-mails in the view
- Start Global Replace by clicking on [R+ Review and Revision Tools > Global Replace].
- Scroll to the bottom and turn on the check box next to 'Clear near duplicate values' as shown below.
- Click 'OK'.