HelpMasterFile Loading EvidenceSplitting PDFs with Express Load

Splitting PDFs with Express Load

Often you will receive one or more PDFs which are a collection of many documents in one file. Common examples of these are medical records, production disclosures and aggregates of scanned documents. To import correctly into MasterFile, the logical documents need to be split out as one document per PDF, so they can be managed properly as evidence, together with any key facts and extracted information.

You can split a PDF with MasterFile's Express Load using an (Excel) CSV file which is our recommended method or using bookmarks (with some caveats).

Using a CSV file created in Excel

MasterFile splits a PDF into logical documents based on their starting and ending page numbers. The process is same with any other MasterFile CSV load file except that two new columns, EL_PDF_Start_Page_n and EL_PDF_End_Page_n, are added to the Excel spreadsheet from which the CSV load file for Express Load is saved. Since you are working with a regular MasterFile CSV load file, you can also add other meta data for the documents and take advantage of cell copy functions in Excel to speed up entry when certain meta data for different documents is the same.

The two attached ZIP files contain an XLSX and CSV versions of an example load file and the associated PDFs for you to examine and try out. Simply unzip everything to a folder and follow the PDF instructions attached below to split and test load your PDF in a new database using Express Load before finally importing.

Notes

The CSV and XSLX files should have exactly the same filename as the PDF you are splitting.

Using bookmarks

Splitting a PDF using bookmarks is useful if a PDF already has descriptive bookmarks and each bookmark marks the start of each logical document in the PDF. You can of course add bookmarks yourself and load the PDF. When splitting with bookmarks, however, the bookmark is the the only meta data loaded (as the document summary) and therefore use short, meaningful summaries as bookmarks for each document within the PDF. 

The attached PDF explains how to split a PDF file using its bookmarks.

Notes

Whether bookmarks exist or you are adding them, PDF products can and do introduce odd characters and invisible line breaks into bookmarks. For example:

  • Although we recommend you use Acrobat, Acrobat itself can add a square box character like this, □, to bookmarks in some cases
  • Bookmarks can not be two or more lines yet Acrobat and other products will let you add carriage returns or new line characters, or add these themselves, that are invisible -- so the bookmarks look like one long line. You will only be able to spot such line breaks by copying and pasting an entire bookmark into Notepad, then replacing its text by copying and pasting the corrected one from Notepad.

Any of the above will cause an import to fail; they can not be detected in advance. We always recommend therefore you check the characters in each bookmark and test load your PDF in a new database before finally importing.