A practitioner’s guide to PDF, Email and ESI production challenges
Introduction
eDiscovery technicians understand the importance of working with the right data. Electronically Stored Information (ESI) holds valuable digital details that can instantly impact a case. While data management starts long before litigation, we will focus on best practices when litigation becomes likely. Once a legal hold is in place, custodians must not change or delete any documents. This discussion will highlight key aspects of preserving and collecting documents, with a focus on PDFs. We will also cover how to avoid common mistakes and risks that could jeopardize your case.
This paper does not provide legal advice. For legal guidance, consult a licensed attorney. DiliTrust is not a forensic collection company but works with trusted professionals to support client projects. This paper examines the key issues in capturing and producing litigation data in a technically sound and legally defensible way.
“The PDF Pitfalls”
Challenges with Using PDFs for eDiscovery
PDFs and TIFFs are the preferred formats for producing documents during discovery. These image files prevent alterations and protect against corruption during transfer. Native files store all metadata, which can be crucial for a case.
In most cases, image files come with a load file. The load file contains metadata and extracted text from the native file and links them to the image file.
Producing native files comes with risks. They can be opened, corrupted, or accidentally altered. For example, if a reviewer opens a native email in Outlook, they could forward or reply to it, creating serious issues. We’ll cover email in more detail later. Using image files instead of native files in such cases protects both the document and the client.
The Problem with Producing Multiple Documents in One PDF
Most productions arrive as PDFs or TIFFs created with proper quality control (QC) procedures. They usually include load files. During litigation, always use native files to retain original metadata. After completing analysis and review, convert the files to images and send them to opposing counsel as a PDF production with a load file.
Some groups, however, produce multiple documents in a single PDF without a load file. This practice prevents coding, separating, or processing the documents properly. It also creates costly problems for the review team. Without metadata or extracted text, running searches or maintaining familial relationships, like email attachments, becomes impossible.
The receiving party then faces a choice. They can either pay to separate, OCR, and manually recover missing metadata or request the producing party to redo the production correctly. This issue often arises during production, but some companies that self-collect also provide data in this flawed format. Not only does this method fail to meet proper standards, but it also invites challenges from opposing counsel and increases the risk of court sanctions.
The Cost of Recovering Metadata and Separating Documents
When problems such as identifying data loss happen, the repair falls to litigation support professionals and they truly are the unsung heroes behind every litigation. These individuals often describe themselves as being constantly “in the weeds,” “running fire drills,” and “underwater” because they are. They have seen it all and can reliably forecast the common issues with poorly collected and produced data, but this advice after the fact comes at a high cost. This is why, when data is collected as PDFs of native files and sent for processing, it is typically sent back to the collecting party to re-collect as native files. In rare cases, this is not possible, it requires a document specialist to determine how best to process the data and ultimately make it reviewable.
In one case we handled, the opposing party produced a single PDF containing all documents. The litigation support technician immediately spotted an issue—the attorney had requested Bates Numbers for each document. Applying Bates Numbers to a single PDF is difficult because each document must be manually separated. This process increases the risk of errors when determining where one document ends and another begins. A PDF is just an image, and without separate images for each document, reviewing it becomes extremely difficult. Our litigation team requested native files instead. While a document specialist could separate the PDF, the cost would have been too high, and native file metadata would have been lost. Some metadata, like document date, sender, and subject, could be manually extracted, but this process is expensive, time-consuming, and incomplete.
Properly preserving and collecting native files keeps their metadata intact. Metadata provides crucial details, such as the device used to create the file and date-specific information, which can be vital at trial. Altering or removing metadata severely limits the case team’s efficiency.
Court Ruling on Metadata in FOIA Requests
In Nat’l Day Laborer Org. v. U.S. Immigration and Customs Enforcement Agency, Judge Scheindlin ruled that even if a FOIA request doesn’t specifically ask for metadata, basic metadata is part of the public record and must be produced. This was the first time a federal judge took this stance in a FOIA case. She stated that producing only static images, without allowing electronic search tools, improperly degrades ESI.
Issues with Deduplication and OCR Technology
The absence of metadata creates another issue for review teams: the inability to use de-duplication technology. Most eDiscovery tools contain the ability to remove duplicate files, in a process known as deduplication. Deduplication normally takes place during the processing stage. However, since PDFs, by themselves, do not contain any native file metadata, deduplication technology will not function properly. Identifying duplicate files is accomplished by matching the hash values (the serial number of a document) of two files. Once a document is turned into a PDF, the hash value of the PDF will be different from that of the original file, making duplicates difficult to identify.
Having only an image of a document creates problems. To make it worse, PDFs don’t have separate text files. This means OCR (Optical Character Recognition) technology must be used to turn the image into text again. OCR reads the image and creates a searchable text file.
OCR has come a long way, but is not perfect and can return results that are different from what actually appears in the document. If you add the presence of a foreign language, misspellings, slang, or image quality into that equation, results can become even more skewed.
The Risks of Multiple Versions and Inconsistent Coding
When applying OCR processing to a PDF, it is possible that the exact same document could be part of the review database several different times. The same document could be read by separate reviewers and coded differently. Disparities in coding could also lead to the possible production of privileged documents. Here is a point where production QC is so important. Do not overlook this vital step in your review. While FRCP 502(d) and clawback agreements protect the producing party from waiving the privilege of inadvertently produced documents, the other side may discover some valuable information within that document and this could be harmful to your litigation—“You cannot unring a bell”.
Who’s qualified to produce PDFs?
Under the newly amended FRE 902, copied data will not be “self-authenticating” unless a qualified person has inspected the data, recorded the process used, and certified that an exact copy of the data was created. The comment section of 902 states that to meet the inspection requirement, the qualified person can compare the hash values of the original ESI and that of the copied version. One of the strongest ways to ensure your collection follows the accepted best practices is to have a forensics expert handle the task. We often see these forensic experts ward off the pitfalls early on because they have seen the issues first-hand and can communicate the risks and rewards of any task before the collection even begins.
What makes someone an expert? The Advisory Committee on Rules of Evidence defines a qualified person as someone who can testify based on their expertise and knowledge of data collection. This means the same criteria used to vet an expert witness (FRE 702) should also apply when selecting a person to collect litigation data. If the files produced cannot withstand opposing counsel’s scrutiny, the collection’s authenticity may be questioned. Beyond authentication, an unqualified collector risks destroying or losing data, leading to higher litigation costs, potential spoliation sanctions, and delays in the case.
ESI, email, and exact copies
Litigation support professionals often encounter issues when custodians collect data by forwarding emails. This may seem harmless, but it alters metadata and file locations, which could be important. Many custodians don’t realize this mistake, making employee education on “best practices” essential. Reviewing emails in an email application also carries risks. Opening an email in Outlook, for example, could lead to accidental forwarding or replying, which compromises the review process. To prevent this, use review applications that open native files in a “near-native format,” like viewing an Outlook email file in Word. This ensures a defensible review without risking misuse of native applications.
Every attorney should aim for a “forensically sound” collection. This approach helps litigation support and review teams by allowing eDiscovery software to function efficiently. Thomas Lidbury and Michael Boland from Drinker Biddle explained this in their paper Technology: Forensically Sound Collection of ESI. They stated that a collection is forensically sound when the collected files are exact copies of the source, including metadata. This ensures the collection can withstand judicial scrutiny. In contrast, using flawed methods increases costs, risks inadvertent disclosures, delays review, and causes late productions. The best practice is to capture ESI in its native format by creating a forensically sound copy, preserving metadata. To ensure defensibility, involve attorneys early so they can protect ESI, follow proper procedures, and document everything correctly.
Conclusion
Properly created and handled PDFs can be as defensible as native files. Following best practices for document collection and production minimizes the risk of opposing counsel questioning metadata authenticity. A skilled attorney documents every step and prepares for any challenges. Having an expert lead your litigation data collection and production strengthens your defense and reduces the chances of disputes.
About the author: Adam Bowers is an eDiscovery expert, LLM candidate, former business owner, and legal technology practitioner who helps law firms and attorneys navigate the complex world of discovery.