"So, tell me about this SaaS idea of yours so we can take it for a spin," I said.
In April, 2016, I met a junior associate at a global law firm who was deeply involved in the St. Louis software startup scene. As we shared an interest in emerging SaaS models, and I was eager to hear his pitch.
"One of the things I've learned as a junior associate," Matt began, his gaze fixed on his salad with a look of determined frustration, "is how much time an opposing legal team can waste with a document dump. It’s a pre-trial tactic intended to overwhelm us, often leading to missed details and extended litigation periods—not just a massive drain on the firm's resources, but on our ability to effectively serve our clients."
His voice carried a blend of exasperation and excitement as he leaned forward. "A buddy of mine from college is pursuing a CS degree and already experimenting with our tech stack—let me CC you into our email thread."
I ate a few bites of my caesar as Matt tapped at his phone screen for a few moments, then looked up expectantly. "Hey, do you think you could show me exactly what a document dump looks like?" I asked.
As we approached the office, Matt chuckled, half in resignation. "We had to rent extra space just to manage this case," he explained as he opened the door to a modest commercial office. The room was stark—a temporary setup with a cluster of folding tables in the corner covered with towering stacks of paperwork, and a scanner whirring quietly in the background.
"Welcome to the chaos," he said, gesturing towards the bustling room. Three legal assistants sat heads-down typing on laptops among piles of documents, occasionally pausing to scan pages using the networked computers. Their percussive rhythm of mouse clicks, scanner motors, and focused activity was punctuated by the occasional snap of a legal binder closing.
"This," Matt waved his hand across the room, "is what we call a document dump in action. Imagine sifting through all of this manually for nuggets of relevant information."
The practical demonstration of the problem was more impactful than any description could have been. As we wove between the tables, the scale of the task at hand was unmistakably clear. Each box represented hours of potential work—work that Matt was convinced could be streamlined significantly.
By training machine learning algorithms, we can significantly reduce error rates associated with Optical Character Recognition (OCR) document processing in pretrial discovery. Specifically, we believed that for every point reduction in OCR error rates, there will be a compounding cost savings of $0.15 per document, leading to enhanced efficiency and accuracy in legal document handling.
Our team was distributed and fully remote, relying on various tools to collaborate effectively. During our series of discussions via Google Hangouts and GitHub, the three of us meticulously planned the OCR training loop process. I decided to focus on ensuring we collected accurate data through a training UI, while Ryan, our software engineer, took charge of the backend and machine learning components. Matt was our lead-user and subject matter expert, ultimately responsible for helping our team navigate the problem space.
"Ok, now who is going to be training this?," I asked.
Our primary users were paralegals working in pretrial discovery—they alone would help us understand the necessary features for an OCR correction tool. To know them, I needed to delve into their daily operations.
I relied on Ryan to help make sense of the technical aspects of our work. He provided me with a sample OCR output in JSON format from a scanning result obtained from our test document. The essential artifacts for conducting a manual review were all there: the original document, coordinates and dimensions for low-confidence passage highlights, and transcribed text awaiting correction.
{
"document_id": "12345",
"pages": [
{
"page_number": 1,
"blocks": [
{
"text": "ST. MARY'S OF MICHIGAN: WASHINGTON AVE. EMERGENCY RECORD",
"confidence": 95,
"coordinates": { "x": 10, "y": 20, "width": 300, "height": 50 }
},
{
"text": "Patient: John Doe",
"confidence": 82,
"coordinates": { "x": 10, "y": 100, "width": 150, "height": 20 }
},
{
"text": "Date of Birth: 01/01/1970",
"confidence": 78,
"coordinates": { "x": 10, "y": 130, "width": 200, "height": 20 }
}
]
}
]
}
My challenge was finding an open-source technology that would allow me to make use of this data to re-create the low-confidence scenario for a human reviewer. I was inspired by my previous experience with 4/4 print press setup creating negative film for copper-plate printing. It involved separating layers of colors (CMYK) in design layouts into four different plates that would be over-printed. Registration marks are critical to this process to ensure the colors print in the right places. Using a similar approach, I could layer the original document with highlighted areas indicating low-confidence OCR results, ensuring the reviewer could focus on these specific sections.
One of our first spirited disagreements occurred over implementation. Matt saw immediate value in providing basic Cloud OCR access and wanted to prioritize document upload features. However, we ultimately decided this was a "pipeline dependent" feature—it's success was dependent upon a reason to automate which presumed success of a trained model—which we didn't yet have.
If we wanted accurate training data, we were going to need to change attitudes around the manual review process with the right design. We agreed to focus on what was making our users scream: the endless pile of documents awaiting manual transcription.
The team conducted two phases of manual corrections to train the ML model. To calculate the impact of these manual corrections, established a baseline threshold with Google Cloud OCR, and flagged all transcription confidence scores lower than 86% for manual correction.
With the web interface, we were able to accurately collect and compile every correction made by our human reviewers, categorizing errors and corrections by type (e.g., misread, incomplete, bad format, other). After applying the corrections, the OCR model was retrained with the corrected data. The model's accuracy was then re-evaluated using a separate validation set of documents not included in the initial training set.
\[ \text{Accuracy Improved} = 7\%\]
\[ \text{Cost Savings} = \$1.05\text{ per document}\]
The expanded training set enabled the model to handle a broader range of document types and quality levels. By categorizing errors, trainers provided nuanced data that significantly enhanced the model's learning process. For example, by categorizing errors by font type, document quality, and specific sections (headers, body text), the model could apply corrections more intelligently.
Our product demonstrated significant improvements in processing efficiency and accuracy through the integration of advanced OCR and ML technologies. By involving users in both the design and training process, the system quickly evolved to handle document dumps more effectively, ultimately reducing the manual effort required and allowing legal teams to focus on more critical tasks.
Many of us will find ourselves training computers to do our work within our lifetimes. They key to success in the design of any automation is the careful consideration of humans—treating everyone involved with respect and dignity.