Recently a New York City law firm approached us with a problem: the other party was dumping thousands of pages of discovery into their Everlaw eDiscovery services and generating thousands of dollars of monthly expenses. By their initial estimates, their Everlaw costs would reach $10,000 per month for this single case and likely to continue past 12 months – meaning a $120,000 for eDiscovery of mostly useless data. The owner of the firm reached out to us on a Thursday afternoon, after a short conversation we promised a solution by Monday.
To the attorney this challenge seemed massive, but to us this was a simple problem with an easy solution. We would roll out a hybrid solution that would help pre-process the data so that only useful information would be sent to Everlaw. Hybrid, in this case, meant that we would leverage their staff, our staff, and plenty of automation. If the project performed well, we would spend 80% of our time doing additional work with each new dump and 20% on improvements – kaizen at its best.
We leveraged the AWS CLI (Amazon Web Services Command-Line Interface) to synchronize the data from on-premise systems to Amazon S3. We needed to stage the data. From there we decided on a two pronged approach and an internal race: see who gets to a solution quicker. We like to make things fun so we had Team Platypus and Team Aardvark. Team Platypus would work using a proven stack of software on an AWS EC2 Windows instance wile Team Aardvark would chase Amazon Textract and Lambda functions. The goal of each team was to OCR all the files, convert them to PDF, and combine them in some human-usable way. The organized documents would be submitted to the attorneys at the firm who were fairly skilled at parsing out useful information, and from there the small subset of data would be sent to Everlaw.
By Friday morning Team Platypus was in the lead and we put all energy into their effort. What we did next sounded far more impressive than it was: we used a substantial Amazon EC2 Windows 2019 Server instance to run Tesseract and PDFtk on the data saving the customer $100,000 in the process.
Tesseract is a Google supported optical character recognition engine. Our server was configured to take the data from the staging area in Amazon S3, bring it to high-performance EBS volumes, OCR it, and save each file as PDF. By Friday afternoon we OCRd about 145,000 files and pushed them to the next stage. The next stage was to identify patterns in the data and find ways to combine it. We wrote our own code for this part and sent the batches to PDFtk to combine into human-usable PDFs. The deliverables were PDF files each containing between 100 and 500 pages that the attorneys would flip through.
We typically do not work on the weekends, but this was a special case and we performed finalization and QA on the data on Saturday. Our QA consisted of spot-checking an automated testing that looked to verify OCR was performed on every detectable paragraph and that bates numbers were in chronological order. They were.
We delivered all final products using a reverse AWS CLI synch on Monday. $100,000+ saved at a cost of about 8 man hours and some AWS machine time.