AI-Powered PDF Extraction

Overview

A financial analyst at an investment firm was manually reading through hundreds of scanned annual reports each quarter, typing Revenue and Operating Profit figures into a tracking spreadsheet. Each report took 5–10 minutes to process by hand — totalling over 8 hours of repetitive work per week. The reports came from dozens of different companies with completely different layouts, fonts, and formatting conventions.

The solution is a Python CLI tool that accepts any PDF, renders each page to a high-resolution image using PyMuPDF, sends the images to OpenAI's GPT-4o Vision model with a structured extraction prompt, parses the JSON response, and upserts the results into a SQLite database. It works across different company formats, handles multi-page PDFs, and maintains a complete audit trail of every extraction by source file.

Key Deliverables

Python CLI script accepting any PDF path as input with optional company name override
PyMuPDF page renderer producing 2× resolution PNG images for accurate OCR via Vision API
Structured GPT-4o Vision prompt returning clean JSON with company name, revenue, and operating profit
SQLite database with four fields: company name, year, metric, and source PDF filename
Upsert logic preventing duplicate entries on re-run of the same PDF
Terminal output summary confirming extracted values and database write confirmation

Challenges & Solutions

Income statements across different companies use wildly different labels for the same line items — "Net Revenue", "Total Net Sales", "Revenues", and "Total Revenue" all mean the same thing. The GPT-4o prompt was engineered to handle this semantic ambiguity by instructing the model to identify the correct line item by meaning rather than by exact label, dramatically reducing missed extractions compared to traditional regex or keyword-matching approaches.

Some PDFs returned markdown-wrapped JSON from the Vision API rather than clean JSON. A pre-processing step strips any code fences before parsing, making the pipeline robust to model output variations without needing to pin to a specific model version.

Results & Impact

8 hrsWeekly manual data entry eliminated

<30 secPer PDF extraction time

AnyCompany format — no template required