ETL · PythonFinancial Services

AI-Powered PDF Extraction

Eliminated 8+ hours of weekly manual data entry by using GPT-4o Vision to read and structure financial metrics from scanned income statements — any company, any layout, with a full audit trail.

PythonOpenAI GPT-4oPyMuPDFSQLiteOCRAutomation

Pipeline Screenshot

Replace this placeholder with your actual tool output or architecture screenshot

Overview

A financial analyst at an investment firm was manually reading through hundreds of scanned annual reports each quarter, typing Revenue and Operating Profit figures into a tracking spreadsheet. Each report took 5–10 minutes to process by hand — totalling over 8 hours of repetitive work per week. The reports came from dozens of different companies with completely different layouts, fonts, and formatting conventions.

The solution is a Python CLI tool that accepts any PDF, renders each page to a high-resolution image using PyMuPDF, sends the images to OpenAI's GPT-4o Vision model with a structured extraction prompt, parses the JSON response, and upserts the results into a SQLite database. It works across different company formats, handles multi-page PDFs, and maintains a complete audit trail of every extraction by source file.

Key Deliverables

  • Python CLI script accepting any PDF path as input with optional company name override
  • PyMuPDF page renderer producing 2× resolution PNG images for accurate OCR via Vision API
  • Structured GPT-4o Vision prompt returning clean JSON with company name, revenue, and operating profit
  • SQLite database with four fields: company name, year, metric, and source PDF filename
  • Upsert logic preventing duplicate entries on re-run of the same PDF
  • Terminal output summary confirming extracted values and database write confirmation

Challenges & Solutions

Income statements across different companies use wildly different labels for the same line items — "Net Revenue", "Total Net Sales", "Revenues", and "Total Revenue" all mean the same thing. The GPT-4o prompt was engineered to handle this semantic ambiguity by instructing the model to identify the correct line item by meaning rather than by exact label, dramatically reducing missed extractions compared to traditional regex or keyword-matching approaches.

Some PDFs returned markdown-wrapped JSON from the Vision API rather than clean JSON. A pre-processing step strips any code fences before parsing, making the pipeline robust to model output variations without needing to pin to a specific model version.

Results & Impact

8 hrsWeekly manual data entry eliminated
<30 secPer PDF extraction time
AnyCompany format — no template required
PreviousMulti-Source Data Pipeline NextRetail Star Schema Design