Multi-Source Data Pipeline

Overview

An e-commerce business was operating with data spread across five systems: a Shopify store, a third-party fulfilment API, a returns management platform, a Google Sheets inventory tracker, and a legacy on-premises order database. Each morning, someone on the ops team manually exported CSVs from each system, reformatted them, and loaded them into a shared SQL Server database. The process took 90 minutes and was prone to missed runs, format changes, and load failures with no alerting.

I designed and built an end-to-end Python ETL pipeline that connected to all five sources, applied a consistent transformation layer, performed data quality checks, and loaded incrementally into a centralised SQL Server warehouse — on a 15-minute schedule via Windows Task Scheduler, with email alerting on failure.

Key Deliverables

Python ETL modules for each source: REST API connectors (Shopify, fulfilment), Google Sheets reader, ODBC connector for legacy DB
Centralised transformation layer normalising date formats, currency, product SKUs, and customer IDs across all five sources
Incremental load logic using watermark timestamps to avoid full reloads and minimise runtime
Data quality checks validating row counts, null rates, and referential integrity before any load commits
Automated email alerting on pipeline failure with error context and affected source
Full load audit log table recording run times, row counts, and status for every execution

Challenges & Solutions

The Shopify and fulfilment APIs had rate limits and occasionally returned paginated responses with inconsistent structures depending on the endpoint version. I implemented exponential backoff retry logic and a response schema validator that caught breaking API changes and raised a descriptive alert rather than silently loading malformed data.

The legacy on-premises database had no timestamp columns on several key tables, making incremental detection impossible using a standard watermark approach. The solution was a hash-based change detection layer that computed a row fingerprint on each extract and compared it to the previous load — loading only rows whose fingerprint had changed.

Results & Impact

90 minDaily manual work eliminated

15 minData freshness (was 24 hours)

5Source systems unified into one warehouse