Overview
An e-commerce business was operating with data spread across five systems: a Shopify store, a third-party fulfilment API, a returns management platform, a Google Sheets inventory tracker, and a legacy on-premises order database. Each morning, someone on the ops team manually exported CSVs from each system, reformatted them, and loaded them into a shared SQL Server database. The process took 90 minutes and was prone to missed runs, format changes, and load failures with no alerting.
I designed and built an end-to-end Python ETL pipeline that connected to all five sources, applied a consistent transformation layer, performed data quality checks, and loaded incrementally into a centralised SQL Server warehouse — on a 15-minute schedule via Windows Task Scheduler, with email alerting on failure.
Key Deliverables
- Python ETL modules for each source: REST API connectors (Shopify, fulfilment), Google Sheets reader, ODBC connector for legacy DB
- Centralised transformation layer normalising date formats, currency, product SKUs, and customer IDs across all five sources
- Incremental load logic using watermark timestamps to avoid full reloads and minimise runtime
- Data quality checks validating row counts, null rates, and referential integrity before any load commits
- Automated email alerting on pipeline failure with error context and affected source
- Full load audit log table recording run times, row counts, and status for every execution
Challenges & Solutions
The Shopify and fulfilment APIs had rate limits and occasionally returned paginated responses with inconsistent structures depending on the endpoint version. I implemented exponential backoff retry logic and a response schema validator that caught breaking API changes and raised a descriptive alert rather than silently loading malformed data.
The legacy on-premises database had no timestamp columns on several key tables, making incremental detection impossible using a standard watermark approach. The solution was a hash-based change detection layer that computed a row fingerprint on each extract and compared it to the previous load — loading only rows whose fingerprint had changed.