
Safe CSV Ingestion into PostgreSQL: A Multi-Tenant ETL Pipeline Pattern
When building a SaaS where users can upload arbitrary CSV files for analysis, the trickiest problem is "we don't know the schema ahead of time." Normal RDBMSes require you to define column names and types before creating a table. But user CSVs might have 10 columns or 100. Column names might be 売上金額 or revenue_amount , with spaces or symbols mixed in. Here's the ETL pipeline I recently implemented for a CSV analytics platform, and the patterns I learned along the way. Overall Flow S3 (uploaded CSV) ↓ Parser: type inference + column name normalization ↓ Staging table (dynamically created) ↓ DWH table (per-company schema) Auth: AWS Cognito. Storage: S3. DB: RDS PostgreSQL (async via SQLAlchemy + asyncpg). Backend: FastAPI. ETL triggered via POST /api/etl/{upload_id}/run . Pattern 1: Column Name Normalization User CSV headers can be anything: 売上金額 , Revenue (JPY) , col (with spaces), empty strings, duplicates. Using these directly as SQL column names invites injection risks and syntax err
Continue reading on Dev.to Tutorial
Opens in a new tab



