Transition from a monolithic web app with static CSV files to a fully automated SAP → S3 → Glue → Lambda → DynamoDB pipeline with Microsoft 365 SSO.
A FastAPI + React monolith that loads static CSV files at startup, processes uploaded SAP exports in-memory with Pandas, and outputs color-coded 192-column Excel files for customs brokers. Auth is JWT + bcrypt with a JSON file user store.
Every dimension that needs to change to reach the target architecture. Green cells show what the final state delivers over the current red state.
| Dimension | Current State | Final State |
|---|---|---|
| Data Source | Gap Static CSVs on disk, manually updated | Target SAP China sends JSON daily to S3 |
| Data Storage | Gap Flat files on app server | Target S3 (raw + processed Parquet) + DynamoDB lookups |
| Data Pipeline | Gap None — loaded at app startup | Target AWS Glue ETL, daily scheduled PySpark transforms |
| Fuzzy Matching | Gap None — exact material match only | Target PySpark fuzzy matching for customer name discrepancies |
| Record Volume | Gap ~75 products | Target 6,000–10,000 records per daily batch |
| Authentication | Gap Custom JWT + bcrypt + users.json | Target Microsoft Entra ID SSO (M365 app launcher) |
| Hosting | Gap Railway / Render / Lightsail | Target AWS (Lambda, S3, DynamoDB, Glue, CloudFront) |
| Cold Storage | Gap None | Target S3 Glacier lifecycle (Instant → Flexible → Deep Archive) |
| Monitoring | Gap Console logs only | Target CloudWatch dashboards, SNS alerts, WAF |
Five layers from SAP ingestion through to the user-facing application and 7-year audit archival.
Phases overlap where dependencies allow. The critical path runs SAP schema → Glue ETL → DynamoDB → Lambda → API Gateway → Frontend.
Click each phase to expand the full task list with deliverables.
| ID | Task | Deliverable |
|---|---|---|
| 0.1 | Set up AWS Organization, accounts (dev/staging/prod) | AWS account structure |
| 0.2 | Terraform / CloudFormation IaC repository | Infrastructure as Code repo |
| 0.3 | Create S3 buckets (raw, processed, audit, frontend) with policies | Buckets with versioning, encryption |
| 0.4 | Create DynamoDB tables with GSIs | 5 tables: product, HTS, tariff, customer, PO |
| 0.5 | Set up IAM roles and policies (Glue, Lambda, S3) | Least-privilege IAM |
| 0.6 | Set up CloudWatch log groups, SNS topics | Monitoring baseline |
| 0.7 | Set up CI/CD pipeline (GitHub Actions or CodePipeline) | Automated deploy pipeline |
| ID | Task | Deliverable |
|---|---|---|
| 1.1 | Define JSON schema contracts with SAP team (products, HTS, tariffs) | Schema documentation |
| 1.2 | SAP team builds daily JSON export to S3 raw landing zone | SAP integration endpoint |
| 1.3 | Build Lambda ingestion validator (schema check, dedup, trigger Glue) | Lambda function |
| 1.4 | Build AWS Glue ETL job (PySpark): clean, normalize, validate 6K–10K records | Glue job |
| 1.5 | Add fuzzy matching for customer name discrepancies (python-Levenshtein) | Fuzzy match module in Glue |
| 1.6 | Add quality checks (missing HTS, zero weights, invalid COO, cross-ref) | Quality report output |
| 1.7 | Write Parquet output to S3 processed zone (partitioned by date) | Parquet files in S3 |
| 1.8 | Populate DynamoDB lookup tables from processed Parquet | DynamoDB populated |
| 1.9 | Build data quality dashboard / reporting | Quality monitoring |
| 1.10 | Backfill: migrate current CSV master data into DynamoDB as baseline | Initial data load verified |
| ID | Task | Deliverable |
|---|---|---|
| 2.1 | Refactor data_loader.py: replace CSV reads with DynamoDB queries | DynamoDB data layer |
| 2.2 | Refactor processor.py: keep logic, swap data source to DynamoDB | Processor using DynamoDB |
| 2.3 | Package processing code as Lambda function (API Gateway trigger) | Lambda: process |
| 2.4 | Package Excel download as Lambda (presigned S3 URL for output) | Lambda: download |
| 2.5 | Package audit storage as Lambda layer (reuse existing S3 code) | Lambda: audit |
| 2.6 | Set up API Gateway with routes matching current API surface | API Gateway configured |
| 2.7 | Implement S3 lifecycle policies for audit (Standard → Glacier tiers) | Lifecycle rules active |
| 2.8 | Integration testing: end-to-end with DynamoDB data | Test suite passing |
| ID | Task | Deliverable |
|---|---|---|
| 3.1 | Register app in Microsoft Entra ID (Client ID, Tenant ID, redirect URIs) | App registration |
| 3.2 | Configure OAuth 2.0 scopes, app roles (admin, operator) | RBAC configuration |
| 3.3 | Frontend: Replace AuthContext with MSAL.js (@azure/msal-react) | SSO login flow |
| 3.4 | Lambda: Add JWT validation middleware (Microsoft-issued tokens) | Token validation |
| 3.5 | Map Entra ID roles to existing admin/operator roles | Role mapping tested |
| 3.6 | Update audit trail to capture Entra ID user identity | Audit user fields updated |
| 3.7 | Configure M365 app launcher tile | App in waffle menu |
| 3.8 | Remove old auth/ module (JWT + bcrypt + users.json) | Dead code removed |
| ID | Task | Deliverable |
|---|---|---|
| 4.1 | Deploy React build to S3 + CloudFront (HTTPS, custom domain) | Frontend hosted on AWS |
| 4.2 | Update API calls to point to API Gateway endpoint | API integration verified |
| 4.3 | Add data quality alerts in UI (ETL quality report flags) | Quality indicators |
| 4.4 | Add data freshness indicator ("Last SAP sync: 2h ago") | Freshness badge in header |
| 4.5 | E2E testing with real SAP data through full pipeline | UAT complete |
| ID | Task | Deliverable |
|---|---|---|
| 5.1 | Load testing (6K–10K records through full pipeline) | Performance baseline documented |
| 5.2 | Security review (IAM, WAF, encryption, Entra ID audit) | Security sign-off |
| 5.3 | Disaster recovery testing (S3 cross-region replication) | DR plan validated |
| 5.4 | Runbook documentation (operations, troubleshooting, escalation) | Ops documentation |
| 5.5 | Staged rollout (pilot users → full team) | Production go-live |
| 5.6 | Decommission old Railway/Render/Lightsail deployment | Old infra shut down |
5–7 people across 5 roles. Some roles can overlap depending on team size and budget.
Estimated monthly AWS service costs after go-live, based on expected usage patterns.
| Service | Usage | Est. Monthly Cost |
|---|---|---|
| Lambda | ~10K invocations/day, 512MB, 10s avg | $15–50 |
| API Gateway | REST API, ~300K requests/month | $3–10 |
| DynamoDB | On-demand, 5 tables, ~50K reads/day | $10–30 |
| S3 (all buckets) | ~50GB raw + processed + audit | $5–15 |
| S3 Glacier | Growing archive over 7 years | $1–5 |
| Glue | 1 daily job, 2 DPU, ~5 min | $15–25 |
| CloudFront | Frontend CDN, low traffic | $1–5 |
| CloudWatch | Logs, metrics, alarms | $5–15 |
| TOTAL | $55–155/mo |