Skip to content

Case studies

Decision stories, delivery evidence, and failure recoveries.

These aren't polished testimonials. They're engineering narratives from real consulting engagements - what the starting state looked like, what constraints made the decision hard, which alternatives got rejected and why, and what actually happened when the solution hit production.

Every case study below comes from production consulting work. The company names are real where we have permission. The numbers are real. The constraints were real - budget ceilings, team skill gaps, compliance deadlines, legacy systems nobody wanted to touch.

What you won't find here: "we used AWS and it was great." What you will find: what didn't work, the options we evaluated and rejected, the trade-offs we accepted, and the things that broke during rollout that nobody predicted. That's where the actual learning lives.

Migration

KFC Thailand: Serverless Migration for 5M+ Orders/Month

Starting state

KFC Thailand's delivery platform ran on a PHP monolith. During peak ordering hours - lunch and dinner rushes - the system buckled. Response times spiked above 2 seconds. Orders dropped. The ops team was manually scaling EC2 instances and praying.

The platform was handling Thailand's largest QSR delivery volume. This wasn't a startup experimenting with cloud - it was a production system serving millions of real customers who expected their food orders to go through.

Constraints

Zero tolerance for downtime during migration. The business couldn't afford a maintenance window - orders flow 18 hours a day. The dev team had PHP experience but limited AWS/serverless knowledge. Budget was fixed, and the existing EC2 bill was already higher than leadership wanted.

What we evaluated and rejected

Rewriting in a single new framework and doing a big-bang cutover - rejected because of the downtime risk and the 6+ month timeline. Containerizing the PHP monolith on ECS as-is - rejected because it wouldn't solve the scaling problem, just the deployment problem. The monolith's database queries were the bottleneck, not the compute.

What we built

Decomposed the monolith into serverless APIs using Node.js and Go. Order processing went through Kafka on ECS/EKS for reliability. We used a strangler fig pattern - new features went to serverless, old features migrated one at a time. Each migration was a separate deploy with its own rollback path. Blue-green deployments meant every cutover was reversible.

Outcome

5M+ orders/month processing at sub-200ms latency. 30% reduction in AWS spend compared to the old EC2 fleet. Zero-downtime releases became the default. The team went from dreading Friday deploys to shipping features multiple times per week.

The thing that surprised us: the biggest cost saving wasn't from Lambda vs EC2. It was from eliminating the over-provisioned RDS instances that were running 24/7 to handle peak load that only lasted 4 hours per day.

SaaS Platform

ProdigyBuild: Serverless SaaS with 99.9% Availability

Starting state

ProdigyBuild needed a SaaS platform built from scratch. The founding team had a working prototype on Heroku. It worked for demos but couldn't handle the load or reliability requirements their enterprise customers demanded. They needed 99.9% uptime in the contract.

Constraints

Small team - three engineers including the CTO. Limited runway, so infrastructure costs had to stay low. Enterprise customers required SOC2-ready architecture from day one, not as a retrofit later. The team knew Python well, had limited AWS experience.

What we built

Fully serverless stack: Lambda functions in Python, API Gateway for the REST layer, DynamoDB for the primary datastore, SQS for background task queues (replaced Celery workers). CloudWatch alarms with PagerDuty integration for on-call. Infrastructure as code in CDK so the whole stack was reproducible.

The key decision: DynamoDB over Aurora. The data model was mostly key-value lookups with a few GSI-based queries. Aurora would have been easier for the team (they knew SQL), but DynamoDB gave us single-digit millisecond reads without capacity planning headaches. The trade-off was query flexibility - some reports required export-to-S3-and-Athena workflows instead of simple SQL joins.

Outcome

99.9% availability over the first 12 months. 40% lower infrastructure costs compared to the original EC2/RDS architecture estimate. Dev velocity improved 60% - the team went from 3-week deploy cycles to shipping daily. The SaaS platform onboarded its first 10 enterprise customers within 6 months of launch.

What we'd do differently: start with a proper observability stack from day one. We added structured logging and distributed tracing 3 months in, and the first 3 months of debugging was painful without it.

Event Processing

AppGambit: Real-Time Event Processing at 1M+ Requests/Minute

Starting state

AppGambit had an event processing platform that was hitting its ceiling. The existing architecture used a combination of EC2 instances and cron jobs to process incoming events. It worked at moderate scale but started dropping events above 500K requests/minute. Audit compliance required zero data loss.

What we built

Cloud-native microservices on ECS, EKS, and Fargate. Event ingestion through API Gateway to Kinesis Data Streams. Processing via Lambda consumers for lightweight transforms and ECS tasks for heavy computation. ElasticSearch (now OpenSearch) for real-time querying. S3 for long-term event storage with lifecycle policies.

The architecture had to handle bursty traffic patterns - quiet for hours, then millions of events in minutes. Kinesis sharding with auto-scaling Lambda consumers handled the burst. ECS tasks handled the sustained compute. The separation mattered because Lambda's 15-minute timeout was too short for some processing jobs.

Outcome

1M+ requests/minute throughput. Sub-100ms query latency on OpenSearch. 99.99% uptime across the stack. Audit compliance maintained with zero data loss through Kinesis's built-in retention and S3 archival.

Cost was higher than expected initially - Kinesis shards aren't cheap at this scale. We optimized by batching events before writing to streams and using reserved capacity for the predictable baseline load. Brought the monthly Kinesis bill down about 40%.

Enterprise

ASTM International: Enterprise Cloud Architecture with SOC2 Compliance

Starting state

ASTM International is a global standards organization. Their cloud infrastructure had grown organically - different teams using different AWS accounts with inconsistent configurations. Security controls varied by team. Compliance audits were manual, time-consuming, and stressful. Every new project required weeks of infrastructure setup because nothing was standardized.

Constraints

SOC2 compliance was non-negotiable. Multiple development teams needed to work independently without stepping on each other. The organization had invested in specific tools and workflows that couldn't be replaced overnight. Budget existed but had to be justified against measurable efficiency gains.

What we built

Standardized cloud architectures using Terraform and CloudFormation IaC templates. Secure VPC configurations with consistent subnet layouts, security group patterns, and network ACLs across all accounts. AWS Organizations with Service Control Policies for guardrails. Config rules for continuous compliance monitoring instead of quarterly manual audits.

The key insight: don't try to centralize everything. Give teams their own accounts with guardrails, not a shared account with permissions. The overhead of account management is lower than the overhead of IAM policy conflicts in a shared environment.

Outcome

70% reduction in manual infrastructure setup time. 20% reduction in overall cloud spend through right-sizing and eliminating orphaned resources. 100% SOC2 compliance achieved and maintained through automated controls. New project environments now spin up in hours, not weeks.

AI/ML

Enterprise GenAI: Production RAG Pipeline on AWS Bedrock

Starting state

An enterprise client (name withheld) had a knowledge base of 2M+ documents - internal policies, technical specifications, compliance records, historical decisions. Employees spent hours searching through documents to find answers. The existing keyword search was useless for conceptual queries like "what's our policy on X when Y happens?"

Constraints

Data couldn't leave the organization's AWS account. No external LLM APIs - everything had to run within their VPC. Cost was a concern because naive RAG implementations can burn through Claude/GPT credits fast at this document volume. The system needed to handle 500+ queries per day from non-technical users.

What we built

Production RAG pipeline on AWS Bedrock with Claude as the generation model. Vector embeddings stored in OpenSearch Serverless (cheaper than running a dedicated OpenSearch cluster for this workload). Document ingestion pipeline: S3 upload triggers Lambda, which chunks documents, generates embeddings, and stores them in OpenSearch. Semantic caching via ElastiCache - if a similar question was asked recently, serve the cached answer instead of calling Bedrock again.

The semantic caching was the key cost optimization. Without it, Bedrock API costs would have been $3K+/month. With caching hitting about 40% of queries, we brought it down to around $1.2K/month. The cache hit rate improved over time as more queries populated the cache.

Outcome

2M+ documents searchable with natural language queries. Sub-second retrieval for cached queries, 3-5 seconds for uncached. 60% reduction in LLM API costs through semantic caching. Employee satisfaction with internal search went from "we use Google instead" to "this actually finds what I need."

What we learned: the hardest part wasn't the LLM or the vector store. It was document chunking. Bad chunks produce bad retrieval, and bad retrieval produces hallucinated answers. We spent more time tuning chunk size and overlap than on any other part of the system.

FinOps

Multi-Client FinOps: $100K+ Annual Savings Across 3 Enterprises

Starting state

Three enterprise clients, similar problem: AWS bills growing 30-50% year-over-year while traffic grew 10-15%. The bills weren't wrong - the architectures were just wasteful. Over-provisioned RDS instances running 24/7 for workloads that peaked 4 hours per day. NAT Gateways in every AZ when one would have been fine. S3 buckets with no lifecycle policies accumulating years of old data. Dev environments running production-sized instances.

What we did

Architecture-level cost audits using Cost Explorer, Trusted Advisor, and manual architecture review. The manual review was the most valuable part - automated tools catch unused resources but miss architectural inefficiencies. We mapped every significant cost line item back to the architecture decision that created it, then ranked fixes by savings-to-effort ratio.

Common fixes across all three clients: right-sizing RDS instances (the single biggest win every time), implementing S3 Intelligent-Tiering and lifecycle policies, consolidating NAT Gateways, scheduling dev/staging environments to shut down outside business hours, and buying Savings Plans for the predictable baseline compute.

The non-obvious fix: in two cases, the biggest cost driver wasn't compute or storage. It was data transfer. Cross-AZ traffic charges from microservices calling each other across availability zones. Colocating chatty services in the same AZ (with cross-AZ replicas for failover) cut data transfer costs by 60%.

Outcome

$100K+ combined annual savings across the three clients. Average cost reduction: 55%. The FinOps dashboards we built in QuickSight gave each team visibility into their own spend - which turned cost optimization from a one-time project into an ongoing practice. Two of the three clients haven't needed another cost audit since.

DevOps

CI/CD Transformation: From Manual Deploys to 10-Minute Cycles

Starting state

A 25-person engineering team deploying through a mix of manual SSH sessions, shared Jenkins server, and "it works on my machine" as the primary QA strategy. Deploys took half a day. Rollbacks meant restoring an AMI snapshot and hoping the database migration was reversible. It wasn't always reversible. The team had one incident where a bad deploy took 4 hours to roll back because the database migration had already run.

What we built

GitHub Actions with composite workflows for build, test, and deploy. Terraform IaC for all infrastructure - no more clicking around in the console. Blue-green deployments on ECS with automatic rollback on health check failures. Database migrations separated from application deploys so they could be tested and rolled back independently.

The critical decision: separating database migrations from application deployments. Most CI/CD tutorials bundle them together. That's fine until a migration fails and you can't roll back the application because the database schema has already changed. We run migrations as a separate pipeline step with its own approval gate and rollback procedure.

Outcome

Deploy cycles under 10 minutes end-to-end. 95%+ test coverage enforced by the pipeline (PRs don't merge without it). Rollback capability under 30 seconds via blue-green target group swap. The team went from dreading deploys to treating them as routine. Friday deploys stopped being a joke - they actually do them now.

Security

ProtectOnce: Automated API Security with Sub-5s Threat Detection

Starting state

ProtectOnce needed to build an automated API security scanning platform. The existing process was manual - security engineers running tools against API endpoints one at a time. Threat detection was measured in hours, not seconds. By the time a vulnerability was found, the window for exploitation had been open for days.

What we built

Python/FastAPI backend services for automated API security scanning. React dashboard for security teams to monitor scan results and manage policies. Event-driven alerting via SNS/SQS - when a scan detects a vulnerability, the security team gets notified in under 5 seconds. Scan results stored in DynamoDB with automatic classification by severity.

The integration pipeline was the hard part. Every new API endpoint needed to be automatically discovered and added to the scan rotation. We built an API gateway integration that detected new routes and scheduled initial scans within minutes of deployment.

Outcome

Threat detection time dropped from hours to under 5 seconds. 100% API coverage - every endpoint is scanned automatically. 25% improvement in CI/CD efficiency because security checks moved from a manual gate to an automated pipeline step. The security team went from being a bottleneck to being a monitoring function.

Facing a similar decision?

If you're working through an architecture decision, a migration, a cost problem, or an AI infrastructure challenge and want someone who's done it before to pressure-test your approach - that's what InfraTales consulting is for. Not implementation capacity. Judgment, trade-off clarity, and production-first architecture support.

Book a 30-minute call or email hello@rahulladumor.com with your system context.

Published case studies

Deep dives from the blog.

Blog-format case studies are in progress. The consulting case studies above cover the same ground with real metrics and outcomes.