Cloud Sayings Architecture

LLM Model Comparison Platform

Cloud Sayings is a serverless web application designed to compare the effectiveness of different Large Language Models (LLMs). While it generates witty analogies for AWS services, the real purpose is to provide a mechanism for side-by-side model comparison, allowing users to evaluate how different models (Claude-3, GPT-4, etc.) perform the same task. The system is designed for high availability, low latency, and cost efficiency through intelligent caching and provider abstraction.

System Overview

Cloud Sayings leverages a serverless architecture on AWS, combining intelligent caching with multi-provider LLM integration to enable real-time comparison of model outputs. The system presents identical prompts to multiple models simultaneously, allowing users to evaluate differences in tone, accuracy, latency, and behavior across providers.

Core Components

  • Frontend: Next.js 15 static export hosted on S3/CloudFront
  • API Layer: AWS API Gateway (REST API)
  • Compute: AWS Lambda (Python 3.9) with Lambda Layers
  • Storage: DynamoDB (two tables: main_humor, saying_feedback)
  • LLM Providers: Anthropic Claude (Haiku 4.5, Sonnet 4.5) and OpenAI (GPT-4.1, GPT-4.1-mini)
  • Monitoring: CloudWatch Logs (structured JSON) and CloudWatch Metrics
  • Secrets: AWS Systems Manager Parameter Store (encrypted SecureString)

Design Principles

  1. Provider Abstraction: Adapter pattern allows easy addition of new LLM providers
  2. Intelligent Caching: In-memory cache reduces LLM API calls on warm Lambda invocations
  3. Graceful Degradation: Multiple fallback layers (cache → DynamoDB → error message)
  4. Observability: Structured JSON logging enables CloudWatch Logs Insights queries
  5. Cost Optimization: Cache-first approach minimizes expensive LLM API calls

Request Flow Architecture

The system uses a request-number-based flow rather than pure probability-based selection, optimizing for both performance and user experience.

┌─────────────────────────────────────────────────────────┐
│  1. User clicks "Get Saying" (Frontend)                │
└────────────────────┬────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────┐
│  2. API Gateway: GET /sayings                          │
└────────────────────┬────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────┐
│  3. getSaying Lambda Handler                           │
│     a. Check request number (tracked in CacheManager)  │
│     b. Select source based on request flow logic       │
└────────────────────┬────────────────────────────────────┘
                     │
        ┌────────────┴────────────┐
        │                         │
        ▼                         ▼
┌───────────────┐         ┌───────────────┐
│ Request 1-3   │         │ Request 4     │
│ DynamoDB      │         │ Live LLM      │
│ (fast)        │         │ (25% each)    │
└───────┬───────┘         └───────┬───────┘
        │                         │
        │                         ▼
        │              ┌──────────────────┐
        │              │ Request 5+       │
        │              │ Cache Lookup     │
        │              │ (from available  │
        │              │  LLMs)           │
        │              └────────┬─────────┘
        │                       │
        └───────────────────────┼───────────┐
                                │           │
                                ▼           ▼
                    ┌───────────────────────────────┐
                    │ Return Response               │
                    │ (saying, source, metrics)     │
                    └───────────┬───────────────────┘
                                │
                                ▼
                    ┌───────────────────────────────┐
                    │ Frontend Display              │
                    │ (saying + source attribution) │
                    └───────────────────────────────┘

Request Flow Logic

  1. Requests 1-3: Always return DynamoDB (fast, <50ms)
    • Cache building happens in background (3 sayings per LLM)
    • Purpose: Provide immediate response while cache warms up
  2. Request 4: Live LLM call (25% chance each model)
    • Randomly selects from: Haiku, Sonnet, GPT-4.1-mini, GPT-4.1
    • 65-second timeout, falls back to DynamoDB if timeout/failure
    • Purpose: First live LLM response, validates cache building
  3. Request 5+: Use cache (from LLMs with available cached sayings)
    • Randomly selects from LLMs that have cache available
    • Falls back to DynamoDB if all caches empty
    • Triggers cache rebuild if total cache ≤ 4 remaining
    • Purpose: Fast responses from cache, minimal LLM API calls
Request-Based Flow Design

The request-based architecture balances performance across different Lambda states. During cold starts, the system serves cached responses from DynamoDB with sub-50ms latency while building the in-memory cache in the background. Once warm, the Lambda maintains a local cache that eliminates LLM API calls during high-traffic periods.

This design provides immediate responses for initial requests while reducing LLM costs for active sessions. Users get fast responses from cache when available, and live LLM generations when engaging with fresh content, creating a responsive experience that scales cost-effectively.