Cloud Sayings Architecture

LLM Gateway & Caching

The LLM Gateway provides a unified interface to multiple language model providers through the Adapter Pattern, while the caching system optimizes costs and latency by intelligently managing in-memory saying storage.

LLM Gateway & Model Selection

Architecture Pattern: Adapter + Factory

The system uses the Adapter Pattern combined with a Factory Pattern to abstract LLM provider differences:

┌─────────────────────────────────────────────────────────┐
│                    LLMFactory                            │
│  ┌──────────────────────────────────────────────────┐   │
│  │  register_provider(name, adapter_class)         │   │
│  │  create_adapter(provider, config, prompt)      │   │
│  └──────────────────────────────────────────────────┘   │
└────────────────────┬────────────────────────────────────┘
                     │
        ┌────────────┴────────────┐
        │                         │
        ▼                         ▼
┌───────────────┐         ┌───────────────┐
│ Anthropic     │         │ OpenAI         │
│ Adapter       │         │ Adapter       │
│               │         │               │
│ • _init_      │         │ • _init_      │
│   client()    │         │   client()    │
│ • generate_   │         │ • generate_   │
│   saying()    │         │   saying()    │
│ • get_        │         │ • get_        │
│   provider()  │         │   provider()  │
└───────┬───────┘         └───────┬───────┘
        │                         │
        ▼                         ▼
┌───────────────┐         ┌───────────────┐
│ Anthropic API │         │ OpenAI API    │
│ (Claude)      │         │ (GPT)         │
└───────────────┘         └───────────────┘

Provider Registration

Providers are registered at module load time:

LLMFactory.register_provider('anthropic', AnthropicAdapter)
LLMFactory.register_provider('openai', OpenAIAdapter)

Adapter Implementation Details

Anthropic Adapter:

  • Client: anthropic.Anthropic(api_key, timeout=25.0)
  • API Method: client.messages.create()
  • Native system message support
  • Structured response parsing
  • Automatic content extraction from message array
  • Timeout: 25 seconds (slightly less than Lambda future timeout)

OpenAI Adapter:

  • Client: openai.OpenAI(api_key, timeout=25.0) (60s for cache building)
  • API Method: client.chat.completions.create()
  • max_completion_tokens for newer models (gpt-4.1, gpt-4.1-mini)
  • max_tokens for older models
  • Extended timeout (60s) for cache building requests
  • Timeout: 25 seconds (single requests), 60 seconds (cache building)

Caching Architecture

Cache Design Philosophy

The caching system is designed to:

  1. Reduce LLM API calls on warm Lambda invocations
  2. Maintain cache bounds (minimum 2, maximum 12 per LLM)
  3. Prevent duplicates through deduplication logic
  4. Handle provider differences (OpenAI retry logic)
  5. Track service usage to avoid repetition

Cache Structure

class CacheManager:
    CACHE_MIN = 2   # Minimum sayings per LLM in cache
    CACHE_MAX = 12  # Maximum sayings per LLM in cache

    caches: Dict[str, CacheEntry] = {
        'dynamodb': CacheEntry([], 0),
        'haiku': CacheEntry([], 0),
        'sonnet': CacheEntry([], 0),
        'gpt4.1': CacheEntry([], 0),
        'gpt5-mini': CacheEntry([], 0)
    }

Each CacheEntry contains:

  • sayings: List[str] - Queue of cached sayings
  • timestamp: float - Last update time
  • is_filling: bool - Prevents concurrent rehydration
  • last_error: Optional[str] - Error tracking

Cache Lifecycle

1. Initial Cache Building (Request #1)

On the first request:

  • DynamoDB saying returned immediately (<50ms)
  • Background threads start building cache for all 4 LLMs
  • Each LLM gets 3 sayings (1 batch of cache-building prompt)
  • Total: 12 sayings across all LLMs

2. Cache Consumption (Request #5+)

  • Randomly selects from LLMs that have cached sayings
  • Removes saying from cache (FIFO queue)
  • Triggers background rehydration if cache drops below CACHE_MIN (2)

3. Cache Rehydration

Triggered when:

  • Cache drops below CACHE_MIN (2 sayings)
  • Total cache across all LLMs ≤ 4
  • Background thread pool executes rehydration

Rehydration process:

  • Uses cache-building prompt (3 sayings at once)
  • For OpenAI: Retries up to 10 times until ≥2 sayings cached
  • For Anthropic: Uses count-based logic (1 batch = 3 sayings)
  • Adds sayings to cache with deduplication

Deduplication Logic

def _add_to_cache(self, source: str, new_sayings: List[str]):
    cache = self.caches[source]
    existing_set = set(cache.sayings)
    unique_new_sayings = [s for s in new_sayings if s not in existing_set]

    if unique_new_sayings:
        cache.sayings.extend(unique_new_sayings)
        # Enforce max bound - keep only the most recent CACHE_MAX
        if len(cache.sayings) > self.CACHE_MAX:
            cache.sayings = cache.sayings[-self.CACHE_MAX:]

Service Tracking

To avoid repetitive service selection (e.g., "Transit Gateway" appearing repeatedly):

_recently_used_services = []  # Tracks last 20 services
_RECENT_SERVICES_MAX = 20

def select_random_service() -> ServiceSelection:
    # Filter out recently used services
    recent_set = set(_recently_used_services)
    available_services = [svc for svc in all_services if svc not in recent_set]

    # If all services used, reset and use all
    if not available_services:
        available_services = all_services
        _recently_used_services.clear()

    # Select and track
    selected = random.choice(available_services)
    _recently_used_services.append(selected)
    if len(_recently_used_services) > _RECENT_SERVICES_MAX:
        _recently_used_services.pop(0)  # Remove oldest

OpenAI-Specific Retry Logic

OpenAI models sometimes return only 1 saying instead of 3. The system handles this with retry logic:

# For OpenAI cache building
is_openai = provider == 'openai'
min_sayings_required = 2 if is_openai else 0
max_attempts = 10 if is_openai else count

attempts = 0
while attempts < max_attempts:
    if is_openai and len(self.caches[source].sayings) >= min_sayings_required:
        break  # Have enough sayings

    sayings_batch, batch_metrics = get_llm_sayings_for_cache(provider, model)
    if sayings_batch:
        self._add_to_cache(source, sayings_batch)
    attempts += 1
Cache Persistence: The in-memory cache only persists on warm Lambda invocations. Cold starts will always return DynamoDB for first 3 requests, build cache in background, and use cache on subsequent warm requests. This design balances cold start performance, warm start efficiency, and cost optimization.

Cache Metrics

The system tracks:

  • cache_hit: Boolean indicating cache hit/miss
  • cache_size: Number of sayings remaining in cache
  • is_cached_response: Whether response came from cache
  • original_llm_time_ms: Original LLM API time (for cached responses)