The LLM Gateway provides a unified interface to multiple language model providers through the Adapter Pattern, while the caching system optimizes costs and latency by intelligently managing in-memory saying storage.
The system uses the Adapter Pattern combined with a Factory Pattern to abstract LLM provider differences:
┌─────────────────────────────────────────────────────────┐
│ LLMFactory │
│ ┌──────────────────────────────────────────────────┐ │
│ │ register_provider(name, adapter_class) │ │
│ │ create_adapter(provider, config, prompt) │ │
│ └──────────────────────────────────────────────────┘ │
└────────────────────┬────────────────────────────────────┘
│
┌────────────┴────────────┐
│ │
▼ ▼
┌───────────────┐ ┌───────────────┐
│ Anthropic │ │ OpenAI │
│ Adapter │ │ Adapter │
│ │ │ │
│ • _init_ │ │ • _init_ │
│ client() │ │ client() │
│ • generate_ │ │ • generate_ │
│ saying() │ │ saying() │
│ • get_ │ │ • get_ │
│ provider() │ │ provider() │
└───────┬───────┘ └───────┬───────┘
│ │
▼ ▼
┌───────────────┐ ┌───────────────┐
│ Anthropic API │ │ OpenAI API │
│ (Claude) │ │ (GPT) │
└───────────────┘ └───────────────┘
Providers are registered at module load time:
LLMFactory.register_provider('anthropic', AnthropicAdapter)
LLMFactory.register_provider('openai', OpenAIAdapter)
Anthropic Adapter:
anthropic.Anthropic(api_key, timeout=25.0)client.messages.create()OpenAI Adapter:
openai.OpenAI(api_key, timeout=25.0) (60s for cache building)client.chat.completions.create()max_completion_tokens for newer models (gpt-4.1, gpt-4.1-mini)max_tokens for older modelsThe caching system is designed to:
class CacheManager:
CACHE_MIN = 2 # Minimum sayings per LLM in cache
CACHE_MAX = 12 # Maximum sayings per LLM in cache
caches: Dict[str, CacheEntry] = {
'dynamodb': CacheEntry([], 0),
'haiku': CacheEntry([], 0),
'sonnet': CacheEntry([], 0),
'gpt4.1': CacheEntry([], 0),
'gpt5-mini': CacheEntry([], 0)
}
Each CacheEntry contains:
sayings: List[str] - Queue of cached sayingstimestamp: float - Last update timeis_filling: bool - Prevents concurrent rehydrationlast_error: Optional[str] - Error tracking1. Initial Cache Building (Request #1)
On the first request:
2. Cache Consumption (Request #5+)
CACHE_MIN (2)3. Cache Rehydration
Triggered when:
CACHE_MIN (2 sayings)Rehydration process:
def _add_to_cache(self, source: str, new_sayings: List[str]):
cache = self.caches[source]
existing_set = set(cache.sayings)
unique_new_sayings = [s for s in new_sayings if s not in existing_set]
if unique_new_sayings:
cache.sayings.extend(unique_new_sayings)
# Enforce max bound - keep only the most recent CACHE_MAX
if len(cache.sayings) > self.CACHE_MAX:
cache.sayings = cache.sayings[-self.CACHE_MAX:]
To avoid repetitive service selection (e.g., "Transit Gateway" appearing repeatedly):
_recently_used_services = [] # Tracks last 20 services
_RECENT_SERVICES_MAX = 20
def select_random_service() -> ServiceSelection:
# Filter out recently used services
recent_set = set(_recently_used_services)
available_services = [svc for svc in all_services if svc not in recent_set]
# If all services used, reset and use all
if not available_services:
available_services = all_services
_recently_used_services.clear()
# Select and track
selected = random.choice(available_services)
_recently_used_services.append(selected)
if len(_recently_used_services) > _RECENT_SERVICES_MAX:
_recently_used_services.pop(0) # Remove oldest
OpenAI models sometimes return only 1 saying instead of 3. The system handles this with retry logic:
# For OpenAI cache building
is_openai = provider == 'openai'
min_sayings_required = 2 if is_openai else 0
max_attempts = 10 if is_openai else count
attempts = 0
while attempts < max_attempts:
if is_openai and len(self.caches[source].sayings) >= min_sayings_required:
break # Have enough sayings
sayings_batch, batch_metrics = get_llm_sayings_for_cache(provider, model)
if sayings_batch:
self._add_to_cache(source, sayings_batch)
attempts += 1
The system tracks:
cache_hit: Boolean indicating cache hit/misscache_size: Number of sayings remaining in cacheis_cached_response: Whether response came from cacheoriginal_llm_time_ms: Original LLM API time (for cached responses)