Assay Architecture

Event-Driven Serverless Design for Document Intelligence

3. Document Processing Workflow

When a user uploads a PDF, Assay orchestrates a sophisticated multi-stage pipeline that extracts, analyzes, and synthesizes information. Here's how it works:

Stage 1: Upload & Validation

Client-Side (Frontend)

  • User selects PDF file (max 5MB)
  • Client validates file type and size
  • Creates Firestore document with status uploading
  • Uploads to Cloud Storage: uploads/{userId}/{documentId}/original.pdf

Storage Trigger (`onFileUpload`)

  • Automatically fires when upload completes
  • Validates PDF signature (reads first bytes, checks for %PDF-)
  • Verifies file size (enforces 5MB limit)
  • Extracts userId and documentId from storage path
  • Updates Firestore document status to processing
  • Creates display/{documentId} entry for UI
  • Publishes to pdf-uploaded Pub/Sub topic

Stage 2: Text Extraction

Worker: `extractText`

  • Trigger: pdf-uploaded Pub/Sub topic
  • Process:
    • Downloads PDF from Cloud Storage
    • Extracts text using pdf-parse library (temporarily, for processing only)
    • Computes statistics: character count, token estimate, page count
    • Uses extracted text to generate summaries (text is not stored, only summaries are retained)
    • Updates Firestore processing/{documentId} with text stats
    • Publishes to pdf-text-extracted Pub/Sub topic

Stage 3: Parallel Analysis

Once text is extracted, two workers process the document in parallel:

Metadata Extraction (`extractMetadata`)

  • Trigger: pdf-text-extracted Pub/Sub topic
  • Process:
    • Uses Gemini 2.5 Pro (always, regardless of user quality selection) to extract:
      • Title
      • Authors
      • Publication date
      • Keywords and key concepts
    • Falls back to pdf-parse metadata if Gemini fails
    • Uses first-page heuristic for title/author if needed
    • Updates Firestore processing/{documentId} and display/{documentId}
    • Publishes to pdf-metadata-extracted Pub/Sub topic

Signals Extraction (`extractSignals`)

  • Trigger: pdf-text-extracted Pub/Sub topic
  • Process:
    • Uses Gemini 2.5 Flash to extract:
      • Keywords (important terms)
      • Key phrases (significant multi-word expressions)
      • Key concepts (high-level ideas)
      • Discovered themes (free-form themes from document)
    • Updates Firestore processing/{documentId} with signals
    • Publishes to pdf-signals-extracted Pub/Sub topic
Parallel Processing Benefit: These two stages run simultaneously, reducing total processing time by approximately 50% compared to sequential execution.

Stage 4: Theme Classification

Worker: `matchCanonicalThemes`

  • Trigger: pdf-themes-match-requested Pub/Sub topic (published after signals extraction)
  • Process:
    • Uses Gemini 2.5 Flash to match discovered themes to canonical taxonomy
    • Canonical Taxonomy:
      • 30 L0 domains (broad categories, e.g., ARTIFICIAL_INTELLIGENCE, CLOUD_SECURITY)
      • 149 L1 specific themes (e.g., ARTIFICIAL_INTELLIGENCE.AI_ARCHITECTURES)
      • UNCATEGORIZED catch-all for unmatched themes
    • Rules:
      • Prefer L1 (specific) themes when there's a clear match
      • Fall back to L0 (domain) if no L1 fits
      • Minimum 1, maximum 7 canonical themes per document
      • Unmatched themes flagged for review
    • Updates Firestore processing/{documentId} with themes_l1[] and themes_l0[] arrays
    • Publishes to pdf-themes-matched Pub/Sub topic

Stage 5: Strategy Selection

Worker: `selectStrategy`

  • Trigger: pdf-signals-extracted Pub/Sub topic
  • Process:
    • Reads token count from extracted text statistics
    • Decision Logic:
      • < 5,000 tokens → Single-pass strategy
      • ≥ 5,000 tokens → Hierarchical strategy
    • Updates Firestore processing/{documentId} with strategy field
    • Publishes to pdf-summary-requested Pub/Sub topic (with strategy)

Stage 6: Summary Generation

Based on the selected strategy, summaries are generated differently:

Single-Pass Strategy (`summarizeSingle`)

  • Trigger: pdf-summary-requested Pub/Sub topic (when strategy = single-pass)
  • Process:
    • Single Gemini call generates all three summary types:
      • General Summary (simpleSummary): Clear, accessible summary (2-3 paragraphs)
      • Wired-Style Summary (wiredStyleSummary): Journalistic-style article (3-4 paragraphs)
      • FAQs (faqs): 5-8 question-answer pairs
    • Uses Gemini 2.5 Flash (Fast quality) or Gemini 2.5 Pro (Premium quality) based on user selection
    • Updates Firestore display/{documentId} with all summaries
    • Publishes to pdf-summary-completed Pub/Sub topic

Hierarchical Strategy (`summarizeHierarchical`)

  • Trigger: pdf-summary-requested Pub/Sub topic (when strategy = hierarchical)
  • Process:
    • Map Step: Chunks text (section-aware, 10-15% overlap), generates per-chunk summaries in parallel
    • Reduce Step: Merges chunk summaries into final summaries using Gemini
    • Generates same three summary types as single-pass
    • Updates Firestore display/{documentId} with all summaries
    • Publishes to pdf-summary-completed Pub/Sub topic

FAQ Generation (`generateFAQs`)

  • Trigger: pdf-faq-summary-requested Pub/Sub topic (published after themes are matched)
  • Process:
    • Uses canonical themes to organize FAQs by theme
    • Generates 5-12 questions and answers
    • Questions are practical and theme-organized
    • Updates Firestore display/{documentId} with FAQs
    • Triggers completion check

Stage 7: Completion & Cleanup

Worker: `generateFAQs` (completion check)

When all summaries (general, wired-style, FAQ) are complete:

  • Updates Firestore processing/{documentId} status to completed
  • Updates display/{documentId} status to ready
  • Deletes original PDF from Cloud Storage (only summaries are retained, no extracted text)
  • Publishes to pdf-summary-completed Pub/Sub topic

Cleanup Job (`cleanupPDFs`)

  • Scheduled Cloud Function (runs daily at 2 AM UTC)
  • Safety net for PDF deletion
  • Deletes PDFs older than 3 hours from completed documents
  • Ensures no PDFs are retained longer than necessary

Processing Timeline (Typical)

These are approximate timelines and vary based on document size, model selection (Flash vs Pro), and system load:

Time Stage Description
T+0s Upload User uploads PDF, validation occurs
T+1s Trigger Storage trigger fires automatically
T+5s Text Extraction Text extracted for processing (not stored, only summaries retained)
T+10s Parallel Analysis Metadata and signals extracted simultaneously
T+15s Theme Classification Themes matched to canonical taxonomy
T+20s Summary Generation All three summary types generated (single-pass) or chunks processed (hierarchical)
T+30s Complete Document ready, summaries appear in UI

Note: Hierarchical processing for large documents (≥5K tokens) may take 2-5 minutes depending on document size and chunk count.

4. Design Decisions

Why Two Data Collections?

Assay maintains two separate Firestore collections, each optimized for different purposes:

`processing/` Collection - Processing State & Raw Data

  • Used by: Worker functions during processing
  • Contains:
    • Processing status (uploadingprocessingcompletedfailed)
    • Raw signals (keywords, key phrases, key concepts, discovered themes)
    • Canonical themes (themes_l1[], themes_l0[])
    • Metadata (title, authors, date)
    • Processing strategy (single-pass or hierarchical)
    • Quality selection (fast or premium)
    • Visibility (public or private)
  • Optimized for: Worker queries and updates during processing
  • Indexes: Optimized for processing queries (status, userId, themes)

`display/` Collection - UI-Optimized Display Data

  • Used by: Frontend for display
  • Contains:
    • Display status (uploadingprocessingreadyerror)
    • Formatted summaries (simpleSummary, wiredStyleSummary, faqs)
    • Metadata (title, authors, date) - duplicated for fast access
    • Visibility (public or private)
  • Optimized for: Frontend queries and real-time updates
  • Indexes: Optimized for user queries (userId, visibility, themes)
Benefits of Separation:
  • Performance: Each collection can be optimized for its specific use case
  • Scalability: Workers don't compete with frontend queries
  • Maintainability: Clear separation of concerns
  • Cost: Smaller display/ collection reduces read costs for frontend

Why Delete PDFs After Processing?

Once processing is complete, the original PDF is deleted from Cloud Storage. No extracted text from PDFs is kept—only AI-generated summaries are retained. This design decision prioritizes:

  • Privacy: Original files and extracted text aren't retained long-term, reducing privacy risk
  • Cost Efficiency: Summary storage is significantly cheaper than PDF or full-text storage (approximately 10-20x reduction)
  • Security: Less data to protect and manage, reducing attack surface
  • Compliance: Summary-only retention simplifies compliance requirements (GDPR, CCPA, etc.)
  • Processing Speed: Summaries are faster to read and process than full documents

Deletion Mechanism:

  • Immediate: PDF deleted when all summaries complete (in generateFAQs completion check)
  • Safety Net: Nightly cleanup job (cleanupPDFs) deletes PDFs older than 3 hours
  • Summary Retention: Only AI-generated summaries are stored in Firestore and retained indefinitely. No extracted text from PDFs is kept.

Why Event-Driven Messaging?

Using Google Cloud Pub/Sub between processing stages provides several critical benefits:

  • Reliability: Messages are persisted, ensuring no processing steps are lost even if workers crash
  • Retry Logic: Failed processing can be automatically retried through Pub/Sub's built-in retry mechanism
  • Backpressure Handling: The queue absorbs spikes in processing demand, preventing system overload
  • Monitoring: Message flow provides clear observability into system health and processing bottlenecks
  • Decoupling: Workers don't need to know about each other, only the topics they publish/subscribe to
  • Scalability: Each topic can scale independently based on message volume

Pub/Sub Topics Used:

  • 11 topics for document processing pipeline
  • Each topic can have multiple subscribers (enabling parallel processing)
  • Message format: JSON with documentId, userId, and stage-specific data

Why Canonical Themes?

Assay uses a canonical theme taxonomy rather than free-form tagging. This approach:

  • Enables Discovery: Users can find documents across different authors/time periods that share themes
  • Maintains Consistency: Same theme always means the same thing (e.g., ARTIFICIAL_INTELLIGENCE.AI_SAFETY)
  • Scales Organically: New themes are flagged for review and added to the taxonomy
  • Supports Navigation: Hierarchical structure (L0 domains → L1 specific themes) enables browsing
  • Improves Search: Theme-based search is more reliable than keyword search

Taxonomy Structure:

  • L0 (Root Domains): 30 broad categories (e.g., ARTIFICIAL_INTELLIGENCE, CLOUD_SECURITY)
  • L1 (Specific Themes): 149 specific themes (e.g., ARTIFICIAL_INTELLIGENCE.AI_ARCHITECTURES)
  • UNCATEGORIZED: Catch-all for documents that don't match any canonical theme
  • Synonyms: Each theme can have searchable synonyms for better matching