Assay Architecture

Event-Driven Serverless Design for Document Intelligence

3. Document Processing Workflow

When a user uploads a PDF, Assay orchestrates a sophisticated multi-stage pipeline that extracts, analyzes, and synthesizes information. Here's how it works:

Stage 1: Upload & Validation

Client-Side (Frontend)

User selects PDF file (max 5MB)
Client validates file type and size
Creates Firestore document with status uploading
Uploads to Cloud Storage: uploads/{userId}/{documentId}/original.pdf

Storage Trigger (`onFileUpload`)

Automatically fires when upload completes
Validates PDF signature (reads first bytes, checks for %PDF-)
Verifies file size (enforces 5MB limit)
Extracts userId and documentId from storage path
Updates Firestore document status to processing
Creates display/{documentId} entry for UI
Publishes to pdf-uploaded Pub/Sub topic

Stage 2: Text Extraction

Worker: `extractText`

Trigger: pdf-uploaded Pub/Sub topic
Process:
- Downloads PDF from Cloud Storage
- Extracts text using pdf-parse library (temporarily, for processing only)
- Computes statistics: character count, token estimate, page count
- Uses extracted text to generate summaries (text is not stored, only summaries are retained)
- Updates Firestore processing/{documentId} with text stats
- Publishes to pdf-text-extracted Pub/Sub topic

Stage 3: Parallel Analysis

Once text is extracted, two workers process the document in parallel:

Metadata Extraction (`extractMetadata`)

Trigger: pdf-text-extracted Pub/Sub topic
Process:
- Uses Gemini 2.5 Pro (always, regardless of user quality selection) to extract:
  - Title
  - Authors
  - Publication date
  - Keywords and key concepts
- Falls back to pdf-parse metadata if Gemini fails
- Uses first-page heuristic for title/author if needed
- Updates Firestore processing/{documentId} and display/{documentId}
- Publishes to pdf-metadata-extracted Pub/Sub topic

Signals Extraction (`extractSignals`)

Trigger: pdf-text-extracted Pub/Sub topic
Process:
- Uses Gemini 2.5 Flash to extract:
  - Keywords (important terms)
  - Key phrases (significant multi-word expressions)
  - Key concepts (high-level ideas)
  - Discovered themes (free-form themes from document)
- Updates Firestore processing/{documentId} with signals
- Publishes to pdf-signals-extracted Pub/Sub topic

                    Parallel Processing Benefit: These two stages run simultaneously, reducing total processing 
                    time by approximately 50% compared to sequential execution.
                

Stage 4: Theme Classification

Worker: `matchCanonicalThemes`

Trigger: pdf-themes-match-requested Pub/Sub topic (published after signals extraction)
Process:
- Uses Gemini 2.5 Flash to match discovered themes to canonical taxonomy
- Canonical Taxonomy:
  - 30 L0 domains (broad categories, e.g., ARTIFICIAL_INTELLIGENCE, CLOUD_SECURITY)
  - 149 L1 specific themes (e.g., ARTIFICIAL_INTELLIGENCE.AI_ARCHITECTURES)
  - UNCATEGORIZED catch-all for unmatched themes
- Rules:
  - Prefer L1 (specific) themes when there's a clear match
  - Fall back to L0 (domain) if no L1 fits
  - Minimum 1, maximum 7 canonical themes per document
  - Unmatched themes flagged for review
- Updates Firestore processing/{documentId} with themes_l1[] and themes_l0[] arrays
- Publishes to pdf-themes-matched Pub/Sub topic

Stage 5: Strategy Selection

Worker: `selectStrategy`

Trigger: pdf-signals-extracted Pub/Sub topic
Process:
- Reads token count from extracted text statistics
- Decision Logic:
  - < 5,000 tokens → Single-pass strategy
  - ≥ 5,000 tokens → Hierarchical strategy
- Updates Firestore processing/{documentId} with strategy field
- Publishes to pdf-summary-requested Pub/Sub topic (with strategy)

Stage 6: Summary Generation

Based on the selected strategy, summaries are generated differently:

Single-Pass Strategy (`summarizeSingle`)

Trigger: pdf-summary-requested Pub/Sub topic (when strategy = single-pass)
Process:
- Single Gemini call generates all three summary types:
  - General Summary (simpleSummary): Clear, accessible summary (2-3 paragraphs)
  - Wired-Style Summary (wiredStyleSummary): Journalistic-style article (3-4 paragraphs)
  - FAQs (faqs): 5-8 question-answer pairs
- Uses Gemini 2.5 Flash (Fast quality) or Gemini 2.5 Pro (Premium quality) based on user selection
- Updates Firestore display/{documentId} with all summaries
- Publishes to pdf-summary-completed Pub/Sub topic

Hierarchical Strategy (`summarizeHierarchical`)

Trigger: pdf-summary-requested Pub/Sub topic (when strategy = hierarchical)
Process:
- Map Step: Chunks text (section-aware, 10-15% overlap), generates per-chunk summaries in parallel
- Reduce Step: Merges chunk summaries into final summaries using Gemini
- Generates same three summary types as single-pass
- Updates Firestore display/{documentId} with all summaries
- Publishes to pdf-summary-completed Pub/Sub topic

FAQ Generation (`generateFAQs`)

Trigger: pdf-faq-summary-requested Pub/Sub topic (published after themes are matched)
Process:
- Uses canonical themes to organize FAQs by theme
- Generates 5-12 questions and answers
- Questions are practical and theme-organized
- Updates Firestore display/{documentId} with FAQs
- Triggers completion check

Stage 7: Completion & Cleanup

Worker: `generateFAQs` (completion check)

When all summaries (general, wired-style, FAQ) are complete:

Updates Firestore processing/{documentId} status to completed
Updates display/{documentId} status to ready
Deletes original PDF from Cloud Storage (only summaries are retained, no extracted text)
Publishes to pdf-summary-completed Pub/Sub topic

Cleanup Job (`cleanupPDFs`)

Scheduled Cloud Function (runs daily at 2 AM UTC)
Safety net for PDF deletion
Deletes PDFs older than 3 hours from completed documents
Ensures no PDFs are retained longer than necessary

Processing Timeline (Typical)

These are approximate timelines and vary based on document size, model selection (Flash vs Pro), and system load:

Time	Stage	Description
T+0s	Upload	User uploads PDF, validation occurs
T+1s	Trigger	Storage trigger fires automatically
T+5s	Text Extraction	Text extracted for processing (not stored, only summaries retained)
T+10s	Parallel Analysis	Metadata and signals extracted simultaneously
T+15s	Theme Classification	Themes matched to canonical taxonomy
T+20s	Summary Generation	All three summary types generated (single-pass) or chunks processed (hierarchical)
T+30s	Complete	Document ready, summaries appear in UI

Note: Hierarchical processing for large documents (≥5K tokens) may take 2-5 minutes depending on document size and chunk count.

4. Design Decisions

Why Two Data Collections?

Assay maintains two separate Firestore collections, each optimized for different purposes:

`processing/` Collection - Processing State & Raw Data

Used by: Worker functions during processing
Contains:
- Processing status (uploading → processing → completed → failed)
- Raw signals (keywords, key phrases, key concepts, discovered themes)
- Canonical themes (themes_l1[], themes_l0[])
- Metadata (title, authors, date)
- Processing strategy (single-pass or hierarchical)
- Quality selection (fast or premium)
- Visibility (public or private)
Optimized for: Worker queries and updates during processing
Indexes: Optimized for processing queries (status, userId, themes)

`display/` Collection - UI-Optimized Display Data

Used by: Frontend for display
Contains:
- Display status (uploading → processing → ready → error)
- Formatted summaries (simpleSummary, wiredStyleSummary, faqs)
- Metadata (title, authors, date) - duplicated for fast access
- Visibility (public or private)
Optimized for: Frontend queries and real-time updates
Indexes: Optimized for user queries (userId, visibility, themes)

                    Benefits of Separation:
                    Performance: Each collection can be optimized for its specific use case
Scalability: Workers don't compete with frontend queries
Maintainability: Clear separation of concerns
Cost: Smaller display/ collection reduces read costs for frontend

                

Why Delete PDFs After Processing?

Once processing is complete, the original PDF is deleted from Cloud Storage. No extracted text from PDFs is kept—only AI-generated summaries are retained. This design decision prioritizes:

Privacy: Original files and extracted text aren't retained long-term, reducing privacy risk
Cost Efficiency: Summary storage is significantly cheaper than PDF or full-text storage (approximately 10-20x reduction)
Security: Less data to protect and manage, reducing attack surface
Compliance: Summary-only retention simplifies compliance requirements (GDPR, CCPA, etc.)
Processing Speed: Summaries are faster to read and process than full documents

Deletion Mechanism:

Immediate: PDF deleted when all summaries complete (in generateFAQs completion check)
Safety Net: Nightly cleanup job (cleanupPDFs) deletes PDFs older than 3 hours
Summary Retention: Only AI-generated summaries are stored in Firestore and retained indefinitely. No extracted text from PDFs is kept.

Why Event-Driven Messaging?

Using Google Cloud Pub/Sub between processing stages provides several critical benefits:

Reliability: Messages are persisted, ensuring no processing steps are lost even if workers crash
Retry Logic: Failed processing can be automatically retried through Pub/Sub's built-in retry mechanism
Backpressure Handling: The queue absorbs spikes in processing demand, preventing system overload
Monitoring: Message flow provides clear observability into system health and processing bottlenecks
Decoupling: Workers don't need to know about each other, only the topics they publish/subscribe to
Scalability: Each topic can scale independently based on message volume

Pub/Sub Topics Used:

11 topics for document processing pipeline
Each topic can have multiple subscribers (enabling parallel processing)
Message format: JSON with documentId, userId, and stage-specific data

Why Canonical Themes?

Assay uses a canonical theme taxonomy rather than free-form tagging. This approach:

Enables Discovery: Users can find documents across different authors/time periods that share themes
Maintains Consistency: Same theme always means the same thing (e.g., ARTIFICIAL_INTELLIGENCE.AI_SAFETY)
Scales Organically: New themes are flagged for review and added to the taxonomy
Supports Navigation: Hierarchical structure (L0 domains → L1 specific themes) enables browsing
Improves Search: Theme-based search is more reliable than keyword search

Taxonomy Structure:

L0 (Root Domains): 30 broad categories (e.g., ARTIFICIAL_INTELLIGENCE, CLOUD_SECURITY)
L1 (Specific Themes): 149 specific themes (e.g., ARTIFICIAL_INTELLIGENCE.AI_ARCHITECTURES)
UNCATEGORIZED: Catch-all for documents that don't match any canonical theme
Synonyms: Each theme can have searchable synonyms for better matching

← Overview & Philosophy Scale, Discovery & Interfaces →

Table of Contents