3. Document Processing Workflow
When a user uploads a PDF, Assay orchestrates a sophisticated multi-stage pipeline that extracts, analyzes,
and synthesizes information. Here's how it works:
Stage 1: Upload & Validation
Client-Side (Frontend)
- User selects PDF file (max 5MB)
- Client validates file type and size
- Creates Firestore document with status
uploading
- Uploads to Cloud Storage:
uploads/{userId}/{documentId}/original.pdf
Storage Trigger (`onFileUpload`)
- Automatically fires when upload completes
- Validates PDF signature (reads first bytes, checks for
%PDF-)
- Verifies file size (enforces 5MB limit)
- Extracts
userId and documentId from storage path
- Updates Firestore document status to
processing
- Creates
display/{documentId} entry for UI
- Publishes to
pdf-uploaded Pub/Sub topic
Stage 2: Text Extraction
Worker: `extractText`
- Trigger:
pdf-uploaded Pub/Sub topic
- Process:
- Downloads PDF from Cloud Storage
- Extracts text using
pdf-parse library (temporarily, for processing only)
- Computes statistics: character count, token estimate, page count
- Uses extracted text to generate summaries (text is not stored, only summaries are retained)
- Updates Firestore
processing/{documentId} with text stats
- Publishes to
pdf-text-extracted Pub/Sub topic
Stage 3: Parallel Analysis
Once text is extracted, two workers process the document in parallel:
Metadata Extraction (`extractMetadata`)
- Trigger:
pdf-text-extracted Pub/Sub topic
- Process:
- Uses Gemini 2.5 Pro (always, regardless of user quality selection) to extract:
- Title
- Authors
- Publication date
- Keywords and key concepts
- Falls back to
pdf-parse metadata if Gemini fails
- Uses first-page heuristic for title/author if needed
- Updates Firestore
processing/{documentId} and display/{documentId}
- Publishes to
pdf-metadata-extracted Pub/Sub topic
Signals Extraction (`extractSignals`)
- Trigger:
pdf-text-extracted Pub/Sub topic
- Process:
- Uses Gemini 2.5 Flash to extract:
- Keywords (important terms)
- Key phrases (significant multi-word expressions)
- Key concepts (high-level ideas)
- Discovered themes (free-form themes from document)
- Updates Firestore
processing/{documentId} with signals
- Publishes to
pdf-signals-extracted Pub/Sub topic
Parallel Processing Benefit: These two stages run simultaneously, reducing total processing
time by approximately 50% compared to sequential execution.
Stage 4: Theme Classification
Worker: `matchCanonicalThemes`
- Trigger:
pdf-themes-match-requested Pub/Sub topic (published after signals extraction)
- Process:
- Uses Gemini 2.5 Flash to match discovered themes to canonical taxonomy
- Canonical Taxonomy:
- 30 L0 domains (broad categories, e.g.,
ARTIFICIAL_INTELLIGENCE, CLOUD_SECURITY)
- 149 L1 specific themes (e.g.,
ARTIFICIAL_INTELLIGENCE.AI_ARCHITECTURES)
UNCATEGORIZED catch-all for unmatched themes
- Rules:
- Prefer L1 (specific) themes when there's a clear match
- Fall back to L0 (domain) if no L1 fits
- Minimum 1, maximum 7 canonical themes per document
- Unmatched themes flagged for review
- Updates Firestore
processing/{documentId} with themes_l1[] and themes_l0[] arrays
- Publishes to
pdf-themes-matched Pub/Sub topic
Stage 5: Strategy Selection
Worker: `selectStrategy`
- Trigger:
pdf-signals-extracted Pub/Sub topic
- Process:
- Reads token count from extracted text statistics
- Decision Logic:
< 5,000 tokens → Single-pass strategy
≥ 5,000 tokens → Hierarchical strategy
- Updates Firestore
processing/{documentId} with strategy field
- Publishes to
pdf-summary-requested Pub/Sub topic (with strategy)
Stage 6: Summary Generation
Based on the selected strategy, summaries are generated differently:
Single-Pass Strategy (`summarizeSingle`)
- Trigger:
pdf-summary-requested Pub/Sub topic (when strategy = single-pass)
- Process:
- Single Gemini call generates all three summary types:
- General Summary (
simpleSummary): Clear, accessible summary (2-3 paragraphs)
- Wired-Style Summary (
wiredStyleSummary): Journalistic-style article (3-4 paragraphs)
- FAQs (
faqs): 5-8 question-answer pairs
- Uses Gemini 2.5 Flash (Fast quality) or Gemini 2.5 Pro (Premium quality) based on user selection
- Updates Firestore
display/{documentId} with all summaries
- Publishes to
pdf-summary-completed Pub/Sub topic
Hierarchical Strategy (`summarizeHierarchical`)
- Trigger:
pdf-summary-requested Pub/Sub topic (when strategy = hierarchical)
- Process:
- Map Step: Chunks text (section-aware, 10-15% overlap), generates per-chunk summaries in parallel
- Reduce Step: Merges chunk summaries into final summaries using Gemini
- Generates same three summary types as single-pass
- Updates Firestore
display/{documentId} with all summaries
- Publishes to
pdf-summary-completed Pub/Sub topic
FAQ Generation (`generateFAQs`)
- Trigger:
pdf-faq-summary-requested Pub/Sub topic (published after themes are matched)
- Process:
- Uses canonical themes to organize FAQs by theme
- Generates 5-12 questions and answers
- Questions are practical and theme-organized
- Updates Firestore
display/{documentId} with FAQs
- Triggers completion check
Stage 7: Completion & Cleanup
Worker: `generateFAQs` (completion check)
When all summaries (general, wired-style, FAQ) are complete:
- Updates Firestore
processing/{documentId} status to completed
- Updates
display/{documentId} status to ready
- Deletes original PDF from Cloud Storage (only summaries are retained, no extracted text)
- Publishes to
pdf-summary-completed Pub/Sub topic
Cleanup Job (`cleanupPDFs`)
- Scheduled Cloud Function (runs daily at 2 AM UTC)
- Safety net for PDF deletion
- Deletes PDFs older than 3 hours from completed documents
- Ensures no PDFs are retained longer than necessary
Processing Timeline (Typical)
These are approximate timelines and vary based on document size, model selection (Flash vs Pro), and system load:
| Time |
Stage |
Description |
| T+0s |
Upload |
User uploads PDF, validation occurs |
| T+1s |
Trigger |
Storage trigger fires automatically |
| T+5s |
Text Extraction |
Text extracted for processing (not stored, only summaries retained) |
| T+10s |
Parallel Analysis |
Metadata and signals extracted simultaneously |
| T+15s |
Theme Classification |
Themes matched to canonical taxonomy |
| T+20s |
Summary Generation |
All three summary types generated (single-pass) or chunks processed (hierarchical) |
| T+30s |
Complete |
Document ready, summaries appear in UI |
Note: Hierarchical processing for large documents (≥5K tokens) may take 2-5 minutes depending on document size and chunk count.
4. Design Decisions
Why Two Data Collections?
Assay maintains two separate Firestore collections, each optimized for different purposes:
`processing/` Collection - Processing State & Raw Data
- Used by: Worker functions during processing
- Contains:
- Processing status (
uploading → processing → completed → failed)
- Raw signals (keywords, key phrases, key concepts, discovered themes)
- Canonical themes (
themes_l1[], themes_l0[])
- Metadata (title, authors, date)
- Processing strategy (
single-pass or hierarchical)
- Quality selection (
fast or premium)
- Visibility (
public or private)
- Optimized for: Worker queries and updates during processing
- Indexes: Optimized for processing queries (status, userId, themes)
`display/` Collection - UI-Optimized Display Data
- Used by: Frontend for display
- Contains:
- Display status (
uploading → processing → ready → error)
- Formatted summaries (
simpleSummary, wiredStyleSummary, faqs)
- Metadata (title, authors, date) - duplicated for fast access
- Visibility (
public or private)
- Optimized for: Frontend queries and real-time updates
- Indexes: Optimized for user queries (userId, visibility, themes)
Benefits of Separation:
- Performance: Each collection can be optimized for its specific use case
- Scalability: Workers don't compete with frontend queries
- Maintainability: Clear separation of concerns
- Cost: Smaller
display/ collection reduces read costs for frontend
Why Delete PDFs After Processing?
Once processing is complete, the original PDF is deleted from Cloud Storage. No extracted text from PDFs is kept—only AI-generated summaries are retained. This design decision prioritizes:
- Privacy: Original files and extracted text aren't retained long-term, reducing privacy risk
- Cost Efficiency: Summary storage is significantly cheaper than PDF or full-text storage (approximately 10-20x reduction)
- Security: Less data to protect and manage, reducing attack surface
- Compliance: Summary-only retention simplifies compliance requirements (GDPR, CCPA, etc.)
- Processing Speed: Summaries are faster to read and process than full documents
Deletion Mechanism:
- Immediate: PDF deleted when all summaries complete (in
generateFAQs completion check)
- Safety Net: Nightly cleanup job (
cleanupPDFs) deletes PDFs older than 3 hours
- Summary Retention: Only AI-generated summaries are stored in Firestore and retained indefinitely. No extracted text from PDFs is kept.
Why Event-Driven Messaging?
Using Google Cloud Pub/Sub between processing stages provides several critical benefits:
- Reliability: Messages are persisted, ensuring no processing steps are lost even if workers crash
- Retry Logic: Failed processing can be automatically retried through Pub/Sub's built-in retry mechanism
- Backpressure Handling: The queue absorbs spikes in processing demand, preventing system overload
- Monitoring: Message flow provides clear observability into system health and processing bottlenecks
- Decoupling: Workers don't need to know about each other, only the topics they publish/subscribe to
- Scalability: Each topic can scale independently based on message volume
Pub/Sub Topics Used:
- 11 topics for document processing pipeline
- Each topic can have multiple subscribers (enabling parallel processing)
- Message format: JSON with documentId, userId, and stage-specific data
Why Canonical Themes?
Assay uses a canonical theme taxonomy rather than free-form tagging. This approach:
- Enables Discovery: Users can find documents across different authors/time periods that share themes
- Maintains Consistency: Same theme always means the same thing (e.g.,
ARTIFICIAL_INTELLIGENCE.AI_SAFETY)
- Scales Organically: New themes are flagged for review and added to the taxonomy
- Supports Navigation: Hierarchical structure (L0 domains → L1 specific themes) enables browsing
- Improves Search: Theme-based search is more reliable than keyword search
Taxonomy Structure:
- L0 (Root Domains): 30 broad categories (e.g.,
ARTIFICIAL_INTELLIGENCE, CLOUD_SECURITY)
- L1 (Specific Themes): 149 specific themes (e.g.,
ARTIFICIAL_INTELLIGENCE.AI_ARCHITECTURES)
- UNCATEGORIZED: Catch-all for documents that don't match any canonical theme
- Synonyms: Each theme can have searchable synonyms for better matching