Uploading Documents
Learn how to upload and manage documents for your chatbot's knowledge base.
Overview
Documents provide your chatbot with knowledge to answer questions. Syllabi supports:
- 📄 PDF files - Manuals, guides, reports
- 📝 Word documents - .docx files
- 📄 Text files - .txt, .md
- 📊 CSV files - Structured data
- 🎥 Videos - YouTube URLs, uploaded videos
- 🎵 Audio files - Podcasts, recordings
- 🌐 Web pages - Any URL
Document Processing Pipeline
- Upload: File uploaded to storage
- Parse: Extract text content
- Chunk: Split into manageable pieces (512-1024 tokens)
- Embed: Generate vector embeddings for similarity search
- Store: Save chunks and embeddings in database
- Ready: Available for chatbot to reference
Uploading PDF Documents
Step-by-Step
- Go to your chatbot dashboard
- Click Content tab
- Click Upload Document button
- Select PDF from file type dropdown
- Choose your PDF file (max 50MB)
- Click Upload
- Wait for processing (30 seconds - 5 minutes depending on size)
Best Practices
File preparation:
- ✅ Use searchable PDFs (not scanned images without OCR)
- ✅ Remove password protection
- ✅ Check for proper text extraction:
- Copy/paste text from PDF
- If you can't select text, it's an image-based PDF
- ✅ Optimize file size (< 20MB recommended)
Content guidelines:
- ✅ Well-structured with headings
- ✅ Clear, readable font
- ✅ Good formatting (tables, lists)
- ✅ Relevant to chatbot purpose
What to avoid:
- ❌ Scanned documents without text layer
- ❌ Password-protected PDFs
- ❌ Corrupted files
- ❌ PDFs with mostly images
- ❌ Files > 50MB
Example Use Cases
Product Manual:
Product-Manual-v2.pdf
├── 234 pages
├── Processing time: ~2 minutes
├── Generates: ~400 chunks
└── Chatbot can answer: Installation, troubleshooting, specificationsCompany Handbook:
Employee-Handbook-2024.pdf
├── 87 pages
├── Processing time: ~45 seconds
├── Generates: ~150 chunks
└── Chatbot can answer: Policies, benefits, proceduresUploading Word Documents
Supported Formats
.docx- Microsoft Word (2007+).doc- Legacy Word (requires backend)
Upload Process
- Content tab → Upload Document
- Select Word Document
- Choose
.docxfile - Upload and process
Advantages
- ✅ Better text extraction than PDF
- ✅ Preserves formatting and structure
- ✅ Smaller file sizes
- ✅ Faster processing
Tips
Convert PDFs to Word:
- Use Microsoft Word: File → Open → Select PDF
- Use online tools: pdf2doc.com (opens in a new tab)
- Better results than direct PDF parsing
Formatting:
- Use heading styles (H1, H2, H3)
- Clear section breaks
- Consistent formatting
Uploading Text Files
Supported Formats
.txt- Plain text.md- Markdown.csv- Comma-separated values
Markdown Files
Great for documentation:
Example:
# API Reference
## Authentication
All requests require an API key in the header:
`Authorization: Bearer YOUR_API_KEY`
## Endpoints
### GET /api/users
Returns list of users.
**Parameters**:
- `page`: Page number (default: 1)
- `limit`: Results per page (default: 10)
**Response**:
```json
{
"users": [...],
"total": 100
}
Upload process:
1. Save as `api-reference.md`
2. Upload via Content tab
3. Chatbot understands structure and formatting
### CSV Files
Upload structured data:
**Example** - Product Catalog:
```csv
product_id,name,price,category,description
001,Widget Pro,29.99,Hardware,Professional-grade widget
002,Widget Lite,19.99,Hardware,Budget-friendly widget
003,Widget Plus,39.99,Hardware,Premium widget with extrasChatbot can answer:
- "What's the price of Widget Pro?" → "$29.99"
- "Show me hardware products" → Lists all hardware items
- "Compare Widget Pro and Widget Plus" → Provides comparison
Uploading Videos (YouTube)
YouTube URL Import
- Content tab → Add URL
- Select YouTube Video
- Paste URL:
https://www.youtube.com/watch?v=dQw4w9WgXcQ - Click Import
Processing
- Fetch metadata: Title, description, duration
- Download transcript: Auto-generated or manual captions
- Chunk by timestamps: Each chunk includes time range
- Generate embeddings: Search by content and timestamp
Chatbot Capabilities
Users can ask:
- "What does the video say about X?"
- "Summarize the video"
- "At what timestamp is Y discussed?"
- "What topics are covered?"
Limitations
- ⚠️ Requires YouTube API key (optional)
- ⚠️ Only videos with captions/transcripts
- ⚠️ Transcripts must be publicly available
- ⚠️ Subject to YouTube API quotas
Uploading Audio Files
Supported Formats
.mp3- MPEG audio.wav- Waveform audio.m4a- MPEG-4 audio.ogg- Ogg Vorbis
Transcription Process
- Upload audio file (max 200MB)
- File sent to transcription service (AssemblyAI)
- Speech-to-text conversion
- Transcript chunked and embedded
- Ready for queries
Requirements
- ✅ Clear audio quality
- ✅ Minimal background noise
- ✅ AssemblyAI API key configured (optional backend feature)
Use Cases
Podcast Episodes:
episode-042.mp3
├── Duration: 45 minutes
├── Transcription time: ~5 minutes
├── Chatbot can answer episode content questionsMeeting Recordings:
team-meeting-2024-01-15.m4a
├── Duration: 30 minutes
├── Chatbot can summarize decisions and action itemsLecture Recordings:
calculus-lecture-3.mp3
├── Duration: 60 minutes
├── Students can ask questions about lecture contentUploading Web Pages
URL Import
- Content tab → Add URL
- Select Web Page
- Enter URL:
https://example.com/docs/getting-started - Click Import
What Gets Extracted
- ✅ Main content text
- ✅ Headings and structure
- ✅ Lists and tables
- ✅ Article metadata
- ❌ Navigation menus
- ❌ Ads and sidebars
- ❌ Comments
Bulk Import
Import entire documentation sites:
Example - Import all docs pages:
https://example.com/docs/intro
https://example.com/docs/installation
https://example.com/docs/configuration
https://example.com/docs/apiOr use sitemap import (if supported):
https://example.com/sitemap.xmlTips
- ✅ Import clean, well-structured pages
- ✅ Verify content was extracted correctly
- ✅ Remove duplicate imports
- ❌ Avoid pages with heavy JavaScript rendering
- ❌ Avoid pages behind login walls
Managing Documents
View Uploaded Documents
Content tab shows all documents:
| Document | Type | Status | Size | Chunks | Actions |
|---|---|---|---|---|---|
| Product Manual | Completed | 5.2 MB | 387 | View, Delete | |
| API Docs | URL | Processing | - | - | Cancel |
| Employee Handbook | DOCX | Failed | 2.1 MB | - | Retry, Delete |
Document Status
Pending:
- ⏳ In upload queue
- Action: Wait
Processing:
- ⚙️ Currently being processed
- Action: Wait or cancel
Completed:
- ✅ Ready to use
- Action: View chunks, delete if needed
Failed:
- ❌ Processing error
- Action: Check error message, retry, or delete
Viewing Chunks
Click View Chunks to see how document was split:
Chunk 1 (page 5):
"Getting Started with Product X
Installation requires the following prerequisites:
- Node.js 18 or higher
- npm 9 or higher..."
Chunk 2 (page 5-6):
"To install Product X, run the following command:
npm install product-x
This will download and install all required dependencies..."Deleting Documents
- Click Delete icon
- Confirm deletion
- Document and all chunks removed
- Embeddings deleted
- Chatbot will no longer reference this document
Note: Deletion is permanent and cannot be undone.
Organizing with Folders
Create folders to organize documents:
Product Documentation/
├── User Guide.pdf
├── Installation Guide.pdf
└── Troubleshooting.pdf
Internal/
├── Employee Handbook.docx
├── HR Policies.pdf
└── IT Guidelines.pdf
Training/
├── Onboarding Video (YouTube)
├── Product Demo (YouTube)
└── Training Manual.pdfBenefits:
- Better organization
- Easier to find documents
- Can enable/disable entire folders
- Separate concerns
Document Processing Options
Chunking Strategy
Configure how documents are split:
Small Chunks (256-512 tokens):
- ✅ Precise retrieval
- ✅ Better for Q&A
- ❌ May lose context
- Use for: FAQs, structured data
Medium Chunks (512-1024 tokens) [Default]:
- ✅ Balanced
- ✅ Good context retention
- ✅ Works for most use cases
Large Chunks (1024-2048 tokens):
- ✅ More context
- ✅ Better for narratives
- ❌ Less precise retrieval
- Use for: Stories, articles, long-form content
Overlap
Add overlap between chunks:
Overlap: 50 tokens
Chunk 1: [Tokens 0-512]
Chunk 2: [Tokens 462-974] (overlaps 50 tokens with Chunk 1)
Chunk 3: [Tokens 924-1436] (overlaps 50 tokens with Chunk 2)Benefits:
- Prevents losing context at chunk boundaries
- Improves retrieval quality
- Recommended: 10-20% overlap
Metadata
Add custom metadata to documents:
{
"document_type": "user_guide",
"version": "2.1",
"department": "product",
"last_updated": "2024-01-15",
"tags": ["installation", "configuration"]
}Use metadata for:
- Filtering searches
- Version control
- Access control
- Analytics
Troubleshooting
Upload Fails
Error: "File too large"
- Solution: Compress PDF or split into smaller files
- Limit: 50MB per file
Error: "Invalid file format"
- Solution: Verify file extension matches content
- Check: File isn't corrupted
Error: "Processing timeout"
- Solution: File too large or complex, try splitting
- Backend may need: More resources
No Text Extracted from PDF
Cause: Image-based PDF without text layer
Solutions:
- Use OCR tool to add text layer:
- Adobe Acrobat: Tools → Recognize Text
- Online: ocr.space (opens in a new tab)
- Convert to Word first, then upload
- Re-create PDF from source document
Chunks Look Wrong
Issue: Sentences cut off mid-word
Solutions:
- Adjust chunk size
- Increase overlap
- Improve source document formatting
Issue: Duplicate content in chunks
Solutions:
- Remove duplicate pages from source
- Adjust overlap percentage
- Filter duplicate chunks
Chatbot Can't Find Information
Issue: Document uploaded but chatbot doesn't use it
Debugging:
- Verify document status is "Completed"
- Check chunks were generated (View Chunks)
- Test direct question about document content
- Verify embeddings were created (check database)
- Review system prompt mentions using documents
Solutions:
- Re-upload document
- Adjust chunking strategy
- Make system prompt explicit about using documents
- Check document actually contains the information
Best Practices
Content Quality
✅ Do:
- Upload authoritative sources
- Keep documents up to date
- Remove outdated information
- Use consistent terminology
- Include version numbers
❌ Don't:
- Upload low-quality content
- Mix old and new versions
- Include confidential data
- Upload irrelevant documents
Organization
✅ Do:
- Use descriptive filenames
- Organize into folders
- Tag with metadata
- Document upload dates
- Track document versions
Performance
✅ Do:
- Start with essential documents (3-10)
- Add more based on gaps
- Remove unused documents
- Monitor retrieval quality
- Optimize chunk sizes
❌ Don't:
- Upload everything at once
- Keep duplicate documents
- Ignore processing errors
- Forget to test after uploads
Security
✅ Do:
- Review documents before upload
- Remove sensitive information
- Use appropriate access controls
- Audit document usage
- Rotate credentials if exposed
❌ Don't:
- Upload confidential data
- Share documents publicly
- Include personal information
- Upload proprietary code without permission
Advanced Features
Batch Upload
Upload multiple files at once:
- Select multiple files (Shift+Click or Ctrl+Click)
- Upload all at once
- Monitor processing queue
- Review results when complete
Scheduled Re-imports
Keep web pages up to date:
- Set re-import schedule (daily, weekly, monthly)
- System automatically fetches latest version
- Updates chunks with new content
- Notifies of significant changes
Use for:
- Documentation websites
- Blog posts
- Product pages
- News feeds
Document Versioning
Track document versions:
Product-Manual-v1.0.pdf (2023-06-15)
Product-Manual-v2.0.pdf (2024-01-15) [Current]- Keep old versions for reference
- Easy rollback if needed
- Track what changed between versions
Access Control
Restrict document access:
Public: Anyone can query Authenticated: Logged-in users only Role-based: Specific user roles only
Example:
Employee Handbook → Authenticated only
Public FAQ → Public
Executive Reports → Admin role onlyNext Steps
- Customizing Theme - Brand your chatbot
- Skills & Actions - Add custom functionality
- Analytics - Monitor document usage
- API Reference - Programmatic document management
Your documents are the foundation of your chatbot's knowledge. Regular maintenance and optimization ensure the best user experience.