UniversalChatBot

Overview

Document sources are the connectors between your content and knowledge collections. Each source type is optimized for different use cases, from uploading internal documents to continuously syncing public documentation sites.

All sources automatically:

Extract and clean content
Convert to a uniform format (markdown)
Track processing status
Handle errors gracefully

File Uploads

Upload files directly from your computer to add them to a knowledge collection.

Supported File Types

Format	Extensions	Notes
Text	.txt	Plain text files
Markdown	.md, .markdown	Preserves formatting structure
HTML	.html, .htm	Converted to markdown
PDF	.pdf	Text extracted from document

How to Upload Files

Open your knowledge collection
Click "Add Source" → "File Upload"
Select one or more files from your computer
Files are uploaded and processing begins automatically
Track processing status in the sources list

Batch Upload

You can upload multiple files at once. Each file becomes a separate document in your collection, but they're all tracked under a single source.

When to Use File Uploads

Internal documents: Company policies, internal wikis, training materials
One-time imports: Historical content that won't change frequently
Confidential content: Documents that can't be publicly scraped
Custom formats: When you have documents in supported formats ready to upload

Website Scraping

Automatically crawl and index an entire website or documentation site.

How Website Scraping Works

Start from a base URL (e.g., https://docs.example.com)
Crawler discovers pages by following internal links
Each page is fetched, converted to markdown, and indexed
Respects same-domain boundaries (won't crawl external sites)
Continues until all linked pages are found or max limit reached

Configuration Options

Base URL

The starting point for crawling. The crawler will only index pages under this domain.

https://docs.example.com — Will crawl all pages under docs.example.com

https://example.com/help — Will crawl /help and subdirectories

https://example.com/blog/post-1 — Too specific, use blog/ instead

Max Pages

Limit the number of pages to crawl to control processing time and costs. Recommended limits:

Small sites (1-50 pages): No limit needed
Medium sites (50-500 pages): Set limit to 200-300
Large sites (500+ pages): Use URL lists for specific sections instead

Be Respectful of Target Sites

The crawler is rate-limited to avoid overloading target servers. Large sites may take 10-30 minutes to fully crawl. Consider using URL lists for specific important pages if you need faster results.

When to Use Website Scraping

Public documentation: Product docs, API references, help centers
Blog archives: Company blogs with helpful content
Knowledge bases: Public FAQs and support articles
Living content: Sites that update regularly (use automatic sync)

URL Lists

Index specific pages by providing a list of URLs to scrape without crawling entire sites.

How URL Lists Work

Provide a list of specific URLs (one per line)
Each URL is fetched independently
No link following or crawling—only the exact URLs you specify
Perfect for curating content from multiple sources

Example URL List

https://docs.example.com/getting-started
https://docs.example.com/api/authentication
https://docs.example.com/api/rate-limits
https://blog.example.com/best-practices
https://help.example.com/troubleshooting

Configuration Tips

Pro Tips

Use URL lists for high-value pages from large sites
Combine multiple sources for different sections of a site
URLs can come from different domains
Great for content from sites that are hard to crawl

When to Use URL Lists

Curated content: Specific important pages across multiple sites
Mixed sources: Combine pages from different documentation sites
Targeted indexing: Focus on high-value pages without crawling entire sites
External content: Industry articles, competitor docs (ensure compliance with terms)

Manual Entry

Type or paste content directly into the knowledge base through the UI.

How Manual Entry Works

Click "Add Source" → "Manual Entry"
Create a new document with a title
Type or paste content into the editor
Save to trigger processing immediately

Content Format

Manual entries support both plain text and markdown:

Plain Text

Our refund policy:

Customers can request refunds within 30 days of purchase.
Refunds are processed within 5-7 business days.
Original payment method will be credited.

Markdown

# Refund Policy

## Eligibility
- Within 30 days of purchase
- Product must be unused
- Original packaging required

## Processing Time
Refunds are processed within **5-7 business days**.

## Payment
Original payment method will be credited.

Markdown is Recommended

Using markdown formatting (headings, lists, bold) helps the chunking algorithm preserve document structure and improves retrieval accuracy.

When to Use Manual Entry

Quick additions: Add small pieces of information immediately
FAQs: Create curated question-answer pairs
Policies: Type up short company policies or guidelines
Proprietary information: Content that doesn't exist elsewhere
Testing: Quickly test how content is chunked and retrieved

Automatic Syncing

Keep your knowledge base up to date by automatically re-scraping website and URL sources on a schedule.

Sync Frequencies

Frequency	Best For
Manual	Static content, one-time imports, file uploads
Daily	Frequently updated documentation, news content
Weekly	Product docs, help centers (most common)
Monthly	Policy documents, infrequently updated content

How Syncing Works

At the scheduled time, the source is re-scraped
New pages are added to the collection
Updated pages replace old versions
Deleted pages are removed from the collection
All new content is automatically processed and indexed

Syncing is Non-Disruptive

Your chatbot continues to use existing content while syncing happens in the background. New content becomes available as soon as processing completes.

Manual Sync

You can trigger a sync manually at any time:

Open the source in your collection
Click the "Sync Now" button
Processing begins immediately
Track progress in the source status

Best Practices

Choosing the Right Source Type

Use website scraping for public documentation that updates regularly
Use URL lists when you need specific pages from large sites
Use file uploads for internal documents or content you control
Use manual entry for quick additions or content that doesn't exist elsewhere

Content Quality

Prefer markdown or well-structured HTML: Better chunking and retrieval
Avoid image-heavy PDFs: Text extraction quality varies
Remove navigation boilerplate: If possible, use markdown versions of pages
Check first scrape results: Verify content is being extracted correctly

Organization

Group related sources in the same collection
Use descriptive source names: "Product Docs - Getting Started" not "Source 1"
Set appropriate sync frequencies: Don't sync daily if content changes monthly
Monitor sync status: Check for errors after setting up new sources

Performance

Start small: Add 10-20 documents first, test retrieval, then scale up
Set max page limits for website sources to avoid long processing times
Use URL lists for large sites: More control over what gets indexed
Remove outdated content: Delete old documents that are no longer relevant

Respect Copyright and Terms of Service

Only scrape websites you have permission to use. Most public documentation is fine, but always check the site's terms of service before indexing external content.

Document Sources

On this page