Documentation Knowledge Document Sources

Document Sources

Learn how to add content from files, websites, URLs, and manual entry to build your knowledge base.

Overview

Document sources are the connectors between your content and knowledge collections. Each source type is optimized for different use cases, from uploading internal documents to continuously syncing public documentation sites.

All sources automatically:

  • Extract and clean content
  • Convert to a uniform format (markdown)
  • Track processing status
  • Handle errors gracefully

File Uploads

Upload files directly from your computer to add them to a knowledge collection.

Supported File Types

Format Extensions Notes
Text .txt Plain text files
Markdown .md, .markdown Preserves formatting structure
HTML .html, .htm Converted to markdown
PDF .pdf Text extracted from document

How to Upload Files

  1. Open your knowledge collection
  2. Click "Add Source""File Upload"
  3. Select one or more files from your computer
  4. Files are uploaded and processing begins automatically
  5. Track processing status in the sources list

Batch Upload

You can upload multiple files at once. Each file becomes a separate document in your collection, but they're all tracked under a single source.

When to Use File Uploads

  • Internal documents: Company policies, internal wikis, training materials
  • One-time imports: Historical content that won't change frequently
  • Confidential content: Documents that can't be publicly scraped
  • Custom formats: When you have documents in supported formats ready to upload

Website Scraping

Automatically crawl and index an entire website or documentation site.

How Website Scraping Works

  1. Start from a base URL (e.g., https://docs.example.com)
  2. Crawler discovers pages by following internal links
  3. Each page is fetched, converted to markdown, and indexed
  4. Respects same-domain boundaries (won't crawl external sites)
  5. Continues until all linked pages are found or max limit reached

Configuration Options

Base URL

The starting point for crawling. The crawler will only index pages under this domain.

https://docs.example.com — Will crawl all pages under docs.example.com
https://example.com/help — Will crawl /help and subdirectories
https://example.com/blog/post-1 — Too specific, use blog/ instead

Max Pages

Limit the number of pages to crawl to control processing time and costs. Recommended limits:

  • Small sites (1-50 pages): No limit needed
  • Medium sites (50-500 pages): Set limit to 200-300
  • Large sites (500+ pages): Use URL lists for specific sections instead

Be Respectful of Target Sites

The crawler is rate-limited to avoid overloading target servers. Large sites may take 10-30 minutes to fully crawl. Consider using URL lists for specific important pages if you need faster results.

When to Use Website Scraping

  • Public documentation: Product docs, API references, help centers
  • Blog archives: Company blogs with helpful content
  • Knowledge bases: Public FAQs and support articles
  • Living content: Sites that update regularly (use automatic sync)

URL Lists

Index specific pages by providing a list of URLs to scrape without crawling entire sites.

How URL Lists Work

  • Provide a list of specific URLs (one per line)
  • Each URL is fetched independently
  • No link following or crawling—only the exact URLs you specify
  • Perfect for curating content from multiple sources

Example URL List

https://docs.example.com/getting-started
https://docs.example.com/api/authentication
https://docs.example.com/api/rate-limits
https://blog.example.com/best-practices
https://help.example.com/troubleshooting

Configuration Tips

Pro Tips

  • Use URL lists for high-value pages from large sites
  • Combine multiple sources for different sections of a site
  • URLs can come from different domains
  • Great for content from sites that are hard to crawl

When to Use URL Lists

  • Curated content: Specific important pages across multiple sites
  • Mixed sources: Combine pages from different documentation sites
  • Targeted indexing: Focus on high-value pages without crawling entire sites
  • External content: Industry articles, competitor docs (ensure compliance with terms)

Manual Entry

Type or paste content directly into the knowledge base through the UI.

How Manual Entry Works

  1. Click "Add Source""Manual Entry"
  2. Create a new document with a title
  3. Type or paste content into the editor
  4. Save to trigger processing immediately

Content Format

Manual entries support both plain text and markdown:

Plain Text

Our refund policy:

Customers can request refunds within 30 days of purchase.
Refunds are processed within 5-7 business days.
Original payment method will be credited.

Markdown

# Refund Policy

## Eligibility
- Within 30 days of purchase
- Product must be unused
- Original packaging required

## Processing Time
Refunds are processed within **5-7 business days**.

## Payment
Original payment method will be credited.

Markdown is Recommended

Using markdown formatting (headings, lists, bold) helps the chunking algorithm preserve document structure and improves retrieval accuracy.

When to Use Manual Entry

  • Quick additions: Add small pieces of information immediately
  • FAQs: Create curated question-answer pairs
  • Policies: Type up short company policies or guidelines
  • Proprietary information: Content that doesn't exist elsewhere
  • Testing: Quickly test how content is chunked and retrieved

Automatic Syncing

Keep your knowledge base up to date by automatically re-scraping website and URL sources on a schedule.

Sync Frequencies

Frequency Best For
Manual Static content, one-time imports, file uploads
Daily Frequently updated documentation, news content
Weekly Product docs, help centers (most common)
Monthly Policy documents, infrequently updated content

How Syncing Works

  1. At the scheduled time, the source is re-scraped
  2. New pages are added to the collection
  3. Updated pages replace old versions
  4. Deleted pages are removed from the collection
  5. All new content is automatically processed and indexed

Syncing is Non-Disruptive

Your chatbot continues to use existing content while syncing happens in the background. New content becomes available as soon as processing completes.

Manual Sync

You can trigger a sync manually at any time:

  1. Open the source in your collection
  2. Click the "Sync Now" button
  3. Processing begins immediately
  4. Track progress in the source status

Best Practices

Choosing the Right Source Type

  • Use website scraping for public documentation that updates regularly
  • Use URL lists when you need specific pages from large sites
  • Use file uploads for internal documents or content you control
  • Use manual entry for quick additions or content that doesn't exist elsewhere

Content Quality

  • Prefer markdown or well-structured HTML: Better chunking and retrieval
  • Avoid image-heavy PDFs: Text extraction quality varies
  • Remove navigation boilerplate: If possible, use markdown versions of pages
  • Check first scrape results: Verify content is being extracted correctly

Organization

  • Group related sources in the same collection
  • Use descriptive source names: "Product Docs - Getting Started" not "Source 1"
  • Set appropriate sync frequencies: Don't sync daily if content changes monthly
  • Monitor sync status: Check for errors after setting up new sources

Performance

  • Start small: Add 10-20 documents first, test retrieval, then scale up
  • Set max page limits for website sources to avoid long processing times
  • Use URL lists for large sites: More control over what gets indexed
  • Remove outdated content: Delete old documents that are no longer relevant

Respect Copyright and Terms of Service

Only scrape websites you have permission to use. Most public documentation is fine, but always check the site's terms of service before indexing external content.