Website Crawling
Crawl public websites and add the content to your Content. The crawler visits pages, extracts content as Markdown, and lets you review and edit before processing.
Getting There
Navigate to Content in the sidebar, then click the Website tab.
How to Crawl a Website
Step 1: Enter the URL
Paste the website URL in the input field (e.g., https://docs.example.com). The crawler will start from this page and follow links to discover more pages.
Step 2: Configure Options
| Option | Default | Description |
|---|---|---|
| Target Folder | No folder (root) | Select a destination folder for crawled documents |
| Single page only | Off | Only crawl the entered URL, don’t follow links |
| Respect robots.txt | On | Honor the site’s robots.txt crawling rules |
Step 3: Start Crawling
Click Crawl to begin. The crawler visits pages starting from your URL and extracts their content.
Step 4: Monitor Progress
- A progress bar shows pages crawled vs. your plan’s maximum
- Pages appear in a live list as they are discovered
- Click any page to preview its content immediately
- Use the Cancel button to stop the crawl at any time
Only one crawl can run at a time per organization. Wait for the current crawl to complete or cancel it before starting another.
Reviewing Crawled Pages
After crawling completes, you can review and edit pages before adding them to your Content.
Edit Page Content
- Click any page in the list to open the editor
- Edit the title and markdown content
- Remove irrelevant sections (navigation menus, footers, sidebars)
- Click Save to update
Only pages with “Pending” status can be edited. Once processing starts, pages cannot be changed.
Select and Process Pages
- Use checkboxes to select pages you want to keep
- Use Select All to select all pending pages at once
- Optionally choose a target folder
- Click Process Selected
After processing starts, you’re automatically switched to the Documents tab where the new documents will appear.
Crawl Status Reference
| Status | Meaning |
|---|---|
| Pending | Crawl is queued to start |
| In Progress | Actively visiting pages |
| Completed | All reachable pages crawled |
| Partial Success | Some pages crawled, some failed |
| Failed | Crawl could not complete |
| Cancelled | Stopped by you |
Session Persistence
Your crawl session survives navigation and page refresh:
- Active crawls automatically restore when you return to the Website tab
- Completed crawls are available for review for up to 1 hour
Tips
- Review and edit crawled content before processing to improve quality
- Use “Single page only” for individual pages you want to add quickly
- Keep “Respect robots.txt” enabled to follow site owner preferences
- Crawled documents appear in the Documents tab as Markdown (.md) files
Common Questions
Q: How many pages can I crawl? A: The maximum number of pages depends on your plan. The crawler follows links from your starting URL up to a fixed depth — or, if you enable Single page only, it stops at the entered URL and skips link-following entirely.
Q: Can I crawl pages behind a login? A: No. The crawler can only access publicly available pages. Pages behind authentication will be skipped.
Q: What format are crawled pages? A: Content is extracted as Markdown (.md) documents and processed through the standard pipeline (chunking, embedding, and search indexing).
Q: Why did some pages fail? A: Pages can fail due to bot protection (firewalls like Cloudflare), server errors, or empty content. The crawl will show “Partial Success” status.
Q: Can I re-crawl the same site? A: Yes. Start a new crawl with the same URL. Previous crawled pages are replaced when a new crawl begins.
Q: Where do processed pages appear? A: In the Documents tab of the Content page. They appear as Markdown (.md) files.
Troubleshooting
Pages Return 403 or 503 Errors
The site may be blocking automated crawlers. Options:
- Ask the site owner to whitelist
CuneiformBot/1.0 - Try crawling a specific page with “Single page only” enabled
Empty or Garbled Content
Some sites use heavy JavaScript rendering. The crawler supports most modern frameworks, but:
- Single-page apps that require authentication won’t work
- Content loaded via infinite scroll may be partially captured
Crawl Seems Slow
- Each page takes a few seconds to crawl (includes rendering and rate limiting)
- Larger sites with many pages will take several minutes
- You can cancel and retry if a crawl appears stuck