ORCFLO Documentation

Overview

The Web Crawl tool retrieves content from specific URLs or crawls multiple pages from a website by following links. Ideal for analyzing competitors, gathering product information, or monitoring website content.

Web crawling is powered by Tavily's crawl API, optimized for AI applications. No API key configuration required - it works out of the box.

How It Works

When you provide a URL in your prompt and enable the Web Crawl tool, the AI can visit that page and optionally follow links to gather more content.

Enable the Web Crawl Tool

Add the Web Crawl tool to your LLM step from the tool configuration panel.

Provide the Target URL

Include the URL you want to crawl in your prompt. Specify if you want to follow links or just analyze the single page.

AI Processes the Content

The AI receives the page content and uses it to complete your task.

Example Prompt

Visit https://example.com/products and extract information about
all their product offerings. Include prices, features, and availability.

Parameters

Configure these parameters in the tool settings panel to control how the crawler navigates and extracts content from websites.

URL Settings

Parameter	Description	Default
`url`	Override the base URL to crawl. Leave empty to use the LLM-provided URL.	-
`instructions`	Instructions for the crawler (what to look for, focus areas). Helps guide which pages are most relevant.	-

Crawl Limits

Control how far and wide the crawler goes from the starting URL.

Parameter	Description	Default
`max_depth`	Maximum depth to crawl from the starting URL (1-10). Depth 1 means only the starting page and its direct links.	2
`max_breadth`	Maximum number of pages per level (1-50). Controls how many links to follow at each depth level.	10
`limit`	Maximum total pages to crawl (1-100). Hard limit across all depth levels.	10

Path Filters

Include or exclude specific paths to focus the crawl on relevant content.

Parameter	Description	Example
`select_paths`	Only crawl pages matching these paths	/docs, /api
`exclude_paths`	Skip pages matching these paths	/blog, /archive

Domain Filters

Control which domains the crawler can access.

Parameter	Description	Default
`select_domains`	Only crawl within these domains	-
`exclude_domains`	Exclude these domains from crawling	-
`allow_external`	Follow links to external domains	false

Content Options

Control how content is extracted from crawled pages.

Parameter	Description	Default
`extract_depth`	Depth of content extraction per page. Basic extracts main content, Advanced includes more details.	basic
`format`	Output format: "markdown" preserves formatting, "text" returns plain text.	markdown
`include_images`	Include images from crawled pages	false
`include_favicon`	Include favicon URLs for crawled pages	false
`categories`	Filter pages by content categories	-

Efficient Crawling

Use path and domain filters to focus your crawl on relevant sections. This improves speed, reduces costs, and produces more relevant results.

Crawl Types

The Web Crawl tool supports two crawling modes depending on your needs.

Single Page Crawl

Retrieve content from a single URL. Fast and efficient for analyzing specific pages like a product page, blog post, or documentation page.

"Analyze the content at https://example.com/pricing"

Multi-Page Crawl

Follow links from a starting URL to gather content from multiple related pages. Useful for comprehensive site analysis or documentation gathering.

"Crawl the documentation at https://docs.example.com and summarize the API reference"

Rate Limiting

Web crawling is rate-limited to prevent abuse. For large-scale crawling needs, break your task into multiple workflow runs or focus on specific sections of a site.

Use Cases

Common scenarios where Web Crawl excels:

Use Case	Description
Competitor Analysis	Analyze competitor websites for pricing, features, and positioning
Product Research	Gather product specifications, reviews, and availability from e-commerce sites
Documentation Synthesis	Crawl technical documentation to create summaries or answer questions
Content Monitoring	Track changes on websites for news, pricing updates, or content changes
Data Collection	Gather structured data from websites for analysis or reporting

Best Practices

Follow these guidelines for effective web crawling:

Provide specific URLs when you know the exact pages you need
Limit crawl scope to relevant sections of a site
Use single page crawl for focused analysis
Specify what information you want to extract from the pages
Consider using Content Extract for cleaner results on single pages

When to Use Web Crawl vs. Other Tools

Tool	Best For
Web Search	Finding pages on a topic when you don't have specific URLs
Web Crawl	Gathering content from known URLs, following links across a site
Content Extract	Getting clean, structured content from a single specific page

Key Takeaways

Web Crawl visits URLs and optionally follows links
No configuration required - works automatically
Best for site analysis, competitor research, and documentation
Rate-limited to prevent abuse - scope crawls appropriately
Use Content Extract for cleaner single-page results