Web Scrape Tool¶
The Web Scrape tool fetches web pages and extracts content, links, or metadata using Cheerio. It enforces domain-level access controls and response size limits to prevent abuse.
Quick Reference¶
| Property | Value |
|---|---|
| Node name | tools/web-scrape |
| Version | 0.1.0 |
| Library | cheerio + native fetch |
| Actions | fetch, extract, extractLinks, extractMetadata |
| Tags | scrape, web, html, cheerio, tools, agentic |
Actions¶
fetch¶
Fetch a web page, strip non-content elements (scripts, styles, nav, footer, header), and return the cleaned body text.
| Parameter | Type | Required | Description |
|---|---|---|---|
url |
string | Yes | URL to fetch (must be http or https) |
headers |
Record\<string, string> | No | Custom HTTP headers |
timeout |
integer | No | Request timeout in ms (default: 30000) |
userAgent |
string | No | Custom User-Agent string |
Returns { content } -- the cleaned text content of the page body.
extract¶
Extract specific elements from a page using a CSS selector.
| Parameter | Type | Required | Description |
|---|---|---|---|
url |
string | Yes | URL to fetch |
selector |
string | Yes | CSS selector to match elements |
headers |
Record\<string, string> | No | Custom HTTP headers |
timeout |
integer | No | Request timeout in ms (default: 30000) |
userAgent |
string | No | Custom User-Agent string |
Returns { elements } -- an array of { text, html, attributes } for each matched element.
extractLinks¶
Extract all links from a page (excluding navigation, footer, and other non-content areas).
| Parameter | Type | Required | Description |
|---|---|---|---|
url |
string | Yes | URL to fetch |
headers |
Record\<string, string> | No | Custom HTTP headers |
timeout |
integer | No | Request timeout in ms (default: 30000) |
Returns { links } -- an array of { text, url } objects with absolute URLs.
extractMetadata¶
Extract page metadata: title, meta description, Open Graph tags, and canonical URL.
| Parameter | Type | Required | Description |
|---|---|---|---|
url |
string | Yes | URL to fetch |
headers |
Record\<string, string> | No | Custom HTTP headers |
timeout |
integer | No | Request timeout in ms (default: 30000) |
Returns { title, metaDescription, ogTitle, ogDescription, ogImage, canonical }.
Output Schema¶
| Field | Type | Description |
|---|---|---|
data |
object | Action-specific result data |
url |
string | The URL that was fetched |
success |
boolean | true on success |
Configuration Reference¶
| Property | Type | Default | Description |
|---|---|---|---|
allowedDomains |
string[] | (none) | If set, only these domains (and their subdomains) can be fetched. |
blockedDomains |
string[] | [] |
Domains that are always blocked, even if they match allowedDomains. |
maxResponseSize |
integer | 5000000 (5 MB) |
Maximum response size in bytes. Responses larger than this are rejected. |
Safety¶
Domain restrictions¶
The tool supports both allowlisting and blocklisting of domains:
- allowedDomains: When configured, only URLs whose hostname matches (or is a subdomain of) an allowed domain are fetched. All other domains are rejected.
- blockedDomains: Always checked, even when
allowedDomainsis not set. A hostname matching a blocked domain is rejected immediately.
Domain matching supports subdomains: if example.com is in the allow list, then api.example.com and docs.example.com are also allowed.
Protocol validation¶
Only http: and https: protocols are permitted. Attempts to use file:, ftp:, data:, or other protocols are rejected with an explicit error.
Response size limits¶
Before reading the response body, the tool checks the Content-Length header against maxResponseSize. Responses exceeding the limit are rejected.
User-Agent¶
A default User-Agent (FlowForge-Bot/1.0) is sent with every request. This can be overridden via the userAgent input parameter.
Warning
The maxResponseSize check relies on the Content-Length header. If the server does not send this header, the response body is read in full. For untrusted URLs, combine this with allowedDomains to limit exposure.
Usage Example¶
import { webScrapeNode } from '@flowforgejs/nodes';
const workflow = {
nodes: [
{
id: 'scrape-article',
node: webScrapeNode,
config: {
allowedDomains: ['example.com', 'docs.example.com'],
blockedDomains: ['ads.example.com'],
maxResponseSize: 2_000_000,
},
input: {
action: 'fetch',
url: 'https://docs.example.com/guides/getting-started',
timeout: 15_000,
},
},
],
};
Tip
Use extractMetadata to quickly preview a page's title and description before deciding whether to perform a full fetch. This is useful for filtering search results in a pipeline.