Skip to content

Web Scrape Tool

The Web Scrape tool fetches web pages and extracts content, links, or metadata using Cheerio. It enforces domain-level access controls and response size limits to prevent abuse.

Quick Reference

Property Value
Node name tools/web-scrape
Version 0.1.0
Library cheerio + native fetch
Actions fetch, extract, extractLinks, extractMetadata
Tags scrape, web, html, cheerio, tools, agentic

Actions

fetch

Fetch a web page, strip non-content elements (scripts, styles, nav, footer, header), and return the cleaned body text.

Parameter Type Required Description
url string Yes URL to fetch (must be http or https)
headers Record\<string, string> No Custom HTTP headers
timeout integer No Request timeout in ms (default: 30000)
userAgent string No Custom User-Agent string

Returns { content } -- the cleaned text content of the page body.

extract

Extract specific elements from a page using a CSS selector.

Parameter Type Required Description
url string Yes URL to fetch
selector string Yes CSS selector to match elements
headers Record\<string, string> No Custom HTTP headers
timeout integer No Request timeout in ms (default: 30000)
userAgent string No Custom User-Agent string

Returns { elements } -- an array of { text, html, attributes } for each matched element.

Extract all links from a page (excluding navigation, footer, and other non-content areas).

Parameter Type Required Description
url string Yes URL to fetch
headers Record\<string, string> No Custom HTTP headers
timeout integer No Request timeout in ms (default: 30000)

Returns { links } -- an array of { text, url } objects with absolute URLs.

extractMetadata

Extract page metadata: title, meta description, Open Graph tags, and canonical URL.

Parameter Type Required Description
url string Yes URL to fetch
headers Record\<string, string> No Custom HTTP headers
timeout integer No Request timeout in ms (default: 30000)

Returns { title, metaDescription, ogTitle, ogDescription, ogImage, canonical }.

Output Schema

Field Type Description
data object Action-specific result data
url string The URL that was fetched
success boolean true on success

Configuration Reference

Property Type Default Description
allowedDomains string[] (none) If set, only these domains (and their subdomains) can be fetched.
blockedDomains string[] [] Domains that are always blocked, even if they match allowedDomains.
maxResponseSize integer 5000000 (5 MB) Maximum response size in bytes. Responses larger than this are rejected.

Safety

Domain restrictions

The tool supports both allowlisting and blocklisting of domains:

  • allowedDomains: When configured, only URLs whose hostname matches (or is a subdomain of) an allowed domain are fetched. All other domains are rejected.
  • blockedDomains: Always checked, even when allowedDomains is not set. A hostname matching a blocked domain is rejected immediately.

Domain matching supports subdomains: if example.com is in the allow list, then api.example.com and docs.example.com are also allowed.

Protocol validation

Only http: and https: protocols are permitted. Attempts to use file:, ftp:, data:, or other protocols are rejected with an explicit error.

Response size limits

Before reading the response body, the tool checks the Content-Length header against maxResponseSize. Responses exceeding the limit are rejected.

User-Agent

A default User-Agent (FlowForge-Bot/1.0) is sent with every request. This can be overridden via the userAgent input parameter.

Warning

The maxResponseSize check relies on the Content-Length header. If the server does not send this header, the response body is read in full. For untrusted URLs, combine this with allowedDomains to limit exposure.

Usage Example

import { webScrapeNode } from '@flowforgejs/nodes';

const workflow = {
  nodes: [
    {
      id: 'scrape-article',
      node: webScrapeNode,
      config: {
        allowedDomains: ['example.com', 'docs.example.com'],
        blockedDomains: ['ads.example.com'],
        maxResponseSize: 2_000_000,
      },
      input: {
        action: 'fetch',
        url: 'https://docs.example.com/guides/getting-started',
        timeout: 15_000,
      },
    },
  ],
};

Tip

Use extractMetadata to quickly preview a page's title and description before deciding whether to perform a full fetch. This is useful for filtering search results in a pipeline.