Crawlers
đ You must always use website content with the explicit permission of the content owner.
Crawlers are automated content extraction tools designed to retrieve, clean, and index web-based data from public or private pages. Crawlers allow your AI/ML workflows to dynamically ingest information from external websites and internal knowledge portals without requiring API access.
Unlike standard data connectors, Crawlers are built to scan web pages, structure the raw content, and make it usable in context-aware workflows. Crawlers are ideal for knowledge retrieval, competitive intelligence, dynamic indexing, or any use case that requires up-to-date information from the web.
Crawlers can be used with the INTELLITHING Router Engine or within manually created workflows for more control over behavior and flow.
đ Key Concepts
Crawlers are specialized integration blocks that extract data directly from HTML pages and convert it into usable content within AI-powered workflows. These blocks simulate human-like reading by crawling and parsing content, then indexing it into a format optimized for retrieval and synthesis.
They are best suited for scenarios where API access is not available or when real-time webpage content is needed as part of an AI pipeline.
How Crawlers Work
-
Targeted Retrieval: Crawlers are configured with one or more URLs. The system visits each URL, reads the page content, and extracts readable text while ignoring irrelevant markup.
-
Automatic Structuring: Extracted content is cleaned and optionally synthesized into knowledge chunks (nodes) that can be used in workflows.
-
Bridge Support: Crawlers can be linked to downstream blocks (e.g., Slack, SQL, LLMs) using bridges, enabling end-to-end workflows like crawl â analyze â respond.
-
Modular Workflow Nodes: Like other integrations, Crawlers follow a standard workflow pattern:
- Ingest Configuration â Sets up the crawl job with required parameters.
- Retrieve Nodes â Performs the crawl and extracts data.
- Rerank Nodes â Sorts or prioritizes crawled content (optional).
-
Synthesize Nodes â Converts crawled data into summaries or usable insights.
-
Workflow-Ready Output: The result of a crawler can be passed to LLMs, used in dashboards, or stored for further indexing.
Why Use Crawlers
- Extract data from public websites, internal pages, documentation portals, and more.
- Avoid dependency on APIs for websites that don't provide one.
- Enable continuous or on-demand knowledge updates from evolving content.
- Enhance AI workflows with real-time data pulled directly from web pages.
- Ground answers and decisions in live, visible, traceable information.
âī¸ Configuration
- Open the block editor and drop a module from the Crawler section.
- Click on the block to access the configuration panel.
Typical Fields:
- Crawl URL(s): One or more URLs to be scraped.
- Description: Brief explanation of the crawl purpose (used by the router).
- Output Filters / Filetypes: (Optional) Specify what content or sections to include/exclude.
âšī¸ Each crawler block requires its own configuration. Refer to the Web Crawler Documentation.
- Save the block.
- Head to the workflow editor to view or customize the default workflow.
- Connect it to additional blocks (e.g., Slack, SQL, LLMs) using bridges, or run it standalone to index and review data.