Crawler
Key Concepts
The Web Crawler, formally known as the Web Page Indexer block, is used for:
- Scraping data from one or more web pages.
- Cleaning and structuring raw web content.
- Indexing the extracted data for use in custom workflows.
This module is ideal for websites that are not heavily reliant on JavaScript. It operates by manually defining each URL to be crawled. Once configured, the crawler integrates with the INTELLITHING workflow system to enable data-driven operations and automation.
Key Definitions
Term | Definition |
---|---|
Web Page Indexer | A block in INTELLITHING that collects, cleans, and indexes content from specified URLs. |
Crawl URL | The address of the web page to be crawled. |
Description | A label for helping the router engine or documentation purposes understand the page's purpose. |
Workflow Node | A discrete step in a data processing pipeline (e.g., synthesize, rerank, retrieve). |
Bridge | A configuration that connects blocks or nodes, allowing custom flow of data between them. |
Setup Guide: Configuring the Web Crawler Block
1. Crawl URL
- Enter the URL of the website or web page you wish to crawl.
Example:https://example.com
- Click Add to Crawl URLs to add the URL to the crawling list.
2. Description
-
Provide a brief description of the purpose of this URL connection.
Example:
Crawled data from the product page. This can be used for answering queries related to products. -
This description is helpful for the router engine. It becomes optional when using a fully custom workflow, but it's still recommended for maintainability.
3. Save
- Save your configuration and return to the main workflow editor.
Workflow Nodes
The Web Crawler module processes data through several predefined workflow nodes:
crawler_synthesize (First Node)
- Synthesizes or transforms the raw content from crawled web pages into structured summaries or insights.
- Useful for generating a coherent view of the crawled data.
crawler_rerank_nodes
- Reranks the crawled content based on relevance to the current workflow’s goal or task.
- Ensures that the most important data is prioritized for downstream processing.
crawler_retrieve_nodes (Last Node)
- Retrieves the raw crawled data from the specified URLs.
- Makes it available for use in subsequent blocks or applications.
Best Practices for Effective Crawling
- Validate URLs – Make sure URLs are accessible and return the correct content.
- Use Descriptions – Always add meaningful descriptions to help with routing and future maintenance.
- Avoid Duplicates – Prevent redundant crawling by checking for duplicate URLs.
- Avoid JavaScript-heavy Sites – Pages that rely heavily on JavaScript may not be parsed correctly by the default crawler.
Example Use Case
To index product listings for querying inside a chatbot (e.g., via Slack):
- Crawl
https://example.com/products
. - Use
crawler_synthesize
to convert raw product listings into structured data. - Rerank the listings with
crawler_rerank_nodes
based on relevance. - Retrieve the result using
crawler_retrieve_nodes
and pass it into a chat interface or report generator.