Skip to content

Crawler

Key Concepts

The Web Crawler, formally known as the Web Page Indexer block, is used for:

  • Scraping data from one or more web pages.
  • Cleaning and structuring raw web content.
  • Indexing the extracted data for use in custom workflows.

This module is ideal for websites that are not heavily reliant on JavaScript. It operates by manually defining each URL to be crawled. Once configured, the crawler integrates with the INTELLITHING workflow system to enable data-driven operations and automation.

Key Definitions

Term Definition
Web Page Indexer A block in INTELLITHING that collects, cleans, and indexes content from specified URLs.
Crawl URL The address of the web page to be crawled.
Description A label for helping the router engine or documentation purposes understand the page's purpose.
Workflow Node A discrete step in a data processing pipeline (e.g., synthesize, rerank, retrieve).
Bridge A configuration that connects blocks or nodes, allowing custom flow of data between them.

Setup Guide: Configuring the Web Crawler Block

1. Crawl URL

  • Enter the URL of the website or web page you wish to crawl.
    Example: https://example.com
  • Click Add to Crawl URLs to add the URL to the crawling list.

2. Description

  • Provide a brief description of the purpose of this URL connection.
    Example:
    Crawled data from the product page. This can be used for answering queries related to products.

  • This description is helpful for the router engine. It becomes optional when using a fully custom workflow, but it's still recommended for maintainability.

3. Save

  • Save your configuration and return to the main workflow editor.

Workflow Nodes

The Web Crawler module processes data through several predefined workflow nodes:

crawler_synthesize (First Node)

  • Synthesizes or transforms the raw content from crawled web pages into structured summaries or insights.
  • Useful for generating a coherent view of the crawled data.

crawler_rerank_nodes

  • Reranks the crawled content based on relevance to the current workflow’s goal or task.
  • Ensures that the most important data is prioritized for downstream processing.

crawler_retrieve_nodes (Last Node)

  • Retrieves the raw crawled data from the specified URLs.
  • Makes it available for use in subsequent blocks or applications.

Best Practices for Effective Crawling

  • Validate URLs – Make sure URLs are accessible and return the correct content.
  • Use Descriptions – Always add meaningful descriptions to help with routing and future maintenance.
  • Avoid Duplicates – Prevent redundant crawling by checking for duplicate URLs.
  • Avoid JavaScript-heavy Sites – Pages that rely heavily on JavaScript may not be parsed correctly by the default crawler.

Example Use Case

To index product listings for querying inside a chatbot (e.g., via Slack):

  1. Crawl https://example.com/products.
  2. Use crawler_synthesize to convert raw product listings into structured data.
  3. Rerank the listings with crawler_rerank_nodes based on relevance.
  4. Retrieve the result using crawler_retrieve_nodes and pass it into a chat interface or report generator.