Skip to content

Web Scraper Agent Tool

πŸ”‘ Key Concepts

The Web Scraper Agent Tool enables INTELLITHING agents to fetch, index, and query content from external web pages. It’s a versatile tool for gathering insights from news sites, documentation portals, FAQs, blog posts, and more β€” all without manual data entry.

Use cases include:

  • Asking questions about external documentation (e.g., API docs)
  • Monitoring and summarizing public-facing announcements
  • Powering LLM agents with fresh web knowledge

πŸ“˜ Key Definitions

Term Description
Web Crawler URLs List of webpages to scrape content from (must be public and reachable).
SimpleWebPageReader A LlamaIndex reader that fetches HTML pages and extracts raw text for indexing.
QueryEngineTool A callable component that lets agents perform semantic search over scraped content.
Agent Tool A module callable by agents when queries match its intent or description.

βš™οΈ Setup Guide: Using the Web Scraper Agent

1. Input Your Target URLs

  • Provide a list of full URLs pointing to the pages you want to scrape.
  • Each URL should be publicly accessible and contain meaningful text.
  • You can include multiple URLs.

  • Example: ["https://docs.llamaindex.ai", "https://openai.com/blog/api-updates"]

2. Configure Tool Parameters

Field Purpose Example
name Unique name for the tool instance "Llama Docs Scraper"
description Used by agent router to match questions to the tool "Answers questions from LlamaIndex documentation"
web_crawler_urls List of URLs to scrape ["https://docs.llamaindex.ai"]

πŸ”„ How It Works

  1. The tool uses SimpleWebPageReader to fetch the HTML content of the specified pages.
  2. Extracted text is indexed into a VectorStoreIndex.
  3. The resulting QueryEngineTool enables semantic Q\&A over the indexed page content.

πŸ”€ Agent Integration

To make this tool usable within a LLM-powered agent, define the following in your agents JSON config:

{
  "agent": "scraper",
  "targets": {
    "web_crawler_urls": [
      "https://docs.llamaindex.ai",
      "https://openai.com/blog/updates"
    ]
  }
}

This enables the agent to respond to queries like:

  • β€œWhat’s the latest update from OpenAI?”
  • β€œHow does LlamaIndex handle document chunking?”
  • β€œWhat’s the embedding strategy recommended in the docs?”

βœ… Best Practices

  • Keep URLs focused: Avoid crawling homepages with too many links or irrelevant sections.
  • Use descriptive names: Help the router understand the tool’s domain by writing a clear description.
  • Limit content volume: For best performance, avoid overly large or cluttered web pages.
  • Scrape trusted sources: Ensure the content is reliable and regularly updated.

πŸ“Œ Example Use Case

You want to build an agent that can answer questions about your own product website:

  1. Add the following URL:

["https://docs.intellithing.tech"]
2. Name the tool "INTELLITHING Docs Agent" with a description like "Answers from public docs about INTELLITHING." 3. The agent can now handle queries like:

  • β€œWhat triggers are available in INTELLITHING?”
  • β€œDoes it support Slack integration?”