Web Scraper Agent Tool

🔑 Key Concepts

The Web Scraper Agent Tool enables INTELLITHING agents to fetch, index, and query content from external web pages. It’s a versatile tool for gathering insights from news sites, documentation portals, FAQs, blog posts, and more — all without manual data entry.

Use cases include:

Asking questions about external documentation (e.g., API docs)
Monitoring and summarizing public-facing announcements
Powering LLM agents with fresh web knowledge

📘 Key Definitions

Term	Description
Web Crawler URLs	List of webpages to scrape content from (must be public and reachable).
SimpleWebPageReader	A LlamaIndex reader that fetches HTML pages and extracts raw text for indexing.
QueryEngineTool	A callable component that lets agents perform semantic search over scraped content.
Agent Tool	A module callable by agents when queries match its intent or description.

⚙️ Setup Guide: Using the Web Scraper Agent

1. Input Your Target URLs

Provide a list of full URLs pointing to the pages you want to scrape.
Each URL should be publicly accessible and contain meaningful text.
You can include multiple URLs.
Example: ["https://docs.llamaindex.ai", "https://openai.com/blog/api-updates"]

2. Configure Tool Parameters

Field	Purpose	Example
`name`	Unique name for the tool instance	`"Llama Docs Scraper"`
`description`	Used by agent router to match questions to the tool	`"Answers questions from LlamaIndex documentation"`
`web_crawler_urls`	List of URLs to scrape	`["https://docs.llamaindex.ai"]`

🔄 How It Works

The tool uses SimpleWebPageReader to fetch the HTML content of the specified pages.
Extracted text is indexed into a VectorStoreIndex.
The resulting QueryEngineTool enables semantic Q\&A over the indexed page content.

🔀 Agent Integration

To make this tool usable within a LLM-powered agent, define the following in your agents JSON config:

{
  "agent": "scraper",
  "targets": {
    "web_crawler_urls": [
      "https://docs.llamaindex.ai",
      "https://openai.com/blog/updates"
    ]
  }
}

This enables the agent to respond to queries like:

“What’s the latest update from OpenAI?”
“How does LlamaIndex handle document chunking?”
“What’s the embedding strategy recommended in the docs?”

✅ Best Practices

Keep URLs focused: Avoid crawling homepages with too many links or irrelevant sections.
Use descriptive names: Help the router understand the tool’s domain by writing a clear description.
Limit content volume: For best performance, avoid overly large or cluttered web pages.
Scrape trusted sources: Ensure the content is reliable and regularly updated.

📌 Example Use Case

You want to build an agent that can answer questions about your own product website:

Add the following URL:

["https://docs.intellithing.tech"]

2. Name the tool "INTELLITHING Docs Agent" with a description like "Answers from public docs about INTELLITHING." 3. The agent can now handle queries like:

“What triggers are available in INTELLITHING?”
“Does it support Slack integration?”