Web Scraper Agent Tool
π Key Concepts
The Web Scraper Agent Tool enables INTELLITHING agents to fetch, index, and query content from external web pages. Itβs a versatile tool for gathering insights from news sites, documentation portals, FAQs, blog posts, and more β all without manual data entry.
Use cases include:
- Asking questions about external documentation (e.g., API docs)
- Monitoring and summarizing public-facing announcements
- Powering LLM agents with fresh web knowledge
π Key Definitions
Term | Description |
---|---|
Web Crawler URLs | List of webpages to scrape content from (must be public and reachable). |
SimpleWebPageReader | A LlamaIndex reader that fetches HTML pages and extracts raw text for indexing. |
QueryEngineTool | A callable component that lets agents perform semantic search over scraped content. |
Agent Tool | A module callable by agents when queries match its intent or description. |
βοΈ Setup Guide: Using the Web Scraper Agent
1. Input Your Target URLs
- Provide a list of full URLs pointing to the pages you want to scrape.
- Each URL should be publicly accessible and contain meaningful text.
-
You can include multiple URLs.
-
Example:
["https://docs.llamaindex.ai", "https://openai.com/blog/api-updates"]
2. Configure Tool Parameters
Field | Purpose | Example |
---|---|---|
name |
Unique name for the tool instance | "Llama Docs Scraper" |
description |
Used by agent router to match questions to the tool | "Answers questions from LlamaIndex documentation" |
web_crawler_urls |
List of URLs to scrape | ["https://docs.llamaindex.ai"] |
π How It Works
- The tool uses
SimpleWebPageReader
to fetch the HTML content of the specified pages. - Extracted text is indexed into a
VectorStoreIndex
. - The resulting
QueryEngineTool
enables semantic Q\&A over the indexed page content.
π Agent Integration
To make this tool usable within a LLM-powered agent, define the following in your agents JSON config:
{
"agent": "scraper",
"targets": {
"web_crawler_urls": [
"https://docs.llamaindex.ai",
"https://openai.com/blog/updates"
]
}
}
This enables the agent to respond to queries like:
- βWhatβs the latest update from OpenAI?β
- βHow does LlamaIndex handle document chunking?β
- βWhatβs the embedding strategy recommended in the docs?β
β Best Practices
- Keep URLs focused: Avoid crawling homepages with too many links or irrelevant sections.
- Use descriptive names: Help the router understand the toolβs domain by writing a clear
description
. - Limit content volume: For best performance, avoid overly large or cluttered web pages.
- Scrape trusted sources: Ensure the content is reliable and regularly updated.
π Example Use Case
You want to build an agent that can answer questions about your own product website:
- Add the following URL:
"INTELLITHING Docs Agent"
with a description like "Answers from public docs about INTELLITHING."
3. The agent can now handle queries like:
- βWhat triggers are available in INTELLITHING?β
- βDoes it support Slack integration?β