Spider Scraper ↗
noSummary: The 'SpiderTool' is designed to extract and read the content of a specified website using Spider.
Original Documentation
Documentation Index#
Fetch the complete documentation index at: https://docs.crewai.com/llms.txt Use this file to discover all available pages before exploring further.
The
SpiderToolis designed to extract and read the content of a specified website using Spider.
SpiderTool#
Description#
Spider is the fastest open source scraper and crawler that returns LLM-ready data. It converts any website into pure HTML, markdown, metadata or text while enabling you to crawl with custom actions using AI.
Installation#
To use the SpiderTool you need to download the Spider SDK
and the crewai[tools] SDK too:
pip install spider-client 'crewai[tools]'Example#
This example shows you how you can use the SpiderTool to enable your agent to scrape and crawl websites.
The data returned from the Spider API is already LLM-ready, so no need to do any cleaning there.
from crewai_tools import SpiderTool
def main():
spider_tool = SpiderTool()
searcher = Agent(
role="Web Research Expert",
goal="Find related information from specific URL's",
backstory="An expert web researcher that uses the web extremely well",
tools=[spider_tool],
verbose=True,
)
return_metadata = Task(
description="Scrape https://spider.cloud with a limit of 1 and enable metadata",
expected_output="Metadata and 10 word summary of spider.cloud",
agent=searcher
)
crew = Crew(
agents=[searcher],
tasks=[
return_metadata,
],
verbose=2
)
crew.kickoff()
if __name__ == "__main__":
main()Arguments#
| Argument | Type | Description |
|---|---|---|
| api_key | string | Specifies Spider API key. If not specified, it looks for SPIDER_API_KEY in environment variables. |
| params | object | Optional parameters for the request. Defaults to {"return_format": "markdown"} to optimize content for LLMs. |
| request | string | Type of request to perform (http, chrome, smart). smart defaults to HTTP, switching to JavaScript rendering if needed. |
| limit | int | Max pages to crawl per website. Set to 0 or omit for unlimited. |
| depth | int | Max crawl depth. Set to 0 for no limit. |
| cache | bool | Enables HTTP caching to speed up repeated runs. Default is true. |
| budget | object | Sets path-based limits for crawled pages, e.g., {"*":1} for root page only. |
| locale | string | Locale for the request, e.g., en-US. |
| cookies | string | HTTP cookies for the request. |
| stealth | bool | Enables stealth mode for Chrome requests to avoid detection. Default is true. |
| headers | object | HTTP headers as a map of key-value pairs for all requests. |
| metadata | bool | Stores metadata about pages and content, aiding AI interoperability. Defaults to false. |
| viewport | object | Sets Chrome viewport dimensions. Default is 800x600. |
| encoding | string | Specifies encoding type, e.g., UTF-8, SHIFT_JIS. |
| subdomains | bool | Includes subdomains in the crawl. Default is false. |
| user_agent | string | Custom HTTP user agent. Defaults to a random agent. |
| store_data | bool | Enables data storage for the request. Overrides storageless when set. Default is false. |
| gpt_config | object | Allows AI to generate crawl actions, with optional chaining steps via an array for "prompt". |
| fingerprint | bool | Enables advanced fingerprinting for Chrome. |
| storageless | bool | Prevents all data storage, including AI embeddings. Default is false. |
| readability | bool | Pre-processes content for reading via Mozilla’s readability. Improves content for LLMs. |
| return_format | string | Format to return data: markdown, raw, text, html2text. Use raw for default page format. |
| proxy_enabled | bool | Enables high-performance proxies to avoid network-level blocking. |
| query_selector | string | CSS query selector for content extraction from markup. |
| full_resources | bool | Downloads all resources linked to the website. |
| request_timeout | int | Timeout in seconds for requests (5-60). Default is 30. |
| run_in_background | bool | Runs the request in the background, useful for data storage and triggering dashboard crawls. No effect if storageless is set. |
Link last verified
June 7, 2026.
View original ↗
Source: CrewAI Docs
Link last verified: 2026-03-04