Spider Scraper ↗

crewai spec intermediate tool-use agents architecture ide

Summary: The 'SpiderTool' is designed to extract and read the content of a specified website using Spider.

Original Documentation

Documentation Index#
Fetch the complete documentation index at: https://docs.crewai.com/llms.txt Use this file to discover all available pages before exploring further.

The SpiderTool is designed to extract and read the content of a specified website using Spider.

`SpiderTool`#

Description#

Spider is the fastest open source scraper and crawler that returns LLM-ready data. It converts any website into pure HTML, markdown, metadata or text while enabling you to crawl with custom actions using AI.

Installation#

To use the SpiderTool you need to download the Spider SDK and the crewai[tools] SDK too:

pip install spider-client 'crewai[tools]'

Example#

This example shows you how you can use the SpiderTool to enable your agent to scrape and crawl websites. The data returned from the Spider API is already LLM-ready, so no need to do any cleaning there.

from crewai_tools import SpiderTool

def main():
    spider_tool = SpiderTool()

    searcher = Agent(
        role="Web Research Expert",
        goal="Find related information from specific URL's",
        backstory="An expert web researcher that uses the web extremely well",
        tools=[spider_tool],
        verbose=True,
    )

    return_metadata = Task(
        description="Scrape https://spider.cloud with a limit of 1 and enable metadata",
        expected_output="Metadata and 10 word summary of spider.cloud",
        agent=searcher
    )

    crew = Crew(
        agents=[searcher],
        tasks=[
            return_metadata,
        ],
        verbose=2
    )

    crew.kickoff()

if __name__ == "__main__":
    main()

Arguments#

Argument	Type	Description
api_key	`string`	Specifies Spider API key. If not specified, it looks for `SPIDER_API_KEY` in environment variables.
params	`object`	Optional parameters for the request. Defaults to `{"return_format": "markdown"}` to optimize content for LLMs.
request	`string`	Type of request to perform (`http`, `chrome`, `smart`). `smart` defaults to HTTP, switching to JavaScript rendering if needed.
limit	`int`	Max pages to crawl per website. Set to `0` or omit for unlimited.
depth	`int`	Max crawl depth. Set to `0` for no limit.
cache	`bool`	Enables HTTP caching to speed up repeated runs. Default is `true`.
budget	`object`	Sets path-based limits for crawled pages, e.g., `{"*":1}` for root page only.
locale	`string`	Locale for the request, e.g., `en-US`.
cookies	`string`	HTTP cookies for the request.
stealth	`bool`	Enables stealth mode for Chrome requests to avoid detection. Default is `true`.
headers	`object`	HTTP headers as a map of key-value pairs for all requests.
metadata	`bool`	Stores metadata about pages and content, aiding AI interoperability. Defaults to `false`.
viewport	`object`	Sets Chrome viewport dimensions. Default is `800x600`.
encoding	`string`	Specifies encoding type, e.g., `UTF-8`, `SHIFT_JIS`.
subdomains	`bool`	Includes subdomains in the crawl. Default is `false`.
user_agent	`string`	Custom HTTP user agent. Defaults to a random agent.
store_data	`bool`	Enables data storage for the request. Overrides `storageless` when set. Default is `false`.
gpt_config	`object`	Allows AI to generate crawl actions, with optional chaining steps via an array for `"prompt"`.
fingerprint	`bool`	Enables advanced fingerprinting for Chrome.
storageless	`bool`	Prevents all data storage, including AI embeddings. Default is `false`.
readability	`bool`	Pre-processes content for reading via Mozilla’s readability. Improves content for LLMs.
return_format	`string`	Format to return data: `markdown`, `raw`, `text`, `html2text`. Use `raw` for default page format.
proxy_enabled	`bool`	Enables high-performance proxies to avoid network-level blocking.
query_selector	`string`	CSS query selector for content extraction from markup.
full_resources	`bool`	Downloads all resources linked to the website.
request_timeout	`int`	Timeout in seconds for requests (5-60). Default is `30`.
run_in_background	`bool`	Runs the request in the background, useful for data storage and triggering dashboard crawls. No effect if `storageless` is set.

Link last verified June 7, 2026. View original ↗

Source: CrewAI Docs

Link last verified: 2026-03-04