Overview
ScrapeGraphAI is a Python web scraping library that pairs large language models with a graph-based pipeline. Instead of writing CSS selectors or XPath rules, you describe the information you want in plain English and point it at a source, and the library returns the data as structured JSON.
It works on live websites and on local documents such as HTML, XML, JSON, and Markdown. You choose the LLM in a small config dictionary, so it runs against hosted models like OpenAI's GPT series or local models served through Ollama.
It fits the web-scraping category for developers who want prompt-driven extraction rather than brittle, hand-written parsers. Several ready-made pipelines cover single pages, multiple pages, and search-result scraping, and it integrates with frameworks like LangChain, LlamaIndex, and CrewAI.
What it does
- Prompt-driven extraction: describe the fields you want and get back structured JSON, no selectors required
- Multiple built-in pipelines: SmartScraperGraph (single page), SearchGraph (search results), SmartScraperMultiGraph (multiple pages), plus script and speech generators
- Model-agnostic config: switch between OpenAI, Ollama, and other providers by editing the llm block
- Scrapes both live websites and local documents (HTML, XML, JSON, Markdown)
- Uses Playwright to fetch page content, with headless and verbose options
- Integrates with LangChain, LlamaIndex, and CrewAI, and ships Python and Node SDKs
Getting started
Install the package, set up Playwright for fetching pages, then run a single-page extraction with SmartScraperGraph.
Install ScrapeGraphAI
Install from PyPI, then install Playwright so the library can fetch website content. A virtual environment is recommended.
pip install scrapegraphai
# IMPORTANT (for fetching websites content)
playwright installRun a SmartScraperGraph
Define a config with your chosen LLM, give it a prompt and a source URL, then run the pipeline. This example uses a local Ollama model.
from scrapegraphai.graphs import SmartScraperGraph
# Define the configuration for the scraping pipeline
graph_config = {
"llm": {
"model": "ollama/llama3.2",
"model_tokens": 8192,
"format": "json",
},
"verbose": True,
"headless": False,
}
# Create the SmartScraperGraph instance
smart_scraper_graph = SmartScraperGraph(
prompt="Extract useful information from the webpage, including a description of what the company does, founders and social media links",
source="https://scrapegraphai.com/",
config=graph_config
)
# Run the pipeline
result = smart_scraper_graph.run()
import json
print(json.dumps(result, indent=4))Switch to a hosted model (optional)
To use OpenAI or another provider, change only the llm block in the config and supply your API key.
graph_config = {
"llm": {
"api_key": "YOUR_OPENAI_API_KEY",
"model": "openai/gpt-4o-mini",
},
"verbose": True,
"headless": False,
}Commands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Pull a company's description, founders, and social links from a homepage into clean JSON
- Gather structured facts across the top search-engine results for a query using SearchGraph
- Extract fields from many product or listing pages at once with SmartScraperMultiGraph
- Convert local HTML, XML, or Markdown documents into structured data for downstream analysis
How ScrapeGraphAI compares
ScrapeGraphAI alongside other open-source web scraping & crawling tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| Firecrawl | ★ 135k | A crawling service and API that converts whole websites into clean Markdown or structured JSON ready for LLMs. |
| Crawl4AI | ★ 68.9k | A local-first Python web crawler that turns pages into clean Markdown for use in RAG and LLM pipelines. |
| Scrapling | ★ 65k | A Python web scraping framework whose parser relocates your elements when pages change, with stealthy fetchers and a Scrapy-like spider engine for full crawls. |
| Scrapy | ★ 62.3k | A mature Python framework for writing fast spiders that crawl websites and extract structured data at scale. |
| ScrapeGraphAI | ★ 27.4k | Extract structured web data with LLMs and a graph pipeline from a plain-English prompt |
| Colly | ★ 25.3k | A Go scraping framework for building fast crawlers with request handling, callbacks, and rate limiting. |
| Crawlee | ★ 23.8k | A Node.js/TypeScript scraping library with proxy rotation and browser fingerprinting for building reliable crawlers. |
| Katana | ★ 17.1k | A fast Go command-line crawler that discovers every URL, endpoint, and JavaScript file on a target site. |