AI/TLDR

ScrapeGraphAI

Extract structured web data with LLMs and a graph pipeline from a plain-English prompt

Overview

ScrapeGraphAI is a Python web scraping library that pairs large language models with a graph-based pipeline. Instead of writing CSS selectors or XPath rules, you describe the information you want in plain English and point it at a source, and the library returns the data as structured JSON.

It works on live websites and on local documents such as HTML, XML, JSON, and Markdown. You choose the LLM in a small config dictionary, so it runs against hosted models like OpenAI's GPT series or local models served through Ollama.

It fits the web-scraping category for developers who want prompt-driven extraction rather than brittle, hand-written parsers. Several ready-made pipelines cover single pages, multiple pages, and search-result scraping, and it integrates with frameworks like LangChain, LlamaIndex, and CrewAI.

What it does

  • Prompt-driven extraction: describe the fields you want and get back structured JSON, no selectors required
  • Multiple built-in pipelines: SmartScraperGraph (single page), SearchGraph (search results), SmartScraperMultiGraph (multiple pages), plus script and speech generators
  • Model-agnostic config: switch between OpenAI, Ollama, and other providers by editing the llm block
  • Scrapes both live websites and local documents (HTML, XML, JSON, Markdown)
  • Uses Playwright to fetch page content, with headless and verbose options
  • Integrates with LangChain, LlamaIndex, and CrewAI, and ships Python and Node SDKs

Getting started

Install the package, set up Playwright for fetching pages, then run a single-page extraction with SmartScraperGraph.

Install ScrapeGraphAI

Install from PyPI, then install Playwright so the library can fetch website content. A virtual environment is recommended.

bashbash
pip install scrapegraphai

# IMPORTANT (for fetching websites content)
playwright install

Run a SmartScraperGraph

Define a config with your chosen LLM, give it a prompt and a source URL, then run the pipeline. This example uses a local Ollama model.

pythonpython
from scrapegraphai.graphs import SmartScraperGraph

# Define the configuration for the scraping pipeline
graph_config = {
    "llm": {
        "model": "ollama/llama3.2",
        "model_tokens": 8192,
        "format": "json",
    },
    "verbose": True,
    "headless": False,
}

# Create the SmartScraperGraph instance
smart_scraper_graph = SmartScraperGraph(
    prompt="Extract useful information from the webpage, including a description of what the company does, founders and social media links",
    source="https://scrapegraphai.com/",
    config=graph_config
)

# Run the pipeline
result = smart_scraper_graph.run()

import json
print(json.dumps(result, indent=4))

Switch to a hosted model (optional)

To use OpenAI or another provider, change only the llm block in the config and supply your API key.

pythonpython
graph_config = {
    "llm": {
        "api_key": "YOUR_OPENAI_API_KEY",
        "model": "openai/gpt-4o-mini",
    },
    "verbose": True,
    "headless": False,
}

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Pull a company's description, founders, and social links from a homepage into clean JSON
  • Gather structured facts across the top search-engine results for a query using SearchGraph
  • Extract fields from many product or listing pages at once with SmartScraperMultiGraph
  • Convert local HTML, XML, or Markdown documents into structured data for downstream analysis

How ScrapeGraphAI compares

ScrapeGraphAI alongside other open-source web scraping & crawling tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
Firecrawl★ 135kA crawling service and API that converts whole websites into clean Markdown or structured JSON ready for LLMs.
Crawl4AI★ 68.9kA local-first Python web crawler that turns pages into clean Markdown for use in RAG and LLM pipelines.
Scrapling★ 65kA Python web scraping framework whose parser relocates your elements when pages change, with stealthy fetchers and a Scrapy-like spider engine for full crawls.
Scrapy★ 62.3kA mature Python framework for writing fast spiders that crawl websites and extract structured data at scale.
ScrapeGraphAI★ 27.4kExtract structured web data with LLMs and a graph pipeline from a plain-English prompt
Colly★ 25.3kA Go scraping framework for building fast crawlers with request handling, callbacks, and rate limiting.
Crawlee★ 23.8kA Node.js/TypeScript scraping library with proxy rotation and browser fingerprinting for building reliable crawlers.
Katana★ 17.1kA fast Go command-line crawler that discovers every URL, endpoint, and JavaScript file on a target site.