ScrapeGraphAI

Extract structured web data with LLMs and a graph pipeline from a plain-English prompt

github.com/ScrapeGraphAI/Scrapegraph-ai★ 27.4k scrapegraphai.com

Overview

ScrapeGraphAI is a Python web scraping library that pairs large language models with a graph-based pipeline. Instead of writing CSS selectors or XPath rules, you describe the information you want in plain English and point it at a source, and the library returns the data as structured JSON.

It works on live websites and on local documents such as HTML, XML, JSON, and Markdown. You choose the LLM in a small config dictionary, so it runs against hosted models like OpenAI's GPT series or local models served through Ollama.

It fits the web-scraping category for developers who want prompt-driven extraction rather than brittle, hand-written parsers. Several ready-made pipelines cover single pages, multiple pages, and search-result scraping, and it integrates with frameworks like LangChain, LlamaIndex, and CrewAI.

What it does

Prompt-driven extraction: describe the fields you want and get back structured JSON, no selectors required
Multiple built-in pipelines: SmartScraperGraph (single page), SearchGraph (search results), SmartScraperMultiGraph (multiple pages), plus script and speech generators
Model-agnostic config: switch between OpenAI, Ollama, and other providers by editing the llm block
Scrapes both live websites and local documents (HTML, XML, JSON, Markdown)
Uses Playwright to fetch page content, with headless and verbose options
Integrates with LangChain, LlamaIndex, and CrewAI, and ships Python and Node SDKs

Getting started

Install the package, set up Playwright for fetching pages, then run a single-page extraction with SmartScraperGraph.

Install ScrapeGraphAI

Install from PyPI, then install Playwright so the library can fetch website content. A virtual environment is recommended.

bashbash

pip install scrapegraphai

# IMPORTANT (for fetching websites content)
playwright install

Run a SmartScraperGraph

Define a config with your chosen LLM, give it a prompt and a source URL, then run the pipeline. This example uses a local Ollama model.

pythonpython

from scrapegraphai.graphs import SmartScraperGraph

# Define the configuration for the scraping pipeline
graph_config = {
    "llm": {
        "model": "ollama/llama3.2",
        "model_tokens": 8192,
        "format": "json",
    },
    "verbose": True,
    "headless": False,
}

# Create the SmartScraperGraph instance
smart_scraper_graph = SmartScraperGraph(
    prompt="Extract useful information from the webpage, including a description of what the company does, founders and social media links",
    source="https://scrapegraphai.com/",
    config=graph_config
)

# Run the pipeline
result = smart_scraper_graph.run()

import json
print(json.dumps(result, indent=4))

Switch to a hosted model (optional)

To use OpenAI or another provider, change only the llm block in the config and supply your API key.

pythonpython

graph_config = {
    "llm": {
        "api_key": "YOUR_OPENAI_API_KEY",
        "model": "openai/gpt-4o-mini",
    },
    "verbose": True,
    "headless": False,
}

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Pull a company's description, founders, and social links from a homepage into clean JSON
Gather structured facts across the top search-engine results for a query using SearchGraph
Extract fields from many product or listing pages at once with SmartScraperMultiGraph
Convert local HTML, XML, or Markdown documents into structured data for downstream analysis

How ScrapeGraphAI compares

ScrapeGraphAI alongside other open-source web scraping & crawling tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
Firecrawl	★ 135k	A crawling service and API that converts whole websites into clean Markdown or structured JSON ready for LLMs.
Crawl4AI	★ 68.9k	A local-first Python web crawler that turns pages into clean Markdown for use in RAG and LLM pipelines.
Scrapling	★ 65k	A Python web scraping framework whose parser relocates your elements when pages change, with stealthy fetchers and a Scrapy-like spider engine for full crawls.
Scrapy	★ 62.3k	A mature Python framework for writing fast spiders that crawl websites and extract structured data at scale.
ScrapeGraphAI	★ 27.4k	Extract structured web data with LLMs and a graph pipeline from a plain-English prompt
Colly	★ 25.3k	A Go scraping framework for building fast crawlers with request handling, callbacks, and rate limiting.
Crawlee	★ 23.8k	A Node.js/TypeScript scraping library with proxy rotation and browser fingerprinting for building reliable crawlers.
Katana	★ 17.1k	A fast Go command-line crawler that discovers every URL, endpoint, and JavaScript file on a target site.

// Overview

// What it does

// Getting started