AI/TLDR

Scrapy

A Python framework for crawling websites and extracting structured data at scale

Overview

Scrapy is a Python framework for building web crawlers, called spiders, that visit pages, follow links, and pull out structured data such as text, prices, or listings. It runs on Linux, macOS, and Windows and requires Python 3.10 or newer.

It is aimed at developers who need to collect data from many pages reliably rather than write one-off scripts. You define a spider class that says which URLs to start from and how to parse each response, and Scrapy handles requests, scheduling, retries, and concurrency for you.

Within the web scraping and crawling space, Scrapy sits at the full-framework end: instead of stitching together an HTTP client and an HTML parser yourself, you get a project structure, CSS and XPath selectors, item pipelines, and export to JSON, CSV, or other formats. It is maintained by Zyte and many community contributors.

What it does

  • Define spiders as Python classes with start URLs and a parse method that yields data or follows links
  • Extract data with built-in CSS and XPath selectors on each response
  • Handles requests, scheduling, retries, and concurrent crawling so you don't manage them by hand
  • Item pipelines for cleaning, validating, and storing scraped data
  • Export results to JSON, CSV, and other formats out of the box
  • Cross-platform support on Linux, macOS, and Windows with Python 3.10+

Getting started

Install Scrapy from PyPI, scaffold a project, write a small spider, then run it.

Install Scrapy

Install the package from PyPI. Scrapy requires Python 3.10 or newer.

bashbash
pip install scrapy

Create a project

Scaffold a new Scrapy project to get the standard folder layout for spiders and settings.

bashbash
scrapy startproject tutorial

Write a spider

Add a spider class that lists start URLs and parses each response. Save it under the project's spiders folder.

pythonpython
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        "https://quotes.toscrape.com/page/1/",
    ]

    def parse(self, response):
        for quote in response.css("div.quote"):
            yield {
                "text": quote.css("span.text::text").get(),
                "author": quote.css("small.author::text").get(),
                "tags": quote.css("div.tags a.tag::text").getall(),
            }

Run the spider

Run the spider by its name. Scrapy fetches the pages and prints the extracted items in the log.

bashbash
scrapy crawl quotes

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Collect product listings, prices, or reviews across many pages of an e-commerce site
  • Build a dataset from public web pages for analysis or machine learning
  • Crawl a site by following links to gather structured records at scale
  • Run scheduled scraping jobs that export results to JSON or CSV for a data pipeline

How Scrapy compares

Scrapy alongside other open-source web scraping & crawling tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
Firecrawl★ 135kA crawling service and API that converts whole websites into clean Markdown or structured JSON ready for LLMs.
Crawl4AI★ 68.9kA local-first Python web crawler that turns pages into clean Markdown for use in RAG and LLM pipelines.
Scrapling★ 65kA Python web scraping framework whose parser relocates your elements when pages change, with stealthy fetchers and a Scrapy-like spider engine for full crawls.
Scrapy★ 62.3kA Python framework for crawling websites and extracting structured data at scale
ScrapeGraphAI★ 27.4kA Python library that uses LLMs and a graph pipeline to extract data from pages based on natural-language prompts.
Colly★ 25.3kA Go scraping framework for building fast crawlers with request handling, callbacks, and rate limiting.
Crawlee★ 23.8kA Node.js/TypeScript scraping library with proxy rotation and browser fingerprinting for building reliable crawlers.
Katana★ 17.1kA fast Go command-line crawler that discovers every URL, endpoint, and JavaScript file on a target site.