Crawl4AI

Open-source web crawler that turns pages into clean, LLM-ready Markdown

github.com/unclecode/crawl4ai★ 68.9k crawl4ai.com

Overview

Crawl4AI is an open-source Python library that crawls web pages and turns them into clean, structured Markdown ready for use with language models. It runs locally with no API keys required, using an async browser pool under the hood to fetch and process pages.

It is built for developers who feed web content into RAG systems, agents, and other LLM data pipelines. Beyond plain Markdown, it can extract structured data, filter out page noise, and convert links into numbered citations so the output stays focused on the content that matters.

As a web scraping and crawling tool, it offers three ways to work: a Python API for embedding crawls in your code, a command-line interface for quick one-off runs, and a Docker server for self-hosted deployments.

What it does

Generates clean, structured Markdown with headings, tables, and code blocks
Fit Markdown filtering removes navigation and noise for AI-friendly output
Converts page links into a numbered reference list with citations
LLM-driven structured data extraction that works with open-source and proprietary models
Deep crawling with BFS strategy and a configurable page limit
Runs with no API keys via a Python API, a CLI, and a Docker server

Getting started

Install the package, run the one-time setup, then crawl a page from Python or the CLI.

Install Crawl4AI

Install from PyPI, run the post-install setup, and verify the install with the doctor command.

bashbash

pip install -U crawl4ai
crawl4ai-setup
crawl4ai-doctor

Crawl a page in Python

Use AsyncWebCrawler to fetch a URL and print its Markdown.

pythonpython

import asyncio
from crawl4ai import *

async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
        )
        print(result.markdown)

if __name__ == "__main__":
    asyncio.run(main())

Or use the command line

The crwl command crawls a URL and writes Markdown output, with options for deep crawling and LLM extraction.

bashbash

# Basic crawl with markdown output
crwl https://www.nbcnews.com/business -o markdown

# Deep crawl with BFS strategy, max 10 pages
crwl https://docs.crawl4ai.com --deep-crawl bfs --max-pages 10

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Convert documentation or news sites into Markdown to feed a RAG knowledge base
Give an LLM agent a tool to fetch and read live web pages as clean text
Extract structured data such as product details or prices from listing pages
Deep crawl a site to a set page limit and collect its content for analysis

How Crawl4AI compares

Crawl4AI alongside other open-source web scraping & crawling tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
Firecrawl	★ 135k	A crawling service and API that converts whole websites into clean Markdown or structured JSON ready for LLMs.
Crawl4AI	★ 68.9k	Open-source web crawler that turns pages into clean, LLM-ready Markdown
Scrapling	★ 65k	A Python web scraping framework whose parser relocates your elements when pages change, with stealthy fetchers and a Scrapy-like spider engine for full crawls.
Scrapy	★ 62.3k	A mature Python framework for writing fast spiders that crawl websites and extract structured data at scale.
ScrapeGraphAI	★ 27.4k	A Python library that uses LLMs and a graph pipeline to extract data from pages based on natural-language prompts.
Colly	★ 25.3k	A Go scraping framework for building fast crawlers with request handling, callbacks, and rate limiting.
Crawlee	★ 23.8k	A Node.js/TypeScript scraping library with proxy rotation and browser fingerprinting for building reliable crawlers.
Katana	★ 17.1k	A fast Go command-line crawler that discovers every URL, endpoint, and JavaScript file on a target site.

// Overview

// What it does

// Getting started