AI/TLDR

Crawl4AI

Open-source web crawler that turns pages into clean, LLM-ready Markdown

Overview

Crawl4AI is an open-source Python library that crawls web pages and turns them into clean, structured Markdown ready for use with language models. It runs locally with no API keys required, using an async browser pool under the hood to fetch and process pages.

It is built for developers who feed web content into RAG systems, agents, and other LLM data pipelines. Beyond plain Markdown, it can extract structured data, filter out page noise, and convert links into numbered citations so the output stays focused on the content that matters.

As a web scraping and crawling tool, it offers three ways to work: a Python API for embedding crawls in your code, a command-line interface for quick one-off runs, and a Docker server for self-hosted deployments.

What it does

  • Generates clean, structured Markdown with headings, tables, and code blocks
  • Fit Markdown filtering removes navigation and noise for AI-friendly output
  • Converts page links into a numbered reference list with citations
  • LLM-driven structured data extraction that works with open-source and proprietary models
  • Deep crawling with BFS strategy and a configurable page limit
  • Runs with no API keys via a Python API, a CLI, and a Docker server

Getting started

Install the package, run the one-time setup, then crawl a page from Python or the CLI.

Install Crawl4AI

Install from PyPI, run the post-install setup, and verify the install with the doctor command.

bashbash
pip install -U crawl4ai
crawl4ai-setup
crawl4ai-doctor

Crawl a page in Python

Use AsyncWebCrawler to fetch a URL and print its Markdown.

pythonpython
import asyncio
from crawl4ai import *

async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
        )
        print(result.markdown)

if __name__ == "__main__":
    asyncio.run(main())

Or use the command line

The crwl command crawls a URL and writes Markdown output, with options for deep crawling and LLM extraction.

bashbash
# Basic crawl with markdown output
crwl https://www.nbcnews.com/business -o markdown

# Deep crawl with BFS strategy, max 10 pages
crwl https://docs.crawl4ai.com --deep-crawl bfs --max-pages 10

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Convert documentation or news sites into Markdown to feed a RAG knowledge base
  • Give an LLM agent a tool to fetch and read live web pages as clean text
  • Extract structured data such as product details or prices from listing pages
  • Deep crawl a site to a set page limit and collect its content for analysis

How Crawl4AI compares

Crawl4AI alongside other open-source web scraping & crawling tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
Firecrawl★ 135kA crawling service and API that converts whole websites into clean Markdown or structured JSON ready for LLMs.
Crawl4AI★ 68.9kOpen-source web crawler that turns pages into clean, LLM-ready Markdown
Scrapling★ 65kA Python web scraping framework whose parser relocates your elements when pages change, with stealthy fetchers and a Scrapy-like spider engine for full crawls.
Scrapy★ 62.3kA mature Python framework for writing fast spiders that crawl websites and extract structured data at scale.
ScrapeGraphAI★ 27.4kA Python library that uses LLMs and a graph pipeline to extract data from pages based on natural-language prompts.
Colly★ 25.3kA Go scraping framework for building fast crawlers with request handling, callbacks, and rate limiting.
Crawlee★ 23.8kA Node.js/TypeScript scraping library with proxy rotation and browser fingerprinting for building reliable crawlers.
Katana★ 17.1kA fast Go command-line crawler that discovers every URL, endpoint, and JavaScript file on a target site.