Overview
Crawl4AI is an open-source Python library that crawls web pages and turns them into clean, structured Markdown ready for use with language models. It runs locally with no API keys required, using an async browser pool under the hood to fetch and process pages.
It is built for developers who feed web content into RAG systems, agents, and other LLM data pipelines. Beyond plain Markdown, it can extract structured data, filter out page noise, and convert links into numbered citations so the output stays focused on the content that matters.
As a web scraping and crawling tool, it offers three ways to work: a Python API for embedding crawls in your code, a command-line interface for quick one-off runs, and a Docker server for self-hosted deployments.
What it does
- Generates clean, structured Markdown with headings, tables, and code blocks
- Fit Markdown filtering removes navigation and noise for AI-friendly output
- Converts page links into a numbered reference list with citations
- LLM-driven structured data extraction that works with open-source and proprietary models
- Deep crawling with BFS strategy and a configurable page limit
- Runs with no API keys via a Python API, a CLI, and a Docker server
Getting started
Install the package, run the one-time setup, then crawl a page from Python or the CLI.
Install Crawl4AI
Install from PyPI, run the post-install setup, and verify the install with the doctor command.
pip install -U crawl4ai
crawl4ai-setup
crawl4ai-doctorCrawl a page in Python
Use AsyncWebCrawler to fetch a URL and print its Markdown.
import asyncio
from crawl4ai import *
async def main():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://www.nbcnews.com/business",
)
print(result.markdown)
if __name__ == "__main__":
asyncio.run(main())Or use the command line
The crwl command crawls a URL and writes Markdown output, with options for deep crawling and LLM extraction.
# Basic crawl with markdown output
crwl https://www.nbcnews.com/business -o markdown
# Deep crawl with BFS strategy, max 10 pages
crwl https://docs.crawl4ai.com --deep-crawl bfs --max-pages 10Commands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Convert documentation or news sites into Markdown to feed a RAG knowledge base
- Give an LLM agent a tool to fetch and read live web pages as clean text
- Extract structured data such as product details or prices from listing pages
- Deep crawl a site to a set page limit and collect its content for analysis
How Crawl4AI compares
Crawl4AI alongside other open-source web scraping & crawling tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| Firecrawl | ★ 135k | A crawling service and API that converts whole websites into clean Markdown or structured JSON ready for LLMs. |
| Crawl4AI | ★ 68.9k | Open-source web crawler that turns pages into clean, LLM-ready Markdown |
| Scrapling | ★ 65k | A Python web scraping framework whose parser relocates your elements when pages change, with stealthy fetchers and a Scrapy-like spider engine for full crawls. |
| Scrapy | ★ 62.3k | A mature Python framework for writing fast spiders that crawl websites and extract structured data at scale. |
| ScrapeGraphAI | ★ 27.4k | A Python library that uses LLMs and a graph pipeline to extract data from pages based on natural-language prompts. |
| Colly | ★ 25.3k | A Go scraping framework for building fast crawlers with request handling, callbacks, and rate limiting. |
| Crawlee | ★ 23.8k | A Node.js/TypeScript scraping library with proxy rotation and browser fingerprinting for building reliable crawlers. |
| Katana | ★ 17.1k | A fast Go command-line crawler that discovers every URL, endpoint, and JavaScript file on a target site. |