Crawlee

A Node.js library for web scraping and browser automation that builds reliable crawlers

github.com/apify/crawlee★ 23.8k crawlee.dev

Overview

Crawlee is an open-source Node.js and TypeScript library for web scraping and browser automation. It gives you one interface for both plain HTTP crawling and headless browser crawling, so you can pull links, scrape data, and store results to disk or the cloud from a single codebase.

It is aimed at developers who build crawlers and need them to keep working against modern bot protection. Out of the box your crawlers generate browser-like headers and human-like fingerprints, rotate proxies, and manage sessions, which helps them blend in without a lot of manual tuning.

Within the web scraping and crawling category, Crawlee covers the full job end to end: a persistent request queue, pluggable storage for tabular data and files, automatic scaling to system resources, plus configurable routing, error handling, and retries. It is built and maintained by Apify.

What it does

Single interface for both HTTP and headless browser crawling
Integrated proxy rotation and session management
Browser-like headers and human-like fingerprints, including replicated TLS fingerprints
Persistent URL queue (breadth- and depth-first) plus pluggable storage for data and files
Use Playwright or Puppeteer with the same API across Chrome, Firefox, and WebKit
Automatic scaling, configurable routing, error handling, and retries; written in TypeScript

Getting started

Crawlee requires Node.js 16 or higher. The fastest way to start is the Crawlee CLI, which scaffolds a project; you can also add Crawlee to an existing project manually.

Scaffold a project with the CLI

Run the Crawlee CLI and pick the getting-started example. It installs the dependencies and adds boilerplate code for you.

bashbash

npx crawlee create my-crawler
cd my-crawler
npm start

Or install into your own project

Install Crawlee alongside Playwright, since the browser crawler needs it and it is not bundled in order to keep the install size down.

bashbash

npm install crawlee playwright

Write a minimal crawler

Create a PlaywrightCrawler that reads each page's title, saves it to a dataset, and follows links found on the page.

jsjs

import { PlaywrightCrawler, Dataset } from 'crawlee';

const crawler = new PlaywrightCrawler({
    async requestHandler({ request, page, enqueueLinks, log }) {
        const title = await page.title();
        log.info(`Title of ${request.loadedUrl} is '${title}'`);

        // Save results as JSON to ./storage/datasets/default
        await Dataset.pushData({ title, url: request.loadedUrl });

        // Extract links and add them to the crawling queue.
        await enqueueLinks();
    },
});

await crawler.run(['https://crawlee.dev']);

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Scrape JavaScript-heavy sites that need a real browser to render content before extraction
Crawl and extract data from static HTML pages or JSON APIs using fast HTTP crawling
Build crawlers that need proxy rotation and human-like fingerprints to avoid bot blocks
Collect data into datasets and deploy the crawler with the provided Dockerfiles

How Crawlee compares

Crawlee alongside other open-source web scraping & crawling tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
Firecrawl	★ 135k	A crawling service and API that converts whole websites into clean Markdown or structured JSON ready for LLMs.
Crawl4AI	★ 68.9k	A local-first Python web crawler that turns pages into clean Markdown for use in RAG and LLM pipelines.
Scrapling	★ 65k	A Python web scraping framework whose parser relocates your elements when pages change, with stealthy fetchers and a Scrapy-like spider engine for full crawls.
Scrapy	★ 62.3k	A mature Python framework for writing fast spiders that crawl websites and extract structured data at scale.
ScrapeGraphAI	★ 27.4k	A Python library that uses LLMs and a graph pipeline to extract data from pages based on natural-language prompts.
Colly	★ 25.3k	A Go scraping framework for building fast crawlers with request handling, callbacks, and rate limiting.
Crawlee	★ 23.8k	A Node.js library for web scraping and browser automation that builds reliable crawlers
Katana	★ 17.1k	A fast Go command-line crawler that discovers every URL, endpoint, and JavaScript file on a target site.

// Overview

// What it does

// Getting started