WebReaper 11.3.1

.NET 10.0

dotnet add package WebReaper --version 11.3.1

NuGet\Install-Package WebReaper -Version 11.3.1

This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.

<PackageReference Include="WebReaper" Version="11.3.1" />

For projects that support PackageReference, copy this XML node into the project file to reference the package.

<PackageVersion Include="WebReaper" Version="11.3.1" />
                    

                            Directory.Packages.props

<PackageReference Include="WebReaper" />
                    

                            Project file

For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.

paket add WebReaper --version 11.3.1

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

#r "nuget: WebReaper, 11.3.1"

#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.

#:package WebReaper@11.3.1

#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.

#addin nuget:?package=WebReaper&version=11.3.1
                    

                            Install as a Cake Addin

#tool nuget:?package=WebReaper&version=11.3.1
                    

                            Install as a Cake Tool

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

logo

WebReaper

AI-native web scraper. Single binary with a bundled Claude Code skill.

Built by HighCraft.io.

Install

macOS / Linux (Homebrew):

brew install alex-on-ai/webreaper/webreaper

Any POSIX shell (install.sh):

curl -fsSL https://raw.githubusercontent.com/alex-on-ai/WebReaper/master/scripts/install.sh | sh

.NET library:

dotnet add package WebReaper

Windows binaries are on the GitHub Releases page; winget and Scoop are on the v10.1 roadmap.

Updating: brew upgrade webreaper (Homebrew), or re-run the install.sh line with --upgrade appended (| sh -s -- --upgrade), or dotnet add package WebReaper (library). When a newer release exists, scrape / crawl / map print a one-line upgrade hint on stderr (interactive terminals only, never in a pipe or CI); disable that check with WEBREAPER_NO_UPDATE_CHECK=1.

30-second demo

$ webreaper scrape https://news.ycombinator.com
# Hacker News

- [Show HN: ...](https://news.ycombinator.com/item?id=...)
- [Ask HN: ...](https://news.ycombinator.com/item?id=...)
...

$ webreaper init
Wrote WebReaper Agent Skill to .claude/skills/webreaper/SKILL.md

Try it out:
  webreaper scrape https://example.com
  webreaper map https://example.com

After webreaper init, the next Claude Code session picks up the skill and routes scraping intents ("give me the markdown of X", "what blog posts are on Y", "scrape the top 5 articles") to webreaper automatically.

Why WebReaper
Quick start
Bot protection that just works
Power your agent
AI features
Use cases
Packages
Compared to Firecrawl, Crawl4AI, and Crawlee
API overview
Architecture and interfaces
Repository structure
License

Why WebReaper

<table> <tr> <td width="33%" valign="top">🪶 Drop on PATH, run. No Docker, no Postgres, no signup. ~12 MB binary.</td> <td width="33%" valign="top">🤖 AI-native by composition. Markdown by default. Schema extraction, LLM fallback, self-healing selectors, autonomous agent. Stack with <code>.With…()</code>.</td> <td width="33%" valign="top">🔌 Bring any LLM. OpenAI, Anthropic, Ollama, Azure OpenAI, llamafile, via <code>Microsoft.Extensions.AI</code>.</td> </tr> <tr> <td width="33%" valign="top">🛡 Bot-checks handled automatically. A blocked page climbs HTTP → browser → stealth on its own, per page and host-sticky. Challenge pages are dropped, never returned as data. No flag needed for the browser fallback.</td> <td width="33%" valign="top">📡 Distributed when needed. Swap scheduler, tracker, sink to Redis, MongoDB, SQLite, Azure Service Bus, Cosmos. Same code.</td> <td width="33%" valign="top">📜 MIT, not AGPL. Embed in commercial software, fork, modify, redistribute. Firecrawl's AGPL requires open-sourcing your service or paying for a commercial license.</td> </tr> </table>

Quick start

CLI

# One page as Markdown
webreaper scrape https://example.com

# Save Markdown to a file
webreaper scrape https://example.com --output page.md

# Discover URLs on a site
webreaper map https://example.com --search /blog/ --max-urls 50

# Crawl a whole site recursively (every on-domain page) to JSON Lines
webreaper crawl https://example.com > pages.jsonl

# Structured fields with a JSON schema (output: JSON; multi-page: JSON Lines)
webreaper scrape https://example.com --schema schema.json

# Schema-free extraction with an LLM (bring your own OpenAI-compatible endpoint)
webreaper scrape https://example.com --prompt "title and author" \
  --model gpt-4o-mini --llm-url https://api.openai.com/v1

# Whole site, fields, cheaply: infer a schema once, then extract the rest
webreaper crawl https://example.com --infer "product name and price" \
  --model gpt-4o-mini --llm-url https://api.openai.com/v1 --output-dir ./out

# JS-rendered single-page app
webreaper scrape https://example.com --browser

# Bot-protected site: a plain scrape already auto-climbs HTTP -> browser on a
# block; --stealth starts at a stealth backend (--auto-stealth = no prompt, for CI)
webreaper scrape https://example.com --stealth

# Install the Claude Code skill
webreaper init

The CLI is built Native-AOT (ADR-0043), ships as a single binary on every tagged GitHub release across six RIDs (linux-x64, linux-arm64, osx-x64, osx-arm64, win-x64, win-arm64), and is block-aware with automatic browser/stealth escalation (ADR-0083). The macOS binaries are Apple codesigned and notarized (ADR-0071); Homebrew installs run without Gatekeeper warnings on a clean machine.

Library

using WebReaper.Builders;

var engine = await ScraperEngineBuilder
    .Crawl("https://news.ycombinator.com")
    .AsMarkdown()
    .WriteToConsole()
    .BuildAsync();

await engine.RunAsync();

That is HTTP-only, no extra packages, no schema. For structured fields, swap AsMarkdown() for Extract(schema); for JS-rendered pages, Crawl for CrawlWithBrowser plus a transport satellite (WebReaper.Playwright or WebReaper.Cdp). The full surface is in the API overview.

Bot protection that just works

Most scrapers make you opt into a browser up front and guess when a site is blocking you. WebReaper detects the block and escalates on its own, one page at a time:

HTTP  ->  browser (Chromium)  ->  stealth (CloakBrowser)
          (climbs only when a page actually looks blocked)

Automatic. A plain scrape or crawl starts on a fast HTTP fetch and climbs to a real browser only when a page looks blocked (a challenge status, response header, or body marker). No flag needed for the browser fallback.
Per page, host-sticky. The first confirmed block on a host lifts that host's floor, so the rest of a whole-site crawl starts at the working tier instead of re-paying the failed one. You pay for the climb once, not per page.
No garbage in your data. A page still blocked at the top tier is dropped, never written to your output, and the run exits non-zero so an unattended job knows. Clean data or a clear signal, never a challenge page masquerading as content.
You decide on stealth. --stealth starts at the stealth backend; --auto-stealth (or WEBREAPER_AUTO_STEALTH=1) enables it unattended; --no-auto-stealth caps the climb at a vanilla browser. The ~220 MB stealth backend downloads only if you opt in.

Self-hosted, single binary, no cloud round-trip (ADR-0083).

Power your agent

Onboarding an agent from zero? Hand it docs/AI-ONBOARDING.md — install → verify → choose-your-path on one page, no credentials anywhere.

Claude Code skill

webreaper init

Writes a polished SKILL.md to .claude/skills/webreaper/. Claude Code loads it on the next session and routes scraping intents to the CLI: "scrape the top 5 stories on HN and summarize each", "give me the markdown of this article", "this Cloudflare-protected site is blocking me". The skill describes when to prefer webreaper over the built-in WebFetch (artifacts and structured data go through webreaper; conversational answers stay on WebFetch).

MCP server

Two MCP servers expose the same tools (scrape, map, extract, extract_with_prompt, extract_inferred, crawl) for clients that speak the Model Context Protocol:

WebReaper.Mcp (ADR-0049) speaks stdio, for local clients that spawn a process: Cursor, Claude Desktop, Copilot Studio.
WebReaper.Mcp.AspNetCore (ADR-0086) speaks Streamable HTTP, for clients that connect to a URL. The headline case is n8n, whose MCP Client node is URL-only and cannot reach a stdio server. Run the Chromium-baked image, point n8n's MCP Client node at the URL with a bearer token, and call the tools from a workflow:
```
docker run -p 8080:8080 -e WEBREAPER_MCP_TOKEN=your-secret \
  ghcr.io/alex-on-ai/webreaper-mcp-http:latest
```
See the n8n quickstart.

Both are thin facades over the library API; the primary agent surface remains the CLI.

CLI inside any agent harness

The single binary works inside any shell-spawning agent: LangChain ShellTool, OpenAI Assistants code-interpreter, GitHub Actions, internal scripts. Zero runtime to install; one syscall to invoke.

AI features

1. LLM-ready Markdown, no schema

await ScraperEngineBuilder
    .Crawl("https://example.com")
    .AsMarkdown()                   // ADR-0040 + ADR-0063
    .WriteToConsole()
    .BuildAsync()
    .Result.RunAsync();

The Markdown extractor (HtmlToMarkdown primitive, ADR-0063; MarkdownContentExtractor adapter, ADR-0040) emits {title, markdown} per page. Pass the output straight into a follow-up LLM prompt.

2. Source-gen schemas with compile-time guards

using WebReaper.Extraction.Attributes;

[ScrapeSchema]
public partial class Article
{
    [ScrapeField("h1")]                                              public string? Title { get; set; }
    [ScrapeField(".views", Type = SchemaFieldType.Integer)]          public int Views { get; set; }
    [ScrapeField(".tag", IsList = true)]                             public List<string> Tags { get; set; } = new();
}

// Emitted at compile time, reflection-free, AOT-clean:
//   public static Schema Schema { get; }
//   public static Article Materialize(JsonObject json)

var engine = await ScraperEngineBuilder
    .Crawl("https://example.com/post")
    .Extract(Article.Schema)
    .Subscribe(p => HandleArticle(Article.Materialize(p.Data)))
    .BuildAsync();

The WebReaper.Extraction.Generators Roslyn analyzer (ADR-0045) emits the schema and a Materialize function. Schema typos are compile errors; the generated path uses no reflection so it's AOT-clean.

3. LLM safety net for deterministic extraction

Three composable patterns where the LLM only fires when the deterministic path can't deliver. Same proposer-validator shape across all three.

using WebReaper.AI;

// (a) Fire LLM only when a field returns empty (ADR-0046)
.WithLlmFallback(chatClient)

// (b) Repair a broken selector once per (Schema, field), cache forever (ADR-0047)
.WithLlmSelfHealing(chatClient)

// (c) No schema at all: infer it from the URL, re-infer on validator failure (ADR-0067 + ADR-0069)
.UseAi(chatClient, AiPolicyMode.Inferred)

Stable pages cost zero LLM calls; broken pages cost one call per (page, field) and cache. Schema inference is one call per site (cached for the run). The WebReaper.AI satellite is built on Microsoft.Extensions.AI so any IChatClient works.

4. Autonomous agent: page selection by goal

using WebReaper.AI;

var result = await LlmAgent.RunAsync(
    "https://example.com",
    goal: "Find the contact email and phone number for the support team.",
    chatClient);

AgentEngine (ADR-0051) runs a sequential decide → persist → execute loop over a closed-sum AgentDecision (Extract | Follow | Act | Stop). The brain picks each step from the bounded AgentState view; the engine validates (visited-link enforcement, MaxSteps cap), persists (IAgentRunStore), and dispatches. Durable resume across process restarts.

5. Semantic page actions

ScraperEngineBuilder
    .CrawlWithBrowser(url, actions => actions
        .Do(PageAction.SemanticAct("click 'sign in'"))   // ADR-0050
        .Do(PageAction.WaitForNetworkIdle())
        .Build())
    .WithLlmActionResolver(chatClient)
    // ...

A natural-language PageAction.SemanticAct(intent) is one of the ten closed-sum arms (ADR-0050 added it; ADR-0074 added Fill / Press / ScrollIntoView for form interactions). The transport resolves it once via the registered IActionResolver (LLM-backed by default), dispatches the concrete arm, and caches the resolution per crawl by intent string. First page pays the LLM, subsequent same-intent pages dispatch the cached arm with no LLM call.

Runnable end-to-end demo

Examples/WebReaper.AiNativeShowcase wires every feature in this section:

dotnet run --project Examples/WebReaper.AiNativeShowcase -- markdown
dotnet run --project Examples/WebReaper.AiNativeShowcase -- sourcegen
dotnet run --project Examples/WebReaper.AiNativeShowcase -- llm
dotnet run --project Examples/WebReaper.AiNativeShowcase -- router
dotnet run --project Examples/WebReaper.AiNativeShowcase -- changetrack

Use cases

Build LLM context from blog or docs sites. webreaper map plus webreaper scrape per URL, piped into a prompt or a vector DB.
Monitor competitor pricing or status pages for changes. Schedule the CLI with cron or a worker, store records in MongoDB or SQLite, plug in .WithChangeTracking() (ADR-0048) so the sink fires only on diff. Hash-based dedup; cron-friendly.
Run an autonomous research agent. LlmAgent.RunAsync(url, goal, chatClient) decides which links to follow until the goal is met. Durable resume across restarts.
Scrape Cloudflare-protected catalogs. A plain scrape auto-climbs HTTP to a browser on a block; add --stealth (or --auto-stealth for unattended runs) to escalate to a stealth backend. Blocked pages are dropped, never emitted as challenge-page garbage.
Generate clean datasets from semi-structured pages. [ScrapeSchema] POCO plus the source generator; reflection-free, AOT-compiles into a native binary.
Embed a scraping primitive in your own app. dotnet add package WebReaper; the public registration seam lets you plug Redis, Cosmos DB, your own sink.

Packages

The release ships fifteen packages (one core, fourteen satellites), all versioned in lockstep. The core stays dependency-light and Native-AOT-publishable with zero warnings; satellites bring their own SDK dependencies and quarantine them off the core graph (ADR-0009).

Package	Add it for	Key builder calls
WebReaper	Core. HTTP crawl and parse, in-memory and file scheduler / visited-link tracker / cookie and config storage, Console / CSV / JSON-Lines sinks, Markdown extractor, schema fold. Dependency-light, Native-AOT-ready, Newtonsoft-free.	`Crawl` `Extract` `AsMarkdown` `Follow` `Paginate` `Sweep` `WriteToJsonFile` `WriteToCsvFile` `WriteToConsole` `Subscribe`
WebReaper.Cdp	Raw CDP `IPageLoadTransport` (ADR-0052). AOT-clean (no PuppeteerSharp / Playwright dependency); System.Net.WebSockets plus System.Text.Json source-gen. Bedrock for the stealth pattern.	`.WithCdpPageLoader(cdpUrl)` (BYO) or `.WithCdpPageLoader(CdpLaunchOptions)` (launch managed Chromium)
WebReaper.Playwright	Microsoft.Playwright-backed transport (ADR-0053). Multi-browser (Chromium default; Firefox / WebKit opt-in). All ten `PageAction` arms supported. Use for modern multi-browser needs; pair with `WebReaper.Cdp` for AOT or stealth.	`.WithPlaywrightPageLoader()`
WebReaper.Stealth.CloakBrowser	First stealth-backend satellite (ADR-0054). Auto-downloads CloakBrowser on first use; composes on `WebReaper.Cdp`. Disposable via the ADR-0058 engine teardown chain.	`.WithCloakBrowser()`
WebReaper.AI	LLM extraction, LLM action resolver, LLM brain, LLM self-healing, LLM schema inferrer (ADR-0044 / 0050 / 0051 / 0067). Built on `Microsoft.Extensions.AI`; bring your own `IChatClient`.	`.WithLlmFallback` `.WithLlmSelfHealing` `.WithLlmExtractor` `.WithLlmAgentBrain` `.WithLlmActionResolver` `.WithLlmSchemaInferrer` `.UseAi(client)`
WebReaper.AI.Http	AOT-safe OpenAI-compatible `IChatClient` (ADR-0084): raw `HttpClient` plus System.Text.Json source-gen, no provider SDK. The bring-your-own client the AOT CLI and the MCP servers use for prompt extraction.	`new OpenAiCompatibleChatClient(baseUrl, model, apiKey)`
WebReaper.Extraction.Attributes	The `[ScrapeSchema]` / `[ScrapeField]` marker types. Standalone, no runtime cost.	`[ScrapeSchema]` `[ScrapeField("selector")]`
WebReaper.Extraction.Generators	Roslyn source generator that emits `static Schema` plus reflection-free `static Materialize(JsonObject)` (ADR-0045). `DevelopmentDependency=true`; does not propagate at runtime.	compile-time only
WebReaper.Mcp	MCP server `Exe` exposing scrape / map / extract / extract_with_prompt / extract_inferred / crawl as MCP tools over stdio (ADR-0049). Interop adapter for local MCP clients (Cursor, Claude Desktop).	the package is the executable
WebReaper.Mcp.AspNetCore	MCP server over Streamable HTTP (ADR-0086) for URL-based clients like n8n; same tools as `WebReaper.Mcp`, plus bearer-token auth and a `WEBREAPER_CDP_URL` browser sidecar. Also ships as a Chromium-baked container image.	the package is the executable; or `docker run ghcr.io/alex-on-ai/webreaper-mcp-http`
WebReaper.Mongo	MongoDB result sink and MongoDB-backed config / cookie storage.	`.WriteToMongoDb(...)` `.WithMongoDbConfigStorage(...)` `.WithMongoDbCookieStorage(...)`
WebReaper.Redis	Redis scheduler, visited-link tracker, result sink, config / cookie storage.	`.WithRedisScheduler(...)` `.TrackVisitedLinksInRedis(...)` `.WriteToRedis(...)` `.WithRedisConfigStorage(...)` `.WithRedisCookieStorage(...)`
WebReaper.AzureServiceBus	Distributed scheduler over an Azure Service Bus queue.	`.WithAzureServiceBusScheduler(...)`
WebReaper.Cosmos	Azure Cosmos DB result sink.	`.WriteToCosmosDb(...)`
WebReaper.Sqlite	Local durable scheduler and visited-link tracker on an embedded SQLite store; resume is a query, no position file. Opt-in robust-local tier (no server, unlike Redis).	`.WithSqliteScheduler(...)` `.TrackVisitedLinksInSqlite(...)`

WebReaper.Cli (the AOT single-binary; ADR-0043) is not a NuGet package; it ships as platform binaries on every GitHub release (Native-AOT plus dotnet tool install are mutually incompatible on one target). Install via Homebrew or install.sh, or build from source.

Compared to Firecrawl, Crawl4AI, and Crawlee

	WebReaper	Firecrawl	Crawl4AI	WebFetch (Claude)
License	MIT	AGPL-3.0 (plus commercial)	Apache 2.0	bundled with Claude
Install	one binary, ~12 MB	Docker + Postgres + Redis (self-host) or hosted	Docker + Python + Playwright	nothing to install
Cost	free	metered API plus free tier	free	included with Claude
BYO LLM	any `IChatClient`	no (their model)	yes (LiteLLM)	Claude only
Autonomous agent	`Agent.RunAsync()` durable, in-process	`/agent` endpoint (cloud only)	code it yourself	not available
Whole-site crawl	`webreaper crawl` / `.Sweep()`: recursive, on-domain, sitemap-seeded, streams JSON Lines	`crawl` (cloud or self-host)	deep-crawl strategies (code it yourself)	no (single fetch)
Page actions	10 declarative arms: `Click`, `Wait`, `Fill`, `Press`, `ScrollToEnd`, `ScrollIntoView`, `WaitForSelector`, `WaitForNetworkIdle`, `EvaluateExpression`, `SemanticAct` (natural-language)	9 actions: `wait`, `click`, `write`, `press`, `scroll`, `executeJavascript`, plus 3 observation (`screenshot`, `pdf`, `scrape`)	JS hooks; no closed-sum vocabulary	none (single-fetch only)
Bot-protected	automatic HTTP → browser → stealth climb, per page, host-sticky, self-hosted	cloud yes; self-host degraded (no Fire-engine)	BYO	no
Claude Code skill	`webreaper init` bundled	community `firecrawl-claude-code-skill` wraps the cloud API	none official	not applicable

Crawlee (Apify's Node/Python library) is also worth knowing; it covers similar ground to the WebReaper library API but doesn't ship a binary, a Claude Code skill, or a built-in LLM safety net. Use it if you're already in the Apify ecosystem.

The closest reference is Firecrawl: same AI-native positioning, opposite distribution shape. Firecrawl optimises for the hosted-API flow; WebReaper optimises for the local-binary flow. If you want a managed cloud with someone else's proxies and infra, Firecrawl is the buy. If you want a binary that runs locally with your own LLM key and no metering, WebReaper is the build.

API overview

The library is a fluent builder over a small set of seams. For the deep seam-by-seam reference (interfaces, main entities, custom sinks), see docs/architecture.md.

Schema extraction

using WebReaper.Builders;

var engine = await ScraperEngineBuilder
    .Crawl("https://www.alexpavlov.dev/blog")
    .Extract(new()
    {
        new("title", ".text-3xl.font-bold"),
        new("text", ".max-w-max.prose.prose-dark")
    })
    .Follow("a.text-gray-900.transition")
    .WriteToJsonFile("output.json")
    .PageCrawlLimit(10)
    .WithParallelismDegree(30)
    .LogToConsole()
    .BuildAsync();

await engine.RunAsync();

Each new("field", "css-selector") is a leaf; nest schemas for objects, set IsList = true for arrays, set Attr = "href" to read an HTML attribute instead of inner text.

Collect records in-process

Get the scraped records back in your own process, no file and no custom sink. Pass Subscribe (ADR-0038) the Add of a thread-safe collection, then read them after the run:

using System.Collections.Concurrent;
using WebReaper.Builders;
using WebReaper.Sinks.Models;

var records = new ConcurrentBag<ParsedData>();

var engine = await ScraperEngineBuilder
    .Crawl("https://news.ycombinator.com")
    .AsMarkdown()
    .Subscribe(records.Add)            // called concurrently; collect into a thread-safe type
    .BuildAsync();

await engine.RunAsync();

foreach (var record in records)
    Console.WriteLine($"{record.Url}: {record.Data["markdown"]?.GetValue<string>()?.Length ?? 0} chars");

The Crawl driver fans out to sinks concurrently, so a ConcurrentBag is the right collector, not a bare List. For structured fields, read your schema keys out of record.Data instead of "markdown".

Parsing dynamic pages (SPA)

For JS-rendered pages, swap Crawl for CrawlWithBrowser and register a browser transport. Two satellites are available.

WebReaper.Playwright is the modern default (ADR-0053): multi-browser, all ten PageAction arms.

using WebReaper.Builders;
using WebReaper.Playwright;

await ScraperEngineBuilder
    .CrawlWithBrowser("https://example.com")
    .Extract(new() { new("title", "h1") })
    .WithPlaywrightPageLoader()
    .BuildAsync();

WebReaper.Cdp is the AOT-clean alternative (ADR-0052): raw CDP over System.Net.WebSockets, AOT-publishable. Use for AOT consumers or as the base for stealth backends.

using WebReaper.Cdp;

.WithCdpPageLoader(new CdpLaunchOptions { Headless = true })
// or .WithCdpPageLoader(cdpUrl: "http://localhost:9222") for BYO browser

For bot-protected sites, layer WebReaper.Stealth.CloakBrowser (ADR-0054) on top of WebReaper.Cdp:

using WebReaper.Stealth.CloakBrowser;

.WithCloakBrowser()    // auto-downloads CloakBrowser on first use, ~220 MB

For visible-browser debugging, add .HeadlessMode(false).

Running JavaScript and page actions

Drive the page as it loads. Pass an actions lambda.

using WebReaper.Builders;
using WebReaper.Playwright;
using WebReaper.Domain.PageActions;

await ScraperEngineBuilder
    .CrawlWithBrowser("https://www.reddit.com/r/dotnet/", actions => actions
        .ScrollToEnd()
        .Build())
    .Extract(new() { new("title", "h1") })
    .WithPlaywrightPageLoader()
    .BuildAsync();

PageActionBuilder exposes Click, Wait, ScrollToEnd, ScrollIntoView, WaitForSelector, WaitForNetworkIdle, EvaluateExpression, Fill, Press, SemanticAct, Repeat / RepeatWithDelay, and Build(). Fill(selector, value) / Press(key) / ScrollIntoView(selector) (ADR-0074) carry an implicit 30 s auto-wait and use the React-friendly native-setter trick on the CDP transport, so controlled components in React / Vue / Svelte observe the change. SemanticAct (ADR-0050) accepts a natural-language intent and resolves it via the registered IActionResolver (see AI features §5).

Persist progress locally

Survive kill -9 and resume across restarts. Two adapters: file-backed (zero deps, in core) and SQLite-backed (durable, opt-in via WebReaper.Sqlite).

using WebReaper.Builders;
using WebReaper.Sqlite;

await ScraperEngineBuilder
    .Crawl("https://example.com")
    .Extract(new() { new("name", "h1") })
    .Follow(".forumlink>a")
    .Paginate("a.torTopic", ".pg")
    .WriteToJsonFile("result.json")
    .WithSqliteScheduler("crawl/state.db")        // resume is a query, not a position file
    .TrackVisitedLinksInSqlite("crawl/state.db")  // the table is the set
    .BuildAsync();

Pass dataCleanupOnStart: true to any sink, tracker, or scheduler method to wipe its store at start (note: WriteToJsonFile defaults this to true; the others default to false).

Authorization

If the site needs cookies, call SetCookies and fill the container. You perform the login yourself.

using System.Net;
using WebReaper.Builders;

await ScraperEngineBuilder
    .Crawl("https://example.com/protected")
    .Extract(new() { new("name", "h1") })
    .SetCookies(cookies =>
    {
        cookies.Add(new Cookie("AuthToken", "123"));
    })
    .BuildAsync();

Distributed and serverless

Swap the scheduler, config storage, and link tracker to Redis or Azure Service Bus; multiple workers or serverless functions share one crawl. Examples/WebReaper.AzureFuncs shows the serverless shape (two functions: StartScraping seeds the work, WebReaperSpider is the distributed Crawl driver). Examples/WebReaper.DistributedScraperWorkerService shows the worker-service shape.

DistributedSpiderBuilder.BuildSpider() returns a bare ISpider without a Crawl seed (ADR-0009 / ADR-0025: "two seams, not one bug" split). The worker's config is persisted separately by the start endpoint.

Storage and scheduler backends

Every backend is a swappable seam. In-memory is the default; file-backed lives in core; the rest come from satellites.

Seam	Core (in-memory default + file)	Satellite options
Scheduler	in-memory, `WithTextFileScheduler`	`WithSqliteScheduler`, `WithRedisScheduler`, `WithAzureServiceBusScheduler`
Visited-link tracker	in-memory, `TrackVisitedLinksInFile`	`TrackVisitedLinksInSqlite`, `TrackVisitedLinksInRedis`
Config storage	in-memory, `WithFileConfigStorage`	`WithMongoDbConfigStorage`, `WithRedisConfigStorage`
Cookie storage	in-memory, `WithFileCookieStorage`	`WithMongoDbCookieStorage`, `WithRedisCookieStorage`
Agent run store	in-memory, file (ADR-0051)	`WithSqliteAgentRunStore`, `WithRedisAgentRunStore`, `WithMongoAgentRunStore`, `WithCosmosAgentRunStore`
Result sink	`WriteToConsole`, `WriteToCsvFile`, `WriteToJsonFile`	`WriteToMongoDb`, `WriteToRedis`, `WriteToCosmosDb`
Page loader transport	HTTP (default)	`WithPlaywrightPageLoader`, `WithCdpPageLoader`, `WithCloakBrowser`

Custom sinks, the full interface index, and the main domain entities live in docs/architecture.md.

Repository structure

Project	Description
`WebReaper`	The core library (the `WebReaper` NuGet package).
`WebReaper.Cdp`	Raw CDP transport satellite (ADR-0052).
`WebReaper.Playwright`	Microsoft.Playwright transport satellite (ADR-0053).
`WebReaper.Stealth.CloakBrowser`	First stealth-backend satellite (ADR-0054).
`WebReaper.AI`	LLM extraction, action resolver, agent brain, self-healing, schema inferrer.
`WebReaper.Extraction.Attributes`	`[ScrapeSchema]` / `[ScrapeField]` marker types (ADR-0045).
`WebReaper.Extraction.Generators`	Roslyn source generator (ADR-0045).
`WebReaper.Mcp`	MCP server satellite (ADR-0049).
`WebReaper.Mongo`	MongoDB sink plus config / cookie storage.
`WebReaper.Redis`	Redis scheduler, tracker, sink, config / cookie storage.
`WebReaper.AzureServiceBus`	Azure Service Bus distributed scheduler.
`WebReaper.Cosmos`	Azure Cosmos DB sink.
`WebReaper.Sqlite`	Local durable scheduler and visited-link tracker over embedded SQLite.
`WebReaper.Cli`	AOT single-binary CLI (ADR-0043).
`Examples/WebReaper.ConsoleApplication`	Using WebReaper in a console application.
`Examples/WebReaper.AiNativeShowcase`	Runnable demos for every AI feature in this README.
`Examples/WebReaper.SchemaInferenceShowcase`	Demos for ADR-0067 / 0068 / 0069 schema inference.
`Examples/WebReaper.ScraperWorkerService`	Using WebReaper in a .NET Worker Service.
`Examples/WebReaper.DistributedScraperWorkerService`	Distributed crawl across workers sharing crawl state.
`Examples/WebReaper.AzureFuncs`	Serverless crawl with Azure Functions plus Azure Service Bus.
`Examples/BrownsfashionScraper`	A real-world e-commerce scraper example.
`Misc/WebReaper.ProxyProviders`	Example proxy-provider implementations.

License

WebReaper is MIT-licensed (ADR-0017). All NuGet packages plus the WebReaper.Cli binary ship under the same terms. Use it commercially, embed it in proprietary software, fork it, modify it, redistribute it; the only ask is that you keep the copyright notice.

Prior to the 10.0.0 wave, WebReaper was GPL-3.0-or-later. The relicense is strictly more permissive: every existing user is unaffected; new users who couldn't embed under GPL now can. Historical contributors are credited in CONTRIBUTORS.md. See docs/adr/0017-relicense-gpl-mit.md for the analysis and contributor consent path.

Contributions are welcome under the same MIT terms; sign-off via DCO (CONTRIBUTING.md).

Built by HighCraft.io — a software and AI engineering team.

Product	Compatible and additional computed target framework versions.
.NET	net10.0 is compatible. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed.

Compatible target framework(s)

Included target framework(s) (in package)

Learn more about Target Frameworks and .NET Standard.

net10.0
- AngleSharp (>= 1.4.0)
- AngleSharp.XPath (>= 2.0.6)
- Microsoft.Extensions.Http (>= 10.0.8)
- Microsoft.Extensions.Logging.Abstractions (>= 10.0.8)

NuGet packages (11)

Showing the top 5 NuGet packages that depend on WebReaper:

Package	Downloads
WebReaper.Sqlite SQLite embedded-store adapters for WebReaper: a local durable scheduler (and, from the next slice, a visited-link tracker) backed by SQLite via Microsoft.Data.Sqlite; resume is a query, not a hand-rolled position file. The opt-in robust-local durability tier between the zero-dependency core file adapters and the distributed Redis / Azure Service Bus satellites. Adds ScraperEngineBuilder.WithSqliteScheduler. Satellite package (ADR-0009 / ADR-0012) so the WebReaper core stays dependency-light and Native-AOT-clean; the native e_sqlite3 (SQLitePCLRaw) graph is quarantined here.	1.4K
WebReaper.Mongo MongoDB adapters for WebReaper: the MongoDbSink, plus MongoDB-backed scraper-config and cookie storage. Adds ScraperEngineBuilder.WriteToMongoDb / WithMongoDbConfigStorage / WithMongoDbCookieStorage. Satellite package (ADR-0009) so the WebReaper core stays dependency-light and AOT-clean.	1.4K
WebReaper.AzureServiceBus Azure Service Bus scheduler for WebReaper: a distributed job queue backed by an Azure Service Bus queue, for sharing crawl state across workers and serverless functions. Adds ScraperEngineBuilder.WithAzureServiceBusScheduler. Satellite package (ADR-0009) so the WebReaper core stays dependency-light and AOT-clean.	1.3K
WebReaper.Cosmos Azure Cosmos DB sink for WebReaper. Adds ScraperEngineBuilder.WriteToCosmosDb(...). Satellite package (ADR-0009) so the WebReaper core stays dependency-light and AOT-clean.	1.3K
WebReaper.Redis Redis adapters for WebReaper: the Redis scheduler, visited-link tracker, sink, and Redis-backed scraper-config and cookie storage, all sharing one connection pool (ADR-0005). Adds ScraperEngineBuilder.WithRedisScheduler / TrackVisitedLinksInRedis / WriteToRedis / WithRedisConfigStorage / WithRedisCookieStorage. Satellite package (ADR-0009) so the WebReaper core stays dependency-light and AOT-clean.	1.3K

GitHub repositories

This package is not used by any popular GitHub repositories.

Version	Downloads	Last Updated
11.3.1	225	6/27/2026
11.3.0	321	6/6/2026
11.2.0	222	6/4/2026
11.1.2	217	6/3/2026
11.1.1	207	5/30/2026
11.1.0	214	5/30/2026
11.0.0	207	5/30/2026
10.2.0	209	5/29/2026
10.1.0	211	5/29/2026
10.0.0	212	5/26/2026
9.0.0	174	5/19/2026
8.0.0	177	5/19/2026
7.0.0	182	5/17/2026
6.0.0	109	5/17/2026
5.1.0	102	5/16/2026
5.0.0	99	5/16/2026
4.1.0	98	5/16/2026
4.0.0	118	5/16/2026
3.5.2	3,123	10/19/2024
3.5.1	3,006	8/15/2023

10.1.0 (minor): three new PageAction arms (Fill, Press, ScrollIntoView; ADR-0074) widen the closed sum from seven to ten, with builder methods and byte-identical wire tags. PostExtractionPipeline (ADR-0076) unifies the page-processor pipeline and sink fan-out for both the crawl driver and the agent engine; AgentEngine is now IAsyncDisposable, fixing silently-dropped durable agent-run records. Internal: a ClosedSumCodec mechanism (ADR-0077) backs the three flat closed-sum JSON converters with the wire format unchanged. 10.0.0 (breaking, major): the AI-native funnel ships on a deepened architecture, in three arcs spanning 24 ADRs. (1) Crawl seed (ADR-0025) closes the last builder-construction trap: a scrape now begins with ScraperEngineBuilder.Crawl(urls)/.CrawlWithBrowser(urls), returning ICrawlSeed; the seed's .Extract(schema) or .AsMarkdown() picks the extraction strategy before the builder is reachable; "build with no start URLs or no extraction" has no representation. The old new ScraperEngineBuilder().Get(...)...Parse(...) shape and ConfigBuilder's runtime InvalidOperationException are gone; DistributedSpiderBuilder is the seedless worker shell (ADR-0009). (2) Architecture deepening (ADR-0026..0039): IRetryPolicy seam with Polly leaving the core graph; AngleSharpRawExtractor shared between the CSS/XPath backends; Schema.Add construction guards; ParsedData owns URL merge; the Crawl driver's StopRule module + one-op IOutstandingWorkLatch credit (Tier-1 break; RedisOutstandingWorkLatch resigned); IAsyncInitializable replaces Initialization Task (Tier-1 break); Spider takes config at construction (DistributedSpiderBuilder.BuildSpider now requires a ScraperConfig); PageAction is a closed sum of six typed arms (PageActionType enum + object[] removed, wire-format break, EvaluateExpression/WaitForSelector actually run now); link extraction collapses to LinkExtractor (ILinkParser removed); IScheduler.Complete() removed; stop is consumer-cancel of GetAllAsync (durable schedulers no longer hang on StopWhenAllLinksProcessed); IPageProcessor pipeline + Sink (PostProcess + Metadata removed; migrate to .Process(new DelegatePageProcessor(...))); IJsonContentParser → IContentExtractor, ParseToJsonAsync → ExtractAsync, WithContentParser → WithContentExtractor, SchemaContentParser → SchemaFold. (3) AI-native (ADR-0040..0049): .AsMarkdown() seed terminal; IPageCache + WithMaxAge(TimeSpan); ISiteMapper + ScraperEngineBuilder.MapAsync; WebReaper.Cli AOT single-binary; WebReaper.AI satellite (LLM extractor on IChatClient); WebReaper.Extraction.Attributes + .Generators source generator emitting Schema + Materialize from attributed POCOs (AOT-clean Pydantic-parity); ExtractionRouter (deterministic-first → fallback) + WithFallbackExtractor; SelfHealingContentExtractor + ISelectorRepairer + per-crawl selector cache; ChangeTrackingProcessor + IChangeStore on the page-processor pipeline; WebReaper.Mcp satellite. License: GPL-3.0-or-later → MIT (ADR-0017). Migration short-form: new ScraperEngineBuilder().Get(url)....Parse(schema)....BuildAsync() → ScraperEngineBuilder.Crawl(url).Extract(schema)....BuildAsync(); .GetWithBrowser → .CrawlWithBrowser; PostProcess(cb) → .Process(new DelegatePageProcessor(cb)) with the Metadata replaced by PageContext; WithJsonContentParser/WithXPathContentParser unchanged. Full per-arc detail and considered alternatives in CHANGELOG.md and docs/adr/0025..0049.