WebReaper 10.0.0

There is a newer version of this package available.
See the version list below for details.

dotnet add package WebReaper --version 10.0.0

NuGet\Install-Package WebReaper -Version 10.0.0

This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.

<PackageReference Include="WebReaper" Version="10.0.0" />

For projects that support PackageReference, copy this XML node into the project file to reference the package.

<PackageVersion Include="WebReaper" Version="10.0.0" />
                    

                            Directory.Packages.props

<PackageReference Include="WebReaper" />
                    

                            Project file

For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.

paket add WebReaper --version 10.0.0

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

#r "nuget: WebReaper, 10.0.0"

#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.

#:package WebReaper@10.0.0

#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.

#addin nuget:?package=WebReaper&version=10.0.0
                    

                            Install as a Cake Addin

#tool nuget:?package=WebReaper&version=10.0.0
                    

                            Install as a Cake Tool

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

logo

WebReaper

Overview

WebReaper is a declarative, high-performance web scraper, crawler and parser in C#. Crawl any web site, parse the data, and save the structured result to a file, a database, or pretty much anywhere you want — with a simple, extensible fluent API.

As of 10.0.0 the core WebReaper package is dependency-light, Native-AOT-ready and Newtonsoft-free: a plain HTTP → file crawl pulls only AngleSharp and Microsoft.Extensions.* (Polly left core in ADR-0026 in favour of the IRetryPolicy seam — the default FixedAttemptsRetryPolicy is hand-rolled). Heavier capabilities (headless browser, MongoDB, Redis, Azure Cosmos DB, Azure Service Bus, SQLite-backed local durable scheduler/tracker, LLM extraction, MCP server, [ScrapeSchema] source generator) ship as optional satellite packages you add only when you need them — see Packages.

Quick start

dotnet add package WebReaper

using WebReaper.Builders;

var engine = await ScraperEngineBuilder
    .Crawl("https://www.alexpavlov.dev/blog")
    .Extract(new()
    {
        new("title", ".text-3xl.font-bold"),
        new("text", ".max-w-max.prose.prose-dark")
    })
    .Follow("a.text-gray-900.transition")
    .WriteToJsonFile("output.json")
    .PageCrawlLimit(10)
    .WithParallelismDegree(30)
    .LogToConsole()
    .BuildAsync();

await engine.RunAsync();

That example is pure HTTP — no browser, no extra packages. For JavaScript-rendered pages, add WebReaper.Puppeteer (see Parsing dynamic pages).

AI-native — the smallest possible call

Since 10.0.0 (the AI-native wave + deepening, ADR-0040..0064), the funnel's no-schema wedge — one page, LLM-ready Markdown, no boilerplate:

var engine = await ScraperEngineBuilder
    .Crawl("https://example.com")
    .AsMarkdown()
    .WriteToConsole()
    .BuildAsync();

await engine.RunAsync();

The CLI mirrors it — the Native-AOT single-binary webreaper (ADR-0043) ships as a prebuilt asset on every tagged GitHub release across six RIDs (linux-x64, linux-arm64, osx-x64, osx-arm64, win-x64, win-arm64). PackAsTool=true is incompatible with PublishAot=true on one target, so a non-AOT dotnet tool install target is deferred to a 10.0.x — file an issue if you need it.

# Latest release, RID = same name dotnet uses (osx-arm64, linux-x64, win-x64, …)
RID=osx-arm64
gh release download --repo pavlovtech/WebReaper --pattern "webreaper-*-$RID.*"
# Linux is .tar.gz; macOS + Windows are .zip
case "$RID" in linux-*) tar -xzf webreaper-*-"$RID".tar.gz ;;
               *)       unzip   webreaper-*-"$RID".zip    ;; esac
cd webreaper-*-"$RID"
./webreaper scrape https://example.com
./webreaper map https://example.com --search /blog/
./webreaper init    # installs the Agent Skill to .claude/skills/webreaper/

The release pages serve direct URLs too — no gh required: https://github.com/pavlovtech/WebReaper/releases/download/<TAG>/webreaper-<TAG>-<RID>.<ext>.

Or build from source — useful between tags or on an unlisted RID:

git clone https://github.com/pavlovtech/WebReaper.git && cd WebReaper
dotnet publish WebReaper.Cli -c Release -r <rid> --self-contained true
./WebReaper.Cli/bin/Release/net10.0/<rid>/publish/WebReaper.Cli scrape https://example.com

Drop in an LLM extractor when the deterministic path can't reach a field (WebReaper.AI satellite — Microsoft.Extensions.AI binding):

using Microsoft.Extensions.AI;
using WebReaper.AI;

var chatClient = /* your IChatClient — OpenAI, Anthropic, Ollama, … */;

var engine = await ScraperEngineBuilder
    .Crawl("https://example.com")
    .Extract(schema)
    .WithLlmFallback(chatClient)        // ADR-0046: det-first → LLM-fallback router
    // or .WithLlmSelfHealing(chatClient)  // ADR-0047: LLM proposes selectors, fold validates, cache demotes back to deterministic
    .WriteToJsonFile("out.jsonl")
    .BuildAsync();

await engine.RunAsync();

Generate the Schema from a typed POCO — [ScrapeSchema], Pydantic-parity that Python's reflection structurally cannot match (WebReaper.Extraction.Generators):

using WebReaper.Extraction.Attributes;

[ScrapeSchema]
public partial class Article
{
    [ScrapeField("h1")]            public string? Title { get; set; }
    [ScrapeField(".views", Type = SchemaFieldType.Integer)]  public int Views { get; set; }
    [ScrapeField(".tag", IsList = true)]  public List<string> Tags { get; set; } = new();
}

// Generated at compile time, reflection-free, AOT-clean:
//   public static Schema Schema { get; }
//   public static Article Materialize(JsonObject json)

var engine = await ScraperEngineBuilder
    .Crawl("https://example.com/post")
    .Extract(Article.Schema)
    .Subscribe(p => HandleArticle(Article.Materialize(p.Data)))
    .BuildAsync();

Runnable end-to-end demo of every AI-native API in this section — Examples/WebReaper.AiNativeShowcase:

dotnet run --project Examples/WebReaper.AiNativeShowcase -- markdown
dotnet run --project Examples/WebReaper.AiNativeShowcase -- map
dotnet run --project Examples/WebReaper.AiNativeShowcase -- sourcegen
dotnet run --project Examples/WebReaper.AiNativeShowcase -- llm
dotnet run --project Examples/WebReaper.AiNativeShowcase -- router
dotnet run --project Examples/WebReaper.AiNativeShowcase -- changetrack

Install
Packages
Requirements
Features
Usage examples
API overview
Repository structure
License

Install

Core package — HTTP crawling/parsing, in-memory and file-backed state, Console/CSV/JSON-Lines sinks:

dotnet add package WebReaper

Add a satellite only for the capability you need (each brings its own SDK so the core stays light):

dotnet add package WebReaper.Puppeteer        # headless-browser (SPA / JS) pages
dotnet add package WebReaper.Mongo            # MongoDB sink + config/cookie storage
dotnet add package WebReaper.Redis            # Redis scheduler, tracker, sink, storage
dotnet add package WebReaper.AzureServiceBus  # Azure Service Bus distributed scheduler
dotnet add package WebReaper.Cosmos           # Azure Cosmos DB sink
dotnet add package WebReaper.Sqlite           # SQLite local durable scheduler + visited-link tracker

Packages

All eleven NuGet packages are versioned in lockstep at 10.0.0 (1 core + 10 satellites). Core and the satellites move together in release waves (ADR-0022 → 8.0.0, ADR-0023 → 9.0.0, ADR-0025 + ADR-0040..0064 → 10.0.0); WebReaper.Sqlite, added at 7.1.0, joined the lockstep from 8.0.0; WebReaper.AI, WebReaper.Mcp, WebReaper.Extraction.Attributes, and WebReaper.Extraction.Generators joined at 10.0.0 (the AI-native wave). All packages are MIT-licensed (ADR-0017; relicensed from GPL-3.0-or-later in the 10.0.0 wave), and every satellite wires itself in through the builder's public registration seam. WebReaper.Cli (ADR-0043, the AOT single-binary agent surface) is not a NuGet package in 10.0.0 — build from source per Quick start.

Package	Add it for	Key builder calls
WebReaper	Core. HTTP crawl/parse, in-memory + file scheduler / visited-link tracker / cookie & config storage, Console / CSV / JSON-Lines sinks. Dependency-light, Native-AOT-ready, Newtonsoft-free.	`Crawl` `Extract` `Follow` `Paginate` `WriteToJsonFile` `WriteToCsvFile` `WriteToConsole`
WebReaper.Puppeteer	Headless-browser loading of SPA / JavaScript pages	`.WithPuppeteerPageLoader()` + `CrawlWithBrowser` / `FollowWithBrowser` / `PaginateWithBrowser`
WebReaper.Mongo	MongoDB result sink and MongoDB-backed config / cookie storage	`.WriteToMongoDb(...)` `.WithMongoDbConfigStorage(...)` `.WithMongoDbCookieStorage(...)`
WebReaper.Redis	Redis scheduler, visited-link tracker, result sink, config / cookie storage	`.WithRedisScheduler(...)` `.TrackVisitedLinksInRedis(...)` `.WriteToRedis(...)` `.WithRedisConfigStorage(...)` `.WithRedisCookieStorage(...)`
WebReaper.AzureServiceBus	Distributed scheduler over an Azure Service Bus queue	`.WithAzureServiceBusScheduler(...)`
WebReaper.Cosmos	Azure Cosmos DB result sink	`.WriteToCosmosDb(...)`
WebReaper.Sqlite	Local durable scheduler & visited-link tracker on an embedded SQLite store — resume is a query, no position file. Opt-in robust-local tier (no server, unlike Redis).	`.WithSqliteScheduler(...)` `.TrackVisitedLinksInSqlite(...)`
WebReaper.AI	LLM content extraction over `Microsoft.Extensions.AI` (ADR-0044) — fallback after the deterministic fold (ADR-0046 router) or self-healing selectors (ADR-0047). Bring your own `IChatClient` (OpenAI, Anthropic, Ollama, …).	`.WithLlmFallback(chatClient)` `.WithLlmSelfHealing(chatClient)` `.WithLlmExtractor(chatClient)`
WebReaper.Extraction.Attributes	The `[ScrapeSchema]` / `[ScrapeField]` marker types — depend on these from POCOs that the source generator should pick up. Standalone, no runtime cost.	`[ScrapeSchema]` `[ScrapeField("selector")]`
WebReaper.Extraction.Generators	Roslyn source generator that emits `static Schema` + reflection-free `static Materialize(JsonObject)` for a `[ScrapeSchema] partial class` (ADR-0045). `DevelopmentDependency=true` — does not propagate at runtime.	— (compile-time only)
WebReaper.Mcp	MCP server `Exe` that exposes scrape / map / extract as MCP tools over stdio (ADR-0049) — interop adapter for MCP-only clients (Cursor, Claude Desktop, Copilot Studio).	— (the package is the executable)

The core default page loader is HTTP-only. Crawling a dynamic page (CrawlWithBrowser / FollowWithBrowser / PaginateWithBrowser) without WebReaper.Puppeteer registered throws an InvalidOperationException telling you to add the package and call .WithPuppeteerPageLoader().

Requirements

.NET 10. The core package is IsAotCompatible — it Native-AOT-publishes with zero trim/AOT warnings (proven by the AOT smoke test in CI). Satellites carry their own SDK dependencies and are not AOT-clean by design; reference one only when you use it.

Features

⚡ High crawling speed through parallelism and asynchrony
🗒 Declarative and easy to use
🪶 Dependency-light, Native-AOT-ready, Newtonsoft-free core
💾 Console, CSV and JSON-Lines sinks out of the box; MongoDB, Redis and Azure Cosmos DB via satellites
🌎 Scalable: run on cloud VMs, serverless functions or on-prem; go distributed with Redis or Azure Service Bus
🐙 Crawl and parse Single Page Applications with Puppeteer (WebReaper.Puppeteer)
🖥 Proxy support
🌀 Extensible: replace any out-of-the-box seam with your own implementation

Usage examples

Data mining
Gathering data for machine learning
Online price-change monitoring and price comparison
News aggregation
Product-review scraping (to watch the competition)
Tracking online presence and reputation

API overview

Parsing dynamic pages (SPA)

Parsing Single Page Applications is simple: use CrawlWithBrowser and/or FollowWithBrowser, add the WebReaper.Puppeteer package, and register it with .WithPuppeteerPageLoader(). Puppeteer then loads those pages in a headless browser.

dotnet add package WebReaper.Puppeteer

using WebReaper.Builders;
using WebReaper.Puppeteer;

var engine = await ScraperEngineBuilder
    .CrawlWithBrowser("https://www.alexpavlov.dev/blog")
    .Extract(new()
    {
        new("title", ".text-3xl.font-bold"),
        new("text", ".max-w-max.prose.prose-dark")
    })
    .WithPuppeteerPageLoader()
    .FollowWithBrowser("a.text-gray-900.transition")
    .WriteToJsonFile("output.json")
    .PageCrawlLimit(10)
    .WithParallelismDegree(30)
    .LogToConsole()
    .BuildAsync();

await engine.RunAsync();

.WithPuppeteerPageLoader() is parameterless and reproduces the pre-7.0 behaviour exactly (one shared cookie container, optional proxy applied the browser's own way). The first dynamic-page run downloads Chromium via Puppeteer.

Running JavaScript / page actions

You can run JavaScript and drive the page as it loads in the headless browser. Pass an actions lambda (e.g. .ScrollToEnd()) — useful when the content you need appears only after clicks, scrolls, etc.

using WebReaper.Builders;
using WebReaper.Puppeteer;

var engine = await ScraperEngineBuilder
    .CrawlWithBrowser("https://www.reddit.com/r/dotnet/", actions => actions
        .ScrollToEnd()
        .Build())
    .Extract(new()
    {
        new("title", "._eYtD2XCVieq6emjKBH3m"),
        new("text", "._3xX726aBn29LDbsDtzr_6E._1Ap4F5maDtT1E1YuCiaO0r.D3IL3FD0RFy_mkKLPwL4")
    })
    .Follow("a.SQnoC3ObvgnGjWt90zD9Z._2INHSNB8V5eaWp4P0rY_mE")
    .WithPuppeteerPageLoader()
    .WriteToJsonFile("output.json")
    .LogToConsole()
    .BuildAsync();

await engine.RunAsync();

Console.ReadLine();

PageActionBuilder exposes Click, Wait, ScrollToEnd, WaitForSelector, WaitForNetworkIdle, EvaluateExpression, Repeat/RepeatWithDelay, and Build().

Persist the progress locally

To persist the job queue and visited links locally — so you can resume where you left off — use WithTextFileScheduler and TrackVisitedLinksInFile:

using WebReaper.Builders;

var engine = await ScraperEngineBuilder
    .Crawl("https://rutracker.org/forum/index.php?c=33")
    .Extract(new()
    {
        new("name", "#topic-title"),
        new("category", "td.nav.t-breadcrumb-top.w100.pad_2>a:nth-child(3)"),
        new("subcategory", "td.nav.t-breadcrumb-top.w100.pad_2>a:nth-child(5)"),
        new("torrentSize", "div.attach_link.guest>ul>li:nth-child(2)"),
        new("torrentLink", ".magnet-link", "href"),
        new("coverImageUrl", ".postImg", "src")
    })
    .WithLogger(logger)
    .Follow("#cf-33 .forumlink>a")
    .Follow(".forumlink>a")
    .Paginate("a.torTopic", ".pg")
    .WriteToJsonFile("result.json")
    .IgnoreUrls(blackList)
    .WithTextFileScheduler("jobs.txt", "currentJob.txt")
    .TrackVisitedLinksInFile("links.txt")
    .BuildAsync();

The file scheduler is the zero-dependency default: an append-only job file, a 300 ms poll and a sidecar position file. For a long single-machine crawl that must survive kill -9 and resume by query — without standing up a Redis server — add the WebReaper.Sqlite satellite and swap the two local backends. "Resume" becomes a SELECT over an indexed table; there is no position file to keep in sync (the visited-link table is the set — no in-memory mirror). The core file adapters are unchanged and stay the default; this is opt-in:

using WebReaper.Builders;
using WebReaper.Sqlite;   // dotnet add package WebReaper.Sqlite

var engine = await ScraperEngineBuilder
    .Crawl("https://rutracker.org/forum/index.php?c=33")
    .Extract(new() { new("name", "#topic-title") })
    .Follow(".forumlink>a")
    .Paginate("a.torTopic", ".pg")
    .WriteToJsonFile("result.json")
    .WithSqliteScheduler("crawl/state.db")        // resume is a query, not a position file
    .TrackVisitedLinksInSqlite("crawl/state.db")  // the table is the set
    .BuildAsync();

Pass dataCleanupOnStart: true to either call to start a fresh crawl (clears that table at start).

Authorization

If the site needs authorization, call SetCookies and fill the CookieContainer with the cookies required. You perform the login yourself; WebReaper only uses the cookies you provide.

using System.Net;
using WebReaper.Builders;

var engine = await ScraperEngineBuilder
    .Crawl("https://rutracker.org/forum/index.php?c=33")
    .Extract(new() { new("name", "#topic-title") })
    .WithLogger(logger)
    .SetCookies(cookies =>
    {
        cookies.Add(new Cookie("AuthToken", "123"));
    })
    // ...
    .BuildAsync();

How to disable headless mode

When scraping with a browser (CrawlWithBrowser / FollowWithBrowser, via WebReaper.Puppeteer) the default is headless — you don't see the browser. Seeing it can help with debugging; disable headless mode with .HeadlessMode(false):

using WebReaper.Builders;
using WebReaper.Puppeteer;

var engine = await ScraperEngineBuilder
    .CrawlWithBrowser("https://www.reddit.com/r/dotnet/", actions => actions
        .ScrollToEnd()
        .Build())
    .Extract(new() { new("title", "._eYtD2XCVieq6emjKBH3m") })
    .HeadlessMode(false)
    .WithPuppeteerPageLoader()
    // ...
    .BuildAsync();

Cleaning data from a previous run

To start fresh, pass dataCleanupOnStart: true to the relevant builder method.

// Result file — note: WriteToJsonFile already defaults dataCleanupOnStart to TRUE
.WriteToJsonFile("output.json", dataCleanupOnStart: true)

// Visited-link tracker
.TrackVisitedLinksInFile("visited.txt", dataCleanupOnStart: true)

// Job queue / scheduler
.WithTextFileScheduler("jobs.txt", "currentJob.txt", dataCleanupOnStart: true)

The dataCleanupOnStart parameter exists on the satellite sinks too (e.g. WriteToMongoDb, WriteToRedis, WriteToCosmosDb). Note WriteToJsonFile defaults it to true (it wipes the file on start) — the opposite of the other sinks, which default to false. The "JSON" file sink writes JSON Lines (one compact JSON object per line), not a JSON array.

Distributed and serverless scraping

Swap the scheduler, config storage and link tracker to Redis or Azure Service Bus and multiple workers / serverless functions can share one crawl. Examples/WebReaper.AzureFuncs shows the serverless shape with two functions:

StartScraping builds the scraper configuration, seeds the distributed Outstanding-work latch, and enqueues the first job (the start URL) onto the queue (e.g. Azure Service Bus).
WebReaperSpider is the distributed Crawl driver, triggered by each queued job. It gets a bare ISpider from new DistributedSpiderBuilder()...BuildSpider() (load → Crawl step → JobReport), then interprets the report: an atomic visited-link test-and-set gates duplicates/redeliveries, a parsed page is fanned out to the sink, discovered child jobs are enqueued back onto the queue, and a distributed Outstanding-work latch detects when all work has drained. It never throws to signal the crawl limit, so the queue is never poisoned (ADR-0022).

DistributedSpiderBuilder.BuildSpider() (the ADR-0009 distributed-worker seam) returns an ISpider without building or persisting a ScraperConfig; it has no Crawl seed and no BuildAsync — the worker's config is persisted separately by the start endpoint (ScraperEngineBuilder.Crawl(...).Extract(...).Build()) and read from storage at crawl time. This is the "two seams, not one bug" split (ADR-0025). See also Examples/WebReaper.DistributedScraperWorkerService.

Storage and scheduler backends

Every backend is a swappable seam. In-memory is the default; file-backed lives in core; the rest come from satellites.

Seam	Core (in-memory default + file)	Satellite options
Scheduler	in-memory, `WithTextFileScheduler`	`WithSqliteScheduler` (SQLite, local durable), `WithRedisScheduler` (Redis), `WithAzureServiceBusScheduler` (Azure Service Bus)
Visited-link tracker	in-memory, `TrackVisitedLinksInFile`	`TrackVisitedLinksInSqlite` (SQLite, local durable), `TrackVisitedLinksInRedis` (Redis)
Config storage	in-memory, `WithFileConfigStorage`	`WithMongoDbConfigStorage`, `WithRedisConfigStorage`
Cookie storage	in-memory, `WithFileCookieStorage`	`WithMongoDbCookieStorage`, `WithRedisCookieStorage`
Result sink	`WriteToConsole`, `WriteToCsvFile`, `WriteToJsonFile`	`WriteToMongoDb`, `WriteToRedis`, `WriteToCosmosDb`
Page loader	HTTP (default)	`WithPuppeteerPageLoader()` (headless browser)

Extensibility: adding a sink

Out of the box the core package sends parsed data to the Console, CSV and JSON-Lines sinks; MongoDB, Redis and Cosmos DB sinks come from satellites. Add your own by implementing IScraperSink:

using WebReaper.Sinks.Abstract;
using WebReaper.Sinks.Models;

public interface IScraperSink
{
    bool DataCleanupOnStart { get; set; }
    Task EmitAsync(ParsedData entity, CancellationToken cancellationToken = default);
}

ParsedData is record ParsedData(string Url, JsonObject Data) — Data is a System.Text.Json.Nodes.JsonObject (no Newtonsoft). A minimal console sink:

using System.Text.Json.Nodes;
using WebReaper.Sinks.Abstract;
using WebReaper.Sinks.Models;

public class ConsoleSink : IScraperSink
{
    public bool DataCleanupOnStart { get; set; }

    public Task EmitAsync(ParsedData entity, CancellationToken cancellationToken = default)
    {
        Console.WriteLine(entity.Data.ToJsonString());
        return Task.CompletedTask;
    }
}

using WebReaper.Builders;

var engine = await ScraperEngineBuilder
    .Crawl("https://rutracker.org/forum/index.php?c=33")
    .Extract(new()
    {
        new("name", "#topic-title"),
    })
    .AddSink(new ConsoleSink())
    .Follow("#cf-33 .forumlink>a")
    .Follow(".forumlink>a")
    .Paginate("a.torTopic", ".pg")
    .BuildAsync();

For result callbacks without a custom sink, use .Subscribe(Action<ParsedData>) or .PostProcess(Func<Metadata, JsonObject, Task>).

Interfaces

Interface	Description
`IScheduler`	Reads and writes the job queue. Default is in-memory; file, Redis and Azure Service Bus implementations are available.
`IVisitedLinkTracker`	Tracks visited links. Default is in-memory; file and Redis implementations are available.
`IPageLoader`	Turns a `PageRequest` into a page's HTML, dispatching on `PageType` to one load transport. The Spider holds one and is loader-blind.
`IPageLoadTransport`	The per-mechanism adapter behind `IPageLoader`: HTTP (core) or headless browser (`WebReaper.Puppeteer`). The only home for that mechanism's client/launch quirks and proxy application.
`IContentExtractor`	The content-extraction seam: takes a loaded document + `Schema`, returns its `System.Text.Json.Nodes.JsonObject` representation. The core adapter is the deterministic `SchemaFold<TNode>` over an `ISchemaBackend` (`WithXPathContentParser()` / `WithJsonContentParser()` select the XPath / JSON backend). Implement it directly for an alternative extraction strategy, e.g. an LLM-backed extractor.
`ISchemaBackend<TNode>`	The per-document-shape seam the shared fold calls: parse a root, select many / one by selector, extract a leaf's raw value. The shipped CSS, XPath and JSON backends implement this.
`IScraperSink`	A destination for scraping results. Receives `ParsedData` (`Url` + `JsonObject`).
`ICrawlStep`	The crawl-step decision: maps a `Job` + loaded page + `Schema` to a `CrawlOutcome` (parse the page, follow links, or paginate). Swap it to customize crawl-vs-parse behavior.
`ISpider`	The per-Job I/O shell around `ICrawlStep`: loads one page, runs the Crawl step, and returns a `JobReport` — nothing else. The Crawl driver (in-process `ScraperEngine` or the distributed worker) owns the visited-link tracker, the crawl-limit stop, sink fan-out and the callbacks. Obtained from `DistributedSpiderBuilder.BuildSpider()` (the ADR-0009 reduced shell).
`IOutstandingWorkLatch`	The Crawl driver's termination detector (ADR-0022): a unit-credit counter that trips exactly once when all work is drained. In-memory `Interlocked` adapter (in-process) and a distributed-atomic Redis adapter (`WebReaper.Redis`).

Main entities

Job — a record representing one unit of work for the spider.
LinkPathSelector — a selector for links to be crawled.
CrawlOutcome — the closed result of a crawl step: a parsed target page, followed links, or paginated pages.
Schema fold — the single recursive Schema interpreter (SchemaFold<TNode>); every backend reuses it instead of re-implementing the walk.

Repository structure

Project	Description
`WebReaper`	The core library (the `WebReaper` NuGet package).
`WebReaper.Puppeteer`	Satellite: headless-browser page loader.
`WebReaper.Mongo`	Satellite: MongoDB sink + config/cookie storage.
`WebReaper.Redis`	Satellite: Redis scheduler, tracker, sink, config/cookie storage.
`WebReaper.AzureServiceBus`	Satellite: Azure Service Bus distributed scheduler.
`WebReaper.Cosmos`	Satellite: Azure Cosmos DB sink.
`Examples/WebReaper.ConsoleApplication`	Using WebReaper in a console application.
`Examples/WebReaper.ScraperWorkerService`	Using WebReaper in a .NET Worker Service.
`Examples/WebReaper.DistributedScraperWorkerService`	Distributed crawl across workers sharing crawl state.
`Examples/WebReaper.AzureFuncs`	Serverless crawl with Azure Functions + Azure Service Bus.
`Examples/BrownsfashionScraper`	A real-world e-commerce scraper example.
`Misc/WebReaper.ProxyProviders`	Example proxy-provider implementations.

License

WebReaper is licensed under the MIT License (ADR-0017). All eleven NuGet packages (core + ten satellites) plus the WebReaper.Cli project ship under the same terms. Use it commercially, embed it in proprietary software, fork it, modify it, redistribute it; the only ask is that you keep the copyright notice.

Prior to the 10.0.0 wave, WebReaper was GPL-3.0-or-later. The relicense is strictly more permissive: every existing user is unaffected; new users who couldn't embed under GPL now can. Historical contributors are credited in CONTRIBUTORS.md. See docs/adr/0017-relicense-gpl-mit.md for the analysis and contributor consent path.

Contributions are welcome under the same MIT terms; sign-off via DCO (CONTRIBUTING.md).

Product	Compatible and additional computed target framework versions.
.NET	net10.0 is compatible. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed.

Compatible target framework(s)

Included target framework(s) (in package)

Learn more about Target Frameworks and .NET Standard.

net10.0
- AngleSharp (>= 1.4.0)
- AngleSharp.XPath (>= 2.0.6)
- Microsoft.Extensions.Http (>= 10.0.8)
- Microsoft.Extensions.Logging.Abstractions (>= 10.0.8)

NuGet packages (11)

Showing the top 5 NuGet packages that depend on WebReaper:

Package	Downloads
WebReaper.Cosmos Azure Cosmos DB sink for WebReaper. Adds ScraperEngineBuilder.WriteToCosmosDb(...). Satellite package (ADR-0009) so the WebReaper core stays dependency-light and AOT-clean.	563
WebReaper.Mongo MongoDB adapters for WebReaper: the MongoDbSink, plus MongoDB-backed scraper-config and cookie storage. Adds ScraperEngineBuilder.WriteToMongoDb / WithMongoDbConfigStorage / WithMongoDbCookieStorage. Satellite package (ADR-0009) so the WebReaper core stays dependency-light and AOT-clean.	559
WebReaper.Sqlite SQLite embedded-store adapters for WebReaper: a local durable scheduler (and, from the next slice, a visited-link tracker) backed by SQLite via Microsoft.Data.Sqlite; resume is a query, not a hand-rolled position file. The opt-in robust-local durability tier between the zero-dependency core file adapters and the distributed Redis / Azure Service Bus satellites. Adds ScraperEngineBuilder.WithSqliteScheduler. Satellite package (ADR-0009 / ADR-0012) so the WebReaper core stays dependency-light and Native-AOT-clean; the native e_sqlite3 (SQLitePCLRaw) graph is quarantined here.	548
WebReaper.Redis Redis adapters for WebReaper: the Redis scheduler, visited-link tracker, sink, and Redis-backed scraper-config and cookie storage, all sharing one connection pool (ADR-0005). Adds ScraperEngineBuilder.WithRedisScheduler / TrackVisitedLinksInRedis / WriteToRedis / WithRedisConfigStorage / WithRedisCookieStorage. Satellite package (ADR-0009) so the WebReaper core stays dependency-light and AOT-clean.	524
WebReaper.AzureServiceBus Azure Service Bus scheduler for WebReaper: a distributed job queue backed by an Azure Service Bus queue, for sharing crawl state across workers and serverless functions. Adds ScraperEngineBuilder.WithAzureServiceBusScheduler. Satellite package (ADR-0009) so the WebReaper core stays dependency-light and AOT-clean.	517

GitHub repositories

This package is not used by any popular GitHub repositories.

Version	Downloads	Last Updated
11.1.1	64	5/30/2026
11.1.0	72	5/30/2026
11.0.0	73	5/30/2026
10.2.0	74	5/29/2026
10.1.0	80	5/29/2026
10.0.0	161	5/26/2026
9.0.0	160	5/19/2026
8.0.0	160	5/19/2026
7.0.0	167	5/17/2026
6.0.0	100	5/17/2026
5.1.0	88	5/16/2026
5.0.0	87	5/16/2026
4.1.0	89	5/16/2026
4.0.0	106	5/16/2026
3.5.2	3,105	10/19/2024
3.5.1	2,994	8/15/2023
3.5.0	291	8/9/2023
3.4.0	433	4/17/2023
3.3.0	402	4/3/2023
3.2.0	386	4/2/2023

10.0.0 (breaking, major): the AI-native funnel ships on a deepened architecture, in three arcs spanning 24 ADRs. (1) Crawl seed (ADR-0025) closes the last builder-construction trap: a scrape now begins with ScraperEngineBuilder.Crawl(urls)/.CrawlWithBrowser(urls), returning ICrawlSeed; the seed's .Extract(schema) or .AsMarkdown() picks the extraction strategy before the builder is reachable — "build with no start URLs or no extraction" has no representation. The old new ScraperEngineBuilder().Get(...)...Parse(...) shape and ConfigBuilder's runtime InvalidOperationException are gone; DistributedSpiderBuilder is the seedless worker shell (ADR-0009). (2) Architecture deepening (ADR-0026..0039): IRetryPolicy seam with Polly leaving the core graph; AngleSharpRawExtractor shared between the CSS/XPath backends; Schema.Add construction guards; ParsedData owns URL merge; the Crawl driver's StopRule module + one-op IOutstandingWorkLatch credit (Tier-1 break — RedisOutstandingWorkLatch resigned); IAsyncInitializable replaces Initialization Task (Tier-1 break); Spider takes config at construction (DistributedSpiderBuilder.BuildSpider now requires a ScraperConfig); PageAction is a closed sum of six typed arms (PageActionType enum + object[] removed, wire-format break, EvaluateExpression/WaitForSelector actually run now); link extraction collapses to LinkExtractor (ILinkParser removed); IScheduler.Complete() removed — stop is consumer-cancel of GetAllAsync (durable schedulers no longer hang on StopWhenAllLinksProcessed); IPageProcessor pipeline + Sink (PostProcess + Metadata removed — migrate to .Process(new DelegatePageProcessor(...))); IJsonContentParser → IContentExtractor, ParseToJsonAsync → ExtractAsync, WithContentParser → WithContentExtractor, SchemaContentParser → SchemaFold. (3) AI-native (ADR-0040..0049): .AsMarkdown() seed terminal; IPageCache + WithMaxAge(TimeSpan); ISiteMapper + ScraperEngineBuilder.MapAsync; WebReaper.Cli AOT single-binary; WebReaper.AI satellite (LLM extractor on IChatClient); WebReaper.Extraction.Attributes + .Generators source generator emitting Schema + Materialize from attributed POCOs (AOT-clean Pydantic-parity); ExtractionRouter (deterministic-first → fallback) + WithFallbackExtractor; SelfHealingContentExtractor + ISelectorRepairer + per-crawl selector cache; ChangeTrackingProcessor + IChangeStore on the page-processor pipeline; WebReaper.Mcp satellite. License: GPL-3.0-or-later → MIT (ADR-0017). Migration short-form: new ScraperEngineBuilder().Get(url)....Parse(schema)....BuildAsync() → ScraperEngineBuilder.Crawl(url).Extract(schema)....BuildAsync(); .GetWithBrowser → .CrawlWithBrowser; PostProcess(cb) → .Process(new DelegatePageProcessor(cb)) with the Metadata replaced by PageContext; WithJsonContentParser/WithXPathContentParser unchanged. Full per-arc detail and considered alternatives in CHANGELOG.md and docs/adr/0025..0049.