WebSpark.HttpClientUtility.Crawler 2.1.2

.NET 8.0

dotnet add package WebSpark.HttpClientUtility.Crawler --version 2.1.2

NuGet\Install-Package WebSpark.HttpClientUtility.Crawler -Version 2.1.2

This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.

<PackageReference Include="WebSpark.HttpClientUtility.Crawler" Version="2.1.2" />

For projects that support PackageReference, copy this XML node into the project file to reference the package.

<PackageVersion Include="WebSpark.HttpClientUtility.Crawler" Version="2.1.2" />
                    

                            Directory.Packages.props

<PackageReference Include="WebSpark.HttpClientUtility.Crawler" />
                    

                            Project file

For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.

paket add WebSpark.HttpClientUtility.Crawler --version 2.1.2

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

#r "nuget: WebSpark.HttpClientUtility.Crawler, 2.1.2"

#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.

#:package WebSpark.HttpClientUtility.Crawler@2.1.2

#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.

#addin nuget:?package=WebSpark.HttpClientUtility.Crawler&version=2.1.2
                    

                            Install as a Cake Addin

#tool nuget:?package=WebSpark.HttpClientUtility.Crawler&version=2.1.2
                    

                            Install as a Cake Tool

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

WebSpark.HttpClientUtility.Crawler

Web crawling extension for WebSpark.HttpClientUtility.

Overview

This package provides enterprise-grade web crawling capabilities with robots.txt compliance, HTML parsing, sitemap generation, and real-time progress tracking via SignalR.

Important: This is an extension package. You must also install the base package WebSpark.HttpClientUtility version 2.0.0.

Features

SiteCrawler: Full-featured web crawler with configurable depth and URL filtering
SimpleSiteCrawler: Lightweight crawler for basic crawling needs
Robots.txt Compliance: Automatic parsing and enforcement of robots.txt rules
HTML Parsing: Extract links, images, and metadata using HtmlAgilityPack
Sitemap Generation: Generate sitemaps in XML and Markdown formats
CSV Export: Export crawl results to CSV for analysis
SignalR Progress: Real-time crawl progress updates via SignalR hub
Performance Tracking: Built-in metrics for crawl operations

Installation

Install both packages:

dotnet add package WebSpark.HttpClientUtility
dotnet add package WebSpark.HttpClientUtility.Crawler

Quick Start

1. Register Services

using WebSpark.HttpClientUtility;
using WebSpark.HttpClientUtility.Crawler;

var builder = WebApplication.CreateBuilder(args);

// Register base package (required)
builder.Services.AddHttpClientUtility();

// Register crawler package
builder.Services.AddHttpClientCrawler();

var app = builder.Build();

// Optional: Register SignalR hub for progress updates
app.MapHub<CrawlHub>("/crawlHub");

app.Run();

2. Basic Crawling

public class CrawlerService
{
    private readonly ISiteCrawler _crawler;

    public CrawlerService(ISiteCrawler crawler)
    {
        _crawler = crawler;
    }

    public async Task<CrawlResult> CrawlWebsiteAsync(string url)
    {
        var options = new CrawlerOptions
        {
            StartUrl = url,
            MaxDepth = 3,
            MaxPages = 100,
            RespectRobotsTxt = true
        };

        var result = await _crawler.CrawlAsync(options);
        
        Console.WriteLine($"Crawled {result.TotalPages} pages in {result.Duration}");
        return result;
    }
}

3. SimpleSiteCrawler

For lightweight crawling without full recursion:

public class SimpleCrawlerService
{
    private readonly SimpleSiteCrawler _simpleCrawler;

    public SimpleCrawlerService(SimpleSiteCrawler simpleCrawler)
    {
        _simpleCrawler = simpleCrawler;
    }

    public async Task<List<string>> GetAllLinksAsync(string url)
    {
        var result = await _simpleCrawler.CrawlAsync(url);
        return result.DiscoveredUrls;
    }
}

4. Advanced Configuration

builder.Services.AddHttpClientCrawler(options =>
{
    options.DefaultMaxDepth = 5;
    options.DefaultMaxPages = 500;
    options.DefaultTimeout = TimeSpan.FromSeconds(30);
    options.UserAgent = "MyBot/1.0";
});

5. CSV Export

var result = await _crawler.CrawlAsync(options);

await result.ExportToCsvAsync("crawl-results.csv");

6. SignalR Progress Updates

// Client-side JavaScript
const connection = new signalR.HubConnectionBuilder()
    .withUrl("/crawlHub")
    .build();

connection.on("CrawlProgress", (progress) => {
    console.log(`Progress: ${progress.pagesProcessed}/${progress.totalPages}`);
});

await connection.start();

CrawlerOptions

Property	Type	Default	Description
StartUrl	string	required	The URL to start crawling from
MaxDepth	int	3	Maximum depth to crawl (0 = no limit)
MaxPages	int	100	Maximum number of pages to crawl
RespectRobotsTxt	bool	true	Honor robots.txt directives
Timeout	TimeSpan	30s	Request timeout per page
UserAgent	string	"WebSpark.Crawler"	User agent string
AllowedDomains	List<string>	null	Restrict crawling to specific domains
ExcludedPaths	List<string>	null	Paths to exclude from crawling

CrawlResult Properties

TotalPages: Number of pages successfully crawled
FailedPages: Number of pages that failed to crawl
Duration: Total time taken for the crawl
DiscoveredUrls: List of all URLs discovered
SitemapXml: Generated sitemap in XML format
SitemapMarkdown: Generated sitemap in Markdown format
Errors: List of errors encountered during crawl

Robots.txt Support

The crawler automatically:

Downloads and parses robots.txt from target sites
Respects Disallow directives
Honors Crawl-delay settings
Supports wildcard patterns

Disable with:

options.RespectRobotsTxt = false; // Not recommended

Performance Tips

Limit Scope: Use MaxDepth and MaxPages to avoid overwhelming servers
Respect Crawl-Delay: The crawler automatically honors robots.txt delays
Use AllowedDomains: Restrict crawling to prevent following external links
Monitor Progress: Use SignalR hub to track crawl status in real-time

Requirements

Base Package: WebSpark.HttpClientUtility 2.0.0 (exact match required)
.NET Version: .NET 8 LTS or .NET 9
ASP.NET Core: Required for SignalR features

Migration from v1.x

If you're upgrading from WebSpark.HttpClientUtility v1.x:

Install the crawler package: dotnet add package WebSpark.HttpClientUtility.Crawler
Add using directive: using WebSpark.HttpClientUtility.Crawler;
Add DI registration: services.AddHttpClientCrawler();

All crawler APIs remain unchanged - only the registration is different.

Documentation

License

MIT License - see LICENSE

Support

Product	Compatible and additional computed target framework versions.
.NET	net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 is compatible. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 is compatible. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed.

Product

.NET

Compatible target framework(s)

Included target framework(s) (in package)

Learn more about Target Frameworks and .NET Standard.

net10.0
- CsvHelper (>= 33.1.0)
- HtmlAgilityPack (>= 1.12.4)
- Markdig (>= 0.44.0)
- WebSpark.HttpClientUtility (>= 2.1.2)
net8.0
- CsvHelper (>= 33.1.0)
- HtmlAgilityPack (>= 1.12.4)
- Markdig (>= 0.44.0)
- WebSpark.HttpClientUtility (>= 2.1.2)
net9.0
- CsvHelper (>= 33.1.0)
- HtmlAgilityPack (>= 1.12.4)
- Markdig (>= 0.44.0)
- WebSpark.HttpClientUtility (>= 2.1.2)

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version	Downloads	Last Updated
2.1.2	194	12/4/2025
2.1.1	284	11/12/2025
2.0.0	188	11/5/2025

2.1.2 - Security patch: Updated to match base package version for lockstep versioning.
Fixed js-yaml and glob vulnerabilities in documentation dependencies. Requires base
package [2.1.2]. Zero breaking changes.
2.1.1 - Version bump to maintain lockstep with base package. Requires base package
[2.1.1]. Zero breaking changes.
2.1.0 - Added .NET 10 (Preview) multi-targeting support. All projects now target net8.0,
net9.0, and net10.0. Requires base package [2.1.0]. All 81 crawler tests passing on all
three frameworks (243 test runs, 0 failures). Zero breaking changes.