WebSpark.HttpClientUtility.Crawler 2.1.2

dotnet add package WebSpark.HttpClientUtility.Crawler --version 2.1.2
                    
NuGet\Install-Package WebSpark.HttpClientUtility.Crawler -Version 2.1.2
                    
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="WebSpark.HttpClientUtility.Crawler" Version="2.1.2" />
                    
For projects that support PackageReference, copy this XML node into the project file to reference the package.
<PackageVersion Include="WebSpark.HttpClientUtility.Crawler" Version="2.1.2" />
                    
Directory.Packages.props
<PackageReference Include="WebSpark.HttpClientUtility.Crawler" />
                    
Project file
For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.
paket add WebSpark.HttpClientUtility.Crawler --version 2.1.2
                    
#r "nuget: WebSpark.HttpClientUtility.Crawler, 2.1.2"
                    
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
#:package WebSpark.HttpClientUtility.Crawler@2.1.2
                    
#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.
#addin nuget:?package=WebSpark.HttpClientUtility.Crawler&version=2.1.2
                    
Install as a Cake Addin
#tool nuget:?package=WebSpark.HttpClientUtility.Crawler&version=2.1.2
                    
Install as a Cake Tool

WebSpark.HttpClientUtility.Crawler

Web crawling extension for WebSpark.HttpClientUtility.

Overview

This package provides enterprise-grade web crawling capabilities with robots.txt compliance, HTML parsing, sitemap generation, and real-time progress tracking via SignalR.

Important: This is an extension package. You must also install the base package WebSpark.HttpClientUtility version 2.0.0.

Features

  • SiteCrawler: Full-featured web crawler with configurable depth and URL filtering
  • SimpleSiteCrawler: Lightweight crawler for basic crawling needs
  • Robots.txt Compliance: Automatic parsing and enforcement of robots.txt rules
  • HTML Parsing: Extract links, images, and metadata using HtmlAgilityPack
  • Sitemap Generation: Generate sitemaps in XML and Markdown formats
  • CSV Export: Export crawl results to CSV for analysis
  • SignalR Progress: Real-time crawl progress updates via SignalR hub
  • Performance Tracking: Built-in metrics for crawl operations

Installation

Install both packages:

dotnet add package WebSpark.HttpClientUtility
dotnet add package WebSpark.HttpClientUtility.Crawler

Quick Start

1. Register Services

using WebSpark.HttpClientUtility;
using WebSpark.HttpClientUtility.Crawler;

var builder = WebApplication.CreateBuilder(args);

// Register base package (required)
builder.Services.AddHttpClientUtility();

// Register crawler package
builder.Services.AddHttpClientCrawler();

var app = builder.Build();

// Optional: Register SignalR hub for progress updates
app.MapHub<CrawlHub>("/crawlHub");

app.Run();

2. Basic Crawling

public class CrawlerService
{
    private readonly ISiteCrawler _crawler;

    public CrawlerService(ISiteCrawler crawler)
    {
        _crawler = crawler;
    }

    public async Task<CrawlResult> CrawlWebsiteAsync(string url)
    {
        var options = new CrawlerOptions
        {
            StartUrl = url,
            MaxDepth = 3,
            MaxPages = 100,
            RespectRobotsTxt = true
        };

        var result = await _crawler.CrawlAsync(options);
        
        Console.WriteLine($"Crawled {result.TotalPages} pages in {result.Duration}");
        return result;
    }
}

3. SimpleSiteCrawler

For lightweight crawling without full recursion:

public class SimpleCrawlerService
{
    private readonly SimpleSiteCrawler _simpleCrawler;

    public SimpleCrawlerService(SimpleSiteCrawler simpleCrawler)
    {
        _simpleCrawler = simpleCrawler;
    }

    public async Task<List<string>> GetAllLinksAsync(string url)
    {
        var result = await _simpleCrawler.CrawlAsync(url);
        return result.DiscoveredUrls;
    }
}

4. Advanced Configuration

builder.Services.AddHttpClientCrawler(options =>
{
    options.DefaultMaxDepth = 5;
    options.DefaultMaxPages = 500;
    options.DefaultTimeout = TimeSpan.FromSeconds(30);
    options.UserAgent = "MyBot/1.0";
});

5. CSV Export

var result = await _crawler.CrawlAsync(options);

await result.ExportToCsvAsync("crawl-results.csv");

6. SignalR Progress Updates

// Client-side JavaScript
const connection = new signalR.HubConnectionBuilder()
    .withUrl("/crawlHub")
    .build();

connection.on("CrawlProgress", (progress) => {
    console.log(`Progress: ${progress.pagesProcessed}/${progress.totalPages}`);
});

await connection.start();

CrawlerOptions

Property Type Default Description
StartUrl string required The URL to start crawling from
MaxDepth int 3 Maximum depth to crawl (0 = no limit)
MaxPages int 100 Maximum number of pages to crawl
RespectRobotsTxt bool true Honor robots.txt directives
Timeout TimeSpan 30s Request timeout per page
UserAgent string "WebSpark.Crawler" User agent string
AllowedDomains List<string> null Restrict crawling to specific domains
ExcludedPaths List<string> null Paths to exclude from crawling

CrawlResult Properties

  • TotalPages: Number of pages successfully crawled
  • FailedPages: Number of pages that failed to crawl
  • Duration: Total time taken for the crawl
  • DiscoveredUrls: List of all URLs discovered
  • SitemapXml: Generated sitemap in XML format
  • SitemapMarkdown: Generated sitemap in Markdown format
  • Errors: List of errors encountered during crawl

Robots.txt Support

The crawler automatically:

  • Downloads and parses robots.txt from target sites
  • Respects Disallow directives
  • Honors Crawl-delay settings
  • Supports wildcard patterns

Disable with:

options.RespectRobotsTxt = false; // Not recommended

Performance Tips

  1. Limit Scope: Use MaxDepth and MaxPages to avoid overwhelming servers
  2. Respect Crawl-Delay: The crawler automatically honors robots.txt delays
  3. Use AllowedDomains: Restrict crawling to prevent following external links
  4. Monitor Progress: Use SignalR hub to track crawl status in real-time

Requirements

  • Base Package: WebSpark.HttpClientUtility 2.0.0 (exact match required)
  • .NET Version: .NET 8 LTS or .NET 9
  • ASP.NET Core: Required for SignalR features

Migration from v1.x

If you're upgrading from WebSpark.HttpClientUtility v1.x:

  1. Install the crawler package: dotnet add package WebSpark.HttpClientUtility.Crawler
  2. Add using directive: using WebSpark.HttpClientUtility.Crawler;
  3. Add DI registration: services.AddHttpClientCrawler();

All crawler APIs remain unchanged - only the registration is different.

Documentation

License

MIT License - see LICENSE

Support

Product Compatible and additional computed target framework versions.
.NET net8.0 is compatible.  net8.0-android was computed.  net8.0-browser was computed.  net8.0-ios was computed.  net8.0-maccatalyst was computed.  net8.0-macos was computed.  net8.0-tvos was computed.  net8.0-windows was computed.  net9.0 is compatible.  net9.0-android was computed.  net9.0-browser was computed.  net9.0-ios was computed.  net9.0-maccatalyst was computed.  net9.0-macos was computed.  net9.0-tvos was computed.  net9.0-windows was computed.  net10.0 is compatible.  net10.0-android was computed.  net10.0-browser was computed.  net10.0-ios was computed.  net10.0-maccatalyst was computed.  net10.0-macos was computed.  net10.0-tvos was computed.  net10.0-windows was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last Updated
2.1.2 194 12/4/2025
2.1.1 284 11/12/2025
2.0.0 188 11/5/2025

2.1.2 - Security patch: Updated to match base package version for lockstep versioning.
Fixed js-yaml and glob vulnerabilities in documentation dependencies. Requires base
package [2.1.2]. Zero breaking changes.
2.1.1 - Version bump to maintain lockstep with base package. Requires base package
[2.1.1]. Zero breaking changes.
2.1.0 - Added .NET 10 (Preview) multi-targeting support. All projects now target net8.0,
net9.0, and net10.0. Requires base package [2.1.0]. All 81 crawler tests passing on all
three frameworks (243 test runs, 0 failures). Zero breaking changes.