WebSpark.HttpClientUtility.Crawler
2.1.2
dotnet add package WebSpark.HttpClientUtility.Crawler --version 2.1.2
NuGet\Install-Package WebSpark.HttpClientUtility.Crawler -Version 2.1.2
<PackageReference Include="WebSpark.HttpClientUtility.Crawler" Version="2.1.2" />
<PackageVersion Include="WebSpark.HttpClientUtility.Crawler" Version="2.1.2" />
<PackageReference Include="WebSpark.HttpClientUtility.Crawler" />
paket add WebSpark.HttpClientUtility.Crawler --version 2.1.2
#r "nuget: WebSpark.HttpClientUtility.Crawler, 2.1.2"
#:package WebSpark.HttpClientUtility.Crawler@2.1.2
#addin nuget:?package=WebSpark.HttpClientUtility.Crawler&version=2.1.2
#tool nuget:?package=WebSpark.HttpClientUtility.Crawler&version=2.1.2
WebSpark.HttpClientUtility.Crawler
Web crawling extension for WebSpark.HttpClientUtility.
Overview
This package provides enterprise-grade web crawling capabilities with robots.txt compliance, HTML parsing, sitemap generation, and real-time progress tracking via SignalR.
Important: This is an extension package. You must also install the base package WebSpark.HttpClientUtility version 2.0.0.
Features
- SiteCrawler: Full-featured web crawler with configurable depth and URL filtering
- SimpleSiteCrawler: Lightweight crawler for basic crawling needs
- Robots.txt Compliance: Automatic parsing and enforcement of robots.txt rules
- HTML Parsing: Extract links, images, and metadata using HtmlAgilityPack
- Sitemap Generation: Generate sitemaps in XML and Markdown formats
- CSV Export: Export crawl results to CSV for analysis
- SignalR Progress: Real-time crawl progress updates via SignalR hub
- Performance Tracking: Built-in metrics for crawl operations
Installation
Install both packages:
dotnet add package WebSpark.HttpClientUtility
dotnet add package WebSpark.HttpClientUtility.Crawler
Quick Start
1. Register Services
using WebSpark.HttpClientUtility;
using WebSpark.HttpClientUtility.Crawler;
var builder = WebApplication.CreateBuilder(args);
// Register base package (required)
builder.Services.AddHttpClientUtility();
// Register crawler package
builder.Services.AddHttpClientCrawler();
var app = builder.Build();
// Optional: Register SignalR hub for progress updates
app.MapHub<CrawlHub>("/crawlHub");
app.Run();
2. Basic Crawling
public class CrawlerService
{
private readonly ISiteCrawler _crawler;
public CrawlerService(ISiteCrawler crawler)
{
_crawler = crawler;
}
public async Task<CrawlResult> CrawlWebsiteAsync(string url)
{
var options = new CrawlerOptions
{
StartUrl = url,
MaxDepth = 3,
MaxPages = 100,
RespectRobotsTxt = true
};
var result = await _crawler.CrawlAsync(options);
Console.WriteLine($"Crawled {result.TotalPages} pages in {result.Duration}");
return result;
}
}
3. SimpleSiteCrawler
For lightweight crawling without full recursion:
public class SimpleCrawlerService
{
private readonly SimpleSiteCrawler _simpleCrawler;
public SimpleCrawlerService(SimpleSiteCrawler simpleCrawler)
{
_simpleCrawler = simpleCrawler;
}
public async Task<List<string>> GetAllLinksAsync(string url)
{
var result = await _simpleCrawler.CrawlAsync(url);
return result.DiscoveredUrls;
}
}
4. Advanced Configuration
builder.Services.AddHttpClientCrawler(options =>
{
options.DefaultMaxDepth = 5;
options.DefaultMaxPages = 500;
options.DefaultTimeout = TimeSpan.FromSeconds(30);
options.UserAgent = "MyBot/1.0";
});
5. CSV Export
var result = await _crawler.CrawlAsync(options);
await result.ExportToCsvAsync("crawl-results.csv");
6. SignalR Progress Updates
// Client-side JavaScript
const connection = new signalR.HubConnectionBuilder()
.withUrl("/crawlHub")
.build();
connection.on("CrawlProgress", (progress) => {
console.log(`Progress: ${progress.pagesProcessed}/${progress.totalPages}`);
});
await connection.start();
CrawlerOptions
| Property | Type | Default | Description |
|---|---|---|---|
| StartUrl | string | required | The URL to start crawling from |
| MaxDepth | int | 3 | Maximum depth to crawl (0 = no limit) |
| MaxPages | int | 100 | Maximum number of pages to crawl |
| RespectRobotsTxt | bool | true | Honor robots.txt directives |
| Timeout | TimeSpan | 30s | Request timeout per page |
| UserAgent | string | "WebSpark.Crawler" | User agent string |
| AllowedDomains | List<string> | null | Restrict crawling to specific domains |
| ExcludedPaths | List<string> | null | Paths to exclude from crawling |
CrawlResult Properties
- TotalPages: Number of pages successfully crawled
- FailedPages: Number of pages that failed to crawl
- Duration: Total time taken for the crawl
- DiscoveredUrls: List of all URLs discovered
- SitemapXml: Generated sitemap in XML format
- SitemapMarkdown: Generated sitemap in Markdown format
- Errors: List of errors encountered during crawl
Robots.txt Support
The crawler automatically:
- Downloads and parses robots.txt from target sites
- Respects
Disallowdirectives - Honors
Crawl-delaysettings - Supports wildcard patterns
Disable with:
options.RespectRobotsTxt = false; // Not recommended
Performance Tips
- Limit Scope: Use
MaxDepthandMaxPagesto avoid overwhelming servers - Respect Crawl-Delay: The crawler automatically honors robots.txt delays
- Use AllowedDomains: Restrict crawling to prevent following external links
- Monitor Progress: Use SignalR hub to track crawl status in real-time
Requirements
- Base Package: WebSpark.HttpClientUtility 2.0.0 (exact match required)
- .NET Version: .NET 8 LTS or .NET 9
- ASP.NET Core: Required for SignalR features
Migration from v1.x
If you're upgrading from WebSpark.HttpClientUtility v1.x:
- Install the crawler package:
dotnet add package WebSpark.HttpClientUtility.Crawler - Add using directive:
using WebSpark.HttpClientUtility.Crawler; - Add DI registration:
services.AddHttpClientCrawler();
All crawler APIs remain unchanged - only the registration is different.
Documentation
License
MIT License - see LICENSE
Support
| Product | Versions Compatible and additional computed target framework versions. |
|---|---|
| .NET | net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 is compatible. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 is compatible. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
-
net10.0
- CsvHelper (>= 33.1.0)
- HtmlAgilityPack (>= 1.12.4)
- Markdig (>= 0.44.0)
- WebSpark.HttpClientUtility (>= 2.1.2)
-
net8.0
- CsvHelper (>= 33.1.0)
- HtmlAgilityPack (>= 1.12.4)
- Markdig (>= 0.44.0)
- WebSpark.HttpClientUtility (>= 2.1.2)
-
net9.0
- CsvHelper (>= 33.1.0)
- HtmlAgilityPack (>= 1.12.4)
- Markdig (>= 0.44.0)
- WebSpark.HttpClientUtility (>= 2.1.2)
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.
2.1.2 - Security patch: Updated to match base package version for lockstep versioning.
Fixed js-yaml and glob vulnerabilities in documentation dependencies. Requires base
package [2.1.2]. Zero breaking changes.
2.1.1 - Version bump to maintain lockstep with base package. Requires base package
[2.1.1]. Zero breaking changes.
2.1.0 - Added .NET 10 (Preview) multi-targeting support. All projects now target net8.0,
net9.0, and net10.0. Requires base package [2.1.0]. All 81 crawler tests passing on all
three frameworks (243 test runs, 0 failures). Zero breaking changes.