NajsAi.WebExtractor
1.0.0
dotnet add package NajsAi.WebExtractor --version 1.0.0
NuGet\Install-Package NajsAi.WebExtractor -Version 1.0.0
<PackageReference Include="NajsAi.WebExtractor" Version="1.0.0" />
<PackageVersion Include="NajsAi.WebExtractor" Version="1.0.0" />
<PackageReference Include="NajsAi.WebExtractor" />
paket add NajsAi.WebExtractor --version 1.0.0
#r "nuget: NajsAi.WebExtractor, 1.0.0"
#:package NajsAi.WebExtractor@1.0.0
#addin nuget:?package=NajsAi.WebExtractor&version=1.0.0
#tool nuget:?package=NajsAi.WebExtractor&version=1.0.0
NajsAi.WebExtractor
A powerful .NET 9 library for extracting web content using Playwright automation. NajsAi.WebExtractor provides both raw HTML extraction and intelligent markdown conversion with comprehensive content filtering capabilities.
Features
- ๐ Web Content Extraction: Extract content from any URL using Playwright automation
- ๐ Raw HTML Support: Get the complete HTML source of web pages
- ๐ Markdown Conversion: Convert HTML to clean, readable markdown
- ๐งน Smart HTML Cleaning: Automatically removes navigation, ads, headers, footers, and other non-content elements
- โ๏ธ Highly Configurable: Customizable timeouts, user agents, JavaScript execution, and more
- ๐ Async/Await: Fully asynchronous operations for optimal performance
- ๐ก๏ธ Comprehensive Error Handling: Specific exceptions for different failure scenarios
- ๐ Detailed Results: Rich metadata including titles, final URLs, and extraction timestamps
- ๐ง Thread-Safe: Safe for concurrent usage
Installation
Prerequisites
- .NET 9: Make sure you have .NET 9 installed
- System Dependencies: Playwright requires certain system libraries. On Ubuntu/Debian:
sudo apt-get install libevent-2.1-7t64 libgstreamer-plugins-bad1.0-0 libavif16
- Playwright Browsers: Install browsers after adding the package:
pwsh bin/Debug/net9.0/playwright.ps1 install
Package Manager Console
Install-Package NajsAi.WebExtractor
.NET CLI
dotnet add package NajsAi.WebExtractor
PackageReference
<PackageReference Include="NajsAi.WebExtractor" Version="1.0.0" />
Quick Start
Basic Usage
using NajsAi.WebExtractor;
using NajsAi.WebExtractor.Models;
// Using statement ensures proper disposal of browser resources
using var extractor = new WebContentExtractor();
// Extract as markdown (cleaned)
var markdownResult = await extractor.GetMarkdownAsync("https://example.com");
Console.WriteLine($"Title: {markdownResult.Title}");
Console.WriteLine($"Content: {markdownResult.Content}");
// Extract raw HTML
var htmlResult = await extractor.GetRawHtmlAsync("https://example.com");
Console.WriteLine($"HTML Length: {htmlResult.ContentLength}");
Console.WriteLine($"Status Code: {htmlResult.StatusCode}");
Advanced Configuration
var options = new ExtractionOptions
{
PageLoadTimeoutMs = 30000,
NetworkTimeoutMs = 30000,
WaitForFullLoad = true,
EnableJavaScript = true,
LoadImages = false,
UserAgent = "MyBot/1.0",
AdditionalExcludeSelectors = new[] { ".custom-ad", "#sidebar" }
};
var result = await extractor.GetMarkdownAsync(
"https://example.com",
excludeSelectors: new[] { ".comments", ".related-articles" },
options: options
);
API Reference
WebContentExtractor
The main class for web content extraction.
Methods
GetRawHtmlAsync(string url, ExtractionOptions? options = null)
- Extracts raw HTML content from the specified URL
- Returns:
ExtractionResult
with raw HTML content
GetMarkdownAsync(string url, string[]? excludeSelectors = null, ExtractionOptions? options = null)
- Extracts and converts web content to cleaned markdown
- Returns:
ExtractionResult
with markdown content
GetDefaultExcludeSelectors()
- Static method that returns the default CSS selectors excluded during cleaning
- Returns:
string[]
of selectors
ExtractionOptions
Configuration options for content extraction.
public class ExtractionOptions
{
public int PageLoadTimeoutMs { get; set; } = 30000;
public int NetworkTimeoutMs { get; set; } = 30000;
public bool WaitForFullLoad { get; set; } = true;
public string[]? AdditionalExcludeSelectors { get; set; }
public bool RemoveScriptsAndStyles { get; set; } = true;
public bool RemoveNavigationElements { get; set; } = true;
public bool RemoveAdvertisements { get; set; } = true;
public string? UserAgent { get; set; }
public bool EnableJavaScript { get; set; } = true;
public bool LoadImages { get; set; } = false;
}
ExtractionResult
The result of a web content extraction operation.
public class ExtractionResult
{
public string Url { get; }
public string Content { get; }
public ContentType ContentType { get; }
public DateTime ExtractedAt { get; }
public string? Title { get; set; }
public string? FinalUrl { get; set; }
public int? StatusCode { get; set; }
public bool IsSuccess { get; }
public int ContentLength { get; }
}
Exception Types
The library provides specific exceptions for different failure scenarios:
InvalidUrlException
: Thrown when the provided URL is invalidNetworkException
: Thrown when network-related errors occurContentExtractionException
: Thrown when content extraction or processing fails
HTML Cleaning
The library automatically removes common non-content elements:
Default Excluded Elements
- Navigation:
nav
,header
,footer
,aside
- Scripts:
script
,style
,noscript
- Common classes:
.nav
,.navbar
,.menu
,.advertisement
,.ads
, etc. - Common IDs:
#nav
,#header
,#sidebar
,#advertisement
, etc. - ARIA roles:
[role='banner']
,[role='navigation']
, etc.
Content Preservation
The cleaner intelligently preserves main content by looking for:
<main>
elements<article>
elements- Common content containers:
.main-content
,.content
,.post-content
, etc.
Building and Testing
Build the Library
cd NajsAi.WebExtractor
dotnet build
Run Tests
cd TestApp
dotnet run
Project Structure
NajsAi.WebExtractor/
โโโ NajsAi.WebExtractor.csproj # Project file
โโโ WebContentExtractor.cs # Main service class
โโโ Models/
โ โโโ ExtractionOptions.cs # Configuration options
โ โโโ ExtractionResult.cs # Result wrapper with metadata
โโโ Exceptions/
โ โโโ InvalidUrlException.cs # URL validation errors
โ โโโ NetworkException.cs # Network-related errors
โ โโโ ContentExtractionException.cs # Extraction/processing errors
โโโ Internal/
โโโ HtmlCleaner.cs # HTML sanitization logic
โโโ PlaywrightService.cs # Browser automation service
Dependencies
- Microsoft.Playwright (1.54.0): Web automation and browser control
- Html2Markdown (7.0.7.17): HTML to markdown conversion
- AngleSharp (1.3.0): HTML parsing and DOM manipulation
- Microsoft.Extensions.Logging.Abstractions (9.0.8): Logging support
Performance Considerations
- Resource Management: Always dispose of the
WebContentExtractor
to free browser resources - Concurrent Usage: The library is thread-safe for concurrent operations
- Memory Efficient: Automatic cleanup of browser contexts and pages
- Configurable Timeouts: Adjust timeouts based on your use case
- Image Loading: Disabled by default for faster extraction
Error Handling
The library provides comprehensive error handling with specific exception types:
try
{
var result = await extractor.GetMarkdownAsync("https://invalid-url");
}
catch (InvalidUrlException ex)
{
// Handle invalid URL format
Console.WriteLine($"Invalid URL: {ex.Url}");
}
catch (NetworkException ex)
{
// Handle network connectivity issues
Console.WriteLine($"Network error: {ex.Message}");
}
catch (ContentExtractionException ex)
{
// Handle extraction/processing failures
Console.WriteLine($"Extraction failed: {ex.Message}");
}
Contributing
We welcome contributions! Please see our Contributing Guide for details on how to:
- Report bugs and request features
- Submit pull requests
- Follow our coding standards
- Set up the development environment
Roadmap
- Support for additional output formats (PDF, plain text)
- Custom content extraction rules
- Performance optimizations for batch processing
- Docker support
- Azure Functions integration examples
Support & Community
- ๐ Documentation
- ๐ Report Issues
- ๐ฌ Discussions
- ๐ข Release Notes
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgments
- Microsoft Playwright for reliable browser automation
- AngleSharp for efficient HTML parsing
- Html2Markdown for markdown conversion
- The open-source community for inspiration and support
Product | Versions Compatible and additional computed target framework versions. |
---|---|
.NET | net9.0 is compatible. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 was computed. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
-
net9.0
- AngleSharp (>= 1.3.0)
- Html2Markdown (>= 7.0.7.17)
- Microsoft.Extensions.Logging.Abstractions (>= 9.0.8)
- Microsoft.Playwright (>= 1.54.0)
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.
Version | Downloads | Last Updated |
---|---|---|
1.0.0 | 60 | 8/23/2025 |
Initial release with HTML and Markdown extraction capabilities.