NajsAi.WebExtractor 1.0.0

.NET 9.0

dotnet add package NajsAi.WebExtractor --version 1.0.0

NuGet\Install-Package NajsAi.WebExtractor -Version 1.0.0

This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.

<PackageReference Include="NajsAi.WebExtractor" Version="1.0.0" />

For projects that support PackageReference, copy this XML node into the project file to reference the package.

<PackageVersion Include="NajsAi.WebExtractor" Version="1.0.0" />
                    

                            Directory.Packages.props

<PackageReference Include="NajsAi.WebExtractor" />
                    

                            Project file

For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.

paket add NajsAi.WebExtractor --version 1.0.0

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

#r "nuget: NajsAi.WebExtractor, 1.0.0"

#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.

#:package NajsAi.WebExtractor@1.0.0

#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.

#addin nuget:?package=NajsAi.WebExtractor&version=1.0.0
                    

                            Install as a Cake Addin

#tool nuget:?package=NajsAi.WebExtractor&version=1.0.0
                    

                            Install as a Cake Tool

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

NajsAi.WebExtractor

A powerful .NET 9 library for extracting web content using Playwright automation. NajsAi.WebExtractor provides both raw HTML extraction and intelligent markdown conversion with comprehensive content filtering capabilities.

Features

🌐 Web Content Extraction: Extract content from any URL using Playwright automation
📝 Raw HTML Support: Get the complete HTML source of web pages
📋 Markdown Conversion: Convert HTML to clean, readable markdown
🧹 Smart HTML Cleaning: Automatically removes navigation, ads, headers, footers, and other non-content elements
⚙️ Highly Configurable: Customizable timeouts, user agents, JavaScript execution, and more
🚀 Async/Await: Fully asynchronous operations for optimal performance
🛡️ Comprehensive Error Handling: Specific exceptions for different failure scenarios
📊 Detailed Results: Rich metadata including titles, final URLs, and extraction timestamps
🔧 Thread-Safe: Safe for concurrent usage

Installation

Prerequisites

.NET 9: Make sure you have .NET 9 installed
System Dependencies: Playwright requires certain system libraries. On Ubuntu/Debian:
```
sudo apt-get install libevent-2.1-7t64 libgstreamer-plugins-bad1.0-0 libavif16
```
Playwright Browsers: Install browsers after adding the package:
```
pwsh bin/Debug/net9.0/playwright.ps1 install
```

Package Manager Console

Install-Package NajsAi.WebExtractor

.NET CLI

dotnet add package NajsAi.WebExtractor

PackageReference

<PackageReference Include="NajsAi.WebExtractor" Version="1.0.0" />

Quick Start

Basic Usage

using NajsAi.WebExtractor;
using NajsAi.WebExtractor.Models;

// Using statement ensures proper disposal of browser resources
using var extractor = new WebContentExtractor();

// Extract as markdown (cleaned)
var markdownResult = await extractor.GetMarkdownAsync("https://example.com");
Console.WriteLine($"Title: {markdownResult.Title}");
Console.WriteLine($"Content: {markdownResult.Content}");

// Extract raw HTML
var htmlResult = await extractor.GetRawHtmlAsync("https://example.com");
Console.WriteLine($"HTML Length: {htmlResult.ContentLength}");
Console.WriteLine($"Status Code: {htmlResult.StatusCode}");

Advanced Configuration

var options = new ExtractionOptions
{
    PageLoadTimeoutMs = 30000,
    NetworkTimeoutMs = 30000,
    WaitForFullLoad = true,
    EnableJavaScript = true,
    LoadImages = false,
    UserAgent = "MyBot/1.0",
    AdditionalExcludeSelectors = new[] { ".custom-ad", "#sidebar" }
};

var result = await extractor.GetMarkdownAsync(
    "https://example.com", 
    excludeSelectors: new[] { ".comments", ".related-articles" },
    options: options
);

API Reference

WebContentExtractor

The main class for web content extraction.

Methods

GetRawHtmlAsync(string url, ExtractionOptions? options = null)
- Extracts raw HTML content from the specified URL
- Returns: ExtractionResult with raw HTML content
GetMarkdownAsync(string url, string[]? excludeSelectors = null, ExtractionOptions? options = null)
- Extracts and converts web content to cleaned markdown
- Returns: ExtractionResult with markdown content
GetDefaultExcludeSelectors()
- Static method that returns the default CSS selectors excluded during cleaning
- Returns: string[] of selectors

ExtractionOptions

Configuration options for content extraction.

public class ExtractionOptions
{
    public int PageLoadTimeoutMs { get; set; } = 30000;
    public int NetworkTimeoutMs { get; set; } = 30000;
    public bool WaitForFullLoad { get; set; } = true;
    public string[]? AdditionalExcludeSelectors { get; set; }
    public bool RemoveScriptsAndStyles { get; set; } = true;
    public bool RemoveNavigationElements { get; set; } = true;
    public bool RemoveAdvertisements { get; set; } = true;
    public string? UserAgent { get; set; }
    public bool EnableJavaScript { get; set; } = true;
    public bool LoadImages { get; set; } = false;
}

ExtractionResult

The result of a web content extraction operation.

public class ExtractionResult
{
    public string Url { get; }
    public string Content { get; }
    public ContentType ContentType { get; }
    public DateTime ExtractedAt { get; }
    public string? Title { get; set; }
    public string? FinalUrl { get; set; }
    public int? StatusCode { get; set; }
    public bool IsSuccess { get; }
    public int ContentLength { get; }
}

Exception Types

The library provides specific exceptions for different failure scenarios:

InvalidUrlException: Thrown when the provided URL is invalid
NetworkException: Thrown when network-related errors occur
ContentExtractionException: Thrown when content extraction or processing fails

HTML Cleaning

The library automatically removes common non-content elements:

Default Excluded Elements

Navigation: nav, header, footer, aside
Scripts: script, style, noscript
Common classes: .nav, .navbar, .menu, .advertisement, .ads, etc.
Common IDs: #nav, #header, #sidebar, #advertisement, etc.
ARIA roles: [role='banner'], [role='navigation'], etc.

Content Preservation

The cleaner intelligently preserves main content by looking for:

<main> elements
<article> elements
Common content containers: .main-content, .content, .post-content, etc.

Building and Testing

Build the Library

cd NajsAi.WebExtractor
dotnet build

Run Tests

cd TestApp
dotnet run

Project Structure

NajsAi.WebExtractor/
├── NajsAi.WebExtractor.csproj        # Project file
├── WebContentExtractor.cs            # Main service class
├── Models/
│   ├── ExtractionOptions.cs          # Configuration options
│   └── ExtractionResult.cs           # Result wrapper with metadata
├── Exceptions/
│   ├── InvalidUrlException.cs        # URL validation errors
│   ├── NetworkException.cs           # Network-related errors
│   └── ContentExtractionException.cs # Extraction/processing errors
└── Internal/
    ├── HtmlCleaner.cs                # HTML sanitization logic
    └── PlaywrightService.cs          # Browser automation service

Dependencies

Microsoft.Playwright (1.54.0): Web automation and browser control
Html2Markdown (7.0.7.17): HTML to markdown conversion
AngleSharp (1.3.0): HTML parsing and DOM manipulation
Microsoft.Extensions.Logging.Abstractions (9.0.8): Logging support

Performance Considerations

Resource Management: Always dispose of the WebContentExtractor to free browser resources
Concurrent Usage: The library is thread-safe for concurrent operations
Memory Efficient: Automatic cleanup of browser contexts and pages
Configurable Timeouts: Adjust timeouts based on your use case
Image Loading: Disabled by default for faster extraction

Error Handling

The library provides comprehensive error handling with specific exception types:

try
{
    var result = await extractor.GetMarkdownAsync("https://invalid-url");
}
catch (InvalidUrlException ex)
{
    // Handle invalid URL format
    Console.WriteLine($"Invalid URL: {ex.Url}");
}
catch (NetworkException ex)
{
    // Handle network connectivity issues
    Console.WriteLine($"Network error: {ex.Message}");
}
catch (ContentExtractionException ex)
{
    // Handle extraction/processing failures
    Console.WriteLine($"Extraction failed: {ex.Message}");
}

Contributing

We welcome contributions! Please see our Contributing Guide for details on how to:

Report bugs and request features
Submit pull requests
Follow our coding standards
Set up the development environment

Roadmap

Support for additional output formats (PDF, plain text)
Custom content extraction rules
Performance optimizations for batch processing
Docker support
Azure Functions integration examples

Support & Community

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Microsoft Playwright for reliable browser automation
AngleSharp for efficient HTML parsing
Html2Markdown for markdown conversion
The open-source community for inspiration and support

Product	Compatible and additional computed target framework versions.
.NET	net9.0 is compatible. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 was computed. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed.

Product

.NET

Compatible target framework(s)

Included target framework(s) (in package)

Learn more about Target Frameworks and .NET Standard.

net9.0
- AngleSharp (>= 1.3.0)
- Html2Markdown (>= 7.0.7.17)
- Microsoft.Extensions.Logging.Abstractions (>= 9.0.8)
- Microsoft.Playwright (>= 1.54.0)

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version	Downloads	Last Updated
1.0.0	60	8/23/2025

Initial release with HTML and Markdown extraction capabilities.