NajsAi.WebExtractor 1.0.0

dotnet add package NajsAi.WebExtractor --version 1.0.0
                    
NuGet\Install-Package NajsAi.WebExtractor -Version 1.0.0
                    
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="NajsAi.WebExtractor" Version="1.0.0" />
                    
For projects that support PackageReference, copy this XML node into the project file to reference the package.
<PackageVersion Include="NajsAi.WebExtractor" Version="1.0.0" />
                    
Directory.Packages.props
<PackageReference Include="NajsAi.WebExtractor" />
                    
Project file
For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.
paket add NajsAi.WebExtractor --version 1.0.0
                    
#r "nuget: NajsAi.WebExtractor, 1.0.0"
                    
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
#:package NajsAi.WebExtractor@1.0.0
                    
#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.
#addin nuget:?package=NajsAi.WebExtractor&version=1.0.0
                    
Install as a Cake Addin
#tool nuget:?package=NajsAi.WebExtractor&version=1.0.0
                    
Install as a Cake Tool

NajsAi.WebExtractor

.NET License NuGet Build Status

A powerful .NET 9 library for extracting web content using Playwright automation. NajsAi.WebExtractor provides both raw HTML extraction and intelligent markdown conversion with comprehensive content filtering capabilities.

Features

  • ๐ŸŒ Web Content Extraction: Extract content from any URL using Playwright automation
  • ๐Ÿ“ Raw HTML Support: Get the complete HTML source of web pages
  • ๐Ÿ“‹ Markdown Conversion: Convert HTML to clean, readable markdown
  • ๐Ÿงน Smart HTML Cleaning: Automatically removes navigation, ads, headers, footers, and other non-content elements
  • โš™๏ธ Highly Configurable: Customizable timeouts, user agents, JavaScript execution, and more
  • ๐Ÿš€ Async/Await: Fully asynchronous operations for optimal performance
  • ๐Ÿ›ก๏ธ Comprehensive Error Handling: Specific exceptions for different failure scenarios
  • ๐Ÿ“Š Detailed Results: Rich metadata including titles, final URLs, and extraction timestamps
  • ๐Ÿ”ง Thread-Safe: Safe for concurrent usage

Installation

Prerequisites

  1. .NET 9: Make sure you have .NET 9 installed
  2. System Dependencies: Playwright requires certain system libraries. On Ubuntu/Debian:
    sudo apt-get install libevent-2.1-7t64 libgstreamer-plugins-bad1.0-0 libavif16
    
  3. Playwright Browsers: Install browsers after adding the package:
    pwsh bin/Debug/net9.0/playwright.ps1 install
    

Package Manager Console

Install-Package NajsAi.WebExtractor

.NET CLI

dotnet add package NajsAi.WebExtractor

PackageReference

<PackageReference Include="NajsAi.WebExtractor" Version="1.0.0" />

Quick Start

Basic Usage

using NajsAi.WebExtractor;
using NajsAi.WebExtractor.Models;

// Using statement ensures proper disposal of browser resources
using var extractor = new WebContentExtractor();

// Extract as markdown (cleaned)
var markdownResult = await extractor.GetMarkdownAsync("https://example.com");
Console.WriteLine($"Title: {markdownResult.Title}");
Console.WriteLine($"Content: {markdownResult.Content}");

// Extract raw HTML
var htmlResult = await extractor.GetRawHtmlAsync("https://example.com");
Console.WriteLine($"HTML Length: {htmlResult.ContentLength}");
Console.WriteLine($"Status Code: {htmlResult.StatusCode}");

Advanced Configuration

var options = new ExtractionOptions
{
    PageLoadTimeoutMs = 30000,
    NetworkTimeoutMs = 30000,
    WaitForFullLoad = true,
    EnableJavaScript = true,
    LoadImages = false,
    UserAgent = "MyBot/1.0",
    AdditionalExcludeSelectors = new[] { ".custom-ad", "#sidebar" }
};

var result = await extractor.GetMarkdownAsync(
    "https://example.com", 
    excludeSelectors: new[] { ".comments", ".related-articles" },
    options: options
);

API Reference

WebContentExtractor

The main class for web content extraction.

Methods
  • GetRawHtmlAsync(string url, ExtractionOptions? options = null)

    • Extracts raw HTML content from the specified URL
    • Returns: ExtractionResult with raw HTML content
  • GetMarkdownAsync(string url, string[]? excludeSelectors = null, ExtractionOptions? options = null)

    • Extracts and converts web content to cleaned markdown
    • Returns: ExtractionResult with markdown content
  • GetDefaultExcludeSelectors()

    • Static method that returns the default CSS selectors excluded during cleaning
    • Returns: string[] of selectors

ExtractionOptions

Configuration options for content extraction.

public class ExtractionOptions
{
    public int PageLoadTimeoutMs { get; set; } = 30000;
    public int NetworkTimeoutMs { get; set; } = 30000;
    public bool WaitForFullLoad { get; set; } = true;
    public string[]? AdditionalExcludeSelectors { get; set; }
    public bool RemoveScriptsAndStyles { get; set; } = true;
    public bool RemoveNavigationElements { get; set; } = true;
    public bool RemoveAdvertisements { get; set; } = true;
    public string? UserAgent { get; set; }
    public bool EnableJavaScript { get; set; } = true;
    public bool LoadImages { get; set; } = false;
}

ExtractionResult

The result of a web content extraction operation.

public class ExtractionResult
{
    public string Url { get; }
    public string Content { get; }
    public ContentType ContentType { get; }
    public DateTime ExtractedAt { get; }
    public string? Title { get; set; }
    public string? FinalUrl { get; set; }
    public int? StatusCode { get; set; }
    public bool IsSuccess { get; }
    public int ContentLength { get; }
}

Exception Types

The library provides specific exceptions for different failure scenarios:

  • InvalidUrlException: Thrown when the provided URL is invalid
  • NetworkException: Thrown when network-related errors occur
  • ContentExtractionException: Thrown when content extraction or processing fails

HTML Cleaning

The library automatically removes common non-content elements:

Default Excluded Elements

  • Navigation: nav, header, footer, aside
  • Scripts: script, style, noscript
  • Common classes: .nav, .navbar, .menu, .advertisement, .ads, etc.
  • Common IDs: #nav, #header, #sidebar, #advertisement, etc.
  • ARIA roles: [role='banner'], [role='navigation'], etc.

Content Preservation

The cleaner intelligently preserves main content by looking for:

  • <main> elements
  • <article> elements
  • Common content containers: .main-content, .content, .post-content, etc.

Building and Testing

Build the Library

cd NajsAi.WebExtractor
dotnet build

Run Tests

cd TestApp
dotnet run

Project Structure

NajsAi.WebExtractor/
โ”œโ”€โ”€ NajsAi.WebExtractor.csproj        # Project file
โ”œโ”€โ”€ WebContentExtractor.cs            # Main service class
โ”œโ”€โ”€ Models/
โ”‚   โ”œโ”€โ”€ ExtractionOptions.cs          # Configuration options
โ”‚   โ””โ”€โ”€ ExtractionResult.cs           # Result wrapper with metadata
โ”œโ”€โ”€ Exceptions/
โ”‚   โ”œโ”€โ”€ InvalidUrlException.cs        # URL validation errors
โ”‚   โ”œโ”€โ”€ NetworkException.cs           # Network-related errors
โ”‚   โ””โ”€โ”€ ContentExtractionException.cs # Extraction/processing errors
โ””โ”€โ”€ Internal/
    โ”œโ”€โ”€ HtmlCleaner.cs                # HTML sanitization logic
    โ””โ”€โ”€ PlaywrightService.cs          # Browser automation service

Dependencies

  • Microsoft.Playwright (1.54.0): Web automation and browser control
  • Html2Markdown (7.0.7.17): HTML to markdown conversion
  • AngleSharp (1.3.0): HTML parsing and DOM manipulation
  • Microsoft.Extensions.Logging.Abstractions (9.0.8): Logging support

Performance Considerations

  • Resource Management: Always dispose of the WebContentExtractor to free browser resources
  • Concurrent Usage: The library is thread-safe for concurrent operations
  • Memory Efficient: Automatic cleanup of browser contexts and pages
  • Configurable Timeouts: Adjust timeouts based on your use case
  • Image Loading: Disabled by default for faster extraction

Error Handling

The library provides comprehensive error handling with specific exception types:

try
{
    var result = await extractor.GetMarkdownAsync("https://invalid-url");
}
catch (InvalidUrlException ex)
{
    // Handle invalid URL format
    Console.WriteLine($"Invalid URL: {ex.Url}");
}
catch (NetworkException ex)
{
    // Handle network connectivity issues
    Console.WriteLine($"Network error: {ex.Message}");
}
catch (ContentExtractionException ex)
{
    // Handle extraction/processing failures
    Console.WriteLine($"Extraction failed: {ex.Message}");
}

Contributing

We welcome contributions! Please see our Contributing Guide for details on how to:

  • Report bugs and request features
  • Submit pull requests
  • Follow our coding standards
  • Set up the development environment

Roadmap

  • Support for additional output formats (PDF, plain text)
  • Custom content extraction rules
  • Performance optimizations for batch processing
  • Docker support
  • Azure Functions integration examples

Support & Community

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Product Compatible and additional computed target framework versions.
.NET net9.0 is compatible.  net9.0-android was computed.  net9.0-browser was computed.  net9.0-ios was computed.  net9.0-maccatalyst was computed.  net9.0-macos was computed.  net9.0-tvos was computed.  net9.0-windows was computed.  net10.0 was computed.  net10.0-android was computed.  net10.0-browser was computed.  net10.0-ios was computed.  net10.0-maccatalyst was computed.  net10.0-macos was computed.  net10.0-tvos was computed.  net10.0-windows was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last Updated
1.0.0 60 8/23/2025

Initial release with HTML and Markdown extraction capabilities.