CocoCrawler 0.1.2

There is a newer version of this package available.
See the version list below for details.

dotnet add package CocoCrawler --version 0.1.2

NuGet\Install-Package CocoCrawler -Version 0.1.2

This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.

<PackageReference Include="CocoCrawler" Version="0.1.2" />

For projects that support PackageReference, copy this XML node into the project file to reference the package.

paket add CocoCrawler --version 0.1.2

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

#r "nuget: CocoCrawler, 0.1.2"

#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.

// Install CocoCrawler as a Cake Addin
#addin nuget:?package=CocoCrawler&version=0.1.2

// Install CocoCrawler as a Cake Tool
#tool nuget:?package=CocoCrawler&version=0.1.2

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

CocoCrawler

Overview

CocoCrawler is an easy to use web crawler, scraper and parser in C#. By combing PuppeteerSharp and AngleSharp it brings the best of both sides, and merges them into an easy to use API.

It provides an simple API to get started

var crawlerEngine = await new CrawlerEngineBuilder()
    .AddPage("https://old.reddit.com/r/csharp", pageOptions => pageOptions
        .ExtractList(containersSelector: "div.thing.link.self", [
            new("Title","a.title"),
            new("Upvotes", "div.score.unvoted"),
            new("Datetime", "time", "datetime"),
            new("Total Comments","a.comments"),
            new("Url","a.title", "href")
        ])
        .AddPagination("span.next-button > a.not-exist", newPage => newPage.ScrollToEnd())
        .AddOutputToConsole()
        .AddOutputToCsvFile("results.csv")
    )
    .ConfigureEngine(options =>
    {
        options.UseHeadlessMode(false);
        options.WithLoggerFactory(loggerFactory);
    })
    .BuildAsync(cancellationToken);

await crawlerEngine.RunAsync(cancellationToken);

This examples starts at page https://old.reddit.com/r/csharp scrapes all the posts, then continues to the next page and scrapes everything again, and on and on.

With this library it's easy to

Scrape Single Page Apps
Scrape Listings
Add pagination
Alternative to list is open each post and scrape the page and continue with pagination
Scrape multiple pages in parallel
Add custom outputs
Customize Everything

Scraping pages

With each Page (a page a is a single URL job) added it's possible to add a Task. For each Page it's possible to call:

.ExtractObject(...)
.ExtractList(...)
.OpenLinks(...)
.AddPagination(...)

It's possible to add multiple pages to scrape with the same Tasks.

   var crawlerEngine = await new CrawlerEngineBuilder()
       .AddPages(["https://old.reddit.com/r/csharp", "https://old.reddit.com/r/dotnet"], pageOptions => pageOptions
           .OpenLinks("div.thing.link.self a.bylink.comments", subPageOptions =>
           {
               subPageOptions.ExtractObject([
                       new("Title","div.sitetable.linklisting a.title"),
                       new("Url","div.sitetable.linklisting a.title", "href"),
                       new("Upvotes", "div.sitetable.linklisting div.score.unvoted"),
                       new("Top comment", "div.commentarea div.entry.unvoted div.md"),
               ]);
               subPageOptions.ConfigurePageActions(ops =>
                {
                    ops.ScrollToEnd();
                    ops.Wait(4000);
                });
           })
           .AddPagination("span.next-button > a")
           .AddOutputToConsole()
           .AddOutputToCsvFile("results.csv"))
        .BuildAsync(cancellationToken);

   await crawlerEngine.RunAsync(cancellationToken);

This example starts at https://old.reddit.com/r/csharp and https://old.reddit.com/r/dotnet and opens each post and scrapes the title, url, upvotes and top comment. It also scrolls to the end of the page and waits 4 seconds before scraping the page. And it continues to the next page.

Configuring the Engine

The engine can be configured with the following options:

UseHeadlessMode(bool headless): If the browser should be headless or not
WithLoggerFactory(ILoggerFactory loggerFactory): The logger factory to use
WithUserAgent(string userAgent): The user agent to use
WithCookies(params Cookie[] cookies): The cookies to use
TotalPagesToCrawl(int total): The total number of pages to crawl
WithParallelismDegree(int parallelismDegree) : The number of pages to crawl in parallel

Stopping the engine

The engine stops when the

The total number of pages to crawl is reached.
2 minutes have passed since the last job was added

Extensibility

The library is designed to be extensible. It's possible to add custom IParser, IScheduler and ICrawler implementations.

using the engine builder it's possible to add custom implementations

.ConfigureEngine(options =>
{
    options.WithCrawler(new MyCustomCrawler());
    options.WithScheduler(new MyCustomScheduler());
    options.WithParser(new MyCustomParser());
})

Custom Outputs

It's possible to add custom outputs by implementing the ICrawlOutput interface.

ICrawlOutput.WriteAsync(JObject jObject, CancellationToken cancellationToken); is called for each object that is scraped.

Product	Compatible and additional computed target framework versions.
.NET	net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed.

Compatible target framework(s)

Included target framework(s) (in package)

Learn more about Target Frameworks and .NET Standard.

net8.0
- AngleSharp (>= 1.1.2)
- PuppeteerSharp (>= 18.0.2)

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version	Downloads	Last updated
1.1.0	82	9/11/2024
1.0.4	98	7/11/2024
1.0.3	64	7/11/2024
1.0.0	91	7/8/2024
0.1.2	92	7/5/2024
0.1.1	89	7/5/2024
0.1.0	84	7/5/2024