CocoCrawler 0.1.2
See the version list below for details.
dotnet add package CocoCrawler --version 0.1.2
NuGet\Install-Package CocoCrawler -Version 0.1.2
<PackageReference Include="CocoCrawler" Version="0.1.2" />
paket add CocoCrawler --version 0.1.2
#r "nuget: CocoCrawler, 0.1.2"
// Install CocoCrawler as a Cake Addin #addin nuget:?package=CocoCrawler&version=0.1.2 // Install CocoCrawler as a Cake Tool #tool nuget:?package=CocoCrawler&version=0.1.2
CocoCrawler
Overview
CocoCrawler
is an easy to use web crawler, scraper and parser in C#. By combing PuppeteerSharp
and AngleSharp
it brings the best of both sides, and merges them into an easy to use API.
It provides an simple API to get started
var crawlerEngine = await new CrawlerEngineBuilder()
.AddPage("https://old.reddit.com/r/csharp", pageOptions => pageOptions
.ExtractList(containersSelector: "div.thing.link.self", [
new("Title","a.title"),
new("Upvotes", "div.score.unvoted"),
new("Datetime", "time", "datetime"),
new("Total Comments","a.comments"),
new("Url","a.title", "href")
])
.AddPagination("span.next-button > a.not-exist", newPage => newPage.ScrollToEnd())
.AddOutputToConsole()
.AddOutputToCsvFile("results.csv")
)
.ConfigureEngine(options =>
{
options.UseHeadlessMode(false);
options.WithLoggerFactory(loggerFactory);
})
.BuildAsync(cancellationToken);
await crawlerEngine.RunAsync(cancellationToken);
This examples starts at page https://old.reddit.com/r/csharp
scrapes all the posts, then continues to the next page and scrapes everything again, and on and on.
With this library it's easy to
- Scrape Single Page Apps
- Scrape Listings
- Add pagination
- Alternative to list is open each post and scrape the page and continue with pagination
- Scrape multiple pages in parallel
- Add custom outputs
- Customize Everything
Scraping pages
With each Page (a page a is a single URL job) added it's possible to add a Task. For each Page it's possible to call:
.ExtractObject(...)
.ExtractList(...)
.OpenLinks(...)
.AddPagination(...)
It's possible to add multiple pages to scrape with the same Tasks.
var crawlerEngine = await new CrawlerEngineBuilder()
.AddPages(["https://old.reddit.com/r/csharp", "https://old.reddit.com/r/dotnet"], pageOptions => pageOptions
.OpenLinks("div.thing.link.self a.bylink.comments", subPageOptions =>
{
subPageOptions.ExtractObject([
new("Title","div.sitetable.linklisting a.title"),
new("Url","div.sitetable.linklisting a.title", "href"),
new("Upvotes", "div.sitetable.linklisting div.score.unvoted"),
new("Top comment", "div.commentarea div.entry.unvoted div.md"),
]);
subPageOptions.ConfigurePageActions(ops =>
{
ops.ScrollToEnd();
ops.Wait(4000);
});
})
.AddPagination("span.next-button > a")
.AddOutputToConsole()
.AddOutputToCsvFile("results.csv"))
.BuildAsync(cancellationToken);
await crawlerEngine.RunAsync(cancellationToken);
This example starts at https://old.reddit.com/r/csharp
and https://old.reddit.com/r/dotnet
and opens each post and scrapes the title, url, upvotes and top comment. It also scrolls to the end of the page and waits 4 seconds before scraping the page. And it continues to the next page.
Configuring the Engine
The engine can be configured with the following options:
UseHeadlessMode(bool headless)
: If the browser should be headless or notWithLoggerFactory(ILoggerFactory loggerFactory)
: The logger factory to useWithUserAgent(string userAgent)
: The user agent to useWithCookies(params Cookie[] cookies)
: The cookies to useTotalPagesToCrawl(int total)
: The total number of pages to crawlWithParallelismDegree(int parallelismDegree)
: The number of pages to crawl in parallel
Stopping the engine
The engine stops when the
- The total number of pages to crawl is reached.
- 2 minutes have passed since the last job was added
Extensibility
The library is designed to be extensible. It's possible to add custom IParser
, IScheduler
and ICrawler
implementations.
using the engine builder it's possible to add custom implementations
.ConfigureEngine(options =>
{
options.WithCrawler(new MyCustomCrawler());
options.WithScheduler(new MyCustomScheduler());
options.WithParser(new MyCustomParser());
})
Custom Outputs
It's possible to add custom outputs by implementing the ICrawlOutput
interface.
ICrawlOutput.WriteAsync(JObject jObject, CancellationToken cancellationToken);
is called for each object that is scraped.
Product | Versions Compatible and additional computed target framework versions. |
---|---|
.NET | net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. |
-
net8.0
- AngleSharp (>= 1.1.2)
- PuppeteerSharp (>= 18.0.2)
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.