SoftCircuits.WebScraper 2.1.0

Prefix Reserved
dotnet add package SoftCircuits.WebScraper --version 2.1.0                
NuGet\Install-Package SoftCircuits.WebScraper -Version 2.1.0                
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="SoftCircuits.WebScraper" Version="2.1.0" />                
For projects that support PackageReference, copy this XML node into the project file to reference the package.
paket add SoftCircuits.WebScraper --version 2.1.0                
#r "nuget: SoftCircuits.WebScraper, 2.1.0"                
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
// Install SoftCircuits.WebScraper as a Cake Addin
#addin nuget:?package=SoftCircuits.WebScraper&version=2.1.0

// Install SoftCircuits.WebScraper as a Cake Tool
#tool nuget:?package=SoftCircuits.WebScraper&version=2.1.0                

WebScraper

NuGet version (SoftCircuits.WebScraper)

Install-Package SoftCircuits.WebScraper

Introduction

.NET library to scrape content from the Internet. Use it to extract information from Web pages in your own application. The library writes the extracted data to a CSV file.

Using the Library

Some example code that uses the library is shown below.

// Create Scraper object and set properties
Scraper scraper = new Scraper();

// Set URL template
scraper.Url = "https://www.example.com/{location}/{category}?page={page}";
// Add URL placeholders data
scraper.Placeholders.Add(new Placeholder("location", new[] { "salt-lake-city-ut", "ogden-ut", }));
scraper.Placeholders.Add(new Placeholder("category", new[] { "lawn-mower-repair", "plumbers" }));
// Set next-page selectors
scraper.NextPageSelector = @"div.pagination a[class=""next ajax-page""]";
// Set container selectors
scraper.ContainerSelector = @"div[id=""top-center-ads""][class=""search-results center-ads""],div[class=""search-results organic""],div[id=""bottom-center-ads""][class=""search-results center-ads""]";
// Set item selectors
scraper.ItemSelector = @"div[id:=""lid-\d+""][class=""result""] div[class=""v-card""]";
// Add data fields
scraper.Fields.Add(new TextField("Name", "a.business-name span"));
scraper.Fields.Add(new TextField("Address", "p.adr"));
scraper.Fields.Add(new TextField("Phone", "div.phones.phone.primary"));
scraper.Fields.Add(new TextField("Category", "div.categories > a"));
scraper.Fields.Add(new AttributeField("Website", "a.track-visit-website", "href"));

// Add handler for UpdateProgress events
scraper.UpdateProgress += Scraper_UpdateProgress;

// Run scraper
await scraper.RunAsync(@"ScraperData.csv");

As you can see, there are a number of steps to get the class working. We'll go through those steps here.

Url

After creating an instance of the Scraper class, you set the Url property to the URL you want to scrape. This property can be set to a regular URL. The URL can also contain special placeholder tags that will be replaced with replacement values. The code example above sets the URL to a value that contains three placeholder tags: {location}, {category}, and {page}. We'll cover these next.

Placeholders

The Placeholders property is a collection that defines the values you want to replace any user tags you've defined. A Placeholder contains a name--the tag without the curly braces ({ and })--and a list of items that will replace the tag.

So if your URL is "http://www.example.com/{category}", and you define a Placeholder with the name "category" (not case-sensitive) and the list of values: "electrical", "plumbing", and "furniture", the Scraper class will examine the following URLs:

Moreover, if you changed the URL to "http://www.example.com/{location}/{category}" and added a second Placeholder with the name "location" and the list of values: "Los-Angeles", "Denver", and "New-York", it will example the following URLs:

As you can see, the library will generate a URL using every combination of placeholders you provide, regardless of the number of placeholders you define. In addition, if the {page} tag is implemented, multiple pages will be generated for every combination of your user tags.

NextPageSelector

In addition to user tags, a URL can also contain the {page} tag. For targets that involve multiple pages, this tag will be replaced with the current page number.

The NextPageSelector property is a string selector that identifies the element or elements that, if present, indicate there are more pages. Selectors used by WebScraper are similar to CSS or jQuery selectors. There is a section on WebScraper selectors below. For now, just know that selectors describe one or more elements on a page.

For example, paged results generally have some sort of Next button used to access the next page. By adding a selector that describes this element, the library uses it to determine if there are additional pages. And if the {page} tag is included in the URL, it will be incremented for each page.

ContainerSelector

The ContainerSelector property is a string selector that identifies the elements on the page that contain all the items to be scraped.

The container narrows down the area to be searched when looking for data to scrape, and so it makes the code a little more efficient. But ContainerSelector is the only selector that is optional. If it is not provided, then the entire page is the container.

ItemSelector

The ItemSelector property is a string selector that identifies the elements within the container that contain data for one item. For example, if you are scanning a website that lists employee details, you would have an element that contains all the employees on the current page (the container). And within the container you would have any number of elements that contain the information for a specific employee (the item).

The library will look for the specific data items you are requesting within the item element or elements, and the library will know all data found within this location is for one employee (item).

Note that the ItemSelector is relative to the ContainerSelector.

Fields

Finally, the Fields property is a collection that defines the specific data items you want to extract. There are several different field classes, TextField

A Field has a name and selector. The name represents the data item. It is not used by the library except that, if you have configured the library to write headers to the resulting CSV file, this name will be the header of the corresponding column.

The selector is a string selector that identifies the element or elements (relative to ItemSelector) that contain the data to be extracted for this field.

There are four types of field classes:

TextField

This field type extracts the data from the text of the matching element or elements.

AttributeField

This field type extracts the data from the value of an attribute of the matching element or elements. This class has one additional property, AttributeName, which specifies the name of the attribute.

InnerHtmlField

This field type extracts the inner HTML of the matching element or elements.

OuterHtmlField

This field type extracts the outer HTML of the matching element or elements.

DataSeparator

Since it is possible to have more than one element match the target selectors, multiple values will be concatenated together. Use the DataSeparator property to insert a delimiter between multiple values. This property is a comma by default.

Selectors

Selectors are used identify elements in an HTML document. WebScraper selectors are very similar to CSS and jQuery selectors, with a couple of minor differences.

Wildcard

The wildcard character matches all HTML elements in the range being searched.

Selector Matches
* All HTML elements in the range being searched

Tag Names

You can also specify the tag name to return all the tags with the given name. Tag names are not case-sensitive.

Selector Matches
p All the <p> tags in the range being searched.

"#", "." and ":"

These characters are shortcuts for ID, class and type attributes.

Selector Matches
p#center-ad <p id="center-ad">
a.align-right <a href="#" class="align-right">
input:button <input type="button">

Square Brackets ([])

For greater control over attributes, you can use square brackets. This is similar to specifying attributes in jQuery, but there are some differences. The first difference is that all the variations for finding a match at the start, middle or end are not supported by WebScraper. Instead, you can use the := operator to specify that the value is a regular expression and the code will match if the attribute value matches that regular expression.

Selector Matches
p[id="center-ad"] All <p> tags with the attribute id="center-ad"
p[id='center-ad'][class='align-right'] All <p> tags that have both attributes id="center-ad" and class="align-right"
p[id=center-ad][class=align-right] Same as above. Quotes within the square brackets are optional if the value contains no whitespace or most punctuation
a[href] All <a> tags that have an href attribute. The attribute value does not matter
p[data-id:="abc-\d+"] All <p> tags that have the attribute data-id with a value that matches the regular expression "abc-\d+". This example is not case-sensitive

Note that there is one key difference when using square brackets. When using a pound (#), period (.) or colon (:) to specify an attribute value, it is considered a match if it matches any value within that attribute. For example, the selector "div.right-align" would match the attribute class="main-content right-align". When using square brackets, it must match the entire value (although there are exceptions to this when using regular expressions).

Multiple Selectors

There are several cases where you can specify multiple selectors.

Selector Matches
"a, div, p" All <a>, <div> and <p> tags
"div span" All <span> tags that are descendants of a <div> tag
div > span All <span> tags that are a direct descendant of a <div> tag
Product Compatible and additional computed target framework versions.
.NET net5.0 was computed.  net5.0-windows was computed.  net6.0 is compatible.  net6.0-android was computed.  net6.0-ios was computed.  net6.0-maccatalyst was computed.  net6.0-macos was computed.  net6.0-tvos was computed.  net6.0-windows was computed.  net7.0 is compatible.  net7.0-android was computed.  net7.0-ios was computed.  net7.0-maccatalyst was computed.  net7.0-macos was computed.  net7.0-tvos was computed.  net7.0-windows was computed.  net8.0 is compatible.  net8.0-android was computed.  net8.0-browser was computed.  net8.0-ios was computed.  net8.0-maccatalyst was computed.  net8.0-macos was computed.  net8.0-tvos was computed.  net8.0-windows was computed. 
.NET Core netcoreapp2.0 was computed.  netcoreapp2.1 was computed.  netcoreapp2.2 was computed.  netcoreapp3.0 was computed.  netcoreapp3.1 was computed. 
.NET Standard netstandard2.0 is compatible.  netstandard2.1 was computed. 
.NET Framework net461 was computed.  net462 was computed.  net463 was computed.  net47 was computed.  net471 was computed.  net472 was computed.  net48 was computed.  net481 was computed. 
MonoAndroid monoandroid was computed. 
MonoMac monomac was computed. 
MonoTouch monotouch was computed. 
Tizen tizen40 was computed.  tizen60 was computed. 
Xamarin.iOS xamarinios was computed. 
Xamarin.Mac xamarinmac was computed. 
Xamarin.TVOS xamarintvos was computed. 
Xamarin.WatchOS xamarinwatchos was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last updated
2.1.0 209 4/1/2024
2.0.0 2,007 2/23/2021
1.0.3 531 5/19/2020
1.0.2 478 5/19/2020
1.0.1 472 5/18/2020
1.0.0 530 5/18/2020

Added missing XML documentation file; Added direct support for .NET 7.0 and .NET 8.0; Removed direct support for now deprecated .NET 5.0; Minor tweaks and enhancements.