DocNetExtended 0.5.0.1
dotnet add package DocNetExtended --version 0.5.0.1
NuGet\Install-Package DocNetExtended -Version 0.5.0.1
<PackageReference Include="DocNetExtended" Version="0.5.0.1" />
paket add DocNetExtended --version 0.5.0.1
#r "nuget: DocNetExtended, 0.5.0.1"
// Install DocNetExtended as a Cake Addin #addin nuget:?package=DocNetExtended&version=0.5.0.1 // Install DocNetExtended as a Cake Tool #tool nuget:?package=DocNetExtended&version=0.5.0.1
DocNetExtended
DocNetExtended is a small extension library built upon the DocNet library, designed to extract text in a readable order from PDFs.
Features
- Get text
- Get lines of text
- Get words
- Split lines of text into blocks
Usage
Extracting all text
using (var docReader = DocLib.Instance.GetDocReader(pdfFileName, new PageDimensions(2480, 3508)))
{
using (var pageReader = new OrderedPageTextReader(docReader, 0))
{
Console.WriteLine(pageReader.GetTextInReadableOrder());
}
}
Extracting lines of text
using (var docReader = DocLib.Instance.GetDocReader(pdfFileName, new PageDimensions(2480, 3508)))
{
using (var pageReader = new OrderedPageTextReader(docReader, 0))
{
var textLines = pageReader.GetTextLines();
foreach (var textLine in textLines)
{
Console.WriteLine(textLine.Text);
}
}
}
Extracting all words
using (var docReader = DocLib.Instance.GetDocReader(pdfFileName, new PageDimensions(2480, 3508)))
{
using (var pageReader = new OrderedPageTextReader(docReader, 0))
{
var words = pageReader.GetWords();
foreach (var word in words)
{
Console.WriteLine(word.Value);
}
}
}
Extracting blocks of text
When extracting text from a PDF, you may only be interested in a certain section of the page.
The GetTextBlocks method will split lines of text into blocks of text by dividing the page width by the block size, and then checking the position of each word to determine which block it should be in.
Note: Blocks are currently calculated per TextLine.
using (var docReader = DocLib.Instance.GetDocReader(pdfFileName, new PageDimensions(2480, 3508)))
{
using (var pageReader = new OrderedPageTextReader(docReader, 0))
{
var textBlocks = pageReader.GetTextBlocks(300);
foreach (var textBlock in textBlocks)
{
Console.WriteLine(textBlock.Text);
}
}
}
Disclaimer
Whilst every attempt is made to extract data in the order it appears in the PDF, this is very much a work in progress and may not support the structure of all PDFs.
Credit
This project wouldn't be possible without the work done by the DocNet team
Product | Versions Compatible and additional computed target framework versions. |
---|---|
.NET | net5.0 was computed. net5.0-windows was computed. net6.0 was computed. net6.0-android was computed. net6.0-ios was computed. net6.0-maccatalyst was computed. net6.0-macos was computed. net6.0-tvos was computed. net6.0-windows was computed. net7.0 was computed. net7.0-android was computed. net7.0-ios was computed. net7.0-maccatalyst was computed. net7.0-macos was computed. net7.0-tvos was computed. net7.0-windows was computed. net8.0 was computed. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. |
.NET Core | netcoreapp2.0 was computed. netcoreapp2.1 was computed. netcoreapp2.2 was computed. netcoreapp3.0 was computed. netcoreapp3.1 was computed. |
.NET Standard | netstandard2.0 is compatible. netstandard2.1 was computed. |
.NET Framework | net461 was computed. net462 was computed. net463 was computed. net47 was computed. net471 was computed. net472 was computed. net48 was computed. net481 was computed. |
MonoAndroid | monoandroid was computed. |
MonoMac | monomac was computed. |
MonoTouch | monotouch was computed. |
Tizen | tizen40 was computed. tizen60 was computed. |
Xamarin.iOS | xamarinios was computed. |
Xamarin.Mac | xamarinmac was computed. |
Xamarin.TVOS | xamarintvos was computed. |
Xamarin.WatchOS | xamarinwatchos was computed. |
-
.NETStandard 2.0
- Docnet.Core (>= 2.3.1)
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.