Mostlylucid.StyloExtract.AspNetCore
1.8.0-alpha.3
See the version list below for details.
dotnet add package Mostlylucid.StyloExtract.AspNetCore --version 1.8.0-alpha.3
NuGet\Install-Package Mostlylucid.StyloExtract.AspNetCore -Version 1.8.0-alpha.3
<PackageReference Include="Mostlylucid.StyloExtract.AspNetCore" Version="1.8.0-alpha.3" />
<PackageVersion Include="Mostlylucid.StyloExtract.AspNetCore" Version="1.8.0-alpha.3" />
<PackageReference Include="Mostlylucid.StyloExtract.AspNetCore" />
paket add Mostlylucid.StyloExtract.AspNetCore --version 1.8.0-alpha.3
#r "nuget: Mostlylucid.StyloExtract.AspNetCore, 1.8.0-alpha.3"
#:package Mostlylucid.StyloExtract.AspNetCore@1.8.0-alpha.3
#addin nuget:?package=Mostlylucid.StyloExtract.AspNetCore&version=1.8.0-alpha.3&prerelease
#tool nuget:?package=Mostlylucid.StyloExtract.AspNetCore&version=1.8.0-alpha.3&prerelease
Mostlylucid.StyloExtract.AspNetCore
AddStyloExtract() DI extensions for ASP.NET Core and any Microsoft.Extensions.DependencyInjection host, plus opt-in Markdown content negotiation middleware.
What this package is
The canonical way to register StyloExtract in any .NET application that uses IServiceCollection. It depends on Core, Html, Fingerprint, Templates, Heuristics, and Markdown and wires them all up correctly.
Since v1.1.0 it also ships the Markdown content negotiation suite: a global middleware, a per-action MVC attribute, and a Minimal API extension that all transparently return Markdown instead of HTML when a client sends Accept: text/markdown.
When to depend on this directly
This is the package most application code should reference directly. Add it to your web API, worker service, or console application, call AddStyloExtract(), and inject ILayoutExtractor wherever you need it.
dotnet add package Mostlylucid.StyloExtract.AspNetCore
Usage
// Program.cs (ASP.NET Core)
builder.Services.AddStyloExtract(o =>
{
o.StorePath = "styloextract-templates.db";
o.HostHashKey = Environment.GetEnvironmentVariable("STYLOEXTRACT_HMAC_KEY");
o.DefaultProfile = ExtractionProfile.RagFull;
o.Match.FastPathJaccardThreshold = 0.85;
o.Match.SlowPathCosineThreshold = 0.75;
o.Centroid.DriftRefitThreshold = 0.35;
});
// Inject ILayoutExtractor in a controller, service, or background worker
public class ContentService(ILayoutExtractor extractor)
{
public async Task<string> GetMarkdownAsync(string html, Uri uri)
{
var result = await extractor.ExtractAsync(html, uri);
return result.Markdown;
}
}
Version event sink
To receive template version change events, register an ITemplateVersionEventSink before calling AddStyloExtract:
services.AddSingleton<ITemplateVersionEventSink, MyVersionEventSink>();
services.AddStyloExtract(o => { ... });
If no sink is registered, DefaultNoopVersionEventSink is used (events discarded).
Response policy framework (v1.2)
IResponsePolicy is the canonical response-transformation primitive in StyloExtract.AspNetCore.
Markdown content negotiation is the first built-in policy instance; cache-hint emission is the second.
The framework is modelled on IOutputCachePolicy's three-phase lifecycle.
Three phases
public interface IResponsePolicy
{
// Pre-pipeline: parse request, configure vary semantics, store per-request state.
ValueTask OnRequestAsync(ResponsePolicyContext context);
// Pre-serve: short-circuit the response (e.g. serve from cache) without calling downstream.
ValueTask OnServeAsync(ResponsePolicyContext context);
// Post-produce: transform the buffered body, set headers, store in cache.
ValueTask OnProducedAsync(ResponsePolicyContext context);
}
Setup
Recommended path (new in v1.2): use the fluent AddStyloExtract(Action<ResponsePolicyBuilder>) overload.
// 1. Register the core stack and Markdown negotiation.
builder.Services.AddStyloExtract(o => o.StorePath = "styloextract.db");
builder.Services.AddStyloExtractMarkdownNegotiation(o => { ... });
// 2. Register named policies via the fluent builder (recommended).
builder.Services.AddStyloExtract(b =>
{
b.AddPolicy("md", p => p.NegotiateMarkdown());
b.AddPolicy("cache", p => p.CacheHints(o =>
{
o.MaxAge = TimeSpan.FromMinutes(5);
o.EmitETag = true;
o.HonorIfNoneMatch = true;
}));
});
// 3. Wire the middleware (after UseRouting, UseAuthentication, UseAuthorization).
app.UseRouting();
app.UseStyloExtract();
If you need access to the service provider to construct policies manually, use the factory overload instead:
builder.Services.AddSingleton<ResponsePolicyOptions>(sp =>
{
var opts = new ResponsePolicyOptions();
opts.AddPolicy("md", sp.GetRequiredService<MarkdownNegotiationPolicy>());
opts.AddPolicy("cache", new CacheHintPolicy(new CacheHintOptions { MaxAge = TimeSpan.FromMinutes(5) }));
return opts;
});
Attaching policies to endpoints
// Minimal API: chain WithResponsePolicy calls in declaration order.
app.MapGet("/article", handler)
.WithResponsePolicy("md")
.WithResponsePolicy("cache");
// MVC controller action: use [ResponsePolicy] attribute.
[HttpGet("article")]
[ResponsePolicy("md")]
public IActionResult GetArticle() => Content(html, "text/html");
Composition
Policies run in declaration order. Each policy's OnProducedAsync sees the body as it was left by the preceding policy. When MarkdownNegotiationPolicy runs before CacheHintPolicy, the ETag is computed from the Markdown body, not the original HTML.
Backward compat (v1.1 paths still work)
All v1.1 entry points (UseStyloExtractMarkdownNegotiation, [NegotiateMarkdown], WithMarkdownNegotiation, StyloExtractResults.HtmlOrMarkdown) remain unchanged and continue to work bit-compatibly. The new MarkdownNegotiationPolicy provides equivalent functionality on the IResponsePolicy pipeline; new code should prefer it via services.AddStyloExtract(b => b.AddPolicy("md", p => p.NegotiateMarkdown(...))) and endpoint.WithResponsePolicy("md").
The framework is purely additive:
- All v1.1.0 public API signatures are unchanged.
- Existing
AddStyloExtract(Action<StyloExtractOptions>?)signature is unchanged.
Markdown content negotiation
StyloExtract can transparently serve Markdown instead of HTML when a client sends Accept: text/markdown. Three opt-in paths are provided; choose the one that fits your app.
1. Global middleware
Call AddStyloExtractMarkdownNegotiation() in your services and UseStyloExtractMarkdownNegotiation() in your pipeline. Every HTML response on every route is subject to negotiation.
// Program.cs
builder.Services.AddStyloExtract(o => o.StorePath = "styloextract.db");
builder.Services.AddStyloExtractMarkdownNegotiation(o =>
{
o.DefaultProfile = ExtractionProfile.RagFull;
o.EmitVaryHeader = true; // adds Vary: Accept to negotiated responses
o.MaxBodyBytes = 4 * 1024 * 1024; // skip bodies larger than 4 MB
});
// ...
app.UseRouting();
app.UseStyloExtractMarkdownNegotiation(); // after UseRouting
app.MapControllers();
A client that sends Accept: text/markdown receives Content-Type: text/markdown; charset=utf-8. All other clients receive the original HTML. The Vary: Accept header is added automatically so HTTP caches differentiate responses by content type.
2. Per-action MVC attribute
Use [NegotiateMarkdown] on a controller action or controller class when you want per-endpoint control without a global middleware.
[HttpGet("article/{id}")]
[NegotiateMarkdown(ExtractionProfile.AgentNavigation)]
public IActionResult GetArticle(int id)
{
var html = BuildArticleHtml(id);
return Content(html, "text/html");
}
The attribute runs as an IAsyncResultFilter. It does not require the global middleware to be registered.
3. Minimal API
Use .WithMarkdownNegotiation() on a route builder to add an endpoint filter, or use StyloExtractResults.HtmlOrMarkdown(...) to produce the right result type in the handler itself.
// Endpoint filter approach
app.MapGet("/article", () => Results.Content(BuildHtml(), "text/html"))
.WithMarkdownNegotiation(ExtractionProfile.RagFull);
// Inline IResult approach
app.MapGet("/article", (IHttpContextAccessor acc) =>
StyloExtractResults.HtmlOrMarkdown(BuildHtml()));
StyloExtractResults.HtmlOrMarkdown inspects Accept and calls ILayoutExtractor before the response is written, making it the simplest approach for Minimal API when you control the handler body.
Profile selection
The profile used for extraction is resolved in this order:
X-Stylo-Profilerequest header (e.g.AgentNavigation)stylo_profilequery string parameter (e.g.?stylo_profile=RagFull)MarkdownNegotiationOptions.DefaultProfile(default:RagFull)
The header and query names are configurable via MarkdownNegotiationOptions.ProfileHeaderName and ProfileQueryName.
Query-string Accept override (v1.1.0+)
Browser clients cannot easily set custom Accept headers. The AcceptOverrideQueryName option (default: "format") maps a query-string value to a virtual Accept header, so ?format=markdown behaves identically to Accept: text/markdown for any browser.
builder.Services.AddStyloExtractMarkdownNegotiation(o =>
{
o.AcceptOverrideQueryName = "format"; // null to disable
// Default mappings: markdown/md => text/markdown, html => text/html,
// json => application/json, text => text/plain
});
When the override fires, the response carries X-Stylo-Accept-Override: text/markdown so consumers can see it was applied.
Caching (v1.1.0+)
Enable Cache.Enabled to avoid re-extracting the same URL + profile combination on repeated requests. The implementation uses IDistributedCache (in-memory by default; inject a real distributed cache before calling AddStyloExtractMarkdownNegotiation to upgrade).
builder.Services.AddStyloExtractMarkdownNegotiation(o =>
{
o.Cache.Enabled = true;
o.Cache.AbsoluteExpiration = TimeSpan.FromMinutes(5);
o.Cache.SlidingExpiration = TimeSpan.FromMinutes(2);
o.Cache.EnableEtag = true; // honors If-None-Match; returns 304
o.Cache.EmitCacheControlHeader = false; // set true for CDN-friendly Cache-Control: public
});
Cache key shape: sha256(method + "|" + scheme + "|" + host + "|" + path + "|" + sortedQuery(minus override key) + "|" + profile). The Accept override query parameter is excluded from the key so ?format=markdown and a bare Accept: text/markdown request share the same cache slot.
Response headers on Markdown responses:
| Header | Value |
|---|---|
X-Stylo-Cache |
miss or hit |
ETag |
SHA-256 digest of the Markdown bytes (when EnableEtag = true) |
Cache-Control |
public, max-age=N (when EmitCacheControlHeader = true) |
AOT
This package is IsAotCompatible=true. The negotiation middleware and attribute use no reflection-based JSON serialization; Markdown output is plain text. IDistributedCache and MemoryDistributedCache are both AOT-safe.
| Product | Versions Compatible and additional computed target framework versions. |
|---|---|
| .NET | net10.0 is compatible. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
-
net10.0
- Mostlylucid.StyloExtract.Core (>= 1.8.0-alpha.3)
- Mostlylucid.StyloExtract.Fingerprint (>= 1.8.0-alpha.3)
- Mostlylucid.StyloExtract.Heuristics (>= 1.8.0-alpha.3)
- Mostlylucid.StyloExtract.Html (>= 1.8.0-alpha.3)
- Mostlylucid.StyloExtract.Markdown (>= 1.8.0-alpha.3)
- Mostlylucid.StyloExtract.Templates (>= 1.8.0-alpha.3)
NuGet packages (1)
Showing the top 1 NuGet packages that depend on Mostlylucid.StyloExtract.AspNetCore:
| Package | Downloads |
|---|---|
|
Mostlylucid.StyloExtract.StyloBot
Bridge between StyloExtract and StyloBot's IActionPolicy registry. Provides extract-markdown / extract-headers / extract-sidecar / extract-passthrough action policies that operators reference by name from EndpointPolicy rules or [BotAction] attributes. |
GitHub repositories
This package is not used by any popular GitHub repositories.
| Version | Downloads | Last Updated |
|---|---|---|
| 2.0.1 | 86 | 6/30/2026 |
| 2.0.0 | 96 | 6/28/2026 |
| 1.8.0 | 100 | 6/27/2026 |
| 1.8.0-alpha.23 | 52 | 6/27/2026 |
| 1.8.0-alpha.22 | 50 | 6/27/2026 |
| 1.8.0-alpha.21 | 51 | 6/27/2026 |
| 1.8.0-alpha.20 | 58 | 6/27/2026 |
| 1.8.0-alpha.19 | 56 | 6/26/2026 |
| 1.8.0-alpha.18 | 65 | 6/26/2026 |
| 1.8.0-alpha.17 | 73 | 6/26/2026 |
| 1.8.0-alpha.16 | 76 | 6/26/2026 |
| 1.8.0-alpha.15 | 58 | 6/26/2026 |
| 1.8.0-alpha.14 | 49 | 6/26/2026 |
| 1.8.0-alpha.13 | 46 | 6/26/2026 |
| 1.8.0-alpha.12 | 48 | 6/26/2026 |
| 1.8.0-alpha.11 | 50 | 6/26/2026 |
| 1.8.0-alpha.10 | 53 | 6/26/2026 |
| 1.8.0-alpha.9 | 55 | 6/25/2026 |
| 1.8.0-alpha.8 | 55 | 6/25/2026 |
| 1.8.0-alpha.3 | 55 | 6/25/2026 |
StyloExtract 1.8.0-alpha.3 - 2026-06-25
========================================
What's new since 1.8.0-alpha.2
------------------------------
Next.js __NEXT_DATA__ rehydration extractor
Next.js apps embed their page state in a JSON blob inside
<script id="__NEXT_DATA__" type="application/json">. Schemas vary
per site (Shopify Hydrogen uses pageProps.shopifyProductsPreloadedState,
news sites use pageProps.initialState.article.body) so the
extractor walks props.pageProps recursively and collects every
string value that looks like prose (>= 80 chars, contains a space,
isn't a URL / data URI / CSS variable / serialised JSON). Conservative
key-exclusion list keeps URLs and build metadata out of the result.
Chains next to the JSON-LD and Discourse rehydration fallbacks.
Content-role fallback gate
The chained fallback (JSON-LD -> Next.js -> Discourse -> body-text)
previously gated on the all-blocks text sum. That sum looked
healthy for pages where the heuristic emitted 3 KB of nav + footer +
boilerplate while finding zero MainContent — the renderer's
MainContentOnly / Wcxb profiles drop those roles anyway, so the
actual markdown is 0 chars. Switch the gate to content-role text
mass only. 18 catastrophic pages recovered without any new code,
just the gate change.
Playwright auto-fallback decorator
AddStyloExtractPlaywright() wires PlaywrightHtmlFetcher AND
decorates the existing ILayoutExtractor with a RenderingLayoutExtractor
that runs static extraction first, then re-fetches via Playwright
only when:
* the caller passed a non-null sourceUri
* the static result has < 200 chars of content-role text
* an IRenderedHtmlFetcher is wired in DI
File-only callers never trigger a render. Operators who don't want
the Chromium dependency simply don't add the package. Three guards
against wasted work: Playwright throws -> return static; rendered
HTML same length as static -> skip the re-extract; re-extract
yields no improvement -> return static.
Usage:
services.AddStyloExtract(...);
services.AddStyloExtractPlaywright();
492 tests across 10 projects, 6 new unit tests for the decorator
policy.
Aggregate WCXB (1495 dev pages, Wcxb profile):
| Stage | F1 | Catastrophic |
|----------------------------------------|-------:|-------------:|
| 1.8.0-alpha.2 | 0.760 | 25 |
| + Next.js extractor | same | |
| + content-role fallback gate | 0.760 | 17 |
| + 14 LLM-trained YAMLs | 0.760 | 17 |
| (Playwright auto-fallback) | -- | |
Playwright auto-fallback is wired but not exercised in the WCXB
benchmark by default — needs `playwright install chromium`. Real-
world consumers with the package added see automatic recovery for
JS-rendered SPAs whose content is hydrated client-side.
StyloExtract 1.8.0-alpha.2 - 2026-06-25
========================================
LLM template-training loop, Discourse rehydration, plus a stack of
heuristic + selection fixes that move the WCXB dev split from F1 0.673
(post-1.7.1, MainContentOnly profile) to F1 0.760 (Wcxb plain-text
profile, with operator-trained templates + Discourse rehydration
active). Catastrophic extraction failures (pred_chars ≤ 5) drop from
92 of 1495 pages to 25.
Beats Readability on every page type. Closes the gap to Trafilatura by
~40% on Article + Documentation. Above v1.5.4 baseline (0.718) by
+0.042 — and that's keeping all the GFM markdown structure (sidebar
TOCs, blockquotes, GFM tables) in the runtime output, not stripping
to plain text for benchmark flattery.
What's new since 1.8.0-alpha.1
------------------------------
LLM template training loop (`stylo-extract template train`)
Operator-driven synchronous LLM template specialisation, the
counterpart to the existing async enrichment coordinator. Smart-
routes between induce (no template yet) and repair (template
exists but underperforms).
Closed-selector prompt: every selector the model can choose from
is enumerated from the actual page DOM via DocumentSelectorCatalog
and handed to the LLM in the prompt. Inventing selectors fails.
Post-parse AngleSharp validation: every selector the model returns
is run through doc.QuerySelectorAll. Selectors that match zero
elements are dropped; templates whose MainContent rule has no
surviving selector are rejected.
Repair prompt re-angled as a diagnostic: "why is this failing AND
how should it work for this page" instead of just "produce a
corrected template."
Hash-prefixed selectors (`#my-id`) are now properly quoted in
emitted YAML so they round-trip; the inducer also pre-repairs
unquoted hash selectors in the LLM response before parse.
OllamaTextProvider bumps NumPredict default 1024 → 4096
(reasoning-tagged models burn tokens on chain-of-thought before
the answer) and falls back to message.thinking when message.content
is empty.
`template repair` command + `LlmTemplateInducer.RepairFromSkeletonAsync`
+ production coordinator dispatch (TemplateEnrichmentJob.Kind +
LayoutExtractor enqueue on low-output existing-template hits).
Discourse data-preloaded rehydration
Discourse renders every page as an Ember.js SPA. Static HTML ships
near-zero post content; the actual topic + posts live in a JSON
blob in <div id="data-preloaded" data-preloaded="...JSON...">.
DiscourseRehydrationExtractor parses the JSON, walks
topic_NNN.post_stream.posts[*].cooked, strips tags, and emits the
result as a synthetic MainContent fallback block — same shape as
the existing JSON-LD fallback. Discourse powers 5 000+ public
forums; one upstream extractor covers them all.
WCXB lift: 6 of 13 catastrophic forum pages go from F1=0 to
F1=0.83–0.99. Forum category F1 0.477 → 0.535.
Wcxb plain-text profile
WCXB-style word-overlap benchmarks score against plain-text gold.
The default MainContentOnly / RagFull output emits GFM Markdown —
headings, lists, sidebar TOCs, multi-paragraph blockquotes — that
improves AI / human readability but registers as precision noise
against plain-text comparison.
New ExtractionProfile.Wcxb uses MainContentOnly's role-set but
emits each block's plain Text instead of its Markdown. Strictly
a benchmark / comparison profile — runtime callers keep their
existing profile and continue getting structured GFM.
Heuristic + selection fixes
DomCleaner: strip <select> globally so <option> text stops
leaking on category dropdowns. mostlylucid.net opened with 290+
category names dumped into the output; now opens with the actual
blog list.
IntraBlockCleaner: content-guard the contamination-hint substring
match. "sidebar" substring was eating WordPress / SNOFlex article
bodies whose class contained "sidebar-mode-single". 28 catastrophic
article pages recovered.
LayoutExtractor: body-text fallback for old-school flat HTML
without <main>/<article>/section wrappers. erikdemaine.org/foldcut
and similar plain H1/H2/P-under-body pages now extract.
LayoutExtractor: detect chrome-heavy applicator output as bug-out.
Stale templates applied to wrong-shape pages produced 1 char of
MainContent while combinedText looked fine (header + footer
selectors found chrome). esprit-barbecue, nike, rei collections
recovered.
HeuristicBlockClassifier: empty-semantic-wrapper handling and
body-spanning <form> fall-through. ASP.NET WebForms pages
(drainblasterbill, etc.) recovered.
Framework-content-class-hints: 20 new patterns — Discourse, phpBB,
vBulletin, PrestaShop, WooCommerce, Shopify, BigCommerce,
Squarespace, Webflow, Wix, Joomla, GitHub Pages, plus some misc.
Benchmark harness
WCXB harness gains --operator-templates <root> for loading
YAML files produced by `template train`, --page-ids for fast
repro of individual failures.
Aggregate WCXB (1495 dev pages, Wcxb profile):
| System | F1 | Precision | Recall |
|-------------------|-------:|----------:|-------:|
| StyloExtract v1.8.0-alpha.2 | 0.760 | 0.756 | 0.849 |
| rs-trafilatura | 0.859 | 0.863 | 0.890 |
| Trafilatura | 0.791 | 0.852 | 0.793 |
| Readability | 0.675 | 0.685 | 0.713 |
Compatibility
Backwards-compatible with 1.8.0-alpha.1. All changes are either new
code paths (Discourse extractor, Wcxb profile, train CLI), strictly
better selection (the heuristic fixes), or schema-additive
(TemplateEnrichmentJob gains optional Kind / BadMarkdownSample with
default Induce). Existing operator templates and trained YAMLs from
alpha.1 continue to work unchanged.
Suite: 486 tests across 10 projects, all green.