Mostlylucid.StyloExtract.AspNetCore 2.0.1

.NET 10.0

dotnet add package Mostlylucid.StyloExtract.AspNetCore --version 2.0.1

NuGet\Install-Package Mostlylucid.StyloExtract.AspNetCore -Version 2.0.1

This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.

<PackageReference Include="Mostlylucid.StyloExtract.AspNetCore" Version="2.0.1" />

For projects that support PackageReference, copy this XML node into the project file to reference the package.

<PackageVersion Include="Mostlylucid.StyloExtract.AspNetCore" Version="2.0.1" />
                    

                            Directory.Packages.props

<PackageReference Include="Mostlylucid.StyloExtract.AspNetCore" />
                    

                            Project file

For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.

paket add Mostlylucid.StyloExtract.AspNetCore --version 2.0.1

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

#r "nuget: Mostlylucid.StyloExtract.AspNetCore, 2.0.1"

#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.

#:package Mostlylucid.StyloExtract.AspNetCore@2.0.1

#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.

#addin nuget:?package=Mostlylucid.StyloExtract.AspNetCore&version=2.0.1
                    

                            Install as a Cake Addin

#tool nuget:?package=Mostlylucid.StyloExtract.AspNetCore&version=2.0.1
                    

                            Install as a Cake Tool

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

Mostlylucid.StyloExtract.AspNetCore

AddStyloExtract() DI extensions for ASP.NET Core and any Microsoft.Extensions.DependencyInjection host, plus opt-in Markdown content negotiation middleware.

What this package is

The canonical way to register StyloExtract in any .NET application that uses IServiceCollection. It depends on Core, Html, Fingerprint, Templates, Heuristics, and Markdown and wires them all up correctly.

Since v1.1.0 it also ships the Markdown content negotiation suite: a global middleware, a per-action MVC attribute, and a Minimal API extension that all transparently return Markdown instead of HTML when a client sends Accept: text/markdown.

When to depend on this directly

This is the package most application code should reference directly. Add it to your web API, worker service, or console application, call AddStyloExtract(), and inject ILayoutExtractor wherever you need it.

dotnet add package Mostlylucid.StyloExtract.AspNetCore

Usage

// Program.cs (ASP.NET Core)
builder.Services.AddStyloExtract(o =>
{
    o.StorePath    = "styloextract-templates.db";
    o.HostHashKey  = Environment.GetEnvironmentVariable("STYLOEXTRACT_HMAC_KEY");
    o.DefaultProfile = ExtractionProfile.RagFull;

    o.Match.FastPathJaccardThreshold = 0.85;
    o.Match.SlowPathCosineThreshold  = 0.75;
    o.Centroid.DriftRefitThreshold   = 0.35;
});

// Inject ILayoutExtractor in a controller, service, or background worker
public class ContentService(ILayoutExtractor extractor)
{
    public async Task<string> GetMarkdownAsync(string html, Uri uri)
    {
        var result = await extractor.ExtractAsync(html, uri);
        return result.Markdown;
    }
}

Version event sink

To receive template version change events, register an ITemplateVersionEventSink before calling AddStyloExtract:

services.AddSingleton<ITemplateVersionEventSink, MyVersionEventSink>();
services.AddStyloExtract(o => { ... });

If no sink is registered, DefaultNoopVersionEventSink is used (events discarded).

Response policy framework (v1.2)

IResponsePolicy is the canonical response-transformation primitive in StyloExtract.AspNetCore. Markdown content negotiation is the first built-in policy instance; cache-hint emission is the second. The framework is modelled on IOutputCachePolicy's three-phase lifecycle.

Three phases

public interface IResponsePolicy
{
    // Pre-pipeline: parse request, configure vary semantics, store per-request state.
    ValueTask OnRequestAsync(ResponsePolicyContext context);

    // Pre-serve: short-circuit the response (e.g. serve from cache) without calling downstream.
    ValueTask OnServeAsync(ResponsePolicyContext context);

    // Post-produce: transform the buffered body, set headers, store in cache.
    ValueTask OnProducedAsync(ResponsePolicyContext context);
}

Setup

Recommended path (new in v1.2): use the fluent AddStyloExtract(Action<ResponsePolicyBuilder>) overload.

// 1. Register the core stack and Markdown negotiation.
builder.Services.AddStyloExtract(o => o.StorePath = "styloextract.db");
builder.Services.AddStyloExtractMarkdownNegotiation(o => { ... });

// 2. Register named policies via the fluent builder (recommended).
builder.Services.AddStyloExtract(b =>
{
    b.AddPolicy("md",    p => p.NegotiateMarkdown());
    b.AddPolicy("cache", p => p.CacheHints(o =>
    {
        o.MaxAge = TimeSpan.FromMinutes(5);
        o.EmitETag = true;
        o.HonorIfNoneMatch = true;
    }));
});

// 3. Wire the middleware (after UseRouting, UseAuthentication, UseAuthorization).
app.UseRouting();
app.UseStyloExtract();

If you need access to the service provider to construct policies manually, use the factory overload instead:

builder.Services.AddSingleton<ResponsePolicyOptions>(sp =>
{
    var opts = new ResponsePolicyOptions();
    opts.AddPolicy("md", sp.GetRequiredService<MarkdownNegotiationPolicy>());
    opts.AddPolicy("cache", new CacheHintPolicy(new CacheHintOptions { MaxAge = TimeSpan.FromMinutes(5) }));
    return opts;
});

Attaching policies to endpoints

// Minimal API: chain WithResponsePolicy calls in declaration order.
app.MapGet("/article", handler)
    .WithResponsePolicy("md")
    .WithResponsePolicy("cache");

// MVC controller action: use [ResponsePolicy] attribute.
[HttpGet("article")]
[ResponsePolicy("md")]
public IActionResult GetArticle() => Content(html, "text/html");

Composition

Policies run in declaration order. Each policy's OnProducedAsync sees the body as it was left by the preceding policy. When MarkdownNegotiationPolicy runs before CacheHintPolicy, the ETag is computed from the Markdown body, not the original HTML.

Backward compat (v1.1 paths still work)

All v1.1 entry points (UseStyloExtractMarkdownNegotiation, [NegotiateMarkdown], WithMarkdownNegotiation, StyloExtractResults.HtmlOrMarkdown) remain unchanged and continue to work bit-compatibly. The new MarkdownNegotiationPolicy provides equivalent functionality on the IResponsePolicy pipeline; new code should prefer it via services.AddStyloExtract(b => b.AddPolicy("md", p => p.NegotiateMarkdown(...))) and endpoint.WithResponsePolicy("md").

The framework is purely additive:

All v1.1.0 public API signatures are unchanged.
Existing AddStyloExtract(Action<StyloExtractOptions>?) signature is unchanged.

Markdown content negotiation

StyloExtract can transparently serve Markdown instead of HTML when a client sends Accept: text/markdown. Three opt-in paths are provided; choose the one that fits your app.

1. Global middleware

Call AddStyloExtractMarkdownNegotiation() in your services and UseStyloExtractMarkdownNegotiation() in your pipeline. Every HTML response on every route is subject to negotiation.

// Program.cs
builder.Services.AddStyloExtract(o => o.StorePath = "styloextract.db");
builder.Services.AddStyloExtractMarkdownNegotiation(o =>
{
    o.DefaultProfile = ExtractionProfile.RagFull;
    o.EmitVaryHeader = true;       // adds Vary: Accept to negotiated responses
    o.MaxBodyBytes   = 4 * 1024 * 1024; // skip bodies larger than 4 MB
});

// ...
app.UseRouting();
app.UseStyloExtractMarkdownNegotiation(); // after UseRouting
app.MapControllers();

A client that sends Accept: text/markdown receives Content-Type: text/markdown; charset=utf-8. All other clients receive the original HTML. The Vary: Accept header is added automatically so HTTP caches differentiate responses by content type.

2. Per-action MVC attribute

Use [NegotiateMarkdown] on a controller action or controller class when you want per-endpoint control without a global middleware.

[HttpGet("article/{id}")]
[NegotiateMarkdown(ExtractionProfile.AgentNavigation)]
public IActionResult GetArticle(int id)
{
    var html = BuildArticleHtml(id);
    return Content(html, "text/html");
}

The attribute runs as an IAsyncResultFilter. It does not require the global middleware to be registered.

3. Minimal API

Use .WithMarkdownNegotiation() on a route builder to add an endpoint filter, or use StyloExtractResults.HtmlOrMarkdown(...) to produce the right result type in the handler itself.

// Endpoint filter approach
app.MapGet("/article", () => Results.Content(BuildHtml(), "text/html"))
   .WithMarkdownNegotiation(ExtractionProfile.RagFull);

// Inline IResult approach
app.MapGet("/article", (IHttpContextAccessor acc) =>
    StyloExtractResults.HtmlOrMarkdown(BuildHtml()));

StyloExtractResults.HtmlOrMarkdown inspects Accept and calls ILayoutExtractor before the response is written, making it the simplest approach for Minimal API when you control the handler body.

Profile selection

The profile used for extraction is resolved in this order:

X-Stylo-Profile request header (e.g. AgentNavigation)
stylo_profile query string parameter (e.g. ?stylo_profile=RagFull)
MarkdownNegotiationOptions.DefaultProfile (default: RagFull)

The header and query names are configurable via MarkdownNegotiationOptions.ProfileHeaderName and ProfileQueryName.

Query-string Accept override (v1.1.0+)

Browser clients cannot easily set custom Accept headers. The AcceptOverrideQueryName option (default: "format") maps a query-string value to a virtual Accept header, so ?format=markdown behaves identically to Accept: text/markdown for any browser.

builder.Services.AddStyloExtractMarkdownNegotiation(o =>
{
    o.AcceptOverrideQueryName = "format"; // null to disable
    // Default mappings: markdown/md => text/markdown, html => text/html,
    //                   json => application/json, text => text/plain
});

When the override fires, the response carries X-Stylo-Accept-Override: text/markdown so consumers can see it was applied.

Caching (v1.1.0+)

Enable Cache.Enabled to avoid re-extracting the same URL + profile combination on repeated requests. The implementation uses IDistributedCache (in-memory by default; inject a real distributed cache before calling AddStyloExtractMarkdownNegotiation to upgrade).

builder.Services.AddStyloExtractMarkdownNegotiation(o =>
{
    o.Cache.Enabled = true;
    o.Cache.AbsoluteExpiration = TimeSpan.FromMinutes(5);
    o.Cache.SlidingExpiration = TimeSpan.FromMinutes(2);
    o.Cache.EnableEtag = true;               // honors If-None-Match; returns 304
    o.Cache.EmitCacheControlHeader = false;  // set true for CDN-friendly Cache-Control: public
});

Cache key shape: sha256(method + "|" + scheme + "|" + host + "|" + path + "|" + sortedQuery(minus override key) + "|" + profile). The Accept override query parameter is excluded from the key so ?format=markdown and a bare Accept: text/markdown request share the same cache slot.

Response headers on Markdown responses:

Header	Value
`X-Stylo-Cache`	`miss` or `hit`
`ETag`	SHA-256 digest of the Markdown bytes (when `EnableEtag = true`)
`Cache-Control`	`public, max-age=N` (when `EmitCacheControlHeader = true`)

AOT

This package is IsAotCompatible=true. The negotiation middleware and attribute use no reflection-based JSON serialization; Markdown output is plain text. IDistributedCache and MemoryDistributedCache are both AOT-safe.

Full documentation and package family

Product	Compatible and additional computed target framework versions.
.NET	net10.0 is compatible. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed.

Compatible target framework(s)

Included target framework(s) (in package)

Learn more about Target Frameworks and .NET Standard.

net10.0
- Mostlylucid.StyloExtract.Core (>= 2.0.1)
- Mostlylucid.StyloExtract.Fingerprint (>= 2.0.1)
- Mostlylucid.StyloExtract.Heuristics (>= 2.0.1)
- Mostlylucid.StyloExtract.Html (>= 2.0.1)
- Mostlylucid.StyloExtract.Markdown (>= 2.0.1)
- Mostlylucid.StyloExtract.Templates (>= 2.0.1)

NuGet packages (1)

Showing the top 1 NuGet packages that depend on Mostlylucid.StyloExtract.AspNetCore:

Package	Downloads
Mostlylucid.StyloExtract.StyloBot Bridge between StyloExtract and StyloBot's IActionPolicy registry. Provides extract-markdown / extract-headers / extract-sidecar / extract-passthrough action policies that operators reference by name from EndpointPolicy rules or [BotAction] attributes.	106

GitHub repositories

This package is not used by any popular GitHub repositories.

Version	Downloads	Last Updated
2.0.1	105	6/30/2026
2.0.0	103	6/28/2026
1.8.0	106	6/27/2026
1.8.0-alpha.23	60	6/27/2026
1.8.0-alpha.22	53	6/27/2026
1.8.0-alpha.21	55	6/27/2026
1.8.0-alpha.20	61	6/27/2026
1.8.0-alpha.19	61	6/26/2026
1.8.0-alpha.18	77	6/26/2026
1.8.0-alpha.17	90	6/26/2026
1.8.0-alpha.16	86	6/26/2026
1.8.0-alpha.15	63	6/26/2026
1.8.0-alpha.14	53	6/26/2026
1.8.0-alpha.13	53	6/26/2026
1.8.0-alpha.12	54	6/26/2026
1.8.0-alpha.11	56	6/26/2026
1.8.0-alpha.10	58	6/26/2026
1.8.0-alpha.9	60	6/25/2026
1.8.0-alpha.8	59	6/25/2026
1.8.0-alpha.4	59	6/25/2026

StyloExtract 2.0.0 - 2026-06-28
================================

First stable release. Closes Phase 1 + Phase 2 of the identity-claim
rework that ran across alpha.22, alpha.23, and the in-flight code that
never tagged. Stable means the v2 API contracts (IdentityClaim, the
streaming options, the operator-template shape with Claims, the apply-
time quality gate) are now things consumers can build on.

What's new since 1.8.0-alpha.21
-------------------------------

Identity-claim primitive (Phase 1)

- New `IdentityClaim` type — outermost-first ancestor chain of
(tag, id, classes, data-* / aria-* / role) entries, anchoring every
selector by stable identity rather than by CSS string.
- `DefaultClassStabilityFilter` rejects hash-shaped class tokens
(Tailwind JIT names, CSS-module hashes, build-time churn) so that
emitted claims survive across visits.
- Inducer is identity-aware end-to-end: cardinality-aware uniqueness
for repeated roles, narrow tripwires for the streaming side, no
CSS-string emission anywhere on the apply path.
- Layout extractor's apply path runs on `IdentityClaimApplicator`;
the old CSS-string applicator is gone.

Streaming gateway: exact tripwire matching + bounded memory

- The streaming scanner shifted from MinHash + LSH bands to exact
`IdentityClaim` matching against the per-event hash data the
tokenizer carries on each `TagEvent` (tag-name + id + per-class
+ per-data-attr + per-aria + role hashes). The matcher walks the
claim's required hashes linearly against the event's hash arrays;
no per-tick MinHash recompute, no sliding window.
- `StreamingTokenizerOptions` replaces the hard `MaxBufferSize`
consts on `IncrementalHtmlTokenizer` and
`IncrementalBytePatternScanner`. Both buffers are now rented from
`ArrayPool<byte>.Shared` and grow on demand up to the configurable
ceiling (default 1 MiB per buffer). Both classes are `IDisposable`.
- `TagAttrLimits` replaces the per-event `TagEvent.MaxClassesPerEvent`
(was 8) and `TagEvent.MaxAttrPairsPerEvent` (was 3). Defaults
bumped to 32 / 16, validated up to 256 / 128 ceilings. Real pages
no longer silently lose the tail.
- Streaming-template inducer rewrite (Task 4 of Phase 1) — emits
`IdentityClaim`-based tripwires shared with the layout side.
- Incremental byte-pattern scanner (Task 13) replaces the alpha.21
tripwire scanner with a faster exact-match path; tag-hash prefilter
cuts per-scan allocs ~25-30x.

Apply-time quality gate + auto-repair loop

- New `ApplicatorBrokenCheck` lifts the apply-time bug-out signal
out of LayoutExtractor's local function into a unit-testable gate.
Three new failure modes: noisy-MainContent (link-density >= 0.5
inside a content block, catches the Wikipedia / mostlylucid
language-picker leak), image-anchor picker (many short-text
anchors, catches the route-variant strip), metadata-shape
rejection (key:value-dominated blocks, catches the MS Learn YAML
frontmatter leak).
- LayoutExtractor Move 3 widens the repair-enqueue gate: drops the
"hand-authored template must exist" requirement, triggers on
applicatorBugOut OR thin-markdown, adds Refit to the qualifying
match-status set.
- `IsDeterministic` flag on `OperatorTemplate` distinguishes the
heuristic inducer's deterministic YAML audit snapshots from
hand-authored / LLM-induced templates. Deterministic snapshots no
longer block LLM induction.
- `OperatorTemplateRule.Claims` carries the identity-claim ancestor
chain on operator templates so the operator-template path runs on
the identity-claim applicator instead of the CSS-string fallback.

Heuristic block classifier improvements

- Tighten-on-anchor (Move 1) — after a `<main>`/`<article>` qualifies
as MainContent, look down one level for a div/section descendant
with a stable identity anchor (stable id OR >= 1 stable class) that
carries >= 80% of the wrapper's prose text and has link density
< 0.5. When exactly one descendant qualifies, prefer it. Catches
Wikipedia + mostlylucid leaks where the picker rides inside the
outer semantic element.
- `<article>` semantic-tag exception in repeated-item link-density
gate — news-listing pattern where each card is a single clickable
`<a>` (density ~1.0) now survives the gate. The Register, Verge,
Ars, BBC News listings render again.

Template enrichment coordinator

- `InMemoryTemplateEnrichmentQueue` cooldown key changed from string
Host to (Host, EnrichmentJobKind) tuple. A first-visit Induce no
longer blocks a follow-up Repair on the same host.
- New `ILlmActivityObserver` interface brackets each LLM call with
LlmCallStarted / LlmCallEnded(success). Wired through the DI
builder so consumers can show "llm <host>..." while CPU inference
is running (the lucidVIEW FULL status bar uses this).

Corpus mining (Phase 2)

- `SelectorDistance` metric quantifies how similar two emitted
selectors are for evolved-candidate ranking (Task 6).
- `CorpusMiner` query primitives (Task 7) and evolved-selector
emission (Task 8) — proposes alternate selectors from the
template_observations table.
- Passive evaluation of evolved candidates at apply time (Task 9):
evolved selectors run alongside the chosen one and contribute
observations for the next mining cycle.
- Background `CorpusMiningCoordinator` (Task 10) drains the
template_observations table on a cadence and writes evolved
candidates.
- `template_observations` SQLite table (Task 5 of Phase 1) feeds
the mining and evolved-candidate paths.

Cold-path arbitrary caps (now configurable or bumped)

- `NextDataRehydrationExtractor` walker bumped from 500 strings /
depth 12 to 5000 / depth 32 — real Next.js __NEXT_DATA__ blobs
exceeded the old guards.
- LayoutExtractor's LLM-repair sample bumped 400 -> 2000 chars
(more context for the LLM to see what's wrong).
- Skeleton renderer's attr-value truncation bumped 40 -> 160 chars
(covers accessibility-conscious aria-label values).
- Streaming `IncrementalHtmlTokenizer.MaxBufferSize` (was 16 KiB
const that threw on JSON-LD blobs) replaced by
`StreamingTokenizerOptions.MaxPartialTagBytes` (1 MiB default).

Breaking changes you need to know about
---------------------------------------

- `IncrementalHtmlTokenizer.MaxBufferSize` and
`IncrementalBytePatternScanner.MaxBufferSize` public consts removed.
Replaced by per-instance configuration via
`StreamingTokenizerOptions`. The instances are now `IDisposable`;
long-lived consumers should wrap in `using`.
- `TagEvent.MaxClassesPerEvent` and `TagEvent.MaxAttrPairsPerEvent`
internal consts removed. Caps thread through `TagAttrLimits`,
configured from `StreamingTokenizerOptions`. Defaults bumped
(8 -> 32 / 3 -> 16) so existing code that didn't override the cap
sees the same or wider coverage.
- `TagAttributeParser.ExtractIdentityHashes` now takes a
`TagAttrLimits` parameter before the `out` arguments. Update
callers; pass `TagAttrLimits.Default` to keep the new defaults.
- `MinimalHtmlTokenizer` has a new `(input, filter, attrLimits)`
constructor; the existing two-arg constructor delegates to
`TagAttrLimits.Default`.
- `OperatorTemplate` gained `IsDeterministic` (bool) and
`OperatorTemplateRule` gained `Claims`
(`IReadOnlyList<IdentityClaim>?`). Both are init-only; existing
call sites compile, but the YAML round-trip writes the new fields
and the loader sets `IsDeterministic` from the file name.
- Layout extractor's CSS-string applicator path is gone. Templates
emitted before alpha.22 that depend on string-based selectors
rebuild through the identity-claim path on first visit.
- `StreamingTemplate` lost its MinHash signature shape — templates
persisted from alpha.16-alpha.20 re-induce on first visit (the
store's PRAGMA user_version gate drops stale rows).
- Streaming `RollingSketch` / `TagAllowlistBloom` types removed
(alpha.21 deprecated the latter; alpha.24 dropped both with the
byte-pattern matcher).
- `InMemoryTemplateEnrichmentQueue._lastEnqueuedByHost` (private)
changed shape; only matters if you reflected against it.

Tests: 850 across 12 projects, all green.

Migration: most consumers don't need to change anything. The two
patterns that DO need a change are (a) anyone who passed
`IncrementalHtmlTokenizer.MaxBufferSize` to size their own buffer
(use `tok.MaxPartialTagBytes` instead) and (b) anyone who called
`TagAttributeParser.ExtractIdentityHashes` directly (add
`TagAttrLimits.Default` as the second argument).

StyloExtract 1.8.0-alpha.21 - 2026-06-27
=========================================

Streaming: scope fixes (no algorithm replacement)
--------------------------------------------------

Tightens the alpha.19 streaming scanner without replacing the MinHash
matcher. The algorithm shape (MinHash + LSH bands + three fences per
template) is unchanged; what changes is its scope:

1. IncrementalHtmlTokenizer.Feed no longer copies the whole chunk into
  _buffer. Chunks are parsed inline; only the partial-tag tail (if a
  tag straddles a chunk boundary) is retained for stitching with the
  next chunk. PeakBufferedBytes is now bounded by O(longest tag), not
  O(chunk size). Measured: peak = 0 B for a 200 KB body in 16 KB
  chunks, 19 B in 1 KB chunks. MaxBufferSize lowered from 64 KiB to
  4 KiB.

2. RollingSketch shingles upgraded to Markov bigrams: each shingle is
  (prevTagHash, currentTagHash, currentClassHash). Order-sensitive:
  [A, B] and [B, A] now produce different signatures. The leftmost
  shingle in any window uses prevTag = 0 so sliding-window scanners
  match fences built from contiguous event sequences regardless of
  what came before the window.

3. Static StructuralTagAllowlist replaces per-fence TagAllowlistBloom.
  Only structural tags (html/body/header/nav/main/article/section/
  div/p/h1-h6/ul/ol/li/table/...) push into the sketch. meta/link/
  script-chrome/img/span/a bypass the recompute entirely. The
  TagAllowlistBloom JSON property is retained as a back-compat sink
  (read-and-discarded) so persisted templates from alpha.16-alpha.20
  round-trip cleanly.

4. Depth-aware capture-end: while in Capturing, ContentEnd only matches
  when DOM depth has returned to (or below) the depth at ContentStart.
  Nested matches mid-content can no longer terminate capture early.

5. Dead StreamingTemplate.MinContentDepth field removed (never read by
  any scanner).

6. FenceScanner and IncrementalFenceScanner now share a single static
  StreamingTick.Step. Both scanners build a StreamingTickState from
  their respective storage (span-backed vs heap-backed) and execute
  literally the same code. Cross-validation tests retained as insurance.

7. IStreamingTemplateStore gains version-chain APIs:
  - GetByHostAtVersionAsync(host, version) — retrieve a specific version.
  - ListVersionsByHostAsync(host) — enumerate all known versions.
  UpsertAsync now APPENDS per (host, version) rather than replacing.
  SQLite store schema migrated to PK (host, version); existing rows
  auto-migrate to version 1 on first open.

Migration notes:
- Persisted SQLite templates from alpha.16-alpha.20 auto-migrate to
version 1 on first open; existing rows are preserved.
- TemplateFence(uint[], ulong[], ulong, int) constructor removed; the
new shape is TemplateFence(uint[], ulong[], int). TagAllowlistBloom
is still readable as a property (returns 0).
- StreamingTemplate.MinContentDepth removed — drop from any code that
set it in `with` expressions.
- RollingSketch.Push signature changed to Push(prevTagHash, tagHash,
classHash) — direct users must track prev tag.

StyloExtract 1.8.0-alpha.19 - 2026-06-26
=========================================

Streaming: sliding-window design (no full-buffer retention)
------------------------------------------------------------

Refactors alpha.18's IncrementalHtmlTokenizer + IncrementalFenceScanner
to a TRUE sliding-window streaming design:

1. Bytes: only partial-tag bytes are retained. Once a tag is emitted,
  the bytes are dropped immediately (compact-on-emit, not compact-on-
  next-Feed). New PeakBufferedBytes property exposes the high-watermark
  for telemetry. Worst-case in-flight buffer is O(longest tag), not
  O(megabytes). MaxBufferSize lowered from 1 MiB to 64 KiB and
  repositioned as a hard safety stop that should never be hit under
  correct input — exceeding it now means a single tag (or unclosed
  script/style body) genuinely exceeds 64 KiB and the scan must bail.

2. Events: fixed-size sliding window of the last WindowSize tag events
  (unchanged from alpha.18). Push new, pop oldest. The window is the
  only event-level state.

3. RollingSketch: documented (in IncrementalFenceScanner XML doc) that
  MinHash with min-pooling is NOT reversibly rollable — once an element
  leaves the window, its contribution to min(...) can't be subtracted.
  The sketch therefore rebuilds the signature from the current event
  window after each accepted tag (O(WindowSize × SignatureSize) per
  tick, gated by the Bloom allowlist filter to skip the vast majority
  of inbound tags). The bounded-buffer property — the user's headline
  concern — is satisfied by the tokenizer; the sketch's per-tick recompute
  is the price MinHash charges for the LSH-band locality the matcher
  relies on. The event-level memory remains O(WindowSize) regardless.

4. IncrementalFenceScanner now exposes PeakBufferedBytes and BytesConsumed
  passthroughs from the tokenizer so callers can prove the bounded-memory
  property to telemetry without reaching into the tokenizer directly.
  The duplicated tick logic (mirroring FenceScanner.Tick over heap-backed
  sketch state) is retained — it's hard-pinned to the ref-struct path by
  the existing cross-validation tests, which give us higher confidence
  than refactoring to delegate would.

Memory-cap proof: tests/StreamingMemoryBoundTests.cs feeds 5 MiB of
synthetic HTML in 4 KiB chunks and asserts PeakBufferedBytes stays
under 16 KiB. The streaming gateway can now scan multi-megabyte
responses while holding bounded memory.

Migration: API is unchanged from alpha.18 — refactor is internal. The
new PeakBufferedBytes and BytesConsumed diagnostic properties on
IncrementalFenceScanner are additive. MaxBufferSize is still public but
the new value is 64 KiB (was 1 MiB); only relevant if you were catching
the InvalidOperationException for pathological input.

StyloExtract 1.8.0-alpha.18 - 2026-06-26
=========================================

Streaming: true chunked tokenization + refit/versioning + bench update
-----------------------------------------------------------------------

1. IncrementalHtmlTokenizer + IncrementalFenceScanner
  Stateful tokenizer that survives chunk boundaries. A partial tag at
  the end of one chunk is held in an internal buffer and completed when
  the next Feed call arrives. Pairs with IncrementalFenceScanner —
  callers Feed chunks as they arrive from the network, get a verdict
  per chunk, bail early on Captured / Bailout.

  Trade-off vs MinimalHtmlTokenizer's span path: one buffer allocation
  per request (not per chunk). Use the span path for whole-buffer
  scans, the incremental path for streaming gateways where bytes
  arrive in chunks. Hard cap of 1 MiB on the internal buffer — feed
  throws InvalidOperationException on pathological input that never
  closes a tag, surfacing the failure rather than silently dropping
  bytes.

  Architectural note: FenceScanner stays a ref struct (zero-alloc hot
  path); IncrementalFenceScanner is a heap-backed class that ports the
  same tick logic. The two are kept in lockstep — any drift between
  them is a correctness bug surface and is covered by cross-validation
  tests that feed the same bytes both ways.

2. Streaming-template refit + versioning
  StreamingTemplate gains a Version field (defaults to 1; persists
  across alpha.17 templates without migration). New
  StreamingRefitOrchestrator observes captured-scan output per host
  and kicks off-hot-path refits when either:
    - capture-region EWMA drift exceeds 30% on N consecutive scans, OR
    - every 10th captured scan re-induces and finds different fences
  On refit: version bumps, store is upserted, the new
  IStreamingTemplateVersionSink fires a StreamingTemplateRefitEvent
  (Host, Old/New TemplateId, Old/New Version, Reason, DetectedAt).
  Default sink is a no-op; consumers wire UI telemetry to it.

3. Bench update
  ExtractionComparisonBench gains a New_StreamingScanByHost variant so
  the host-keyed hot-path is benchmarked alongside the original
  GUID-keyed scan. Pre-populates the in-memory store with the
  host="www.mostlylucid.net" template that lucidview FULL hits in
  production.

Migration: additive APIs. Alpha.17 consumers using ScanByHost continue
to work; the incremental tokenizer and the refit orchestrator are
opt-in (use them when feeding chunks / when wiring drift telemetry).

StyloExtract 1.8.0-alpha.17 - 2026-06-26
=========================================

Streaming: host-keyed templates + naive auto-induction
-------------------------------------------------------

Three changes to close the alpha.16 streaming integration loop:

1. Host-keyed lookup
  IStreamingTemplateStore gains GetByHostAsync / TryGetHotByHost /
  UpsertAsync. StreamingTemplate gains a Host field (required). One
  template per host (latest wins). The existing GUID-keyed methods
  remain — Host is the lookup key for consumers; TemplateId stays for
  stable identity / versioning.

2. StreamingPathSelector.ScanByHost(host, bytes)
  Synchronous hot-cache-only host scan. Returns NoTemplate on miss
  so the caller can WarmByHostAsync + retry or induce.
  WarmByHostAsync brings a host's template into the hot cache via
  the durable tier.

3. StreamingTemplateInducer
  Naive first-pass inducer: walks HTML via MinimalHtmlTokenizer,
  finds semantic-marker tag-sequence-pairs (<header>...</header>,
  <p>...</p>...<p>...</p>, <footer>/</main>/</body>) and produces a
  StreamingTemplate ready to upsert. Returns null on pages with no
  identifiable structural fences (plain text, image-only, etc.).
  Describe() returns a human-readable summary of the chosen markers
  for logging.

Storage migrations:
- InMemoryStreamingTemplateStore: adds an in-memory host index.
- SqliteStreamingTemplateStore: adds a 'host' TEXT column + index;
on-open ALTER TABLE migration handles pre-alpha.17 schemas
(existing rows get Host="" — reachable only by GUID).

Migration: additive APIs; alpha.16 consumers using only the
GUID-keyed surface continue to work unchanged. The new Host field
on StreamingTemplate IS required — existing construction sites must
set Host="" if they have no host context.

StyloExtract 1.8.0-alpha.16 - 2026-06-26
=========================================

Mostlylucid.StyloExtract.Streaming — zero-allocation byte-stream scanner
------------------------------------------------------------------------

New package on NuGet. Hot-path streaming fence scanner: skips page chrome
and captures the content region as response bytes flow past, using
MinHash-derived structural fences. Zero per-request GC-tracked
allocations in steady state.

Designed for the gateway position — drop into a response pipeline
(HttpClient, Stylobot's edge, ASP.NET output filters) alongside the byte
stream and emit a verdict without buffering the full page.

Public hot-path API:
StreamingPathSelector.Scan(Guid templateId, ReadOnlySpan<byte> html)
   → ScanVerdict { Continue | Captured | Bailout | NoTemplate }

// Warm a template into the hot cache:
await selector.WarmAsync(templateId);

Storage:
- InMemoryStreamingTemplateStore — single-process LRU.
- SqliteStreamingTemplateStore — durable; same SQLite file pattern as
   the existing ITemplateIndex but a separate table.

Pairs with the existing StyloExtract.Fingerprint learn path and
ITemplateIndex template store. The streaming template format is its own
shape (TemplateFence with MinHash bloom, content-start/content-end
fences) — not an LLM template or operator template.

Bench results vs LayoutExtractor on mostlylucid fixtures: see
bench/StyloExtract.Streaming.Benchmarks/ (zero-alloc scan competitive
with the full extractor's path-match cost while never building a DOM).

Migration: additive package; consumers add a PackageReference to
Mostlylucid.StyloExtract.Streaming if they want gateway-position
scanning.

StyloExtract 1.8.0-alpha.15 - 2026-06-26
=========================================

RenderOptions.WaitUntil — opt out of NetworkIdle for SPA routing
-----------------------------------------------------------------

PlaywrightHtmlFetcher previously hardcoded WaitUntilState.NetworkIdle
for the primary GotoAsync. On sites with aggressive client-side
routing (BBC News auto-navigates /news → /articles/<id> in the
post-load JS phase), this means the fetcher returns the post-routing
DOM, not the page the user requested.

RenderOptions now exposes a WaitUntil property (PlaywrightWaitUntil
enum: Load / DOMContentLoaded / NetworkIdle / Commit). Default stays
NetworkIdle for backwards compatibility. Consumers fetching SPA-heavy
sites should set Load to capture the initial DOM before the router
fires.

The secondary WaitForLoadStateAsync(NetworkIdle, ...) drain remains —
it's independently bounded by WaitForNetworkIdleTimeout and serves as
a best-effort late-XHR catch-up; safe even with the primary returning
on Load.

PlaywrightWaitUntil is a small enum (not Microsoft.Playwright.WaitUntilState
direct) so consumers don't take a transitive dependency on
Microsoft.Playwright just to pick a strategy.

StyloExtract 1.8.0-alpha.14 - 2026-06-26
=========================================

Sitemap CLI end-to-end regression + LLM nav few-shot
-----------------------------------------------------

1. Sitemap CLI test suite

The alpha.11 stylo-extract sitemap verb has been working on real sites
since alpha.13 (heuristic nav-classification tightening), but nothing
caught regressions. Added 5 end-to-end tests in
StyloExtract.Core.Tests/SitemapCommandTests.cs that invoke the
SitemapCommand.CrawlAsync handler against the mostlylucid-home.html.gz
fixture (real captured homepage, shared with the heuristics suite) plus
a stub HttpMessageHandler and assert: real nav links emitted under
# www.mostlylucid.net, --max-depth 0 emits only the seed Title row,
off-host links are not followed, --max-pages cap honoured exactly, and
--delay-ms enforced with a stopwatch floor. No network access required.

2. LLM induction prompt — nav-classification few-shot

LlmInducerPrompts.System and SystemRepair now include a second worked
example: a blog homepage with header <nav>, breadcrumb,
MainContent + RepeatedItem post cards, and footer <nav>. Mirrors the
patterns the alpha.13 NavPreDetector heuristic correctly classifies.
Rule 6 (RepeatedItem usage) tightened with explicit guidance that
header/footer nav lists are PrimaryNavigation / SecondaryNavigation at
the parent <ul>/<nav> level, NOT RepeatedItem at the <li> level —
closes a known LLM confusion mode.

Tests: snapshot tests in StyloExtract.Core.Tests/LlmInducerPromptsTests.cs
verify the prompt extensions land verbatim so future prompt edits don't
accidentally regress.

StyloExtract 1.8.0-alpha.13 - 2026-06-26
=========================================

Heuristic nav-classification tightening
----------------------------------------

HeuristicBlockClassifier was under-classifying real-world nav patterns
on server-rendered sites — header <nav> strips, header <ul>-of-links,
breadcrumb lists, role="navigation" attributes, footer nav — all landed
as Boilerplate (or weren't extracted at all). Result: the alpha.11
Sitemap profile and stylo-extract sitemap CLI verb produced a one-line
tree even on sites with rich nav, because the classifier didn't surface
PrimaryNavigation / SecondaryNavigation / Breadcrumb roles for them.

Tightened patterns now produce definite role classifications:
1. <header> <nav> -> PrimaryNavigation (0.9)
2. Top-of-document <nav> -> PrimaryNavigation (0.85)
3. <footer> <nav> -> SecondaryNavigation (0.9)
4. <nav aria-label="breadcrumb"> / class~="breadcrumb" -> Breadcrumb (0.95)
5. <* role="navigation"> -> PrimaryNavigation (0.95)
6. Header <ul> of mostly-link <li>s -> PrimaryNavigation (0.85) at
    the <ul> level, suppress descent (was emitting deep Boilerplate)
7. Footer <ul> of mostly-link <li>s -> SecondaryNavigation (0.85)

Implementation: a new NavPreDetector runs after per-element classification
and injects each detected nav container as a high-score (50000) candidate
at the parent level, then demotes any descendant candidates so greedy
selection picks the nav parent and stops descending into its noise.
Containers nested inside <main>/<article> are skipped — IntraBlockCleaner
already strips them as intra-block contaminants; hoisting would steal
the article's selection win.

Regression fixtures captured from mostlylucid.net + wikipedia.org under
tests/StyloExtract.Heuristics.Tests/Fixtures so the next time a
classifier change regresses real-world nav detection, the bench catches
it before it ships.

Downstream impact: the Sitemap ExtractionProfile and stylo-extract
sitemap CLI verb now produce real nav trees on these sites - see the
lucidview FULL dogfood smoke for evidence.

StyloExtract 1.8.0-alpha.12 - 2026-06-26
=========================================

DI wire-up fix for deterministic-template YAML persistence
-----------------------------------------------------------

alpha.11 introduced DeterministicTemplateYamlSink + the
AddStyloExtractOperatorTemplates registration, but AddStyloExtract's
LayoutExtractor construction did not pass the sink through to the
extractor — so even when the sink was registered in DI, LayoutExtractor's
optional ctor parameter defaulted to null and no `<host>-deterministic.yaml`
file was ever written.

Fixed by threading `sp.GetService<DeterministicTemplateYamlSink>()` to the
LayoutExtractor constructor in AddStyloExtract. No API change; consumers
who already called AddStyloExtractOperatorTemplates start seeing
deterministic YAML files immediately after upgrading.

StyloExtract 1.8.0-alpha.11 - 2026-06-26
=========================================

Sequenced architecture extension: deterministic templates with
extended classification — Title role, Sitemap profile, deterministic
YAML persistence, and a sitemap CLI verb.

Title BlockRole
---------------

New BlockRole.Title value distinguishes the page-level <h1> (the single
H1 the rest of the page is "about") from intra-content Heading
(H2/H3/H4 inside the body). HeuristicBlockClassifier surfaces the Title
via a shared PageTitleDetector helper, picking the H1 in/closest-to
<main>/<article> and falling back to earliest-in-document with multiple
H1s. ExtractorApplicator surfaces Title on the fast-path / applicator
branch too, so output stays consistent across novel and cached requests
(matters for the response-cache ETag). LlmInducerPrompts list Title in
the allowed-roles set with a one-line distinction from Heading.

MainContentOnly, RagFull, Wcxb, and AgentNavigation profiles all
include Title in their role-set. The renderer quality gate (drop short
text) bypasses for Title and Heading so intentionally-terse page
titles ("Home", "About") still surface.

Sitemap ExtractionProfile
-------------------------

New ExtractionProfile.Sitemap value emits only Title + Heading +
PrimaryNavigation + SecondaryNavigation + Breadcrumb. For sitemap /
outline / crawler use cases that want page titles and the site's nav
structure without pulling body content. The CLI's --profile flag
recognises `sitemap` automatically (enum binding).

Deterministic YAML persistence
------------------------------

New DeterministicTemplateYamlSink, wired automatically when
AddStyloExtractOperatorTemplates(root) is called, writes
<host>-deterministic.yaml alongside each heuristic-induced template's
SQLite row. The file carries every role the heuristic detected (Title,
MainContent, Navigation, Footer, …) — auditable, hand-editable, and
diffable, mirroring how LLM-induced templates have always been written
by TemplateEnrichmentCoordinator. The SQLite store remains the
authoritative source at match time; YAML is best-effort and
non-blocking.

stylo-extract sitemap CLI verb
------------------------------

New `sitemap` subcommand: takes one or more starting URLs, extracts
each with ExtractionProfile.Sitemap, follows internal nav links to
--max-depth (default 3), and emits a markdown tree of titles + URLs to
stdout or --out <file>. Safety caps: 50 pages by default
(--max-pages), 1s between requests (--delay-ms), no off-host follow.

Migration
---------

No source change required for consumers. The new Title role is
additive (existing switches that handled BlockRole pattern-match
defaults will continue to compile and behave identically; switches
that exhaustively listed roles were updated). Deterministic YAML
writing only activates when AddStyloExtractOperatorTemplates is
called, so consumers that don't use operator templates see no new
filesystem activity.

StyloExtract 1.8.0-alpha.10 - 2026-06-26
=========================================

LLM classification accuracy for chrome patterns
------------------------------------------------

Symptom: induced templates were labelling language pickers, filter UI,
locale switchers, and pagination strips as MainContent on
server-rendered blogs (mostlylucid.net being the canonical reproducer).
The downstream RagFull renderer's role-filter — which already drops
PrimaryNavigation / SecondaryNavigation / Form / Boilerplate — never
saw them as nav and so left them in the extracted markdown, producing
output WORSE than the deterministic heuristic.

Fix: expanded the induction and repair system prompts with explicit
"chrome pattern → role" examples (language picker → PrimaryNavigation;
filter / fac

[truncated — see RELEASE_NOTES.txt packaged at root for full history]