Mostlylucid.StyloExtract.Streaming 1.8.0-alpha.23

This is a prerelease version of Mostlylucid.StyloExtract.Streaming.

There is a newer version of this package available.
See the version list below for details.

dotnet add package Mostlylucid.StyloExtract.Streaming --version 1.8.0-alpha.23

NuGet\Install-Package Mostlylucid.StyloExtract.Streaming -Version 1.8.0-alpha.23

This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.

<PackageReference Include="Mostlylucid.StyloExtract.Streaming" Version="1.8.0-alpha.23" />

For projects that support PackageReference, copy this XML node into the project file to reference the package.

<PackageVersion Include="Mostlylucid.StyloExtract.Streaming" Version="1.8.0-alpha.23" />
                    

                            Directory.Packages.props

<PackageReference Include="Mostlylucid.StyloExtract.Streaming" />
                    

                            Project file

For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.

paket add Mostlylucid.StyloExtract.Streaming --version 1.8.0-alpha.23

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

#r "nuget: Mostlylucid.StyloExtract.Streaming, 1.8.0-alpha.23"

#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.

#:package Mostlylucid.StyloExtract.Streaming@1.8.0-alpha.23

#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.

#addin nuget:?package=Mostlylucid.StyloExtract.Streaming&version=1.8.0-alpha.23&prerelease
                    

                            Install as a Cake Addin

#tool nuget:?package=Mostlylucid.StyloExtract.Streaming&version=1.8.0-alpha.23&prerelease
                    

                            Install as a Cake Tool

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

Mostlylucid.StyloExtract.Streaming

Zero-allocation, bounded-memory gateway fence scanner for the StyloExtract family. Rides alongside the byte stream of an HTTP response and emits a verdict — Captured / Bailout / NoTemplate / Continue — while the body is still in flight, without ever building a DOM or buffering the full page.

Designed for the gateway position: HTTP reverse proxies, CDN edges, ASP.NET output filters, and Stylobot's response pipeline. Use it to decide whether a response is worth feeding to the full LayoutExtractor extraction pipeline, before you commit to buffering it.

Memory contract

A sliding-window tokenizer holds ONLY the partial tag bytes that straddle a chunk boundary — typically <500 B, often zero. Each chunk is parsed inline; complete-tag bytes are dropped immediately, never copied into a holding buffer. The hard cap is 4 KiB (IncrementalHtmlTokenizer.MaxBufferSize); a single tag larger than that throws InvalidOperationException rather than silently dropping bytes. Measured peak: 0 B for 16 KB chunks over a 200 KB body; 19 B for 1 KB chunks. Pinned by the StreamingMemoryBoundTests regression suite.

Wire-up

// Singletons — the scanner and store are thread-safe.
services.AddSingleton<IStreamingTemplateStore, InMemoryStreamingTemplateStore>();
// Or durable: new SqliteStreamingTemplateStore("streaming-templates.db")
services.AddSingleton<StreamingPathSelector>();
services.AddSingleton<StreamingTemplateInducer>();
services.AddSingleton<StreamingRefitOrchestrator>();

Hot path

var selector = sp.GetRequiredService<StreamingPathSelector>();
var inducer  = sp.GetRequiredService<StreamingTemplateInducer>();
var store    = sp.GetRequiredService<IStreamingTemplateStore>();

await selector.WarmByHostAsync(host);
var verdict = selector.ScanByHost(host, bodyBytes);

if (verdict == ScanVerdict.NoTemplate)
{
    // First visit to this host — induce a template heuristically.
    var induced = inducer.Induce(host, bodyBytes);
    if (induced is not null)
        await store.UpsertAsync(induced);
}

For chunked (streaming) inputs, use IncrementalFenceScanner.Create(template) and call Feed(chunk) per chunk. The verdict latches on the first terminal result.

Product	Compatible and additional computed target framework versions.
.NET	net10.0 is compatible. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version	Downloads	Last Updated
1.8.0	0	6/27/2026
1.8.0-alpha.23	0	6/27/2026
1.8.0-alpha.22	0	6/27/2026
1.8.0-alpha.21	0	6/27/2026
1.8.0-alpha.20	38	6/27/2026
1.8.0-alpha.19	39	6/26/2026
1.8.0-alpha.18	53	6/26/2026
1.8.0-alpha.17	63	6/26/2026
1.8.0-alpha.16	61	6/26/2026

StyloExtract 1.8.0-alpha.21 - 2026-06-27
=========================================

Streaming: scope fixes (no algorithm replacement)
--------------------------------------------------

Tightens the alpha.19 streaming scanner without replacing the MinHash
matcher. The algorithm shape (MinHash + LSH bands + three fences per
template) is unchanged; what changes is its scope:

1. IncrementalHtmlTokenizer.Feed no longer copies the whole chunk into
  _buffer. Chunks are parsed inline; only the partial-tag tail (if a
  tag straddles a chunk boundary) is retained for stitching with the
  next chunk. PeakBufferedBytes is now bounded by O(longest tag), not
  O(chunk size). Measured: peak = 0 B for a 200 KB body in 16 KB
  chunks, 19 B in 1 KB chunks. MaxBufferSize lowered from 64 KiB to
  4 KiB.

2. RollingSketch shingles upgraded to Markov bigrams: each shingle is
  (prevTagHash, currentTagHash, currentClassHash). Order-sensitive:
  [A, B] and [B, A] now produce different signatures. The leftmost
  shingle in any window uses prevTag = 0 so sliding-window scanners
  match fences built from contiguous event sequences regardless of
  what came before the window.

3. Static StructuralTagAllowlist replaces per-fence TagAllowlistBloom.
  Only structural tags (html/body/header/nav/main/article/section/
  div/p/h1-h6/ul/ol/li/table/...) push into the sketch. meta/link/
  script-chrome/img/span/a bypass the recompute entirely. The
  TagAllowlistBloom JSON property is retained as a back-compat sink
  (read-and-discarded) so persisted templates from alpha.16-alpha.20
  round-trip cleanly.

4. Depth-aware capture-end: while in Capturing, ContentEnd only matches
  when DOM depth has returned to (or below) the depth at ContentStart.
  Nested matches mid-content can no longer terminate capture early.

5. Dead StreamingTemplate.MinContentDepth field removed (never read by
  any scanner).

6. FenceScanner and IncrementalFenceScanner now share a single static
  StreamingTick.Step. Both scanners build a StreamingTickState from
  their respective storage (span-backed vs heap-backed) and execute
  literally the same code. Cross-validation tests retained as insurance.

7. IStreamingTemplateStore gains version-chain APIs:
  - GetByHostAtVersionAsync(host, version) — retrieve a specific version.
  - ListVersionsByHostAsync(host) — enumerate all known versions.
  UpsertAsync now APPENDS per (host, version) rather than replacing.
  SQLite store schema migrated to PK (host, version); existing rows
  auto-migrate to version 1 on first open.

Migration notes:
- Persisted SQLite templates from alpha.16-alpha.20 auto-migrate to
version 1 on first open; existing rows are preserved.
- TemplateFence(uint[], ulong[], ulong, int) constructor removed; the
new shape is TemplateFence(uint[], ulong[], int). TagAllowlistBloom
is still readable as a property (returns 0).
- StreamingTemplate.MinContentDepth removed — drop from any code that
set it in `with` expressions.
- RollingSketch.Push signature changed to Push(prevTagHash, tagHash,
classHash) — direct users must track prev tag.

StyloExtract 1.8.0-alpha.19 - 2026-06-26
=========================================

Streaming: sliding-window design (no full-buffer retention)
------------------------------------------------------------

Refactors alpha.18's IncrementalHtmlTokenizer + IncrementalFenceScanner
to a TRUE sliding-window streaming design:

1. Bytes: only partial-tag bytes are retained. Once a tag is emitted,
  the bytes are dropped immediately (compact-on-emit, not compact-on-
  next-Feed). New PeakBufferedBytes property exposes the high-watermark
  for telemetry. Worst-case in-flight buffer is O(longest tag), not
  O(megabytes). MaxBufferSize lowered from 1 MiB to 64 KiB and
  repositioned as a hard safety stop that should never be hit under
  correct input — exceeding it now means a single tag (or unclosed
  script/style body) genuinely exceeds 64 KiB and the scan must bail.

2. Events: fixed-size sliding window of the last WindowSize tag events
  (unchanged from alpha.18). Push new, pop oldest. The window is the
  only event-level state.

3. RollingSketch: documented (in IncrementalFenceScanner XML doc) that
  MinHash with min-pooling is NOT reversibly rollable — once an element
  leaves the window, its contribution to min(...) can't be subtracted.
  The sketch therefore rebuilds the signature from the current event
  window after each accepted tag (O(WindowSize × SignatureSize) per
  tick, gated by the Bloom allowlist filter to skip the vast majority
  of inbound tags). The bounded-buffer property — the user's headline
  concern — is satisfied by the tokenizer; the sketch's per-tick recompute
  is the price MinHash charges for the LSH-band locality the matcher
  relies on. The event-level memory remains O(WindowSize) regardless.

4. IncrementalFenceScanner now exposes PeakBufferedBytes and BytesConsumed
  passthroughs from the tokenizer so callers can prove the bounded-memory
  property to telemetry without reaching into the tokenizer directly.
  The duplicated tick logic (mirroring FenceScanner.Tick over heap-backed
  sketch state) is retained — it's hard-pinned to the ref-struct path by
  the existing cross-validation tests, which give us higher confidence
  than refactoring to delegate would.

Memory-cap proof: tests/StreamingMemoryBoundTests.cs feeds 5 MiB of
synthetic HTML in 4 KiB chunks and asserts PeakBufferedBytes stays
under 16 KiB. The streaming gateway can now scan multi-megabyte
responses while holding bounded memory.

Migration: API is unchanged from alpha.18 — refactor is internal. The
new PeakBufferedBytes and BytesConsumed diagnostic properties on
IncrementalFenceScanner are additive. MaxBufferSize is still public but
the new value is 64 KiB (was 1 MiB); only relevant if you were catching
the InvalidOperationException for pathological input.

StyloExtract 1.8.0-alpha.18 - 2026-06-26
=========================================

Streaming: true chunked tokenization + refit/versioning + bench update
-----------------------------------------------------------------------

1. IncrementalHtmlTokenizer + IncrementalFenceScanner
  Stateful tokenizer that survives chunk boundaries. A partial tag at
  the end of one chunk is held in an internal buffer and completed when
  the next Feed call arrives. Pairs with IncrementalFenceScanner —
  callers Feed chunks as they arrive from the network, get a verdict
  per chunk, bail early on Captured / Bailout.

  Trade-off vs MinimalHtmlTokenizer's span path: one buffer allocation
  per request (not per chunk). Use the span path for whole-buffer
  scans, the incremental path for streaming gateways where bytes
  arrive in chunks. Hard cap of 1 MiB on the internal buffer — feed
  throws InvalidOperationException on pathological input that never
  closes a tag, surfacing the failure rather than silently dropping
  bytes.

  Architectural note: FenceScanner stays a ref struct (zero-alloc hot
  path); IncrementalFenceScanner is a heap-backed class that ports the
  same tick logic. The two are kept in lockstep — any drift between
  them is a correctness bug surface and is covered by cross-validation
  tests that feed the same bytes both ways.

2. Streaming-template refit + versioning
  StreamingTemplate gains a Version field (defaults to 1; persists
  across alpha.17 templates without migration). New
  StreamingRefitOrchestrator observes captured-scan output per host
  and kicks off-hot-path refits when either:
    - capture-region EWMA drift exceeds 30% on N consecutive scans, OR
    - every 10th captured scan re-induces and finds different fences
  On refit: version bumps, store is upserted, the new
  IStreamingTemplateVersionSink fires a StreamingTemplateRefitEvent
  (Host, Old/New TemplateId, Old/New Version, Reason, DetectedAt).
  Default sink is a no-op; consumers wire UI telemetry to it.

3. Bench update
  ExtractionComparisonBench gains a New_StreamingScanByHost variant so
  the host-keyed hot-path is benchmarked alongside the original
  GUID-keyed scan. Pre-populates the in-memory store with the
  host="www.mostlylucid.net" template that lucidview FULL hits in
  production.

Migration: additive APIs. Alpha.17 consumers using ScanByHost continue
to work; the incremental tokenizer and the refit orchestrator are
opt-in (use them when feeding chunks / when wiring drift telemetry).

StyloExtract 1.8.0-alpha.17 - 2026-06-26
=========================================

Streaming: host-keyed templates + naive auto-induction
-------------------------------------------------------

Three changes to close the alpha.16 streaming integration loop:

1. Host-keyed lookup
  IStreamingTemplateStore gains GetByHostAsync / TryGetHotByHost /
  UpsertAsync. StreamingTemplate gains a Host field (required). One
  template per host (latest wins). The existing GUID-keyed methods
  remain — Host is the lookup key for consumers; TemplateId stays for
  stable identity / versioning.

2. StreamingPathSelector.ScanByHost(host, bytes)
  Synchronous hot-cache-only host scan. Returns NoTemplate on miss
  so the caller can WarmByHostAsync + retry or induce.
  WarmByHostAsync brings a host's template into the hot cache via
  the durable tier.

3. StreamingTemplateInducer
  Naive first-pass inducer: walks HTML via MinimalHtmlTokenizer,
  finds semantic-marker tag-sequence-pairs (<header>...</header>,
  <p>...</p>...<p>...</p>, <footer>/</main>/</body>) and produces a
  StreamingTemplate ready to upsert. Returns null on pages with no
  identifiable structural fences (plain text, image-only, etc.).
  Describe() returns a human-readable summary of the chosen markers
  for logging.

Storage migrations:
- InMemoryStreamingTemplateStore: adds an in-memory host index.
- SqliteStreamingTemplateStore: adds a 'host' TEXT column + index;
on-open ALTER TABLE migration handles pre-alpha.17 schemas
(existing rows get Host="" — reachable only by GUID).

Migration: additive APIs; alpha.16 consumers using only the
GUID-keyed surface continue to work unchanged. The new Host field
on StreamingTemplate IS required — existing construction sites must
set Host="" if they have no host context.

StyloExtract 1.8.0-alpha.16 - 2026-06-26
=========================================

Mostlylucid.StyloExtract.Streaming — zero-allocation byte-stream scanner
------------------------------------------------------------------------

New package on NuGet. Hot-path streaming fence scanner: skips page chrome
and captures the content region as response bytes flow past, using
MinHash-derived structural fences. Zero per-request GC-tracked
allocations in steady state.

Designed for the gateway position — drop into a response pipeline
(HttpClient, Stylobot's edge, ASP.NET output filters) alongside the byte
stream and emit a verdict without buffering the full page.

Public hot-path API:
StreamingPathSelector.Scan(Guid templateId, ReadOnlySpan<byte> html)
   → ScanVerdict { Continue | Captured | Bailout | NoTemplate }

// Warm a template into the hot cache:
await selector.WarmAsync(templateId);

Storage:
- InMemoryStreamingTemplateStore — single-process LRU.
- SqliteStreamingTemplateStore — durable; same SQLite file pattern as
   the existing ITemplateIndex but a separate table.

Pairs with the existing StyloExtract.Fingerprint learn path and
ITemplateIndex template store. The streaming template format is its own
shape (TemplateFence with MinHash bloom, content-start/content-end
fences) — not an LLM template or operator template.

Bench results vs LayoutExtractor on mostlylucid fixtures: see
bench/StyloExtract.Streaming.Benchmarks/ (zero-alloc scan competitive
with the full extractor's path-match cost while never building a DOM).

Migration: additive package; consumers add a PackageReference to
Mostlylucid.StyloExtract.Streaming if they want gateway-position
scanning.

StyloExtract 1.8.0-alpha.15 - 2026-06-26
=========================================

RenderOptions.WaitUntil — opt out of NetworkIdle for SPA routing
-----------------------------------------------------------------

PlaywrightHtmlFetcher previously hardcoded WaitUntilState.NetworkIdle
for the primary GotoAsync. On sites with aggressive client-side
routing (BBC News auto-navigates /news → /articles/<id> in the
post-load JS phase), this means the fetcher returns the post-routing
DOM, not the page the user requested.

RenderOptions now exposes a WaitUntil property (PlaywrightWaitUntil
enum: Load / DOMContentLoaded / NetworkIdle / Commit). Default stays
NetworkIdle for backwards compatibility. Consumers fetching SPA-heavy
sites should set Load to capture the initial DOM before the router
fires.

The secondary WaitForLoadStateAsync(NetworkIdle, ...) drain remains —
it's independently bounded by WaitForNetworkIdleTimeout and serves as
a best-effort late-XHR catch-up; safe even with the primary returning
on Load.

PlaywrightWaitUntil is a small enum (not Microsoft.Playwright.WaitUntilState
direct) so consumers don't take a transitive dependency on
Microsoft.Playwright just to pick a strategy.

StyloExtract 1.8.0-alpha.14 - 2026-06-26
=========================================

Sitemap CLI end-to-end regression + LLM nav few-shot
-----------------------------------------------------

1. Sitemap CLI test suite

The alpha.11 stylo-extract sitemap verb has been working on real sites
since alpha.13 (heuristic nav-classification tightening), but nothing
caught regressions. Added 5 end-to-end tests in
StyloExtract.Core.Tests/SitemapCommandTests.cs that invoke the
SitemapCommand.CrawlAsync handler against the mostlylucid-home.html.gz
fixture (real captured homepage, shared with the heuristics suite) plus
a stub HttpMessageHandler and assert: real nav links emitted under
# www.mostlylucid.net, --max-depth 0 emits only the seed Title row,
off-host links are not followed, --max-pages cap honoured exactly, and
--delay-ms enforced with a stopwatch floor. No network access required.

2. LLM induction prompt — nav-classification few-shot

LlmInducerPrompts.System and SystemRepair now include a second worked
example: a blog homepage with header <nav>, breadcrumb,
MainContent + RepeatedItem post cards, and footer <nav>. Mirrors the
patterns the alpha.13 NavPreDetector heuristic correctly classifies.
Rule 6 (RepeatedItem usage) tightened with explicit guidance that
header/footer nav lists are PrimaryNavigation / SecondaryNavigation at
the parent <ul>/<nav> level, NOT RepeatedItem at the <li> level —
closes a known LLM confusion mode.

Tests: snapshot tests in StyloExtract.Core.Tests/LlmInducerPromptsTests.cs
verify the prompt extensions land verbatim so future prompt edits don't
accidentally regress.

StyloExtract 1.8.0-alpha.13 - 2026-06-26
=========================================

Heuristic nav-classification tightening
----------------------------------------

HeuristicBlockClassifier was under-classifying real-world nav patterns
on server-rendered sites — header <nav> strips, header <ul>-of-links,
breadcrumb lists, role="navigation" attributes, footer nav — all landed
as Boilerplate (or weren't extracted at all). Result: the alpha.11
Sitemap profile and stylo-extract sitemap CLI verb produced a one-line
tree even on sites with rich nav, because the classifier didn't surface
PrimaryNavigation / SecondaryNavigation / Breadcrumb roles for them.

Tightened patterns now produce definite role classifications:
1. <header> <nav> -> PrimaryNavigation (0.9)
2. Top-of-document <nav> -> PrimaryNavigation (0.85)
3. <footer> <nav> -> SecondaryNavigation (0.9)
4. <nav aria-label="breadcrumb"> / class~="breadcrumb" -> Breadcrumb (0.95)
5. <* role="navigation"> -> PrimaryNavigation (0.95)
6. Header <ul> of mostly-link <li>s -> PrimaryNavigation (0.85) at
    the <ul> level, suppress descent (was emitting deep Boilerplate)
7. Footer <ul> of mostly-link <li>s -> SecondaryNavigation (0.85)

Implementation: a new NavPreDetector runs after per-element classification
and injects each detected nav container as a high-score (50000) candidate
at the parent level, then demotes any descendant candidates so greedy
selection picks the nav parent and stops descending into its noise.
Containers nested inside <main>/<article> are skipped — IntraBlockCleaner
already strips them as intra-block contaminants; hoisting would steal
the article's selection win.

Regression fixtures captured from mostlylucid.net + wikipedia.org under
tests/StyloExtract.Heuristics.Tests/Fixtures so the next time a
classifier change regresses real-world nav detection, the bench catches
it before it ships.

Downstream impact: the Sitemap ExtractionProfile and stylo-extract
sitemap CLI verb now produce real nav trees on these sites - see the
lucidview FULL dogfood smoke for evidence.

StyloExtract 1.8.0-alpha.12 - 2026-06-26
=========================================

DI wire-up fix for deterministic-template YAML persistence
-----------------------------------------------------------

alpha.11 introduced DeterministicTemplateYamlSink + the
AddStyloExtractOperatorTemplates registration, but AddStyloExtract's
LayoutExtractor construction did not pass the sink through to the
extractor — so even when the sink was registered in DI, LayoutExtractor's
optional ctor parameter defaulted to null and no `<host>-deterministic.yaml`
file was ever written.

Fixed by threading `sp.GetService<DeterministicTemplateYamlSink>()` to the
LayoutExtractor constructor in AddStyloExtract. No API change; consumers
who already called AddStyloExtractOperatorTemplates start seeing
deterministic YAML files immediately after upgrading.

StyloExtract 1.8.0-alpha.11 - 2026-06-26
=========================================

Sequenced architecture extension: deterministic templates with
extended classification — Title role, Sitemap profile, deterministic
YAML persistence, and a sitemap CLI verb.

Title BlockRole
---------------

New BlockRole.Title value distinguishes the page-level <h1> (the single
H1 the rest of the page is "about") from intra-content Heading
(H2/H3/H4 inside the body). HeuristicBlockClassifier surfaces the Title
via a shared PageTitleDetector helper, picking the H1 in/closest-to
<main>/<article> and falling back to earliest-in-document with multiple
H1s. ExtractorApplicator surfaces Title on the fast-path / applicator
branch too, so output stays consistent across novel and cached requests
(matters for the response-cache ETag). LlmInducerPrompts list Title in
the allowed-roles set with a one-line distinction from Heading.

MainContentOnly, RagFull, Wcxb, and AgentNavigation profiles all
include Title in their role-set. The renderer quality gate (drop short
text) bypasses for Title and Heading so intentionally-terse page
titles ("Home", "About") still surface.

Sitemap ExtractionProfile
-------------------------

New ExtractionProfile.Sitemap value emits only Title + Heading +
PrimaryNavigation + SecondaryNavigation + Breadcrumb. For sitemap /
outline / crawler use cases that want page titles and the site's nav
structure without pulling body content. The CLI's --profile flag
recognises `sitemap` automatically (enum binding).

Deterministic YAML persistence
------------------------------

New DeterministicTemplateYamlSink, wired automatically when
AddStyloExtractOperatorTemplates(root) is called, writes
<host>-deterministic.yaml alongside each heuristic-induced template's
SQLite row. The file carries every role the heuristic detected (Title,
MainContent, Navigation, Footer, …) — auditable, hand-editable, and
diffable, mirroring how LLM-induced templates have always been written
by TemplateEnrichmentCoordinator. The SQLite store remains the
authoritative source at match time; YAML is best-effort and
non-blocking.

stylo-extract sitemap CLI verb
------------------------------

New `sitemap` subcommand: takes one or more starting URLs, extracts
each with ExtractionProfile.Sitemap, follows internal nav links to
--max-depth (default 3), and emits a markdown tree of titles + URLs to
stdout or --out <file>. Safety caps: 50 pages by default
(--max-pages), 1s between requests (--delay-ms), no off-host follow.

Migration
---------

No source change required for consumers. The new Title role is
additive (existing switches that handled BlockRole pattern-match
defaults will continue to compile and behave identically; switches
that exhaustively listed roles were updated). Deterministic YAML
writing only activates when AddStyloExtractOperatorTemplates is
called, so consumers that don't use operator templates see no new
filesystem activity.

StyloExtract 1.8.0-alpha.10 - 2026-06-26
=========================================

LLM classification accuracy for chrome patterns
------------------------------------------------

Symptom: induced templates were labelling language pickers, filter UI,
locale switchers, and pagination strips as MainContent on
server-rendered blogs (mostlylucid.net being the canonical reproducer).
The downstream RagFull renderer's role-filter — which already drops
PrimaryNavigation / SecondaryNavigation / Form / Boilerplate — never
saw them as nav and so left them in the extracted markdown, producing
output WORSE than the deterministic heuristic.

Fix: expanded the induction and repair system prompts with explicit
"chrome pattern → role" examples (language picker → PrimaryNavigation;
filter / faceted-search → Form; pagination → SecondaryNavigation;
cookie banner → CookieBanner; newsletter signup → Form; social-share
→ Boilerplate). Also nudged the model to prefer narrower MainContent
selectors that don't include chrome as nested children.

DomSkeletonRenderer now surfaces structural ARIA attributes (`role`,
`aria-label`, `aria-labelledby`) alongside each element's tag / class /
id, giving the LLM more signal for distinguishing landmark regions
(nav / form / banner) from content. The hash-class-name filter is also
slightly less aggressive: pure PascalCase ids (e.g. `LanguageDropDown`)
now survive into the skeleton so the LLM can use them as selectors,
while real CSS-module hashes (mixed-case + digits, or 4+ case
transitions) are still dropped.

The renderer side (TypedMarkdownRenderer.ShouldEmit) is unchanged —
it was already correctly filtering by role. The fix is entirely about
label accuracy and the signal the LLM sees.

Migration: no source change required for consumers; templates induced
post-1.8.0-alpha.10 will produce cleaner output under
ExtractionProfile.RagFull and MainContentOnly. Cached templates
induced under earlier alphas will keep producing the old output until
they're refit (centroid drift triggers refit automatically; or
operators can manually clear the template store).

Regression tests: tests/StyloExtract.Core.Tests adds
LlmInducerPromptAntiPatternTests (prompt snapshot) and
MostlylucidLlmInductionRegressionTests (applies a synthetic bad-wide-
wrapper template against a captured mostlylucid.net fixture, proves
the language-picker / filter chrome leaks; then shows a properly
authored RepeatedItem template excludes per-card chrome cleanly).

StyloExtract 1.8.0-alpha.9 - 2026-06-25
========================================

App-safe AddStyloExtract + LlmInductionFired flag
--------------------------------------------------

Two changes that downstream desktop / CLI consumers (e.g. lucidVIEW-FULL)
need:

1. The basic `AddStyloExtract(IServiceCollection, Action<StyloExtractOptions>?)`
  DI extension and its companion `StyloExtractOptions` type now live in
  `Mostlylucid.StyloExtract.Core` instead of `Mostlylucid.StyloExtract.AspNetCore`.
  Non-AspNetCore hosts (desktop apps, CLI tools, console workers) can call:

      services.AddStyloExtract(o => o.StorePath = "templates.db");

  without pulling `Microsoft.AspNetCore.App` (~70 MB of framework runtime).

  `Mostlylucid.StyloExtract.AspNetCore` keeps its `Action<ResponsePolicyBuilder>`
  overloads (response-policy framework, markdown content negotiation
  middleware, operator-template minimal-API endpoints) — those legitimately
  need AspNetCore. They now delegate to the Core overload internally.

  Migration: no source change for AspNetCore consumers. Desktop / CLI
  consumers can reference `Mostlylucid.StyloExtract.Core` alone.

2. `ExtractionResult.LlmInductionFired` (new bool property) signals
  whether the LLM template inducer ran during this extraction. Downstream
  telemetry surfaces (e.g. status bars, NDJSON exports) can now show LLM
  utilisation per call without reflection or polling internal state.
  Defaults to false for non-LLM hosts and heuristic-only extractions;
  set true only when the LlmTemplateInducer (or any future ILlmTextProvider-
  backed inducer) actually invoked the LLM.

StyloExtract 1.8.0-alpha.6 - 2026-06-25
========================================

App-safe AddStyloExtract — moved to StyloExtract.Core
------------------------------------------------------

The basic `AddStyloExtract(IServiceCollection, Action<StyloExtractOptions>?)`
DI extension and its companion `StyloExtractOptions` type have moved from
`Mostlylucid.StyloExtract.AspNetCore` to `Mostlylucid.StyloExtract.Core`.
Desktop, CLI, and any non-AspNetCore host can now call:

   services.AddStyloExtract(o => o.StorePath = "templates.db");

without pulling `Microsoft.AspNetCore.App` (~70 MB of framework runtime).

`Mostlylucid.StyloExtract.AspNetCore` keeps its `Action<ResponsePolicyBuilder>`
overloads (response-policy framework, markdown content negotiation
middleware, operator-template minimal-API endpoints) — those legitimately
need AspNetCore. They now delegate to the Core overload internally.

Migration: no source change required for AspNetCore consumers. Desktop /
CLI consumers can drop direct dependencies on `Mostlylucid.StyloExtract.AspNetCore`
and reference `Mostlylucid.StyloExtract.Core` alone.

StyloExtract 1.8.0-alpha.5 - 2026-06-25
========================================

In-process CPU LLM backend (LLamaSharp) + 13-model bench harness.
Operators can now embed a single ~2-3 GB GGUF model in the host
process — no Ollama server, no separate LLM daemon. Same
ILlmTextProvider contract as the Ollama backend, so the
LlmTemplateInducer + production enrichment coordinator + CLI
`template train` all work unchanged.

What's new since 1.8.0-alpha.4
------------------------------

Mostlylucid.StyloExtract.Llm.LlamaSharp

   New package. ILlmTextProvider implementation backed by LLamaSharp
   0.27 (the .NET binding for llama.cpp). Loads a GGUF model from
   disk; the executor reads the model's chat template from GGUF
   metadata so prompts written for Ollama work unchanged.

   Wire-up:

       services.AddStyloExtract(...);
       services.AddStyloExtractLlamaSharp(o =>
       {
           o.ModelPath = "/var/models/Phi-4-mini-instruct-Q4_K_M.gguf";
           o.ContextSize = 8192;
           o.GpuLayerCount = 0;        // pure CPU target
       });
       services.AddStyloExtractLlmInducer("config/templates");

   Anti-prompt set covers Qwen, Phi, Llama 3+, and Gemma 4 stop
   tokens so the generator halts at the model's natural turn boundary
   instead of echoing the chat template structure.

   Known LLamaSharp 0.27 issue documented in the package README:
   Gemma 4 E2B / E4B's chat template metadata isn't applied cleanly
   by StatelessExecutor — the model emits Jinja2 template source
   instead of YAML. Phi-4-mini, Qwen 2.5 Coder, Llama 3.2 work fine.

Model benchmark harness

   New tests/StyloExtract.Llm.Benchmark project — runs the
   cross-product of (models × pages) for template induction and
   reports F1 / train-time / markdown-size matrices. Reuses WCXB
   ground-truth shape (one HTML.gz per page id, one ground-truth
   JSON) and the operator-template store path.

   Model spec routing: `llamasharp:/path/to/file.gguf` resolves via
   the in-process backend; anything else hits Ollama. Lets one
   bench compare server (Ollama) and embedded (LlamaSharp) backends
   side-by-side with identical fixtures.

Recommended models (empirically validated)

   For Ollama backend:
     * qwen3.5:4b           — 3 GB, ~26 s, F1 0.805 (default, best)
     * qwen2.5-coder:3b     — 2 GB, ~21 s, F1 0.767 (smaller-and-faster pick;
                                                      code-trained matters for
                                                      CSS selectors)
     * qwen3.5:0.8b         — 1 GB, ~5 s, F1 0.528 (tiny floor)

   For LLamaSharp backend (use bartowski quants):
     * Phi-4-mini-instruct Q4_K_M    — 2.5 GB, verified working
     * Qwen 3.5 4B Q4_K_M            — 3 GB, verified working
     * Qwen 2.5 Coder 3B Q4_K_M      — 2 GB, verified working

OllamaTextProviderOptions default model bumped

   Default tag was gemma4:e4b-it-qat; switched to qwen3.5:4b per the
   bench. The doc-comment now lists the smaller-and-faster pick and
   the model families to avoid (thinking-mode budget burn).

Tests

   494 across 11 projects. New StyloExtract.Llm.LlamaSharp.Tests
   project covers ctor validation, missing-file behaviour, and
   SkippableFact live-GGUF integration (skipped without
   STYLOEXTRACT_LLAMASHARP_MODEL env var pointing at a GGUF file).

StyloExtract 1.8.0-alpha.4 - 2026-06-25
========================================

Tiny patch alpha to fix two consumer-facing bugs found while smoke-
installing alpha.3 against NuGet.

What's new since 1.8.0-alpha.3
------------------------------

SQLite chain CVE patched (GHSA-2m69-gcr7-jv3q)

   Microsoft.Data.Sqlite bumped 10.0.1 -> 10.0.9; StyloExtract.Templates
   gains a direct PackageReference to SQLitePCLRaw.bundle_e_sqlite3 so
   the existing 3.0.3 central pin lifts the resolved bundle off the
   vulnerable 2.1.11 line and onto SourceGear.sqlite3 3.50.4.5.
   `dotnet list package --vulnerable` on consumer projects now
   returns clean.

Playw

[truncated — see RELEASE_NOTES.txt packaged at root for full history]