Mostlylucid.StyloExtract.Streaming
1.8.0-alpha.23
See the version list below for details.
dotnet add package Mostlylucid.StyloExtract.Streaming --version 1.8.0-alpha.23
NuGet\Install-Package Mostlylucid.StyloExtract.Streaming -Version 1.8.0-alpha.23
<PackageReference Include="Mostlylucid.StyloExtract.Streaming" Version="1.8.0-alpha.23" />
<PackageVersion Include="Mostlylucid.StyloExtract.Streaming" Version="1.8.0-alpha.23" />
<PackageReference Include="Mostlylucid.StyloExtract.Streaming" />
paket add Mostlylucid.StyloExtract.Streaming --version 1.8.0-alpha.23
#r "nuget: Mostlylucid.StyloExtract.Streaming, 1.8.0-alpha.23"
#:package Mostlylucid.StyloExtract.Streaming@1.8.0-alpha.23
#addin nuget:?package=Mostlylucid.StyloExtract.Streaming&version=1.8.0-alpha.23&prerelease
#tool nuget:?package=Mostlylucid.StyloExtract.Streaming&version=1.8.0-alpha.23&prerelease
Mostlylucid.StyloExtract.Streaming
Zero-allocation, bounded-memory gateway fence scanner for the
StyloExtract family. Rides alongside the byte stream of an HTTP response
and emits a verdict — Captured / Bailout / NoTemplate / Continue
— while the body is still in flight, without ever building a DOM or
buffering the full page.
Designed for the gateway position: HTTP reverse proxies, CDN edges,
ASP.NET output filters, and Stylobot's response pipeline. Use it to
decide whether a response is worth feeding to the full
LayoutExtractor extraction pipeline, before you commit to buffering
it.
Memory contract
A sliding-window tokenizer holds ONLY the partial tag bytes that
straddle a chunk boundary — typically <500 B, often zero. Each chunk
is parsed inline; complete-tag bytes are dropped immediately, never
copied into a holding buffer. The hard cap is 4 KiB
(IncrementalHtmlTokenizer.MaxBufferSize); a single tag larger than
that throws InvalidOperationException rather than silently dropping
bytes. Measured peak: 0 B for 16 KB chunks over a 200 KB body; 19 B for
1 KB chunks. Pinned by the StreamingMemoryBoundTests regression suite.
Wire-up
// Singletons — the scanner and store are thread-safe.
services.AddSingleton<IStreamingTemplateStore, InMemoryStreamingTemplateStore>();
// Or durable: new SqliteStreamingTemplateStore("streaming-templates.db")
services.AddSingleton<StreamingPathSelector>();
services.AddSingleton<StreamingTemplateInducer>();
services.AddSingleton<StreamingRefitOrchestrator>();
Hot path
var selector = sp.GetRequiredService<StreamingPathSelector>();
var inducer = sp.GetRequiredService<StreamingTemplateInducer>();
var store = sp.GetRequiredService<IStreamingTemplateStore>();
await selector.WarmByHostAsync(host);
var verdict = selector.ScanByHost(host, bodyBytes);
if (verdict == ScanVerdict.NoTemplate)
{
// First visit to this host — induce a template heuristically.
var induced = inducer.Induce(host, bodyBytes);
if (induced is not null)
await store.UpsertAsync(induced);
}
For chunked (streaming) inputs, use IncrementalFenceScanner.Create(template)
and call Feed(chunk) per chunk. The verdict latches on the first
terminal result.
See also
- Full guide:
docs/streaming.mdcovers the auto-induction lifecycle, bounded-memory proof, refit / versioning, and a comparison table for streaming vsLayoutExtractor. - Top-level README:
README.md - Pairs with
Mostlylucid.StyloExtract.Fingerprint(layout learning) and the existingITemplateIndextemplate store; the streaming template format is its own shape (StreamingTemplatewithTemplateFenceMinHash sketches), not an LLM template or operator template.
| Product | Versions Compatible and additional computed target framework versions. |
|---|---|
| .NET | net10.0 is compatible. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
-
net10.0
- Microsoft.Data.Sqlite (>= 10.0.9)
- Mostlylucid.Ephemeral.Sqlite.SingleWriter (>= 2.6.4)
- Mostlylucid.StyloExtract.Abstractions (>= 1.8.0-alpha.23)
- Mostlylucid.StyloExtract.Fingerprint (>= 1.8.0-alpha.23)
- SQLitePCLRaw.bundle_e_sqlite3 (>= 3.0.3)
- System.IO.Hashing (>= 10.0.1)
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.
| Version | Downloads | Last Updated |
|---|---|---|
| 1.8.0 | 0 | 6/27/2026 |
| 1.8.0-alpha.23 | 0 | 6/27/2026 |
| 1.8.0-alpha.22 | 0 | 6/27/2026 |
| 1.8.0-alpha.21 | 0 | 6/27/2026 |
| 1.8.0-alpha.20 | 38 | 6/27/2026 |
| 1.8.0-alpha.19 | 39 | 6/26/2026 |
| 1.8.0-alpha.18 | 53 | 6/26/2026 |
| 1.8.0-alpha.17 | 63 | 6/26/2026 |
| 1.8.0-alpha.16 | 61 | 6/26/2026 |
StyloExtract 1.8.0-alpha.21 - 2026-06-27
=========================================
Streaming: scope fixes (no algorithm replacement)
--------------------------------------------------
Tightens the alpha.19 streaming scanner without replacing the MinHash
matcher. The algorithm shape (MinHash + LSH bands + three fences per
template) is unchanged; what changes is its scope:
1. IncrementalHtmlTokenizer.Feed no longer copies the whole chunk into
_buffer. Chunks are parsed inline; only the partial-tag tail (if a
tag straddles a chunk boundary) is retained for stitching with the
next chunk. PeakBufferedBytes is now bounded by O(longest tag), not
O(chunk size). Measured: peak = 0 B for a 200 KB body in 16 KB
chunks, 19 B in 1 KB chunks. MaxBufferSize lowered from 64 KiB to
4 KiB.
2. RollingSketch shingles upgraded to Markov bigrams: each shingle is
(prevTagHash, currentTagHash, currentClassHash). Order-sensitive:
[A, B] and [B, A] now produce different signatures. The leftmost
shingle in any window uses prevTag = 0 so sliding-window scanners
match fences built from contiguous event sequences regardless of
what came before the window.
3. Static StructuralTagAllowlist replaces per-fence TagAllowlistBloom.
Only structural tags (html/body/header/nav/main/article/section/
div/p/h1-h6/ul/ol/li/table/...) push into the sketch. meta/link/
script-chrome/img/span/a bypass the recompute entirely. The
TagAllowlistBloom JSON property is retained as a back-compat sink
(read-and-discarded) so persisted templates from alpha.16-alpha.20
round-trip cleanly.
4. Depth-aware capture-end: while in Capturing, ContentEnd only matches
when DOM depth has returned to (or below) the depth at ContentStart.
Nested matches mid-content can no longer terminate capture early.
5. Dead StreamingTemplate.MinContentDepth field removed (never read by
any scanner).
6. FenceScanner and IncrementalFenceScanner now share a single static
StreamingTick.Step. Both scanners build a StreamingTickState from
their respective storage (span-backed vs heap-backed) and execute
literally the same code. Cross-validation tests retained as insurance.
7. IStreamingTemplateStore gains version-chain APIs:
- GetByHostAtVersionAsync(host, version) — retrieve a specific version.
- ListVersionsByHostAsync(host) — enumerate all known versions.
UpsertAsync now APPENDS per (host, version) rather than replacing.
SQLite store schema migrated to PK (host, version); existing rows
auto-migrate to version 1 on first open.
Migration notes:
- Persisted SQLite templates from alpha.16-alpha.20 auto-migrate to
version 1 on first open; existing rows are preserved.
- TemplateFence(uint[], ulong[], ulong, int) constructor removed; the
new shape is TemplateFence(uint[], ulong[], int). TagAllowlistBloom
is still readable as a property (returns 0).
- StreamingTemplate.MinContentDepth removed — drop from any code that
set it in `with` expressions.
- RollingSketch.Push signature changed to Push(prevTagHash, tagHash,
classHash) — direct users must track prev tag.
StyloExtract 1.8.0-alpha.19 - 2026-06-26
=========================================
Streaming: sliding-window design (no full-buffer retention)
------------------------------------------------------------
Refactors alpha.18's IncrementalHtmlTokenizer + IncrementalFenceScanner
to a TRUE sliding-window streaming design:
1. Bytes: only partial-tag bytes are retained. Once a tag is emitted,
the bytes are dropped immediately (compact-on-emit, not compact-on-
next-Feed). New PeakBufferedBytes property exposes the high-watermark
for telemetry. Worst-case in-flight buffer is O(longest tag), not
O(megabytes). MaxBufferSize lowered from 1 MiB to 64 KiB and
repositioned as a hard safety stop that should never be hit under
correct input — exceeding it now means a single tag (or unclosed
script/style body) genuinely exceeds 64 KiB and the scan must bail.
2. Events: fixed-size sliding window of the last WindowSize tag events
(unchanged from alpha.18). Push new, pop oldest. The window is the
only event-level state.
3. RollingSketch: documented (in IncrementalFenceScanner XML doc) that
MinHash with min-pooling is NOT reversibly rollable — once an element
leaves the window, its contribution to min(...) can't be subtracted.
The sketch therefore rebuilds the signature from the current event
window after each accepted tag (O(WindowSize × SignatureSize) per
tick, gated by the Bloom allowlist filter to skip the vast majority
of inbound tags). The bounded-buffer property — the user's headline
concern — is satisfied by the tokenizer; the sketch's per-tick recompute
is the price MinHash charges for the LSH-band locality the matcher
relies on. The event-level memory remains O(WindowSize) regardless.
4. IncrementalFenceScanner now exposes PeakBufferedBytes and BytesConsumed
passthroughs from the tokenizer so callers can prove the bounded-memory
property to telemetry without reaching into the tokenizer directly.
The duplicated tick logic (mirroring FenceScanner.Tick over heap-backed
sketch state) is retained — it's hard-pinned to the ref-struct path by
the existing cross-validation tests, which give us higher confidence
than refactoring to delegate would.
Memory-cap proof: tests/StreamingMemoryBoundTests.cs feeds 5 MiB of
synthetic HTML in 4 KiB chunks and asserts PeakBufferedBytes stays
under 16 KiB. The streaming gateway can now scan multi-megabyte
responses while holding bounded memory.
Migration: API is unchanged from alpha.18 — refactor is internal. The
new PeakBufferedBytes and BytesConsumed diagnostic properties on
IncrementalFenceScanner are additive. MaxBufferSize is still public but
the new value is 64 KiB (was 1 MiB); only relevant if you were catching
the InvalidOperationException for pathological input.
StyloExtract 1.8.0-alpha.18 - 2026-06-26
=========================================
Streaming: true chunked tokenization + refit/versioning + bench update
-----------------------------------------------------------------------
1. IncrementalHtmlTokenizer + IncrementalFenceScanner
Stateful tokenizer that survives chunk boundaries. A partial tag at
the end of one chunk is held in an internal buffer and completed when
the next Feed call arrives. Pairs with IncrementalFenceScanner —
callers Feed chunks as they arrive from the network, get a verdict
per chunk, bail early on Captured / Bailout.
Trade-off vs MinimalHtmlTokenizer's span path: one buffer allocation
per request (not per chunk). Use the span path for whole-buffer
scans, the incremental path for streaming gateways where bytes
arrive in chunks. Hard cap of 1 MiB on the internal buffer — feed
throws InvalidOperationException on pathological input that never
closes a tag, surfacing the failure rather than silently dropping
bytes.
Architectural note: FenceScanner stays a ref struct (zero-alloc hot
path); IncrementalFenceScanner is a heap-backed class that ports the
same tick logic. The two are kept in lockstep — any drift between
them is a correctness bug surface and is covered by cross-validation
tests that feed the same bytes both ways.
2. Streaming-template refit + versioning
StreamingTemplate gains a Version field (defaults to 1; persists
across alpha.17 templates without migration). New
StreamingRefitOrchestrator observes captured-scan output per host
and kicks off-hot-path refits when either:
- capture-region EWMA drift exceeds 30% on N consecutive scans, OR
- every 10th captured scan re-induces and finds different fences
On refit: version bumps, store is upserted, the new
IStreamingTemplateVersionSink fires a StreamingTemplateRefitEvent
(Host, Old/New TemplateId, Old/New Version, Reason, DetectedAt).
Default sink is a no-op; consumers wire UI telemetry to it.
3. Bench update
ExtractionComparisonBench gains a New_StreamingScanByHost variant so
the host-keyed hot-path is benchmarked alongside the original
GUID-keyed scan. Pre-populates the in-memory store with the
host="www.mostlylucid.net" template that lucidview FULL hits in
production.
Migration: additive APIs. Alpha.17 consumers using ScanByHost continue
to work; the incremental tokenizer and the refit orchestrator are
opt-in (use them when feeding chunks / when wiring drift telemetry).
StyloExtract 1.8.0-alpha.17 - 2026-06-26
=========================================
Streaming: host-keyed templates + naive auto-induction
-------------------------------------------------------
Three changes to close the alpha.16 streaming integration loop:
1. Host-keyed lookup
IStreamingTemplateStore gains GetByHostAsync / TryGetHotByHost /
UpsertAsync. StreamingTemplate gains a Host field (required). One
template per host (latest wins). The existing GUID-keyed methods
remain — Host is the lookup key for consumers; TemplateId stays for
stable identity / versioning.
2. StreamingPathSelector.ScanByHost(host, bytes)
Synchronous hot-cache-only host scan. Returns NoTemplate on miss
so the caller can WarmByHostAsync + retry or induce.
WarmByHostAsync brings a host's template into the hot cache via
the durable tier.
3. StreamingTemplateInducer
Naive first-pass inducer: walks HTML via MinimalHtmlTokenizer,
finds semantic-marker tag-sequence-pairs (<header>...</header>,
<p>...</p>...<p>...</p>, <footer>/</main>/</body>) and produces a
StreamingTemplate ready to upsert. Returns null on pages with no
identifiable structural fences (plain text, image-only, etc.).
Describe() returns a human-readable summary of the chosen markers
for logging.
Storage migrations:
- InMemoryStreamingTemplateStore: adds an in-memory host index.
- SqliteStreamingTemplateStore: adds a 'host' TEXT column + index;
on-open ALTER TABLE migration handles pre-alpha.17 schemas
(existing rows get Host="" — reachable only by GUID).
Migration: additive APIs; alpha.16 consumers using only the
GUID-keyed surface continue to work unchanged. The new Host field
on StreamingTemplate IS required — existing construction sites must
set Host="" if they have no host context.
StyloExtract 1.8.0-alpha.16 - 2026-06-26
=========================================
Mostlylucid.StyloExtract.Streaming — zero-allocation byte-stream scanner
------------------------------------------------------------------------
New package on NuGet. Hot-path streaming fence scanner: skips page chrome
and captures the content region as response bytes flow past, using
MinHash-derived structural fences. Zero per-request GC-tracked
allocations in steady state.
Designed for the gateway position — drop into a response pipeline
(HttpClient, Stylobot's edge, ASP.NET output filters) alongside the byte
stream and emit a verdict without buffering the full page.
Public hot-path API:
StreamingPathSelector.Scan(Guid templateId, ReadOnlySpan<byte> html)
→ ScanVerdict { Continue | Captured | Bailout | NoTemplate }
// Warm a template into the hot cache:
await selector.WarmAsync(templateId);
Storage:
- InMemoryStreamingTemplateStore — single-process LRU.
- SqliteStreamingTemplateStore — durable; same SQLite file pattern as
the existing ITemplateIndex but a separate table.
Pairs with the existing StyloExtract.Fingerprint learn path and
ITemplateIndex template store. The streaming template format is its own
shape (TemplateFence with MinHash bloom, content-start/content-end
fences) — not an LLM template or operator template.
Bench results vs LayoutExtractor on mostlylucid fixtures: see
bench/StyloExtract.Streaming.Benchmarks/ (zero-alloc scan competitive
with the full extractor's path-match cost while never building a DOM).
Migration: additive package; consumers add a PackageReference to
Mostlylucid.StyloExtract.Streaming if they want gateway-position
scanning.
StyloExtract 1.8.0-alpha.15 - 2026-06-26
=========================================
RenderOptions.WaitUntil — opt out of NetworkIdle for SPA routing
-----------------------------------------------------------------
PlaywrightHtmlFetcher previously hardcoded WaitUntilState.NetworkIdle
for the primary GotoAsync. On sites with aggressive client-side
routing (BBC News auto-navigates /news → /articles/<id> in the
post-load JS phase), this means the fetcher returns the post-routing
DOM, not the page the user requested.
RenderOptions now exposes a WaitUntil property (PlaywrightWaitUntil
enum: Load / DOMContentLoaded / NetworkIdle / Commit). Default stays
NetworkIdle for backwards compatibility. Consumers fetching SPA-heavy
sites should set Load to capture the initial DOM before the router
fires.
The secondary WaitForLoadStateAsync(NetworkIdle, ...) drain remains —
it's independently bounded by WaitForNetworkIdleTimeout and serves as
a best-effort late-XHR catch-up; safe even with the primary returning
on Load.
PlaywrightWaitUntil is a small enum (not Microsoft.Playwright.WaitUntilState
direct) so consumers don't take a transitive dependency on
Microsoft.Playwright just to pick a strategy.
StyloExtract 1.8.0-alpha.14 - 2026-06-26
=========================================
Sitemap CLI end-to-end regression + LLM nav few-shot
-----------------------------------------------------
1. Sitemap CLI test suite
The alpha.11 stylo-extract sitemap verb has been working on real sites
since alpha.13 (heuristic nav-classification tightening), but nothing
caught regressions. Added 5 end-to-end tests in
StyloExtract.Core.Tests/SitemapCommandTests.cs that invoke the
SitemapCommand.CrawlAsync handler against the mostlylucid-home.html.gz
fixture (real captured homepage, shared with the heuristics suite) plus
a stub HttpMessageHandler and assert: real nav links emitted under
# www.mostlylucid.net, --max-depth 0 emits only the seed Title row,
off-host links are not followed, --max-pages cap honoured exactly, and
--delay-ms enforced with a stopwatch floor. No network access required.
2. LLM induction prompt — nav-classification few-shot
LlmInducerPrompts.System and SystemRepair now include a second worked
example: a blog homepage with header <nav>, breadcrumb,
MainContent + RepeatedItem post cards, and footer <nav>. Mirrors the
patterns the alpha.13 NavPreDetector heuristic correctly classifies.
Rule 6 (RepeatedItem usage) tightened with explicit guidance that
header/footer nav lists are PrimaryNavigation / SecondaryNavigation at
the parent <ul>/<nav> level, NOT RepeatedItem at the <li> level —
closes a known LLM confusion mode.
Tests: snapshot tests in StyloExtract.Core.Tests/LlmInducerPromptsTests.cs
verify the prompt extensions land verbatim so future prompt edits don't
accidentally regress.
StyloExtract 1.8.0-alpha.13 - 2026-06-26
=========================================
Heuristic nav-classification tightening
----------------------------------------
HeuristicBlockClassifier was under-classifying real-world nav patterns
on server-rendered sites — header <nav> strips, header <ul>-of-links,
breadcrumb lists, role="navigation" attributes, footer nav — all landed
as Boilerplate (or weren't extracted at all). Result: the alpha.11
Sitemap profile and stylo-extract sitemap CLI verb produced a one-line
tree even on sites with rich nav, because the classifier didn't surface
PrimaryNavigation / SecondaryNavigation / Breadcrumb roles for them.
Tightened patterns now produce definite role classifications:
1. <header> <nav> -> PrimaryNavigation (0.9)
2. Top-of-document <nav> -> PrimaryNavigation (0.85)
3. <footer> <nav> -> SecondaryNavigation (0.9)
4. <nav aria-label="breadcrumb"> / class~="breadcrumb" -> Breadcrumb (0.95)
5. <* role="navigation"> -> PrimaryNavigation (0.95)
6. Header <ul> of mostly-link <li>s -> PrimaryNavigation (0.85) at
the <ul> level, suppress descent (was emitting deep Boilerplate)
7. Footer <ul> of mostly-link <li>s -> SecondaryNavigation (0.85)
Implementation: a new NavPreDetector runs after per-element classification
and injects each detected nav container as a high-score (50000) candidate
at the parent level, then demotes any descendant candidates so greedy
selection picks the nav parent and stops descending into its noise.
Containers nested inside <main>/<article> are skipped — IntraBlockCleaner
already strips them as intra-block contaminants; hoisting would steal
the article's selection win.
Regression fixtures captured from mostlylucid.net + wikipedia.org under
tests/StyloExtract.Heuristics.Tests/Fixtures so the next time a
classifier change regresses real-world nav detection, the bench catches
it before it ships.
Downstream impact: the Sitemap ExtractionProfile and stylo-extract
sitemap CLI verb now produce real nav trees on these sites - see the
lucidview FULL dogfood smoke for evidence.
StyloExtract 1.8.0-alpha.12 - 2026-06-26
=========================================
DI wire-up fix for deterministic-template YAML persistence
-----------------------------------------------------------
alpha.11 introduced DeterministicTemplateYamlSink + the
AddStyloExtractOperatorTemplates registration, but AddStyloExtract's
LayoutExtractor construction did not pass the sink through to the
extractor — so even when the sink was registered in DI, LayoutExtractor's
optional ctor parameter defaulted to null and no `<host>-deterministic.yaml`
file was ever written.
Fixed by threading `sp.GetService<DeterministicTemplateYamlSink>()` to the
LayoutExtractor constructor in AddStyloExtract. No API change; consumers
who already called AddStyloExtractOperatorTemplates start seeing
deterministic YAML files immediately after upgrading.
StyloExtract 1.8.0-alpha.11 - 2026-06-26
=========================================
Sequenced architecture extension: deterministic templates with
extended classification — Title role, Sitemap profile, deterministic
YAML persistence, and a sitemap CLI verb.
Title BlockRole
---------------
New BlockRole.Title value distinguishes the page-level <h1> (the single
H1 the rest of the page is "about") from intra-content Heading
(H2/H3/H4 inside the body). HeuristicBlockClassifier surfaces the Title
via a shared PageTitleDetector helper, picking the H1 in/closest-to
<main>/<article> and falling back to earliest-in-document with multiple
H1s. ExtractorApplicator surfaces Title on the fast-path / applicator
branch too, so output stays consistent across novel and cached requests
(matters for the response-cache ETag). LlmInducerPrompts list Title in
the allowed-roles set with a one-line distinction from Heading.
MainContentOnly, RagFull, Wcxb, and AgentNavigation profiles all
include Title in their role-set. The renderer quality gate (drop short
text) bypasses for Title and Heading so intentionally-terse page
titles ("Home", "About") still surface.
Sitemap ExtractionProfile
-------------------------
New ExtractionProfile.Sitemap value emits only Title + Heading +
PrimaryNavigation + SecondaryNavigation + Breadcrumb. For sitemap /
outline / crawler use cases that want page titles and the site's nav
structure without pulling body content. The CLI's --profile flag
recognises `sitemap` automatically (enum binding).
Deterministic YAML persistence
------------------------------
New DeterministicTemplateYamlSink, wired automatically when
AddStyloExtractOperatorTemplates(root) is called, writes
<host>-deterministic.yaml alongside each heuristic-induced template's
SQLite row. The file carries every role the heuristic detected (Title,
MainContent, Navigation, Footer, …) — auditable, hand-editable, and
diffable, mirroring how LLM-induced templates have always been written
by TemplateEnrichmentCoordinator. The SQLite store remains the
authoritative source at match time; YAML is best-effort and
non-blocking.
stylo-extract sitemap CLI verb
------------------------------
New `sitemap` subcommand: takes one or more starting URLs, extracts
each with ExtractionProfile.Sitemap, follows internal nav links to
--max-depth (default 3), and emits a markdown tree of titles + URLs to
stdout or --out <file>. Safety caps: 50 pages by default
(--max-pages), 1s between requests (--delay-ms), no off-host follow.
Migration
---------
No source change required for consumers. The new Title role is
additive (existing switches that handled BlockRole pattern-match
defaults will continue to compile and behave identically; switches
that exhaustively listed roles were updated). Deterministic YAML
writing only activates when AddStyloExtractOperatorTemplates is
called, so consumers that don't use operator templates see no new
filesystem activity.
StyloExtract 1.8.0-alpha.10 - 2026-06-26
=========================================
LLM classification accuracy for chrome patterns
------------------------------------------------
Symptom: induced templates were labelling language pickers, filter UI,
locale switchers, and pagination strips as MainContent on
server-rendered blogs (mostlylucid.net being the canonical reproducer).
The downstream RagFull renderer's role-filter — which already drops
PrimaryNavigation / SecondaryNavigation / Form / Boilerplate — never
saw them as nav and so left them in the extracted markdown, producing
output WORSE than the deterministic heuristic.
Fix: expanded the induction and repair system prompts with explicit
"chrome pattern → role" examples (language picker → PrimaryNavigation;
filter / faceted-search → Form; pagination → SecondaryNavigation;
cookie banner → CookieBanner; newsletter signup → Form; social-share
→ Boilerplate). Also nudged the model to prefer narrower MainContent
selectors that don't include chrome as nested children.
DomSkeletonRenderer now surfaces structural ARIA attributes (`role`,
`aria-label`, `aria-labelledby`) alongside each element's tag / class /
id, giving the LLM more signal for distinguishing landmark regions
(nav / form / banner) from content. The hash-class-name filter is also
slightly less aggressive: pure PascalCase ids (e.g. `LanguageDropDown`)
now survive into the skeleton so the LLM can use them as selectors,
while real CSS-module hashes (mixed-case + digits, or 4+ case
transitions) are still dropped.
The renderer side (TypedMarkdownRenderer.ShouldEmit) is unchanged —
it was already correctly filtering by role. The fix is entirely about
label accuracy and the signal the LLM sees.
Migration: no source change required for consumers; templates induced
post-1.8.0-alpha.10 will produce cleaner output under
ExtractionProfile.RagFull and MainContentOnly. Cached templates
induced under earlier alphas will keep producing the old output until
they're refit (centroid drift triggers refit automatically; or
operators can manually clear the template store).
Regression tests: tests/StyloExtract.Core.Tests adds
LlmInducerPromptAntiPatternTests (prompt snapshot) and
MostlylucidLlmInductionRegressionTests (applies a synthetic bad-wide-
wrapper template against a captured mostlylucid.net fixture, proves
the language-picker / filter chrome leaks; then shows a properly
authored RepeatedItem template excludes per-card chrome cleanly).
StyloExtract 1.8.0-alpha.9 - 2026-06-25
========================================
App-safe AddStyloExtract + LlmInductionFired flag
--------------------------------------------------
Two changes that downstream desktop / CLI consumers (e.g. lucidVIEW-FULL)
need:
1. The basic `AddStyloExtract(IServiceCollection, Action<StyloExtractOptions>?)`
DI extension and its companion `StyloExtractOptions` type now live in
`Mostlylucid.StyloExtract.Core` instead of `Mostlylucid.StyloExtract.AspNetCore`.
Non-AspNetCore hosts (desktop apps, CLI tools, console workers) can call:
services.AddStyloExtract(o => o.StorePath = "templates.db");
without pulling `Microsoft.AspNetCore.App` (~70 MB of framework runtime).
`Mostlylucid.StyloExtract.AspNetCore` keeps its `Action<ResponsePolicyBuilder>`
overloads (response-policy framework, markdown content negotiation
middleware, operator-template minimal-API endpoints) — those legitimately
need AspNetCore. They now delegate to the Core overload internally.
Migration: no source change for AspNetCore consumers. Desktop / CLI
consumers can reference `Mostlylucid.StyloExtract.Core` alone.
2. `ExtractionResult.LlmInductionFired` (new bool property) signals
whether the LLM template inducer ran during this extraction. Downstream
telemetry surfaces (e.g. status bars, NDJSON exports) can now show LLM
utilisation per call without reflection or polling internal state.
Defaults to false for non-LLM hosts and heuristic-only extractions;
set true only when the LlmTemplateInducer (or any future ILlmTextProvider-
backed inducer) actually invoked the LLM.
StyloExtract 1.8.0-alpha.6 - 2026-06-25
========================================
App-safe AddStyloExtract — moved to StyloExtract.Core
------------------------------------------------------
The basic `AddStyloExtract(IServiceCollection, Action<StyloExtractOptions>?)`
DI extension and its companion `StyloExtractOptions` type have moved from
`Mostlylucid.StyloExtract.AspNetCore` to `Mostlylucid.StyloExtract.Core`.
Desktop, CLI, and any non-AspNetCore host can now call:
services.AddStyloExtract(o => o.StorePath = "templates.db");
without pulling `Microsoft.AspNetCore.App` (~70 MB of framework runtime).
`Mostlylucid.StyloExtract.AspNetCore` keeps its `Action<ResponsePolicyBuilder>`
overloads (response-policy framework, markdown content negotiation
middleware, operator-template minimal-API endpoints) — those legitimately
need AspNetCore. They now delegate to the Core overload internally.
Migration: no source change required for AspNetCore consumers. Desktop /
CLI consumers can drop direct dependencies on `Mostlylucid.StyloExtract.AspNetCore`
and reference `Mostlylucid.StyloExtract.Core` alone.
StyloExtract 1.8.0-alpha.5 - 2026-06-25
========================================
In-process CPU LLM backend (LLamaSharp) + 13-model bench harness.
Operators can now embed a single ~2-3 GB GGUF model in the host
process — no Ollama server, no separate LLM daemon. Same
ILlmTextProvider contract as the Ollama backend, so the
LlmTemplateInducer + production enrichment coordinator + CLI
`template train` all work unchanged.
What's new since 1.8.0-alpha.4
------------------------------
Mostlylucid.StyloExtract.Llm.LlamaSharp
New package. ILlmTextProvider implementation backed by LLamaSharp
0.27 (the .NET binding for llama.cpp). Loads a GGUF model from
disk; the executor reads the model's chat template from GGUF
metadata so prompts written for Ollama work unchanged.
Wire-up:
services.AddStyloExtract(...);
services.AddStyloExtractLlamaSharp(o =>
{
o.ModelPath = "/var/models/Phi-4-mini-instruct-Q4_K_M.gguf";
o.ContextSize = 8192;
o.GpuLayerCount = 0; // pure CPU target
});
services.AddStyloExtractLlmInducer("config/templates");
Anti-prompt set covers Qwen, Phi, Llama 3+, and Gemma 4 stop
tokens so the generator halts at the model's natural turn boundary
instead of echoing the chat template structure.
Known LLamaSharp 0.27 issue documented in the package README:
Gemma 4 E2B / E4B's chat template metadata isn't applied cleanly
by StatelessExecutor — the model emits Jinja2 template source
instead of YAML. Phi-4-mini, Qwen 2.5 Coder, Llama 3.2 work fine.
Model benchmark harness
New tests/StyloExtract.Llm.Benchmark project — runs the
cross-product of (models × pages) for template induction and
reports F1 / train-time / markdown-size matrices. Reuses WCXB
ground-truth shape (one HTML.gz per page id, one ground-truth
JSON) and the operator-template store path.
Model spec routing: `llamasharp:/path/to/file.gguf` resolves via
the in-process backend; anything else hits Ollama. Lets one
bench compare server (Ollama) and embedded (LlamaSharp) backends
side-by-side with identical fixtures.
Recommended models (empirically validated)
For Ollama backend:
* qwen3.5:4b — 3 GB, ~26 s, F1 0.805 (default, best)
* qwen2.5-coder:3b — 2 GB, ~21 s, F1 0.767 (smaller-and-faster pick;
code-trained matters for
CSS selectors)
* qwen3.5:0.8b — 1 GB, ~5 s, F1 0.528 (tiny floor)
For LLamaSharp backend (use bartowski quants):
* Phi-4-mini-instruct Q4_K_M — 2.5 GB, verified working
* Qwen 3.5 4B Q4_K_M — 3 GB, verified working
* Qwen 2.5 Coder 3B Q4_K_M — 2 GB, verified working
OllamaTextProviderOptions default model bumped
Default tag was gemma4:e4b-it-qat; switched to qwen3.5:4b per the
bench. The doc-comment now lists the smaller-and-faster pick and
the model families to avoid (thinking-mode budget burn).
Tests
494 across 11 projects. New StyloExtract.Llm.LlamaSharp.Tests
project covers ctor validation, missing-file behaviour, and
SkippableFact live-GGUF integration (skipped without
STYLOEXTRACT_LLAMASHARP_MODEL env var pointing at a GGUF file).
StyloExtract 1.8.0-alpha.4 - 2026-06-25
========================================
Tiny patch alpha to fix two consumer-facing bugs found while smoke-
installing alpha.3 against NuGet.
What's new since 1.8.0-alpha.3
------------------------------
SQLite chain CVE patched (GHSA-2m69-gcr7-jv3q)
Microsoft.Data.Sqlite bumped 10.0.1 -> 10.0.9; StyloExtract.Templates
gains a direct PackageReference to SQLitePCLRaw.bundle_e_sqlite3 so
the existing 3.0.3 central pin lifts the resolved bundle off the
vulnerable 2.1.11 line and onto SourceGear.sqlite3 3.50.4.5.
`dotnet list package --vulnerable` on consumer projects now
returns clean.
Playw
[truncated — see RELEASE_NOTES.txt packaged at root for full history]