Mostlylucid.StyloExtract.Templates.Postgres 2.0.1

.NET 10.0

dotnet add package Mostlylucid.StyloExtract.Templates.Postgres --version 2.0.1

NuGet\Install-Package Mostlylucid.StyloExtract.Templates.Postgres -Version 2.0.1

This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.

<PackageReference Include="Mostlylucid.StyloExtract.Templates.Postgres" Version="2.0.1" />

For projects that support PackageReference, copy this XML node into the project file to reference the package.

<PackageVersion Include="Mostlylucid.StyloExtract.Templates.Postgres" Version="2.0.1" />
                    

                            Directory.Packages.props

<PackageReference Include="Mostlylucid.StyloExtract.Templates.Postgres" />
                    

                            Project file

For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.

paket add Mostlylucid.StyloExtract.Templates.Postgres --version 2.0.1

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

#r "nuget: Mostlylucid.StyloExtract.Templates.Postgres, 2.0.1"

#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.

#:package Mostlylucid.StyloExtract.Templates.Postgres@2.0.1

#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.

#addin nuget:?package=Mostlylucid.StyloExtract.Templates.Postgres&version=2.0.1
                    

                            Install as a Cake Addin

#tool nuget:?package=Mostlylucid.StyloExtract.Templates.Postgres&version=2.0.1
                    

                            Install as a Cake Tool

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

Mostlylucid.StyloExtract.Templates.Postgres

PostgreSQL-backed template index for StyloExtract. Implements the same ITemplateIndex contract as Mostlylucid.StyloExtract.Templates (the SQLite provider); swap providers via DI with no change to calling code.

When to use this instead of SQLite

Choose the Postgres provider when:

Your deployment already runs PostgreSQL as its operational database (StyloBot commercial, multi-tenant SaaS)
You need multiple extraction nodes sharing one template store (Npgsql pools connections; Postgres serialises concurrent writes natively)
You plan to add pgvector cosine-similarity search in a future upgrade (the schema is forward-compatible)

The SQLite provider (Mostlylucid.StyloExtract.Templates) is the right choice for single-host or air-gapped deployments, CLI tools, and anywhere you want zero external dependencies.

Installation

dotnet add package Mostlylucid.StyloExtract.Templates.Postgres

Usage

// Register the Postgres provider. Call this instead of (or after) AddStyloExtract()
// to replace the SQLite ITemplateIndex with the Postgres one.
services.AddStyloExtractPostgres(o =>
    o.ConnectionString = "Host=localhost;Port=5432;Database=styloextract;Username=se;Password=secret");

// Optional: register drift-triggered refit support (mirrors RefitOrchestrator for SQLite).
services.AddStyloExtractPostgresRefit(
    driftRefitThreshold: 0.35,
    observationsBeforeStable: 5,
    versionHistoryDepth: 3);

Schema is applied idempotently on the first operation (CREATE TABLE IF NOT EXISTS). No migration tool required.

Storage model

Table	Contents
`templates`	Template id (bytea), host hash, fingerprint, extractor JSON blob, version, observation count
`template_lsh_band_index`	LSH bucket rows for fast-path lookup
`template_observations`	Per-request observation vectors (bounded to last 100 per template)
`template_version_history`	Past extractor versions retained for diff generation

Columns that are BLOB in SQLite are bytea in Postgres. Timestamps are bigint Unix milliseconds. No pgvector dependency in v1; vector similarity uses the same CPU-side cosine math as the SQLite provider.

AOT

This package sets IsAotCompatible=false because Npgsql requires runtime reflection for connection-string parsing. It will not break AOT builds in packages that do not reference it (sibling packages such as StyloExtract.Playwright remain AOT-safe).

Full documentation and package family

Product	Compatible and additional computed target framework versions.
.NET	net10.0 is compatible. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed.

Compatible target framework(s)

Included target framework(s) (in package)

Learn more about Target Frameworks and .NET Standard.

net10.0
- Mostlylucid.StyloExtract.Abstractions (>= 2.0.1)
- Mostlylucid.StyloExtract.Fingerprint (>= 2.0.1)
- Mostlylucid.StyloExtract.Templates (>= 2.0.1)
- Npgsql (>= 10.0.3)

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version	Downloads	Last Updated
2.0.1	106	6/30/2026
2.0.0	105	6/28/2026
1.8.0	104	6/27/2026
1.8.0-alpha.23	55	6/27/2026
1.8.0-alpha.22	59	6/27/2026
1.8.0-alpha.21	57	6/27/2026
1.8.0-alpha.20	65	6/27/2026
1.8.0-alpha.19	59	6/26/2026
1.8.0-alpha.18	85	6/26/2026
1.8.0-alpha.17	94	6/26/2026
1.8.0-alpha.16	97	6/26/2026
1.8.0-alpha.15	66	6/26/2026
1.8.0-alpha.14	55	6/26/2026
1.8.0-alpha.13	55	6/26/2026
1.8.0-alpha.12	53	6/26/2026
1.8.0-alpha.11	54	6/26/2026
1.8.0-alpha.10	60	6/26/2026
1.8.0-alpha.9	60	6/25/2026
1.8.0-alpha.8	64	6/25/2026
1.8.0-alpha.4	66	6/25/2026

StyloExtract 2.0.0 - 2026-06-28
================================

First stable release. Closes Phase 1 + Phase 2 of the identity-claim
rework that ran across alpha.22, alpha.23, and the in-flight code that
never tagged. Stable means the v2 API contracts (IdentityClaim, the
streaming options, the operator-template shape with Claims, the apply-
time quality gate) are now things consumers can build on.

What's new since 1.8.0-alpha.21
-------------------------------

Identity-claim primitive (Phase 1)

- New `IdentityClaim` type — outermost-first ancestor chain of
(tag, id, classes, data-* / aria-* / role) entries, anchoring every
selector by stable identity rather than by CSS string.
- `DefaultClassStabilityFilter` rejects hash-shaped class tokens
(Tailwind JIT names, CSS-module hashes, build-time churn) so that
emitted claims survive across visits.
- Inducer is identity-aware end-to-end: cardinality-aware uniqueness
for repeated roles, narrow tripwires for the streaming side, no
CSS-string emission anywhere on the apply path.
- Layout extractor's apply path runs on `IdentityClaimApplicator`;
the old CSS-string applicator is gone.

Streaming gateway: exact tripwire matching + bounded memory

- The streaming scanner shifted from MinHash + LSH bands to exact
`IdentityClaim` matching against the per-event hash data the
tokenizer carries on each `TagEvent` (tag-name + id + per-class
+ per-data-attr + per-aria + role hashes). The matcher walks the
claim's required hashes linearly against the event's hash arrays;
no per-tick MinHash recompute, no sliding window.
- `StreamingTokenizerOptions` replaces the hard `MaxBufferSize`
consts on `IncrementalHtmlTokenizer` and
`IncrementalBytePatternScanner`. Both buffers are now rented from
`ArrayPool<byte>.Shared` and grow on demand up to the configurable
ceiling (default 1 MiB per buffer). Both classes are `IDisposable`.
- `TagAttrLimits` replaces the per-event `TagEvent.MaxClassesPerEvent`
(was 8) and `TagEvent.MaxAttrPairsPerEvent` (was 3). Defaults
bumped to 32 / 16, validated up to 256 / 128 ceilings. Real pages
no longer silently lose the tail.
- Streaming-template inducer rewrite (Task 4 of Phase 1) — emits
`IdentityClaim`-based tripwires shared with the layout side.
- Incremental byte-pattern scanner (Task 13) replaces the alpha.21
tripwire scanner with a faster exact-match path; tag-hash prefilter
cuts per-scan allocs ~25-30x.

Apply-time quality gate + auto-repair loop

- New `ApplicatorBrokenCheck` lifts the apply-time bug-out signal
out of LayoutExtractor's local function into a unit-testable gate.
Three new failure modes: noisy-MainContent (link-density >= 0.5
inside a content block, catches the Wikipedia / mostlylucid
language-picker leak), image-anchor picker (many short-text
anchors, catches the route-variant strip), metadata-shape
rejection (key:value-dominated blocks, catches the MS Learn YAML
frontmatter leak).
- LayoutExtractor Move 3 widens the repair-enqueue gate: drops the
"hand-authored template must exist" requirement, triggers on
applicatorBugOut OR thin-markdown, adds Refit to the qualifying
match-status set.
- `IsDeterministic` flag on `OperatorTemplate` distinguishes the
heuristic inducer's deterministic YAML audit snapshots from
hand-authored / LLM-induced templates. Deterministic snapshots no
longer block LLM induction.
- `OperatorTemplateRule.Claims` carries the identity-claim ancestor
chain on operator templates so the operator-template path runs on
the identity-claim applicator instead of the CSS-string fallback.

Heuristic block classifier improvements

- Tighten-on-anchor (Move 1) — after a `<main>`/`<article>` qualifies
as MainContent, look down one level for a div/section descendant
with a stable identity anchor (stable id OR >= 1 stable class) that
carries >= 80% of the wrapper's prose text and has link density
< 0.5. When exactly one descendant qualifies, prefer it. Catches
Wikipedia + mostlylucid leaks where the picker rides inside the
outer semantic element.
- `<article>` semantic-tag exception in repeated-item link-density
gate — news-listing pattern where each card is a single clickable
`<a>` (density ~1.0) now survives the gate. The Register, Verge,
Ars, BBC News listings render again.

Template enrichment coordinator

- `InMemoryTemplateEnrichmentQueue` cooldown key changed from string
Host to (Host, EnrichmentJobKind) tuple. A first-visit Induce no
longer blocks a follow-up Repair on the same host.
- New `ILlmActivityObserver` interface brackets each LLM call with
LlmCallStarted / LlmCallEnded(success). Wired through the DI
builder so consumers can show "llm <host>..." while CPU inference
is running (the lucidVIEW FULL status bar uses this).

Corpus mining (Phase 2)

- `SelectorDistance` metric quantifies how similar two emitted
selectors are for evolved-candidate ranking (Task 6).
- `CorpusMiner` query primitives (Task 7) and evolved-selector
emission (Task 8) — proposes alternate selectors from the
template_observations table.
- Passive evaluation of evolved candidates at apply time (Task 9):
evolved selectors run alongside the chosen one and contribute
observations for the next mining cycle.
- Background `CorpusMiningCoordinator` (Task 10) drains the
template_observations table on a cadence and writes evolved
candidates.
- `template_observations` SQLite table (Task 5 of Phase 1) feeds
the mining and evolved-candidate paths.

Cold-path arbitrary caps (now configurable or bumped)

- `NextDataRehydrationExtractor` walker bumped from 500 strings /
depth 12 to 5000 / depth 32 — real Next.js __NEXT_DATA__ blobs
exceeded the old guards.
- LayoutExtractor's LLM-repair sample bumped 400 -> 2000 chars
(more context for the LLM to see what's wrong).
- Skeleton renderer's attr-value truncation bumped 40 -> 160 chars
(covers accessibility-conscious aria-label values).
- Streaming `IncrementalHtmlTokenizer.MaxBufferSize` (was 16 KiB
const that threw on JSON-LD blobs) replaced by
`StreamingTokenizerOptions.MaxPartialTagBytes` (1 MiB default).

Breaking changes you need to know about
---------------------------------------

- `IncrementalHtmlTokenizer.MaxBufferSize` and
`IncrementalBytePatternScanner.MaxBufferSize` public consts removed.
Replaced by per-instance configuration via
`StreamingTokenizerOptions`. The instances are now `IDisposable`;
long-lived consumers should wrap in `using`.
- `TagEvent.MaxClassesPerEvent` and `TagEvent.MaxAttrPairsPerEvent`
internal consts removed. Caps thread through `TagAttrLimits`,
configured from `StreamingTokenizerOptions`. Defaults bumped
(8 -> 32 / 3 -> 16) so existing code that didn't override the cap
sees the same or wider coverage.
- `TagAttributeParser.ExtractIdentityHashes` now takes a
`TagAttrLimits` parameter before the `out` arguments. Update
callers; pass `TagAttrLimits.Default` to keep the new defaults.
- `MinimalHtmlTokenizer` has a new `(input, filter, attrLimits)`
constructor; the existing two-arg constructor delegates to
`TagAttrLimits.Default`.
- `OperatorTemplate` gained `IsDeterministic` (bool) and
`OperatorTemplateRule` gained `Claims`
(`IReadOnlyList<IdentityClaim>?`). Both are init-only; existing
call sites compile, but the YAML round-trip writes the new fields
and the loader sets `IsDeterministic` from the file name.
- Layout extractor's CSS-string applicator path is gone. Templates
emitted before alpha.22 that depend on string-based selectors
rebuild through the identity-claim path on first visit.
- `StreamingTemplate` lost its MinHash signature shape — templates
persisted from alpha.16-alpha.20 re-induce on first visit (the
store's PRAGMA user_version gate drops stale rows).
- Streaming `RollingSketch` / `TagAllowlistBloom` types removed
(alpha.21 deprecated the latter; alpha.24 dropped both with the
byte-pattern matcher).
- `InMemoryTemplateEnrichmentQueue._lastEnqueuedByHost` (private)
changed shape; only matters if you reflected against it.

Tests: 850 across 12 projects, all green.

Migration: most consumers don't need to change anything. The two
patterns that DO need a change are (a) anyone who passed
`IncrementalHtmlTokenizer.MaxBufferSize` to size their own buffer
(use `tok.MaxPartialTagBytes` instead) and (b) anyone who called
`TagAttributeParser.ExtractIdentityHashes` directly (add
`TagAttrLimits.Default` as the second argument).

StyloExtract 1.8.0-alpha.21 - 2026-06-27
=========================================

Streaming: scope fixes (no algorithm replacement)
--------------------------------------------------

Tightens the alpha.19 streaming scanner without replacing the MinHash
matcher. The algorithm shape (MinHash + LSH bands + three fences per
template) is unchanged; what changes is its scope:

1. IncrementalHtmlTokenizer.Feed no longer copies the whole chunk into
  _buffer. Chunks are parsed inline; only the partial-tag tail (if a
  tag straddles a chunk boundary) is retained for stitching with the
  next chunk. PeakBufferedBytes is now bounded by O(longest tag), not
  O(chunk size). Measured: peak = 0 B for a 200 KB body in 16 KB
  chunks, 19 B in 1 KB chunks. MaxBufferSize lowered from 64 KiB to
  4 KiB.

2. RollingSketch shingles upgraded to Markov bigrams: each shingle is
  (prevTagHash, currentTagHash, currentClassHash). Order-sensitive:
  [A, B] and [B, A] now produce different signatures. The leftmost
  shingle in any window uses prevTag = 0 so sliding-window scanners
  match fences built from contiguous event sequences regardless of
  what came before the window.

3. Static StructuralTagAllowlist replaces per-fence TagAllowlistBloom.
  Only structural tags (html/body/header/nav/main/article/section/
  div/p/h1-h6/ul/ol/li/table/...) push into the sketch. meta/link/
  script-chrome/img/span/a bypass the recompute entirely. The
  TagAllowlistBloom JSON property is retained as a back-compat sink
  (read-and-discarded) so persisted templates from alpha.16-alpha.20
  round-trip cleanly.

4. Depth-aware capture-end: while in Capturing, ContentEnd only matches
  when DOM depth has returned to (or below) the depth at ContentStart.
  Nested matches mid-content can no longer terminate capture early.

5. Dead StreamingTemplate.MinContentDepth field removed (never read by
  any scanner).

6. FenceScanner and IncrementalFenceScanner now share a single static
  StreamingTick.Step. Both scanners build a StreamingTickState from
  their respective storage (span-backed vs heap-backed) and execute
  literally the same code. Cross-validation tests retained as insurance.

7. IStreamingTemplateStore gains version-chain APIs:
  - GetByHostAtVersionAsync(host, version) — retrieve a specific version.
  - ListVersionsByHostAsync(host) — enumerate all known versions.
  UpsertAsync now APPENDS per (host, version) rather than replacing.
  SQLite store schema migrated to PK (host, version); existing rows
  auto-migrate to version 1 on first open.

Migration notes:
- Persisted SQLite templates from alpha.16-alpha.20 auto-migrate to
version 1 on first open; existing rows are preserved.
- TemplateFence(uint[], ulong[], ulong, int) constructor removed; the
new shape is TemplateFence(uint[], ulong[], int). TagAllowlistBloom
is still readable as a property (returns 0).
- StreamingTemplate.MinContentDepth removed — drop from any code that
set it in `with` expressions.
- RollingSketch.Push signature changed to Push(prevTagHash, tagHash,
classHash) — direct users must track prev tag.

StyloExtract 1.8.0-alpha.19 - 2026-06-26
=========================================

Streaming: sliding-window design (no full-buffer retention)
------------------------------------------------------------

Refactors alpha.18's IncrementalHtmlTokenizer + IncrementalFenceScanner
to a TRUE sliding-window streaming design:

1. Bytes: only partial-tag bytes are retained. Once a tag is emitted,
  the bytes are dropped immediately (compact-on-emit, not compact-on-
  next-Feed). New PeakBufferedBytes property exposes the high-watermark
  for telemetry. Worst-case in-flight buffer is O(longest tag), not
  O(megabytes). MaxBufferSize lowered from 1 MiB to 64 KiB and
  repositioned as a hard safety stop that should never be hit under
  correct input — exceeding it now means a single tag (or unclosed
  script/style body) genuinely exceeds 64 KiB and the scan must bail.

2. Events: fixed-size sliding window of the last WindowSize tag events
  (unchanged from alpha.18). Push new, pop oldest. The window is the
  only event-level state.

3. RollingSketch: documented (in IncrementalFenceScanner XML doc) that
  MinHash with min-pooling is NOT reversibly rollable — once an element
  leaves the window, its contribution to min(...) can't be subtracted.
  The sketch therefore rebuilds the signature from the current event
  window after each accepted tag (O(WindowSize × SignatureSize) per
  tick, gated by the Bloom allowlist filter to skip the vast majority
  of inbound tags). The bounded-buffer property — the user's headline
  concern — is satisfied by the tokenizer; the sketch's per-tick recompute
  is the price MinHash charges for the LSH-band locality the matcher
  relies on. The event-level memory remains O(WindowSize) regardless.

4. IncrementalFenceScanner now exposes PeakBufferedBytes and BytesConsumed
  passthroughs from the tokenizer so callers can prove the bounded-memory
  property to telemetry without reaching into the tokenizer directly.
  The duplicated tick logic (mirroring FenceScanner.Tick over heap-backed
  sketch state) is retained — it's hard-pinned to the ref-struct path by
  the existing cross-validation tests, which give us higher confidence
  than refactoring to delegate would.

Memory-cap proof: tests/StreamingMemoryBoundTests.cs feeds 5 MiB of
synthetic HTML in 4 KiB chunks and asserts PeakBufferedBytes stays
under 16 KiB. The streaming gateway can now scan multi-megabyte
responses while holding bounded memory.

Migration: API is unchanged from alpha.18 — refactor is internal. The
new PeakBufferedBytes and BytesConsumed diagnostic properties on
IncrementalFenceScanner are additive. MaxBufferSize is still public but
the new value is 64 KiB (was 1 MiB); only relevant if you were catching
the InvalidOperationException for pathological input.

StyloExtract 1.8.0-alpha.18 - 2026-06-26
=========================================

Streaming: true chunked tokenization + refit/versioning + bench update
-----------------------------------------------------------------------

1. IncrementalHtmlTokenizer + IncrementalFenceScanner
  Stateful tokenizer that survives chunk boundaries. A partial tag at
  the end of one chunk is held in an internal buffer and completed when
  the next Feed call arrives. Pairs with IncrementalFenceScanner —
  callers Feed chunks as they arrive from the network, get a verdict
  per chunk, bail early on Captured / Bailout.

  Trade-off vs MinimalHtmlTokenizer's span path: one buffer allocation
  per request (not per chunk). Use the span path for whole-buffer
  scans, the incremental path for streaming gateways where bytes
  arrive in chunks. Hard cap of 1 MiB on the internal buffer — feed
  throws InvalidOperationException on pathological input that never
  closes a tag, surfacing the failure rather than silently dropping
  bytes.

  Architectural note: FenceScanner stays a ref struct (zero-alloc hot
  path); IncrementalFenceScanner is a heap-backed class that ports the
  same tick logic. The two are kept in lockstep — any drift between
  them is a correctness bug surface and is covered by cross-validation
  tests that feed the same bytes both ways.

2. Streaming-template refit + versioning
  StreamingTemplate gains a Version field (defaults to 1; persists
  across alpha.17 templates without migration). New
  StreamingRefitOrchestrator observes captured-scan output per host
  and kicks off-hot-path refits when either:
    - capture-region EWMA drift exceeds 30% on N consecutive scans, OR
    - every 10th captured scan re-induces and finds different fences
  On refit: version bumps, store is upserted, the new
  IStreamingTemplateVersionSink fires a StreamingTemplateRefitEvent
  (Host, Old/New TemplateId, Old/New Version, Reason, DetectedAt).
  Default sink is a no-op; consumers wire UI telemetry to it.

3. Bench update
  ExtractionComparisonBench gains a New_StreamingScanByHost variant so
  the host-keyed hot-path is benchmarked alongside the original
  GUID-keyed scan. Pre-populates the in-memory store with the
  host="www.mostlylucid.net" template that lucidview FULL hits in
  production.

Migration: additive APIs. Alpha.17 consumers using ScanByHost continue
to work; the incremental tokenizer and the refit orchestrator are
opt-in (use them when feeding chunks / when wiring drift telemetry).

StyloExtract 1.8.0-alpha.17 - 2026-06-26
=========================================

Streaming: host-keyed templates + naive auto-induction
-------------------------------------------------------

Three changes to close the alpha.16 streaming integration loop:

1. Host-keyed lookup
  IStreamingTemplateStore gains GetByHostAsync / TryGetHotByHost /
  UpsertAsync. StreamingTemplate gains a Host field (required). One
  template per host (latest wins). The existing GUID-keyed methods
  remain — Host is the lookup key for consumers; TemplateId stays for
  stable identity / versioning.

2. StreamingPathSelector.ScanByHost(host, bytes)
  Synchronous hot-cache-only host scan. Returns NoTemplate on miss
  so the caller can WarmByHostAsync + retry or induce.
  WarmByHostAsync brings a host's template into the hot cache via
  the durable tier.

3. StreamingTemplateInducer
  Naive first-pass inducer: walks HTML via MinimalHtmlTokenizer,
  finds semantic-marker tag-sequence-pairs (<header>...</header>,
  <p>...</p>...<p>...</p>, <footer>/</main>/</body>) and produces a
  StreamingTemplate ready to upsert. Returns null on pages with no
  identifiable structural fences (plain text, image-only, etc.).
  Describe() returns a human-readable summary of the chosen markers
  for logging.

Storage migrations:
- InMemoryStreamingTemplateStore: adds an in-memory host index.
- SqliteStreamingTemplateStore: adds a 'host' TEXT column + index;
on-open ALTER TABLE migration handles pre-alpha.17 schemas
(existing rows get Host="" — reachable only by GUID).

Migration: additive APIs; alpha.16 consumers using only the
GUID-keyed surface continue to work unchanged. The new Host field
on StreamingTemplate IS required — existing construction sites must
set Host="" if they have no host context.

StyloExtract 1.8.0-alpha.16 - 2026-06-26
=========================================

Mostlylucid.StyloExtract.Streaming — zero-allocation byte-stream scanner
------------------------------------------------------------------------

New package on NuGet. Hot-path streaming fence scanner: skips page chrome
and captures the content region as response bytes flow past, using
MinHash-derived structural fences. Zero per-request GC-tracked
allocations in steady state.

Designed for the gateway position — drop into a response pipeline
(HttpClient, Stylobot's edge, ASP.NET output filters) alongside the byte
stream and emit a verdict without buffering the full page.

Public hot-path API:
StreamingPathSelector.Scan(Guid templateId, ReadOnlySpan<byte> html)
   → ScanVerdict { Continue | Captured | Bailout | NoTemplate }

// Warm a template into the hot cache:
await selector.WarmAsync(templateId);

Storage:
- InMemoryStreamingTemplateStore — single-process LRU.
- SqliteStreamingTemplateStore — durable; same SQLite file pattern as
   the existing ITemplateIndex but a separate table.

Pairs with the existing StyloExtract.Fingerprint learn path and
ITemplateIndex template store. The streaming template format is its own
shape (TemplateFence with MinHash bloom, content-start/content-end
fences) — not an LLM template or operator template.

Bench results vs LayoutExtractor on mostlylucid fixtures: see
bench/StyloExtract.Streaming.Benchmarks/ (zero-alloc scan competitive
with the full extractor's path-match cost while never building a DOM).

Migration: additive package; consumers add a PackageReference to
Mostlylucid.StyloExtract.Streaming if they want gateway-position
scanning.

StyloExtract 1.8.0-alpha.15 - 2026-06-26
=========================================

RenderOptions.WaitUntil — opt out of NetworkIdle for SPA routing
-----------------------------------------------------------------

PlaywrightHtmlFetcher previously hardcoded WaitUntilState.NetworkIdle
for the primary GotoAsync. On sites with aggressive client-side
routing (BBC News auto-navigates /news → /articles/<id> in the
post-load JS phase), this means the fetcher returns the post-routing
DOM, not the page the user requested.

RenderOptions now exposes a WaitUntil property (PlaywrightWaitUntil
enum: Load / DOMContentLoaded / NetworkIdle / Commit). Default stays
NetworkIdle for backwards compatibility. Consumers fetching SPA-heavy
sites should set Load to capture the initial DOM before the router
fires.

The secondary WaitForLoadStateAsync(NetworkIdle, ...) drain remains —
it's independently bounded by WaitForNetworkIdleTimeout and serves as
a best-effort late-XHR catch-up; safe even with the primary returning
on Load.

PlaywrightWaitUntil is a small enum (not Microsoft.Playwright.WaitUntilState
direct) so consumers don't take a transitive dependency on
Microsoft.Playwright just to pick a strategy.

StyloExtract 1.8.0-alpha.14 - 2026-06-26
=========================================

Sitemap CLI end-to-end regression + LLM nav few-shot
-----------------------------------------------------

1. Sitemap CLI test suite

The alpha.11 stylo-extract sitemap verb has been working on real sites
since alpha.13 (heuristic nav-classification tightening), but nothing
caught regressions. Added 5 end-to-end tests in
StyloExtract.Core.Tests/SitemapCommandTests.cs that invoke the
SitemapCommand.CrawlAsync handler against the mostlylucid-home.html.gz
fixture (real captured homepage, shared with the heuristics suite) plus
a stub HttpMessageHandler and assert: real nav links emitted under
# www.mostlylucid.net, --max-depth 0 emits only the seed Title row,
off-host links are not followed, --max-pages cap honoured exactly, and
--delay-ms enforced with a stopwatch floor. No network access required.

2. LLM induction prompt — nav-classification few-shot

LlmInducerPrompts.System and SystemRepair now include a second worked
example: a blog homepage with header <nav>, breadcrumb,
MainContent + RepeatedItem post cards, and footer <nav>. Mirrors the
patterns the alpha.13 NavPreDetector heuristic correctly classifies.
Rule 6 (RepeatedItem usage) tightened with explicit guidance that
header/footer nav lists are PrimaryNavigation / SecondaryNavigation at
the parent <ul>/<nav> level, NOT RepeatedItem at the <li> level —
closes a known LLM confusion mode.

Tests: snapshot tests in StyloExtract.Core.Tests/LlmInducerPromptsTests.cs
verify the prompt extensions land verbatim so future prompt edits don't
accidentally regress.

StyloExtract 1.8.0-alpha.13 - 2026-06-26
=========================================

Heuristic nav-classification tightening
----------------------------------------

HeuristicBlockClassifier was under-classifying real-world nav patterns
on server-rendered sites — header <nav> strips, header <ul>-of-links,
breadcrumb lists, role="navigation" attributes, footer nav — all landed
as Boilerplate (or weren't extracted at all). Result: the alpha.11
Sitemap profile and stylo-extract sitemap CLI verb produced a one-line
tree even on sites with rich nav, because the classifier didn't surface
PrimaryNavigation / SecondaryNavigation / Breadcrumb roles for them.

Tightened patterns now produce definite role classifications:
1. <header> <nav> -> PrimaryNavigation (0.9)
2. Top-of-document <nav> -> PrimaryNavigation (0.85)
3. <footer> <nav> -> SecondaryNavigation (0.9)
4. <nav aria-label="breadcrumb"> / class~="breadcrumb" -> Breadcrumb (0.95)
5. <* role="navigation"> -> PrimaryNavigation (0.95)
6. Header <ul> of mostly-link <li>s -> PrimaryNavigation (0.85) at
    the <ul> level, suppress descent (was emitting deep Boilerplate)
7. Footer <ul> of mostly-link <li>s -> SecondaryNavigation (0.85)

Implementation: a new NavPreDetector runs after per-element classification
and injects each detected nav container as a high-score (50000) candidate
at the parent level, then demotes any descendant candidates so greedy
selection picks the nav parent and stops descending into its noise.
Containers nested inside <main>/<article> are skipped — IntraBlockCleaner
already strips them as intra-block contaminants; hoisting would steal
the article's selection win.

Regression fixtures captured from mostlylucid.net + wikipedia.org under
tests/StyloExtract.Heuristics.Tests/Fixtures so the next time a
classifier change regresses real-world nav detection, the bench catches
it before it ships.

Downstream impact: the Sitemap ExtractionProfile and stylo-extract
sitemap CLI verb now produce real nav trees on these sites - see the
lucidview FULL dogfood smoke for evidence.

StyloExtract 1.8.0-alpha.12 - 2026-06-26
=========================================

DI wire-up fix for deterministic-template YAML persistence
-----------------------------------------------------------

alpha.11 introduced DeterministicTemplateYamlSink + the
AddStyloExtractOperatorTemplates registration, but AddStyloExtract's
LayoutExtractor construction did not pass the sink through to the
extractor — so even when the sink was registered in DI, LayoutExtractor's
optional ctor parameter defaulted to null and no `<host>-deterministic.yaml`
file was ever written.

Fixed by threading `sp.GetService<DeterministicTemplateYamlSink>()` to the
LayoutExtractor constructor in AddStyloExtract. No API change; consumers
who already called AddStyloExtractOperatorTemplates start seeing
deterministic YAML files immediately after upgrading.

StyloExtract 1.8.0-alpha.11 - 2026-06-26
=========================================

Sequenced architecture extension: deterministic templates with
extended classification — Title role, Sitemap profile, deterministic
YAML persistence, and a sitemap CLI verb.

Title BlockRole
---------------

New BlockRole.Title value distinguishes the page-level <h1> (the single
H1 the rest of the page is "about") from intra-content Heading
(H2/H3/H4 inside the body). HeuristicBlockClassifier surfaces the Title
via a shared PageTitleDetector helper, picking the H1 in/closest-to
<main>/<article> and falling back to earliest-in-document with multiple
H1s. ExtractorApplicator surfaces Title on the fast-path / applicator
branch too, so output stays consistent across novel and cached requests
(matters for the response-cache ETag). LlmInducerPrompts list Title in
the allowed-roles set with a one-line distinction from Heading.

MainContentOnly, RagFull, Wcxb, and AgentNavigation profiles all
include Title in their role-set. The renderer quality gate (drop short
text) bypasses for Title and Heading so intentionally-terse page
titles ("Home", "About") still surface.

Sitemap ExtractionProfile
-------------------------

New ExtractionProfile.Sitemap value emits only Title + Heading +
PrimaryNavigation + SecondaryNavigation + Breadcrumb. For sitemap /
outline / crawler use cases that want page titles and the site's nav
structure without pulling body content. The CLI's --profile flag
recognises `sitemap` automatically (enum binding).

Deterministic YAML persistence
------------------------------

New DeterministicTemplateYamlSink, wired automatically when
AddStyloExtractOperatorTemplates(root) is called, writes
<host>-deterministic.yaml alongside each heuristic-induced template's
SQLite row. The file carries every role the heuristic detected (Title,
MainContent, Navigation, Footer, …) — auditable, hand-editable, and
diffable, mirroring how LLM-induced templates have always been written
by TemplateEnrichmentCoordinator. The SQLite store remains the
authoritative source at match time; YAML is best-effort and
non-blocking.

stylo-extract sitemap CLI verb
------------------------------

New `sitemap` subcommand: takes one or more starting URLs, extracts
each with ExtractionProfile.Sitemap, follows internal nav links to
--max-depth (default 3), and emits a markdown tree of titles + URLs to
stdout or --out <file>. Safety caps: 50 pages by default
(--max-pages), 1s between requests (--delay-ms), no off-host follow.

Migration
---------

No source change required for consumers. The new Title role is
additive (existing switches that handled BlockRole pattern-match
defaults will continue to compile and behave identically; switches
that exhaustively listed roles were updated). Deterministic YAML
writing only activates when AddStyloExtractOperatorTemplates is
called, so consumers that don't use operator templates see no new
filesystem activity.

StyloExtract 1.8.0-alpha.10 - 2026-06-26
=========================================

LLM classification accuracy for chrome patterns
------------------------------------------------

Symptom: induced templates were labelling language pickers, filter UI,
locale switchers, and pagination strips as MainContent on
server-rendered blogs (mostlylucid.net being the canonical reproducer).
The downstream RagFull renderer's role-filter — which already drops
PrimaryNavigation / SecondaryNavigation / Form / Boilerplate — never
saw them as nav and so left them in the extracted markdown, producing
output WORSE than the deterministic heuristic.

Fix: expanded the induction and repair system prompts with explicit
"chrome pattern → role" examples (language picker → PrimaryNavigation;
filter / fac

[truncated — see RELEASE_NOTES.txt packaged at root for full history]