Mostlylucid.StyloExtract.Templates.Postgres
2.0.0
dotnet add package Mostlylucid.StyloExtract.Templates.Postgres --version 2.0.0
NuGet\Install-Package Mostlylucid.StyloExtract.Templates.Postgres -Version 2.0.0
<PackageReference Include="Mostlylucid.StyloExtract.Templates.Postgres" Version="2.0.0" />
<PackageVersion Include="Mostlylucid.StyloExtract.Templates.Postgres" Version="2.0.0" />
<PackageReference Include="Mostlylucid.StyloExtract.Templates.Postgres" />
paket add Mostlylucid.StyloExtract.Templates.Postgres --version 2.0.0
#r "nuget: Mostlylucid.StyloExtract.Templates.Postgres, 2.0.0"
#:package Mostlylucid.StyloExtract.Templates.Postgres@2.0.0
#addin nuget:?package=Mostlylucid.StyloExtract.Templates.Postgres&version=2.0.0
#tool nuget:?package=Mostlylucid.StyloExtract.Templates.Postgres&version=2.0.0
Mostlylucid.StyloExtract.Templates.Postgres
PostgreSQL-backed template index for StyloExtract. Implements the same ITemplateIndex contract as Mostlylucid.StyloExtract.Templates (the SQLite provider); swap providers via DI with no change to calling code.
When to use this instead of SQLite
Choose the Postgres provider when:
- Your deployment already runs PostgreSQL as its operational database (StyloBot commercial, multi-tenant SaaS)
- You need multiple extraction nodes sharing one template store (Npgsql pools connections; Postgres serialises concurrent writes natively)
- You plan to add pgvector cosine-similarity search in a future upgrade (the schema is forward-compatible)
The SQLite provider (Mostlylucid.StyloExtract.Templates) is the right choice for single-host or air-gapped deployments, CLI tools, and anywhere you want zero external dependencies.
Installation
dotnet add package Mostlylucid.StyloExtract.Templates.Postgres
Usage
// Register the Postgres provider. Call this instead of (or after) AddStyloExtract()
// to replace the SQLite ITemplateIndex with the Postgres one.
services.AddStyloExtractPostgres(o =>
o.ConnectionString = "Host=localhost;Port=5432;Database=styloextract;Username=se;Password=secret");
// Optional: register drift-triggered refit support (mirrors RefitOrchestrator for SQLite).
services.AddStyloExtractPostgresRefit(
driftRefitThreshold: 0.35,
observationsBeforeStable: 5,
versionHistoryDepth: 3);
Schema is applied idempotently on the first operation (CREATE TABLE IF NOT EXISTS). No migration tool required.
Storage model
| Table | Contents |
|---|---|
templates |
Template id (bytea), host hash, fingerprint, extractor JSON blob, version, observation count |
template_lsh_band_index |
LSH bucket rows for fast-path lookup |
template_observations |
Per-request observation vectors (bounded to last 100 per template) |
template_version_history |
Past extractor versions retained for diff generation |
Columns that are BLOB in SQLite are bytea in Postgres. Timestamps are bigint Unix milliseconds. No pgvector dependency in v1; vector similarity uses the same CPU-side cosine math as the SQLite provider.
AOT
This package sets IsAotCompatible=false because Npgsql requires runtime reflection for connection-string parsing. It will not break AOT builds in packages that do not reference it (sibling packages such as StyloExtract.Playwright remain AOT-safe).
| Product | Versions Compatible and additional computed target framework versions. |
|---|---|
| .NET | net10.0 is compatible. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
-
net10.0
- Mostlylucid.StyloExtract.Abstractions (>= 2.0.0)
- Mostlylucid.StyloExtract.Fingerprint (>= 2.0.0)
- Mostlylucid.StyloExtract.Templates (>= 2.0.0)
- Npgsql (>= 10.0.3)
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.
| Version | Downloads | Last Updated |
|---|---|---|
| 2.0.0 | 0 | 6/28/2026 |
| 1.8.0 | 47 | 6/27/2026 |
| 1.8.0-alpha.23 | 35 | 6/27/2026 |
| 1.8.0-alpha.22 | 40 | 6/27/2026 |
| 1.8.0-alpha.21 | 39 | 6/27/2026 |
| 1.8.0-alpha.20 | 50 | 6/27/2026 |
| 1.8.0-alpha.19 | 44 | 6/26/2026 |
| 1.8.0-alpha.18 | 59 | 6/26/2026 |
| 1.8.0-alpha.17 | 68 | 6/26/2026 |
| 1.8.0-alpha.16 | 73 | 6/26/2026 |
| 1.8.0-alpha.15 | 49 | 6/26/2026 |
| 1.8.0-alpha.14 | 39 | 6/26/2026 |
| 1.8.0-alpha.13 | 37 | 6/26/2026 |
| 1.8.0-alpha.12 | 36 | 6/26/2026 |
| 1.8.0-alpha.11 | 38 | 6/26/2026 |
| 1.8.0-alpha.10 | 45 | 6/26/2026 |
| 1.8.0-alpha.9 | 44 | 6/25/2026 |
| 1.8.0-alpha.8 | 48 | 6/25/2026 |
| 1.8.0-alpha.4 | 46 | 6/25/2026 |
| 1.8.0-alpha.3 | 51 | 6/25/2026 |
StyloExtract 2.0.0 - 2026-06-28
================================
First stable release. Closes Phase 1 + Phase 2 of the identity-claim
rework that ran across alpha.22, alpha.23, and the in-flight code that
never tagged. Stable means the v2 API contracts (IdentityClaim, the
streaming options, the operator-template shape with Claims, the apply-
time quality gate) are now things consumers can build on.
What's new since 1.8.0-alpha.21
-------------------------------
Identity-claim primitive (Phase 1)
- New `IdentityClaim` type — outermost-first ancestor chain of
(tag, id, classes, data-* / aria-* / role) entries, anchoring every
selector by stable identity rather than by CSS string.
- `DefaultClassStabilityFilter` rejects hash-shaped class tokens
(Tailwind JIT names, CSS-module hashes, build-time churn) so that
emitted claims survive across visits.
- Inducer is identity-aware end-to-end: cardinality-aware uniqueness
for repeated roles, narrow tripwires for the streaming side, no
CSS-string emission anywhere on the apply path.
- Layout extractor's apply path runs on `IdentityClaimApplicator`;
the old CSS-string applicator is gone.
Streaming gateway: exact tripwire matching + bounded memory
- The streaming scanner shifted from MinHash + LSH bands to exact
`IdentityClaim` matching against the per-event hash data the
tokenizer carries on each `TagEvent` (tag-name + id + per-class
+ per-data-attr + per-aria + role hashes). The matcher walks the
claim's required hashes linearly against the event's hash arrays;
no per-tick MinHash recompute, no sliding window.
- `StreamingTokenizerOptions` replaces the hard `MaxBufferSize`
consts on `IncrementalHtmlTokenizer` and
`IncrementalBytePatternScanner`. Both buffers are now rented from
`ArrayPool<byte>.Shared` and grow on demand up to the configurable
ceiling (default 1 MiB per buffer). Both classes are `IDisposable`.
- `TagAttrLimits` replaces the per-event `TagEvent.MaxClassesPerEvent`
(was 8) and `TagEvent.MaxAttrPairsPerEvent` (was 3). Defaults
bumped to 32 / 16, validated up to 256 / 128 ceilings. Real pages
no longer silently lose the tail.
- Streaming-template inducer rewrite (Task 4 of Phase 1) — emits
`IdentityClaim`-based tripwires shared with the layout side.
- Incremental byte-pattern scanner (Task 13) replaces the alpha.21
tripwire scanner with a faster exact-match path; tag-hash prefilter
cuts per-scan allocs ~25-30x.
Apply-time quality gate + auto-repair loop
- New `ApplicatorBrokenCheck` lifts the apply-time bug-out signal
out of LayoutExtractor's local function into a unit-testable gate.
Three new failure modes: noisy-MainContent (link-density >= 0.5
inside a content block, catches the Wikipedia / mostlylucid
language-picker leak), image-anchor picker (many short-text
anchors, catches the route-variant strip), metadata-shape
rejection (key:value-dominated blocks, catches the MS Learn YAML
frontmatter leak).
- LayoutExtractor Move 3 widens the repair-enqueue gate: drops the
"hand-authored template must exist" requirement, triggers on
applicatorBugOut OR thin-markdown, adds Refit to the qualifying
match-status set.
- `IsDeterministic` flag on `OperatorTemplate` distinguishes the
heuristic inducer's deterministic YAML audit snapshots from
hand-authored / LLM-induced templates. Deterministic snapshots no
longer block LLM induction.
- `OperatorTemplateRule.Claims` carries the identity-claim ancestor
chain on operator templates so the operator-template path runs on
the identity-claim applicator instead of the CSS-string fallback.
Heuristic block classifier improvements
- Tighten-on-anchor (Move 1) — after a `<main>`/`<article>` qualifies
as MainContent, look down one level for a div/section descendant
with a stable identity anchor (stable id OR >= 1 stable class) that
carries >= 80% of the wrapper's prose text and has link density
< 0.5. When exactly one descendant qualifies, prefer it. Catches
Wikipedia + mostlylucid leaks where the picker rides inside the
outer semantic element.
- `<article>` semantic-tag exception in repeated-item link-density
gate — news-listing pattern where each card is a single clickable
`<a>` (density ~1.0) now survives the gate. The Register, Verge,
Ars, BBC News listings render again.
Template enrichment coordinator
- `InMemoryTemplateEnrichmentQueue` cooldown key changed from string
Host to (Host, EnrichmentJobKind) tuple. A first-visit Induce no
longer blocks a follow-up Repair on the same host.
- New `ILlmActivityObserver` interface brackets each LLM call with
LlmCallStarted / LlmCallEnded(success). Wired through the DI
builder so consumers can show "llm <host>..." while CPU inference
is running (the lucidVIEW FULL status bar uses this).
Corpus mining (Phase 2)
- `SelectorDistance` metric quantifies how similar two emitted
selectors are for evolved-candidate ranking (Task 6).
- `CorpusMiner` query primitives (Task 7) and evolved-selector
emission (Task 8) — proposes alternate selectors from the
template_observations table.
- Passive evaluation of evolved candidates at apply time (Task 9):
evolved selectors run alongside the chosen one and contribute
observations for the next mining cycle.
- Background `CorpusMiningCoordinator` (Task 10) drains the
template_observations table on a cadence and writes evolved
candidates.
- `template_observations` SQLite table (Task 5 of Phase 1) feeds
the mining and evolved-candidate paths.
Cold-path arbitrary caps (now configurable or bumped)
- `NextDataRehydrationExtractor` walker bumped from 500 strings /
depth 12 to 5000 / depth 32 — real Next.js __NEXT_DATA__ blobs
exceeded the old guards.
- LayoutExtractor's LLM-repair sample bumped 400 -> 2000 chars
(more context for the LLM to see what's wrong).
- Skeleton renderer's attr-value truncation bumped 40 -> 160 chars
(covers accessibility-conscious aria-label values).
- Streaming `IncrementalHtmlTokenizer.MaxBufferSize` (was 16 KiB
const that threw on JSON-LD blobs) replaced by
`StreamingTokenizerOptions.MaxPartialTagBytes` (1 MiB default).
Breaking changes you need to know about
---------------------------------------
- `IncrementalHtmlTokenizer.MaxBufferSize` and
`IncrementalBytePatternScanner.MaxBufferSize` public consts removed.
Replaced by per-instance configuration via
`StreamingTokenizerOptions`. The instances are now `IDisposable`;
long-lived consumers should wrap in `using`.
- `TagEvent.MaxClassesPerEvent` and `TagEvent.MaxAttrPairsPerEvent`
internal consts removed. Caps thread through `TagAttrLimits`,
configured from `StreamingTokenizerOptions`. Defaults bumped
(8 -> 32 / 3 -> 16) so existing code that didn't override the cap
sees the same or wider coverage.
- `TagAttributeParser.ExtractIdentityHashes` now takes a
`TagAttrLimits` parameter before the `out` arguments. Update
callers; pass `TagAttrLimits.Default` to keep the new defaults.
- `MinimalHtmlTokenizer` has a new `(input, filter, attrLimits)`
constructor; the existing two-arg constructor delegates to
`TagAttrLimits.Default`.
- `OperatorTemplate` gained `IsDeterministic` (bool) and
`OperatorTemplateRule` gained `Claims`
(`IReadOnlyList<IdentityClaim>?`). Both are init-only; existing
call sites compile, but the YAML round-trip writes the new fields
and the loader sets `IsDeterministic` from the file name.
- Layout extractor's CSS-string applicator path is gone. Templates
emitted before alpha.22 that depend on string-based selectors
rebuild through the identity-claim path on first visit.
- `StreamingTemplate` lost its MinHash signature shape — templates
persisted from alpha.16-alpha.20 re-induce on first visit (the
store's PRAGMA user_version gate drops stale rows).
- Streaming `RollingSketch` / `TagAllowlistBloom` types removed
(alpha.21 deprecated the latter; alpha.24 dropped both with the
byte-pattern matcher).
- `InMemoryTemplateEnrichmentQueue._lastEnqueuedByHost` (private)
changed shape; only matters if you reflected against it.
Tests: 850 across 12 projects, all green.
Migration: most consumers don't need to change anything. The two
patterns that DO need a change are (a) anyone who passed
`IncrementalHtmlTokenizer.MaxBufferSize` to size their own buffer
(use `tok.MaxPartialTagBytes` instead) and (b) anyone who called
`TagAttributeParser.ExtractIdentityHashes` directly (add
`TagAttrLimits.Default` as the second argument).
StyloExtract 1.8.0-alpha.21 - 2026-06-27
=========================================
Streaming: scope fixes (no algorithm replacement)
--------------------------------------------------
Tightens the alpha.19 streaming scanner without replacing the MinHash
matcher. The algorithm shape (MinHash + LSH bands + three fences per
template) is unchanged; what changes is its scope:
1. IncrementalHtmlTokenizer.Feed no longer copies the whole chunk into
_buffer. Chunks are parsed inline; only the partial-tag tail (if a
tag straddles a chunk boundary) is retained for stitching with the
next chunk. PeakBufferedBytes is now bounded by O(longest tag), not
O(chunk size). Measured: peak = 0 B for a 200 KB body in 16 KB
chunks, 19 B in 1 KB chunks. MaxBufferSize lowered from 64 KiB to
4 KiB.
2. RollingSketch shingles upgraded to Markov bigrams: each shingle is
(prevTagHash, currentTagHash, currentClassHash). Order-sensitive:
[A, B] and [B, A] now produce different signatures. The leftmost
shingle in any window uses prevTag = 0 so sliding-window scanners
match fences built from contiguous event sequences regardless of
what came before the window.
3. Static StructuralTagAllowlist replaces per-fence TagAllowlistBloom.
Only structural tags (html/body/header/nav/main/article/section/
div/p/h1-h6/ul/ol/li/table/...) push into the sketch. meta/link/
script-chrome/img/span/a bypass the recompute entirely. The
TagAllowlistBloom JSON property is retained as a back-compat sink
(read-and-discarded) so persisted templates from alpha.16-alpha.20
round-trip cleanly.
4. Depth-aware capture-end: while in Capturing, ContentEnd only matches
when DOM depth has returned to (or below) the depth at ContentStart.
Nested matches mid-content can no longer terminate capture early.
5. Dead StreamingTemplate.MinContentDepth field removed (never read by
any scanner).
6. FenceScanner and IncrementalFenceScanner now share a single static
StreamingTick.Step. Both scanners build a StreamingTickState from
their respective storage (span-backed vs heap-backed) and execute
literally the same code. Cross-validation tests retained as insurance.
7. IStreamingTemplateStore gains version-chain APIs:
- GetByHostAtVersionAsync(host, version) — retrieve a specific version.
- ListVersionsByHostAsync(host) — enumerate all known versions.
UpsertAsync now APPENDS per (host, version) rather than replacing.
SQLite store schema migrated to PK (host, version); existing rows
auto-migrate to version 1 on first open.
Migration notes:
- Persisted SQLite templates from alpha.16-alpha.20 auto-migrate to
version 1 on first open; existing rows are preserved.
- TemplateFence(uint[], ulong[], ulong, int) constructor removed; the
new shape is TemplateFence(uint[], ulong[], int). TagAllowlistBloom
is still readable as a property (returns 0).
- StreamingTemplate.MinContentDepth removed — drop from any code that
set it in `with` expressions.
- RollingSketch.Push signature changed to Push(prevTagHash, tagHash,
classHash) — direct users must track prev tag.
StyloExtract 1.8.0-alpha.19 - 2026-06-26
=========================================
Streaming: sliding-window design (no full-buffer retention)
------------------------------------------------------------
Refactors alpha.18's IncrementalHtmlTokenizer + IncrementalFenceScanner
to a TRUE sliding-window streaming design:
1. Bytes: only partial-tag bytes are retained. Once a tag is emitted,
the bytes are dropped immediately (compact-on-emit, not compact-on-
next-Feed). New PeakBufferedBytes property exposes the high-watermark
for telemetry. Worst-case in-flight buffer is O(longest tag), not
O(megabytes). MaxBufferSize lowered from 1 MiB to 64 KiB and
repositioned as a hard safety stop that should never be hit under
correct input — exceeding it now means a single tag (or unclosed
script/style body) genuinely exceeds 64 KiB and the scan must bail.
2. Events: fixed-size sliding window of the last WindowSize tag events
(unchanged from alpha.18). Push new, pop oldest. The window is the
only event-level state.
3. RollingSketch: documented (in IncrementalFenceScanner XML doc) that
MinHash with min-pooling is NOT reversibly rollable — once an element
leaves the window, its contribution to min(...) can't be subtracted.
The sketch therefore rebuilds the signature from the current event
window after each accepted tag (O(WindowSize × SignatureSize) per
tick, gated by the Bloom allowlist filter to skip the vast majority
of inbound tags). The bounded-buffer property — the user's headline
concern — is satisfied by the tokenizer; the sketch's per-tick recompute
is the price MinHash charges for the LSH-band locality the matcher
relies on. The event-level memory remains O(WindowSize) regardless.
4. IncrementalFenceScanner now exposes PeakBufferedBytes and BytesConsumed
passthroughs from the tokenizer so callers can prove the bounded-memory
property to telemetry without reaching into the tokenizer directly.
The duplicated tick logic (mirroring FenceScanner.Tick over heap-backed
sketch state) is retained — it's hard-pinned to the ref-struct path by
the existing cross-validation tests, which give us higher confidence
than refactoring to delegate would.
Memory-cap proof: tests/StreamingMemoryBoundTests.cs feeds 5 MiB of
synthetic HTML in 4 KiB chunks and asserts PeakBufferedBytes stays
under 16 KiB. The streaming gateway can now scan multi-megabyte
responses while holding bounded memory.
Migration: API is unchanged from alpha.18 — refactor is internal. The
new PeakBufferedBytes and BytesConsumed diagnostic properties on
IncrementalFenceScanner are additive. MaxBufferSize is still public but
the new value is 64 KiB (was 1 MiB); only relevant if you were catching
the InvalidOperationException for pathological input.
StyloExtract 1.8.0-alpha.18 - 2026-06-26
=========================================
Streaming: true chunked tokenization + refit/versioning + bench update
-----------------------------------------------------------------------
1. IncrementalHtmlTokenizer + IncrementalFenceScanner
Stateful tokenizer that survives chunk boundaries. A partial tag at
the end of one chunk is held in an internal buffer and completed when
the next Feed call arrives. Pairs with IncrementalFenceScanner —
callers Feed chunks as they arrive from the network, get a verdict
per chunk, bail early on Captured / Bailout.
Trade-off vs MinimalHtmlTokenizer's span path: one buffer allocation
per request (not per chunk). Use the span path for whole-buffer
scans, the incremental path for streaming gateways where bytes
arrive in chunks. Hard cap of 1 MiB on the internal buffer — feed
throws InvalidOperationException on pathological input that never
closes a tag, surfacing the failure rather than silently dropping
bytes.
Architectural note: FenceScanner stays a ref struct (zero-alloc hot
path); IncrementalFenceScanner is a heap-backed class that ports the
same tick logic. The two are kept in lockstep — any drift between
them is a correctness bug surface and is covered by cross-validation
tests that feed the same bytes both ways.
2. Streaming-template refit + versioning
StreamingTemplate gains a Version field (defaults to 1; persists
across alpha.17 templates without migration). New
StreamingRefitOrchestrator observes captured-scan output per host
and kicks off-hot-path refits when either:
- capture-region EWMA drift exceeds 30% on N consecutive scans, OR
- every 10th captured scan re-induces and finds different fences
On refit: version bumps, store is upserted, the new
IStreamingTemplateVersionSink fires a StreamingTemplateRefitEvent
(Host, Old/New TemplateId, Old/New Version, Reason, DetectedAt).
Default sink is a no-op; consumers wire UI telemetry to it.
3. Bench update
ExtractionComparisonBench gains a New_StreamingScanByHost variant so
the host-keyed hot-path is benchmarked alongside the original
GUID-keyed scan. Pre-populates the in-memory store with the
host="www.mostlylucid.net" template that lucidview FULL hits in
production.
Migration: additive APIs. Alpha.17 consumers using ScanByHost continue
to work; the incremental tokenizer and the refit orchestrator are
opt-in (use them when feeding chunks / when wiring drift telemetry).
StyloExtract 1.8.0-alpha.17 - 2026-06-26
=========================================
Streaming: host-keyed templates + naive auto-induction
-------------------------------------------------------
Three changes to close the alpha.16 streaming integration loop:
1. Host-keyed lookup
IStreamingTemplateStore gains GetByHostAsync / TryGetHotByHost /
UpsertAsync. StreamingTemplate gains a Host field (required). One
template per host (latest wins). The existing GUID-keyed methods
remain — Host is the lookup key for consumers; TemplateId stays for
stable identity / versioning.
2. StreamingPathSelector.ScanByHost(host, bytes)
Synchronous hot-cache-only host scan. Returns NoTemplate on miss
so the caller can WarmByHostAsync + retry or induce.
WarmByHostAsync brings a host's template into the hot cache via
the durable tier.
3. StreamingTemplateInducer
Naive first-pass inducer: walks HTML via MinimalHtmlTokenizer,
finds semantic-marker tag-sequence-pairs (<header>...</header>,
<p>...</p>...<p>...</p>, <footer>/</main>/</body>) and produces a
StreamingTemplate ready to upsert. Returns null on pages with no
identifiable structural fences (plain text, image-only, etc.).
Describe() returns a human-readable summary of the chosen markers
for logging.
Storage migrations:
- InMemoryStreamingTemplateStore: adds an in-memory host index.
- SqliteStreamingTemplateStore: adds a 'host' TEXT column + index;
on-open ALTER TABLE migration handles pre-alpha.17 schemas
(existing rows get Host="" — reachable only by GUID).
Migration: additive APIs; alpha.16 consumers using only the
GUID-keyed surface continue to work unchanged. The new Host field
on StreamingTemplate IS required — existing construction sites must
set Host="" if they have no host context.
StyloExtract 1.8.0-alpha.16 - 2026-06-26
=========================================
Mostlylucid.StyloExtract.Streaming — zero-allocation byte-stream scanner
------------------------------------------------------------------------
New package on NuGet. Hot-path streaming fence scanner: skips page chrome
and captures the content region as response bytes flow past, using
MinHash-derived structural fences. Zero per-request GC-tracked
allocations in steady state.
Designed for the gateway position — drop into a response pipeline
(HttpClient, Stylobot's edge, ASP.NET output filters) alongside the byte
stream and emit a verdict without buffering the full page.
Public hot-path API:
StreamingPathSelector.Scan(Guid templateId, ReadOnlySpan<byte> html)
→ ScanVerdict { Continue | Captured | Bailout | NoTemplate }
// Warm a template into the hot cache:
await selector.WarmAsync(templateId);
Storage:
- InMemoryStreamingTemplateStore — single-process LRU.
- SqliteStreamingTemplateStore — durable; same SQLite file pattern as
the existing ITemplateIndex but a separate table.
Pairs with the existing StyloExtract.Fingerprint learn path and
ITemplateIndex template store. The streaming template format is its own
shape (TemplateFence with MinHash bloom, content-start/content-end
fences) — not an LLM template or operator template.
Bench results vs LayoutExtractor on mostlylucid fixtures: see
bench/StyloExtract.Streaming.Benchmarks/ (zero-alloc scan competitive
with the full extractor's path-match cost while never building a DOM).
Migration: additive package; consumers add a PackageReference to
Mostlylucid.StyloExtract.Streaming if they want gateway-position
scanning.
StyloExtract 1.8.0-alpha.15 - 2026-06-26
=========================================
RenderOptions.WaitUntil — opt out of NetworkIdle for SPA routing
-----------------------------------------------------------------
PlaywrightHtmlFetcher previously hardcoded WaitUntilState.NetworkIdle
for the primary GotoAsync. On sites with aggressive client-side
routing (BBC News auto-navigates /news → /articles/<id> in the
post-load JS phase), this means the fetcher returns the post-routing
DOM, not the page the user requested.
RenderOptions now exposes a WaitUntil property (PlaywrightWaitUntil
enum: Load / DOMContentLoaded / NetworkIdle / Commit). Default stays
NetworkIdle for backwards compatibility. Consumers fetching SPA-heavy
sites should set Load to capture the initial DOM before the router
fires.
The secondary WaitForLoadStateAsync(NetworkIdle, ...) drain remains —
it's independently bounded by WaitForNetworkIdleTimeout and serves as
a best-effort late-XHR catch-up; safe even with the primary returning
on Load.
PlaywrightWaitUntil is a small enum (not Microsoft.Playwright.WaitUntilState
direct) so consumers don't take a transitive dependency on
Microsoft.Playwright just to pick a strategy.
StyloExtract 1.8.0-alpha.14 - 2026-06-26
=========================================
Sitemap CLI end-to-end regression + LLM nav few-shot
-----------------------------------------------------
1. Sitemap CLI test suite
The alpha.11 stylo-extract sitemap verb has been working on real sites
since alpha.13 (heuristic nav-classification tightening), but nothing
caught regressions. Added 5 end-to-end tests in
StyloExtract.Core.Tests/SitemapCommandTests.cs that invoke the
SitemapCommand.CrawlAsync handler against the mostlylucid-home.html.gz
fixture (real captured homepage, shared with the heuristics suite) plus
a stub HttpMessageHandler and assert: real nav links emitted under
# www.mostlylucid.net, --max-depth 0 emits only the seed Title row,
off-host links are not followed, --max-pages cap honoured exactly, and
--delay-ms enforced with a stopwatch floor. No network access required.
2. LLM induction prompt — nav-classification few-shot
LlmInducerPrompts.System and SystemRepair now include a second worked
example: a blog homepage with header <nav>, breadcrumb,
MainContent + RepeatedItem post cards, and footer <nav>. Mirrors the
patterns the alpha.13 NavPreDetector heuristic correctly classifies.
Rule 6 (RepeatedItem usage) tightened with explicit guidance that
header/footer nav lists are PrimaryNavigation / SecondaryNavigation at
the parent <ul>/<nav> level, NOT RepeatedItem at the <li> level —
closes a known LLM confusion mode.
Tests: snapshot tests in StyloExtract.Core.Tests/LlmInducerPromptsTests.cs
verify the prompt extensions land verbatim so future prompt edits don't
accidentally regress.
StyloExtract 1.8.0-alpha.13 - 2026-06-26
=========================================
Heuristic nav-classification tightening
----------------------------------------
HeuristicBlockClassifier was under-classifying real-world nav patterns
on server-rendered sites — header <nav> strips, header <ul>-of-links,
breadcrumb lists, role="navigation" attributes, footer nav — all landed
as Boilerplate (or weren't extracted at all). Result: the alpha.11
Sitemap profile and stylo-extract sitemap CLI verb produced a one-line
tree even on sites with rich nav, because the classifier didn't surface
PrimaryNavigation / SecondaryNavigation / Breadcrumb roles for them.
Tightened patterns now produce definite role classifications:
1. <header> <nav> -> PrimaryNavigation (0.9)
2. Top-of-document <nav> -> PrimaryNavigation (0.85)
3. <footer> <nav> -> SecondaryNavigation (0.9)
4. <nav aria-label="breadcrumb"> / class~="breadcrumb" -> Breadcrumb (0.95)
5. <* role="navigation"> -> PrimaryNavigation (0.95)
6. Header <ul> of mostly-link <li>s -> PrimaryNavigation (0.85) at
the <ul> level, suppress descent (was emitting deep Boilerplate)
7. Footer <ul> of mostly-link <li>s -> SecondaryNavigation (0.85)
Implementation: a new NavPreDetector runs after per-element classification
and injects each detected nav container as a high-score (50000) candidate
at the parent level, then demotes any descendant candidates so greedy
selection picks the nav parent and stops descending into its noise.
Containers nested inside <main>/<article> are skipped — IntraBlockCleaner
already strips them as intra-block contaminants; hoisting would steal
the article's selection win.
Regression fixtures captured from mostlylucid.net + wikipedia.org under
tests/StyloExtract.Heuristics.Tests/Fixtures so the next time a
classifier change regresses real-world nav detection, the bench catches
it before it ships.
Downstream impact: the Sitemap ExtractionProfile and stylo-extract
sitemap CLI verb now produce real nav trees on these sites - see the
lucidview FULL dogfood smoke for evidence.
StyloExtract 1.8.0-alpha.12 - 2026-06-26
=========================================
DI wire-up fix for deterministic-template YAML persistence
-----------------------------------------------------------
alpha.11 introduced DeterministicTemplateYamlSink + the
AddStyloExtractOperatorTemplates registration, but AddStyloExtract's
LayoutExtractor construction did not pass the sink through to the
extractor — so even when the sink was registered in DI, LayoutExtractor's
optional ctor parameter defaulted to null and no `<host>-deterministic.yaml`
file was ever written.
Fixed by threading `sp.GetService<DeterministicTemplateYamlSink>()` to the
LayoutExtractor constructor in AddStyloExtract. No API change; consumers
who already called AddStyloExtractOperatorTemplates start seeing
deterministic YAML files immediately after upgrading.
StyloExtract 1.8.0-alpha.11 - 2026-06-26
=========================================
Sequenced architecture extension: deterministic templates with
extended classification — Title role, Sitemap profile, deterministic
YAML persistence, and a sitemap CLI verb.
Title BlockRole
---------------
New BlockRole.Title value distinguishes the page-level <h1> (the single
H1 the rest of the page is "about") from intra-content Heading
(H2/H3/H4 inside the body). HeuristicBlockClassifier surfaces the Title
via a shared PageTitleDetector helper, picking the H1 in/closest-to
<main>/<article> and falling back to earliest-in-document with multiple
H1s. ExtractorApplicator surfaces Title on the fast-path / applicator
branch too, so output stays consistent across novel and cached requests
(matters for the response-cache ETag). LlmInducerPrompts list Title in
the allowed-roles set with a one-line distinction from Heading.
MainContentOnly, RagFull, Wcxb, and AgentNavigation profiles all
include Title in their role-set. The renderer quality gate (drop short
text) bypasses for Title and Heading so intentionally-terse page
titles ("Home", "About") still surface.
Sitemap ExtractionProfile
-------------------------
New ExtractionProfile.Sitemap value emits only Title + Heading +
PrimaryNavigation + SecondaryNavigation + Breadcrumb. For sitemap /
outline / crawler use cases that want page titles and the site's nav
structure without pulling body content. The CLI's --profile flag
recognises `sitemap` automatically (enum binding).
Deterministic YAML persistence
------------------------------
New DeterministicTemplateYamlSink, wired automatically when
AddStyloExtractOperatorTemplates(root) is called, writes
<host>-deterministic.yaml alongside each heuristic-induced template's
SQLite row. The file carries every role the heuristic detected (Title,
MainContent, Navigation, Footer, …) — auditable, hand-editable, and
diffable, mirroring how LLM-induced templates have always been written
by TemplateEnrichmentCoordinator. The SQLite store remains the
authoritative source at match time; YAML is best-effort and
non-blocking.
stylo-extract sitemap CLI verb
------------------------------
New `sitemap` subcommand: takes one or more starting URLs, extracts
each with ExtractionProfile.Sitemap, follows internal nav links to
--max-depth (default 3), and emits a markdown tree of titles + URLs to
stdout or --out <file>. Safety caps: 50 pages by default
(--max-pages), 1s between requests (--delay-ms), no off-host follow.
Migration
---------
No source change required for consumers. The new Title role is
additive (existing switches that handled BlockRole pattern-match
defaults will continue to compile and behave identically; switches
that exhaustively listed roles were updated). Deterministic YAML
writing only activates when AddStyloExtractOperatorTemplates is
called, so consumers that don't use operator templates see no new
filesystem activity.
StyloExtract 1.8.0-alpha.10 - 2026-06-26
=========================================
LLM classification accuracy for chrome patterns
------------------------------------------------
Symptom: induced templates were labelling language pickers, filter UI,
locale switchers, and pagination strips as MainContent on
server-rendered blogs (mostlylucid.net being the canonical reproducer).
The downstream RagFull renderer's role-filter — which already drops
PrimaryNavigation / SecondaryNavigation / Form / Boilerplate — never
saw them as nav and so left them in the extracted markdown, producing
output WORSE than the deterministic heuristic.
Fix: expanded the induction and repair system prompts with explicit
"chrome pattern → role" examples (language picker → PrimaryNavigation;
filter / fac
[truncated — see RELEASE_NOTES.txt packaged at root for full history]