Mostlylucid.StyloExtract.Templates.Postgres
1.8.0-alpha.9
See the version list below for details.
dotnet add package Mostlylucid.StyloExtract.Templates.Postgres --version 1.8.0-alpha.9
NuGet\Install-Package Mostlylucid.StyloExtract.Templates.Postgres -Version 1.8.0-alpha.9
<PackageReference Include="Mostlylucid.StyloExtract.Templates.Postgres" Version="1.8.0-alpha.9" />
<PackageVersion Include="Mostlylucid.StyloExtract.Templates.Postgres" Version="1.8.0-alpha.9" />
<PackageReference Include="Mostlylucid.StyloExtract.Templates.Postgres" />
paket add Mostlylucid.StyloExtract.Templates.Postgres --version 1.8.0-alpha.9
#r "nuget: Mostlylucid.StyloExtract.Templates.Postgres, 1.8.0-alpha.9"
#:package Mostlylucid.StyloExtract.Templates.Postgres@1.8.0-alpha.9
#addin nuget:?package=Mostlylucid.StyloExtract.Templates.Postgres&version=1.8.0-alpha.9&prerelease
#tool nuget:?package=Mostlylucid.StyloExtract.Templates.Postgres&version=1.8.0-alpha.9&prerelease
Mostlylucid.StyloExtract.Templates.Postgres
PostgreSQL-backed template index for StyloExtract. Implements the same ITemplateIndex contract as Mostlylucid.StyloExtract.Templates (the SQLite provider); swap providers via DI with no change to calling code.
When to use this instead of SQLite
Choose the Postgres provider when:
- Your deployment already runs PostgreSQL as its operational database (StyloBot commercial, multi-tenant SaaS)
- You need multiple extraction nodes sharing one template store (Npgsql pools connections; Postgres serialises concurrent writes natively)
- You plan to add pgvector cosine-similarity search in a future upgrade (the schema is forward-compatible)
The SQLite provider (Mostlylucid.StyloExtract.Templates) is the right choice for single-host or air-gapped deployments, CLI tools, and anywhere you want zero external dependencies.
Installation
dotnet add package Mostlylucid.StyloExtract.Templates.Postgres
Usage
// Register the Postgres provider. Call this instead of (or after) AddStyloExtract()
// to replace the SQLite ITemplateIndex with the Postgres one.
services.AddStyloExtractPostgres(o =>
o.ConnectionString = "Host=localhost;Port=5432;Database=styloextract;Username=se;Password=secret");
// Optional: register drift-triggered refit support (mirrors RefitOrchestrator for SQLite).
services.AddStyloExtractPostgresRefit(
driftRefitThreshold: 0.35,
observationsBeforeStable: 5,
versionHistoryDepth: 3);
Schema is applied idempotently on the first operation (CREATE TABLE IF NOT EXISTS). No migration tool required.
Storage model
| Table | Contents |
|---|---|
templates |
Template id (bytea), host hash, fingerprint, extractor JSON blob, version, observation count |
template_lsh_band_index |
LSH bucket rows for fast-path lookup |
template_observations |
Per-request observation vectors (bounded to last 100 per template) |
template_version_history |
Past extractor versions retained for diff generation |
Columns that are BLOB in SQLite are bytea in Postgres. Timestamps are bigint Unix milliseconds. No pgvector dependency in v1; vector similarity uses the same CPU-side cosine math as the SQLite provider.
AOT
This package sets IsAotCompatible=false because Npgsql requires runtime reflection for connection-string parsing. It will not break AOT builds in packages that do not reference it (sibling packages such as StyloExtract.Playwright remain AOT-safe).
| Product | Versions Compatible and additional computed target framework versions. |
|---|---|
| .NET | net10.0 is compatible. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
-
net10.0
- Mostlylucid.StyloExtract.Abstractions (>= 1.8.0-alpha.9)
- Mostlylucid.StyloExtract.Fingerprint (>= 1.8.0-alpha.9)
- Mostlylucid.StyloExtract.Templates (>= 1.8.0-alpha.9)
- Npgsql (>= 10.0.3)
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.
| Version | Downloads | Last Updated |
|---|---|---|
| 2.0.1 | 45 | 6/30/2026 |
| 2.0.0 | 94 | 6/28/2026 |
| 1.8.0 | 100 | 6/27/2026 |
| 1.8.0-alpha.23 | 49 | 6/27/2026 |
| 1.8.0-alpha.22 | 50 | 6/27/2026 |
| 1.8.0-alpha.21 | 52 | 6/27/2026 |
| 1.8.0-alpha.20 | 60 | 6/27/2026 |
| 1.8.0-alpha.19 | 56 | 6/26/2026 |
| 1.8.0-alpha.18 | 70 | 6/26/2026 |
| 1.8.0-alpha.17 | 79 | 6/26/2026 |
| 1.8.0-alpha.16 | 86 | 6/26/2026 |
| 1.8.0-alpha.15 | 59 | 6/26/2026 |
| 1.8.0-alpha.14 | 50 | 6/26/2026 |
| 1.8.0-alpha.13 | 49 | 6/26/2026 |
| 1.8.0-alpha.12 | 47 | 6/26/2026 |
| 1.8.0-alpha.11 | 48 | 6/26/2026 |
| 1.8.0-alpha.10 | 55 | 6/26/2026 |
| 1.8.0-alpha.9 | 53 | 6/25/2026 |
| 1.8.0-alpha.8 | 58 | 6/25/2026 |
| 1.8.0-alpha.4 | 56 | 6/25/2026 |
StyloExtract 1.8.0-alpha.9 - 2026-06-25
========================================
App-safe AddStyloExtract + LlmInductionFired flag
--------------------------------------------------
Two changes that downstream desktop / CLI consumers (e.g. lucidVIEW-FULL)
need:
1. The basic `AddStyloExtract(IServiceCollection, Action<StyloExtractOptions>?)`
DI extension and its companion `StyloExtractOptions` type now live in
`Mostlylucid.StyloExtract.Core` instead of `Mostlylucid.StyloExtract.AspNetCore`.
Non-AspNetCore hosts (desktop apps, CLI tools, console workers) can call:
services.AddStyloExtract(o => o.StorePath = "templates.db");
without pulling `Microsoft.AspNetCore.App` (~70 MB of framework runtime).
`Mostlylucid.StyloExtract.AspNetCore` keeps its `Action<ResponsePolicyBuilder>`
overloads (response-policy framework, markdown content negotiation
middleware, operator-template minimal-API endpoints) — those legitimately
need AspNetCore. They now delegate to the Core overload internally.
Migration: no source change for AspNetCore consumers. Desktop / CLI
consumers can reference `Mostlylucid.StyloExtract.Core` alone.
2. `ExtractionResult.LlmInductionFired` (new bool property) signals
whether the LLM template inducer ran during this extraction. Downstream
telemetry surfaces (e.g. status bars, NDJSON exports) can now show LLM
utilisation per call without reflection or polling internal state.
Defaults to false for non-LLM hosts and heuristic-only extractions;
set true only when the LlmTemplateInducer (or any future ILlmTextProvider-
backed inducer) actually invoked the LLM.
StyloExtract 1.8.0-alpha.6 - 2026-06-25
========================================
App-safe AddStyloExtract — moved to StyloExtract.Core
------------------------------------------------------
The basic `AddStyloExtract(IServiceCollection, Action<StyloExtractOptions>?)`
DI extension and its companion `StyloExtractOptions` type have moved from
`Mostlylucid.StyloExtract.AspNetCore` to `Mostlylucid.StyloExtract.Core`.
Desktop, CLI, and any non-AspNetCore host can now call:
services.AddStyloExtract(o => o.StorePath = "templates.db");
without pulling `Microsoft.AspNetCore.App` (~70 MB of framework runtime).
`Mostlylucid.StyloExtract.AspNetCore` keeps its `Action<ResponsePolicyBuilder>`
overloads (response-policy framework, markdown content negotiation
middleware, operator-template minimal-API endpoints) — those legitimately
need AspNetCore. They now delegate to the Core overload internally.
Migration: no source change required for AspNetCore consumers. Desktop /
CLI consumers can drop direct dependencies on `Mostlylucid.StyloExtract.AspNetCore`
and reference `Mostlylucid.StyloExtract.Core` alone.
StyloExtract 1.8.0-alpha.5 - 2026-06-25
========================================
In-process CPU LLM backend (LLamaSharp) + 13-model bench harness.
Operators can now embed a single ~2-3 GB GGUF model in the host
process — no Ollama server, no separate LLM daemon. Same
ILlmTextProvider contract as the Ollama backend, so the
LlmTemplateInducer + production enrichment coordinator + CLI
`template train` all work unchanged.
What's new since 1.8.0-alpha.4
------------------------------
Mostlylucid.StyloExtract.Llm.LlamaSharp
New package. ILlmTextProvider implementation backed by LLamaSharp
0.27 (the .NET binding for llama.cpp). Loads a GGUF model from
disk; the executor reads the model's chat template from GGUF
metadata so prompts written for Ollama work unchanged.
Wire-up:
services.AddStyloExtract(...);
services.AddStyloExtractLlamaSharp(o =>
{
o.ModelPath = "/var/models/Phi-4-mini-instruct-Q4_K_M.gguf";
o.ContextSize = 8192;
o.GpuLayerCount = 0; // pure CPU target
});
services.AddStyloExtractLlmInducer("config/templates");
Anti-prompt set covers Qwen, Phi, Llama 3+, and Gemma 4 stop
tokens so the generator halts at the model's natural turn boundary
instead of echoing the chat template structure.
Known LLamaSharp 0.27 issue documented in the package README:
Gemma 4 E2B / E4B's chat template metadata isn't applied cleanly
by StatelessExecutor — the model emits Jinja2 template source
instead of YAML. Phi-4-mini, Qwen 2.5 Coder, Llama 3.2 work fine.
Model benchmark harness
New tests/StyloExtract.Llm.Benchmark project — runs the
cross-product of (models × pages) for template induction and
reports F1 / train-time / markdown-size matrices. Reuses WCXB
ground-truth shape (one HTML.gz per page id, one ground-truth
JSON) and the operator-template store path.
Model spec routing: `llamasharp:/path/to/file.gguf` resolves via
the in-process backend; anything else hits Ollama. Lets one
bench compare server (Ollama) and embedded (LlamaSharp) backends
side-by-side with identical fixtures.
Recommended models (empirically validated)
For Ollama backend:
* qwen3.5:4b — 3 GB, ~26 s, F1 0.805 (default, best)
* qwen2.5-coder:3b — 2 GB, ~21 s, F1 0.767 (smaller-and-faster pick;
code-trained matters for
CSS selectors)
* qwen3.5:0.8b — 1 GB, ~5 s, F1 0.528 (tiny floor)
For LLamaSharp backend (use bartowski quants):
* Phi-4-mini-instruct Q4_K_M — 2.5 GB, verified working
* Qwen 3.5 4B Q4_K_M — 3 GB, verified working
* Qwen 2.5 Coder 3B Q4_K_M — 2 GB, verified working
OllamaTextProviderOptions default model bumped
Default tag was gemma4:e4b-it-qat; switched to qwen3.5:4b per the
bench. The doc-comment now lists the smaller-and-faster pick and
the model families to avoid (thinking-mode budget burn).
Tests
494 across 11 projects. New StyloExtract.Llm.LlamaSharp.Tests
project covers ctor validation, missing-file behaviour, and
SkippableFact live-GGUF integration (skipped without
STYLOEXTRACT_LLAMASHARP_MODEL env var pointing at a GGUF file).
StyloExtract 1.8.0-alpha.4 - 2026-06-25
========================================
Tiny patch alpha to fix two consumer-facing bugs found while smoke-
installing alpha.3 against NuGet.
What's new since 1.8.0-alpha.3
------------------------------
SQLite chain CVE patched (GHSA-2m69-gcr7-jv3q)
Microsoft.Data.Sqlite bumped 10.0.1 -> 10.0.9; StyloExtract.Templates
gains a direct PackageReference to SQLitePCLRaw.bundle_e_sqlite3 so
the existing 3.0.3 central pin lifts the resolved bundle off the
vulnerable 2.1.11 line and onto SourceGear.sqlite3 3.50.4.5.
`dotnet list package --vulnerable` on consumer projects now
returns clean.
PlaywrightHtmlFetcher.Dispose() (sync path)
The fetcher previously only implemented IAsyncDisposable. When
registered as a DI singleton (which AddStyloExtractPlaywright()
does), `using var sp = services.BuildServiceProvider()` — the
canonical sync pattern — threw at container shutdown:
InvalidOperationException: 'PlaywrightHtmlFetcher' type only
implements IAsyncDisposable. Use DisposeAsync to dispose the
container.
Add a sync Dispose() that block-waits on the async path. Container
disposal happens off the request hot path so the sync wait is safe.
Both fixes are backwards-compatible drop-in patches. No code changes
needed in consumer projects beyond bumping the package version.
492 tests across 10 projects, all green.
StyloExtract 1.8.0-alpha.3 - 2026-06-25
========================================
What's new since 1.8.0-alpha.2
------------------------------
Next.js __NEXT_DATA__ rehydration extractor
Next.js apps embed their page state in a JSON blob inside
<script id="__NEXT_DATA__" type="application/json">. Schemas vary
per site (Shopify Hydrogen uses pageProps.shopifyProductsPreloadedState,
news sites use pageProps.initialState.article.body) so the
extractor walks props.pageProps recursively and collects every
string value that looks like prose (>= 80 chars, contains a space,
isn't a URL / data URI / CSS variable / serialised JSON). Conservative
key-exclusion list keeps URLs and build metadata out of the result.
Chains next to the JSON-LD and Discourse rehydration fallbacks.
Content-role fallback gate
The chained fallback (JSON-LD -> Next.js -> Discourse -> body-text)
previously gated on the all-blocks text sum. That sum looked
healthy for pages where the heuristic emitted 3 KB of nav + footer +
boilerplate while finding zero MainContent — the renderer's
MainContentOnly / Wcxb profiles drop those roles anyway, so the
actual markdown is 0 chars. Switch the gate to content-role text
mass only. 18 catastrophic pages recovered without any new code,
just the gate change.
Playwright auto-fallback decorator
AddStyloExtractPlaywright() wires PlaywrightHtmlFetcher AND
decorates the existing ILayoutExtractor with a RenderingLayoutExtractor
that runs static extraction first, then re-fetches via Playwright
only when:
* the caller passed a non-null sourceUri
* the static result has < 200 chars of content-role text
* an IRenderedHtmlFetcher is wired in DI
File-only callers never trigger a render. Operators who don't want
the Chromium dependency simply don't add the package. Three guards
against wasted work: Playwright throws -> return static; rendered
HTML same length as static -> skip the re-extract; re-extract
yields no improvement -> return static.
Usage:
services.AddStyloExtract(...);
services.AddStyloExtractPlaywright();
492 tests across 10 projects, 6 new unit tests for the decorator
policy.
Aggregate WCXB (1495 dev pages, Wcxb profile):
| Stage | F1 | Catastrophic |
|----------------------------------------|-------:|-------------:|
| 1.8.0-alpha.2 | 0.760 | 25 |
| + Next.js extractor | same | |
| + content-role fallback gate | 0.760 | 17 |
| + 14 LLM-trained YAMLs | 0.760 | 17 |
| (Playwright auto-fallback) | -- | |
Playwright auto-fallback is wired but not exercised in the WCXB
benchmark by default — needs `playwright install chromium`. Real-
world consumers with the package added see automatic recovery for
JS-rendered SPAs whose content is hydrated client-side.
StyloExtract 1.8.0-alpha.2 - 2026-06-25
========================================
LLM template-training loop, Discourse rehydration, plus a stack of
heuristic + selection fixes that move the WCXB dev split from F1 0.673
(post-1.7.1, MainContentOnly profile) to F1 0.760 (Wcxb plain-text
profile, with operator-trained templates + Discourse rehydration
active). Catastrophic extraction failures (pred_chars ≤ 5) drop from
92 of 1495 pages to 25.
Beats Readability on every page type. Closes the gap to Trafilatura by
~40% on Article + Documentation. Above v1.5.4 baseline (0.718) by
+0.042 — and that's keeping all the GFM markdown structure (sidebar
TOCs, blockquotes, GFM tables) in the runtime output, not stripping
to plain text for benchmark flattery.
What's new since 1.8.0-alpha.1
------------------------------
LLM template training loop (`stylo-extract template train`)
Operator-driven synchronous LLM template specialisation, the
counterpart to the existing async enrichment coordinator. Smart-
routes between induce (no template yet) and repair (template
exists but underperforms).
Closed-selector prompt: every selector the model can choose from
is enumerated from the actual page DOM via DocumentSelectorCatalog
and handed to the LLM in the prompt. Inventing selectors fails.
Post-parse AngleSharp validation: every selector the model returns
is run through doc.QuerySelectorAll. Selectors that match zero
elements are dropped; templates whose MainContent rule has no
surviving selector are rejected.
Repair prompt re-angled as a diagnostic: "why is this failing AND
how should it work for this page" instead of just "produce a
corrected template."
Hash-prefixed selectors (`#my-id`) are now properly quoted in
emitted YAML so they round-trip; the inducer also pre-repairs
unquoted hash selectors in the LLM response before parse.
OllamaTextProvider bumps NumPredict default 1024 → 4096
(reasoning-tagged models burn tokens on chain-of-thought before
the answer) and falls back to message.thinking when message.content
is empty.
`template repair` command + `LlmTemplateInducer.RepairFromSkeletonAsync`
+ production coordinator dispatch (TemplateEnrichmentJob.Kind +
LayoutExtractor enqueue on low-output existing-template hits).
Discourse data-preloaded rehydration
Discourse renders every page as an Ember.js SPA. Static HTML ships
near-zero post content; the actual topic + posts live in a JSON
blob in <div id="data-preloaded" data-preloaded="...JSON...">.
DiscourseRehydrationExtractor parses the JSON, walks
topic_NNN.post_stream.posts[*].cooked, strips tags, and emits the
result as a synthetic MainContent fallback block — same shape as
the existing JSON-LD fallback. Discourse powers 5 000+ public
forums; one upstream extractor covers them all.
WCXB lift: 6 of 13 catastrophic forum pages go from F1=0 to
F1=0.83–0.99. Forum category F1 0.477 → 0.535.
Wcxb plain-text profile
WCXB-style word-overlap benchmarks score against plain-text gold.
The default MainContentOnly / RagFull output emits GFM Markdown —
headings, lists, sidebar TOCs, multi-paragraph blockquotes — that
improves AI / human readability but registers as precision noise
against plain-text comparison.
New ExtractionProfile.Wcxb uses MainContentOnly's role-set but
emits each block's plain Text instead of its Markdown. Strictly
a benchmark / comparison profile — runtime callers keep their
existing profile and continue getting structured GFM.
Heuristic + selection fixes
DomCleaner: strip <select> globally so <option> text stops
leaking on category dropdowns. mostlylucid.net opened with 290+
category names dumped into the output; now opens with the actual
blog list.
IntraBlockCleaner: content-guard the contamination-hint substring
match. "sidebar" substring was eating WordPress / SNOFlex article
bodies whose class contained "sidebar-mode-single". 28 catastrophic
article pages recovered.
LayoutExtractor: body-text fallback for old-school flat HTML
without <main>/<article>/section wrappers. erikdemaine.org/foldcut
and similar plain H1/H2/P-under-body pages now extract.
LayoutExtractor: detect chrome-heavy applicator output as bug-out.
Stale templates applied to wrong-shape pages produced 1 char of
MainContent while combinedText looked fine (header + footer
selectors found chrome). esprit-barbecue, nike, rei collections
recovered.
HeuristicBlockClassifier: empty-semantic-wrapper handling and
body-spanning <form> fall-through. ASP.NET WebForms pages
(drainblasterbill, etc.) recovered.
Framework-content-class-hints: 20 new patterns — Discourse, phpBB,
vBulletin, PrestaShop, WooCommerce, Shopify, BigCommerce,
Squarespace, Webflow, Wix, Joomla, GitHub Pages, plus some misc.
Benchmark harness
WCXB harness gains --operator-templates <root> for loading
YAML files produced by `template train`, --page-ids for fast
repro of individual failures.
Aggregate WCXB (1495 dev pages, Wcxb profile):
| System | F1 | Precision | Recall |
|-------------------|-------:|----------:|-------:|
| StyloExtract v1.8.0-alpha.2 | 0.760 | 0.756 | 0.849 |
| rs-trafilatura | 0.859 | 0.863 | 0.890 |
| Trafilatura | 0.791 | 0.852 | 0.793 |
| Readability | 0.675 | 0.685 | 0.713 |
Compatibility
Backwards-compatible with 1.8.0-alpha.1. All changes are either new
code paths (Discourse extractor, Wcxb profile, train CLI), strictly
better selection (the heuristic fixes), or schema-additive
(TemplateEnrichmentJob gains optional Kind / BadMarkdownSample with
default Induce). Existing operator templates and trained YAMLs from
alpha.1 continue to work unchanged.
Suite: 486 tests across 10 projects, all green.