Web Sieve

Nowadays, the web feels full of noise — advertising, SEO filler, bloat. I wanted a cleaner browsing experience that goes further than what adblockers offer: not just hiding elements after the page loads, but not fetching unneeded images, CSS, and JavaScript in the first place. I ended up choosing Starlark scripts to express filtering rules, so that the community can contribute their own. I also want a uniform interface for similar types of websites — news, weather forecasts, forums, download pages, mangas, streaming, forges — where a shared template handles the layout and only the content changes.

Browser extensions work from inside the browser: they can hide elements, intercept fetch calls, and inject scripts. But they live after the network — the bytes still travel the wire, the JavaScript still executes, the trackers still phone home before the extension gets a chance to react. And on mobile, or in apps that embed a WebView, extensions don’t exist at all.

The premise of web-sieve is different. If you put a proxy between your machine and the internet, you can modify content before the browser ever sees it — at the byte-stream level, before layout, before script evaluation, before the tracker has a chance to run. This is the gap that MITM proxies like mitmproxy fill, but they require manual scripting in Python and reload the entire interpreter per request. web-sieve is built around a streaming HTML rewriter and a Starlark scripting engine that compiles rules once and hot-reloads them without restarting anything.

The secondary problem was trust. HTTPS means the browser encrypts traffic that the proxy needs to read. The standard solution — generate a local Root CA, install it in the OS trust store, issue leaf certificates on the fly — works, but most anti-bot systems (Cloudflare, Akamai) fingerprint the TLS handshake. A non-browser TLS signature triggers bot challenges on half the web. web-sieve spoofs the outgoing TLS fingerprint (JA3/JA4) to match the user’s active browser User-Agent, so upstream servers see a handshake they recognise.

How it works

Browser → web-sieve (local proxy)
              │
              ├─ CONNECT intercept → on-the-fly leaf cert (rcgen Root CA)
              │                      TLS termination (tokio-rustls)
              │
              ├─ Request pipeline → Starlark rules → upstream fetch
              │                     (TLS fingerprint spoofing)
              │
              └─ Response pipeline → lol-html byte-stream rewriter
                                     aho-corasick content scanner
                                     → clean response to browser

web-sieve generates a local Root CA on first run. Install its certificate once in your OS or browser trust store and all HTTPS traffic is transparently intercepted and cleaned. Every response passes through two sequential pipelines: the Starlark rules decide which transformations to apply, and the lol-html streaming rewriter applies them at the byte level with bounded memory and sub-millisecond overhead.

Scripting engine

Rules are written in Starlark, Google’s deterministic, sandboxed configuration language — syntactically close to Python but with no side effects and a predictable evaluation model. A rule can target a URL pattern, inspect request and response headers, and call a set of structural operations on the HTML: remove_element(), keep_only(), inject_style(), and others.

Rules hot-reload on save using a file watcher. The proxy keeps running; only the changed rule is recompiled and swapped in atomically. There is no restart, no dropped connections, no state loss.

Features

Ad-blocking and cosmetic filtering — network-level blocklists (EasyList, uBlock Origin format) parsed into a prefix-tree engine. Cosmetic rules inject <style> blocks via lol-html to hide ad containers without breaking single-page app hydration.

Reader mode — a Rust port of Mozilla Readability extracts the main article body, strips navigation, sidebars, and comment sections, and wraps the result in a minimal uniform template. A Starlark hook can forward saved URLs to read-later services (Shiori, Pocket) via webhook.

Manga and manhwa pipeline — a Starlark schema targets continuous-scroll image containers on scanlation sites and extracts the image strip. An internal image proxy endpoint spoofs Referer and Origin headers to bypass hotlinking protections on image hosts.

Download page simplification — Starlark recipes scrape torrent and DDL pages and re-render them as clean searchable HTML tables. Webhook buttons send magnet links directly to qBittorrent on click, skipping the ad-ridden source page entirely.

Content safety and parental controls — aho-corasick multi-pattern scanning on raw text payloads. A configurable weighted scoring threshold blocks a page only when a density of flagged terms is reached, preventing the false positives that plague simple substring matching.

Heuristic anti-phishing — the rewriter intercepts <form> elements containing credential or PII inputs and correlates DOM structure with semantic panic triggers (“verify your account”, “unauthorized access detected”) on untrusted domains. A form that matches both signals is flagged before the user can submit anything.

TLS fingerprint spoofing — outgoing TLS handshakes are crafted to match real browser signatures, preventing Cloudflare and Akamai bot-detection from fingerprinting the proxy. This is transparent: you browse with your normal browser and the proxy matches its User-Agent.

Alternatives

Browser extensions

uBlock Origin is the gold standard for ad and tracker blocking. It intercepts network requests at the browser level using declarative filter lists and cosmetic rules, and blocks asset loading for matched URLs. It is fast, well-maintained, and has a vast community of list curators. The fundamental limitation is structural: it runs inside the browser, so it only works in that browser. Mobile browsers with extension support are few; native apps, WebViews, and other HTTP clients are invisible to it. It also cannot rewrite page structure beyond hiding elements — it cannot re-render a page with a different layout.

LibRedirect and Privacy Redirect take a different approach: instead of rewriting a page, they redirect you away from it entirely — YouTube becomes Invidious, Twitter becomes Nitter, Reddit becomes Redlib. This is elegant when it works, but it depends on community-hosted instances that you do not control. Instances go down, get overloaded, or disappear. When the upstream platform changes its API or HTML structure, all instances break simultaneously until someone patches the alternative frontend. And the coverage is inherently limited to the sites that have an established alternative frontend — niche sites, forums, and anything outside the major platforms are out of scope.

Alternative frontends

Invidious, Nitter, Redlib, Rimgo, and similar projects are server-side rewrites of specific platforms. They solve the clean-interface problem well for their target site, and self-hosting gives you full control. The cost is operational: each frontend is a separate service to deploy, update, and monitor. When the upstream platform changes its internal API — which happens regularly and deliberately — the frontend breaks and you wait for a patch. Coverage is fixed to whatever sites have an active frontend project; there is no general mechanism for writing a recipe for an arbitrary site.

Proxies

Privoxy is a filtering HTTP proxy that has been around since 2001. It supports action files (block, redirect, rewrite) and filter files (regex substitutions on response bodies). It is the closest historical precedent to web-sieve: a local proxy that modifies content before the browser sees it. Its limitations are its age. It has no streaming rewriter — it buffers the full response before applying filters. It has no native HTTPS MITM; you chain it with an external CA tool. Its filter language is regex, not a scriptable engine. Configuration is spread across multiple file formats with a steep learning curve. It remains useful for simple blocking rules on a network gateway, but it was not designed for structural HTML rewriting.

mitmproxy is a Python-based MITM proxy with a full HTTPS interception stack and a scripting API. It is powerful and well-documented, widely used for security testing and debugging. For production content filtering it has a significant throughput cost: each intercepted request invokes the Python runtime, which adds latency and CPU overhead that scales poorly under concurrent load. Its addon API exposes the full request/response cycle but provides no streaming HTML rewriting — you manipulate raw bytes or parse the HTML yourself. Hot-reloading rules requires restarting the addon or the proxy.

Comparison

	uBlock Origin	LibRedirect	Alt. frontends	Privoxy	mitmproxy	web-sieve
Works outside the browser	No	No	N/A	Yes	Yes	Yes
HTTPS interception	No	No	N/A	No (external)	Yes	Yes
Blocks asset loading	Yes	Yes (redirect)	N/A	Yes	Yes	Yes
Structural HTML rewriting	Limited (CSS hide)	No	Yes	Regex only	Manual	Yes (streaming)
Uniform templates across sites	No	No	Per-site only	No	No	Yes (planned)
Community scripts	Yes (filter lists)	Yes (instances)	Yes (per-project)	Yes (filter files)	Yes (addons)	Yes (Starlark)
Self-contained, no external servers	Yes	No	No	Yes	Yes	Yes
TLS fingerprint spoofing	No	No	N/A	No	No	Yes
Hot-reload rules without restart	N/A	N/A	N/A	Partial	No	Yes
Mobile / non-browser apps	No	No	No	Yes	Yes	Yes

Status

The core proxy infrastructure, TLS MITM, Starlark scripting engine, streaming HTML pipeline, ad-blocking, reader mode, content scanning, and anti-phishing engine are implemented. The manga pipeline and download-page simplification are in progress. HTTP/2 on the browser↔proxy leg and a Minijinja template engine for user-customisable views are planned.

The repository will be published when the implementation reaches a stable baseline.