Web Scraping Municipal Permit Portals with

Web scraping is the acquisition stage of the broader automated permit ingestion and parsing workflows: the code that turns a fragmented public portal — one with no API, a 2009-era search form, and a session cookie that expires mid-crawl — into a clean, deduplicated stream of permit records. Municipalities are digitizing permitting, inspection, and compliance tracking faster than they are exposing machine-readable endpoints, so for the foreseeable future the only way many jurisdictions will hand you their data is through the same HTML a clerk sees in a browser. This guide builds that acquisition layer the way the rest of the pipeline expects it: stateful, polite, auditable, and producing output that downstream routing, parsing, and reporting can trust.

Permalink to this section Problem Statement and Scope

Without a disciplined scraper, permit acquisition fails in ways that are invisible until an audit or a missed deadline exposes them. A time.sleep() guess races a slow municipal server and silently drops half a result page. A session cookie expires on page 40 of 200 and the crawler returns a partial dataset that looks complete. A portal ships a cosmetic UI update and a brittle selector starts pulling the wrong column into the permit_number field. None of these throw an exception; they just corrupt the record set quietly.

Three groups feel that directly. Python automation builders own the fragile selectors and the retry logic that has to survive nightly portal maintenance windows. Municipal clerks lose trust the moment a scraped record disagrees with what they see on screen. Compliance officers cannot certify a dataset whose provenance — which URL, at what time, with which selector version — was never recorded.

The scope of this component is deliberately bounded. Its input is a portal base URL plus the search parameters that define a jurisdiction’s permit universe (a date range, a set of permit-type codes, a status filter). Its output is a stream of validated record objects, each carrying a stable identity key and an extraction-provenance stamp, handed off to the next pipeline stage. Everything between those two points — transport selection, session handling, pagination, extraction, and normalization — is what this guide covers. Parsing the contents of attached documents is out of scope here and belongs to parsing PDF permit applications with OCR and layout analysis.

Permalink to this section Prerequisites

This component targets Python 3.10+ (the examples use match statements and X | Y type unions). Install the runtime dependencies:

pip install "httpx>=0.27" "selectolax>=0.3" "pydantic>=2.6" "tenacity>=8.2" "playwright>=1.43"
python -m playwright install chromium   # only needed for SPA portals

You will also need:

A reconnaissance pass on the target portal. Open it in a browser with the network panel recording and note whether the permit grid arrives in the initial HTML or via a later XHR/Fetch call — that single fact decides your transport.
Documented authorization to scrape. Confirm the jurisdiction’s terms of service permit automated access, check robots.txt, and where a clerk login is involved, use credentials issued to your integration, not a person’s. Where the portal exposes a sanctioned API behind keys, prefer it — that path is covered in securing municipal API endpoints for third-party integrations.
A canonical record schema to normalize into. The scraper should not invent field names; it maps the portal’s columns onto the shared contract defined in designing JSON schemas for building permits.

Permalink to this section Fingerprint the Portal and Choose a Transport

The first real step is reconnaissance, not coding. Municipal portals fall into two families, and picking the wrong transport wastes the entire build. Server-rendered portals — legacy ASP.NET, PHP, ColdFusion, or JavaServer Faces stacks — return the permit grid fully formed in the initial HTML response. These are ideal for a lightweight HTTP client: httpx with selectolax parses them at thousands of rows per second with a tiny memory footprint. Client-rendered portals — React, Angular, Vue, or legacy ExtJS dashboards — ship an empty shell and populate the grid through asynchronous API calls, so a raw HTTP fetch sees no data at all.

Detect the family programmatically before committing, rather than eyeballing it once and hard-coding an assumption:

import httpx
from selectolax.parser import HTMLParser

def fingerprint_portal(url: str, grid_selector: str) -> str:
    """Return 'static' if the permit grid is server-rendered, else 'spa'."""
    resp = httpx.get(url, timeout=20.0, follow_redirects=True)
    resp.raise_for_status()
    tree = HTMLParser(resp.text)
    # If the grid container exists AND already holds data rows, the server
    # rendered it. An empty shell (container present, no rows) signals a SPA
    # that hydrates client-side.
    rows = tree.css(f"{grid_selector} tr")
    return "static" if len(rows) > 1 else "spa"

If fingerprint_portal returns "static", stay on the HTTP path described below. If it returns "spa", drive a real browser — explicit waits tied to DOM mutations and network-idle states, never arbitrary sleeps — as detailed in using Playwright to scrape dynamic municipal permit dashboards. The two transports share everything downstream — the same session contract, the same normalizer, the same router — so this branch is the only place the choice leaks into your code.

Permalink to this section Establish a Stateful Session

Permit portals almost never expose stateless endpoints. They issue session cookies, embed cross-site request forgery (CSRF) tokens in hidden form fields, and gate searches behind a multi-step filter flow. A scraper has to replicate a legitimate user’s sequence — load the search page, capture the token, submit the query with that token attached — or the server rejects it and, worse, may flag the source IP.

Centralize this in one session object so every request shares cookies and the current token. Reusing a single httpx.Client also keeps the TCP connection pool warm, which is both faster and gentler on the municipal server:

import httpx
from selectolax.parser import HTMLParser

class PortalSession:
    """Holds cookies + the current CSRF token across a crawl."""

    def __init__(self, base_url: str, *, user_agent: str) -> None:
        # A transparent UA that identifies the integration reduces the chance
        # of being mistaken for a hostile bot by the portal's WAF.
        self._client = httpx.Client(
            base_url=base_url,
            headers={"User-Agent": user_agent},
            timeout=30.0,
            follow_redirects=True,
        )
        self._csrf: str | None = None

    def prime(self, search_path: str, token_field: str = "__RequestVerificationToken") -> None:
        """GET the search page once to seat cookies and read the CSRF token."""
        resp = self._client.get(search_path)
        resp.raise_for_status()
        node = HTMLParser(resp.text).css_first(f"input[name='{token_field}']")
        if node is None:
            raise RuntimeError(f"CSRF field {token_field!r} not found — portal layout changed")
        self._csrf = node.attributes.get("value")

    def search(self, search_path: str, payload: dict[str, str]) -> httpx.Response:
        body = payload | ({"__RequestVerificationToken": self._csrf} if self._csrf else {})
        resp = self._client.post(search_path, data=body)
        resp.raise_for_status()
        return resp

For portals that require a clerk login, inject credentials from environment variables or a secrets vault — never commit them — and intercept 401/403 responses to re-prime() the session automatically when a token ages out mid-crawl. Treating expiry as a recoverable, transient fault rather than a fatal error is what keeps a long paginated crawl from dying at page 40; that classification logic is shared with error handling and retry logic for ingestion pipelines.

Permalink to this section Drive Pagination and Polite Crawling

Most permit search interfaces page their results behind date ranges, permit classifications, and a sequential cursor. The crawler’s job is to walk every page exactly once, never advance past a boundary it has not confirmed, and never hammer a public server. Aggressive concurrency against a single municipal host is both rude and self-defeating — it triggers 429 Too Many Requests, WAF rules, or an IP ban that ends the crawl.

Bound the request rate explicitly and back off when the server signals overload. Honoring a Retry-After header is non-negotiable; it is the server telling you precisely how long to wait:

import time
from collections.abc import Iterator

from tenacity import retry, retry_if_exception_type, stop_after_attempt, wait_exponential_jitter

class RateLimited(Exception):
    """Raised on HTTP 429 so the retry policy can pause the crawl."""

@retry(
    retry=retry_if_exception_type(RateLimited),
    wait=wait_exponential_jitter(initial=2, max=60),  # jitter avoids thundering-herd retries
    stop=stop_after_attempt(5),
    reraise=True,
)
def fetch_page(session: PortalSession, search_path: str, page: int, *, params: dict[str, str]) -> str:
    resp = session.search(search_path, params | {"page": str(page)})
    if resp.status_code == 429:
        # Respect the server's own pacing instruction before retrying.
        delay = int(resp.headers.get("Retry-After", "30"))
        time.sleep(delay)
        raise RateLimited(f"throttled on page {page}")
    return resp.text

def crawl(session: PortalSession, search_path: str, params: dict[str, str],
          *, min_interval: float = 1.0) -> Iterator[str]:
    """Yield each result page's HTML, one at a time, at a polite fixed cadence."""
    page = 1
    while True:
        html = fetch_page(session, search_path, page, params=params)
        tree = HTMLParser(html)
        if not tree.css("table.permit-results tr.data-row"):
            break  # an empty result body is the true end-of-pagination signal
        yield html
        page += 1
        time.sleep(min_interval)  # a deliberate floor between requests, not a race guess

Detecting the end of pagination from the content — an empty result body — is far more robust than trusting a “Next” button’s disabled state or a total-count label, both of which legacy portals frequently render incorrectly.

Permalink to this section Extract and Normalize Records

Raw HTML is only half the job. Portals bury the data you need in irregular table layouts, dynamically generated JavaScript variables, and occasionally embedded JSON-LD blocks. Pin every field to an explicit selector with a fallback, then immediately map the messy result onto a strict, validated model so malformed rows never reach the pipeline. Validating at the door — with pydantic rejecting bad records loudly — is what keeps reconciliation cost out of every downstream stage.

from datetime import date, datetime, timezone
from pydantic import BaseModel, field_validator
from selectolax.parser import HTMLParser, Node

class PermitRecord(BaseModel):
    permit_number: str
    permit_type: str
    status: str
    applied_on: date
    parcel_id: str | None = None
    source_url: str
    scraped_at: datetime

    @field_validator("permit_number", "permit_type", "status")
    @classmethod
    def _strip(cls, v: str) -> str:
        cleaned = " ".join(v.split())          # collapse stray whitespace/newlines from HTML
        if not cleaned:
            raise ValueError("required text field was empty after cleaning")
        return cleaned

def _cell(row: Node, selector: str) -> str:
    node = row.css_first(selector)
    return node.text(strip=True) if node else ""

def extract_rows(html: str, source_url: str) -> list[PermitRecord]:
    rows = HTMLParser(html).css("table.permit-results tr.data-row")
    records: list[PermitRecord] = []
    for row in rows:
        records.append(PermitRecord(
            permit_number=_cell(row, "td.col-permit"),
            permit_type=_cell(row, "td.col-type"),
            status=_cell(row, "td.col-status"),
            # Normalize the portal's MM/DD/YYYY into a real date object up front.
            applied_on=datetime.strptime(_cell(row, "td.col-date"), "%m/%d/%Y").date(),
            parcel_id=_cell(row, "td.col-parcel") or None,
            source_url=source_url,
            scraped_at=datetime.now(timezone.utc),
        ))
    return records

Normalization is more than parsing dates. Map each jurisdiction’s permit-type strings onto your internal codes so "BLDG-RES" from one portal and "Residential Building" from another converge — a crosswalk that lives in the shared permit code taxonomy. Where a row exposes a parcel identifier, it is worth validating against authoritative GIS records, the subject of linking zoning codes to parcel IDs.

Permalink to this section Route into the Ingestion Pipeline

The final stage hands validated records to the pipeline without ever creating a duplicate. Portals routinely re-display the same permit across overlapping date queries, and any crawl can be replayed after a failure, so the router must be idempotent. Build a stable identity key — the application number scoped to its source system is ideal — and let the database enforce uniqueness with a conditional upsert:

import asyncpg

def identity_key(record: PermitRecord, source_system: str) -> str:
    # Application number + source is stable across re-crawls; never hash volatile
    # fields like scraped_at or status, which change between runs.
    return f"{source_system}:{record.permit_number}"

async def route(pool: asyncpg.Pool, record: PermitRecord, source_system: str) -> bool:
    """Insert a record; return True if new, False if the replay was a no-op."""
    row = await pool.fetchrow(
        """
        INSERT INTO permits (id, permit_number, permit_type, status,
                             applied_on, parcel_id, source_url, scraped_at)
        VALUES ($1, $2, $3, $4, $5, $6, $7, $8)
        ON CONFLICT (id) DO NOTHING
        RETURNING id
        """,
        identity_key(record, source_system), record.permit_number, record.permit_type,
        record.status, record.applied_on, record.parcel_id, record.source_url, record.scraped_at,
    )
    return row is not None

Permalink to this section Configuration Reference

Parameter	Type	Default	Municipal-context notes
`user_agent`	`str`	—	Identify the integration explicitly (name + contact URL). Anonymous or spoofed browser strings are what trips a municipal WAF.
`min_interval`	`float` (s)	`1.0`	Floor between requests to one host. Raise to 2–3 s for small jurisdictions on shared hosting; their public site must stay responsive to residents.
`timeout`	`float` (s)	`30.0`	Legacy portals stall under load. A generous read timeout plus retries beats a tight timeout that drops valid slow responses.
`max_attempts`	`int`	`5`	Retry budget for transient faults (429/5xx/expired token). Cap it hard so an outage cannot grow an unbounded backlog.
`token_field`	`str`	`__RequestVerificationToken`	Hidden CSRF field name. Varies by stack (ASP.NET vs. PHP); confirm during reconnaissance.
`transport`	`'static' \| 'spa'`	auto	Output of `fingerprint_portal`. Pin it per jurisdiction once confirmed, but re-check on extraction failures — a portal can be rebuilt as a SPA overnight.

Permalink to this section Error Handling and Edge Cases

Municipal portals fail in characteristic ways, and each needs a specific response rather than a blanket retry:

Mid-crawl session expiry. A long paginated walk outlives its session cookie and starts returning the login page as 200 OK. Detect it by asserting an expected element exists in every result page; on absence, re-prime() the session and retry the current page rather than advancing.
Encoding mismatches. Older portals declare utf-8 but emit Windows-1252, turning a property owner’s name into mojibake. Trust the bytes, not the header: decode with the charset httpx detects from content, and normalize to NFC before persisting.
Cosmetic layout drift. A portal restyles its grid and a selector silently matches the wrong column. The empty-field validators in PermitRecord turn that into a loud ValidationError instead of a corrupted row — quarantine those records for review instead of writing them.
Hostile responses to concurrency. A burst trips a WAF that returns CAPTCHA pages or 403s. Back off, lower concurrency, and treat a sustained block as a signal to fall back to a sanctioned export rather than escalating — see syncing legacy CSV exports to modern databases.
Duplicate rows across queries. Overlapping date windows re-surface the same permit. The idempotent route upsert absorbs this, but log the no-op rate; a sudden spike usually means your pagination is looping.

Permalink to this section Testing and Verification

Scrapers rot because portals change underneath them, so the test suite must pin behavior against saved fixtures, not the live site. Capture a representative result page once, commit it as a fixture, and assert your extractor produces the exact records you expect — including the awkward rows (missing parcel ID, a status with trailing whitespace):

from pathlib import Path

def test_extract_handles_missing_parcel_and_whitespace() -> None:
    html = Path("tests/fixtures/results_page_1.html").read_text(encoding="utf-8")
    records = extract_rows(html, source_url="https://portal.example/search?page=1")

    assert len(records) == 25
    first = records[0]
    assert first.permit_number == "BLD-2026-0001"   # whitespace collapsed by the validator
    assert first.parcel_id is None                   # blank cell becomes None, not ""
    assert first.applied_on.isoformat() == "2026-03-14"

Beyond unit tests, run an end-to-end dry run against a mock server that serves your fixtures across several “pages” and then an empty body. Confirm the crawler stops on the empty page, every record validates, and a second identical run inserts zero new rows — proof the idempotency key holds. A passing dry run is the evidence a compliance officer needs that the dataset is complete and reproducible.

Permalink to this section Integration Notes

This component is the front door of the ingestion pipeline, and its output feeds several adjacent stages directly. Validated records flow into the resilience layer, where transient scrape faults are classified and retried by error handling and retry logic for ingestion pipelines. When a crawl covers many jurisdictions or back-fills years of history, the per-host rate limits here must be coordinated with worker pools and bounded concurrency, which is the job of implementing async batch processing for high-volume submissions.

Reads of the data you land are accelerated separately: once permits are in the database, the cache-warming strategies for permit lookup APIs keep public lookups fast without re-hitting the source portal. And where a row references an attached application document, the scraper records the document URL and hands it off to the OCR and layout-analysis stage rather than parsing it inline.

Permalink to this section Frequently Asked Questions

Permalink to this section Should I use requests/httpx or a headless browser like Playwright?

Let the portal decide. If the permit grid arrives fully formed in the initial HTML response, an HTTP client with an HTML parser is faster, lighter, and far less fragile — use it. If the grid is hydrated by a later XHR/Fetch call (React, Angular, Vue, ExtJS), a raw HTTP fetch sees no data and you need a real browser. The fingerprint_portal check decides this programmatically so you do not guess.

Permalink to this section How do I scrape politely without getting the source IP blocked?

Send a transparent User-Agent that names your integration, keep a deliberate floor between requests (start at one second per host), honor Retry-After on 429 responses, back off exponentially with jitter, and avoid aggressive concurrency against a single municipal host. The goal is that residents using the public site never notice your crawl.

Permalink to this section How do I keep a session alive through a long paginated crawl?

Reuse one session object so cookies and the CSRF token persist, and treat a 401/403 mid-crawl as a recoverable event: re-prime the search page to seat a fresh token and cookie, then retry the current page instead of restarting. Asserting an expected element on every result page catches a silent expiry that returns the login screen as 200 OK.

Permalink to this section What stops a re-run from creating duplicate permit records?

A stable identity key — the application number scoped to its source system — combined with a conditional ON CONFLICT DO NOTHING upsert. Because the key is derived only from stable fields (never from scraped_at or status), any replay or overlapping date query converges on the same row and the insert becomes a no-op.

Permalink to this section Is scraping a municipal portal legally safe?

Treat it as a governed activity: confirm the terms of service permit automated access, respect robots.txt, use credentials issued to your integration rather than a person’s, and prefer a sanctioned API or bulk export where one exists. Keep an audit log of every request so you can demonstrate exactly what you accessed and when.

Automated permit ingestion and parsing workflows — the parent guide this acquisition layer feeds.
Using Playwright to scrape dynamic municipal permit dashboards — the headless-browser path for SPA portals.
Error handling and retry logic for ingestion pipelines — classifying and recovering the transient faults a crawl throws.
Parsing PDF permit applications with OCR and layout analysis — handling the documents a scraped row links to.
Designing JSON schemas for building permits — the canonical contract the normalizer maps onto.

Web Scraping Municipal Permit Portals with Python

#Permalink to this section Problem Statement and Scope

#Permalink to this section Prerequisites

#Permalink to this section Fingerprint the Portal and Choose a Transport

#Permalink to this section Establish a Stateful Session

#Permalink to this section Drive Pagination and Polite Crawling

#Permalink to this section Extract and Normalize Records

#Permalink to this section Route into the Ingestion Pipeline

#Permalink to this section Configuration Reference

#Permalink to this section Error Handling and Edge Cases

#Permalink to this section Testing and Verification

#Permalink to this section Integration Notes

#Permalink to this section Frequently Asked Questions

#Permalink to this section Should I use requests/httpx or a headless browser like Playwright?

#Permalink to this section How do I scrape politely without getting the source IP blocked?

#Permalink to this section How do I keep a session alive through a long paginated crawl?

#Permalink to this section What stops a re-run from creating duplicate permit records?

#Permalink to this section Is scraping a municipal portal legally safe?

#Permalink to this section Related

Explore deeper