Syncing Legacy CSV Exports to Modern

This guide is one subsystem of the broader automated permit ingestion and parsing workflows: it owns the flat-file transport layer that moves permit records out of aging case-management platforms and into a modern relational store without losing fidelity, auditability, or idempotency.

Municipal permitting departments frequently operate on case-management platforms that predate REST and GraphQL. Instead of an API, these systems emit scheduled flat-file exports — nightly CSV dumps dropped to an SFTP folder or a network share. Bridging those exports to a modern PostgreSQL or document database is the connective tissue beneath inspection scheduling, fee reconciliation, and compliance routing. The patterns below are written for the people who actually own that pipe: government IT teams, municipal clerks who triage exceptions, Python automation builders, and compliance officers who answer for data lineage during a public-records request.

Permalink to this section Problem Statement and Scope

Without a disciplined synchronization layer, a single malformed export can corrupt production permit tables, double-bill applicants, or silently drop inspection records that statute requires the jurisdiction to retain. The failure surface is wide because legacy CSV is not a contract — it is whatever the vendor’s report writer produced that night. Column order drifts when someone edits a saved report. Encodings flip between Windows-1252 and UTF-8 mid-year. A “permit fee” column arrives as 1,250.00 in one county and 1250.00 in the next. Treat every export as hostile input.

The inputs to this component are raw CSV files of unknown trustworthiness plus the metadata around them (filename, drop timestamp, byte size, checksum). The output is a set of validated, deduplicated rows committed atomically to a target schema, accompanied by an immutable audit record and a quarantine of every row that failed validation. Who is affected when this breaks: clerks see phantom or missing permits in the portal, Python builders chase non-reproducible data drift, and compliance officers cannot reconstruct what was loaded, when, or from which source file.

The synchronization layer is the authoritative path for historical records, financial reconciliation, and regulatory reporting. Where a jurisdiction also exposes a public portal, controlled extraction via web scraping municipal permit portals with Python provides a parallel real-time channel, but CSV remains the system of record. The records this component produces should validate cleanly against the platform’s JSON schemas for building permits so that every downstream consumer sees one consistent internal contract.

Permalink to this section Prerequisites Checklist

This pipeline targets Python 3.10+ for structural pattern matching and modern type hints. The reference implementation uses Polars for columnar coercion and psycopg (v3) for the database layer, but the patterns translate directly to pandas and psycopg2.

# Python 3.10 or newer
python --version

# Core libraries
pip install "polars>=0.20" "psycopg[binary]>=3.1" "pydantic>=2.6"

# Optional: charset detection for encoding-uncertain exports
pip install charset-normalizer

Environment assumptions before the first run:

A read-only landing path (SFTP mount, S3/MinIO bucket, or a secured network drop zone) where the legacy system writes exports and the worker never deletes them.
A PostgreSQL database where you can create a dedicated staging schema separate from production permit tables, plus write access to an audit_log and an ingestion_errors table.
Connection credentials supplied through environment variables or a secrets manager — never hard-coded in the worker.
Agreement with the data owner on the immutable business keys for a permit record (typically permit_id, issue_date, parcel_apn). Idempotency depends entirely on getting these keys right.

Permalink to this section Stage 1 — Land and Detect Exports

A resilient synchronization pipeline operates asynchronously so a slow load never blocks inspection scheduling or exhausts the database connection pool. The first stage watches the drop zone, verifies file integrity, and decides whether an export is new work. Raw files land in an immutable store first; nothing is parsed until its checksum is recorded, which makes partial transfers and silent vendor re-writes detectable rather than catastrophic.

import hashlib
from dataclasses import dataclass
from pathlib import Path


@dataclass(frozen=True)
class Export:
    path: Path
    sha256: str
    size_bytes: int


def detect_new_exports(drop_zone: Path, seen: set[str]) -> list[Export]:
    """Return checksummed exports not yet processed (idempotent re-scan)."""
    new: list[Export] = []
    for csv_path in sorted(drop_zone.glob("*.csv")):
        digest = _sha256(csv_path)
        if digest in seen:
            continue  # already ingested this exact file content
        new.append(Export(csv_path, digest, csv_path.stat().st_size))
    return new


def _sha256(path: Path, chunk: int = 1 << 20) -> str:
    h = hashlib.sha256()
    with path.open("rb") as fh:
        for block in iter(lambda: fh.read(chunk), b""):
            h.update(block)
    return h.hexdigest()

Keying the “seen” set on content hash rather than filename matters: a vendor that re-exports the same nightly file under a new timestamped name will not be reprocessed, and a file re-written in place with new content will be picked up even though the name is unchanged.

Permalink to this section Stage 2 — Coerce and Validate

Legacy CSV exports rarely conform to a strict relational schema. Expect mixed types in date columns, trailing whitespace in permit identifiers, inconsistent null markers ("", NULL, N/A, 0), and locale-specific decimal separators. Apply a deterministic coercion layer before any row reaches the transactional store, and route anything that fails to a quarantine rather than aborting the whole batch — one bad row out of 40,000 should not cost a clerk the night’s load.

Declaring an explicit schema up front prevents silent type inference from masking corruption. Strict casting paired with a row-wise validity check cleanly separates loadable rows from those a human must review.

import polars as pl

PERMIT_SCHEMA = pl.Schema({
    "permit_id": pl.String,
    "issue_date": pl.Date,
    "inspection_status": pl.Categorical(),
    "parcel_apn": pl.String,
    "contractor_license": pl.String,
    "permit_fee": pl.Float64,
    "row_hash": pl.String,
})


def coerce_and_validate(df: pl.DataFrame) -> tuple[pl.DataFrame, pl.DataFrame]:
    """Split a raw export into (valid_rows, quarantined_rows)."""
    df = df.with_columns([
        # Normalize identifiers: strip whitespace, fold case for stable keys
        pl.col("permit_id").str.strip_chars().str.to_uppercase(),
        pl.col("parcel_apn").str.strip_chars(),
        # Collapse empty-string license to a true NULL
        pl.when(pl.col("contractor_license").str.strip_chars() == "")
          .then(None)
          .otherwise(pl.col("contractor_license"))
          .alias("contractor_license"),
        # Strip thousands separators before numeric cast (e.g. "1,250.00")
        pl.col("permit_fee").str.replace_all(",", "").cast(pl.Float64, strict=False),
    ])

    # Strict date parse; unparseable dates become null and get quarantined
    df = df.with_columns(
        pl.col("issue_date").str.to_date(format="%m/%d/%Y", strict=False)
    )

    # A row is valid only if every required field survived coercion
    valid_mask = pl.all_horizontal(pl.col("*").is_not_null())
    return df.filter(valid_mask), df.filter(~valid_mask)

This keeps type coercion transparent and reproducible across deployments. For expression syntax and lazy-evaluation patterns, consult the official Polars user guide. Jurisdictions migrating off even older proprietary formats often hit CSV only after an intermediate conversion step — that pathway is covered in converting legacy dBase files to PostgreSQL for permit tracking.

Permalink to this section Stage 3 — Idempotent Upsert and Conflict Resolution

Validated rows move to the transactional layer. Municipal databases must absorb repeated exports without creating duplicates or clobbering legitimate manual corrections a clerk made in the portal. The mechanism is a deterministic business key plus a cryptographic row hash.

Compute row_hash by concatenating the immutable business keys and the mutable payload, then hashing with SHA-256. During upsert, compare the incoming hash to the stored one: if they match, skip the write; if they differ, update; if the key is absent, insert. PostgreSQL’s ON CONFLICT clause performs this atomically without application-level locking — see the official PostgreSQL INSERT documentation for conflict-target and update-action syntax.

import psycopg

UPSERT_SQL = """
INSERT INTO permits (
    permit_id, issue_date, inspection_status,
    parcel_apn, contractor_license, permit_fee, row_hash
)
VALUES (%(permit_id)s, %(issue_date)s, %(inspection_status)s,
        %(parcel_apn)s, %(contractor_license)s, %(permit_fee)s, %(row_hash)s)
ON CONFLICT (permit_id) DO UPDATE
   SET inspection_status  = EXCLUDED.inspection_status,
       contractor_license = EXCLUDED.contractor_license,
       permit_fee         = EXCLUDED.permit_fee,
       row_hash           = EXCLUDED.row_hash,
       updated_at         = now()
   -- Only write when the payload actually changed; preserves manual edits
   WHERE permits.row_hash IS DISTINCT FROM EXCLUDED.row_hash
"""


def upsert_permits(conn: psycopg.Connection, rows: list[dict]) -> None:
    """Atomically upsert a validated batch in a single transaction."""
    with conn.transaction():
        with conn.cursor() as cur:
            cur.executemany(UPSERT_SQL, rows)

The WHERE permits.row_hash IS DISTINCT FROM EXCLUDED.row_hash guard is what makes the load both idempotent and respectful of human corrections: a nightly re-export of unchanged data touches zero rows, and the updated_at timestamp only advances when something genuinely changed. Wrapping the batch in a single conn.transaction() guarantees the whole load either commits or rolls back — there is no half-applied export.

Permalink to this section Configuration Reference

Surface every operational knob as configuration rather than burying it in code. The defaults below suit a mid-size jurisdiction loading 10,000–50,000 permit rows nightly on a modest server.

Parameter	Type	Default	Municipal-context notes
`drop_zone`	`Path`	`/srv/permit-drop`	Read-only mount the legacy system writes to; the worker never deletes from it.
`target_schema`	`str`	`public`	Production permit tables; keep `staging` separate for raw landings.
`business_keys`	`list[str]`	`["permit_id"]`	Columns forming the conflict target. Must be immutable and agreed with the data owner.
`date_format`	`str`	`"%m/%d/%Y"`	Vendor export date pattern; the single most common source of quarantines.
`encoding`	`str`	`"utf-8-sig"`	Use `utf-8-sig` to absorb the BOM Excel-origin exports emit; fall back to detection.
`batch_size`	`int`	`5000`	Rows per `executemany` chunk; bound memory on constrained municipal servers.
`quarantine_threshold`	`float`	`0.05`	If more than 5% of rows fail, halt and alert — likely a schema drift, not bad rows.
`delimiter`	`str`	`","`	Some county report writers emit `;` or tab; detect rather than assume.

A quarantine_threshold breach is a deliberate circuit-breaker: a handful of bad rows is normal, but a large fraction failing usually means the vendor changed column order or the export was truncated, and loading it anyway would corrupt the table.

Permalink to this section Error Handling and Edge Cases

The failure modes below recur across municipal CSV feeds. Handle each explicitly; do not let any of them silently drop a record that statute requires the jurisdiction to keep.

Encoding mismatches. Exports flip between Windows-1252 and UTF-8, often mid-fiscal-year when a vendor upgrades. Detect rather than assume, and decode with replacement only as a logged last resort.

from charset_normalizer import from_path


def read_csv_bytes(path: Path) -> tuple[str, str]:
    """Return (decoded_text, detected_encoding) for an uncertain export."""
    best = from_path(path).best()
    if best is None:
        # Last resort: decode lossily but record that we did so
        return path.read_bytes().decode("utf-8", errors="replace"), "utf-8/replace"
    return str(best), best.encoding

Schema drift. A column appears, disappears, or is reordered. Validate the header against an expected set before parsing a single row, and fail loudly with the diff so a clerk knows exactly what the vendor changed.

Locale-formatted numbers and dates. 1.250,00 (European) versus 1,250.00, or DD/MM/YYYY versus MM/DD/YYYY. These coerce without error to the wrong value, which is worse than a hard failure — pin the expected format in configuration and quarantine anything that does not match.

Partial transfers. A file still being written by SFTP looks complete to a naive watcher. Require a stable size across two polls, or wait for a .done sentinel file, before checksumming.

Quarantine, never drop. Every failed row lands in ingestion_errors with its original payload, the error, and a stack trace. Failed rows are reviewable work items, not lost data. For the routing taxonomy behind transient-versus-permanent failures and retry policy, see error handling and retry logic for ingestion pipelines.

Permalink to this section Auditability and Compliance

Government environments require a defensible trail for every data mutation. Maintain a dedicated audit_log that records the source file’s SHA-256, processing timestamps, row counts (inserted, updated, skipped, quarantined), and the operator or service identity. This metadata lets a compliance officer reconstruct exactly what was loaded from which file during an audit or a public-records request, without ever querying production tables.

def write_audit(conn: psycopg.Connection, export: Export, counts: dict[str, int]) -> None:
    """Record an immutable lineage row for one processed export."""
    with conn.cursor() as cur:
        cur.execute(
            """INSERT INTO audit_log
                 (source_file, sha256, size_bytes,
                  inserted, updated, skipped, quarantined, processed_at)
               VALUES (%s, %s, %s, %s, %s, %s, %s, now())""",
            (export.path.name, export.sha256, export.size_bytes,
             counts["inserted"], counts["updated"],
             counts["skipped"], counts["quarantined"]),
        )

The staging schema should mirror the target structure while preserving original column names, the row hash, and the ingestion timestamp, so lineage is traceable end to end without touching live data.

Permalink to this section Testing and Verification

Confirm the component with fixtures that encode the real anomalies, not just a clean happy path. The two properties worth proving are deterministic coercion and idempotency — running the same export twice must converge to the same state with zero second-pass writes.

import polars as pl
from pipeline import coerce_and_validate


def test_quarantines_unparseable_dates() -> None:
    raw = pl.DataFrame({
        "permit_id": [" bld-2026-01 ", "bld-2026-02"],
        "issue_date": ["03/14/2026", "not-a-date"],
        "inspection_status": ["passed", "pending"],
        "parcel_apn": ["009-122-04", "009-122-05"],
        "contractor_license": ["", "C-10 884421"],
        "permit_fee": ["1,250.00", "980.00"],
        "row_hash": ["a", "b"],
    })
    valid, quarantined = coerce_and_validate(raw)

    # Whitespace stripped, case folded, comma removed
    assert valid["permit_id"].to_list() == ["BLD-2026-01"]
    assert valid["permit_fee"].to_list() == [1250.0]
    # Empty license became a true null, not the string ""
    assert valid["contractor_license"].to_list() == [None]
    # The unparseable date row is isolated, not dropped
    assert quarantined.height == 1


def test_idempotent_second_load(seeded_conn) -> None:
    rows = sample_valid_rows()
    upsert_permits(seeded_conn, rows)
    before = row_count(seeded_conn, "permits")
    upsert_permits(seeded_conn, rows)  # identical second load
    assert row_count(seeded_conn, "permits") == before  # no duplicates

Assert on counts from the audit_log in integration tests too: a re-run of an unchanged file should report inserted=0, updated=0, skipped=N. If the second pass reports updates, the row hash is non-deterministic — usually an unsorted key concatenation or an un-normalized field leaking into the hash.

Permalink to this section Integration Notes

This component sits between acquisition and the durable store, and it hands off in three directions. Upstream, when a jurisdiction has no usable flat-file export at all, the real-time channel comes from web scraping municipal permit portals with Python. When exports reference supporting documents, cross-validate CSV metadata against fields pulled by parsing PDF permit applications with OCR and layout analysis before committing. For very large nightly dumps, hand the validated batches to async batch processing for high-volume submissions so coercion and upsert run concurrently across workers. Downstream, freshly upserted rows should trigger cache warming for permit lookup APIs so the public portal reflects the night’s load immediately.

Permalink to this section Frequently Asked Questions

Permalink to this section How do I choose the business keys for idempotent upserts?

Pick the smallest set of columns the source guarantees to be immutable and unique for a permit’s lifetime — usually permit_id alone, or permit_id plus issue_date when a jurisdiction recycles identifiers across years. Confirm uniqueness against a real historical export before committing the choice; a key that turns out non-unique will silently overwrite legitimate records.

Permalink to this section Should validated rows go straight to production or through a staging schema?

Land raw rows in a staging schema first, then upsert into production. Staging gives you a queryable record of exactly what arrived, preserves original column names and row hashes for lineage, and lets you re-run coercion after a logic fix without re-reading the source file.

Permalink to this section What belongs in the row hash versus the business key?

The business key identifies the record; the row hash detects change. Hash the immutable keys plus every mutable payload column you want to track, in a fixed sorted order. Excluding a column from the hash means edits to it will never trigger an update — occasionally desirable (ignore a volatile vendor timestamp), usually a bug.

Permalink to this section How large a quarantine fraction is acceptable before halting a load?

Treat a few isolated bad rows as routine and load the rest, but stop and alert above roughly 5%. A large failure fraction almost always signals schema drift or a truncated transfer rather than genuinely bad data, and loading it anyway risks corrupting the target table.

Automated permit ingestion and parsing workflows — the end-to-end pipeline this synchronization layer feeds.
Converting legacy dBase files to PostgreSQL for permit tracking — the upstream conversion step for pre-CSV archives.
Error handling and retry logic for ingestion pipelines — failure routing and retry policy for the quarantine path.
Designing JSON schemas for building permits — the internal contract validated rows must satisfy.
Cache warming strategies for permit lookup APIs — the downstream consumer that reflects each load to the public portal.

Syncing Legacy CSV Exports to Modern Databases

#Permalink to this section Problem Statement and Scope

#Permalink to this section Prerequisites Checklist

#Permalink to this section Stage 1 — Land and Detect Exports

#Permalink to this section Stage 2 — Coerce and Validate

#Permalink to this section Stage 3 — Idempotent Upsert and Conflict Resolution

#Permalink to this section Configuration Reference

#Permalink to this section Error Handling and Edge Cases

#Permalink to this section Auditability and Compliance

#Permalink to this section Testing and Verification

#Permalink to this section Integration Notes

#Permalink to this section Frequently Asked Questions

#Permalink to this section How do I choose the business keys for idempotent upserts?

#Permalink to this section Should validated rows go straight to production or through a staging schema?

#Permalink to this section What belongs in the row hash versus the business key?

#Permalink to this section How large a quarantine fraction is acceptable before halting a load?

#Permalink to this section Related

Explore deeper