Converting Legacy dBase Files to PostgreSQL

This guide sits under Syncing Legacy CSV Exports to Modern Databases, part of the broader practice of automated permit ingestion and parsing workflows. It walks through one focused task: migrating decades-old dBase III/IV/V .dbf archives — historical inspection logs, zoning variances, fee schedules, and contractor licensing records — into a modern PostgreSQL instance without silently corrupting the data a public records request depends on.

Permalink to this section Why a Naive `.dbf` Import Fails

A straight file dump treats a .dbf as if its header still describes its contents. In practice, DOS-era permitting systems drifted: a FEE field declared N may be padded with spaces, an INSP_DATE field stores 00000000 for un-inspected permits, and a STATUS logical field carries ? for records a clerk never finished. If the loader guesses, an undersized NUMERIC truncates fee cents, an invalid date raises mid-batch and rolls back thousands of good rows, and a wrong code page turns the degree sign in a setback note into mojibake. Because these archives back FOIA disclosure and statutory retention windows, the migration must be deterministic and auditable, not best-effort. The non-negotiable requirements are explicit type mapping, NULL handling for sentinel values, idempotent deduplication, and an immutable record of exactly what was transformed.

Permalink to this section Step 1 — Profile the Header Before Trusting It

Stream the header and sample real values per field so the target schema is shaped by the data, not by a 1994 declaration. dbfread with load=False reads row by row and never holds the whole file in memory — important on constrained municipal servers.

from collections.abc import Iterable
from pathlib import Path

from dbfread import DBF


def profile_dbf(path: Path, encoding: str = "cp437") -> dict[str, dict]:
    """Inspect a legacy .dbf header and measure the values actually stored.

    DOS-era permit systems are almost always code page 437; later Windows
    front-ends used cp1252. Profiling with the right encoding up front avoids
    re-reading the archive after a mojibake surprise.
    """
    table = DBF(path, load=False, encoding=encoding, char_decode_errors="strict")
    profile: dict[str, dict] = {
        f.name: {"dbf_type": f.type, "length": f.length,
                 "decimals": f.decimal_count, "max_len": 0, "nulls": 0}
        for f in table.fields
    }
    for record in table:  # streams rows; the full archive never lands in RAM
        for name, value in record.items():
            cell = profile[name]
            if value in (None, "", b""):
                cell["nulls"] += 1
            else:
                cell["max_len"] = max(cell["max_len"], len(str(value)))
    return profile

Permalink to this section Step 2 — Generate Deterministic DDL

Map dBase types to PostgreSQL explicitly and size every column from the profile. The rule that matters most for permit tracking: preserve the declared scale on N fields so fee and valuation cents survive intact.

# Never let the driver infer types. An inferred NUMERIC silently drops cents.
DBF_TO_PG: dict[str, str] = {
    "M": "TEXT",              # memo: inspector narratives / violation notes
    "L": "BOOLEAN",
    "D": "DATE",
    "F": "DOUBLE PRECISION",
}


def build_ddl(table: str, profile: dict[str, dict]) -> str:
    cols: list[str] = ['"id" BIGSERIAL PRIMARY KEY']
    for name, cell in profile.items():
        t = cell["dbf_type"]
        if t == "N":
            # scale 0 -> integer counter/key; otherwise fixed NUMERIC(p, s)
            pg = "BIGINT" if cell["decimals"] == 0 \
                else f"NUMERIC({cell['length']}, {cell['decimals']})"
        elif t == "C":
            width = max(cell["max_len"], 1)
            pg = "TEXT" if width > 255 else f"VARCHAR({width})"
        else:
            pg = DBF_TO_PG.get(t, "TEXT")
        cols.append(f'    "{name.lower()}" {pg}')
    return f'CREATE TABLE IF NOT EXISTS "{table}" (\n' + ",\n".join(cols) + "\n);"

Permalink to this section Step 3 — Coerce and Validate Each Record

Raw dBase values are not PostgreSQL values. The two failure points specific to permit archives are date sentinels and ambiguous logicals; both must resolve to NULL rather than a fabricated default.

import re
from datetime import date

_YYYYMMDD = re.compile(r"^(\d{4})(\d{2})(\d{2})$")
_TRUE = {"T", "Y", "1", "TRUE"}
_FALSE = {"F", "N", "0", "FALSE"}


def coerce_date(raw: object) -> date | None:
    """`00000000`, blanks, and impossible dates are placeholders for permits
    that were never inspected — they must become NULL, not 1900-01-01, or
    they will pollute every retention-window query."""
    if isinstance(raw, date):
        return raw if date(1900, 1, 2) <= raw <= date.today() else None
    m = _YYYYMMDD.match(str(raw or "").strip())
    if not m:
        return None
    try:
        parsed = date(int(m[1]), int(m[2]), int(m[3]))
    except ValueError:
        return None
    return parsed if date(1900, 1, 2) <= parsed <= date.today() else None


def coerce_bool(raw: object) -> bool | None:
    token = str(raw).strip().upper()
    if token in _TRUE:
        return True
    if token in _FALSE:
        return False
    return None  # '?' or space -> unknown, never a silent False

Permalink to this section Step 4 — Bulk Load with `COPY`, Not Per-Row INSERT

Row-by-row INSERT pays a network round trip and trigger pass per record; an archive of half a million inspection rows can take hours. Serialize cleaned rows into an in-memory buffer and stream them through PostgreSQL’s COPY protocol into an UNLOGGED staging table, which skips Write-Ahead Log generation during the load. See the PostgreSQL COPY reference for the format options used here.

import csv
import io

import psycopg2


def to_pg_cell(value: object) -> str:
    """Render a coerced value for CSV COPY. None -> empty unquoted field,
    which PostgreSQL reads as NULL under the default CSV null marker."""
    if value is None:
        return ""
    if isinstance(value, bool):
        return "t" if value else "f"
    if isinstance(value, date):
        return value.isoformat()
    return str(value).strip()


def copy_into_staging(conn, table: str, columns: list[str],
                      rows: Iterable[dict], chunk: int = 25_000) -> int:
    """Stream coerced rows into an UNLOGGED staging table via COPY."""
    buf = io.StringIO()
    writer = csv.writer(buf, quoting=csv.QUOTE_MINIMAL)
    total = 0
    with conn.cursor() as cur:
        for row in rows:
            writer.writerow([to_pg_cell(row.get(c)) for c in columns])
            total += 1
            if total % chunk == 0:  # flush in bounded chunks to cap memory
                buf.seek(0)
                cur.copy_expert(
                    f'COPY "{table}" ({", ".join(columns)}) FROM STDIN WITH (FORMAT csv)',
                    buf,
                )
                buf.seek(0)
                buf.truncate(0)
        if buf.tell():
            buf.seek(0)
            cur.copy_expert(
                f'COPY "{table}" ({", ".join(columns)}) FROM STDIN WITH (FORMAT csv)',
                buf,
            )
    return total

Permalink to this section Step 5 — Deduplicate with an Idempotent Upsert

Concurrent multi-user edits and crash recovery in DOS-era systems left duplicate permit numbers behind. COPY cannot upsert, so promote staged rows into the live table with ON CONFLICT, breaking ties deterministically on the newest revision. Running the migration twice then yields the same result — essential when an error handling and retry logic for ingestion pipelines layer reruns a failed batch.

INSERT INTO permits AS p (permit_no, status, issued_on, last_modified)
SELECT DISTINCT ON (permit_no)
       permit_no, status, issued_on, last_modified
FROM   permits_staging
ORDER  BY permit_no, last_modified DESC NULLS LAST   -- keep the newest revision
ON CONFLICT (permit_no) DO UPDATE
   SET status        = EXCLUDED.status,
       issued_on     = EXCLUDED.issued_on,
       last_modified = EXCLUDED.last_modified
 WHERE EXCLUDED.last_modified > p.last_modified;      -- never overwrite newer data

Defer foreign-key enforcement until after this step: validating parcel references against the best practices for linking zoning codes to parcel IDs is far cheaper as one post-load ALTER TABLE ... VALIDATE CONSTRAINT than as a cascade of per-row checks during the bulk insert.

Permalink to this section Parameter and Flag Reference

Setting	Recommended value	Rationale for permit archives
`DBF(encoding=...)`	`cp437` (fallback `cp1252`)	DOS permit systems are code page 437; Windows front-ends drifted to cp1252
`DBF(load=...)`	`False`	Streams rows; keeps memory flat on constrained municipal servers
`char_decode_errors`	`strict` while profiling	Surface encoding mismatches up front instead of after a bad load
`COPY` chunk size	10,000–50,000 rows	Bounded buffers; predictable memory under overnight batch volume
Staging table	`UNLOGGED`	Skips WAL during load, then promoted with the dedup upsert
`NUMERIC` scale	`field.decimal_count`	Preserves fee/valuation cents instead of truncating to integer
`maintenance_work_mem`	512MB–1GB during load	Faster post-load index builds; revert to baseline afterward
Conflict key	natural permit number	Deterministic dedup across crash-duplicated revisions

Permalink to this section Common Failure Patterns and Fixes

Permalink to this section Code-page mojibake

A file decoded as latin-1 when it is really cp437 mangles degree signs, fractions, and box-drawing characters embedded in setback and elevation notes. Detect it by profiling with char_decode_errors="strict" and catching UnicodeDecodeError, then retry the encoding before any load reaches the database.

for enc in ("cp437", "cp1252", "latin-1"):
    try:
        DBF(path, encoding=enc, char_decode_errors="strict").field_names
        chosen = enc
        break
    except UnicodeDecodeError:
        continue

Permalink to this section Sentinel and impossible dates

00000000, 19000101, and future-dated placeholders are not real inspection dates. The coerce_date guard in Step 3 maps every one to NULL; never substitute a default, or retention math will treat un-inspected permits as decades overdue.

Permalink to this section Duplicate permit numbers

Multi-user DOS edits produced repeated keys. The DISTINCT ON ... ORDER BY last_modified DESC clause in Step 5 keeps the newest revision; log the discarded duplicate count so a compliance officer can see what was collapsed.

Permalink to this section NUMERIC overflow and fee truncation

A N(12,2) fee field loaded into an inferred INTEGER drops the cents and overflows on large valuations. Always build the column as NUMERIC(length, decimals) straight from the profile (Step 2) rather than letting the driver choose.

Permalink to this section Missing or corrupt memo files

Memo (M) fields live in a sibling .dbt/.fpt file. If it is absent, dbfread returns None for every memo and the inspector narratives are silently lost. Assert the companion file exists before profiling and fail loudly:

if not path.with_suffix(".dbt").exists() and not path.with_suffix(".fpt").exists():
    raise FileNotFoundError(f"memo sidecar missing for {path.name}; narratives would be dropped")

Permalink to this section Audit and Logging Guidance

Municipal records retention demands defensible lineage. Write one row to a migration_audit table per source file capturing the SHA-256 checksum of the .dbf (and its memo sidecar), the chosen encoding, total records read, records inserted, duplicates collapsed, rows rejected with their reason codes, and a version tag for the coercion rules applied. Reject manifests — every NULL-ed date and dropped duplicate — give a compliance officer a transparent answer to “what changed during migration.” Route those manifests through the same channel described in logging and alerting strategies for failed CSV parsing jobs so a partial failure is visible before the next nightly window. When archives are large enough to run unattended, schedule them through managing Celery task queues for overnight batch imports and persist correlation IDs that trace each record from extraction to final commit.

Syncing Legacy CSV Exports to Modern Databases — the parent topic covering delimited-export reconciliation alongside dBase.
Error Handling and Retry Logic for Ingestion Pipelines — making the idempotent upsert safe to rerun.
Managing Celery Task Queues for Overnight Batch Imports — running large migrations unattended.
Best Practices for Linking Zoning Codes to Parcel IDs — validating parcel references after the load.

Converting Legacy dBase Files to PostgreSQL for Municipal Permit Tracking

#Permalink to this section Why a Naive .dbf Import Fails

#Permalink to this section Step 1 — Profile the Header Before Trusting It

#Permalink to this section Step 2 — Generate Deterministic DDL

#Permalink to this section Step 3 — Coerce and Validate Each Record

#Permalink to this section Step 4 — Bulk Load with COPY, Not Per-Row INSERT

#Permalink to this section Step 5 — Deduplicate with an Idempotent Upsert

#Permalink to this section Parameter and Flag Reference

#Permalink to this section Common Failure Patterns and Fixes

#Permalink to this section Code-page mojibake

#Permalink to this section Sentinel and impossible dates

#Permalink to this section Duplicate permit numbers

#Permalink to this section NUMERIC overflow and fee truncation

#Permalink to this section Missing or corrupt memo files

#Permalink to this section Audit and Logging Guidance

#Permalink to this section Related