Handling Malformed CSVs from Set Accountants: A Self-Healing Cost Ingestion Parser

Set accountants compile daily cost reports on field laptops, offline spreadsheets, and legacy terminal exports, so the files that reach the central ledger almost never arrive clean. A single delimiter swap, an unescaped quote inside a vendor memo, or a CP1252-encoded em-dash can push a whole daily cost report into a type-coercion failure — and if the parser reacts by aborting the batch, one production assistant’s stray comma stalls petty-cash reconciliation for an entire unit. The exact task this page covers is narrow: how do you ingest a structurally malformed CSV so that the recoverable rows post cleanly to the ledger while the unrecoverable ones are quarantined with enough provenance for a completion-guarantor’s auditor to reconstruct exactly what was rejected and why? The answer is a self-healing parser that treats every malformed artifact as a recoverable signal rather than a fatal exception.

Prerequisites and Context

This page extends the parent guide, Schema Validation & Error Handling, applying its boundary-contract and quarantine discipline to the single messiest input class in Cost Ingestion & Data Parsing Workflows: free-form CSVs produced by hand in the field. It targets Python 3.11+ for standard-library zoneinfo, and leans on a deliberately small dependency set: the standard-library csv module (whose reader natively handles quoted fields), hashlib for the audit fingerprint, decimal for money, zoneinfo for timezone-aware audit timestamps, and Pydantic v2 (model_validate, field_validator) for the row schema. No third-party CSV library is required — the resilience comes from ordering the fallback steps correctly, not from a heavier parser.

The relevant contract obligation is the bond agreement’s reconciliation covenant: every figure that lands in the general ledger must be traceable to the raw payload that produced it, and every rejected figure must be recoverable rather than silently dropped. That covenant is why quarantine here is a first-class output, not an error log. The cleaned rows this parser emits flow onward into the same downstream steps as every other source — the concurrency model in Async Batch Processing, the transport described in CSV & API Sync Pipelines, and the accounting-system round-trip covered in EP/Showbiz Sync Parsing.

The Taxonomy of Field-Generated Corruption

The corruption patterns in production accounting files follow a predictable taxonomy, and each layer of the parser answers one of them. Delimiter drift is the most frequent offender: regional Excel defaults or localized keyboard layouts silently swap commas for semicolons or tabs, breaking any parser that assumes strict comma separation per RFC 4180. Quote-escape paradoxes follow closely — unescaped double quotes embedded in vendor memos, location descriptors, or union grievance notes that split a row across arbitrary boundaries. Encoding collisions are the most destructive: a UTF-8 pipeline choking on a CP1252 file full of em-dashes, non-breaking spaces, or legacy currency glyphs will truncate rows or inject Unicode replacement characters. Whitespace and BOM noise — a leading byte-order mark, trailing \xa0 non-breaking spaces, mixed \r\n and \n line endings — quietly defeats exact-match validation on cost codes and currency fields.

The critical design rule is that embedded delimiters and newlines inside correctly quoted fields must be left to the CSV reader itself, never rewritten by ad hoc regular expressions. Naive quote rewriting corrupts the very rows it aims to repair, so a line producer’s daily cost report is never rejected just because someone typed a comma inside a vendor description.

Step-by-Step: A Self-Healing Parser

Read the implementation as a strict, non-destructive fallback chain. sanitize_encoding decodes raw bytes through an ordered list of candidate encodings before any string logic runs. detect_delimiter derives the separator from a frequency heuristic rather than assuming a comma. CostRowSchema is the Pydantic v2 boundary contract that coerces money to Decimal and rejects anything it cannot normalize. parse_malformed_csv ties them together: each row is validated independently, so a bad row quarantines itself — carrying a SHA-256 hash of its own raw payload — without touching its neighbors.

from __future__ import annotations

import asyncio
import csv
import hashlib
import io
import json
import logging
from datetime import datetime
from decimal import Decimal, InvalidOperation
from pathlib import Path
from typing import Any
from zoneinfo import ZoneInfo

from pydantic import BaseModel, Field, ValidationError, field_validator

# Stamp audit records in the unit's own operating zone — an IANA id, never a bare UTC offset.
UNIT_TZ = ZoneInfo("America/Los_Angeles")

logging.basicConfig(level=logging.INFO, format="%(asctime)s | %(levelname)s | %(message)s")
logger = logging.getLogger(__name__)


class CostRowSchema(BaseModel):
    """Strict boundary contract aligned with production accounting standards."""

    date: datetime
    department: str
    cost_code: str
    vendor: str
    amount: Decimal = Field(..., ge=0)
    currency: str = Field(default="USD")
    memo: str = Field(default="")
    union_category: str = Field(default="")

    @field_validator("amount", mode="before")
    @classmethod
    def normalize_amount(cls, v: Any) -> Decimal:
        """Strip currency glyphs and thousands separators before Decimal coercion."""
        if isinstance(v, Decimal):
            return v
        # Decimal is constructed from the cleaned string form so no float ever touches money.
        cleaned = str(v).replace("$", "").replace(",", "").strip()
        try:
            return Decimal(cleaned)
        except InvalidOperation as exc:
            raise ValueError(f"Invalid monetary amount: {v!r}") from exc


def _fingerprint(payload: dict) -> str:
    """Canonical, sorted serialization yields a deterministic, auditable payload hash."""
    canonical = json.dumps(payload, sort_keys=True, default=str, separators=(",", ":"))
    return hashlib.sha256(canonical.encode("utf-8")).hexdigest()


def sanitize_encoding(raw_bytes: bytes) -> str:
    """Decode legacy encodings common in field spreadsheets, BOM-aware first."""
    for encoding in ("utf-8-sig", "utf-8", "cp1252", "iso-8859-1"):
        try:
            return raw_bytes.decode(encoding)
        except UnicodeDecodeError:
            continue
    raise ValueError("Unrecoverable encoding collision in cost report.")


def detect_delimiter(text_sample: str) -> str:
    """Heuristic delimiter detection for regional Excel exports; comma is the safe fallback."""
    try:
        return csv.Sniffer().sniff(text_sample, delimiters=",;\t|").delimiter
    except csv.Error:
        return ","


async def parse_malformed_csv(file_path: Path) -> list[dict[str, Any]]:
    """Async ingestion routine with a non-destructive, self-healing fallback chain."""
    raw_bytes = await asyncio.to_thread(file_path.read_bytes)
    decoded_text = sanitize_encoding(raw_bytes)
    delimiter = detect_delimiter(decoded_text)

    valid_rows: list[dict[str, Any]] = []
    quarantined: list[dict[str, Any]] = []

    try:
        # csv.DictReader natively handles quoted fields containing the delimiter or
        # embedded newlines, so no lossy pre-sanitization of the raw text is needed.
        reader = csv.DictReader(io.StringIO(decoded_text), delimiter=delimiter)
        for row_num, row in enumerate(reader, start=2):  # header is line 1
            try:
                clean_row = {
                    k.strip(): (v.strip().replace("\xa0", "") if isinstance(v, str) else v)
                    for k, v in row.items()
                    if k is not None
                }
                validated = CostRowSchema.model_validate(clean_row)
                valid_rows.append(validated.model_dump(mode="json"))
            except ValidationError as exc:
                quarantined.append(
                    {
                        "row": row_num,
                        "data": row,
                        "error": exc.errors(),
                        "payload_sha256": _fingerprint(dict(row)),
                        "quarantined_at": datetime.now(UNIT_TZ).isoformat(),
                    }
                )
                logger.warning("Row %s quarantined (sha256 %s)", row_num, _fingerprint(dict(row))[:12])
    except csv.Error as exc:
        logger.error("Structural parse failure in %s: %s", file_path.name, exc)
        raise

    if quarantined:
        stamp = datetime.now(UNIT_TZ).strftime("%Y%m%dT%H%M%S%z")
        quarantine_path = file_path.parent / f"quarantine_{file_path.stem}_{stamp}.json"
        await asyncio.to_thread(
            quarantine_path.write_text,
            json.dumps(quarantined, default=str, indent=2),
        )
        logger.info("%s rows quarantined to %s", len(quarantined), quarantine_path)

    return valid_rows

Because each row is validated in isolation, a batch of two thousand line items with three malformed rows posts 1,997 clean records and writes a single quarantine artifact for the remaining three — the entire file is never forfeited to a handful of bad cells. The CostRowSchema enforces strict type boundaries so union category codes and cost-center mappings are normalized before they reach guild-rate math, and the async file I/O keeps the event loop free during high-concurrency daily-cost-report uploads.

The diagram below follows a malformed file through the encoding fallback chain, heuristic delimiter detection, the spec-compliant CSV reader, and the per-row split between validated records and the JSON quarantine artifact.

Audit Trail Requirements

A recovered batch is only defensible if the rejects carry their own proof. Every quarantine record must persist, at minimum: the source row number; the raw data exactly as read, before any cleaning; the structured Pydantic error list explaining precisely which field failed and why; the payload_sha256 fingerprint of the raw row; and a timezone-aware quarantined_at stamp anchored to the unit’s IANA zone through zoneinfo. The fingerprint is the linchpin of the reconciliation story — it lets a guarantor’s auditor tie any rejected line back to the exact bytes that arrived, and it makes re-ingestion idempotent, because a corrected file that reproduces the same canonical payload reproduces the same hash.

Two storage rules are non-negotiable. First, quarantine artifacts go to write-once storage, so the record of what was rejected cannot later be edited to hide a drop. Second, a resubmission is a new artifact, never an in-place overwrite — when a set accountant re-sends a corrected file, the parser posts the newly valid rows and leaves the original quarantine record intact, preserving the full chain of custody the bond covenant demands. The same SHA-256 quarantine discipline defined by the parent Schema Validation & Error Handling reference governs the controlled-degradation path in Compliance Fallback Chains, so a missing cost code and a missing guild rate are logged with identical provenance.

Gotchas and Production Edge Cases

Encoding order matters. utf-8-sig must be tried before utf-8, or a BOM-prefixed file decodes with a stray glued to the first header cell, silently breaking the date column match. Attempting iso-8859-1 first is worse: it decodes almost any byte sequence without error, masking genuine corruption. Order the fallback chain strictest-to-loosest.

The Sniffer can guess wrong. csv.Sniffer occasionally mis-detects the delimiter on a file whose first rows contain many commas inside quoted memos. Restrict the candidate set to ,;\t|, fall back to comma on csv.Error, and — for a high-value source — pin the delimiter explicitly rather than sniffing at all. A wrong guess does not raise; it silently collapses every row into one column, which the schema then rejects wholesale.

Header drift across units. Multi-location shoots often submit the same logical report with slightly different column names (Cost Code vs cost_code vs CostCode). Normalize headers with k.strip() at minimum, and maintain an alias map that resolves known variants to canonical fields before validation, so a Vancouver unit and a Atlanta unit do not each generate false quarantine rows. Aligning those canonical fields is the job of Cost Code Standardization.

Float contamination at the boundary. Never let a monetary value arrive as a float. Constructing Decimal(0.1) imports the float’s binary error; the schema instead cleans the string form and calls Decimal(cleaned). A single fractional cent, compounded across tens of thousands of daily line items, becomes exactly the variance a guarantor will ask you to explain.

Idempotency on resubmission. Because the payload hash is deterministic, a downstream consumer that deduplicates on (cost_code, payload_sha256) can safely absorb a re-sent file without double-posting the rows that were already valid the first time. Without that guard, a corrected resubmission re-posts every clean row it contains.

Once the clean rows carry normalized cost codes and Decimal amounts, the downstream guild-rate math — overtime multipliers, meal penalties, and the fringe percentages computed in Pension & Health Fund Calculations — runs against records that are structurally sound, which is the entire reason for hardening ingestion at the front of the pipeline. Guilds referenced in that math, including the International Alliance of Theatrical Stage Employees (IATSE), the Directors Guild of America (DGA), and SAG-AFTRA, each key off cost-center codes that must survive ingestion intact.

Schema Validation & Error Handling — the parent guide whose Pydantic boundary contracts and SHA-256 quarantine semantics this parser applies to hand-made field CSVs.
Parsing EP/Showbiz Sync Exports Without Manual Cleanup — the sibling guide for structured accounting-system exports, where the corruption class differs from free-form set-accountant files.
Async Batch Processing for Multi-Currency Shoots — how the clean rows this parser emits are normalized to a base currency with pinned daily FX rates.
Compliance Fallback Chains — the controlled-degradation path that reuses the same quarantine provenance for missing guild-rate tables.
Cost Code Standardization — the canonical cost-code taxonomy that header normalization must resolve to before validation.

Up one level: Schema Validation & Error Handling, part of Cost Ingestion & Data Parsing Workflows.

# Handling Malformed CSVs from Set Accountants: A Self-Healing Cost Ingestion Parser

# Prerequisites and Context

# The Taxonomy of Field-Generated Corruption

# Step-by-Step: A Self-Healing Parser

# Audit Trail Requirements

# Gotchas and Production Edge Cases

# Related Guides