Why not just load the EP/Showbiz export into a pandas DataFrame?

A multi-season export loaded whole can exhaust memory and stall the close, and a DataFrame read hides the row index you need to quarantine a defect. Streaming with the standard csv module parses row by row, correctly handles quoted fields containing embedded delimiters and newlines, and keeps a single malformed row from halting the pipeline.

How does the parser handle European versus US decimal separators?

The amount validator strips non-breaking spaces, then treats the right-most of a comma or period as the decimal mark: 1.234,56 becomes 1234.56 and 1,234.56 becomes 1234.56. A lone comma is resolved to a decimal mark, though a file known to use comma thousands separators should drive that from locale metadata to avoid a thousand-fold error.

What happens to a row that fails validation?

It is never dropped or silently coerced. The untouched row is fingerprinted with a SHA-256 hash of its canonical bytes, stamped with a batch id, source row index, machine-readable reason, and a timezone-aware quarantined_at, then yielded to an append-only audit queue while the stream continues. An accountant corrects and re-ingests only the affected rows.

How does the SHA-256 fingerprint keep re-ingestion idempotent?

The hash is computed over a canonical, key-sorted serialization of the row, so a corrected row that reproduces the same bytes reproduces the same hash. A reconciliation tool can therefore prove whether a fix was genuinely applied or a duplicate slipped in, and tie any ledger or quarantine line back to the exact export byte for a bond lender.

Parsing EP/Showbiz Sync Exports Without Manual Cleanup

Q: Why must monetary amounts use Decimal instead of float?

Binary floating point cannot represent most cents exactly. A single fractional-cent error, compounded across the tens of thousands of transactions a shoot generates, becomes exactly the variance a completion guarantor's auditor asks you to explain, and it makes union fringe math disagree with the guild's own published figure.

Parsing EP/Showbiz Sync exports without manual cleanup turns a defect-ridden accounting file into a validated ledger feed no accountant has to hand-patch. Entertainment Partners (EP) and Showbiz Budgeting export the day’s transactions in shapes built for a human to eyeball, not for a ledger to consume unattended: hidden carriage returns split one payroll row into two, non-breaking spaces poison numeric fields, decimal separators flip between US and European locales, merged header artifacts drift the column order, and union fringe overrides land in the wrong field. When an accountant scrubs those defects by hand every close, the weekly cost report slips, budget variance surfaces days late, and a completion guarantor’s cost-to-complete report is stale before it is even generated. This page specifies the parsing layer that removes the hand-patching entirely — a deterministic path from a raw export to a validated feed where every field is coerced explicitly, every rejection is fingerprinted and quarantined rather than guessed at, and every dollar that reaches the general ledger (GL) can be traced back to the exact byte it came from.

Prerequisites and Context

This page is the end-to-end implementation walkthrough for its parent topic, EP/Showbiz Sync Parsing, which defines the deterministic ingestion contract this code satisfies. It targets Python 3.11+ — for modern union-type syntax and standard-library zoneinfo — with a deliberately small dependency set: Pydantic v2 for boundary schema validation via model_validate and field_validator, and the standard-library csv, decimal, hashlib, and zoneinfo modules for streaming reads, currency-safe arithmetic, tamper-evident audit hashing, and timezone-aware timestamps. Never coerce money through float: one fractional cent, compounded across the tens of thousands of transactions a shoot generates, becomes exactly the variance a guarantor’s auditor asks you to explain.

The parsing layer sits directly downstream of transport and directly upstream of the ledger. The idempotent fetch, retry, and checksum semantics that deliver these files are specified in CSV & API Sync Pipelines; the concurrency model that runs per-row validation across a worker pool without starving the event loop is Async Batch Processing; and the boundary contracts and quarantine discipline this code applies are drawn in Schema Validation & Error Handling. All three are subsystems of the broader Cost Ingestion & Data Parsing Workflows architecture that every parsed record must ultimately satisfy. Two contract details drive the validation rules below: EP/Showbiz cost codes follow the four-segment decimal pattern XXXX.YY.ZZ.WW (for example 2050.03.01.07 for Art Department), and a well-formed code is not yet a correct one until it is resolved against an approved account matrix — the subject of Cost Code Standardization and its migration walkthrough, How to Map EP/Showbiz Sync Cost Codes to Custom Databases.

Why RFC 4180 Assumptions Fail on These Exports

The first failure point in automated parsing is assuming the export strictly follows the RFC 4180 CSV standard. EP and Showbiz files routinely embed Excel-style formatting artifacts: a leading byte-order mark (BOM), trailing whitespace in column headers, non-breaking spaces (\xa0) inside numeric fields, and inconsistent line endings across international co-productions. Loading a multi-season export directly into an in-memory DataFrame can exhaust available memory and stall the close. The reliable pattern is a streaming reader built on Python’s standard csv module, which parses row by row and correctly handles quoted fields that contain embedded delimiters and newlines — the exact source of split payroll rows — without loading the whole payload at once. Decoupling the stream from transformation ensures a single malformed row never halts the pipeline.

Step-by-Step: A Streaming, Fingerprinted Parser

The routine below opens the file with utf-8-sig (which transparently strips a BOM) and newline="" (which defers line-ending handling to the csv module), normalizes the header row, then streams each transaction through a strict Pydantic v2 model. Valid records accumulate into batches for the worker pool; invalid rows are fingerprinted with a SHA-256 hash of their canonical bytes, stamped with a timezone-aware quarantined_at, and yielded to the audit queue instead of halting the run. The locale-aware sanitize_amount validator resolves both European (1.234,56) and US (1,234.56) decimal formats to a canonical Decimal, and validate_cost_code enforces the four-segment chart-of-accounts pattern.

import csv
import hashlib
import json
import logging
from datetime import datetime
from decimal import Decimal, InvalidOperation
from pathlib import Path
from typing import Any, Iterator
from zoneinfo import ZoneInfo

from pydantic import BaseModel, ValidationError, field_validator

logger = logging.getLogger("showbiz_ingestion")

# The production's home hub anchors every audit timestamp. Use an IANA
# identifier (never a fixed UTC offset) so DST transitions resolve correctly.
HUB_TZ = ZoneInfo("America/Los_Angeles")


def payload_fingerprint(row: dict[str, Any]) -> str:
    """Deterministic SHA-256 over the canonical row so a re-ingested fix
    reproduces the same hash — the basis of idempotent reconciliation."""
    canonical = json.dumps(row, sort_keys=True, separators=(",", ":"), default=str)
    return hashlib.sha256(canonical.encode("utf-8")).hexdigest()


class CostRecord(BaseModel):
    cost_code: str
    vendor_id: str
    description: str
    amount: Decimal
    currency: str
    union_category: str | None = None

    @field_validator("amount", mode="before")
    @classmethod
    def sanitize_amount(cls, v: Any) -> Decimal:
        cleaned = str(v).replace("\xa0", "").replace(" ", "").strip()
        if "," in cleaned and "." in cleaned:
            # The right-most separator is the decimal mark.
            if cleaned.rfind(",") > cleaned.rfind("."):
                cleaned = cleaned.replace(".", "").replace(",", ".")  # 1.234,56 -> 1234.56
            else:
                cleaned = cleaned.replace(",", "")                    # 1,234.56 -> 1234.56
        elif "," in cleaned:
            cleaned = cleaned.replace(",", ".")                       # 1234,56 -> 1234.56
        try:
            return Decimal(cleaned)
        except InvalidOperation as exc:
            raise ValueError(f"Invalid monetary amount: {v!r}") from exc

    @field_validator("cost_code")
    @classmethod
    def validate_cost_code(cls, v: str) -> str:
        import re
        if not re.fullmatch(r"\d{4}\.\d{2}\.\d{2}\.\d{2}", v.strip()):
            raise ValueError(
                f"EP/Showbiz cost code must match XXXX.YY.ZZ.WW, got: {v!r}"
            )
        return v.strip()


def stream_parse_export(
    file_path: Path, batch_id: str, chunk_size: int = 1000
) -> Iterator[list[dict[str, Any]]]:
    with open(file_path, "r", encoding="utf-8-sig", newline="") as f:
        raw_headers = next(csv.reader(f))
        headers = [h.strip().replace("\xa0", "") for h in raw_headers]
        reader = csv.DictReader(f, fieldnames=headers)

        batch: list[dict[str, Any]] = []
        for index, row in enumerate(reader, start=2):  # row 1 was the header
            cleaned = {
                k: (v.strip().replace("\xa0", "") if isinstance(v, str) else v)
                for k, v in row.items()
                if k is not None
            }
            try:
                record = CostRecord(**cleaned)
                batch.append(record.model_dump(mode="json"))
                if len(batch) >= chunk_size:
                    yield batch
                    batch = []
            except ValidationError as exc:
                logger.error("Schema drift at row %s: %s", index, exc.json())
                yield [{
                    "error": True,
                    "batch_id": batch_id,
                    "source_row": index,
                    "payload": cleaned,
                    "payload_sha256": payload_fingerprint(cleaned),
                    "reason": str(exc),
                    "quarantined_at": datetime.now(HUB_TZ).isoformat(),
                }]
        if batch:
            yield batch

By catching ValidationError at the row level, the pipeline holds throughput while producing a structured exception queue for accountants, and the yield pattern drops straight into an async task runner. Routing a payroll row that references a union category or contract year the loaded rate table does not contain follows the same principle — it becomes an auditable exception through Compliance Fallback Chains rather than a halted close.

The flow below shows how each row moves through the streaming reader, header and decimal normalization, and the row-level validation gate that splits validated batches from the immutable audit queue.

Each row streams through a defensive open, header and locale normalization, and a row-level schema gate: valid records batch to the worker pool — a full batch flushes downstream while a partial one loops back for more rows — and any ValidationError is fingerprinted into the audit queue without halting the run.

Union Fringe Overrides and Multi-Currency Fields

Entertainment payroll operates under strict union realities that a raw export encodes but does not enforce. The International Alliance of Theatrical Stage Employees (IATSE) mandates fringe rates — pension, health, and vacation/holiday — that often override the base pay field, and the Screen Actors Guild–American Federation of Television and Radio Artists (SAG-AFTRA) governs performer scale and the pension-and-health contribution basis. A payroll-derived code that maps to a union account must be validated against a rate table keyed by union category, contract year, and jurisdiction in the same pass that runs schema validation — never coerced or guessed. Because every fringe multiplier applies to money, every one is a Decimal operation; a fringe computed in floating point disagrees with the guild’s own figure by cents that compound into an audit finding.

Multi-currency reconciliation demands deterministic FX anchoring. Rather than making a live API call during ingestion, pin the daily reference mid-market rate at each transaction’s own timestamp and apply it as a Decimal operation in the transformation stage, recording both the original and the converted amount. Pinning at the transaction timestamp — not at ingestion time — is what makes a re-run reproducible; the per-transaction rate-pinning and dual-write mechanics are detailed in Async Batch Processing for Multi-Currency Shoots.

Audit Trail Requirements

The quarantine queue is what lets the parser guarantee zero data loss while still refusing a bad row. Its contract is strict: a rejected row is never dropped, never silently repaired inline, and never mutated. It is preserved exactly as read and annotated with five audit fields written to append-only, write-once storage before any corresponding ledger transaction commits, so a crash mid-batch leaves a replayable record of intent rather than a gap:

batch_id — ties every admitted and quarantined row back to a single ingestion run.
source_row — the 1-based row index that locates the record in the original export.
payload_sha256 — a SHA-256 fingerprint of the row’s canonical serialization; carried alongside a whole-file SHA-256 stamped at intake, it lets even a malformed row be matched to the exact EP or Showbiz submission byte-for-byte.
reason — the machine-readable validation error that explains why the row failed, so an accountant can triage without diffing CSVs.
quarantined_at — a timezone-aware ISO-8601 timestamp anchored to the production hub’s IANA zone.

Because completion-bond lenders require explicit reconciliation of every dollar, those two hashes are the linchpin of the audit story. They also make re-ingestion idempotent: a corrected row that reproduces the same canonical bytes reproduces the same hash, so a reconciliation tool can prove whether a fix was genuinely applied or a duplicate slipped in. A reconciliation over any run should show that admitted records plus quarantine entries equal the rows read — no row is ever both, and none vanishes.

Every quarantined row carries five audit fields: its payload_sha256 ties back to the whole-file SHA-256 stamped at intake, and its source_row locates the exact line — so any rejected record is traceable byte-for-byte to the original EP/Showbiz submission.

Gotchas and Production Edge Cases

Locale ambiguity on lone separators. A field like 1,250 is genuinely ambiguous — it is 1250 in the US and 1.25 in Europe. The validator resolves a lone comma to a decimal mark; if an export is known to use comma thousands separators, drive that decision from the file’s locale metadata rather than the heuristic, or the amount will be off by three orders of magnitude.
DST boundaries on multi-location shoots. A unit that crosses a Daylight Saving Time transition will disagree about when a close window elapsed if timestamps are naive or use fixed UTC offsets. Anchoring quarantined_at and every post date to an IANA zone through zoneinfo is the only way the audit timeline matches what an auditor reconstructs later.
Header drift between studio templates. Showbiz column order tracks the operator’s on-screen layout and drifts between productions. Bind fields by normalized header name, not positional index, and fail loudly on a missing required header rather than silently reading the wrong column.
Partial-file transport failures. A truncated download or torn SFTP transfer is a file-level failure, not a row-level one; it must be caught upstream at the boundary in CSV & API Sync Pipelines and the whole file rejected before a single row is parsed, so a partial file can never post half a day’s costs.
Resolved does not mean correct. Pattern-matching XXXX.YY.ZZ.WW proves a code is well-formed; whether it books above or below the line — which drives how it rolls up in the guarantor’s report — is governed by Above/Below-the-Line Mapping.

Eliminating manual cleanup is not about removing human oversight; it shifts accountant effort from data scrubbing to exception management. When this parsing becomes a standardized engineering practice, production teams gain real-time visibility into below-the-line costs, union compliance drift, and multi-currency exposure, and deliver bond-ready financials without sacrificing pipeline velocity — the whole objective of Guild Compliance & Rule Validation Automation meeting the ingestion architecture that feeds it.

EP/Showbiz Sync Parsing — the parent topic defining the deterministic ingestion, cost-code validation, and bond-grade audit contract this walkthrough implements.
Handling Malformed CSVs from Set Accountants — encoding fallbacks and delimiter detection for field-generated corruption upstream of these exports.
Async Batch Processing for Multi-Currency Shoots — the semaphore-bounded worker pool and FX rate-pinning that consume the batches this parser yields.
Automating Daily Cost Report Ingestion with Python — the idempotent transport and checksum layer that delivers the files parsed here.
Compliance Fallback Chains — deterministic secondary and cached routing when a guild rate table is missing at parse time.

Up one level: EP/Showbiz Sync Parsing.

# Parsing EP/Showbiz Sync Exports Without Manual Cleanup

# Prerequisites and Context

# Why RFC 4180 Assumptions Fail on These Exports

# Step-by-Step: A Streaming, Fingerprinted Parser

# Union Fringe Overrides and Multi-Currency Fields

# Audit Trail Requirements

# Gotchas and Production Edge Cases

# Related Guides