Automating Daily Cost Report Ingestion with Python

A Daily Cost Report (DCR) has a hard deadline: the production office expects yesterday’s actuals reconciled before the morning cost meeting, every day, without a human babysitting the parser. The exact automation task this page covers is turning that recurring obligation into an unattended scheduled job — one that pulls each vendor’s DCR for a specific production day, validates every row before a single dollar reaches the ledger, and produces a report the completion guarantor can audit months later. The failure mode is rarely a crash; it is a silent success where a shifted delimiter, a localized decimal comma, or a stale re-uploaded file corrupts actuals and nobody notices until variance reporting nets a duplicated code against itself. This page treats daily ingestion as a deterministic, replayable pipeline rather than a pandas.read_csv() cron one-liner, extending the deduplication and normalization contracts defined by CSV & API Sync Pipelines into a runnable, debugging-first Python job.

Prerequisites and Context

This page builds directly on the CSV & API Sync Pipelines reference and assumes the broader Cost Ingestion & Data Parsing Workflows architecture for how heterogeneous feeds converge into one normalized cost record. Target Python 3.11+ for standard-library zoneinfo, and lean on a deliberate stack:

Pydantic v2 — strict boundary validation via model_validate and field_validator, so type coercion is owned by the schema, not by pandas’ type inference.
pandas or polars — chunked CSV reading with every column held as a string until it reaches Decimal.
Standard-library decimal, hashlib, json, and zoneinfo — currency-safe arithmetic, audit hashing, canonical serialization, and timezone-aware timestamps respectively.
A scheduler (cron, systemd timer, or APScheduler) that hands the job a single parameter: the production date to ingest.

Never use float for monetary values — a fractional-cent drift compounded across tens of thousands of daily line items becomes exactly the variance a bond lender asks you to explain. Two upstream contracts matter: malformed rows are quarantined rather than repaired inline, per Schema Validation & Error Handling, and legacy Entertainment Partners / Showbiz exports arrive pre-normalized by EP/Showbiz Sync Parsing. Validation rules must respect the fringe multipliers, per-diem caps, and overtime thresholds set by the International Alliance of Theatrical Stage Employees (IATSE), the Directors Guild of America (DGA), and the Screen Actors Guild–American Federation of Television and Radio Artists (SAG-AFTRA).

Step 1 — Anchor the Ingestion Window to a Production Day

“Yesterday’s costs” is not a UTC concept. A production shooting in America/Los_Angeles closes its books on wall-clock local time, and a splinter unit in Europe/London closes on a different one. If you derive the ingestion window from a naive datetime.now() or a fixed UTC offset, a job that fires at 03:00 will silently grab the wrong calendar day twice a year across a daylight-saving boundary. Resolve the window from an IANA identifier, and derive a stable run identifier from the date and vendor so the same day can be replayed deterministically.

import hashlib
from datetime import datetime, timedelta
from zoneinfo import ZoneInfo


def production_day_window(production_date: str, tz_name: str) -> tuple[datetime, datetime]:
    """Return the timezone-aware [start, end) bounds for one production day.

    production_date: ISO date string, e.g. "2026-07-02".
    tz_name: IANA identifier, e.g. "America/Los_Angeles" — never a fixed offset.
    """
    tz = ZoneInfo(tz_name)
    start = datetime.fromisoformat(production_date).replace(tzinfo=tz)
    # Add a calendar day, then normalize: this survives DST transitions correctly.
    end = (start + timedelta(days=1)).astimezone(tz)
    return start, end


def run_id(production_date: str, vendor_id: str) -> str:
    """Deterministic run key so a re-fired job for the same day is idempotent."""
    seed = f"{production_date}|{vendor_id}".encode()
    return hashlib.sha256(seed).hexdigest()[:16]

Passing the production date in as a parameter — rather than reading the clock inside the job — is what makes the pipeline replayable. When a disputed Tuesday needs re-ingesting after a corrected vendor upload, you invoke the exact same code path with the exact same window, and the deduplication key from CSV & API Sync Pipelines turns a re-run into a no-op instead of a double accrual.

Step 2 — Validate Every Row as an Untrusted Payload

The parsing layer must be completely decoupled from ledger synchronization: a malformed header, an unexpected date format, or a vendor-specific encoding quirk must never cascade into the accounting database. Read every column as a string so Pydantic — not pandas’ inference — owns type coercion, then validate each row against a canonical schema aligned to the production’s chart of accounts. Rows that pass are collected; rows that fail are quarantined with the exact context an accountant needs to fix and resubmit them.

import hashlib
import json
import logging
from datetime import date, datetime
from decimal import Decimal
from pathlib import Path
from zoneinfo import ZoneInfo

import pandas as pd
from pydantic import BaseModel, Field, ValidationError, field_validator

logger = logging.getLogger("dcr_ingestion")
logging.basicConfig(level=logging.INFO, format="%(asctime)s | %(levelname)s | %(message)s")
LEDGER_TZ = ZoneInfo("America/Los_Angeles")


class DCRLineItem(BaseModel):
    cost_code: str = Field(..., min_length=4, max_length=8)
    department: str
    actual_amount: Decimal = Field(..., gt=0)  # Decimal preserves exact monetary precision
    transaction_date: date
    vendor_id: str

    @field_validator("actual_amount")
    @classmethod
    def flag_high_value_transaction(cls, v: Decimal) -> Decimal:
        # Surface large postings for manual review; the threshold is policy-driven.
        if v > Decimal("1000000"):
            logger.warning("High-value transaction flagged for manual review: %s", v)
        return v


def _payload_hash(payload: dict) -> str:
    """SHA-256 over a canonical serialization — same input always yields the same digest."""
    canonical = json.dumps(payload, sort_keys=True, separators=(",", ":"), default=str).encode()
    return hashlib.sha256(canonical).hexdigest()


def process_dcr_chunk(chunk_path: Path, quarantine_path: Path, ingestion_run_id: str) -> None:
    # Read every column as a string so Pydantic, not pandas, owns type coercion.
    df = pd.read_csv(chunk_path, dtype=str)
    valid_rows: list[dict] = []
    quarantined_rows: list[dict] = []

    for idx, row in df.iterrows():
        raw = row.to_dict()
        try:
            validated = DCRLineItem.model_validate(raw)
            # mode="json" emits serializable primitives (Decimal -> str, date -> ISO 8601).
            valid_rows.append(validated.model_dump(mode="json"))
        except ValidationError as exc:
            error_payload = {
                "run_id": ingestion_run_id,
                "row_index": int(idx),
                "raw_data": raw,
                "payload_hash": _payload_hash(raw),
                "validation_errors": exc.errors(),
                "source_file": chunk_path.name,
                "quarantined_at": datetime.now(LEDGER_TZ).isoformat(),
            }
            quarantined_rows.append(error_payload)
            logger.error("Row quarantined due to schema violation: %s", error_payload["payload_hash"])

    if valid_rows:
        pd.DataFrame(valid_rows).to_parquet(chunk_path.with_suffix(".valid.parquet"))
    if quarantined_rows:
        pd.DataFrame(quarantined_rows).to_json(quarantine_path, orient="records", lines=True)

This boundary is what prevents garbage-in, garbage-out. The quarantine file becomes the single source of truth for accountants to resolve discrepancies, attach supporting invoices, and resubmit corrected payloads. For union compliance, add validators that cross-reference department codes against active collective bargaining agreements, rejecting allocations that violate jurisdictional boundaries — camera-department labor billed as grip, for example — using the same discipline as Handling Malformed CSVs from Set Accountants.

Step 3 — Route, Then Reconcile

The flow below captures the debugging-first path: read each row as a string, validate it, branch valid rows to Parquet, and quarantine offending rows with full audit context before anything is safe for downstream ledger synchronization.

Debugging-first DCR ingestion: each row is validated against the schema before any dollar moves — valid rows stream to Parquet and become safe for ledger sync, while offending rows are quarantined with full audit context as the reconciliation source of truth.

Memory pressure is the other daily-scale failure mode: multi-gigabyte Excel exports from legacy enterprise resource planning (ERP) systems will raise MemoryError during peak reporting cycles if loaded whole. Read in chunks, stream validated output straight to disk instead of accumulating full DataFrames, and let Async Batch Processing absorb vendor API rate limits with exponential backoff and idempotent writes. When a production spans jurisdictions, attach a date-effective foreign-exchange rate to every transaction and compute the variance against the treasury rate before flagging discrepancies that exceed lender tolerance bands — the deterministic rate-pinning pattern detailed in Async Batch Processing for Multi-Currency Shoots.

Audit Trail Requirements

A daily ingestion job is only trustworthy if every run is reconstructable. Bond lenders, studio executives, and tax-incentive auditors require immutable, timestamped records of every financial transaction and every rejection. For each run, log:

run_id — the deterministic key from Step 1, so all rows from one ingestion share a correlation handle.
payload_hash — a SHA-256 digest over the canonical serialization of each raw row. Because the hash is computed over sorted, separator-normalized JSON, re-hashing the same payload later yields an identical digest — the idempotency check that lets you replay a disputed day deterministically.
validation_status, row_index, source_file, and the timezone-aware quarantined_at timestamp for every rejected row.
schema_version and, on any exception, the full traceback plus the offending row’s raw bytes.

Write these to append-only, write-once storage (object storage with versioning locked on, or a WORM-configured bucket) — the quarantine JSON and the audit log must never be mutated in place. This is the same append-only provenance principle that governs Production Schema Design, and it is what converts a nightly cron job into evidence a completion guarantor will accept.

Gotchas and Production Edge Cases

Daylight-saving boundaries. A window derived from a fixed UTC offset silently double-counts or skips an hour twice a year. Always resolve the production day through zoneinfo, as in Step 1, and pin the identifier to the shooting entity, never the server’s locale.
Multi-location shoots. Second-unit and international splinter shoots close their books on different local calendars. Run one job per shooting entity with its own tz_name, rather than one global job that assumes a single day boundary.
Idempotent re-runs. Set accountants routinely re-send a corrected DCR for the same day. Key each ledger write on (run_id, payload_hash) so a re-ingested identical row is a no-op, not a duplicate accrual — the exact continuity guarantee CSV & API Sync Pipelines is built around.
Localized numerics and encodings. Strip UTF-8 BOM markers and coerce comma decimal separators to periods before validation, or a legitimate 1.234,56 EUR amount fails schema checks and lands in quarantine for no real reason.
Missing rate tables, not just missing codes. When a guild rate table is absent at run time, a bounded fallback — not a crash — keeps the pipeline continuous; resolve it through the pattern in Building Fallback Chains for Missing Guild Rate Tables.

Treated as a compliance-critical engineering discipline rather than a formatting chore, automated DCR ingestion turns brittle manual reconciliation into a resilient, replayable pipeline: peak shooting weeks scale predictably, accountants receive actionable exception reports instead of corrupted spreadsheets, and the production’s financial posture stays audit-ready from day one.

CSV & API Sync Pipelines — the parent guide whose idempotency and normalization contracts this daily job enforces.
Async Batch Processing for Multi-Currency Shoots — deterministic FX rate pinning for cross-jurisdiction DCRs.
Handling Malformed CSVs from Set Accountants — the quarantine and reconciliation discipline this pipeline reuses.
Parsing EP/Showbiz Sync Exports Without Manual Cleanup — the upstream normalization that feeds this job clean legacy rows.
Cost Ingestion & Data Parsing Workflows — the reference architecture this page sits within.

Up: CSV & API Sync Pipelines

# Automating Daily Cost Report Ingestion with Python

# Prerequisites and Context

# Step 1 — Anchor the Ingestion Window to a Production Day

# Step 2 — Validate Every Row as an Untrusted Payload

# Step 3 — Route, Then Reconcile

# Audit Trail Requirements

# Gotchas and Production Edge Cases

# Related