The Data Cleaning Fallacy · Loxley Signal

← Back to Library

Every analyst who has ever opened a dataset has been trained to clean it. Strip the duplicates. Standardize the formats. Drop the outliers. Normalize the entries. The cleaned dataset becomes the analytical surface. The original is treated as raw, suspect, contaminated.

In corridor intelligence, that instinct is wrong. The dirt is the signal.

The Reflex

What clean data hides.

Conventional analytical training treats messiness as friction to be removed before the real work begins. The premise: the underlying pattern is clean, and the noise is a layer to subtract. That premise holds in clinical trial data. It holds in well-defined manufacturing telemetry. It holds in any domain where the generating process is stable and the recording instrument is calibrated.

It does not hold in public-record corridor data. The records are produced by hundreds of independent jurisdictional systems with different schemas, different definitions, different reporting cadences, and different incentives. The messiness is not friction on top of a clean signal. The messiness is the signal.

An entity that registers in three states under three slightly different names is doing so for a reason. A property that appears in one assessor record but not another is appearing inconsistently for a reason. A contractor that shows up in federal contract data with one address and in state corporate registry data with another is producing that mismatch for a reason.

Cleaning normalizes those discrepancies away. The pattern that the discrepancies were trying to surface disappears with them.

The Inversion

Anomaly is the read.

The Loxley methodology treats anomaly as the entry point, not the disposal pile. When an entity surfaces under multiple names, the analytical question is what economic activity that name variation is concealing. When a property appears inconsistently across registries, the question is what jurisdictional or ownership transaction generated the inconsistency. When a federal contract address does not match the corporate registry address, the question is which one represents the actual operational footprint and why the operator is comfortable with the gap.

Every one of those questions is a leading indicator. The cleaned dataset cannot answer any of them because the cleaning destroyed the evidence.

The competitive read

Conventional corridor intelligence platforms run aggressive normalization because they sell standardized outputs to customers who want comparable metrics across geographies. The standardization is a product feature for them. It is a structural blindness in their analytical surface.

The Discipline

What replaces cleaning.

The discipline is not to skip data quality work. The discipline is to do it differently. Loxley preserves the raw. Tags the discrepancies. Asks what the discrepancies imply. Cross-references against adjacent sources to determine which version of the inconsistency reflects ground truth.

The output looks clean. The reader sees a coherent corridor read with clear pattern calls. What the reader does not see is the apparatus underneath that treated every anomaly as an analytical question rather than a data quality problem to suppress.

This is the difference between a dataset and an intelligence product. The dataset is what survives cleaning. The intelligence is what the discrepancies were trying to tell you before someone removed them.

Clean data is for confirming what you already know. Dirty data is for finding what you do not.