Signals and Artifacts in Health Data

May 18, 2026 · Benjamin Kramer

research

One of the hardest parts of working with large-scale health data is that the dataset is big enough to make almost anything look real if you ask the question badly. This is not a small problem. In a cohort with millions of people, the usual comfort of sample size can become misleading. The standard errors shrink, the p-values become impressive, and suddenly a coding artifact can look like biology.

Claims data is especially vulnerable to this because the data were not collected to answer scientific questions. They were collected because someone needed to bill for a visit, procedure, prescription, or diagnosis. That does not make the data bad. It makes the data conditional on a system of care. Diagnosis codes reflect disease, but they also reflect access, provider behavior, insurance structure, follow-up intensity, and how often a person interacts with the medical system.

This is why I have become more skeptical of clean-looking results rather than less skeptical. A messy result sometimes announces its limitations. A clean result can be more dangerous because it invites interpretation before enough pressure has been applied. If an association appears across a huge sample and has a beautiful confidence interval, the next question should be: what else could have produced this exact pattern?

The answer is often administrative. A disease may appear more common in one group because that group is seen more often. A medication may look protective because the people receiving it are healthier in ways the model does not capture. A diagnosis may cluster with another diagnosis because one specialist tends to code both, not because the conditions share biology. These are not edge cases. They are part of the texture of the data.

I do not think this means claims data should be treated as weak evidence. I think the opposite. It is powerful precisely because it captures medicine at a scale and messiness that smaller curated cohorts often cannot. But the strength of the data depends on being honest about what it is. A claims database is not a microscope. It is a record of interactions between patients, clinicians, institutions, and payment systems.

The scientific task is to separate signal from everything that can imitate signal. That usually means triangulation: within-family comparisons, sensitivity analyses, negative controls, replication across coding definitions, and literature checks that happen before the story is written. The boring parts of the workflow are often the parts that keep the result from becoming fiction.

← Back to Writing
Reading Lists