The Spreadsheet That Three Institutions Edit at Once
Field notes on QA/QC for biological datasets that were never meant to be one dataset.
There's a specific kind of file I've come to recognize. It lives on a shared drive, it has a name like monitoring_data_FINAL_v3_revised.xlsx, and it has been opened, edited, and re-saved by people across two or three institutions over the course of several years. Some of those people are no longer with the program. Some of the columns no longer mean what they used to mean. Somewhere around row 4,217 the units quietly switch from millimeters to centimeters, and nobody is sure when that happened.
If you work in restoration, monitoring, or any kind of long-running ecological program, you know this file. You probably have one open in another tab right now.
I want to write about how I actually deal with these, because most of the QA/QC writing I see online is either too theoretical to be useful or it assumes you started from scratch with a clean schema and a CI pipeline. Real biological data does not look like that. Real biological data looks like a coral monitoring database where someone in 2019 started entering "bleached" in the notes column and someone in 2022 added a bleach_severity column without going back to recode the old entries.
Before I write a single line of validation code, I open the file and I look at it. Page through it. Sort by date. Sort by site. Look at the tail. Look at any column whose name I do not immediately understand. I write down questions in a notebook, not in a Jupyter cell, an actual notebook, because the questions almost always come back later when I am trying to interpret a model and need to remember what flag_2 was supposed to mean. This sounds obvious and I am telling you anyway because I keep meeting analysts who skip it. They load the CSV, run df.describe(), see that nothing is on fire, and start modeling. Then three weeks in they discover that one site's data was entered in a different timezone and every tidal correlation they computed is wrong by six hours. Read the data first. With your eyes. It is not a waste of time.
When I do start writing validation code, I sort issues into three buckets, and I treat them very differently. Structural problems are the easiest. Wrong dtype, missing columns, duplicated primary keys, dates parsed as strings. These are mechanical and they get fixed mechanically. A handful of assertions in the pipeline, run on every load, and you will catch them when they reappear, which they will. Semantic problems are harder. The column exists, the type is right, the values are even in the expected range, but the meaning has drifted. Someone changed the protocol in 2023 and the same column now records a slightly different measurement. The only way to catch these is to talk to the people who collected the data. There is no statistical test that will tell you a definition changed. I have learned to ask, every single time I get a new dataset, whether the protocol changed at any point during this period. The answer is yes more often than you would think. Biological problems are the hardest, because the data is correct, but reality is weird. A growth rate that looks impossible turns out to be a fragment that broke off and got re-attached. A site that goes silent for six months turns out to have been hit by a storm. These are not errors to fix, they are context to preserve. I keep a separate annotations table linked by row ID, and I never let my pipeline silently drop or "correct" these rows. They are usually the most interesting parts of the dataset.
For the structural layer, I lean on a simple stack: a typed schema definition (Pandera or pydantic, depending on the team's preference), a small set of column-level validators, and a tiny script that runs on every ingest and writes a validation report to the same folder as the data. Nothing fancy. The point is that it runs every time, and the report is human-readable. For the semantic layer, the best tool I have found is documentation. Specifically, a definitions file that lives next to the dataset, gets edited when the protocol changes, and includes a date for every change. When I onboard a new dataset I make people sit down and write this with me. It feels slow. It saves weeks later. For the biological layer, I write notes. I keep an anomalies log, indexed by sample or transect or genotype, that captures the things I cannot explain but want to remember. When I get to modeling and something looks strange, I check the anomalies log first.
A typical first month with a new client, when the data is in this kind of state, looks something like this. Week one, I read everything and ask questions. Week two, I write the structural validators and we fix the obvious mechanical issues. Week three is the hard one, where we sit with the people who collected the data and untangle the semantic drift. Week four, we have a clean ingest pipeline, a documented schema, and an anomalies log. The data is now ready to model. Clients are sometimes impatient with this. They want to see results, and the first month produces no charts. I have learned to be honest about why this matters. If you skip it, the model you build in month two will be wrong in ways you cannot easily detect, and you will eventually have to redo it. The sequence is not optional.
If you are the person collecting the data, the single most useful thing you can do for whoever ends up analyzing it is to write down what changed and when. Not the values. The protocol. New instrument? New person? New definition of what counts as a colony? Write it down with a date. Put it in the same folder as the data. That one habit is worth more than any specific validation tool. It is the difference between a dataset someone can work with and a dataset that has to be re-derived from scratch every time a new analyst joins the program. The best monitoring programs I have seen all do this. They are not the ones with the fanciest stack. They are the ones where the principal investigator keeps a running document of every protocol change, and where the data, the protocol, and the change log all live in the same place. That is the whole secret. Everything else is just code.