What ISO 42001 Asks For, and What It Cannot Give You
The standard tells you to evaluate the performance of your AI system and to keep records of how it behaves. It is silent on a harder question: when someone asks what the model actually did on one specific decision, what do you hand them?
If you implement an AI management system to ISO/IEC 42001, you spend a lot of time in the back half of the standard. Clause 9 asks you to monitor, measure, analyze, and evaluate your AI system, and to plan how you will do it. Clause 8 asks you to keep the operational records that show the system ran the way your controls say it should. The annexes push you toward documented evidence of performance throughout the lifecycle.
This is good discipline, and the firms building real ISO 42001 practices are doing serious work. But there is a quiet assumption sitting underneath the performance-evaluation clauses, and it is worth naming directly, because almost everyone in the governance conversation glides over it.
The assumption is that a record of what your AI did is the same thing as proof of what your AI did. It is not.
What the clauses actually produce
Walk through what monitoring and measurement give you in practice. You log inputs and outputs. You track quality metrics over a population of decisions: accuracy, drift, error rates, distribution shifts. You retain artifacts that show the pipeline was configured correctly and the controls were in place. When an auditor arrives, you show them the logs, the dashboards, and the retained records.
Every one of those artifacts is produced by the same system that did the work, or by tooling that trusts that system. The log says the model returned a given output because the model told the logger so. The dashboard says quality held steady because the metrics pipeline read the system's own outputs. The record says the right model version was loaded because the deployment config asserts it was.
This is a self-attested account. It is internally consistent, often genuinely useful, and exactly what the standard asks for. But it answers the question "what does our system say it did" rather than "what did it actually do, in a form a third party can check without trusting us."
The organization shall determine what needs to be monitored and measured, the methods to be used, and when results shall be analyzed and evaluated. The standard specifies that performance be evaluated. It does not specify that the evaluation be independently verifiable.
That is not a flaw in the standard. A management-system standard sets out what you must govern, not the cryptographic form your evidence has to take. ISO 27001 does the same thing for information security: it tells you to control access, not which cipher to deploy. The standard defines the obligation. It leaves the strength of the evidence to you.
Where the gap starts to bite
For most of the history of enterprise software, self-attested records were enough, because the questions were about process. Did you have a control? Did you follow it? Was it reviewed? You can answer those from your own records, and an auditor checking process is satisfied by an honest account of process.
AI changes the question being asked. When an automated system denies a claim, declines a loan, flags a candidate, or summarizes a record that informs a clinical decision, the dispute is not about whether you had a monitoring control in place. It is about what the model did on that particular input. A regulator, an opposing counsel, or an enterprise customer running their own diligence is not asking to see your dashboard. They are asking you to demonstrate that a specific model produced a specific output from a specific input, and increasingly they would prefer not to take your word for it.
This is the seam I find most interesting, and the reason I think the people doing ISO 42001 work and the people building verification infrastructure are looking at two halves of the same problem. The governance practitioner has correctly identified that AI behavior must be measured and evidenced. The verification primitive supplies the part the management system was never designed to carry: evidence that holds up when the person examining it has no reason to trust the operator.
What proof adds to a record
Consider what it would take to move one rung up the evidence ladder, from "we recorded it" to "we can prove it." Three properties have to hold.
First, the output has to be bound to its inputs: a tamper-evident link tying a specific model identity to the exact input it received and the exact output it produced, so that none of the three can be quietly changed after the fact. Second, that binding has to be checkable by someone other than the operator, which means the party that ran the computation cannot also be the sole source of the assurance that it ran correctly. Third, the check has to be independent of trust in the runtime: the verifier should not have to assume the executor was honest, because a separate party can confirm the result on its own.
None of that replaces ISO 42001. It sits inside it. The performance-evaluation clauses tell you that you owe evidence of how your AI behaves. A cryptographic receipt is simply the strongest available form of that evidence for the one question records handle worst: what happened on this specific decision. The management system defines the obligation. The proof discharges the hardest part of it.
Why this is a partnership, not a competition
I want to be precise about the relationship, because it is easy to misread. Verification does not compete with governance, and it does not compete with the firms that implement it. A management system without an evidentiary primitive is a well-run set of promises. An evidentiary primitive without a management system is a sharp tool with no program around it. Neither is sufficient alone, and they are strongest when they are layered: governance defines what should be true, and proof shows that it was.
The honest version of the ISO 42001 pitch, the one I think serious practitioners already half-say to their clients, is that you can build an excellent management system and still be unable to prove, to a hostile reader, what your model did last Tuesday. Closing that last gap is not a governance failure. It is the natural next layer, and it is a cryptographic problem more than a procedural one.
If you spend your days helping companies stand up AI management systems, you have felt the edge of this. The clause asks for evidence of performance. Your client hands you logs. You accept them, because that is what the standard contemplates and what the tooling produces. But somewhere in the back of your mind is the question of what happens when a record is challenged by someone who does not trust the system that wrote it. That question has an answer now. It just lives one layer below the standard, in the part nobody wrote a clause for, because when the standard was drafted, the proof did not yet exist.