The Evaluation Copilot for Marking Guides

Anyone who has ever written a proper marking guide for a high-stakes assessment knows the experience. You start with the question, the rubric, and the model solution. You try to articulate every criterion the grader should check, every edge case they might encounter, every way the candidate might partially meet the standard, and every distinction between acceptable and unacceptable reasoning. Two hours later you have half of what you need. Two days later you have something usable but still not comprehensive. By the time you have a real marking guide for a real assessment, you have spent a week on a single question, and you still have fifty more to write.

This is the bottleneck that has quietly shaped assessment programmes for decades. The detailed grading rules that make rigorous marking possible are too expensive to produce manually at the level of rigour the assessment actually deserves. So most programmes settle for a fraction of the rigour they know they need.

The evaluation copilot inside Assess for Learning closes that gap by generating the detailed grading rules for you, then handing them back for human review. The bottleneck goes from days per question to minutes per question, and the rigour goes up, not down.

“The evaluation copilot is not a shortcut around human judgement. It is a productivity multiplier for human judgement.”

Why the rules are so detailed in the first place

Before explaining how the copilot works, it is worth being explicit about what the grading rules actually contain inside Assess for Learning, because the depth is not obvious from outside.

For a serious assessment, the evaluation criteria for a single question often run to hundreds of lines of structured logic. They define what the AI should look for in the candidate’s reasoning. They specify how to weight different aspects of the answer. They encode the expected logical steps, the acceptable variations, the common misconceptions to flag, and the conditions under which the answer qualifies for each level of the rubric. They reference the rubric type (analytic, holistic, checklist, or scored), the rubric criteria, and the mark allocation.

This level of detail is what makes AI-assisted grading and copilot-supported human grading work at a professional standard. Without it, you get a generic pattern-matching exercise that cannot be trusted on high-stakes work. With it, you get a grading engine that can be defended to an auditor and trusted by a board.

The problem is that writing this level of detail by hand is prohibitive. The evaluation copilot is the answer.

What the evaluation copilot actually does

When you configure an assessment, you provide the building blocks the copilot needs:

What the evaluation copilot consumes

The question text and the task type
The rubric you have chosen (analytic, holistic, checklist, or scored)
The rubric criteria and their descriptions
The marks available and their distribution across criteria
The model solution and the reasoning behind it
Any exhibits, supporting materials, or context the candidate will receive
The domain model (general, actuarial science, and other available specialisations)

From these inputs, the evaluation copilot generates the full set of detailed grading rules. It produces the layered logic the rules engine needs. It encodes the rubric criteria into checkable conditions. It anticipates common variations in candidate reasoning. It produces rules that the subsequent grading process can execute consistently across every submission.

The output is comprehensive, structured, and immediately usable. What would have taken hours to write by hand takes minutes to generate.

Why human review is not optional

The evaluation copilot is not a shortcut around human judgement. It is a productivity multiplier for human judgement. The generated rules are presented to the assessment designer in a text-based, editable format. The designer reviews them, refines them, adjusts thresholds, softens criteria that feel too harsh, tightens criteria that feel too lenient, and approves the final version before any grading takes place.

This is the AI-proposes-human-disposes pattern, and it is the only pattern that works for credentialing. The copilot does the heavy lifting of generating the detailed rules. The human keeps control of what counts as a correct answer. Nothing is locked. Nothing is hidden. The rules are text, the text is editable, and the editing history is tracked.

In practice, assessment designers typically accept most of the generated rules and adjust a small number based on their subject expertise or their knowledge of the candidate population. The productivity gain is enormous, and the quality of the final rules is higher than what most teams would have produced manually, because the copilot is comprehensive in a way that tired human writers often are not.

How the copilot fits into the rules engine

The evaluation copilot does not generate rules in a vacuum. It produces rules that plug directly into the Assess for Learning rules engine, which is the layered execution environment that runs the grading process. The rules flow through the engine layer by layer, from atomic evaluations up through task-level and assessment-level aggregations, with short-term memory slots and diagnostic tags attached where needed.

This tight integration matters because it means the rules the copilot generates are immediately executable. They are not a draft to be re-entered into another system. They are the actual grading logic that will evaluate every submission, produce the data for the examiner’s report, feed the precision report, and support the grading copilot when a human grader calls on it. One generation step. One review step. One approval. From there, the rules are live.

Why this is the productivity shift that changes assessment design

For assessment designers, the evaluation copilot changes what is possible. Assessments that were previously too expensive to build with proper rigour are now practical. Programme teams can refresh their question banks more often, because rebuilding the rules for a new question is no longer a week of effort. Pilots and experimental assessment formats become feasible because the cost of trying something new has collapsed.

For credentialing leadership, this translates into more assessments, better assessments, and a faster improvement cycle. The assessment programme stops being a slow-moving artefact produced once a year and becomes a living instrument that can be tuned, refined, and extended continuously. That is the shape a modern credentialing programme needs to have, and the evaluation copilot is one of the main enablers.

From a bottleneck to a capability

Every mature credentialing programme has the same hidden constraint: the bandwidth of the people who can write serious marking guides. The evaluation copilot multiplies that bandwidth by an order of magnitude without compromising the quality of the output. It does this by taking on the generative heavy lifting while leaving the judgement, the approval, and the editorial control firmly with the humans who know the subject.

If your programme’s ambition is constrained by the time it takes to write marking guides, that constraint is now optional.

Ready to stop writing marking guides by hand?

Talk to us about how the evaluation copilot in Assess for Learning can transform your assessment design productivity.

Explore Assess for Learning

The Evaluation Copilot: Writing the Marking Guide So You Do Not Have To