ASOP 56 to NIST AI RMF: Practitioner Bridge

Article 2 of 4 in the series NIST AI RMF for Actuaries

The first article in this series argued that, in our view, NIST AI RMF has become the de facto operating spine of AI governance for the insurance industry, and that the chief actuary is a strong candidate to lead the cross-functional committee that operationalises it. This article is for the people on that committee who actually have to produce the artefacts.

If you have read the NIST AI RMF Playbook cover to cover and felt the urge to highlight every second paragraph because it described something you already do, this article is for you. If you have looked at it and felt slightly lost because the language is unfamiliar, this article is also for you. The intention is the same in both cases: to give you a working bridge between the actuarial professional standards you already operate under and the NIST framework that regulators, auditors and your own board are increasingly going to ask you about.

You will not finish this article qualified to certify a high-risk AI system under the EU AI Act. You should finish it able to do four things you may not currently do: produce a documented bias and fairness test for a production model, produce a global and local explainability evidence pack, produce a socio-technical impact statement that satisfies both the RMF Map function and ASOP 56’s documentation requirements, and chair a model governance meeting in which you can speak both languages without a translator.

A note on the worked examples before we begin. The two examples in this article use synthetic but realistic numbers. They are illustrative only and must not be taken as guidance on parameter settings, thresholds or specific model architectures for any production purpose. They are deliberately constructed so that the first pass through the model fails the four-fifths rule. This is honest: in our experience, most first-pass models built on traditional features plus external consumer data sources will fail at least one fairness test. The teaching value is in showing how to detect that, how to diagnose it, how to remediate it, and how to document the remediation in a form that satisfies the framework. A worked example in which everything passes on the first attempt would teach the reader nothing useful.

A note on EU AI Act language used throughout this article. Obligations under the Act split between provider and deployer roles. Where an insurer builds a system for its own use, it is both. Where it uses a vendor system, the stack splits. The worked examples below flag which role is in scope for each specific artefact. Act timelines in this article reflect the current timetable; the European Commission has publicly noted a proposal under consideration to adjust parts of the high-risk timeline, so the 2 August 2026 date should be read as the current timetable rather than a fixed certainty.

A note on registers. Part 1 of this series introduced four categories of authority used throughout the series: law (binding regulation or statute), supervisory expectation (regulator guidance, often “should” framing, used in examinations), professional standard (binding within a profession on its members), and author recommendation (our own practitioner judgement). Where this article uses confident language about what an organisation should do, the underlying register is usually professional standard or author recommendation rather than law, except where explicitly noted. The mapping table below makes the legal source explicit cell by cell.

How this article is organised

Section 1 is the full mapping table between actuarial professional standards, the NAIC AI Systems Program, the EU AI Act articles, and the NIST RMF subcategories. Bookmark it. Section 2 introduces the three artefact categories that NIST adds to traditional actuarial practice. Sections 3 and 4 are the two worked examples, each carried through Govern, Map, Measure and Manage end to end. Section 5 is a reusable MEASURE evidence pack template. Section 6 contains short call-outs for practice areas other than life and health, with the full treatment reserved for Part 3 of the series. Section 7 is a first-week, first-month, first-quarter action plan for technical teams.

Section 1: The full mapping

The table below is the second of the two reference artefacts in this series. Article 1 gave you the convergence table at framework level. This is the same convergence at subcategory level. Where a row has no entry in a column, it usually means the obligation is implicit rather than absent. ASOP 56 section numbers in this table have been verified against the published standard (Doc. No. 195, adopted December 2019, effective 1 October 2020). APS X2 references are to version 1.1, effective 30 January 2026.

A word on what this table is, and is not. The rows below are functional bridges between frameworks, not one-to-one legal equivalences. Each cell shows the closest analogue in the relevant vocabulary to the obligation in the NIST column. In several places the analogue is weaker or broader than the NIST subcategory it sits alongside. Where a framework is genuinely silent on a topic, the cell reads “no explicit equivalent” and a short note explains what the nearest adjacent provision is. EU AI Act entries are annotated where the obligation sits primarily with the provider versus the deployer. Read the table as a cross-reference aid for cross-functional audit response, not as a legal concordance.

The full mapping is presented below in four blocks, one per NIST RMF function, so readers can jump to the function they need. Jump to: Govern subcategories · Map subcategories · Measure subcategories · Manage subcategories.

Govern subcategories

NIST RMF subcategory	Plain meaning	Actuarial standards	NAIC AIS Program	EU AI Act
GOVERN 1.1	Legal and regulatory requirements understood and documented	Actuaries’ Code (Compliance); Code of Professional Conduct Precept 4	AIS Program §3.1	Art. 9 (provider risk management); Art. 26 (deployer duties)
GOVERN 1.2	Trustworthy AI characteristics in policies	TAS 100 Principle 5 (Models)	AIS Program §3.2	Art. 9 (provider); Art. 17 (provider quality management)
GOVERN 1.4	Risk management process transparent and documented	ASOP 56 §4.1 (required disclosures); §3.7 (documentation)	AIS Program §4	Art. 11, Annex IV (provider technical documentation)
GOVERN 1.6	Inventory of AI systems	No explicit equivalent in actuarial standards (inventory is a new artefact category)	AIS Program §4.2	Art. 49 (EU database registration for providers and certain public deployers of high-risk systems; not a general internal-inventory obligation for all deployers)
GOVERN 1.7	Decommissioning and phase-out processes	No explicit equivalent; ASOP 56 §3.6.4 (governance and controls) is the nearest adjacent provision	AIS Program §4.3	Art. 17 (provider quality management, adjacent)
GOVERN 2.1	Roles, responsibilities, lines of communication	Actuaries’ Code (Communication); ASOP 41	AIS Program §3.3	Art. 17 (provider); Art. 26 (deployer)
GOVERN 2.3	Executive leadership accountability	TAS 100 Principle 4 (Communications)	AIS Program §2	Art. 26 (deployer obligations)
GOVERN 4.1	Critical thinking and safety-first culture	Actuaries’ Code (Speaking up)	AIS Program §3	Art. 14 (human oversight by design; provider, supported by deployer)
GOVERN 6.1	Third-party AI risks managed	ASOP 56 §3.4 (reliance on models developed by others)	AIS Program §4.4 (vendor management)	Art. 25 (provider obligations across the value chain)

Map subcategories

NIST RMF subcategory	Plain meaning	Actuarial standards	NAIC AIS Program	EU AI Act
MAP 1.1	Intended purpose, context, laws, norms documented	ASOP 56 §3.1 (model meeting the intended purpose); §2.6 (definition of intended purpose)	AIS Program §4.1	Art. 13 (provider instructions for use, consumed by deployer)
MAP 1.4	Business value or context defined	ASOP 56 §3.1.2 (selecting, reviewing, or evaluating the model)	AIS Program §4.1	Art. 13
MAP 1.5	Organisational risk tolerances determined	TAS 100 Principle 1 (Judgement)	AIS Program §2.1	Art. 9 (provider); Art. 26 deployer risk assessment duties
MAP 1.6	System requirements elicited from stakeholders	ASOP 56 §3.1.2	AIS Program §4.1	Art. 14 (human oversight design)
MAP 2.3	TEVV considerations identified	ASOP 56 §3.1.1 (designing, developing, or modifying the model)	AIS Program §4.2.1	Art. 15 (provider accuracy, robustness, cybersecurity)
MAP 3.5	Human oversight processes defined	(Implicit in actuarial sign-off)	AIS Program §3.4	Art. 14 (provider designs for oversight; deployer operates it under Art. 26)
MAP 4.1	Approaches for legal and AI risks of components	ASOP 56 §3.4 (reliance on models developed by others)	AIS Program §4.4	Art. 25 (provider value-chain obligations)
MAP 5.1	Likelihood and magnitude of impacts identified	TAS 100 Principle 1 (ASOP 56 is silent on broader impact analysis)	AIS Program §4.2	Art. 27 (deployer FRIA; expressly reaches Annex III point 5(c) for life and health insurance)

Measure subcategories

NIST RMF subcategory	Plain meaning	Actuarial standards	NAIC AIS Program	EU AI Act
MEASURE 1.1	Approaches and metrics for AI risk measurement	ASOP 56 §3.6.1 (model testing)	AIS Program §4.2.1	Art. 15 (provider)
MEASURE 1.3	Independent assessment	Closest analogue (not a direct equivalent): ASOP 56 §3.6.3 (permissive: actuary “may consider” review by another qualified professional); APS X2 v1.1 (peer review obligations, judgement-based)	AIS Program §4.2.3	Art. 17 (provider quality management)
MEASURE 2.1	Test sets, metrics, TEVV details documented	ASOP 56 §3.6.1; §3.6.2	AIS Program §4.2.1	Art. 10 (provider data governance)
MEASURE 2.3	Performance measured qualitatively and quantitatively	ASOP 56 §3.6.1; §3.6.2	AIS Program §4.2.1	Art. 15 (provider)
MEASURE 2.5	System validity and reliability demonstrated	ASOP 56 §3.6.2 (model output validation)	AIS Program §4.2.1	Art. 15 (provider)
MEASURE 2.7	Security and resilience evaluated	(Implicit; new artefact category for most actuarial teams)	AIS Program §4.2.4	Art. 15 (provider)
MEASURE 2.8	Transparency and accountability risks evaluated	ASOP 41 (communications)	AIS Program §4.2.2	Art. 13 (provider transparency to deployer)
MEASURE 2.9	Model explained and outputs interpreted within scope	(Implicit; new artefact category)	AIS Program §4.2.2	Art. 13 (provider); Art. 86 (deployer-supported explanation rights to affected persons)
MEASURE 2.10	Privacy risk examined	(Privacy is governed externally: GDPR, GLBA, HIPAA where the actor is a covered entity or business associate, FCRA)	AIS Program §4.2.5	Art. 10 (data and data governance; Art. 10(5) provides a narrow special-category-data mechanism for bias detection/correction, not a general privacy-risk regime)
MEASURE 2.11	Fairness and bias evaluated	(New artefact category)	AIS Program §4.2.2	Art. 10(2)(f) (provider bias examination); Art. 27 (deployer FRIA)
MEASURE 2.12	Environmental impact and sustainability	(No equivalent)	(No equivalent)	(No equivalent)
MEASURE 3.3	Feedback from end users and impacted communities; appeal processes	No explicit equivalent in actuarial standards; ASOP 41 applies generally to any communication arising from such processes	AIS Program §4.2.6	Art. 14; Art. 86 (deployer-supported explanation rights)
MEASURE 4.2	Trustworthiness in deployment context evaluated	ASOP 56 §3.1.3 (using the model)	AIS Program §4.2.6	Art. 72 (provider post-market monitoring; deployer Art. 26 duties to monitor in use)

Manage subcategories

NIST RMF subcategory	Plain meaning	Actuarial standards	NAIC AIS Program	EU AI Act
MANAGE 1.2	Risk treatment prioritised	TAS 100 Principle 1 (Judgement)	AIS Program §4.3.1	Art. 9 (provider)
MANAGE 1.4	Residual risks documented	ASOP 56 §4.1 (required disclosures)	AIS Program §4.3.2	Art. 9(5) (provider)
MANAGE 2.3	Response to unknown risks	(Implicit)	AIS Program §4.3.3	Art. 73 (provider serious incident reporting; deployers have related duties under Art. 26 to inform the provider and authorities)
MANAGE 2.4	Mechanisms to supersede or deactivate	No explicit equivalent; ASOP 56 §3.6.4 (governance and controls) is the nearest adjacent provision; GOVERN 1.7 link	AIS Program §4.3	Art. 17 (provider)
MANAGE 4.1	Post-deployment monitoring implemented	ASOP 56 §3.1.3 (using the model); §3.6.4	AIS Program §4.3.4	Art. 72 (provider post-market monitoring; deployer Art. 26 use monitoring)
MANAGE 4.3	Incidents communicated to relevant AI actors; documented response and recovery processes	No explicit equivalent in actuarial standards; communication obligations arise through actuarial reporting and governance processes, with ASOP 41 applying to any formal written output	AIS Program §4.3.5	Art. 73 (provider serious incident reporting; deployer Art. 26 duties to inform provider and authorities)

Read down the NIST column in any block and you have the playbook subcategory list for that function. Read across any row and you have the closest functional analogue across five vocabularies. The empty cells and “no explicit equivalent” entries are interesting: they mark places where one framework is silent on something another framework addresses explicitly. The most important gaps on the actuarial side are at GOVERN 1.7 (decommissioning), MANAGE 2.4 (supersession), MEASURE 3.3 (user feedback and appeals) and MANAGE 4.3 (incident communication), together with the three new artefact categories at MEASURE 2.7, 2.9 and 2.11 addressed in the next section.

Section 2: The three new artefact categories

For most actuarial teams operating to professional standard, the Govern function is largely a paperwork exercise. Most of the substance is already there in your model risk policy, your peer review process, your vendor management programme, and your existing model documentation. The Map function similarly maps onto ASOP 56 §3.1 (intended purpose, business context, model selection rationale) with very little new substantive work. Manage maps cleanly onto ASOP 56 §3.1.3 (using the model) and §3.6.4 (reasonable governance and controls). It is the Measure function that introduces meaningful new work, in three categories.

Category one: documented fairness and bias evidence (MEASURE 2.11)

Traditional actuarial validation looks at predictive accuracy, calibration and stability across the development period and out-of-time. It does not, by default, ask whether the model performs equivalently for different demographic groups, whether protected-class membership is being inferred from facially neutral features, or whether the decisions the model drives produce disparate impact.

The regulatory picture has shifted materially in the past eighteen months. Colorado Regulation 10-1-1, as amended effective 15 October 2025, applies to insurers offering individual life insurance, private passenger automobile insurance and health benefit plans that use external consumer data and information sources (ECDIS) or algorithms and predictive models that use ECDIS. The amended regulation requires a documented governance and risk management framework and quantitative testing to detect unfair discrimination, though the Division’s specific quantitative testing standards remain under active development and stakeholders continue to provide input on methodology. The EU AI Act, under the current timetable from 2 August 2026, imposes obligations on high-risk AI systems split between provider obligations (Article 10 data governance and bias examination, Article 15 accuracy and robustness, Article 43 conformity assessment, Article 72 post-market monitoring) and deployer obligations (Article 26 use monitoring, Article 27 FRIA, Article 86 explanation rights). Annex III 5(c) specifically covers AI systems used for risk assessment and pricing in life and health insurance. Whether any particular system is in scope is a fact-specific determination under Article 6 and Annex III. The NAIC Model Bulletin asks for testing for “potential biases” and related unfair discrimination risks through its AIS Program expectations.

Producing this evidence requires four things you probably don’t currently do as part of validation:

Inferring or otherwise estimating group membership for the protected classes you cannot collect directly
Computing fairness metrics across those groups (disparate impact ratio, demographic parity difference, equalised odds, calibration parity)
Investigating root causes through subgroup attribution analysis
Documenting trade-offs between predictive performance and fairness in a form that survives external audit

The two worked examples that follow walk through all four steps with numbers.

Category two: explainability evidence (MEASURE 2.9)

A generalised linear model is explainable by construction: every coefficient has a sign, a magnitude and an interpretation. A gradient boosting model with five hundred trees is not. A neural network is even less so. The RMF asks for evidence that a model is explained, validated and documented, and that its outputs are interpreted within the scope of intended use.

For an actuarial audience, the practical content of MEASURE 2.9 boils down to two artefacts. Global explainability means you can describe, on average, which features drive the model’s predictions and in which direction. Local explainability means that for any individual decision, you can explain why the model produced that decision in language a regulator, an actuary or in some cases an applicant can understand. The two standard techniques are SHAP (Shapley Additive exPlanations) and LIME (Local Interpretable Model-agnostic Explanations), with SHAP being the dominant choice in practice because its theoretical grounding in cooperative game theory gives it stability properties that LIME lacks.

A third technique, counterfactual explanations, deserves more attention than it currently gets. A counterfactual explanation answers the question “what would need to change for this decision to be different”. This is a powerful explanatory artefact that supports EU AI Act Article 13 and Article 86 transparency obligations and provides substantive content that can feed into US adverse action and principal-reasons notices. We use one in worked Example 1 below and return to the specific legal treatment there.

Category three: socio-technical context and human oversight design (MAP 1.1, MAP 3.5)

The RMF treats every AI system as a socio-technical system. That means the people who build it, the people who use it, the people affected by it, the operating environment and the plausible misuse paths are all in scope of the documentation. For most actuarial teams this is the most uncomfortable category, because it asks for things that are usually held in heads rather than written down: who the affected populations are, what could go wrong if a non-actuary uses the output, where the boundary of acceptable use sits, how a human reviewer is meant to override the model, and what evidence exists that the human reviewer is actually empowered to do so.

This is also the category where actuarial professionalism gives the team a head start. Every actuary signing off model output is already implicitly making a socio-technical judgement. The new requirement is to write it down, in a form a third party can audit.

Section 3: Worked Example 1, Accelerated Underwriting for 20-Year Term Life

The scenario

A mid-size US life insurer with a small EU subsidiary has built a gradient boosting model (XGBoost, 500 trees, max depth 6) to triage incoming applications for its 20-year level term product (face amounts $100,000 to $2,000,000, issue ages 25 to 65). The model routes each application into one of three pathways:

STP (straight-through placement): no further underwriting, immediate offer at the predicted rate class
STANDARD: standard underwriting (paramedical exam, tele-interview)
FULL: full underwriting (fluids, attending physician statements)

Training data is 500,000 historical applications from 2018 to 2023, with five-year mortality observation. The target is the rate class assigned by historical underwriters as a proxy for mortality risk. Features fall into two groups:

Traditional: age, sex, BMI, smoker status, build, MIB code count, application-form questions
ECDIS: prescription drug history (cardiovascular, diabetes, mental health composites), credit-based insurance score, motor vehicle record (3-year violation count), public records (judgments, liens), educational attainment

This kind of model is now widely used across the US life market; the Society of Actuaries has published research on the adoption of accelerated underwriting in the US. It sits squarely in the heartland of the concerns Colorado Regulation 10-1-1 (as amended) addresses. For the EU subsidiary, the system is likely to be in scope of Annex III 5(c) of the EU AI Act because it forms part of the risk assessment process for life insurance, though whether any specific implementation is caught depends on how it is characterised under Article 6 and Annex III. That determination is fact-specific and ultimately a legal question for the deployer. In this example the insurer has built the model for its own use: it is acting as both provider and deployer and therefore carries both sides of the Act’s obligation stack where the system is determined to be in scope. The numbers below are illustrative.

Govern artefacts

Before any technical work, the Govern function requires a small set of documents to be in place. For this model, the actuarial team produces:

AI inventory entry (GOVERN 1.6): a model card recording owner, version, training date, regulatory classification (Colorado Reg 10-1-1 in scope as a life insurer using ECDIS; NAIC AIS Program in scope in every adopting state of operation; EU AI Act status pending a fact-specific Article 6 / Annex III 5(c) assessment for the EU subsidiary, with provider-role assessment under Articles 16, 17, 43 and 72 and deployer-role assessment under Articles 26 and 27), business owner, model risk classification, and links to all downstream artefacts.
Governance committee charter: chair (Chief Actuary), members (Chief Underwriting Officer, Head of Data Science, Head of Compliance, Privacy Officer, Legal). Quorum, meeting cadence (monthly), decision rights.
Risk tolerance statement (MAP 1.5): the maximum acceptable disparate impact ratio (set at 0.85, exceeding the 4/5 rule’s 0.80 floor); the maximum acceptable AUC degradation between training and the most recent quarterly re-validation (3 percentage points absolute); the maximum acceptable Population Stability Index for any feature (0.15 monthly, 0.25 absolute halt threshold).
Third-party attestations (GOVERN 6.1): signed statements from each ECDIS provider confirming data lineage, FCRA compliance, GDPR Article 28 processor obligations for the EU subsidiary, and audit rights.
Decommissioning trigger conditions (GOVERN 1.7): explicit criteria under which the model is taken out of service, including any sustained breach of the risk tolerance thresholds.
Decision authority matrix: who approves a feature change, who approves a threshold change, who can halt the model in production, who signs off the annual recertification.

Each of these artefacts maps to a row in Section 1’s mapping table and to a section in the company’s existing model risk policy. The work is mostly about gathering and formalising what exists, not creating new content.

Map artefacts

The Map function produces the documentation that grounds every later decision. For this model:

Intended purpose statement (MAP 1.1, ASOP 56 §3.1): “To triage incoming individual term life applications into underwriting pathways consistent with the company’s mortality assumptions, in order to reduce time-to-issue for low-risk applications without altering the company’s pricing or its acceptance criteria. The model does not set prices, does not decline applicants, and does not produce final underwriting decisions. All decline decisions are made by qualified human underwriters.”
Out-of-scope statement: explicit list of uses the model must not be put to. Reinsurance treaty pricing. Post-issue claims decisions. Cross-sell recommendations. Decisions outside the issue ages 25 to 65 or face amounts $100,000 to $2,000,000.
Stakeholder map: applicants, agents, underwriters, claims handlers, reinsurers, state regulators, the EU subsidiary’s national supervisory authority.
Affected populations analysis: 50 US states plus three EU markets via the subsidiary. Demographic profile of the applicant population by age, sex, geography, and (using BIFSG-inferred) race and ethnicity.
Legal and regulatory mapping (MAP 4.1): NAIC Model Bulletin (in scope as a supervisory expectation in every state of operation that has adopted it); Colorado Reg 10-1-1 (individual life insurer using ECDIS, in scope as amended October 2025; binding law); NYDFS Insurance Circular Letter on AI (2024); EU AI Act Annex III 5(c) scoping for the subsidiary, pending fact-specific legal determination, with provider-role and deployer-role obligations separately assessed; FCRA for credit-based features; GLBA for data handling; HIPAA where the insurer or its business associates are acting in a covered-entity capacity and protected health information is being processed (the more usual position in life underwriting is that prescription history is sourced from a consumer-reporting context and HIPAA does not directly attach, but the determination is fact-specific and should be made with privacy counsel).
Plausible misuse paths: a short structured list of ways the model could be used outside its intended purpose, with mitigations for each. Used as input to MAP 5.1.
Fundamental Rights Impact Assessment (MAP 5.1; required by EU AI Act Article 27 of deployers of high-risk AI systems in Annex III point 5(c) for insurance pricing and risk assessment, where the subsidiary’s system is determined to be in scope): a structured document listing the categories of persons affected, the nature of the impact, the likelihood and severity, and the mitigations in place.

The Map artefacts are dry, but the Map function is where most regulator audits begin. A team that cannot produce these documents in twenty minutes when asked is not Map-ready.

Measure artefacts

This is where the work happens.

Phase 1: Traditional validation (MEASURE 2.5, MEASURE 2.3)

Standard ASOP 56 §3.6 model testing and output validation work, which most actuarial teams will already do. Illustrative numbers for this model:

Metric	Value	Acceptance criterion	Status
AUC (discrimination)	0.823	0.78	Pass
Brier score (calibration)	0.045	under 0.060	Pass
Hosmer-Lemeshow p-value	0.12	0.05	Pass
KS statistic	0.51	0.40	Pass
Top-decile lift	4.2×	3.5×	Pass
Population Stability Index (vs holdout)	0.08	under 0.10	Pass

This is a competent model. By traditional actuarial standards, it would be approved for production.

Phase 2: Fairness testing (MEASURE 2.11)

This is the new work. The first question is methodological: how do you test for fairness when you cannot collect the protected class directly? In US insurance, you generally cannot ask an applicant for their race. In the EU, GDPR Article 9 makes processing “special categories” of personal data including racial or ethnic origin lawful only under narrow exceptions, with Article 10(5) of the EU AI Act providing one of those exceptions specifically for bias detection and correction in high-risk AI systems, subject to strict conditions.

A common and defensible approach in US fairness auditing where direct collection is unavailable or constrained is Bayesian Improved First Name Surname Geocoding (BIFSG). BIFSG estimates the probability that an individual belongs to each of several racial and ethnic categories, given their first name, surname and ZIP code, using publicly available US Census data and name-list data. The methodology was developed by RAND researchers (building on the earlier BISG approach pioneered by Elliott and colleagues), is well-documented in the academic literature, and is in the public domain. It is imperfect, producing probabilistic estimates with uncertainty that should be quantified and disclosed, but it is defensible, transparent and reproducible. Other proxy-imputation methodologies exist and may be appropriate depending on context; Colorado Regulation 10-1-1 does not name BIFSG or any specific methodology, leaving the choice to the insurer to justify.

Several open-source Python packages implement BISG and BIFSG. Surgeo is a widely used open-source Python package for BISG/BIFSG and was used in the NAIC Special Committee on Race and Insurance’s 2025 presentation on statistical methods for imputing race and ethnicity. Alternatives include the pyethnicity package; for teams working from the underlying methodology, the RAND BISG/BIFSG literature is the authoritative source. IBM AI Fairness 360 is a general fairness toolkit but does not include a BIFSG implementation out of the box; teams wanting to use AIF360 for the downstream fairness metrics need to bring their own group-inference step.

Applied to the holdout sample for this model, BIFSG produces inferred group membership for each applicant. We can then compute observed outcomes by group:

Inferred group	Apps	STP rate	Decline rate	Premium per $1k face (ages 35 to 45)
White	100,000	45.1%	8.2%	$0.92
Black	12,500	34.2%	11.4%	$0.96
Hispanic	18,300	39.8%	9.8%	$0.94
Asian	8,400	47.6%	7.1%	$0.91

Now compute the standard fairness metrics. The four-fifths (80%) rule is the established US fair lending and employment benchmark for disparate impact: the selection rate for any group should be at least 80% of the rate for the most-favoured group. Applied to the STP rate (the favourable outcome):

Comparison	Ratio	Threshold	Status
Black / White STP	34.2 / 45.1 = 0.758	≥ 0.80	Fail
Hispanic / White STP	39.8 / 45.1 = 0.882	≥ 0.80	Pass
Asian / White STP	47.6 / 45.1 = 1.055	≥ 0.80	Pass

The model fails the four-fifths rule on the Black/White STP comparison. Demographic parity difference (Black minus White) is minus 10.9 percentage points.

For equalised odds, restrict to the subset of the holdout sample for whom we have observed mortality outcomes (the five-year observation window). True positive rate (correctly identifying low-mortality applicants as STP candidates):

Metric	White	Black	Gap
TPR (sensitivity)	0.81	0.74	0.07
FPR (1 minus specificity)	0.14	0.18	0.04

The TPR gap of 7 percentage points and the FPR gap of 4 percentage points together indicate that the model is materially less accurate for Black applicants than for White applicants in both directions. This is a separate problem from the disparate impact ratio: even if the rates were equal, the underlying error structure would be different.

“By traditional actuarial standards the model is good. Under the insurer’s chosen fairness criteria, and under likely regulatory scrutiny, the model would not yet be in a defensible production position.”

By traditional actuarial standards the model is good. Under the insurer’s chosen fairness criteria, and under the likely scrutiny of the regulators this system is in scope of, the model would not yet be in a defensible production position and the team proceeds to remediation.

Phase 3: Explainability and root cause investigation (MEASURE 2.9)

To understand why the model is producing disparate outcomes, we run SHAP on the trained model. SHAP decomposes each individual prediction into additive contributions from each feature, in a way that satisfies the Shapley axioms of efficiency, symmetry, dummy and additivity. The mean absolute SHAP value across the holdout sample gives us a global feature importance ranking; the per-applicant SHAP vector gives us local explanations.

Global SHAP importance, top eight features (illustrative):

Rank	Feature	Mean abs. SHAP value
1	Rx history (cardiovascular composite)	0.31
2	Credit-based insurance score	0.24
3	Smoker status	0.21
4	BMI	0.18
5	Age	0.16
6	MVR violations (3 years)	0.12
7	Rx history (mental health composite)	0.09
8	Public records (judgments, liens)	0.07

Now do the subgroup attribution. For Black applicants who were declined or routed to FULL underwriting, compute the average SHAP contribution of each feature, and compare it to the equivalent average for White applicants in the same outcome category:

Feature	Avg SHAP (Black declined/full)	Avg SHAP (White declined/full)	Ratio
Credit-based insurance score	+0.42	+0.20	2.10×
Public records	+0.19	+0.11	1.73×
Rx history (cardiovascular)	+0.34	+0.31	1.10×
MVR violations	+0.09	+0.08	1.13×
BMI	+0.16	+0.15	1.07×

The credit-based insurance score is contributing more than twice as much to adverse decisions for Black applicants as for White applicants. Public records show a similar but weaker pattern. The clinical features (Rx history, BMI) show no meaningful disparity in contribution.

This is the actionable finding. The model is using a credit-derived feature whose disparate impact substantially exceeds its actuarial signal for mortality. The clinical features are doing the work the company actually intends them to do.

A counterfactual explanation for an individual applicant illustrates the same point at the local level. Take the case of a 38-year-old Black male, BMI 27, non-smoker, no Rx flags, average MVR record, who was routed to FULL underwriting by the model. The counterfactual question is: “what minimal change to this applicant’s features would have produced a STANDARD or STP routing, all other features equal?” The answer for this individual is: “a credit-based insurance score one tier higher would have routed the applicant to STANDARD; two tiers higher would have routed to STP; no change to any clinical, behavioural or demographic feature would have changed the routing.”

A word on what this counterfactual does and does not do legally. It is a powerful explanatory artefact. It supports the EU AI Act Article 13 and Article 86 transparency obligations (Article 13 being a provider obligation to supply instructions to the deployer, Article 86 giving affected persons a right to explanation from the deployer) by giving affected individuals a meaningful account of the role the system played in a decision and the main elements of that decision. It provides the substantive content that an ECOA principal-reasons notice (for covered credit decisions) or a follow-up explanation accompanying an FCRA adverse action notice would draw on. What it does not do, strictly speaking, is satisfy the formal content requirements of an FCRA adverse action notice, which focus on the fact of the adverse action, the source of the consumer report, and the consumer’s rights to dispute and obtain a free copy. Counterfactuals and formal notices live in complementary layers: the notice satisfies the legal formality; the counterfactual gives the individual an actionable understanding. Counterfactual explanations are not yet routine in actuarial practice and they should be.

Phase 4: Remediation cycle

There are three classes of remediation available, and the choice between them is a governance decision, not a technical one.

Option A: feature removal. Drop the credit-based insurance score from the feature set. Retrain. This is the simplest, most transparent and most defensible option. It has the largest absolute cost in predictive performance but the smallest implementation risk and the cleanest audit trail.

Option B: in-processing fairness constraints. Use an adversarial debiasing method (e.g. the algorithms in IBM AI Fairness 360 or Microsoft FairLearn) to retrain the model with an explicit fairness constraint. This preserves more predictive signal but introduces an additional model component, which itself needs validation, documentation and explanation. It is harder to defend to a non-technical regulator.

Option C: post-processing threshold adjustment. Apply different decision thresholds for different inferred groups so that the rates equalise. In US insurance this is legally hazardous (it can be characterised as intentional disparate treatment) and in EU regulation it sits very close to the prohibition on protected-class processing. We do not recommend it.

For this example, the governance committee chooses Option A. The team retrains with the credit-based insurance score removed and re-runs the full validation pack:

Metric	Pre-remediation	Post-remediation	Change
AUC	0.823	0.804	minus 0.019
Brier score	0.045	0.048	+0.003
KS statistic	0.51	0.48	minus 0.03
Top-decile lift	4.2×	3.9×	minus 0.3
Overall STP rate	42%	39%	minus 3pp
Black/White DI ratio	0.758	0.847	+0.089
Hispanic/White DI ratio	0.882	0.901	+0.019
TPR gap (Black vs White)	0.07	0.04	minus 0.03
FPR gap (Black vs White)	0.04	0.03	minus 0.01

The model now passes the four-fifths rule with margin and meets the company’s own internal threshold of 0.85. The TPR and FPR gaps are materially narrower. The cost is a 1.9 percentage point AUC reduction and a 3 percentage point reduction in overall STP rate. The actuarial committee accepts this trade-off, documents the trade-off explicitly in the MEASURE evidence pack, and approves the model for production.

This is the kind of trade-off documentation that NIST RMF MEASURE 2.5 and MEASURE 2.11 require, that ASOP 56 §4.1 disclosures expect, and that an EU AI Act Article 9 provider risk management file would absorb directly.

Phase 5: Privacy and security (MEASURE 2.10, MEASURE 2.7)

Two further measures rarely covered in traditional actuarial validation but explicitly required by the RMF:

Privacy: ECDIS data is processed under FCRA in the US and under GDPR (with the EU AI Act Article 10(5) exception for bias detection and correction in high-risk systems, where applicable) in the EU. Data minimisation principles applied. Fields not used by the production model are not retained beyond the validation period. BIFSG inference outputs are stored separately from individual records and used only for aggregate fairness reporting, never for individual decisioning.
Security and resilience: the model is served behind an authenticated, rate-limited API. Adversarial input testing has been carried out using synthetic edge cases (rare ZIP codes, names with low BIFSG confidence, applications at the extremes of the age and face amount ranges). The model’s response under deliberate adversarial perturbation is documented. A separate monitoring stream tracks prediction confidence distributions for anomalies that could indicate input manipulation.

Manage artefacts

Once the model is in production, the Manage function takes over. For this model:

Drift monitoring: monthly Population Stability Index calculation on every input feature; monthly tracking of prediction distribution; monthly tracking of STP, STANDARD and FULL routing rates by inferred group.
Fairness re-test: full quarterly re-computation of disparate impact ratio, equalised odds, and the SHAP subgroup attribution on the most recent quarter of applications.
Trigger thresholds: PSI above 0.15 on any feature triggers investigation; PSI above 0.25 triggers automatic halt of the affected pathway; DI ratio below 0.80 triggers automatic halt; AUC drop above 0.03 against the last validation triggers retraining.
Incident response runbook (MANAGE 2.3, MANAGE 4.3): named incident commander (Chief Actuary); escalation chain; communications to the AI Governance Committee, the board’s risk committee, the relevant state insurance departments, and (where the system is in scope of the EU AI Act and a serious incident as defined in Article 73 has occurred) the relevant competent authority for the EU subsidiary’s jurisdiction under the provider’s Article 73 reporting obligation, with the deployer’s Article 26 duty to inform the provider and authorities running in parallel.
Residual risk acceptance (MANAGE 1.4): formal sign-off by the Chief Actuary recording the residual risks the company has chosen to accept (in this case, the residual fairness gap, the inherent uncertainty in BIFSG inference, and the model’s reduced applicability outside the trained age/face-amount range), reviewed annually.
Decommissioning planning (MANAGE 2.4, GOVERN 1.7): scheduled 24-month review at which the model is either recertified, retrained, replaced or retired.

Practice-area call-outs

A few brief notes on where this example would diverge for other practice areas:

P&C personal auto with telematics: the analogous data is a continuous behavioural stream rather than discrete ECDIS pulls. Fairness analysis is more complicated because driving behaviour is partly a free choice and partly a function of economic and geographic constraint. The MEASURE function looks similar, but MAP 5.1 (impact analysis) is heavier and the question “what is the protected-class proxy in this feature set” has many more candidate answers.
Reinsurance treaty pricing using catastrophe models: the model is almost always a third-party model, so GOVERN 6.1 (third-party AI risk) and MAP 4.1 (legal risks of components) carry most of the weight. The MEASURE function devolves substantially into vendor evidence review rather than first-party testing.
Pensions longevity modelling: lower fairness sensitivity in many populations because schemes are closed and demographic profiles are well understood. Explainability matters, particularly to trustees, but the disparate-impact lens is less applicable. MEASURE 2.9 (explainability) is the priority subcategory.

These are previewed here and treated in full in Part 3 of this series.

Section 4: Worked Example 2, Health Claims Prior Authorisation

The scenario

A US health insurer (or a TPA) operates a model that triages incoming prior authorisation requests. The model is a binary classifier (XGBoost again, for like-for-like comparison) trained on two million historical PA requests with their final outcomes. The output is a single score: above the threshold, the request is auto-approved and processed without clinical review; below the threshold, the request is routed to a clinician for medical necessity review. Features are CPT and HCPCS procedure codes, ICD-10 diagnosis codes, provider NPI and specialty, prior treatment history for the same member, claim amount, patient age and sex.

This is a higher-stakes setting than the underwriting example because the system can directly affect a member’s access to care. By design, the model can only auto-approve. It cannot deny. Every adverse determination is made by a qualified human reviewer.

This design choice does most of the work of satisfying MAP 3.5 (human oversight) and provides the practical human-in-the-loop architecture that EU AI Act Article 14 contemplates for high-risk systems where Article 14 applies. It substantially reduces the practical risk profile of the system and the likely severity of harms. It does not, however, change the system’s formal risk classification under the RMF (which is not a legal classification scheme in the first place), under the NAIC Model Bulletin (which applies a proportional, risk-based treatment rather than a binary taxonomy), or under the EU AI Act (where high-risk classification is determined by Article 6 and Annex III rather than by oversight design). Whether a US prior authorisation system is caught by the EU AI Act at all depends on whether the deployer also operates in the EU and whether the system’s purpose maps to any Annex III category. For a domestic US health insurer or TPA, that scoping is not settled and should not be assumed. Teams operating in both markets should seek a specific legal determination, including a role-specific assessment of provider versus deployer obligations.

Govern and Map artefacts (compressed)

The Govern and Map work for this model differs from the underwriting example in three respects worth flagging:

The cross-functional committee includes the Medical Director as a voting member alongside the Chief Actuary. This is not optional.
The intended-purpose statement is constructed around what the model cannot do, not what it can: “this system is permitted to auto-approve prior authorisation requests; it is not permitted under any circumstance to produce an adverse determination, to influence one, or to inform the standard of clinical review applied to a routed request.”
The MAP 5.1 impact analysis explicitly enumerates the stakeholder groups who could be harmed by an over-conservative routing decision (members denied timely access to care), an over-liberal routing decision (inappropriate care approved), or a systematic bias in either direction.

Measure artefacts

Phase 1: Traditional performance

Illustrative numbers:

Metric	Value
AUC	0.91
Precision at auto-approve threshold	0.94
Recall at auto-approve threshold	0.71
Auto-approval rate	58%
Average time saved per auto-approved request	4.2 days

These are good numbers. The model is conservative by design: 29 percent of requests that would have been approved by a clinician are still routed to clinical review, which is the safer error to make.

Phase 2: Fairness testing, with the appeal-rate canary

Run the same BIFSG-based subgroup analysis as in Example 1:

Inferred group	Members	Auto-approve %	Appeal rate	Appeal success rate
White	1,200,000	61.2%	1.2%	38%
Black	280,000	51.8%	2.3%	52%
Hispanic	320,000	55.7%	1.6%	41%
Asian	95,000	63.4%	1.0%	36%

The disparate impact ratio for auto-approval (Black/White) is 51.8 / 61.2 = 0.846, which marginally passes the four-fifths rule. By the underwriting example’s standards, this would be a “watch but proceed” outcome.

The more telling signal is the appeal success rate. Black members appeal more often and they succeed more often when they appeal. A 52% appeal success rate compared to 38% for the reference group means that, conditional on a request being routed to clinical review and then appealed, the underlying claim was substantially more likely to be valid for Black members than for White members. The model’s initial routing decisions are systematically more wrong, in the safer direction, for Black members.

This is the appeal-rate disparity canary, and it deserves special attention because it has a property that the BIFSG-based DI ratio does not: it does not require any protected-class data at all. You can compute appeal rate and appeal success rate by any segment available to you, including segments you would never use as protected-class proxies. Geography, plan type, employer group, age band, network status. If any of those segments shows a meaningfully higher appeal success rate than others, the model is making systematically worse decisions for that segment, regardless of whether the segment correlates with a protected class. This is a fairness signal that costs nothing to compute and is available to every health insurer running a routing model. In the engagements we see, it is rarely produced as routine MEASURE evidence, even though it is cheap to compute and powerful as a fairness signal. It should be.

Phase 3: Root cause investigation

Run SHAP on the routed-then-approved subset (the cases where the model said “review this” but the clinician said “this was fine”). The pattern that emerges:

Provider type (a derived feature combining specialty, hospital affiliation and tax ID) is contributing 1.8× more to adverse routing decisions for safety-net hospital providers than for non-safety-net providers
Certain CPT code combinations common in those settings are flagged as outliers because they are underrepresented in the training data

The model was trained on a dataset that systematically underrepresented care delivered in safety-net settings. This is a data sufficiency problem (ASOP 23 §3.2 territory) showing up as a fairness problem at MEASURE 2.11.

Phase 4: Remediation

Re-stratify the training set so that safety-net provider cases are appropriately represented. Add an explicit hospital-type indicator as a feature, with calibration applied so that the indicator does not itself become a proxy for member demographics. Retrain.

Post-remediation:

Group	Auto-approve %	Appeal rate	Appeal success rate
White	61.0%	1.2%	38%
Black	56.4%	1.6%	41%
Hispanic	57.2%	1.4%	39%
Asian	63.1%	1.0%	36%

DI ratio for auto-approval (Black/White) improves from 0.846 to 0.925. Appeal success rate disparity narrows from 14 percentage points to 3 percentage points. Overall auto-approval rate is broadly unchanged. AUC moves from 0.91 to 0.90 (negligible). The model is now in a defensible position.

Manage artefacts and a post-deployment incident

Three months after deployment, the drift monitor fires. The PSI on the CPT code distribution feature has crossed 0.18. The team investigates and finds that a CMS fee schedule update has introduced new HCPCS codes for several services. The model has never seen these codes in training and is treating them as low-frequency outliers, with the predictable consequence that requests using the new codes are being routed to clinical review at materially higher rates.

The runbook is invoked. The Medical Director and the AI Governance Committee chair are paged. The decision is made to halt the auto-approval pathway for procedure categories affected by the new codes, falling back to full clinical routing for those categories only. Communications go out to providers explaining the temporary change. The model is retrained on a refreshed dataset including the new codes within five business days. The MANAGE 4.3 incident report is filed internally; external reporting obligations depend on jurisdictional scope and on the insurer’s role under any applicable AI regulation. For a domestic US health insurer, state insurance department notification and any contractual incident reporting to plan sponsors would apply. If the deployer also operates in the EU and the system is in scope of an Annex III category under the EU AI Act, the Article 73 serious-incident reporting obligation sits primarily with the provider, with the deployer’s Article 26 duty to inform the provider and competent authorities running in parallel where the incident meets the Article 3 definition of a serious incident.

This incident is teaching material. Traditional actuarial monitoring would have caught the broad PSI movement, but it would not have led to the diagnosis as quickly because the failure was distributed across many code-level features rather than concentrated in a single headline metric. The reason it was caught and resolved within five days is that the team had a runbook, a named incident commander, and a pre-established threshold-and-action table. That is what MANAGE 4.1 looks like in practice.

Section 5: A reusable MEASURE evidence pack template

Below is the structure of the document that each material AI system should have, kept in version control, regenerated at every recertification. It is opinionated; adapt to your environment. The names of the sections deliberately mirror NIST RMF subcategories so that an external audit can map the document directly to the framework.

MEASURE Evidence Pack: [Model name and version]

1.  Identification
    1.1  Model name, version, training date, owner
    1.2  Risk classification (NAIC/Colorado/EU AI Act/internal)
    1.3  EU AI Act role (provider, deployer, or both) where relevant
    1.4  Cross-references to inventory entry, governance committee
         minutes, and ASOP 56 model documentation

2.  Test data (MEASURE 2.1)
    2.1  Sources, dates, sampling method
    2.2  Pre-processing pipeline
    2.3  Train/validation/holdout split rationale
    2.4  Population coverage and representation analysis

3.  Performance (MEASURE 2.3, 2.5)
    3.1  Discrimination (AUC, KS, lift)
    3.2  Calibration (Brier, Hosmer-Lemeshow, calibration plots)
    3.3  Stability (PSI vs holdout, vs prior validation)
    3.4  Sensitivity to feature perturbation
    3.5  Acceptance criteria and pass/fail status

4.  Security and resilience (MEASURE 2.7)
    4.1  Threat model
    4.2  Adversarial input testing results
    4.3  Authentication, rate limiting, monitoring
    4.4  Access controls

5.  Transparency and explainability (MEASURE 2.8, 2.9)
    5.1  Model architecture description (audience: technical reviewer)
    5.2  Global SHAP / feature importance with interpretation
    5.3  Sample local SHAP explanations for representative cases
    5.4  Sample counterfactual explanations
    5.5  Mapping of output language to user-facing communications

6.  Privacy (MEASURE 2.10)
    6.1  Data inventory and lawful basis
    6.2  Data minimisation evidence
    6.3  Retention schedule
    6.4  Special-category data handling (Art. 9 / Art. 10(5) AI Act
         where applicable; HIPAA where in covered-entity capacity;
         GLBA / FCRA where applicable)

7.  Fairness and bias (MEASURE 2.11)
    7.1  Group inference methodology and uncertainty (BIFSG or equivalent)
    7.2  Disparate impact ratio across all relevant outcomes
    7.3  Demographic parity metrics
    7.4  Equalised odds metrics where ground truth is available
    7.5  Subgroup SHAP attribution analysis
    7.6  Appeal-rate / outcome-disparity canary metrics where applicable
    7.7  Trade-off documentation if remediation has been applied
    7.8  Residual fairness risk and acceptance (links to MANAGE 1.4)

8.  Feedback and engagement (MEASURE 3.3)
    8.1  Channels available to affected populations
    8.2  Internal user feedback mechanism
    8.3  Summary of feedback received in the reporting period

9.  Trustworthiness in deployment (MEASURE 4.2)
    9.1  Live monitoring metrics and dashboards
    9.2  Incidents in the reporting period
    9.3  Drift status

10. Sign-off
    10.1  Independent reviewer (MEASURE 1.3 / APS X2 v1.1 peer review where applicable)
    10.2  Chief Actuary or Appointed Actuary
    10.3  AI Governance Committee resolution number

A team that produces this document for every material AI system, refreshes it on a quarterly cadence, and stores it under version control will be able to respond to any RMF, NAIC, Colorado, NYDFS or EU AI Act audit request within hours rather than weeks. That is the operational benefit of doing the work, separate from any specific regulatory obligation.

Section 6: Cross-practice call-outs

Part 3 of this series treats the practice areas in full. The brief notes below are the most important ways in which the underwriting and prior-auth examples above would differ for other actuarial work.

Property and casualty pricing. Telematics pricing for personal auto is the closest analogue to the life accelerated underwriting example. The fairness analysis is harder because driving behaviour is partly chosen and partly forced by economic geography. The same MEASURE 2.11 toolkit applies, but the “find the protected-class proxy” exercise is more delicate. Image-AI claims handling for first notice of loss introduces a new category of MEASURE 2.7 (security) work, because the inputs are images that can be adversarially perturbed.

Catastrophe and reinsurance modelling. Almost always built on third-party vendor models. The Govern function (specifically GOVERN 6.1) and the Map function (MAP 4.1) carry the weight. The MEASURE function devolves substantially into vendor evidence review and benchmarking against alternative models, rather than first-party testing.

Pensions longevity modelling. Lower fairness sensitivity in closed populations, but explainability remains important to trustees, sponsors and the Pensions Regulator. MEASURE 2.9 is the priority subcategory. MAP 5.1 (impact analysis) is dominated by the question of intergenerational equity rather than disparate impact.

Capital modelling and ESG. Solvency II and US RBC capital models that incorporate machine learning components fall squarely under model risk management as the actuarial profession has always understood it. The bridge to the RMF is mostly a relabelling exercise. The novel work is at MEASURE 2.7 (security) and at MEASURE 2.9 (explainability) for any non-linear component.

Climate and catastrophe modelling. A specific case where MAP 1.5 (risk tolerance) and MAP 5.1 (impact analysis) require explicit treatment of deep uncertainty rather than estimable risk. The framework accommodates this, but the language is uncomfortable for practitioners trained to quantify everything.

Enterprise risk management. ERM is the natural home for the inventory, the risk register and the residual risk acceptance documentation. The CRO and the Chief Actuary should agree which function owns the AI risk register early, because both have a legitimate claim and parallel ownership produces gaps.

Section 7: The first week, the first month, the first quarter

The point of this article is not just to teach a framework but to enable a team to act. If you read this article on Monday morning and want to start producing artefacts by Friday, here is the roughly correct sequence.

Week 1

Make a list of every machine learning, statistical, or algorithmic system in your function that produces outputs used in business decisions. Include vendor-supplied models. This is your AI inventory candidate list.
For each item on the list, classify it against the four risk lenses: NAIC AIS Program in scope yes/no, Colorado Reg 10-1-1 in scope yes/no (if you offer individual life insurance, private passenger automobile insurance or health benefit plans in Colorado and use ECDIS), EU AI Act Annex III in scope yes/no (subject to fact-specific Article 6 determination, with provider/deployer role separately assessed), and internal materiality high/medium/low.
Identify the single most consequential system on the list and book a working session for the team to walk it through Section 1’s mapping table.
Identify whether your organisation already has a cross-functional AI governance committee. If yes, get yourself on it. If no, draft a one-page proposal for one and put it in front of your CRO or COO.

Month 1

Produce the AI inventory entry, intended-purpose statement, and stakeholder map for your most consequential system. These are the Govern and Map artefacts. They are mostly writing, not technical work.
Run the traditional validation pack on the system if you have not done so recently. This is your baseline.
Identify whether you have access to a BIFSG implementation. If not, procure or build one, or evaluate alternative proxy-imputation methods appropriate to your context. Surgeo is a widely used open-source Python package for BISG/BIFSG and was used in the NAIC Special Committee on Race and Insurance’s 2025 presentation on statistical methods for imputing race and ethnicity. The methodology is from RAND and is well-documented in the academic literature. Document the inference uncertainty.
Compute the disparate impact ratio for the system’s key outcomes by inferred group. Document the result, whatever it is.
Run SHAP global feature importance on the model. Document the top features and the mean absolute SHAP values. This is the start of your MEASURE 2.9 evidence.

Quarter 1

Produce the full MEASURE evidence pack using Section 5’s template for the most consequential system.
If the fairness testing surfaces issues, work through the remediation cycle as in Worked Example 1. Document the trade-off in writing. Get it signed off.
Repeat the inventory exercise for your second and third most consequential systems and book them into the next two quarters of work.
Establish the drift monitoring and incident response runbook for any system in production.
Schedule the AI governance committee’s first quarterly review of the MEASURE evidence pack. Invite the CRO, the Chief Underwriting Officer where relevant, the Medical Director where relevant, Legal and Compliance.

A team that completes that sequence on its top three systems within a quarter is, in practical terms, RMF-aligned. The ongoing work is the maintenance of the artefacts, the quarterly fairness re-tests, and the response to incidents and drift. None of it is exotic. It is the existing actuarial discipline of model risk management, extended into three new artefact categories and documented in a vocabulary the regulator understands.

Closing

The NIST AI Risk Management Framework is not a new burden for actuaries. It is a relabelling and a modest extension of the model governance work the profession has been doing for a generation, expressed in the vocabulary that regulators, lawyers, auditors and engineers now share. Three new artefact categories matter most:

documented fairness and bias testing
global and local explainability evidence
explicit socio-technical and human oversight documentation

Each of them is learnable. Each of them is producible by a competent actuarial team in a quarter. None of them is beyond what professional standards already implicitly ask for.

“The framework is the language. The artefacts are the work. The next move is yours.”

Part 3 of this series applies these methods across life, health, property and casualty, and pensions, with shorter call-outs for reinsurance, capital and climate modelling. Part 4 turns to the NIST Generative AI Profile and the specific questions it raises for actuaries using large language models in documentation, code, research and member communications.

Ready to operationalise NIST AI RMF and the Generative AI Profile in your actuarial function?

Talk to our team about how Globebyte can help you build the governance structures, the eval suites, the RAG systems and the MEASURE evidence packs. From strategic alignment to working code.

Explore our services

How this article is organised

Section 1: The full mapping

Govern subcategories

Map subcategories

Measure subcategories

Manage subcategories

Section 2: The three new artefact categories

Category one: documented fairness and bias evidence (MEASURE 2.11)

Category two: explainability evidence (MEASURE 2.9)

Category three: socio-technical context and human oversight design (MAP 1.1, MAP 3.5)

Section 3: Worked Example 1, Accelerated Underwriting for 20-Year Term Life

The scenario

Govern artefacts

Map artefacts

Measure artefacts

Phase 1: Traditional validation (MEASURE 2.5, MEASURE 2.3)

Phase 2: Fairness testing (MEASURE 2.11)

Phase 3: Explainability and root cause investigation (MEASURE 2.9)

Phase 4: Remediation cycle

Phase 5: Privacy and security (MEASURE 2.10, MEASURE 2.7)

Manage artefacts

Practice-area call-outs

Section 4: Worked Example 2, Health Claims Prior Authorisation

The scenario

Govern and Map artefacts (compressed)

Measure artefacts

Phase 1: Traditional performance

Phase 2: Fairness testing, with the appeal-rate canary

Phase 3: Root cause investigation

Phase 4: Remediation

Manage artefacts and a post-deployment incident

Section 5: A reusable MEASURE evidence pack template

Section 6: Cross-practice call-outs

Section 7: The first week, the first month, the first quarter

Week 1

Month 1

Quarter 1

Closing

Ready to operationalise NIST AI RMF and the Generative AI Profile in your actuarial function?

Ready to explore AI for your organisation?