Article 3 of 4 in the series NIST AI RMF for Actuaries
The first article in this series made the strategic case for NIST AI RMF as the operating spine of actuarial AI governance. The second was a deep practitioner’s guide built around two worked examples in life accelerated underwriting and health prior authorisation, both carrying numerical artefacts through Govern, Map, Measure and Manage end to end.
This article does something different. It picks four representative use cases across the major actuarial practice areas, each chosen because it teaches a distinctive lesson about how the framework lands in different contexts. The same four functions apply everywhere. The relative weight of each function changes dramatically depending on what the model is being asked to do, who it affects, and what the failure modes look like.
The guiding observation is that NIST AI RMF is not a checklist that gets applied uniformly. It is a structured set of questions, and which questions matter most depends on the practice area. A lapse model is not an underwriting model. A care management model is not a prior authorisation model. A telematics pricing model is not a credit-based pricing model. A longevity model is barely the same kind of object at all. The framework accommodates this. The actuary’s job is to know which subcategories carry the weight in their context.
A note on registers. As in Parts 1 and 2 of this series, this article distinguishes four categories of authority: law (binding regulation or statute), supervisory expectation (regulator guidance, often “should” framing, used in examinations), professional standard (binding within a profession on its members), and author recommendation (our own practitioner judgement). Each vignette below flags the relevant register where the distinction matters. Specifically, the discussion of which AI consumer-protection regimes a given model falls inside or outside is a legal-scope question; the discussion of which RMF subcategories deserve the heaviest investment is a governance-scope question. Models can sit outside the legal scope of one regime while sitting squarely inside the governance scope of internal model risk management. The two are not in tension; they are simply different scopes.
The numbers in this article are illustrative. As in Part 2, they are synthetic but realistic and must not be taken as guidance for any production parameter setting. The Obermeyer study referenced in Section 2 is the one exception. It is a real published finding and we cite it as such. ASOP 56 section references throughout this article have been verified against the published standard (Doc. No. 195, December 2019, effective 1 October 2020). ASOP 27 references are to the revised standard (Selection of Assumptions for Measuring Pension Obligations) effective 1 January 2025, which absorbed the previously separate ASOP 35.
Section 1: Life, Lapse and Policyholder Behaviour Modelling
The lesson: AI risk that does not look like AI risk
A large life insurer operates a gradient boosting model that predicts the probability of lapse, surrender or partial withdrawal for each in-force policy in its term, whole life and universal life books. The model is used in three places: ALM and hedging (to project liability cash flows), reserving (to inform best-estimate assumption setting under IFRS 17 and US statutory frameworks), and in-force management (to identify retention candidates for proactive customer outreach). Features include policy attributes (face amount, duration in force, surrender charge schedule, premium mode, recent premium history), policyholder attributes (issue age, current age, sex, originating distribution channel), economic context (interest rates, equity market level, credit spreads), and behavioural history (recent customer service interactions, login frequency for digital channels).
This is a textbook actuarial use case. It does not feel like an “AI” application. There is no consumer-facing decision. No applicant is declined. No claim is denied. No premium changes as a result of any individual prediction.
It is, however, unambiguously a material model under the RMF and under ASOP 56. The scoping question that matters more is the regulatory one, and the answer there depends on what the model is actually used for.
A note on legal scope versus governance scope. A lapse model used purely for hedging, reserving and assumption setting is less likely to be the direct legal focus of consumer-facing AI regimes (the NAIC AI Systems Program, Colorado Regulation 10-1-1, EU AI Act Annex III 5(c)) than a system that drives consumer-facing decisions or pricing. That is a legal-scope observation. It does not mean the model sits outside governance scope. The same model is squarely within: ASOP 56 model risk management (US professional standard); the firm’s own model risk policy and AI governance committee remit (internal governance); Solvency II Article 124 model validation expectations where the model sits inside an approved internal model used to calculate the SCR (EU law); and the equivalent expectations in other capital regimes. Many real-world deployments do double duty across reserving and retention, in which case the consumer-facing AI regimes also engage. The scoping question is a live one and worth answering explicitly in the AI inventory entry, with the legal-scope and governance-scope answers given separately.
Where the weight sits
The fairness work that dominated Part 2 is light here. There is no individual decision and no protected-class proxy concern of the ECDIS variety, at least for the reserving and hedging uses. MEASURE 2.11 becomes a watching brief unless the model is also driving retention outreach, in which case it moves back to the foreground.
The work that dominates is at three other places.
MAP 1.5, Risk tolerance. Lapse model error does not produce an individual harm; it produces a systemic one. A persistently optimistic lapse assumption (predicting more lapse than will occur) understates reserves and overstates capital adequacy. A persistently pessimistic assumption (predicting less lapse) overstates reserves and undermines hedging effectiveness. The risk tolerance statement for this model has to be expressed in terms of the cumulative impact on technical provisions and on hedge effectiveness, not in terms of accuracy on individual policies. This is a different kind of conversation than the one a P&C pricing committee has, and the actuarial profession has been having it for a generation. The work is to write it down in RMF-compatible form.
MEASURE 2.5, Validity and reliability. ASOP 56 §3.6 territory (model testing and model output validation), but with one twist. Conventional out-of-sample validation is necessary but not sufficient for a model whose predictions are used at long durations and whose ground truth only emerges over years. The MEASURE 2.5 evidence pack for this model needs to include a back-testing methodology that explicitly addresses delayed-feedback bias: how do you know the 2018 model would have correctly predicted 2024 behaviour, when the only thing you can observe in 2026 is what actually happened? The standard answer is rolling-origin retrospective testing supplemented with stress scenario analysis and sensitivity to economic assumptions. The new requirement is to document this as MEASURE evidence, in RMF vocabulary, alongside the existing ASOP 56 documentation.
MEASURE 4.2, Trustworthiness in deployment context. This is where the lapse model needs the most attention, and where most actuarial teams already do something but do not currently call it MEASURE 4.2 evidence. The model’s outputs feed downstream financial reporting, hedging decisions, and capital calculations. A small drift in the model’s predictions can produce a large effect on hedge slippage or on the IFRS 17 contractual service margin. The evidence pack needs to show that the model’s predictions are being monitored against actual experience on a quarterly or monthly basis, that there is a documented process for refreshing the model when actual experience deviates, and that the downstream consumers of the model’s outputs are aware of the model’s confidence intervals and limitations.
Specific artefacts to produce
| RMF subcategory | Artefact |
|---|---|
| MAP 1.5 | Risk tolerance statement expressed in technical-provision and hedge-effectiveness terms |
| MAP 4.1 | Inventory of which downstream processes consume the model output, with named owners |
| MEASURE 2.5 | Rolling-origin back-test pack across at least two economic regimes (rising and falling rates) |
| MEASURE 2.9 | SHAP attribution showing the relative contribution of economic, policy and behavioural feature families (the AI Governance Committee will want to see that the model is not leaning excessively on any one) |
| MEASURE 4.2 | Quarterly experience-vs-prediction reconciliation, with formal trigger thresholds for re-fit |
| MANAGE 1.4 | Residual risk acceptance covering the parameter uncertainty and the regime-change risk |
The practice-area moral
For models that affect financial reporting and capital, the RMF Govern and Manage functions look almost identical to existing actuarial governance for models. The work is mostly relabelling. The genuinely new artefact is the explicit risk tolerance statement framed in business-impact terms, which actuaries usually carry implicitly in their professional judgement but rarely write down in a form that an external auditor or supervisor could verify.
Section 2: Health, Risk Stratification for Care Management
The Obermeyer anchor
Six years before this article was written, Obermeyer, Powers, Vogeli and Mullainathan published “Dissecting racial bias in an algorithm used to manage the health of populations” in Science (366:447 to 453, 2019). It is the most important real-world AI bias finding in healthcare and it is the cleanest teaching anchor for any actuary working on health risk stratification.
The set-up is simple. A widely deployed commercial algorithm was used by US health systems to identify patients for high-risk care management programmes. The algorithm assigned each patient a risk score; patients above a threshold (the 97th percentile in the study population) were auto-identified for additional care coordination, nursing outreach and resources. Obermeyer and colleagues showed that the algorithm was technically well-calibrated for what it was asked to predict, and yet it was producing a profound and consistent racial disparity in who received care.
The algorithm was trained to predict future healthcare costs. This is a defensible choice. Cost is observable, available, well-defined and broadly correlates with healthcare need. The trouble is that “broadly correlates” is doing a lot of hidden work. Black patients in the study population received less care for the same level of underlying illness, for reasons rooted in access, trust and structural inequality. Less care meant lower observed costs. Lower observed costs meant lower predicted costs. Lower predicted costs meant lower risk scores. Lower risk scores meant fewer Black patients identified for the very programme that would have closed the gap.
The numerical finding is striking and worth quoting precisely. Black patients made up 17.7% of those auto-identified at the 97th percentile. Targeting need directly rather than cost would have raised that share to 46.5%. At the same threshold, Black patients had 26% more chronic conditions than White patients with the same score.
“The algorithm was perfectly calibrated for cost. The bias lived entirely in the choice of label.”
The algorithm was not biased in any sense the manufacturer would have recognised before the study. It was perfectly calibrated for cost. The bias lived entirely in the choice of label.
When Obermeyer and colleagues took their findings to the manufacturer, the manufacturer did not push back. They confirmed the finding using a national dataset of 3.5 million patients and worked with the researchers to construct a hybrid label combining cost prediction with a measure of chronic condition burden. The reformulated algorithm reduced the measured racial bias by approximately 84%.
The lessons for the RMF
This single study illustrates almost every important MEASURE-function lesson in the framework. Reading it through the RMF lens:
MAP 1.1, Intended purpose. The system’s intended purpose was to identify patients who would benefit most from care management. The system’s actual training target was who would generate the highest future costs. The gap between purpose and target is the entire story. Any RMF Map artefact for this kind of system must explicitly state the intended purpose in clinical or welfare terms, then explicitly state what the model actually predicts, then explicitly justify that the latter is a defensible proxy for the former. If the proxy cannot be defended, the model needs a different target.
MEASURE 2.5, Validity. The algorithm was internally valid for cost prediction. Conventional validation, by AUC or by Brier score on cost, would have approved it. Internal validity is not the same as fitness for the intended purpose. The MEASURE 2.5 evidence pack for any care management or risk stratification model needs to test the validity of the model against the intended-purpose outcome, not just the training-target outcome.
MEASURE 2.9, Explainability. A subgroup attribution analysis of the kind described in Part 2 would not have guaranteed catching this. The central failure was the choice of label, not anything explainability tooling can directly diagnose. It is, however, the kind of routine check that increases the chance of catching a problem of this shape. SHAP applied across protected groups would at minimum have surfaced that the cost-related features were doing very different work for Black patients than for White patients with the same underlying condition burden, which is the prompt to ask the harder question.
MEASURE 2.11, Fairness and bias. The compositional gap between the algorithm’s actual high-risk group (17.7% Black) and a need-targeted alternative (46.5% Black) is severe by any reasonable standard. This is a different metric from the standard four-fifths rule, which compares favourable-outcome selection rates across protected groups under the same rule, but it is the more telling number here because it directly measures the gap between the system’s output and the system’s stated purpose. Detection required nothing more sophisticated than computing the racial composition of the high-risk group with and without the bias correction. Any first-pass MEASURE 2.11 work would have surfaced it in an afternoon.
MEASURE 1.3, Independent assessment. The bias was found by external researchers, not by internal validation. This is a generic lesson for the RMF. Independent assessment by people not involved in building the model is the most reliable way to find systematic errors that the build team has internalised as features of the problem rather than features of the solution. ASOP 56 §3.6.3 is the closest analogue in US actuarial standards, though framed permissively (the actuary “may consider” review by another qualified professional, depending on the model’s nature and complexity) rather than as a direct obligation.
Beyond Obermeyer: contemporary care management work
A 2026 actuarial team building or validating a risk stratification model has a clearer playbook than the 2018 team that built the system Obermeyer studied. The MAP function should produce an explicit intended-purpose statement framed in clinical-need terms. The training target should be chosen to match the intended purpose where possible. Where the available label is necessarily a proxy (avoidable hospitalisations, number of active chronic conditions, an explicit composite of cost and condition counts), the trade-off should be documented. The MEASURE 2.11 evidence pack should compute disparate impact across racial, ethnic, socioeconomic and geographic subgroups using inferred-membership techniques where direct collection is not lawful. The MEASURE 2.9 work should include subgroup attribution analysis. And the MEASURE 1.3 work should include a genuinely independent reviewer who did not build the model.
The practice-area moral
Obermeyer is not a story about a bad algorithm. It is a story about a competent algorithm trained on the wrong target, validated with the wrong metric, and assessed by people too close to the build. Every one of those failures is preventable by routine application of the RMF Map and Measure functions. This is why the framework exists, and it is the single most useful case study in the actuarial AI literature for explaining what the framework is for.
Section 3: Property and Casualty, Personal Auto Telematics Pricing
The lesson: continuous behavioural data and the geography of fairness
A US personal auto insurer offers a usage-based pricing programme. Drivers opt in to a smartphone or in-vehicle telematics device that streams continuous data on speed, acceleration, braking, cornering, time of day, road type and total miles driven. A gradient boosting model converts this stream into a per-policy risk score, which feeds the rating algorithm. Discounts and surcharges are applied at renewal based on the score.
The system is in scope for the NAIC Model Bulletin in every adopting state (supervisory expectation), in scope for California’s foundational rate-regulation regime under Proposition 103 and the California Department of Insurance’s bulletins on the use of artificial intelligence in insurance practices, in scope in New York for the use of external consumer data and information sources in underwriting and pricing under the NYDFS Insurance Circular Letter No. 7 (2024) (which sits alongside Regulation 187 governing suitability in life insurance), and (as amended effective 15 October 2025) in scope for Colorado Regulation 10-1-1 for any private passenger automobile insurer using ECDIS or algorithms that use ECDIS. Colorado’s amended regulation expressly covers consumer-generated Internet of Things data including telematics. Personal auto pricing is not listed in Annex III of the EU AI Act, so this is generally not a high-risk Annex III insurance use case under the Act; other AI Act obligations could still arise in some specific deployments, but that is the unusual case rather than the default.
Where the weight sits
This is the practice area where the fairness machinery from Part 2 needs the most adaptation. The reason is that telematics features are continuous, behavioural and partly chosen, and the question “where is the protected-class proxy in this feature set” has many more candidate answers than it does for ECDIS-based underwriting.
Consider a single feature: late-night driving frequency. It is a defensible actuarial signal because late-night driving genuinely correlates with collision frequency. It is also strongly correlated with shift work, which is strongly correlated with income, which correlates with race. It is correlated with urban residence, which correlates with race and ethnicity in different patterns in different cities. It is correlated with age. It is correlated with employment in certain sectors. Removing the feature would degrade the model. Keeping the feature without analysis would risk a disparate impact finding. The RMF answer is neither remove nor ignore. It is to do the work.
MAP 5.1, Likelihood and magnitude of impacts is the function that does most of the heavy lifting for telematics pricing. The Map function needs to enumerate, for each material feature, which protected-class proxies it could plausibly act as, what mitigations are in place, and what residual concentration of effect by geography or demographics would trigger a review. This is not a one-time exercise. It is repeated for every model refresh.
MEASURE 2.11, Fairness and bias uses a different toolkit than the discrete-features case in Part 2. Two techniques are particularly useful here:
- Geographic disparate impact testing. Compute the per-policy effective rate by ZIP code, overlay against US Census demographic data, and look for systematic over-rating concentration in tracts above a threshold for any protected characteristic. This is not the same as testing individual disparate impact, but in the engagements we work on it is the test most state regulators engage with substantively.
- Single-feature sensitivity analysis. For each material feature, ask: if we held this feature at its mean across the population, how would the rate distribution change by inferred group? This isolates the contribution of any single feature to the overall fairness picture and supports targeted remediation of the form “we keep this feature because removing it costs us X% in lift, but we accept the residual fairness impact of Y% because we have documented its actuarial signal”. (Note that this is a sensitivity technique, not a formal causal method; extracting causal claims would require additional identification work that practitioner fairness analysis rarely undertakes.)
MEASURE 2.7, Security and resilience is materially harder for telematics than for any of the other vignettes in this article, because the inputs are continuous streams from devices the policyholder controls. Device-controlled streaming inputs create realistic manipulation and integrity risks: anomalous mileage profiles, suspicious acceleration patterns, sudden discontinuities in driving style, all of which can be deliberate or accidental. The MEASURE 2.7 evidence pack needs to show what the model does in the presence of such patterns, what the detection thresholds are, and what the response protocol is.
Specific artefacts to produce
| RMF subcategory | Artefact |
|---|---|
| MAP 5.1 | Per-feature proxy analysis with documented residual concentration thresholds |
| MEASURE 2.7 | Anomaly detection thresholds and adversarial input testing results |
| MEASURE 2.9 | Global SHAP attribution; per-feature partial dependence plots; counterfactual explanations |
| MEASURE 2.11 | Geographic disparate impact analysis; single-feature sensitivity by feature |
| MEASURE 3.3 | Channel for policyholder challenge of individual scores (good practice and consistent with ASOP 41 communications expectations) |
| MANAGE 4.1 | Drift monitoring on the input feature distributions, not just the prediction distribution |
Image-AI claims handling: a related sub-case
Most P&C carriers running telematics also run an adjacent AI system: image-based first-notice-of-loss triage and damage estimation. This sits under the same RMF umbrella but with a different MEASURE 2.7 profile. Image inputs can be adversarially perturbed at much lower cost than tabular inputs. The MEASURE 2.7 work for an image system needs to address robustness to lighting conditions, weather, partial occlusion, image quality and deliberate adversarial perturbation. This is genuinely novel territory for most actuarial teams and is one of the places where collaboration with the vendor or with an internal AI security function is essential.
The practice-area moral
P&C with telematics is where the fairness work in MEASURE 2.11 is most demanding, because the protected-class proxies are everywhere and the trade-offs cannot be resolved by simple feature removal. The actuarial discipline is to do the analysis explicitly, document the trade-offs explicitly, and accept residual risk explicitly. The framework supports all three.
Section 4: Pensions, Mortality Improvement and Longevity Modelling
The lesson: low-fairness, high-explainability, fiduciary stakeholders
A consulting actuarial firm produces longevity assumptions for trustees of UK occupational pension schemes. The work involves projecting mortality improvement rates for closed scheme populations using a combination of base tables (CMI series), scheme-specific experience analysis, and a stochastic mortality projection model that incorporates trend uncertainty and parameter uncertainty. The model outputs feed scheme valuations under TPR’s funding regime and the buy-out pricing assumptions used for derisking transactions with insurers.
This is the practice area where the RMF lands most differently from any of the consumer-facing examples in this series. Lower fairness intensity than consumer-facing models. Lower adversarial-input intensity than systems with externally controlled inputs. Privacy concerns are real but proportionate to standard scheme data handling under UK GDPR rather than novel to AI. And yet there is a substantial RMF workload, all of it concentrated in two subcategories.
Where the weight sits
MEASURE 2.9, Explainability is the dominant subcategory. The audience for the model’s outputs is a board of trustees, most of whom are not statisticians, who carry fiduciary responsibility for the scheme’s funding adequacy and the security of members’ benefits. They cannot discharge that responsibility if they do not understand the model’s outputs to a sufficient depth.
For a UK scheme example, the primary governance lens is the IFoA framework. TAS 100 Principle 4 (Communications) requires that technical actuarial work be communicated in a way that allows users (here, the trustees) to understand the implications of the work, the materiality of judgements made, and the limitations of the conclusions. APS X2 v1.1 (effective 30 January 2026) governs work review and independent peer review. The Pensions Regulator’s expectations on actuarial advice to trustees provide further context. The RMF MEASURE 2.9 obligation maps onto exactly this trustee-facing communications discipline. ASOP 41 (Actuarial Communications) and ASOP 27 (Selection of Assumptions for Measuring Pension Obligations, revised effective 1 January 2025, which absorbed the previously separate ASOP 35) are the parallel US analogues that a US-licensed actuary working on a similar scheme would also need to consider.
The practical content of explainability for a longevity model is different from the SHAP-based work in Part 2, because the model is a structured stochastic projection rather than a tree ensemble. The artefacts that matter are:
- Scenario decomposition, showing how trustees should interpret the central projection alongside high and low improvement scenarios
- Sensitivity tables showing how the present value of liabilities changes for plausible parameter shifts
- Documentation of which historical data drove the central projection and why
- A clear written explanation, in plain language, of what the model can and cannot predict
These are artefacts that most pensions actuaries already produce. The RMF work is to label them explicitly as MEASURE 2.9 evidence and to ensure that a trustee or external reviewer can map the trustee-facing communication back to the model’s mathematical structure.
MAP 5.1, Likelihood and magnitude of impacts is dominated here by intergenerational equity rather than disparate impact. A longevity assumption that is too aggressive (projecting too much improvement) may produce inadequate funding for current pensioners. An assumption that is too conservative may impose excessive contributions on current sponsors and active members. The MAP 5.1 artefact needs to articulate this trade-off explicitly and connect it to the trustees’ funding decision authority.
A note on the GenAI tail
A growing number of pensions consultancies are using LLMs to help draft trustee communications, summarise actuarial reports, or generate plain-language explanations of complex outputs. This is GenAI Profile territory and is the subject of Part 4 of this series. The brief preview is that even when the underlying actuarial model is structured and explainable, the LLM layer that translates it for the trustee audience is itself a model under the framework, and it carries its own MEASURE 2.9 evidence requirements (explainability of the LLM’s choices) and its own MEASURE 2.11 governance considerations (does it produce systematically different communications for different audiences). Both of these sit alongside the underlying actuarial work, not in place of it.
Specific artefacts to produce
| RMF subcategory | Artefact |
|---|---|
| MAP 1.1 | Intended-purpose statement framed in trustee fiduciary terms |
| MAP 5.1 | Intergenerational equity impact analysis |
| MEASURE 2.5 | Back-testing of past projections against subsequent experience |
| MEASURE 2.9 | Trustee-facing explainability pack with scenario decomposition and sensitivity tables (TAS 100 Principle 4 / APS X2 v1.1; ASOP 41 / ASOP 27 (revised) where US-licensed) |
| MEASURE 1.3 | Independent peer review by an actuary not involved in the projection (APS X2 v1.1 in the UK; ASOP 56 §3.6.3 as the closest US analogue, framed permissively) |
The practice-area moral
Pensions longevity work is the cleanest example of the RMF being mostly a relabelling exercise. The substance is already there in TAS 100 and APS X2 (and in the revised ASOP 27 and ASOP 41 for US-licensed work). The new artefact is the explicit MEASURE 2.9 evidence pack, presented in trustee-facing terms, that satisfies both the actuarial standards and the framework.
Section 5: Cross-cutting call-outs
The four vignettes above are not the whole story. The brief notes below cover the practice areas that did not earn a full vignette in this article but where the framework still applies.
Reinsurance treaty pricing and catastrophe modelling
Almost always built on third-party vendor models (Moody’s RMS, Verisk, KCC, Impact Forecasting). The Govern and Map functions carry the weight, and within them the dominant subcategories are GOVERN 6.1 (third-party AI risk) and MAP 4.1 (legal risks of components). The MEASURE function devolves substantially into vendor evidence review, multi-vendor benchmarking, and documentation of model uncertainty. The interesting RMF lesson here is that “we use a vendor model” does not transfer the RMF obligation. The cedent still needs to demonstrate that it has reviewed the vendor’s evidence, understood the model’s limitations, and accepted the residual risk in writing. ASOP 56 §3.4 (reliance on models developed by others) carries the same obligation in the US actuarial standards.
Capital modelling under Solvency II and US RBC
ML components are increasingly embedded inside Economic Scenario Generators, internal model components, and the calibration of correlation matrices. Each ML component is an AI system under the framework. Where these components sit inside an approved internal model used to calculate the SCR, the Solvency II model validation regime under Article 124 of the Directive applies in full, and that regime already requires much of the substance of the MEASURE function expressed in different vocabulary. ML components that sit outside an approved internal model (in standard formula firms, in pricing-only tooling, or in supplementary risk analytics) sit under general internal model risk management rather than Article 124 specifically, but the RMF mapping is still direct. The documentation burden is real either way, particularly for any non-linear component where MEASURE 2.9 explainability becomes a regulator question rather than an internal one.
Climate and catastrophe risk modelling
The unusual case where MAP 1.5 (risk tolerance) and MAP 5.1 (impact analysis) require explicit treatment of deep uncertainty rather than estimable risk. Most actuarial training equips the practitioner to quantify probabilities and produce confidence intervals. Climate models routinely produce outputs whose uncertainty is structural, irreducible, and dependent on policy and behavioural pathways no model can predict. The framework accommodates this, but the language is uncomfortable. The honest answer is that the MEASURE 2.5 evidence pack for a climate-influenced model needs to include scenario analysis, narrative scenarios as well as probabilistic ones, and explicit acknowledgement of irreducible uncertainty. This is a place where the RMF and the IFoA’s climate-related guidance line up well and reinforce each other.
Enterprise risk management
ERM is the natural home for the AI inventory, the risk register and the residual risk acceptance documentation. The CRO’s office and the Chief Actuary’s office should agree which function owns the AI risk register early in any RMF implementation, because both have a legitimate claim and parallel ownership produces gaps and overlaps. Our recommendation in most insurers is that the ERM function owns the inventory and the register, and that the actuarial function owns the per-model MEASURE evidence packs that feed into them.
Section 6: Practice-area triage table
The single most useful artefact in this article is below. It maps the four primary practice areas to the RMF subcategories that carry the most weight in each context. Use it as a starting point for prioritising your own RMF work.
A note on jurisdiction. Some entries reference jurisdiction-specific obligations. FRIA (EU AI Act Article 27) and Article 86 explanation rights apply only where the system is in scope of the EU AI Act, which is fact-specific under Article 6 and Annex III. Treat those entries as priorities only where the EU is in scope of your deployment.
| Practice area | Heaviest Govern | Heaviest Map | Heaviest Measure | Heaviest Manage |
|---|---|---|---|---|
| Life, accelerated UW | 1.6 (inventory), 6.1 (third-party) | 1.1 (intended purpose), 5.1 (FRIA, where EU in scope) | 2.11 (fairness), 2.9 (explainability) | 4.1 (drift), 4.3 (incidents) |
| Life, lapse / behavioural | 1.4 (transparency) | 1.5 (risk tolerance), 4.1 (downstream consumers) | 2.5 (validity), 4.2 (deployment context) | 1.4 (residual risk), 4.1 (drift) |
| Health, prior authorisation | 2.1 (roles), 4.1 (safety culture) | 1.1 (purpose), 3.5 (human oversight) | 2.11 (fairness), 2.9 (explainability) | 2.3 (unknown risks), 4.3 (incidents) |
| Health, risk stratification | 2.3 (executive accountability) | 1.1 (label vs purpose), 5.1 (impact) | 2.5 (validity to purpose), 2.11, 1.3 (independent assessment) | 4.2 (deployment context) |
| P&C, telematics pricing | 1.6 (inventory), 6.1 | 5.1 (proxy analysis) | 2.7 (security), 2.11, 2.9 | 4.1 (input drift), 4.3 |
| P&C, image claims AI | 6.1 (vendor) | 1.1, 4.1 | 2.7 (adversarial inputs), 2.5 | 4.1, 2.3 |
| Pensions, longevity | 1.4 (transparency) | 1.1 (trustee purpose), 5.1 (intergenerational) | 2.9 (explainability), 2.5, 1.3 | 1.4 (residual risk acceptance) |
| Reinsurance, cat models | 6.1 (vendor) | 4.1 (third-party legal) | 2.5 (vendor evidence review) | 1.4, 2.4 |
| Capital, Solvency II internal model | 1.4 | 1.5, 4.1 | 2.5, 2.9 | 4.1 |
| Climate / cat modelling | 1.4 | 1.5 (deep uncertainty), 5.1 | 2.5 (scenario analysis) | 1.4 |
The bolded cells mark the subcategories where, in our experience, the heaviest substantive work lies for that practice area. Read down a column and you see which functions matter most across the profession. Read across a row and you see which subcategories deserve the first investment of effort for any specific use case.
Closing
The RMF treats AI governance as a single problem with a single vocabulary. In practice, AI governance work in life, health, P&C and pensions looks quite different from one another. The vocabulary is the same. The subcategories that carry the weight are not. The artefacts you produce in each context look different, even when the column heading on the table is identical.
Three observations to carry forward.
The first is that the framework rewards specificity. A team that says “we comply with NIST AI RMF” is saying nothing useful. A team that says “for our accelerated underwriting model, the highest-effort subcategories are MAP 5.1, MEASURE 2.9 and MEASURE 2.11, and here are the artefacts we produce for each” is saying something a regulator can verify and an internal stakeholder can audit. Specificity is the deliverable.
The second is that in our experience, the most consequential failures are usually in the Map function rather than the Measure function.
“In our experience, the most consequential failures are usually in the Map function rather than the Measure function.”
The Obermeyer case is the clearest example in the literature: the system passed every reasonable Measure test for the target it was actually trained on. The failure was that the target was the wrong target, and that failure was a Map failure. Teams that invest heavily in Measure work without first doing the Map work risk producing very thorough validation evidence for systems that should not exist in their current form. NIST itself treats Map and Measure as complementary and many material failures are also measurement failures; the practitioner observation is that the ones that surprise people most are usually the Map ones.
The third is that the artefacts compound. A team that produces a MEASURE evidence pack for its accelerated underwriting model in Q1, a different one for its lapse model in Q2, and a different one for its longevity model in Q3 will find that the second and third packs are easier than the first, that the team’s vocabulary has matured, and that the AI Governance Committee meetings have become substantive rather than performative. This compounding is the operational benefit of doing the work, separate from any specific regulatory obligation.
Part 4 of this series, the final article, addresses the NIST Generative AI Profile and the questions it raises for actuaries using large language models in documentation, code generation, research synthesis and member communications. The same framework. A different category of model. A new set of subcategory weights.
Ready to operationalise NIST AI RMF and the Generative AI Profile in your actuarial function?
Talk to our team about how Globebyte can help you build the governance structures, the eval suites, the RAG systems and the MEASURE evidence packs. From strategic alignment to working code.