NIST AI RMF for Actuaries

The NIST Generative AI Profile for Actuaries: Confabulation, Controls and Concrete Practice

Article 4 of 4 in the series NIST AI RMF for Actuaries

If you are an actuary in 2026 and you have not yet used a large language model in your professional work, you are unusual. If you have used one and have not yet thought through the governance implications, you are typical. This article is for both groups.

The first three articles in this series treated the NIST AI Risk Management Framework as it applies to traditional machine learning systems: gradient boosting models for accelerated underwriting, classifiers for prior authorisation, lapse models, telematics scoring, longevity projections. The framework absorbs all of these comfortably because they are the kinds of models the framework was originally written for. Generative AI is different. The risks are different, the failure modes are different, the validation techniques are different, and in 2024 NIST published a dedicated companion document to address the difference. This article is about that document and what it means for actuarial practice.

NIST AI 600-1, Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile, was published on 26 July 2024. It is a cross-sectoral profile that sits alongside the core RMF and identifies twelve risk categories that are unique to or amplified by generative AI. The profile is structured the same way the core RMF is, around Govern, Map, Measure and Manage, with suggested actions for each subcategory recast in GenAI-specific terms. For actuaries, the profile is essential reading because almost every part of the actuarial workflow is now within reach of an LLM, and very few of the controls actuaries have built up over a generation of model risk management apply directly to generative systems.

The headline message of this article is simple. Where an LLM is doing actuarial work, supporting actuarial conclusions, generating analysis, drafting communications a client or regulator will read, producing code that goes into a validation pipeline, ASOP 56 applies in full. TAS 100 applies. The Actuaries’ Code applies. APS X2 v1.1 (effective 30 January 2026) applies to the review of that work. Incidental use of an autocomplete, a writing aid or a search-engine plug-in that does not touch actuarial conclusions sits outside this scope, as the American Academy of Actuaries’ 2024 professionalism discussion paper Actuarial Professionalism Considerations for Generative AI explicitly notes. The discussion paper is an interpretive professionalism resource rather than a binding promulgation, but it sets out clearly how ASOP 23, ASOP 41 and ASOP 56 apply to GenAI work that does fall within scope. Professional responsibility for in-scope work is undiminished by the fact that an LLM produced part of it. The technical content of this article is about how to do that responsibly: how to use generative AI to do actuarial work better, faster and at higher quality, while producing the artefacts that satisfy the framework, the standards and the audit.

A note on registers. As in Parts 1, 2 and 3 of this series, this article distinguishes four categories of authority: law (binding regulation or statute), supervisory expectation (regulator guidance, often “should” framing, used in examinations), professional standard (binding within a profession on its members), and author recommendation (our own practitioner judgement, signalled by phrases like “in our view” or “in our experience”). The point is sharpest in this article because GenAI is governed by a thinner stack of binding law than traditional ML pricing and underwriting models, and a denser stack of professional standards and author recommendation. Where this article uses confident language about controls and architecture, the underlying register is usually professional standard plus author recommendation rather than law, except where explicitly noted (the EU AI Act’s general-purpose AI model provisions are the main law-register example).

A note before we start. The numbers and configurations in this article are illustrative. Specific prompts, tool names, evaluation thresholds and architectural choices are reference points, not prescriptions. The intention is to teach the techniques in enough depth that a competent actuarial team can build something useful from them. ASOP 56 section references in this article have been verified against the published standard (Doc. No. 195, December 2019, effective 1 October 2020).

Section 1: What the Generative AI Profile actually adds

The GenAI Profile does not replace the core RMF. It extends it. The four functions still apply. The subcategories still apply. The Profile adds a new vocabulary on top: twelve risk categories that the profile maps against the existing subcategories with GenAI-specific suggested actions.

The twelve risk categories are:

  1. CBRN information or capabilities (chemical, biological, radiological, nuclear)
  2. Confabulation (the production of confidently stated but erroneous content)
  3. Dangerous, violent, or hateful content
  4. Data privacy
  5. Environmental impacts
  6. Harmful bias and homogenisation
  7. Human-AI configuration
  8. Information integrity
  9. Information security
  10. Intellectual property
  11. Obscene, degrading, or abusive content
  12. Value chain and component integration

Some of these are usually low-priority for actuarial GenAI use cases. CBRN, obscene content, and the most acute dangerous-content risks will, in most actuarial deployments, be rated low and handled at the model-provider layer through the provider’s own filtering and use policy. They should still be screened on the way to that low-relevance call rather than dismissed at the outset. For member-facing chatbots, multi-step agents, or any system with a free-text user interface, these categories can surface in unexpected ways and need explicit treatment in the Map function.

Eight of the twelve carry meaningful weight for most actuarial work, and we will treat each in turn. The one that matters most is confabulation, and Section 3 of this article is dedicated to it.

The eight risks that matter for actuarial work

Confabulation is the production of confidently stated but factually wrong outputs. Sometimes called hallucination, fabrication, or simply being wrong. It is the dominant risk in actuarial GenAI use because the cost of a confidently wrong number in a regulatory filing, a board memo, a reserving calculation, or a member communication is enormous, and LLMs are particularly prone to producing exactly this kind of error in exactly this kind of work. Section 3 covers it in detail.

Data privacy. When you paste a policyholder record, a claims file, scheme member data, or anything containing personal or confidential information into a public LLM interface, you are transferring that information to a third party. Depending on the data and the use case, that transfer can engage data protection law (GDPR in the EU and UK, state privacy regimes in the US), sector-specific regimes (HIPAA where protected health information is involved, GLBA for financial information held by financial institutions, FCRA where consumer-report data is in play), confidentiality and contractual obligations to the client or scheme, and professional secrecy duties. The exact legal regime depends on the data, the entity, and the jurisdiction. The question of which model can be sent which data is a foundational governance question and should be answered by written policy before the first prompt is sent.

Information integrity. When an LLM is used to draft regulatory communications, member letters, board reports or internal memoranda, the output enters the information ecosystem of the organisation and its stakeholders. ASOP 41 (Actuarial Communications) requires that actuarial communications be clear, complete, and supported. An LLM-drafted communication that misstates a fact, even subtly, fails this test, and the actuary who signed off on the communication is professionally responsible for the failure regardless of how the draft was produced.

Harmful bias and homogenisation. LLMs are trained on internet-scale text and inherit the biases of that training corpus. They also tend toward homogenisation: prompted similarly, they produce similar output. For actuarial work this matters in two specific ways. First, when an LLM is used in member-facing communications, it may produce systematically different language for different audiences in ways that could constitute disparate impact. Second, when used in research synthesis, it may systematically over-represent the most-cited views and under-represent legitimate minority positions, narrowing the apparent state of knowledge.

Intellectual property. Two distinct problems. First, the LLM may produce output that inadvertently reproduces copyrighted training material. Second, the act of submitting your own copyrighted material (your firm’s actuarial methodology, a vendor’s proprietary documentation, a client’s confidential data) to an LLM may itself raise IP and contractual concerns. Both need to be addressed in written policy.

Value chain and component integration. Most LLMs in actuarial use are accessed through APIs from a small number of providers (OpenAI, Anthropic, Google, Microsoft, Meta, and a handful of open-source model hosts). The actuary using such a model is at the end of a long supply chain that includes the model developer, the data sources used in training, the cloud provider hosting the model, and any RAG corpus or fine-tuning data added by the deployer. Vendor management under GOVERN 6.1 needs to address all of this, and ASOP 56 §3.4 (reliance on models developed by others) carries the parallel obligation in the US actuarial standards. Where an LLM-based system is in scope of the EU AI Act, the Act’s role-based obligations split across this supply chain in specific ways: the foundation model provider, any fine-tuner who materially modifies the model, and the deployer who puts the system to use each carry distinct obligations. The Act’s general-purpose AI model provisions (Articles 51-56) govern the upstream end of this chain.

Human-AI configuration. This is the question of how the human user relates to the AI output. Is the LLM producing a draft that a human reviews? Is it producing a final output that goes straight to a stakeholder? Is it part of an agentic chain in which one LLM call feeds another? The configuration determines the level of oversight that is technically and professionally adequate. MAP 3.5 territory.

Information security. A new threat model that traditional ML did not have. LLMs can be subject to prompt injection attacks, in which malicious instructions are hidden inside content the LLM is asked to process. They can leak training data or in-context information. They can be coerced by adversarial inputs into producing output that violates their system prompt. The MEASURE 2.7 evidence pack for an LLM-based system needs to address all of this, and most actuarial teams have not yet done so.

How the Profile structures suggested actions

For each of the twelve risks, the Profile provides suggested actions mapped against the four core functions and many of the existing subcategories from the playbook. The actions are not new subcategories; they are GenAI-specific guidance attached to existing ones. So, for example, MEASURE 2.5 (validity) gets a set of suggested actions specific to confabulation and information integrity. GOVERN 6.1 (third-party AI risk) gets a set of actions specific to value chain and component integration. MAP 5.1 (impact analysis) gets actions specific to harmful bias and homogenisation in generated content.

The practical consequence is that the mapping table from Part 2 of this series still applies, but each cell may now contain additional GenAI-specific evidence expectations. A team running the framework against an LLM-based system uses the same skeleton and adds the GenAI-specific actions where they are relevant.

Section 2: Four categories of actuarial LLM use

LLMs are not all used the same way. The risks change dramatically with the use case, and so do the controls that proportionately match the risk. The four categories below cover most of the actuarial uses we see in practice, with the dominant risks for each. The recommended controls in Section 3 are calibrated to these categories, not applied uniformly.

2a: Research and synthesis

The actuary uses the LLM to summarise a long document, scan a body of literature, identify patterns across regulatory filings, or generate a first-pass research note on a topic the actuary will then verify. Examples: summarising a 200-page CDC mortality monograph, scanning the past year of NAIC bulletins for items relevant to a specific line of business, producing a one-page comparison of three vendor model architectures from their public documentation.

Dominant risks:

  • confabulation (the LLM may confidently state things the source documents do not say)
  • information integrity (the summary may distort the source material in ways the actuary does not catch)
  • intellectual property (the source material may be copyrighted)

Stakes: low to medium. The actuary is the only consumer of the output, and the actuary has the original sources to verify against. This is the lowest-risk LLM use case in actuarial work.

Minimum viable controls:

  • the LLM must be configured with retrieval-augmented generation against an authoritative corpus (more on this in Section 3)
  • citations to source spans must be required in every output
  • the actuary must verify the citations before using the output
  • no PII goes to the LLM

2b: Drafting and communications

The actuary uses the LLM to produce a first draft of a written deliverable: a board paper, a trustee report, a regulatory filing, a member communication, an internal methodology memo, an email response to a complex query. Examples: drafting the narrative section of an IFRS 17 disclosure note, producing a plain-language explanation of a Solvency II SCR change, writing a member letter about a longevity assumption update.

Dominant risks:

  • confabulation (the draft may contain numbers or facts the LLM has invented)
  • information integrity (the draft may misstate the actuarial position)
  • harmful bias and homogenisation (member-facing communications may differ systematically across audiences)
  • intellectual property (the firm’s methodology may be exposed by submission to the model)

Stakes: medium to high. The output is consumed by stakeholders other than the actuary, and the actuary’s professional name is attached to it. ASOP 41 governs.

Minimum viable controls:

  • human review of every output before release
  • structured prompting that constrains the LLM to use only specified inputs
  • explicit prohibition on the LLM producing numerical claims that are not in the input materials
  • full audit log of every prompt and response
  • model and version pinning so that the same input produces the same output as far as possible

2c: Code generation

The actuary uses the LLM to write Python, R, SQL or Excel-formula code for analytics, validation, data transformation, or modelling tasks. Examples: writing a SHAP analysis script for a gradient boosting model; building a data pipeline for monthly experience monitoring; generating a SQL query against a claims data warehouse; writing a Python function to compute the four-fifths rule across BIFSG-inferred groups.

Dominant risks: confabulation in two specific forms. First, the LLM may invent functions, libraries or syntax that do not exist. Second, the LLM may produce code that runs but is silently wrong. The second is much more dangerous than the first because incorrect-but-running code looks like correct code to a casual reviewer.

Stakes: high. Code generated by an LLM and used in production validation work has the property that it can quietly approve a broken model. An LLM-written validation script that contains a subtle off-by-one error in the disparate-impact calculation can produce a clean fairness report for a model that is not actually fair. The actuarial review process must catch this.

Minimum viable controls:

  • every LLM-generated code artefact is reviewed line-by-line by a competent actuary before use
  • every code artefact is run against test data with known expected outputs
  • LLM-generated code is never used in production without independent peer review
  • the code is treated as if a junior analyst had written it (because, in effect, that is exactly what has happened)
  • the LLM-generated code goes into version control with the prompt that produced it recorded in the commit message

2d: Embedded and agentic systems

The LLM is part of a production system that interacts with customers, processes claims, or takes multi-step actions on behalf of users. Examples: a member-facing chatbot that answers benefits questions; a first-notice-of-loss intake agent that takes a free-text claim description and produces a structured triage record; a multi-step agent that reads a regulatory filing, identifies relevant ASOPs and TASs, looks up the latest versions, and produces a compliance assessment; a customer service agent that has access to the policy administration system and can take action on behalf of the user.

Dominant risks: every risk in the Profile applies. Confabulation, data privacy, information integrity, harmful bias, human-AI configuration, information security, value chain and IP all matter, and the categories that were low-priority in 2a to 2c (CBRN-style content, dangerous content, obscene content) need to be screened explicitly because the system has a free-text input surface. This is the highest-stakes category by some distance.

Stakes: very high. The system is consumer-facing, takes real actions, processes real personal data, and operates without per-interaction human review.

A note on EU AI Act scoping for embedded and agentic systems. It is worth being precise here, because the language around “high-risk” can drift. Not every chatbot, agent or embedded LLM in an insurance setting is automatically an Annex III high-risk AI system under the EU AI Act. Whether any specific system falls inside Annex III is a fact-specific determination under Article 6 and the Annex III categories, and depends on the system’s intended purpose rather than on its underlying technology. For insurance-adjacent deployments, the most likely Annex III hook is point 5(c), which covers AI systems used for risk assessment and pricing in relation to natural persons in the case of life and health insurance. A member-facing chatbot that answers benefits questions but does not perform risk assessment or pricing is unlikely to be in scope under point 5(c) alone, though it may engage other Annex III categories (for example, employment-related uses if the same architecture were redeployed) and other parts of the Act (the general-purpose AI model provisions in Articles 51-56 govern the upstream foundation model regardless). A multi-step agent that influences an underwriting or pricing decision is more likely to be caught. A first-notice-of-loss intake agent is usually not a pricing or risk-assessment system and so is usually outside Annex III 5(c), though it may still be subject to transparency obligations under Article 50 if it interacts directly with natural persons. The point is that the in-scope question deserves a written, fact-specific answer for each system, with the provider-role and deployer-role obligations separately assessed where the system is in scope. Where the system is in scope, both the provider’s and the deployer’s obligations under the Act are engaged. The RMF’s GenAI Profile and the Act’s provider/deployer stack operate on different planes but ask for heavily overlapping evidence.

Minimum viable controls:

  • the full RMF treatment from Part 2 of this series, applied as if this were a high-risk traditional ML system, plus the GenAI-specific extensions covered in this article
  • a documented MEASURE evidence pack including LLM evaluation suite results
  • prompt injection testing as a routine MEASURE 2.7 activity
  • full conversation logging with retention in line with regulatory requirements
  • human escalation paths from every interaction
  • model and version pinning with documented change control
  • regular re-evaluation against the eval suite when the model is updated by the provider

Section 3: Confabulation, the dominant actuarial risk

Confabulation is the technical name for what most people call hallucination. NIST defines it precisely in the GenAI Profile: erroneous or false content, including outputs that diverge from the prompt or contradict previously generated statements in the same context. It happens because LLMs are trained to produce plausible next tokens, not true ones. Plausibility and truth are correlated but not identical. Where they diverge, the LLM produces confabulation.

“An LLM that confabulates a citation looks exactly like an LLM that does not. The text is grammatically correct, professionally toned, internally consistent, and false.”

For actuarial work this is the dominant risk because the cost of being confidently wrong is high and the failure mode is hard to detect. An LLM that confabulates an ASOP section number, a regulation paragraph, a numerical assumption, or a citation looks exactly like an LLM that does not. The text is grammatically correct, professionally toned, internally consistent, and false.

There are five mitigations that work, and a competent actuarial GenAI deployment uses several of them in combination. None of them eliminates confabulation completely. Together they reduce it to a level that, combined with appropriate human review, is consistent with professional standards.

Mitigation one: Retrieval-Augmented Generation (RAG)

RAG is the architecture in which the LLM is given access to a defined corpus of authoritative documents and is required to ground its answers in retrieved passages from that corpus. Practically, the corpus is processed into a vector database (Pinecone, Weaviate, Chroma, Qdrant, or a managed equivalent like Azure AI Search or AWS Kendra). When the user asks a question, the system retrieves the most relevant passages from the corpus and supplies them to the LLM as context, along with the question. The LLM is instructed to answer using only the supplied passages.

For actuarial use, the corpus is the artefact that determines the quality of the output. A good corpus for an actuarial assistant might contain:

  • The full text of every relevant ASOP, TAS, APS (including APS X2 v1.1 effective 30 January 2026) and Actuaries’ Code section
  • Current NAIC bulletins, model laws and the AIS Program documentation
  • Relevant state insurance department circulars and guidance
  • The full EU AI Act with its annexes, including the Article 13 provider transparency provisions and the Article 86 deployer-facing explanation rights
  • The current NIST AI RMF Playbook and the GenAI Profile
  • The firm’s own internal methodology documents

The corpus must be version-controlled. When ASOP 56 is updated, the corpus must be updated. When a new state adopts the NAIC Model Bulletin, the corpus must be updated. When the EU AI Act delegated acts are published, the corpus must be updated. The MEASURE 4.2 (trustworthiness in deployment context) evidence pack for any RAG-based actuarial assistant must include the corpus version, the date of the most recent update, and a process for catching missed updates.

RAG does not eliminate confabulation. The LLM can still ignore the retrieved passages, conflate multiple passages, or interpolate between them. But it dramatically reduces confabulation rates and gives every output a defined provenance that the actuary can verify.

Mitigation two: structured output

Most modern LLM APIs (OpenAI’s structured outputs, Anthropic’s tool use, Google’s function calling) allow the developer to constrain the model to produce output that conforms to a JSON schema. This is more powerful than it sounds. A properly designed schema can force the model to produce, for every claim it makes, a citation to a source span; for every numerical assertion, the source field name and the numerical value; for every recommendation, a confidence level and a list of caveats.

Structured output also makes downstream validation possible. Free-text LLM output is hard to check programmatically. Schema-conformant output can be validated by code: every required field present, every numerical value within a plausible range, every citation pointing to a real document, every recommendation accompanied by the required metadata. This gives the actuary an automated first line of defence against confabulated output.

Mitigation three: citation requirements

Closely related to structured output but worth calling out separately. Every factual claim the LLM produces should be accompanied by a pointer to its source. In a RAG system, this means the document, the section, and where possible the sentence or character span within the section. The system prompt should make this requirement explicit: “Every factual claim must be supported by a citation in the form [document_id, section, span]. If you cannot find a supporting citation in the retrieved passages, say so explicitly rather than producing the claim.”

This is more than a stylistic preference. It transforms the actuary’s review task from “is this true?” to “does the citation support this?” The latter is a much faster and more reliable check, and it surfaces confabulations directly: a confabulated claim either has no citation, or has a citation that does not actually support it, or has a citation to a document or span that does not exist.

Mitigation four: evaluation suites against ground truth

This is the artefact category that most actuarial teams have never built and that is most worth investing in. An LLM evaluation suite is a structured collection of test cases for which the correct answer is known, run against the system at every change and at regular intervals, with results tracked over time.

For an actuarial RAG assistant on regulatory citations, an eval suite might include several hundred test cases of the form: “given this draft paragraph, produce the relevant ASOP and TAS citations.” The expected output for each case is created by a human actuary in advance and stored in version control. Each test case records the input, the expected output, and the actual output of the system. Pass/fail is determined by comparing the actual citations to the expected ones. Aggregate pass rate, false positive rate (citations the system produced that should not be there), and false negative rate (citations the system should have produced but did not) are tracked over time and across model versions.

Eval suites do three things. They give the team a quantitative measure of system quality that can be reported to the AI Governance Committee. They catch regressions when the underlying model is updated by the provider (a real risk on managed endpoints, since providers can update underlying models with limited notice, and any system that does not pin to a specific model version can produce materially different output from the same prompt one day to the next). And they provide MEASURE evidence in a form that maps directly onto the existing MEASURE 1.1, 2.1, 2.3 and 2.5 subcategories.

A useful starting point is twenty to fifty manually constructed test cases. A mature eval suite for an actuarial assistant has several hundred cases and is updated as new edge cases are discovered in production.

Mitigation five: human review gates with sign-off semantics

The most fundamental control. For any LLM output that enters a work product or supports an actuarial conclusion, explicit human review and sign-off, recorded, is the discipline that satisfies the standards. The sign-off is not a tick-box. It is the actuary asserting professional responsibility for the content, with the same weight that asserting responsibility for any other actuarial work carries.

The specific sign-off semantics depend on the use case category. For research synthesis (2a) used as background reading by the actuary themselves, lighter-touch confirmation that citations have been verified is reasonable. For drafting work (2b), the actuary who would normally sign off the deliverable signs off the LLM-assisted version with the same standard of review. For code generation (2c), peer review and test execution. For embedded production systems (2d), the system itself is the work product and the sign-off is on the system as a whole rather than individual outputs. The audit log is calibrated to match: verbose for high-stakes categories, lightweight for low-stakes exploration.

“‘The model told me’ is not a defence. The actuary is responsible.”

The Code of Professional Conduct and the Actuaries’ Code are clear. “The model told me” is not a defence. The actuary is responsible.

Section 4: Worked example, a regulatory citation assistant

To make all of this concrete, here is how an actuarial team might build, govern and validate a useful LLM application. The example is realistic, the techniques are real, and the resulting system is something most actuarial teams could productively use.

The use case

An actuarial team produces regulatory filings, board papers and technical memoranda that need to cite the relevant actuarial standards (ASOPs in the US, TAS and APS in the UK), regulatory guidance (NAIC bulletins, state insurance department circulars), and where applicable EU AI Act articles. Producing accurate citations is time-consuming and error-prone because the corpus is large and the relevance of any given standard depends on the specific claim being made. The team builds an LLM-powered assistant that takes a draft paragraph and returns a list of relevant citations with section-level references and an explanation of why each is relevant.

Govern artefacts

Before any technical work:

  • AI inventory entry classifying the assistant as a GenAI system, internal use only, no consumer-facing deployment, no PII processing, low to medium risk.
  • Usage policy specifying that the assistant is permitted for internal drafting support only, that all output must be reviewed by the responsible actuary before use, that the corpus is curated by the team and version-controlled, and that no client or policyholder data may be sent to the system.
  • Vendor selection documented under GOVERN 6.1, specifying the model provider, the data residency, the contractual protection of submitted content (the provider must not use submitted content for training), and the audit rights.
  • Decommissioning trigger conditions: specified eval suite pass-rate floor below which the system is taken offline pending investigation.

Map artefacts

  • Intended-purpose statement: “to assist actuaries in identifying relevant professional and regulatory citations for use in actuarial communications, by retrieving authoritative source material and proposing citations that the actuary will then verify and accept or reject. The assistant does not produce final actuarial work product. It produces draft citations only.”
  • Out-of-scope statement: the assistant is not used for substantive legal interpretation, client advice, regulatory submission without review, or any work product that is communicated outside the firm without actuarial sign-off.
  • Stakeholder map: actuarial team members as primary users; responsible actuaries as reviewers; the AI Governance Committee as oversight; the firm’s IT and information security functions as operators of the underlying infrastructure.
  • Confabulation impact analysis (MAP 5.1, GenAI-specific): the worst plausible failure mode is a confabulated citation that the reviewing actuary fails to catch and that ends up in a regulatory filing. Mitigation is the human review gate plus the structured citation format that makes verification cheap.

The technical architecture

The system has five components:

  1. Corpus: full text of all relevant ASOPs, TASs, APSs (including APS X2 v1.1), the Actuaries’ Code, the NAIC AI Model Bulletin and AIS Program guidance, current state insurance department circulars on AI, the NIST AI RMF and GenAI Profile, and the EU AI Act with annexes. Stored in version control with a documented update process. Each document has a stable identifier, a publication date and a version number.
  2. Vector database: the corpus is processed into chunks of approximately 500 tokens each, with each chunk annotated with its document, section, paragraph, and version. The chunks are embedded using a stable embedding model (the choice of embedding model is itself version-pinned) and stored in a vector database.
  3. Retrieval: the user submits a draft paragraph. The system embeds the paragraph and retrieves the top 20 most relevant chunks from the corpus using approximate nearest neighbour search.
  4. Generation: the retrieved chunks are passed to the LLM along with the user’s draft paragraph and a structured prompt that requires the LLM to identify which chunks are relevant, propose citations using a defined JSON schema, and explain in one sentence why each is relevant. The LLM is explicitly instructed not to produce citations that are not supported by the retrieved chunks. The LLM endpoint is pinned to a specific model version; provider model updates are not picked up automatically (see provider change management under Manage artefacts below).
  5. Output: the LLM’s structured output is validated against the schema, every citation is verified to point to a real chunk in the corpus, and the result is presented to the actuary in a review interface that shows the proposed citation, the supporting chunk, and the actuary’s accept/reject control.

Measure artefacts

The most important MEASURE artefact is the eval suite. The team constructs an initial set of 50 test cases by selecting 50 paragraphs from prior actuarial work products and having an experienced actuary identify the correct citations for each paragraph. Each test case stores the input paragraph, the expected list of citations, and a brief rationale for each.

The eval suite is run at three points: every time the corpus is updated, every time the system prompt is changed, and every Monday morning to detect any drift in the underlying model. Aggregate metrics are tracked over time:

  • Recall: of the citations that should have been produced, what fraction were produced?
  • Precision: of the citations that were produced, what fraction should have been produced?
  • F1: harmonic mean of precision and recall.
  • Hallucination rate: what fraction of produced citations point to documents, sections, or spans that do not exist in the corpus? For a citation assistant, the working target is zero. Any non-zero hallucination rate triggers investigation; the floor below is the rate at which the system is taken offline pending root-cause analysis, not the steady-state aspiration.
  • Source faithfulness: for citations that point to real spans, do the spans actually support the claim being made? Sampled for manual review.

Acceptance thresholds for production use are set at the start of the project and recorded in the Govern artefacts. For the example, the team sets recall ≥ 0.85, precision ≥ 0.90, hallucination-rate offline-floor below 0.5% (with a working target of zero), and source faithfulness ≥ 0.95 in manual review of a 10% sample. A citation assistant that confabulates one citation in every two hundred is not in production-quality territory; it is in investigate-and-fix territory. The rest of the example is conservative; this threshold should be too.

Other MEASURE artefacts:

  • MEASURE 2.7 (security): prompt injection testing. Test cases include adversarial paragraphs that attempt to instruct the model to ignore the corpus, produce malicious code, or expose its system prompt. The system’s behaviour is documented.
  • MEASURE 2.10 (privacy): confirmation that no PII is sent to the model and that the audit log of all prompts and responses is retained and reviewable.
  • MEASURE 2.8 and 2.9 (transparency and explainability): every LLM output presented to the user shows the retrieved chunks alongside the proposed citations. The user can see exactly what the LLM was given and what it produced. This is the GenAI version of the explainability artefacts in Part 2 of this series.

Manage artefacts

  • Audit log: every prompt, every retrieval, every LLM response, every actuary accept/reject decision is logged with timestamp, user, model version and corpus version. Retained for the period required by the firm’s record retention policy.
  • Drift monitoring: the weekly eval suite run is the drift monitor. A material change in any of the eval metrics triggers investigation.
  • Incident response runbook: named incident commander, escalation chain, communication templates for the AI Governance Committee.
  • Provider change management: when the model provider releases a new version of the underlying model, the system is not automatically updated. The new version is run against the eval suite, the results are compared to the previous version, and the change is approved or rejected by the AI Governance Committee before any production use. Version pinning is the technical control that makes this governance step possible.

What the team gets

Once built, the system saves the actuarial team substantial time on routine citation work. More importantly, it improves citation quality (because the corpus is comprehensive and the system is consistent), it produces a complete audit trail that supports professional standards expectations, and it serves as a reference implementation for future GenAI projects in the firm. The eval suite, the prompt management approach, the corpus governance and the audit log architecture are all directly reusable for the next project.

This is the operational benefit of doing GenAI well from the start. Each project compounds the team’s capability. A team that starts with a low-stakes citation assistant in Q1 has the technical foundations to build a higher-stakes system in Q3 with substantially less new work.

Section 5: A reusable LLM usage policy template

Below is the structure of a usage policy for generative AI in an actuarial function. As with the MEASURE evidence pack template in Part 2, it is opinionated; adapt to your environment. The numbered sections are a starting point for the team’s own policy, not a prescription.

LLM Usage Policy for [Function Name]

1.  Scope and definitions
    1.1  What this policy covers (which models, which use cases)
    1.2  Definitions: model, prompt, completion, system prompt,
         RAG, agentic system, sensitive data
    1.3  Out-of-scope uses (e.g., personal use, experimentation
         outside defined sandboxes; incidental autocomplete or
         writing-aid use that does not touch actuarial conclusions)

2.  Approved models and providers
    2.1  List of approved LLM providers and models, with version
         pinning where required
    2.2  Procurement and vendor due diligence requirements
    2.3  Contractual requirements (no training on submitted data,
         data residency, audit rights)
    2.4  Process for adding a new approved model

3.  Data classification and use
    3.1  Data classification scheme: public, internal, confidential,
         restricted, regulated PII
    3.2  Which classifications may be sent to which models
    3.3  Explicit prohibition on sending regulated PII to public
         consumer LLM interfaces
    3.4  RAG corpus eligibility and curation responsibilities

4.  Use case approval process
    4.1  Risk classification for new GenAI use cases, including any
         EU AI Act role assessment (provider, deployer, both) where
         the system is determined to be in scope of Annex III
    4.2  Approval authority by risk level
    4.3  Required artefacts before production deployment
    4.4  Required reviews and sign-offs

5.  Required controls (calibrated to use case category)
    5.1  Confabulation controls (RAG, structured output, citation
         requirements, eval suites, human review)
    5.2  Prompt injection and security controls
    5.3  Audit logging requirements (verbose for high-stakes
         categories, lightweight for low-stakes exploration)
    5.4  Model and version pinning requirements
    5.5  Reproducibility expectations

6.  Human responsibility and sign-off
    6.1  Statement of professional responsibility (the user is
         responsible for the work, regardless of LLM involvement)
    6.2  Specific sign-off requirements by use case category
    6.3  Documentation of LLM-assisted work in the audit trail,
         consistent with APS X2 v1.1 work review expectations
         where applicable

7.  Incident reporting
    7.1  What constitutes an LLM incident
    7.2  Reporting channel and timeline
    7.3  Escalation triggers
    7.4  Post-incident review process

8.  Training and awareness
    8.1  Mandatory training for all users
    8.2  Specialised training for system builders
    8.3  Refresh cadence

9.  Policy review
    9.1  Owner and review cadence
    9.2  Triggers for ad-hoc review (new model release, regulatory
         change, incident)

10. References
    10.1  NIST AI RMF and GenAI Profile
    10.2  Applicable actuarial professional standards (ASOPs, TASs,
          APS X2 v1.1, Actuaries' Code, Code of Professional Conduct)
    10.3  Applicable regulatory frameworks (NAIC, Colorado, EU AI Act
          with provider/deployer split noted)
    10.4  Firm's broader AI governance documentation

A team that adopts a policy of this shape, enforces it through the AI Governance Committee, and reviews it quarterly will be in a materially stronger position under professional scrutiny, internal audit, and most regulatory inquiry. Sector-specific law and contract terms remain the binding constraints in their own right, and a usage policy is not a substitute for legal advice on either.

Section 6: Where the professional standards land

The American Academy of Actuaries published Actuarial Professionalism Considerations for Generative AI in 2024 as a discussion paper. The IFoA has issued parallel non-mandatory guidance through its AI, Data Science and Emerging Technologies Practice Board. Both materials say the existing professionalism framework already governs GenAI use; both narrow their focus to GenAI used in the course of actuarial services or to support actuarial conclusions, rather than to incidental autocomplete or writing-aid use. The position is consistent across the profession and worth stating directly:

Generative AI used in actuarial work is a model. Where the LLM is being used to design, develop, select, modify, use, review or evaluate a model, or to produce output that supports an actuarial conclusion, ASOP 56 (Modeling) applies. ASOP 23 (Data Quality) applies to the inputs. ASOP 41 (Actuarial Communications) applies to the outputs. The Code of Professional Conduct applies to the actuary using the model. The Actuaries’ Code applies in the UK, and APS X2 v1.1 (effective 30 January 2026) governs the review of the work whether or not an LLM was involved in producing it.

The actuary is responsible. Precepts 1 and 2 of the Code of Professional Conduct require the actuary to be competent and qualified to provide actuarial services. An actuary who uses a tool they do not understand is in tension with these precepts. An actuary who uses a tool that produces output they cannot verify is in tension with ASOP 56 §3.4’s requirement to “make a reasonable attempt to have a basic understanding of the model”.

Validation is mandatory where in scope. ASOP 56 §3.6 (Evaluation and Mitigation of Model Risk), including §3.6.1 (Model Testing) and §3.6.2 (Model Output Validation), requires sufficient testing and output validation appropriate to the model’s intended purpose. This applies to LLM output that supports actuarial conclusions. Using an LLM result without validation, on the basis that “that is what the model said”, is not consistent with the standard.

Independent review. ASOP 56 §3.6.3 permits, and in many cases recommends, review by another qualified professional (framed permissively: the actuary “may consider” such review). APS X2 v1.1 is more prescriptive in the UK, requiring judgement-based work review that may extend to independent peer review. For any material LLM application in actuarial work, peer review is the single most effective control against the build team’s own blind spots, as the Obermeyer case in Part 3 illustrated.

Documentation is mandatory. ASOP 41 requires that actuarial communications be sufficient to allow another qualified actuary to assess the work. ASOP 56 §3.7 (documentation) reinforces the same expectation for the model itself. An actuarial work product that includes LLM-assisted content must be documented to the same standard. The audit log of LLM interactions is part of the documentation.

Reliance on tools developed by others. ASOP 56 §3.4 explicitly addresses reliance on models developed by others. The actuary should make a reasonable attempt to understand the model’s intended purpose, its general operation, its major sensitivities and dependencies, and its key strengths and limitations. For an LLM, this means understanding what the model is, what its known failure modes are, how it produces outputs, and where confabulation is most likely. It does not require the actuary to understand transformer architectures in detail. It does require the actuary to understand the model well enough to know when its output should be doubted, and to disclose the extent of such reliance in line with ASOP 56 §4.1.

The position is, in short, that generative AI is a powerful new tool which actuaries should use, and that the use must be governed by the same professional standards that have governed actuarial work for a generation. There is no exception in the standards for AI-assisted work. There is no separate regime. There is the existing regime (Code, ASOPs, TASs, APS X2) applied to a new category of model, with new control techniques and new evidence artefacts that the standards anticipate without prescribing in detail.

Section 7: The first week, the first month, the first quarter for GenAI

The action plan from Part 2 of this series covered traditional ML systems. The plan below is the GenAI-specific version. A team that has already done the Part 2 work has a head start; a team starting with GenAI from scratch should expect the first quarter to be busy.

Week 1

  • Inventory existing LLM use. This is the single most informative action and most actuarial functions are surprised by the result. Survey the team. Ask, in writing, who has used ChatGPT, Claude, Copilot, Gemini, Cursor or any other LLM tool for any work-related purpose in the past month. Record what they used it for, what data they sent, and what they did with the output. Most functions discover that LLM use is much more widespread than the leadership thinks, and that some of it is happening on consumer accounts with no governance and no audit trail.
  • Identify the highest-risk current use. From the inventory, identify the use case that carries the most risk: usually the one that involves the most sensitive data or the most consequential output.
  • Issue an interim guidance note. A short written communication to the team specifying what they may continue to do, what they must stop doing immediately, and what is under review. Even a one-page interim note materially reduces uncorrelated risk while the full policy is being developed.

Month 1

  • Draft the LLM usage policy using Section 5’s template. Get it reviewed by the AI Governance Committee, Legal and Compliance.
  • Identify one or two high-value use cases to build out properly. The regulatory citation assistant from Section 4 is a good candidate because it is universally applicable and exercises every important control technique.
  • Stand up a sandbox environment for the team to experiment in safely. This is usually an enterprise account with one of the major providers (OpenAI, Anthropic, Azure OpenAI, Google Vertex AI, AWS Bedrock) configured with the contractual protections required by Section 5.2 of the policy.
  • Build the first eval suite. Even ten to twenty test cases is enough to start. The discipline of writing test cases against expected outputs is itself a learning exercise and the team’s understanding of what good output looks like will mature rapidly.

Quarter 1

  • Build and deploy the first internal use case end-to-end. RAG corpus, vector database, structured prompting, eval suite, human review interface, audit log, governance sign-off, MEASURE evidence pack.
  • Train the team. At least one hour of structured training on the policy, the controls, and the responsibilities. Recurring annually.
  • Establish the eval and review cadence. The eval suite runs weekly. The AI Governance Committee reviews LLM use cases monthly. The policy is reviewed quarterly. Production use cases are recertified annually.
  • Document the residual risks. What risks remain after all controls are in place? Who has accepted them? Reviewed annually.

A team that completes this sequence in a quarter has a defensible foundation for ongoing GenAI use in actuarial work. Subsequent use cases are easier because the policy, the eval methodology, the governance committee, the audit infrastructure and the team’s vocabulary are all already in place.

Section 8: Series closing, the toolkit you now have

This is the final article in the series. Across the four pieces, we have produced five reusable reference artefacts that together form an operating toolkit for actuarial AI governance. They are listed below for the reader who wants to assemble them all.

The framework convergence table (Part 1). One page mapping the four NIST RMF functions to NAIC Model Bulletin obligations, Colorado Regulation 10-1-1 requirements, EU AI Act articles (with provider and deployer obligations distinguished), and UK and US actuarial professional standards. The artefact for board-level conversations.

The full subcategory mapping (Part 2, Section 1). A long-form table mapping the relevant NIST RMF subcategories (Govern, Map, Measure, Manage) to ASOP 56, ASOP 23, ASOP 41, TAS 100, APS X2 v1.1, the NAIC AIS Program, and the EU AI Act articles with provider/deployer annotations. The artefact for cross-framework audit response.

The MEASURE evidence pack template (Part 2, Section 5). A reusable structure for the document that should exist for every material AI system in the function. The artefact for RMF, NAIC, Colorado, NYDFS or EU AI Act audit response.

The practice-area triage table (Part 3, Section 6). A heatmap-style mapping of common actuarial use cases to the RMF subcategories that carry the most weight in each context, with jurisdiction-specific entries flagged. The artefact for prioritising RMF work across a heterogeneous portfolio.

The LLM usage policy template (Part 4, Section 5 above). A reusable structure for the policy document that should govern any GenAI use in the actuarial function. The artefact for establishing GenAI governance from a standing start.

A team that has these five artefacts on its shared drive, has filled them in for its own context, has stood up the cross-functional AI Governance Committee that the first article described, and has begun the first-week-first-month-first-quarter sequence from Part 2 (for traditional ML) and Part 4 (for GenAI), is in practical terms operating to NIST AI RMF. The work is ongoing, the artefacts compound, and the team’s capability matures with every use case.

A final word

The framework is not the point. The framework is the vocabulary in which actuaries, regulators, auditors, lawyers, engineers and their own boards now talk about AI governance. The point is the work. The artefacts. The careful attention to what the model is for, what it actually does, how it can fail, and what the team will do when it does. The actuarial profession has been doing this kind of careful attention for a very long time, on a different category of model, expressed in a different vocabulary. The work the framework asks for is mostly the work the profession already does, plus a few new artefact categories where AI introduces genuinely new failure modes.

The chief actuary is, in our view, a strong candidate to lead the cross-functional AI governance committee. Few other members arrive with professional standards that already require them to sign off on model suitability, document assumptions and limitations, and take personal professional responsibility for the work. The technical actuary is the natural producer of the MEASURE evidence pack. The actuarial profession’s existing standards already require almost everything the framework asks for. The opportunity for the profession is to lead this work confidently rather than be led into it reluctantly.

The framework is the language. The artefacts are the work. The next move is yours.

Ready to operationalise NIST AI RMF and the Generative AI Profile in your actuarial function?

Talk to our team about how Globebyte can help you build the governance structures, the eval suites, the RAG systems and the MEASURE evidence packs. From strategic alignment to working code.

Explore our services

Ready to explore AI for your organisation?

Talk to our team about how Globebyte can help.

More insights