Long term safety practitioners will typically recognise the rhythm we hope to see with the onset of new technologies in the workplace. The technology arrives, researchers evaluate it, regulators publish guidance, standards bodies eventually update. Unfortunately, the pattern is not always met. Asbestos, silica and engineered stone remind us how badly it can fail when we fail to heed the lessons of adequate and implemented research. But when the rhythm holds strong, it gives us time for considered governance of matters related to work health and safety. Artificial intelligence has begun to outpace that. The software updates are moving faster than the research, faster than the regulators, faster than the standards. For organisations thinking seriously about AI in the work safety function, that mismatch is likely the central governance problem, and it is not going to resolve itself on its own.
Our earlier paper, Artificial intelligence at work (May 2026), addressed AI as a workplace hazard such as algorithmic management, intrusive surveillance, the psychosocial risks of AI-mediated work, and the duty position under the Work Health and Safety Act and the Commonwealth Code of Practice. This articles seeks to turn the lens. The article exams AI as a tool used inside the work safety function itself such as safety practitioners drafting risk assessments, reviewing incidents, writing procedures, and the officers and directors who may sign off on the work. The governance problem is largely different. The lag problem is the same.
A capability curve no evaluator can keep up with
A recent scoping review by Dokas (2026) in Safety Science 194 maps the current landscape of benchmarks used to evaluate large language models in safety-critical hazard analysis. It identifies inadequate coverage of safety-specific knowledge, limited evaluation of causal reasoning in technical contexts, the absence of regulatory compliance assessment, and minimal evaluation of how models handle uncertainty.
A companion paper in the same volume by Lee, Park, Oh and Ma (2026) tested four leading models (GPT-4o, GPT-4o-mini, LLAMA 3.2 and Gemini 2.0) for whether they could autonomously generate Hazard and Operability Study (HAZOP) worksheets without human intervention. The outputs clearly looked credible. Textual similarity to expert-prepared reference worksheets exceeded 86 per cent. The substance was thinner. Only 19 to 37 per cent of generated accident scenarios were semantically valid, and the safeguards proposed were heavily biased toward procedural measures rather than design-level controls. The authors concluded that LLMs are usefully positioned as supportive tools, but not as replacements for expert-led HAZOP.
These are valuable findings that contribute to the wider body of discussion surrounding the use of AI in work safety. They are also, by the time they appear in print, likely dated in detail. The models tested in any given study were the models available when the work was designed, typically 12 to 18 months before publication. Frontier model capability is moving rapidly and accordingly by the time papers make the press, models have changed substantially . Some specific weaknesses may have narrowed. Others may have changed shape. The structural concerns, particularly around causal reasoning, output consistency and the gap between surface plausibility and semantic validity, cannot be presumed to have disappeared, but neither can the empirical detail be presumed to still hold.
Regulators and standards bodies are likely in the same position. The Australian Government’s National AI Plan (December 2025) and the National AI Centre’s Guidance for AI Adoption (October 2025), with its AI6 framework of six essential governance practices, are recent, considered, and already chasing a moving target. ISO/IEC 42001:2023, the international standard for AI management systems, was released after generative AI was widely deployed. ISO 45001:2018, the WHS management system standard most Australian businesses certify against, predates the generative AI shift entirely. None of this is a current criticism but the practical reality is simply that anyone using AI in safety work has to govern under uncertainty about both capability and constraint.
Where AI is genuinely useful in WHS work today?
There is a defensible case for several uses of AI in WHS practice, and the author believes that it is worth being specific.
AI tools are useful for first-draft generation. Drafting a SWMS, a JSA or a standard operating procedure is a generative task with strong patterned behavior. A competent practitioner can produce a better draft faster with AI assistance than without it, provided they then verify the output against the actual job scope requirements, the actual plant used and the actual jurisdiction in which the document would apply.
AI tools are emerging as becoming useful for document review often surfacing themes across hundreds of incident reports, contractor safety files or monitoring records is the kind of work where human reviewers run out of attention long before the dataset does.
AI tools are useful for hazard brainstorming. Used as a prompt for human imagination rather than a substitute for it, this is a low-risk and frequently high-value application.
AI tools also appear somewhat useful for translating regulatory and technical language for frontline workers. A worker who can read a SWMS in plain English is potentially safer than one who cannot read the SWMS at all.
Each of these is assistive use. None of them is a replacement use.
Where the failure modes still cluster
The failure modes worth taking seriously are not the ones that vanish with the next model release. They are structural.
Causal reasoning in incident investigation is one. Mainstream investigation methodologies used in mining, construction, ports and process industries, including ICAM, TapRoot, Tripod and AcciMap, treat the proximate event as the starting point of inquiry rather than the end of it. The major published investigations of the past three decades have consistently identified systemic and organisational factors that the proximate description does not capture. Language models pattern-match against surface features of incident narratives. They do not reason causally about latent conditions, defended layers or the distinction between active and latent failures. This matters in any forum where the question is not “what happened” but “what should have been controlled”.
Jurisdictional interpretation is another emerging issue. The model Work Health and Safety framework is adopted with variation across jurisdictions. Victoria operates under a different Act. Western Australia has its own duties and machinery. The Commonwealth has additional reporting overlays. Language models trained on a global corpus tend to default to majority interpretations and miss the local provisions that actually bind a particular duty holder.
Control hierarchy bias is the failure mode WHS practitioners should find most disquieting. The Lee et al. (2026) study found that AI-generated HAZOP safeguards clustered heavily at the procedural end of the hierarchy of controls: training, signage, supervision, work instructions. Higher-order controls, including elimination, substitution and engineering interventions, were under-represented. Left to itself, the technology likely recommends the weakest level of risk control. For any practitioner trained on the hierarchy, the implication is clearly direct i.e an AI-drafted control plan needs human challenge specifically on whether higher-order controls have been considered.
Output consistency is the failure mode least appreciated by general users. The Dokas review introduces “performance consistency” as a critical evaluation metric for a reason. The same prompt, asked twice, can produce materially different outputs. In a regulated practice, inconsistency is not a quirk. It is a defect.
Hallucinated authority is among the most dangerous failure modes in regulated practice. Confident citations to regulations that do not exist, or to case law that has been mis-summarised, carry real legal exposure for anyone who relies on them without verification. The Lee finding that AI HAZOP outputs scored above 86 per cent on surface similarity to expert worksheets while delivering only 19 to 37 per cent valid scenarios is the same pattern in another form. Outputs look right. The diligence required to confirm whether they are right has not gone away.
A governance framework that holds under uncertainty
If the evidence base and the regulatory base both run behind the technology, the governance framework inside an organisation has to be robust to capability change. A workable foundation rests on three principles.
The first is that AI may assist humans govern, and competent experts sign off. AI tools should accelerate the work of competent practitioners, not displace them. Final outputs that carry regulatory, contractual or duty-holder weight need expert review and a named human responsible for the determination. Under section 27 of the model Work Health and Safety Act, officer due diligence requires taking reasonable steps to acquire and keep up-to-date knowledge of work health and safety matters and to ensure appropriate resources and processes are used. Deploying AI tools in safety-critical decisions without competent oversight is unlikely to satisfy that standard.
The second is that AI outputs are inputs to expert judgement, not authority in themselves. Every finding that informs a duty-relevant decision must be traceable to a verified primary source. Plausibility is not verification, regardless of how confidently the output is written.
The third is treating AI as a managed system rather than a collection of tools. ISO/IEC 42001:2023 sits alongside, not in place of, ISO 45001:2018. The AI6 framework maps cleanly onto the same logic, with its emphasis on accountability, risk management, transparency, testing, human oversight and incident response. An organisation cannot credibly assure board-level oversight of AI in safety work without a defined management system around it.
What boards and senior safety leaders should do soon
The author believe that there is an emerging need to map current AI usage in the work safety function. Shadow AI, where individual practitioners adopt tools without sanction, currently appears widespread and rarely visible at executive level. You cannot govern what you cannot see.
- Establish a risk-graded use policy.
Low-stakes uses such as brainstorming, drafting and summarising need light governance and routine quality assurance. High-stakes uses, including compliance findings, incident causation, regulatory advice and board reporting, need expert sign-off as a non-negotiable condition.
- Require traceability.
Every AI-assisted output that informs a decision should be linked to the human who verified it, the sources that ground it, and the version of the tool that produced it. Without traceability there is no due diligence record.
- Build in periodic re-evaluation
The technology will not stay still, and neither should the policy. A twelve-month review cycle is probably too slow. Six months is closer to the right rhythm.
It critically important to document the governance, not just the use. Boards likely need to demonstrate they have considered AI risk, not assert it.
The harder conversation
The harder conversation for boards and work safety leaders is not whether to adopt AI. That decision is already being made, daily, by individuals across every organisation. The harder conversation is whether governance is keeping pace with capability, and whether the people relying on these tools have the competence, time and authority to verify what the tools produce.
The evidence base on AI in work health and safety will not settle in any useful sense, because the technology will keep moving. Researchers will continue to produce valuable work on cycles that lag the software by years. Regulators will continue to update guidance on cycles longer still. Standards bodies will continue to revise frameworks on cycles longer again. None of that is going to change.
The organisations that handle this period well will be those that govern for capability change rather than for any particular model. The duty holders who handle it well will be those who treat AI as a tool that accelerates competent practice, not as a substitute for it.
The distinction is not subtle, and it matters more with each model released.
References
Dokas, I. M. (2026). From hallucinations to hazards: benchmarking LLMs for hazard analysis in safety-critical systems. Safety Science, 194, 107056. https://doi.org/10.1016/j.ssci.2025.107056
International Organization for Standardization. (2018). ISO 45001:2018 Occupational health and safety management systems: Requirements with guidance for use. ISO.
International Organization for Standardization. (2023). ISO/IEC 42001:2023 Information technology: Artificial intelligence management system. ISO.
Lee, J., Park, S., Oh, S., & Ma, B. (2026). Can large language models automate the HAZOP process without human intervention? Safety Science, 194, 107039. https://doi.org/10.1016/j.ssci.2025.107039
National AI Centre. (2025). Guidance for AI Adoption. Department of Industry, Science and Resources, Australian Government.
Department of Industry, Science and Resources. (2025). National AI Plan. Australian Government.
Ninness, J. (2026). Artificial intelligence at work. Safetysure Research, 20 May 2026. https://www.safetysure.com.au/research/artificial-intelligence-at-work/
Work Health and Safety Act 2011 (Cth).
You might like to read our article https://www.safetysure.com.au/safety-advice/workplace-health-safety-programs-ai/
You might like to read our Research Article on the Role of the Work Safety Regulator
