Back to news
Ethics Security Society

Large language models are not yet reliable enough for laboratory safety tasks

Researchers have developed a new test suite to measure how well large language models and models that combine images and text perform in tasks related to safety risks in scientific laboratories. Artificial intelligence is already used in research, for example, to support experiment planning and guide work phases, but at the same time, there is an increasing risk that users rely too much on systems that may seem to understand the situation, even though there is no real understanding.

The work published in the journal Nature Machine Intelligence introduces a benchmark test called LabSafety Bench. It evaluates models in three key areas: hazard identification, risk assessment, and consequence prediction. The dataset is extensive: 765 multiple-choice questions and 404 realistic laboratory scenarios, forming a total of 3,128 open tasks.

The test evaluated 19 advanced models, including both large language models based solely on text and models that can also utilize images. The overall picture was stark: current models are still far from the reliability required for safe laboratory use.

The results emphasize that although artificial intelligence can be a useful tool in research, its generated instructions and assessments cannot be considered safe in laboratory situations as such. The new benchmark test provides a way to measure and monitor whether models' ability to identify hazardous situations and assess their risks improves before AI is more closely integrated into the daily operations of laboratories.

Source: Benchmarking large language models on safety risks in scientific laboratories, Nature Machine Intelligence.

This text was generated with AI assistance and may contain errors. Please verify details from the original source.

Original research: Benchmarking large language models on safety risks in scientific laboratories
Publisher: Nature Machine Intelligence
Authors:
January 15, 2026
Read original →