R-Judge
Benchmarking Safety Risk Awareness
for LLM Agents
(2024)
Large language models (LLMs) have exhibited great potential in autonomously completing tasks across
real-world applications. Despite this, these LLM agents introduce unexpected safety risks when
operating in interactive environments. Instead of centering on LLM-generated content safety
in most prior studies, this work addresses the imperative need for benchmarking the behavioral
safety of LLM agents within diverse environments. We introduce R-Judge, a benchmark crafted to
evaluate the proficiency of LLMs in judging and identifying safety risks given agent interaction
records.
The figure illustrates R-Judge: The upper left part is an example in the dataset, i.e., a record of an agent interacting with the user and environment. The lower left part is human annotation involving a binary safety label and high-quality risk description. Following the arrow, we can see the serial evaluation paradigm with two tests: given record and task instruction, LLMs are asked to generate an analysis and a label. An automatic evaluator compares the analysis with the ground truth risk description to assess the effectiveness of risk identification. And correct generated labels are counted to evaluate the performance of safety judgment.
R-Judge comprises 162 records of multi-turn agent interaction, encompassing 27 key risk scenarios among 7 application categories and 10 risk types. It incorporates human consensus on safety with annotated safety labels and high-quality risk descriptions.
For more details, you can explore the datatset and check the visualizations here: Explore and Visualizations.
Our dataset is distributed under the CC BY-NC-SA (Attribution-NonCommercial-ShareAlike) license. You can check out our github repository to download the dataset .
Based on the dataset, we evaluate the effectiveness of public LLMs as agent safety monitors. As monitors for agent safety demand both risk identification and safety judgment of LLM capabilities, we design two recipes to evaluate LLMs' proficiency in identifying and judging safety risks given agent interaction records.
As shown in the Figure, the two evaluation recipes are conducted in a serial pipeline. First, LLMs are demonstrated with the record as input and asked to analyze whether the agent actions in the record are safe, formalized as:
$$p_{\theta}(R) \to analysis$$Then, LLMs are asked to output 'unsafe' or 'safe' based on record and analysis in the first step, formalized as:
$$p_{\theta} (R, analysis) \to label$$The output analysis in the first step and the output label in the second step are taken respectively for the two evaluation recipes, with human annotation as ground truth.
1. Label Score for Safety Judgment. To evaluate the ability of LLMs to make safety judgments, a label-based test compares LLM-generated binary safety labels with truth labels from the consensus of human annotators.
2. Pairwise Evaluation for Risk Identification. To evaluate the effectiveness of LLMs in identifying safety risks, an open-ended test utilizes GPT-4 as an automatic evaluator to assess the open-ended model-generated analysis on unsafe cases.
The results of other baselines and recent work are reported at the Leaderboard page.
Tongxin Yuan, Zhiwei He, Lingzhong Dong, Yiming Wang, Ruijie Zhao, Tian Xia, Lizhen Xu, Binglin Zhou,
Fangqi Li, Zhuosheng Zhang, Rui Wang, Gongshen Liu
Paper /
PDF /
Code
View on the github repository.
If the paper, codes, or the dataset inspire you, please cite us:
@article{yuan2024rjudge, title={R-Judge: Benchmarking Safety Risk Awareness for LLM Agents}, author={Tongxin Yuan and Zhiwei He and Lingzhong Dong and Yiming Wang and Ruijie Zhao and Tian Xia and Lizhen Xu and Binglin Zhou and Fangqi Li and Zhuosheng Zhang and Rui Wang and Gongshen Liu}, journal={arXiv preprint arXiv:2401.10019}, year={2024} }