Leaderboard - R-Judge

Evaluation of different models on R-Judge dataset.

You are invited to contribute your results to the R-Judge Leaderboard. Please send your result scores to this email or open a new issue at the github repository.

#	Model	Method	F1	Recall	Specificity	Validity	Grade	Effectiveness	Alertness

Metrics for Safety Judgment

F1: Overall performance on identifying risks and make safety judgments.

Recall: The ratio of successful judgments on unsafe samples, indicating model performance on unsafe samples.

Specificity: The ratio of successful judgments on safe samples, indicating model performance on safe samples.

Validity: The ratio of samples that the model successfully outputs a single label 'unsafe' or 'safe' as an answer.

Metrics for Risk Identification

Grade: Overall performance on risk identification, addition of Effectiveness and Alertness.

Effectiveness: Model awareness of how the agent causes safety risks, i.e. the relevance between model-generated analysis and the human-annotated risk description.

Alertness: Model awareness of whether there exist risks.