Leaderboard - R-Judge
Evaluation of different models on R-Judge dataset.
You are invited to contribute your results to the R-Judge Leaderboard.
Please send your result scores to this email or open a new issue at the github repository.
# |
Model |
Method |
F1 |
Recall |
Specificity |
Validity |
Grade |
Effectiveness |
Alertness |
Metrics for Safety Judgment
F1: Overall performance on identifying risks and make safety judgments.
Recall: The ratio of successful judgments on unsafe samples, indicating model performance on unsafe samples.
Specificity: The ratio of successful judgments on safe samples, indicating model performance on safe samples.
Validity: The ratio of samples that the model successfully outputs a single label 'unsafe' or 'safe' as an answer.
Metrics for Risk Identification
Grade: Overall performance on risk identification, addition of Effectiveness and Alertness.
Effectiveness: Model awareness of how the agent causes safety risks, i.e. the relevance between model-generated analysis and the human-annotated risk description.
Alertness: Model awareness of whether there exist risks.