Leaderboard - R-Judge

Evaluation of different models on R-Judge dataset.

You are invited to contribute your results to the R-Judge Leaderboard. Please send your result scores to this email or open a new issue at the github repository.

# Model Method F1 Recall Specificity Validity Grade Effectiveness Alertness

Metrics for Safety Judgment

  • F1: Overall performance on identifying risks and make safety judgments.
  • Recall: The ratio of successful judgments on unsafe samples, indicating model performance on unsafe samples.
  • Specificity: The ratio of successful judgments on safe samples, indicating model performance on safe samples.
  • Validity: The ratio of samples that the model successfully outputs a single label 'unsafe' or 'safe' as an answer.
  • Metrics for Risk Identification

  • Grade: Overall performance on risk identification, addition of Effectiveness and Alertness.
  • Effectiveness: Model awareness of how the agent causes safety risks, i.e. the relevance between model-generated analysis and the human-annotated risk description.
  • Alertness: Model awareness of whether there exist risks.