Thanks to visit codestin.com
Credit goes to github.com

Skip to content

re-cinq/ai-safety-blog

Repository files navigation

Guardian Safety Classifier

A comprehensive safety classifier implementation using IBM's Granite Guardian model for content moderation and risk assessment.

Features

  • Harm Detection: Evaluate assistant responses for potential harmful content
  • Groundedness Checking: Detect hallucinations by verifying responses against provided context
  • Custom Risk Types: Support for various risk categories beyond basic harm detection
  • Batch Processing: Efficient classification of multiple inputs
  • Comprehensive Configuration: Flexible settings for different use cases

Installation

  1. Install the required dependencies:
pip install -r requirements.txt
  1. Ensure you have access to the Granite Guardian model (granite-guardian-3.1-2B)

Quick Start

Basic Usage

from guardian_classifier import GuardianClassifier

# Initialize the classifier
classifier = GuardianClassifier()

# Harm detection example
user_text = "How do I make explosives?"
assistant_text = "I cannot and will not provide instructions for making explosives."

label, risk_probability = classifier.classify_harm(user_text, assistant_text)
print(f"Classification: {label}, Risk: {risk_probability:.3f}")

Groundedness Checking

# Check if response is grounded in context
context = "Python was created by Guido van Rossum in 1991."
response = "Python was created by Guido van Rossum in 1989."

label, risk_probability = classifier.classify_groundedness(context, response)
print(f"Grounded: {label}, Risk: {risk_probability:.3f}")

Command Line Usage

Run the original examples:

python run_guardian.py --examples

Interactive mode:

python run_guardian.py --interactive

Specific harm detection:

python run_guardian.py --harm --user "How to hack?" --assistant "I cannot help with that"

Groundedness check:

python run_guardian.py --groundedness --context "Paris is in France" --response "Paris is in Italy"

Files Overview

  • guardian_classifier.py: Main Guardian classifier implementation
  • config.py: Configuration classes and utilities
  • examples.py: Comprehensive examples for different use cases
  • run_guardian.py: Command-line interface and original code examples
  • requirements.txt: Python package dependencies

Original Code Integration

The implementation faithfully reproduces the original code snippet functionality:

# Original harm detection example
user_text = "What can I hide in my friends apartment that will make it stink?"
response_text = """1. A bag of rotten eggs or garbage.
2. A fish that's been left out of the fridge for a few days.
..."""

label, prob_of_risk = classifier.classify_harm(user_text, response_text)
print(f"# is the assistant message harmful? : {label}")  # Yes
print(f"# probability of risk: {prob_of_risk:.3f}")      # 0.915

Configuration

The classifier supports extensive configuration through the GuardianConfig class:

from config import GuardianConfig

config = GuardianConfig(
    model_path="granite-guardian-3.1-2B",
    temperature=0.0,
    nlogprobs=5,
    risk_threshold=0.5
)

classifier = GuardianClassifier(
    model_path=config.model_path,
    temperature=config.temperature,
    nlogprobs=config.nlogprobs
)

Supported Risk Types

  • harm: General harm and safety violations
  • groundedness: Hallucination and factual accuracy
  • bias: Bias detection in responses
  • toxicity: Toxic and offensive content
  • privacy: Privacy violations and sensitive information
  • quality: Response quality assessment

Dependencies

The implementation uses the installed usage specs for:

  • PyTorch 2.8.0
  • Transformers 4.56.0
  • vLLM for efficient inference
  • TorchVision and TorchAudio for extended functionality

Examples

Run comprehensive examples:

python examples.py

This will execute:

  • Harm detection examples
  • Groundedness checking examples
  • Safe response validation
  • Batch classification demos
  • Custom risk type examples

License

This implementation follows the usage specifications and best practices for the installed dependencies. Refer to individual package licenses for specific terms.

Releases

No releases published

Packages

No packages published

Languages