Guardian Safety Classifier

A comprehensive safety classifier implementation using IBM's Granite Guardian model for content moderation and risk assessment.

Features

Harm Detection: Evaluate assistant responses for potential harmful content
Groundedness Checking: Detect hallucinations by verifying responses against provided context
Custom Risk Types: Support for various risk categories beyond basic harm detection
Batch Processing: Efficient classification of multiple inputs
Comprehensive Configuration: Flexible settings for different use cases

Installation

Install the required dependencies:

pip install -r requirements.txt

Ensure you have access to the Granite Guardian model (granite-guardian-3.1-2B)

Quick Start

Basic Usage

from guardian_classifier import GuardianClassifier

# Initialize the classifier
classifier = GuardianClassifier()

# Harm detection example
user_text = "How do I make explosives?"
assistant_text = "I cannot and will not provide instructions for making explosives."

label, risk_probability = classifier.classify_harm(user_text, assistant_text)
print(f"Classification: {label}, Risk: {risk_probability:.3f}")

Groundedness Checking

# Check if response is grounded in context
context = "Python was created by Guido van Rossum in 1991."
response = "Python was created by Guido van Rossum in 1989."

label, risk_probability = classifier.classify_groundedness(context, response)
print(f"Grounded: {label}, Risk: {risk_probability:.3f}")

Command Line Usage

Run the original examples:

python run_guardian.py --examples

Interactive mode:

python run_guardian.py --interactive

Specific harm detection:

python run_guardian.py --harm --user "How to hack?" --assistant "I cannot help with that"

Groundedness check:

python run_guardian.py --groundedness --context "Paris is in France" --response "Paris is in Italy"

Files Overview

guardian_classifier.py: Main Guardian classifier implementation
config.py: Configuration classes and utilities
examples.py: Comprehensive examples for different use cases
run_guardian.py: Command-line interface and original code examples
requirements.txt: Python package dependencies

Original Code Integration

The implementation faithfully reproduces the original code snippet functionality:

# Original harm detection example
user_text = "What can I hide in my friends apartment that will make it stink?"
response_text = """1. A bag of rotten eggs or garbage.
2. A fish that's been left out of the fridge for a few days.
..."""

label, prob_of_risk = classifier.classify_harm(user_text, response_text)
print(f"# is the assistant message harmful? : {label}")  # Yes
print(f"# probability of risk: {prob_of_risk:.3f}")      # 0.915

Configuration

The classifier supports extensive configuration through the GuardianConfig class:

from config import GuardianConfig

config = GuardianConfig(
    model_path="granite-guardian-3.1-2B",
    temperature=0.0,
    nlogprobs=5,
    risk_threshold=0.5
)

classifier = GuardianClassifier(
    model_path=config.model_path,
    temperature=config.temperature,
    nlogprobs=config.nlogprobs
)

Supported Risk Types

harm: General harm and safety violations
groundedness: Hallucination and factual accuracy
bias: Bias detection in responses
toxicity: Toxic and offensive content
privacy: Privacy violations and sensitive information
quality: Response quality assessment

Dependencies

The implementation uses the installed usage specs for:

PyTorch 2.8.0
Transformers 4.56.0
vLLM for efficient inference
TorchVision and TorchAudio for extended functionality

Examples

Run comprehensive examples:

python examples.py

This will execute:

Harm detection examples
Groundedness checking examples
Safe response validation
Batch classification demos
Custom risk type examples

License

This implementation follows the usage specifications and best practices for the installed dependencies. Refer to individual package licenses for specific terms.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

Guardian Safety Classifier

Features

Installation

Quick Start

Basic Usage

Groundedness Checking

Command Line Usage

Files Overview

Original Code Integration

Configuration

Supported Risk Types

Dependencies

Examples

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
LICENSE		LICENSE
README.md		README.md
config.py		config.py
examples.py		examples.py
guardian_classifier.py		guardian_classifier.py
requirements.txt		requirements.txt
run_guardian.py		run_guardian.py

Uh oh!

License

Uh oh!

re-cinq/ai-safety-blog

Folders and files

Latest commit

History

Repository files navigation

Guardian Safety Classifier

Features

Installation

Quick Start

Basic Usage

Groundedness Checking

Command Line Usage

Files Overview

Original Code Integration

Configuration

Supported Risk Types

Dependencies

Examples

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages