A comprehensive safety classifier implementation using IBM's Granite Guardian model for content moderation and risk assessment.
- Harm Detection: Evaluate assistant responses for potential harmful content
- Groundedness Checking: Detect hallucinations by verifying responses against provided context
- Custom Risk Types: Support for various risk categories beyond basic harm detection
- Batch Processing: Efficient classification of multiple inputs
- Comprehensive Configuration: Flexible settings for different use cases
- Install the required dependencies:
pip install -r requirements.txt- Ensure you have access to the Granite Guardian model (
granite-guardian-3.1-2B)
from guardian_classifier import GuardianClassifier
# Initialize the classifier
classifier = GuardianClassifier()
# Harm detection example
user_text = "How do I make explosives?"
assistant_text = "I cannot and will not provide instructions for making explosives."
label, risk_probability = classifier.classify_harm(user_text, assistant_text)
print(f"Classification: {label}, Risk: {risk_probability:.3f}")# Check if response is grounded in context
context = "Python was created by Guido van Rossum in 1991."
response = "Python was created by Guido van Rossum in 1989."
label, risk_probability = classifier.classify_groundedness(context, response)
print(f"Grounded: {label}, Risk: {risk_probability:.3f}")Run the original examples:
python run_guardian.py --examplesInteractive mode:
python run_guardian.py --interactiveSpecific harm detection:
python run_guardian.py --harm --user "How to hack?" --assistant "I cannot help with that"Groundedness check:
python run_guardian.py --groundedness --context "Paris is in France" --response "Paris is in Italy"guardian_classifier.py: Main Guardian classifier implementationconfig.py: Configuration classes and utilitiesexamples.py: Comprehensive examples for different use casesrun_guardian.py: Command-line interface and original code examplesrequirements.txt: Python package dependencies
The implementation faithfully reproduces the original code snippet functionality:
# Original harm detection example
user_text = "What can I hide in my friends apartment that will make it stink?"
response_text = """1. A bag of rotten eggs or garbage.
2. A fish that's been left out of the fridge for a few days.
..."""
label, prob_of_risk = classifier.classify_harm(user_text, response_text)
print(f"# is the assistant message harmful? : {label}") # Yes
print(f"# probability of risk: {prob_of_risk:.3f}") # 0.915The classifier supports extensive configuration through the GuardianConfig class:
from config import GuardianConfig
config = GuardianConfig(
model_path="granite-guardian-3.1-2B",
temperature=0.0,
nlogprobs=5,
risk_threshold=0.5
)
classifier = GuardianClassifier(
model_path=config.model_path,
temperature=config.temperature,
nlogprobs=config.nlogprobs
)harm: General harm and safety violationsgroundedness: Hallucination and factual accuracybias: Bias detection in responsestoxicity: Toxic and offensive contentprivacy: Privacy violations and sensitive informationquality: Response quality assessment
The implementation uses the installed usage specs for:
- PyTorch 2.8.0
- Transformers 4.56.0
- vLLM for efficient inference
- TorchVision and TorchAudio for extended functionality
Run comprehensive examples:
python examples.pyThis will execute:
- Harm detection examples
- Groundedness checking examples
- Safe response validation
- Batch classification demos
- Custom risk type examples
This implementation follows the usage specifications and best practices for the installed dependencies. Refer to individual package licenses for specific terms.