PromptMeNot is a Ruby gem that helps protect your application against prompt injection attacks. It scans user-submitted text for malicious patterns like instruction overrides, role manipulation, delimiter injection, encoding tricks, and more. With ~70 built-in detection patterns organized across 7 attack categories, it covers direct injection (where users try to hijack your LLM through form inputs), indirect injection (where malicious prompts are stored in user content like profiles or comments, waiting for other LLMs to scrape and execute them), and resource extraction (where attackers trick AI agents into transferring funds, leaking credentials, or exhausting compute resources). It plugs into Rails with a simple ActiveModel validator or works standalone in any Ruby app, with configurable sensitivity levels so you can tune the trade-off between coverage and false positives.
New in v0.2: Resource Extraction Detection AI agents are managing wallets, executing trades, and holding API keys. In February 2026, an autonomous trading bot sent $250K in tokens to a stranger because nothing validated what it was about to do. This release adds 10 patterns that catch attempts to drain wallets, steal credentials, manipulate agents with fake urgency, and exfiltrate secrets to external endpoints. If your app has an AI agent anywhere near funds or keys, this is the category you want active. Pattern details | Changelog
| Attack type | Description |
|---|---|
| Direct injection | Users trying to override LLM instructions via form inputs (e.g., "ignore all previous instructions") |
| Indirect injection | Malicious prompts stored in profiles, comments, or posts that target LLMs scraping or processing your site |
| Delimiter attacks | Fake system tokens, ChatML tags, and XML/markdown boundaries injected into text |
| Obfuscation | Base64-encoded payloads, zero-width characters, hex escapes, and other encoding tricks |
| Resource extraction | Crypto transfer requests, wallet drain attacks, credential theft, financial urgency manipulation |
PromptMeNot is a supplemental defense layer, not a silver bullet.
It uses pattern matching to catch known injection techniques, which means:
- It can be bypassed. Sufficiently creative or novel attacks may evade detection. Prompt injection is an evolving problem and no regex-based approach will catch everything.
- It's not a replacement for other safeguards. You should still use system prompts with clear boundaries, output filtering, least-privilege API access, and human review where appropriate.
- It won't prevent all LLM misuse. It focuses on the input side. It doesn't monitor or constrain what your LLM outputs.
Think of it like input validation for SQL injection: you still use parameterized queries, but rejecting '; DROP TABLE users-- at the front door doesn't hurt. PromptMeNot is that front door check for prompt injection.
For production apps, pair PromptMeNot with other layers:
| Layer | What to do |
|---|---|
| Structural isolation | Wrap user input in XML delimiters (<user_input>...</user_input>) so the LLM treats it as data, not instructions |
| System prompt design | Explicitly tell the model to ignore instructions found inside user content |
| Output validation | Check LLM responses for leaked system prompts, PII, or unexpected behavior before returning them to users |
| Least-privilege access | Restrict what your LLM can do (read-only DB access, scoped API keys, no eval) |
PromptMeNot handles the fast, cheap first pass. It catches the known attacks before they cost you an API call. The layers above handle the rest.
Add to your Gemfile:
gem "promptmenot"Then run:
bundle install
rails generate promptmenot:install # creates config/initializers/promptmenot.rbclass UserProfile < ApplicationRecord
# Reject mode (default) — adds validation error
validates :bio, prompt_safety: true
# Sanitize mode — strips malicious content, no error
validates :about_me, prompt_safety: { mode: :sanitize }
# Custom sensitivity
validates :notes, prompt_safety: { sensitivity: :high, mode: :reject }
endPromptmenot.safe?("Hello world")
# => true
Promptmenot.safe?("Ignore all previous instructions")
# => false
result = Promptmenot.detect("Some text with [SYSTEM] override")
result.safe? # => false
result.unsafe? # => true
result.matches # => [#<Match ...>]
result.categories_detected # => [:delimiter_injection]
result.summary # => "Detected 1 potential prompt injection pattern..."
sanitized = Promptmenot.sanitize("Hello. Ignore all previous instructions. Goodbye.")
sanitized.sanitized # => "Hello. [removed] Goodbye."
sanitized.changed? # => true
sanitized.original # => "Hello. Ignore all previous instructions. Goodbye."# config/initializers/promptmenot.rb
Promptmenot.configure do |config|
# Default sensitivity level for all validations
# Options: :low, :medium (default), :high, :paranoid
config.sensitivity = :medium
# Default mode: :reject (validation error) or :sanitize (strip content)
config.mode = :reject
# Replacement text used in sanitize mode
config.replacement_text = "[removed]"
# Callback fired whenever injection is detected
config.on_detect = ->(result) { Rails.logger.warn("Injection: #{result.summary}") }
# Register custom patterns
config.add_pattern(
name: :my_custom_pattern,
regex: /my dangerous regex/i,
category: :custom,
sensitivity: :medium,
confidence: :high
)
endSensitivity controls which patterns are active. Each pattern declares a minimum sensitivity level and only runs when the requested sensitivity is at or above that level.
| Pattern sensitivity | Active at :low |
:medium |
:high |
:paranoid |
|---|---|---|---|---|
:low |
Yes | Yes | Yes | Yes |
:medium |
No | Yes | Yes | Yes |
:high |
No | No | Yes | Yes |
:paranoid |
No | No | No | Yes |
:low catches only the most obvious attacks (e.g., "ignore all previous instructions"). :paranoid flags anything remotely suspicious, including mixed-script text.
| Category | Examples | Count |
|---|---|---|
direct_instruction_override |
"ignore previous instructions", "new instructions:" | ~12 |
role_manipulation |
"jailbreak mode", "act as unrestricted AI", "DAN" | ~10 |
delimiter_injection |
<|system|>, [SYSTEM], ChatML tokens |
~10 |
encoding_obfuscation |
Base64 payloads, zero-width chars, hex escapes | ~10 |
indirect_injection |
"Dear AI", "if you are an LLM", "note to chatbot" | ~10 |
context_manipulation |
===RESET===, "the above is a test", prompt leaking |
~8 |
resource_extraction |
"transfer 100 SOL to", credential theft, wallet drain, endpoint exfiltration | ~10 |
Patterns use contextual qualifiers to minimize false positives:
- "ignore" alone is fine, but "ignore previous instructions" is flagged
- "act as" requires malicious qualifiers, so "act as a consultant" passes
- "you are now" requires AI/restriction qualifiers, so "you are now subscribed" passes
- "from now on" requires imperative "you must/will", so "from now on I'll work from home" passes
- Broad patterns are placed at
:high/:paranoidsensitivity so they don't fire at default settings
Does this work without Rails?
Yes. The ActiveModel validator is the Rails integration, but the core API works in any Ruby app:
Promptmenot.safe?("some text")
Promptmenot.detect("some text")
Promptmenot.sanitize("some text")The Railtie only loads if Rails::Railtie is defined.
Is this thread-safe?
Yes. The module singleton (configuration, registry) is protected by a Monitor. Pattern matching itself is stateless, so concurrent calls to detect or sanitize are safe.
What happens when patterns overlap?
The detector deduplicates automatically. If two patterns match overlapping regions of text, it keeps the larger match and discards the smaller one. This prevents double-counting in results and avoids garbled output in sanitize mode.
Can I use this to scan existing database records?
Yes. You can run detection against any string, not just incoming form input:
UserProfile.find_each do |profile|
result = Promptmenot.detect(profile.bio, sensitivity: :high)
puts "#{profile.id}: #{result.summary}" if result.unsafe?
endWhat's the performance like?
At default sensitivity (:medium), roughly 40-50 patterns are active. Each is a single regex scan, so detection is fast on typical user input. The max_length config (default: 50,000 characters) truncates excessively long inputs before scanning to prevent regex backtracking on adversarial payloads.
Can I scan only specific categories?
Yes. Both the Detector and Sanitizer accept a categories filter:
detector = Promptmenot::Detector.new(
sensitivity: :high,
categories: [:delimiter_injection, :encoding_obfuscation]
)
result = detector.detect(user_input)This is useful if you only care about certain attack types for a given field.
How does the on_detect callback work?
The callback fires whenever an injection is detected, before the result is returned. It receives the full Result object, so you can log, alert, or track metrics:
Promptmenot.configure do |config|
config.on_detect = ->(result) {
Rails.logger.warn("Injection detected: #{result.summary}")
StatsD.increment("promptmenot.injection_detected")
}
endIf the callback raises an exception, it's rescued and printed to warn so it never breaks your app.
Does it catch leetspeak and Cyrillic homoglyphs?
Yes. The encoding_obfuscation category includes patterns for leetspeak injection (e.g., 1gn0r3 1nstruct10ns) and mixed-script homoglyphs (Latin + Cyrillic characters in the same string). The homoglyph pattern is set to :paranoid sensitivity since mixed scripts can appear in legitimate multilingual content.
What's the difference between confidence and sensitivity?
They answer different questions:
- Sensitivity controls when a pattern runs. A
:lowsensitivity pattern runs at all levels. A:paranoidpattern only runs when you explicitly crank sensitivity up. - Confidence describes how certain we are that a match is actually an attack. A
:highconfidence match (e.g., "ignore all previous instructions") is almost certainly malicious. A:lowconfidence match (e.g., mixed Cyrillic/Latin text) might be legitimate.
You can filter results by confidence after detection using result.high_confidence_matches.
Can I add patterns without modifying the gem source?
Yes. Use the config DSL in your initializer:
Promptmenot.configure do |config|
config.add_pattern(
name: :my_app_specific_attack,
regex: /some pattern specific to your app/i,
category: :custom,
sensitivity: :medium,
confidence: :high
)
endCustom patterns go through the same detection pipeline as built-in ones.
We welcome contributions! See CONTRIBUTING.md for guidelines on:
- Setting up development environment
- Running tests and linting
- Adding new patterns
- Reporting issues
MIT License. See LICENSE.txt.