Maintain separate databases for different goodware sources:
good-strings-part1.dbthroughpart11.db- Downloaded databasesgood-strings-local.db- Your custom goodware collection
Pros:
- Granular updates (update only changed sources)
- Easier debugging (know which source contains a string)
- Smaller individual files for version control
Cons:
- Slightly slower startup (more file reads)
- More files to manage
Combine all databases into a single file:
yargen-util merge -o good-strings-all.db dbs/good-strings-*.dbPros:
- Single file to deploy
- Faster startup (one file read)
- Simpler distribution
Cons:
- Lose source attribution
- Must regenerate entire database for updates
- Larger file for version control
All databases are loaded into memory as hash maps:
- ~8.6M strings ≈ 200-300 MB RAM
- ~200K opcodes ≈ 50-100 MB RAM
With 24 database files:
- SSD: ~2-5 seconds
- HDD: ~10-30 seconds
Merged single file:
- ~30% faster load time
# Create new database
yargen-util create -g /opt/goodware -i local -opcodes
# This creates:
# - dbs/good-strings-local.db
# - dbs/good-opcodes-local.db# Add more samples to existing database
yargen-util append -g /new/goodware -i local -opcodes- Quality over quantity: Better to have 10K clean files than 100K questionable ones
- Diversity: Include various Windows versions, software types, architectures
- No malware: Strictly clean files - any malware strings pollute the database
- Version control: Track what's in your collection
# List all databases and sizes
yargen-util list
# Inspect specific database
yargen-util inspect dbs/good-strings-local.db -top 20
# Compare databases
yargen-util inspect db1.db -top 0 > /tmp/db1.txt
yargen-util inspect db2.db -top 0 > /tmp/db2.txt
diff /tmp/db1.txt /tmp/db2.txt# Download latest pre-built databases
yargen-util update- Databases are gzipped JSON - highly compressible
- Version control works well (text-based, diffable)
- Keep backups before major updates
If yarGen fails to load databases:
- Check file permissions
- Verify gzip integrity:
gunzip -t file.db - Check JSON validity:
zcat file.db | python -m json.tool > /dev/null
If rules have false positives:
- Check if strings exist in database:
yargen-util inspect - Consider creating custom scoring rules
- Add more goodware samples to cover the strings
If yarGen is slow:
- Use SSD for database storage
- Consider merging databases
- Reduce database size (remove old/unused entries)
- Add more RAM (databases are memory-resident)