DLP Systems: Models, Architecture and Algorithms
Liwei Ren, Ph.D, Sr. Architect
Data Security Research, Trend Micro™
May, 2013, UCSC, Santa Cruz, CA
Classification 8/2/2013 Copyright 2011 Trend Micro Inc. 1
Backgrounds:
• Liwei Ren, Data Security Research, Trend Micro™
– Research interests:
• DLP, differential compression, data de-duplication, file transfer protocols, database
security, and practical algorithms.
– Education:
• MS/BS in mathematics, Tsinghua University, Beijing
• Ph.D in mathematics, MS in information science, University of Pittsburgh
– Relevant works for this talk:
• Provilla, Inc : a startup focusing on endpoint based DLP products and solutions. It was
co-founded by Liwei and acquired by Trend Micro a few years ago.
• Patents --- Liwei holds 10+ patents for DLP, mostly, for DLP content inspection
techniques.
• Trend Micro™
– Global security software company with headquarter in Tokyo, and R&D centers in
Nanjing, Taipei and Silicon Valley.
– One of top 3 anti-malware vendors
– Pioneer in cloud security
– DLP vendor via Provilla™ acquisition
Copyright 2011 Trend Micro Inc. 2
Agenda
• What is Data Loss Prevention (DLP) ?
• Concepts, Models, Architecture
• Content Inspection Problems
• Practical Algorithms for DLP
• Summary
• References
• Q&A
Classification 8/2/2013 Copyright 2011 Trend Micro Inc. 3
What Is Data Loss Prevention?
• What is Data Loss Prevention?
– Data loss prevention (aka, DLP) is a data security technology that detects
data breach incidents in timely manner and prevents them by monitoring
data in-use (endpoints), in-motion (network traffic), and at-rest (data
storage) in an organization’s network.
– A.k.a. ,Data Leak Prevention (DLP),Information Leak Prevention (ILP) or
Information Leak Detection and Prevention (ILDP).
Classification 8/2/2013 Copyright 2011 Trend Micro Inc. 4
What Is Data Loss Prevention?
• A Few Elements of a DLP system:
– WHAT data to protect?
– WHO leaks data?
– HOW the data is leaked?
– WHERE to protect data?
– WHAT actions to take?
Classification 8/2/2013 Copyright 2011 Trend Micro Inc. 5
Concepts, Models and Architecture
• WHAT data to protect?
• WHO causes data leaks?
External Hackers
Classification 8/2/2013 Copyright 2011 Trend Micro Inc. 6
Concepts, Models and Architecture
Three Data States:
Classification 8/2/2013 Copyright 2011 Trend Micro Inc. 7
Concepts, Models and Architecture
• Data-in-use:
• Data-in-motion:
Classification 8/2/2013 Copyright 2011 Trend Micro Inc. 8
Concepts, Models and Architecture
• Data-at-rest at risk:
Classification 8/2/2013 Copyright 2011 Trend Micro Inc. 9
Concepts, Models and Architecture
• DLP for data-in-use and data-in-motion:
• A conceptual view!
Classification 8/2/2013 Copyright 2011 Trend Micro Inc. 10
Concepts, Models and Architecture
• DLP for data-in-use and data-in-motion:
• A technical view!
Classification 8/2/2013 Copyright 2011 Trend Micro Inc. 11
Concepts, Models and Architecture
• DLP Model for data-in-use and data-in-motion:
– If DATA flows from SOURCE to DESTINATION via CHANNEL, the
system takes ACTIONs
– DATA specifies what confidential data is
– SOURCE can be an user, an endpoint, an email
address, or a group of them
– DESTINATION can be an endpoint, an email address,
or a group of them, or simply the external world
– CHANNEL indicates the data leak channel such as
USB, email, network protocols and etc
– ACTION is the action that needs to be taken by the
DLP system when an incident occurs
Classification 8/2/2013 Copyright 2011 Trend Micro Inc. 12
Concepts, Models and Architecture
• DLP for data-at-rest:
Classification 8/2/2013 Copyright 2011 Trend Micro Inc. 13
Concepts, Models and Architecture
• DLP Model for data-at-rest:
– If DATA resides at SOURCE , the system takes ACTIONs
– DATA specifies what the sensitive data (which has
potential for leakage) is
– SOURCE can be an endpoint, a storage server or a
group of them
– ACTION is the action that needs to be taken by the
DLP system when confidential data is identified at
rest.
Classification 8/2/2013 Copyright 2011 Trend Micro Inc. 14
Concepts, Models and Architecture
• Typical DLP systems:
– DLP Management Console
– DLP Endpoint Agent
– DLP Network Gateway
– Data Discovery Agent (or Appliance)
Classification 8/2/2013 Copyright 2011 Trend Micro Inc. 15
Concepts, Models and Architecture
• Typical DLP system architecture:
Classification 8/2/2013 Copyright 2011 Trend Micro Inc. 16
Agenda
• What is Data Loss Prevention (DLP) ?
• Concepts, Models, Architecture
•Content Inspection Problems
• Practical Algorithms for DLP
• Summary
• References
• Q&A
Classification 8/2/2013 Copyright 2011 Trend Micro Inc. 17
Content Inspection Problems
• Two fundamental problems for a DLP system:
• It is a pair of problems that always come together:
• One determines data sensitivity based on what has been
defined.
Classification 8/2/2013 Copyright 2011 Trend Micro Inc. 18
Content Inspection Problems
• Four typical approaches for <defining, determining>
sensitive data in a DLP system:
1. Document fingerprinting
2. Database record fingerprinting
3. Multiple Keyword matching
4. Regular expression matching
Classification 8/2/2013 Copyright 2011 Trend Micro Inc. 19
Content Inspection Problems
• Document fingerprinting:
• A technique for identifying modified versions of known documents
• Problem Definition (Model 1):
– Let S= { T1, T2, …,Tn} be a set of known texts
– Given a query text T, one needs to determine if there exist at least a
docu e t t ϵ S such that T a d t share co o textual co te t
significantly, where multiple returned documents are ranked by how
much common content are shared.
Classification 8/2/2013 Copyright 2011 Trend Micro Inc. 20
Content Inspection Problems
• An alternative model (Model 2):
– Let S= { T1, T2, …,Tn} be a set of known texts
– Given a query text T and X%, one needs to determine if there exist at
least a text t ϵ S such that SIM T,t ≥ X%, where SIM is a fu ctio to
measure the similarity between two texts.
• Multiple documents are ranked by the percentiles .
Classification 8/2/2013 Copyright 2011 Trend Micro Inc. 21
Content Inspection Problems
• Database record fingerprinting:
– A technique for identifying sensitive data records within a text.
– A.k.a., Exact Match in DLP field
• Use Case:
– We have several personal data records of <SSN, Phone#, address>
that are included in a text, we want to extract all records from the
text to determine the sensitivity of the file.
Classification 8/2/2013 Copyright 2011 Trend Micro Inc. 22
Content Inspection Problems
SSN Phone # Address
178-76-6754 412-876-6789 43 Atword Street, Pittsburgh, PA 15260
159-87-8965 408-780-8876 76 Parkview Ave, Sunnyvale, CA 94086
…… …… ……
An example: a text contains a few data records:
Hhhhhdds ghghg 178-76-6754 ggkjkfddfdkkkk879-45-6785kjkjjk 43
Atword Street, Pittsburgh, PA 15260 kllkll 412-876-6789 kjkjjkj 76
Parkview Ave, Sunnyvale, CA 94086 hhsjskkdhjhjhj 408-780-8876
hjhjkjkjjj 159-87-8965 hjhjhjhjmnnmnxcbls w243 54y45 wefddew
dddw3n nn xxxxxxxxxx
Copyright 2011 Trend Micro Inc. 23
Content Inspection Problems
• Problem Definition (Model 3) :
– Let S= { R1, R2, …,Rn} be a set of known data records from a same table.
– Given any text T, one needs to extract all records or sub-records from T
while the record cells may appear randomly within the text.
Classification 8/2/2013 Copyright 2011 Trend Micro Inc. 24
Content Inspection Problems
• Problem Definition for Keyword Match:
– Let S= {K1,K2,…,Kn} be a dictionary of keywords.
– Given any text T, one needs to identify all keyword occurrences in T.
• Problem Definition for RegEx Match:
– Let S= {P1,P2,…,Pm} be a set of RegEx patterns.
– Given any text T, one needs to identify all pattern instances from T.
Easy problems?
– Not at all! For large n and m, one will
have performance issue.
– That’s the problem of scalability.
– Scalable algorithms must be provided.
Classification 8/2/2013 Copyright 2011 Trend Micro Inc. 25
Agenda
• What is Data Loss Prevention (DLP) ?
• Concepts, Models, Architecture
• Content Inspection Problems
• Practical Algorithms for DLP
• Summary
• References
• Q&A
Classification 8/2/2013 Copyright 2011 Trend Micro Inc. 26
Practical Algorithms for DLP
• We investigate some algorithms for 2 problems:
1. Document fingerprinting
2. Multiple keyword matching
Assumption: a text T is a sequence of UTF-8 characters without
loss of generality.
Classification 8/2/2013 Copyright 2011 Trend Micro Inc. 27
Document Fingerprinting Algorithms
• Lets investigate algorithmic solutions for Model 2 ( document
fingerprinting).
• Analysis for Solution:
1. We need to construct the function SIM(T,t). For example:
– SIM(T,t = |T ∩t| /Min |T|,|t|) based on common sub-strings.
2. An Obvious Challenge:
– If n is large, say, in scale of millions, we can not compute SIM(T, Tk) one by one
to find the t that satisfies SIM(T,t ≥ X%
– We need to figure out an approach that can identify a possible candidate quickly.
3. General search engines like Google use keywords to index/identify
the documents. Should we? There are too many keywords and
language dependency. The answer is NO.
4. So, which features can we use for indexing/searching?
– One answer is documents fingerprints.
Copyright 2011 Trend Micro Inc. 28
Document Fingerprinting Algorithms
• What are document fingerprints?
– A fingerprint is a hash value
– One text has multiple fingerprints
– Unique to the text: two irrelevant texts do not share any fingerprints.
– Robustness: it can survive moderate textual changes.
Copyright 2011 Trend Micro Inc. 29
Document Fingerprinting Algorithms
• How to extract fingerprints from a text?
– Anchoring point:
• A point in the text that can endure the moderate changes.
• Its neighborhood (of fixed size) is unique to the text
– We select a few anchoring points to fingerprints:
• To generate hash values around their neighborhoods.
• These hash values are the fingerprints
•Samples of anchoring points and their neighborhood:
Thereareabundantliteraturesonhowtogeneratedifferencebetween
twofilesBasicallytherearetwofundamentalapproachestoattackthisgenericp
roblemLCSmodelwhereLCSstandsforlargestcommonsubsequenceCalculate
thelargestcommonsubsequenceoftwostringFindasequenceofeditoperation
sbasedontheLCSsothatonecanapplytheeditoperationstothereferencefiletoc
onstructthetargetfileBlock movemodel
Copyright 2011 Trend Micro Inc. 30
Document Fingerprinting Algorithms
• Conclusion : we have a solution that consists of two
algorithms and one search technology:
– An algorithm for computing SIM(T,t)
– An algorithm for fingerprint generator FPGEN(T)
– Fingerprint search engine
Copyright 2011 Trend Micro Inc. 31
Document Fingerprinting Algorithms
• Fingerprint generation algorithm 1:
– INPUT: String T
• Select top M candidate characters based on a score function
– Character frequency n
– Character positio s i the text T: P , …, P
– SCORE(c) = SQRT(D(n) * [ P(n)-P(1)] / SQRT(D)
» Where D= [(P(2)-P(1)]2+ [(P(3)-P(2)] 2 + … + P -P(n-1)] 2
• For each selected character c
– Create a hash around the neighborhood at each occurrence
– Sort these hashes
– Select the top N hashes
– These N hashes are fingerprints
– OUTPUT: M*N fingerprints
Note 1: M and N are pre-defined. Note 2: Two keys of this algorithm are (a)
the score function; (b)sorting the hashes.
Copyright 2011 Trend Micro Inc. 32
Document Fingerprinting Algorithms
• About the score function:
– Why SQRT(n) ?
• Measurement of frequency for the given character
• The larger the value, more stable the character is
– Why [ P(n)-P(1)] / SQRT(D) ?
• Measurement of distribution for the given character
• The larger the value, more even distributed the character, and more
stable the character;
• WHY? Think about a constrained optimization problem:
– min f(X1,X2 , … Xm) = X12+ X22 + … Xm2
» subject to
Note: The solution of the
» X1+ X2 + … Xm = c AND optimization problem is Xk
» Xk ≥ , k= , ,…, = c/m, k= , ,…,m
33
Copyright 2011 Trend Micro Inc.
Document Fingerprinting Algorithms
There are alternative algorithms to construct a
fingerprint generation function.
We recently constructed algorithm 2:
– A novel approach based on rolling hash function
H(x);
– It selects anchoring points with first filter H(x) = 0
mod p;
– It further selects anchoring points with a heuristic
second filter.
Note 1: The anchoring – It also employs the asymmetric architecture of
points have better fingerprint match;
distribution across text.
Note 2: Two keys of this algorithm are (a) Rolling hash;
(b)Asymmetric use of two filters.
Copyright 2011 Trend Micro Inc. 34
Multiple Keyword Match
Essentially, it is a multi-pattern
string match problems.
Problem Definition:
– Let S={P1,P2,…,Pk} be multiple short strings as
patterns;
– Given any string T, one needs to identify all pattern
occurrences in T.
Copyright 2011 Trend Micro Inc. 35
Multiple Keyword Match
Existing string match algorithms:
Algorithm Type
Naïve string match One pattern
Knuth–Morris–Pratt One pattern
Boyer-Moore One pattern
Boyer-Moore-Horspool One pattern
Boyer-Moore-Horspool-Raita One pattern
Rabin-Karp Multi-patterns
Aho-Corasick Multi-patterns
Sun-Manber Multi-patterns
Copyright 2011 Trend Micro Inc. 36
Multiple Keyword Match
Boyer-Moore-Horspool (BMH) Algorithm
Key elements of the algorithm:
– Character comparison can be made from right to left, starting from the end of
the pattern.
– Ending Character Heuristics
• Consider that we are pointing to character R[i] and try to compare it with the
ending character of P
• Bad character
– If R[i ≠P m and R i is not included in P’s alphabet, then it is safe for the pointer to skip
m positions arriving at R[i+m].
– If R[i ≠P m , R i is included in P’s alphabet, and R i ’s last occurrence within P has
distance q from the end of P, then it is safe for the pointer to skip q positions arriving at
R[i+q].
• Good character
– If R[i] =P[m] , P is not matched , and R[i] has no other occurrences within P, then it is safe
for the pointer to skip m positions arriving at R[i+m].
– If R[i] =P[m] , P is not matched and R[i ’s last occurrence other than P m has distance q
from the end of P, then it is safe for the pointer to skip q positions arriving at R[i+q].
• Matched instance
– If R[i] =P[m] and P is matched, then save the instance.
– It is almost safe to move the pointer to skip m positions arriving at R[i+m].
Copyright 2011 Trend Micro Inc. 37
Multiple Keyword Match
• Rabin-Karp Algorithm
– Hash based string match
• Rabin-Karp hash function H(S):
– For a given string S = x1x2…xm with length m, a hash function can be
constructed as:
• H(S) = x1bm-1 + x2 bm-2 + … + xm-1 b + xm mod q
• Where b is a base number, usually we take b=256 , and q is a big prime
number.
– For pattern P, H(P) = p1bm-1 + p2 bm-2 + … + pm-1 b + pm mod q
– If we denote Rk = R[k,k+m-1], we can derive H(Rk+1) from H(Rk) with
relatively small cost
– H(Rk+1) = [ H(Rk) – rkbm-1 ] b + rk+m mod q
– This is an iterative formula which is a common practice for algorithm
optimization
Copyright 2011 Trend Micro Inc. 38
Multiple Keyword Match
• Rabin-Karp hash function:
– The quantity bm-1 mod q can be pre-calculated to save CPU time.
– For each iteration, we only need 5 arithmetic operations.
• It can be further reduced to 4
• One considers the number rkbm-1
– Horner’s rule
• H S = … x1b + x2)b + x3 b + … + x m-1 ) b + xm mod q
• Yet another formula for performance tuning
Copyright 2011 Trend Micro Inc. 39
Multiple Keyword Match
• Rabin-Karp algorithm for multiple patterns:
– Input:
• String R, multiple patterns {P1,…,Pk},
• n= Length(R), mj =Length(Pj), q, b,
– Procedure:
• Step 0:
– Let m = Min(mk)
– Calculate the number bm-1 mod q
– Calculate all H(Pj ,…,m j= ,..,k and H R1 by Horner’s rule
• Step 1: Let i=1
• Step 2:
If there exists j in , ,…,k such that
H(Pj ,…,m = H Ri) and Pj = R[i,…, mj +i-1],
it is a match and output the instance
• Step 3: i = i + 1
• Step 4: If i > n-m, stop
• Step 5: Calculate H(Ri+1) using the iterative formula.
• Step 6 Go to step 2
– Output: All matched instances
Copyright 2011 Trend Micro Inc. 40
Multiple Keyword Match
A practical hybrid method:
– BMH or Rabin-Karp
– If k < Magic-number,
• Use BMH k times,
• Otherwise, use Rabin-Harp
– Magic-number=100 is my exercise in DLP products.
Rabin-Karp has its weakness :
• when Min({Length(Pi)| i = , ,…,k is
small, say, less than 4, we have trouble.
• We need to introduce efficient multiple
pattern match for short patterns.
Copyright 2011 Trend Micro Inc. 41
Multiple Keyword Match
We have a complimentary solution to RK algorithm when
handling multiple short patterns
– This is Reverse-trie matching algorithm.
A reverse-trie presents a set of keywords,
especially, it is good for CJK languages in
root
UTF-8 encoding :
c d
b c
a b a
a
The keyword set: {abc,abcd,acd}
Copyright 2011 Trend Micro Inc. 42
Agenda
• What is Data Loss Prevention (DLP) ?
• Concepts, Models, Architecture
• Content Inspection Problems
• Practical Algorithms for DLP
• Summary
• References
• Q&A
Classification 8/2/2013 Copyright 2011 Trend Micro Inc. 43
Summary
• What DLP is.
• DLP Security Model
• Architecture of a DLP System
• Four Content Inspection Problems
• Two Algorithms for DLP Content Inspection
– Document Fingerprinting
– Multi-keyword matching
Classification 8/2/2013 Copyright 2011 Trend Micro Inc. 44
References
• Liwei Ren et al., Document fingerprinting with asymmetric selection of anchor
points, US patent 8359472
• Liwei Ren et al., Two tiered architecture of named entity recognition engine, US
patent 8321434.
• Yingqiang Lin el al., Scalable document signature search engine, US patent
8266150
• Liwei Ren et al., Fingerprint based entity extraction, US patent 7950062
• Liwei Ren et al., Document match engine using asymmetric signature generation,
US patent 7860853
• Liwei Ren et al., Match engine for querying relevant documents, US patent
7747642
• Liwei Ren et al., Matching engine with signature generation, US patent 7516130
Classification 8/2/2013 Copyright 2011 Trend Micro Inc. 45
Q&A
Any questions?
Classification 8/2/2013 Copyright 2011 Trend Micro Inc. 46
Thank You!
Innovation is not a part
time job, and it is not even
a full-time job. It’s a life
style.
Classification 8/2/2013 Copyright 2011 Trend Micro Inc. 47