An Elasticsearch
Crash Course
Elasticsearch is Everywhere
Why?
Elasticsearch
Jared and Corin @ flickr http://bit.ly/1qHHMPu
Some Use Cases
Searching pieces of pure text (books, legal documents,
blog posts)
Searching text + structured data (products, user
profiles, application logs)
Pure aggregated data (statistics, metrics, etc.)
Geo Search
Distributed JSON Document DB (Anything)
At a High Level
Is a database, like any other!
Document Oriented!
Clusters!
Built on Lucene!
Built on an IR foundation!
Can perform fancy tricks with inverted indexes and
automata!
The Basics of the ES API
Getting Data Into ES
Storing a Document
Verb
Index
Type
DocID
curl -XPUT http://localhost:9200/literature/quote/one -d'
{
"person": "Jack Handy",
"said": "The face of a child can say it all, especially the
mouth part of the face"
}'
Document
Where does the
document go?
Indexes live in the cluster
Documents live in indexes
Cluster
Index
Doc
Doc
Doc
Doc
Doc
Doc
Doc
Doc
Doc
Doc
Doc
Doc
Doc
Doc
Doc
Doc
Doc
Doc
Index
Doc
Doc
Doc
Doc
Doc
Doc
Doc
Doc
Doc
Doc
Doc
Doc
Doc
Doc
Doc
Doc
Doc
Doc
Index
Doc
Doc
Doc
Doc
Doc
Doc
Doc
Doc
Doc
Doc
Doc
Doc
Doc
Doc
Doc
Doc
Doc
Doc
Key Nouns
Documents
A single Arbitrary JSON object
Stored as a text blob + indexes on fields
All fields get an inverted index(es)
{
"person": "Sam",
"foods": ["Green eggs", "ham"]
"likeswith": {
"place": "house",
"companion": mouse,
"age": 10
}
}
Types
Defines the schema for documents
Defines indexing rules as well
{
"human" : {
"properties" : {
"person" : {"type" : "string"},
"age" :
{"type" : "integer"}}}}
Indexes
Largest building block in ES
Container for documents / types
Composable
Document Storage
{
_id: 1,
person: Jack Handy,
said: The face of
Docs
_id: 3,
person: Ben Franklin,
said: Any fool can
_id: 2,
person: George Eliot,
said: Wear a
}
Routing
Consistent Hashing!
Index
!
!
!
!
SHARD 1
SHARD 2
SHARD 3
SHARD 4
Inside an Elasticsearch Index
Elasticsearch Index
Lucene
Indexes
Shard 1
Shard 2
Shard 3
Shard N
Primary
Primary
Primary
Primary
Replica 1
Replica 1
Replica 1
Replica 1
Replica N
Replica N
Replica N
Replica N
Each primary or replica shard is a Lucene index
Querying
A Simple Query
Verb
Index
Type
Action
curl -XPOST http://localhost:9200/literature/quote/_search -d'
{
"query": {
"match": {
"person": "jack"}}}'
Search Body
The Search API in Action
Query
Response
{
"took": 13,
"timed_out": false,
"_shards": {
"total": 5,
{"query": {
"match": {
"person": "jack"}}}
API
Any Node
Index
!
!
!
SHARD 1
SHARD 2
SHARD 3
SHARD 4
Natural Language Search
Everything should run in
sub linear time, usually
O(log n)
Martin Fisch @ flickr http://bit.ly/1l4sII3
Think of Your Indexes
as Trees
Martin Fisch @ flickr http://bit.ly/1l4sII3
Working with Data in SQL
phrases table
Index on phrase
The fat brown.
id
phrase
The quick brown
fox jumped over the
lazy dog
The fat brown dog
Raining cats and
dogs
The quick brown
Raining cats and
SQL Index as a B-Tree
The fat brown.
Raining cats and
The quick brown
Fast Prefix Search
SELECT * FROM
phrases WHERE
phrase LIKE The%
Standard BTree-based indexes
are fast at:
Exact matches
Prefix matches
How well does the
previous example work
given a search for
dog?
Slow Scan Search
SELECT * FROM
phrases WHERE
phrase LIKE %dog%
An Inverted Index
Terms
Document
brown
{
"_id": 1
"phrase": the
quick brown fox jumps
over the lazy dog
dog
fat
fox
jump
lazi
over
quick
{
"_id": 2
"phrase": "The fat
brown dog"
}
An Inverted Index as a Tree
Terms
jump
dog
brown
over
fox
fat
lazi
quick
Sequential Scan City
SELECT * FROM
phrases WHERE
phrase ILIKE dog
Uses an index!
SELECT * FROM
phrases WHERE
LOWER(phrase)
=LOWER(dog)
Making the index
CREATE INDEX
lcase_phrase_idx ON
phrases (LOWER(phrase));
Text In, Terms Out
Some kind of Text
ANALYZER
[text, of, kind, some]
Analysis
The quick brown fox jumps over the lazy dog
Snowball Analyzer
[quick2, brown3, fox4, jump5, over6, lazi7, dog8]
Stemming and Stopwords
I jump while she jumps and laughs
Snowball Analyzer
[i1 jump2, while3, she4, jump5, laugh7]
NGrams
news
NGram Analyzer
["n", "e", "w", "s", "ne", "ew", "ws"]
An NGram Search
Query
["n", "e", "w", "ne", "ew"]
Good Match
["n", "e", "w", "s", "ne", "ew", "ws"]
Poor Match
["s", "t", "e", "w", "s", "st", te, ew, ws]
Path Hierarchy
"/var/lib/racoons"
Path Hierarchy Analyzer
["/var", "/var/lib", "/var/lib/racoons"]
Inverted Index Highlights
M Terms map to N documents
Still uses trees, but by breaking up text,
performance is gained!
String broken up into linguistic terms (usually
words)
Postgres users can do this (in a simple form)
List of ES Analysis Tools
Analyzers!
Tokenizers!
standard analyzer!
standard tokenizer!
simple analyzer!
edge ngram tokenizer!
whitespace analyzer! keyword tokenizer!
stop analyzer!
letter tokenizer!
keyword analyzer!
lowercase tokenizer!
pattern analyzer!
ngram tokenizer!
language analyzers! whitespace tokenizer!
snowball analyzer!
pattern tokenizer!
custom analyzer
uax email url tokenizer!
path hierarchy tokenizer!
classic tokenizer!
thai tokenizer
+ Plugins!
Token Filters!
standard token filter!
ascii folding token filter!
length token filter!
lowercase token filter!
uppercase token filter!
ngram token filter!
edge ngram token filter!
porter stem token filter!
shingle token filter!
stop token filter!
word delimiter token filter!
stemmer token filter!
stemmer override token filter!
keyword marker token filter!
keyword repeat token filter!
kstem token filter!
snowball token filter!
phonetic token filter!
synonym token filter!
compound word token filter!
reverse token filter!
elision token filter!
truncate token filter!
unique token filter!
pattern capture token filter!
pattern replace token filter!
trim token filter!
limit token count token filter!
hunspell token filter!
common grams token filter!
normalization token filter!
cjk width token filter!
cjk bigram token filter!
delimited payload token filter!
keep words token filter!
classic token filter!
apostrophe token filter
Scoring
=
Relevance
Search Methodology
Find all the docs using a boolean query!
Score all the docs using a similarity algorithm (TF/IDF)
TF/IDF Boosts When
The matched term is rare in the corpus!
The term appears frequently in the document
Document Scoring
Results are ordered based on score
(relevance)
Score based on either TF/IDF or other
algorithm
Custom scoring functions can be sent with
query or registered on the server
Document Scoring
Results are ordered based on score
(relevance)
Score based on either TF/IDF or other
algorithm
Custom scoring functions can be sent with
query or registered on the server
Query Types
Phrase Queries
Geo Queries
Numeric Range Queries
More Like This Queries
Autocomplete Queries
Query Types
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
match query
multi match query
bool query
boosting query
common terms query
custom filters score
query
custom score query
custom boost factor
query
constant score query
dis max query
field query
filtered query
fuzzy like this query
fuzzy like this field
query
function score query
fuzzy query
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
geoshape query
has child query
has parent query
ids query
indices query
match all query
more like this query
more like this field
query
nested query
prefix query
query string query
simple query string
query
range query
regexp query
span first query
span multi term query
span near query
34.
35.
36.
37.
38.
39.
40.
41.
42.
span not query
span or query
span term query
term query
terms query
top children query
wildcard query
text query
minimum should
match
43. multi term query
rewrite
Compose Queries with
Boolean / DisMax Queries
Efficient Aggregate Queries:
An RDBMS vs Elasticsearch
Elasticsearch is an Information
Retrieval (IR) System
An RDBMS is oriented around
organizing data
!
An IR system is oriented around
efficient searches
In an RDBMS you create data, then
index it
!
In an IR system you create indexes
linked to data
Inverted indexes are
fantastically efficient for
denormalization!
Inverted Indexes for HTTP Logs
Proto Terms
http
Document
{
"_id": 1,
"proto": "http",
"path": "/foo",
https
}
Path Terms
/foo
/foo/bar
"_id": 2,
"proto": "http",
"path": "/foo",
{
"_id": 3,
"proto": "https",
"path": /foo/bar",
}
Question:
How many reqs did we
get under for each
path?
How We Answer It
SQL
SELECT
stat,COUNT(*)
FROM logs
WHERE stat IN
(proto,path)
GROUP BY stat
ES
"aggs": {
"path": {
"terms": {
"field": "path"} },
"proto": {
"terms": {
"field": proto"}}}}
Question:
How many reqs did we
get under each different
path AND its parents?
Inverted Indexes for HTTP Logs
Proto Terms
http
Document
{
"_id": 1,
"proto": "http",
"path": "/foo",
https
}
Path Terms
/foo
/foo/bar
"_id": 2,
"proto": "http",
"path": "/foo",
{
"_id": 3,
"proto": "https",
"path": /foo/bar",
}
Inverted Indexes for HTTP Logs
Proto Terms
http
Document
{
"_id": 1,
"proto": "http",
"path": "/foo",
https
}
Path Terms
/foo
/foo/bar
"_id": 2,
"proto": "http",
"path": "/foo",
{
"_id": 3,
"proto": "https",
"path": /foo/bar",
}
How We Answer It
SQL
X
SELECT
stat,COUNT(*)
FROM logs
WHERE stat IN
(proto,path)
GROUP BY stat
ES
"facets": {
"path": {
"terms": {
"field": "path"} },
"proto": {
"terms": {
"field": proto"}}}}
Lets Save some Space
Space Now Saved!
Proto Terms
http
Document
{
"_id": 1,
https
Path Terms
{
"_id": 2,
}
/foo
/foo/bar
{
"_id": 3,
}
Reasons to Consider ES
1. Speed
Traditional databases!
often are slower for full text search
2. Relevance
Search is all about relevance. A huge!
array of tools are provided by ES/Lucene!
to ensure results are relevant.
3. Aggregate Statistics
Elasticsearch can be faster than
your RDBMS when it comes to
aggregate stats!
4. Search Goodies
Users nowadays expect features like ultrafast type-ahead search, Did you mean?,
and More Like this
Logstash, an ES
Success Story
Indexes
Multi Index Query
logs-2013-01
logs-2013-02
logs-2013-03
logs-2013-04
logs-2013-05
logs-2013-06
curl http://es.srv/logs-2013-05,logs-2013-06/
_search -d '
"query": ""
'
Kibana + Logstash
Generic Document Store
Document Store Properties
Distributed
Excellent read performance / scalability
Mediocre delete/update performance
Rich queries on top of document properties
Things ES is bad at
Extremely high write environments: Lucene is
not write optimized. You probably wont hit limits
here however!
Large amounts of document churn: Deleting
and remerging segments can get expensive
Transactional Operations: Lucene is no RDBMS.
It is meant for fast, denormalized operations.
Primary Store: Still too new
Thank You!
Check out our hosted ES solution @
http://found.no