Design Elastic Search
Elasticsearch is a search engine based on the Lucene library. It
provides a distributed, multitenant-capable full-text search engine
with an HTTP web interface and schema-free JSON documents.
Functional Requirements:
1. Search documents with given queries. (Full Text Search)
2. Find top documents (Ranking).
Non-Functional Requirements:
1. We have billions of documents.
2. Eventual Consistency. (It’s okay if new documents are
included in search queries later.).
3. Highly Available.
Examples:
1. E-commerce Website: Review Search, Product Search
2. Social Media: Search bar (Post, comment, tweets)
3. Search Engine
Jamboard Link
Elastic search - Google Jamboard
The version of the browser you are using is no longer supported. Please upgrade to
a supported browser. Dismiss
jamboard.google.com
How to store Documents:
We will store documents in the form of an inverted index.
Let’s see examples of how this is being stored:
Inverted Index Example
Inverted Index (HashMap):
1. Product -> [(1,4), (2,0), (3,0)]
2. Using -> [(1,2)]
3. Working -> [(1,8)]
4. Nice -> [(1,9)]
5. Great -> [(2,2)]
6. Feature -> [(2,4)]
7. OverPriced -> [(3,2)]
Documents often contain numerous unnecessary words. Let’s
explore ways to remove them.
1. Remove Stop Words: Useless words that appear often. of,
the, are, a, of etc.
2. Stemming/Lemmatization
3. Other Cleaning Methods: Removing
negative/abusive/sensitive words.
Stemming: It’s fast and simple.
1. Escaped -> Escap
2. Escaping -> Escap
3. Escape -> Escap
4. Caring -> Car
Lemmatization: It takes context into account.
1. Caring -> Care
2. Escaping -> Escape
Example of Search Query:
1. Search Query: The Product Quality is great.
2. Cleaning: product quality great.
3. Intersection: Find all the docs with all the words.
4. Union: Find all docs that contains any of the word.
5. Maintain Order of words: [product — i, quality — i+1,
great — i+2]
Design DeepDive:
Single Node:
1. It can cause Single Point of Failure.
2. Run Out of space.
3. High Load
4. Solution: Sharding.
CAP Theorem:
1. Eventual Consistency: It’s okay if new documents are
included in search queries later.
2. High Availability: System Should be available, whenever
we search something.
3. Partition Tolerant: System should handle any network
partitions, it should continue to work despite any number
of communication breakdowns.
Sharding:
Shard by Document Id [Recommended]:
DocumentId can be:
1. ReviewId
2. LogId
3. ProductId
4. PostId/TweetId
Shard by DocumentId
Pros:
1. Latency is highly predictable: ResponseTime doesn’t
depend on search query.
2. No need to perform intersection.
3. Can skip unresponsive/slow shards.
Skip Unresponsive/Slow Shards
Cons:
1. Read Heavy: Complexity (Run on all shards)
2. Write Heavy: Simple (1 Shard)
Shard by Words:
1. Insert/update/Delete: Go to all shards corresponding to
the words in the document. (Multiple Shards)
2. Query: Go to all shards
Word Sharding
Let’s See How Processing will Work:
Map Reduce:
1. Search: “product quality good”, Go to all shards -> Run
query at every shards.
2. Each shards will return a list of docs that matches the
query.
3. Collect them all.
4. Items: Map(Single Shard), Reduce(Combine result from 2
shards) -> Easy to fit in a memory.
Map Reduce
Replication:
1. It will be good to have replication in case a node goes
down.
2. Parameter: Number of shards, Replication factor.
3. n -> number of nodes, M -> shards, R -> replication factor
4. n ≤ M*R
5. Master Slave Replication
Replication
Master Slave Replication
Ranking:
1. Rank all the document matches a certain query.
2. Method: TF-IDF, is a popular method.
TF-IDF (Below Diagram):