Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
3 views11 pages

Design Elastic Search 2

Elasticsearch is a distributed search engine based on Lucene, designed for full-text search with a schema-free JSON document structure. It supports billions of documents with requirements for eventual consistency and high availability, utilizing an inverted index for document storage and various cleaning methods for search queries. The design includes sharding strategies, replication for fault tolerance, and ranking methods like TF-IDF for effective search results.

Uploaded by

ramirtt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views11 pages

Design Elastic Search 2

Elasticsearch is a distributed search engine based on Lucene, designed for full-text search with a schema-free JSON document structure. It supports billions of documents with requirements for eventual consistency and high availability, utilizing an inverted index for document storage and various cleaning methods for search queries. The design includes sharding strategies, replication for fault tolerance, and ranking methods like TF-IDF for effective search results.

Uploaded by

ramirtt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Design Elastic Search

Elasticsearch is a search engine based on the Lucene library. It

provides a distributed, multitenant-capable full-text search engine

with an HTTP web interface and schema-free JSON documents.

Functional Requirements:

1.​ Search documents with given queries. (Full Text Search)

2.​ Find top documents (Ranking).

Non-Functional Requirements:

1.​ We have billions of documents.

2.​ Eventual Consistency. (It’s okay if new documents are

included in search queries later.).

3.​ Highly Available.

Examples:

1.​ E-commerce Website: Review Search, Product Search

2.​ Social Media: Search bar (Post, comment, tweets)


3.​ Search Engine

Jamboard Link

Elastic search - Google Jamboard

The version of the browser you are using is no longer supported. Please upgrade to
a supported browser. Dismiss

jamboard.google.com

How to store Documents:

We will store documents in the form of an inverted index.

Let’s see examples of how this is being stored:

Inverted Index Example


Inverted Index (HashMap):

1.​ Product -> [(1,4), (2,0), (3,0)]

2.​ Using -> [(1,2)]

3.​ Working -> [(1,8)]

4.​ Nice -> [(1,9)]

5.​ Great -> [(2,2)]

6.​ Feature -> [(2,4)]

7.​ OverPriced -> [(3,2)]

Documents often contain numerous unnecessary words. Let’s


explore ways to remove them.

1.​ Remove Stop Words: Useless words that appear often. of,

the, are, a, of etc.

2.​ Stemming/Lemmatization

3.​ Other Cleaning Methods: Removing

negative/abusive/sensitive words.

Stemming: It’s fast and simple.


1.​ Escaped -> Escap

2.​ Escaping -> Escap

3.​ Escape -> Escap

4.​ Caring -> Car

Lemmatization: It takes context into account.

1.​ Caring -> Care

2.​ Escaping -> Escape

Example of Search Query:

1.​ Search Query: The Product Quality is great.

2.​ Cleaning: product quality great.

3.​ Intersection: Find all the docs with all the words.

4.​ Union: Find all docs that contains any of the word.

5.​ Maintain Order of words: [product — i, quality — i+1,

great — i+2]

Design DeepDive:
Single Node:

1.​ It can cause Single Point of Failure.

2.​ Run Out of space.

3.​ High Load

4.​ Solution: Sharding.

CAP Theorem:

1.​ Eventual Consistency: It’s okay if new documents are

included in search queries later.

2.​ High Availability: System Should be available, whenever

we search something.

3.​ Partition Tolerant: System should handle any network

partitions, it should continue to work despite any number

of communication breakdowns.

Sharding:

Shard by Document Id [Recommended]:

DocumentId can be:


1.​ ReviewId

2.​ LogId

3.​ ProductId

4.​ PostId/TweetId

Shard by DocumentId

Pros:

1.​ Latency is highly predictable: ResponseTime doesn’t

depend on search query.

2.​ No need to perform intersection.

3.​ Can skip unresponsive/slow shards.


Skip Unresponsive/Slow Shards

Cons:

1.​ Read Heavy: Complexity (Run on all shards)

2.​ Write Heavy: Simple (1 Shard)

Shard by Words:

1.​ Insert/update/Delete: Go to all shards corresponding to

the words in the document. (Multiple Shards)

2.​ Query: Go to all shards


Word Sharding

Let’s See How Processing will Work:

Map Reduce:

1.​ Search: “product quality good”, Go to all shards -> Run

query at every shards.

2.​ Each shards will return a list of docs that matches the

query.

3.​ Collect them all.

4.​ Items: Map(Single Shard), Reduce(Combine result from 2

shards) -> Easy to fit in a memory.


Map Reduce

Replication:

1.​ It will be good to have replication in case a node goes

down.

2.​ Parameter: Number of shards, Replication factor.

3.​ n -> number of nodes, M -> shards, R -> replication factor

4.​ n ≤ M*R

5.​ Master Slave Replication


Replication

Master Slave Replication

Ranking:

1.​ Rank all the document matches a certain query.


2.​ Method: TF-IDF, is a popular method.

TF-IDF (Below Diagram):

You might also like