0% found this document useful (0 votes)

3 views11 pages

Design Elastic Search 2

Elasticsearch is a distributed search engine based on Lucene, designed for full-text search with a schema-free JSON document structure. It supports billions of documents with requirements for eventual consistency and high availability, utilizing an inverted index for document storage and various cleaning methods for search queries. The design includes sharding strategies, replication for fault tolerance, and ranking methods like TF-IDF for effective search results.

Uploaded by

ramirtt

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views11 pages

Design Elastic Search 2

Uploaded by

ramirtt

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Design Elastic Search

Elasticsearch is a search engine based on the Lucene library. It

provides a distributed, multitenant-capable full-text search engine

with an HTTP web interface and schema-free JSON documents.

Functional Requirements:

1. Search documents with given queries. (Full Text Search)

2. Find top documents (Ranking).

Non-Functional Requirements:

1. We have billions of documents.

2. Eventual Consistency. (It’s okay if new documents are

included in search queries later.).

3. Highly Available.

Examples:

1. E-commerce Website: Review Search, Product Search

2. Social Media: Search bar (Post, comment, tweets)

3. Search Engine

Jamboard Link

Elastic search - Google Jamboard

The version of the browser you are using is no longer supported. Please upgrade to
a supported browser. Dismiss

jamboard.google.com

How to store Documents:

We will store documents in the form of an inverted index.

Let’s see examples of how this is being stored:

Inverted Index Example

Inverted Index (HashMap):

1. Product -> [(1,4), (2,0), (3,0)]

2. Using -> [(1,2)]

3. Working -> [(1,8)]

4. Nice -> [(1,9)]

5. Great -> [(2,2)]

6. Feature -> [(2,4)]

7. OverPriced -> [(3,2)]

Documents often contain numerous unnecessary words. Let’s

explore ways to remove them.

1. Remove Stop Words: Useless words that appear often. of,

the, are, a, of etc.

2. Stemming/Lemmatization

3. Other Cleaning Methods: Removing

negative/abusive/sensitive words.

Stemming: It’s fast and simple.

1. Escaped -> Escap

2. Escaping -> Escap

3. Escape -> Escap

4. Caring -> Car

Lemmatization: It takes context into account.

1. Caring -> Care

2. Escaping -> Escape

Example of Search Query:

1. Search Query: The Product Quality is great.

2. Cleaning: product quality great.

3. Intersection: Find all the docs with all the words.

4. Union: Find all docs that contains any of the word.

5. Maintain Order of words: [product — i, quality — i+1,

great — i+2]

Design DeepDive:
Single Node:

1. It can cause Single Point of Failure.

2. Run Out of space.

3. High Load

4. Solution: Sharding.

CAP Theorem:

1. Eventual Consistency: It’s okay if new documents are

included in search queries later.

2. High Availability: System Should be available, whenever

we search something.

3. Partition Tolerant: System should handle any network

partitions, it should continue to work despite any number

of communication breakdowns.

Sharding:

Shard by Document Id [Recommended]:

DocumentId can be:

1. ReviewId

2. LogId

3. ProductId

4. PostId/TweetId

Shard by DocumentId

Pros:

1. Latency is highly predictable: ResponseTime doesn’t

depend on search query.

2. No need to perform intersection.

3. Can skip unresponsive/slow shards.

Skip Unresponsive/Slow Shards

Cons:

1. Read Heavy: Complexity (Run on all shards)

2. Write Heavy: Simple (1 Shard)

Shard by Words:

1. Insert/update/Delete: Go to all shards corresponding to

the words in the document. (Multiple Shards)

2. Query: Go to all shards

Word Sharding

Let’s See How Processing will Work:

Map Reduce:

1. Search: “product quality good”, Go to all shards -> Run

query at every shards.

2. Each shards will return a list of docs that matches the

query.

3. Collect them all.

4. Items: Map(Single Shard), Reduce(Combine result from 2

shards) -> Easy to fit in a memory.

Map Reduce

Replication:

1. It will be good to have replication in case a node goes

down.

2. Parameter: Number of shards, Replication factor.

3. n -> number of nodes, M -> shards, R -> replication factor

4. n ≤ M*R

5. Master Slave Replication

Replication

Master Slave Replication

Ranking:

1. Rank all the document matches a certain query.

2. Method: TF-IDF, is a popular method.

TF-IDF (Below Diagram):

Topical Map SOP for SEO Experts
50% (2)
Topical Map SOP for SEO Experts
48 pages
Vapt Unit-1
No ratings yet
Vapt Unit-1
11 pages
Google-China Cross-Cultural Case Study
100% (1)
Google-China Cross-Cultural Case Study
11 pages
Digital Pharma Marketing
No ratings yet
Digital Pharma Marketing
9 pages
IR Unit III - Notes
No ratings yet
IR Unit III - Notes
18 pages
Web Technology Lab File
No ratings yet
Web Technology Lab File
106 pages
CAQ Relevance Guidelines (2 Label) 113021
No ratings yet
CAQ Relevance Guidelines (2 Label) 113021
7 pages
IRS Notes
No ratings yet
IRS Notes
40 pages
Llama3, LangGraph and Elasticsearch - Build A Local Agent For Vector Search - Search Labs
100% (2)
Llama3, LangGraph and Elasticsearch - Build A Local Agent For Vector Search - Search Labs
48 pages
Reconnaissance - Phase-2
No ratings yet
Reconnaissance - Phase-2
9 pages
Design A Metrics Aggregation System
No ratings yet
Design A Metrics Aggregation System
3 pages
Week 5
No ratings yet
Week 5
4 pages
Design A Messaging App Like WhatsApp
No ratings yet
Design A Messaging App Like WhatsApp
3 pages
Database Scaling Through Sharding and Partitioning
No ratings yet
Database Scaling Through Sharding and Partitioning
14 pages
Holiday Homework STD 5
No ratings yet
Holiday Homework STD 5
3 pages
Design A Highly Available Key-Value Data Store (Cassandra - )
No ratings yet
Design A Highly Available Key-Value Data Store (Cassandra - )
4 pages
Irs CH 3
No ratings yet
Irs CH 3
28 pages
7TH Class Computer
No ratings yet
7TH Class Computer
2 pages
IRSNOTES2
No ratings yet
IRSNOTES2
4 pages
Apache Lucene
No ratings yet
Apache Lucene
19 pages
Chapter 4 - Processing Text
No ratings yet
Chapter 4 - Processing Text
7 pages
Information Retrievalpdf
No ratings yet
Information Retrievalpdf
7 pages
Computer Studies - Form 2 - Term-II
No ratings yet
Computer Studies - Form 2 - Term-II
10 pages
Chap ... 02 (Basic ICT Productivity Tools) by STARS SERIES.
No ratings yet
Chap ... 02 (Basic ICT Productivity Tools) by STARS SERIES.
104 pages
How Apache Solr Works
No ratings yet
How Apache Solr Works
15 pages
SEO & User Experience Insights
No ratings yet
SEO & User Experience Insights
9 pages
System Design
No ratings yet
System Design
6 pages
IR (1-7) - Heet
No ratings yet
IR (1-7) - Heet
19 pages
Text
No ratings yet
Text
5 pages
Logo 345 1649916914 Elasticsearch-Introductions
No ratings yet
Logo 345 1649916914 Elasticsearch-Introductions
86 pages
Bulu
No ratings yet
Bulu
47 pages
CCW332 5 Units Two Marks QB
No ratings yet
CCW332 5 Units Two Marks QB
16 pages
Unit 2 IR
No ratings yet
Unit 2 IR
13 pages
Seo
No ratings yet
Seo
13 pages
Ap May 23 QP Ans
No ratings yet
Ap May 23 QP Ans
9 pages
IR Assignment 1 Solution
No ratings yet
IR Assignment 1 Solution
10 pages
Elasticsearch Research Paper
No ratings yet
Elasticsearch Research Paper
5 pages
2nd Quarter English 7
No ratings yet
2nd Quarter English 7
6 pages
Off Page Link Motherhood Real Sheet
No ratings yet
Off Page Link Motherhood Real Sheet
9 pages
1 Introduction MIR
No ratings yet
1 Introduction MIR
35 pages
List All Indices: Shards & Replicas
No ratings yet
List All Indices: Shards & Replicas
5 pages
Query Languages
No ratings yet
Query Languages
54 pages
Chap 2
No ratings yet
Chap 2
29 pages
Es Lab Final
No ratings yet
Es Lab Final
19 pages
Evolution of Digital Marketing
No ratings yet
Evolution of Digital Marketing
31 pages
SEO Audit Report for YourSite.com
No ratings yet
SEO Audit Report for YourSite.com
39 pages
Irs Unit - 3
No ratings yet
Irs Unit - 3
68 pages
US9489464
No ratings yet
US9489464
20 pages
SEO-Friendly Website Planning Guide
No ratings yet
SEO-Friendly Website Planning Guide
56 pages
1-Overview of Information Retrieval - New
No ratings yet
1-Overview of Information Retrieval - New
47 pages
Unit 1 Digital Marketing
No ratings yet
Unit 1 Digital Marketing
22 pages
CAT King Study Material 3
No ratings yet
CAT King Study Material 3
25 pages
Did It Make The News?
No ratings yet
Did It Make The News?
6 pages
Carried Out at Insights Success Media Tech LLC
No ratings yet
Carried Out at Insights Success Media Tech LLC
79 pages
IR
No ratings yet
IR
57 pages
1-Overview of Information Retrieval
No ratings yet
1-Overview of Information Retrieval
44 pages
06 Application Architecture
No ratings yet
06 Application Architecture
22 pages
A INTERNSHIP REPORT (VK) .......................
No ratings yet
A INTERNSHIP REPORT (VK) .......................
34 pages
Elasticsearch Basics for Beginners
No ratings yet
Elasticsearch Basics for Beginners
44 pages
Networking
No ratings yet
Networking
51 pages
Digital Marketing Communication Lecture Notes
No ratings yet
Digital Marketing Communication Lecture Notes
46 pages
International Journal of Engineering Research and Development
No ratings yet
International Journal of Engineering Research and Development
8 pages
Learning Guide Unit 2
No ratings yet
Learning Guide Unit 2
15 pages
7 B - Query Languages
No ratings yet
7 B - Query Languages
33 pages
AeroScout MobileView User Guide
No ratings yet
AeroScout MobileView User Guide
64 pages
Elastic Search
No ratings yet
Elastic Search
19 pages
Elasticsearch Python Slides
No ratings yet
Elasticsearch Python Slides
173 pages
Chap5 Index Construction
No ratings yet
Chap5 Index Construction
38 pages
An Elasticsearch Crash Course Presentation PDF
No ratings yet
An Elasticsearch Crash Course Presentation PDF
81 pages
What Is Elasticsearch
No ratings yet
What Is Elasticsearch
63 pages
Information Retrieval Overview
No ratings yet
Information Retrieval Overview
44 pages
True/False Quiz on Disruptive Tech & Ebusiness
No ratings yet
True/False Quiz on Disruptive Tech & Ebusiness
107 pages
Effective Search Engine - Final With Modules
No ratings yet
Effective Search Engine - Final With Modules
12 pages
03 MapReduce
No ratings yet
03 MapReduce
184 pages
Project Report of Website Development PR
No ratings yet
Project Report of Website Development PR
117 pages
Elasticsearch: A Technical Guide
No ratings yet
Elasticsearch: A Technical Guide
41 pages
Search Engine Architecture Guide
No ratings yet
Search Engine Architecture Guide
23 pages
The Sesame Lucenesail: RDF Queries With Full-Text Search: Nepomuk Technical Report 2008-1
No ratings yet
The Sesame Lucenesail: RDF Queries With Full-Text Search: Nepomuk Technical Report 2008-1
14 pages
AI6122 Topic 3.1 - Index
No ratings yet
AI6122 Topic 3.1 - Index
40 pages
Lecture3 Hadoop-NLP
No ratings yet
Lecture3 Hadoop-NLP
44 pages
System Design
No ratings yet
System Design
150 pages
FULLTEXT01
No ratings yet
FULLTEXT01
32 pages
Ir Chapter Three
No ratings yet
Ir Chapter Three
41 pages
System Design
No ratings yet
System Design
150 pages
Asddas
No ratings yet
Asddas
34 pages
Major Project PROPOSAL-BACHELOR OF ENGINEERING
No ratings yet
Major Project PROPOSAL-BACHELOR OF ENGINEERING
37 pages
Query Languages: Chapter Seven
No ratings yet
Query Languages: Chapter Seven
36 pages
Text Mining
No ratings yet
Text Mining
23 pages
Made By:-Bhawana Agarwal Cs Iiiyr
No ratings yet
Made By:-Bhawana Agarwal Cs Iiiyr
29 pages
Chapter 1: Introduction: Efficient Search in Large Textual Collections With Redundancy - 2009
No ratings yet
Chapter 1: Introduction: Efficient Search in Large Textual Collections With Redundancy - 2009
31 pages
Standard Web Search Engine Architecture: User Query
No ratings yet
Standard Web Search Engine Architecture: User Query
101 pages

Design Elastic Search 2

Uploaded by

Design Elastic Search 2

Uploaded by

Design Elastic Search

Elasticsearch is a search engine based on the Lucene library. It

provides a distributed, multitenant-capable full-text search engine

with an HTTP web interface and schema-free JSON documents.

1.​ Search documents with given queries. (Full Text Search)

2.​ Find top documents (Ranking).

1.​ We have billions of documents.

2.​ Eventual Consistency. (It’s okay if new documents are

included in search queries later.).

3.​ Highly Available.

1.​ E-commerce Website: Review Search, Product Search

2.​ Social Media: Search bar (Post, comment, tweets)

Elastic search - Google Jamboard

How to store Documents:

We will store documents in the form of an inverted index.

Let’s see examples of how this is being stored:

Inverted Index Example

1.​ Product -> [(1,4), (2,0), (3,0)]

2.​ Using -> [(1,2)]

3.​ Working -> [(1,8)]

4.​ Nice -> [(1,9)]

5.​ Great -> [(2,2)]

6.​ Feature -> [(2,4)]

7.​ OverPriced -> [(3,2)]

Documents often contain numerous unnecessary words. Let’s

the, are, a, of etc.

3.​ Other Cleaning Methods: Removing

Stemming: It’s fast and simple.

2.​ Escaping -> Escap

3.​ Escape -> Escap

4.​ Caring -> Car

Lemmatization: It takes context into account.

1.​ Caring -> Care

2.​ Escaping -> Escape

Example of Search Query:

1.​ Search Query: The Product Quality is great.

2.​ Cleaning: product quality great.

5.​ Maintain Order of words: [product — i, quality — i+1,

1.​ It can cause Single Point of Failure.

2.​ Run Out of space.

3.​ High Load

4.​ Solution: Sharding.

1.​ Eventual Consistency: It’s okay if new documents are

included in search queries later.

2.​ High Availability: System Should be available, whenever

3.​ Partition Tolerant: System should handle any network

partitions, it should continue to work despite any number

Shard by Document Id [Recommended]:

DocumentId can be:

1.​ Latency is highly predictable: ResponseTime doesn’t

depend on search query.

2.​ No need to perform intersection.

3.​ Can skip unresponsive/slow shards.

1.​ Read Heavy: Complexity (Run on all shards)

2.​ Write Heavy: Simple (1 Shard)

1.​ Insert/update/Delete: Go to all shards corresponding to

the words in the document. (Multiple Shards)

2.​ Query: Go to all shards

Let’s See How Processing will Work:

1.​ Search: “product quality good”, Go to all shards -> Run

query at every shards.

3.​ Collect them all.

4.​ Items: Map(Single Shard), Reduce(Combine result from 2

shards) -> Easy to fit in a memory.

1.​ It will be good to have replication in case a node goes

2.​ Parameter: Number of shards, Replication factor.

3.​ n -> number of nodes, M -> shards, R -> replication factor

5.​ Master Slave Replication

Master Slave Replication

1.​ Rank all the document matches a certain query.

TF-IDF (Below Diagram):

You might also like

1. Search documents with given queries. (Full Text Search)

2. Find top documents (Ranking).

1. We have billions of documents.

2. Eventual Consistency. (It’s okay if new documents are

3. Highly Available.

1. E-commerce Website: Review Search, Product Search

2. Social Media: Search bar (Post, comment, tweets)

1. Product -> [(1,4), (2,0), (3,0)]

2. Using -> [(1,2)]

3. Working -> [(1,8)]

4. Nice -> [(1,9)]

5. Great -> [(2,2)]

6. Feature -> [(2,4)]

7. OverPriced -> [(3,2)]

3. Other Cleaning Methods: Removing

2. Escaping -> Escap

3. Escape -> Escap

4. Caring -> Car

1. Caring -> Care

2. Escaping -> Escape

1. Search Query: The Product Quality is great.

2. Cleaning: product quality great.

5. Maintain Order of words: [product — i, quality — i+1,

1. It can cause Single Point of Failure.

2. Run Out of space.

3. High Load

4. Solution: Sharding.

1. Eventual Consistency: It’s okay if new documents are

2. High Availability: System Should be available, whenever

3. Partition Tolerant: System should handle any network

1. Latency is highly predictable: ResponseTime doesn’t

2. No need to perform intersection.

3. Can skip unresponsive/slow shards.

1. Read Heavy: Complexity (Run on all shards)

2. Write Heavy: Simple (1 Shard)

1. Insert/update/Delete: Go to all shards corresponding to

2. Query: Go to all shards

1. Search: “product quality good”, Go to all shards -> Run

3. Collect them all.

4. Items: Map(Single Shard), Reduce(Combine result from 2

1. It will be good to have replication in case a node goes

2. Parameter: Number of shards, Replication factor.

3. n -> number of nodes, M -> shards, R -> replication factor

5. Master Slave Replication

1. Rank all the document matches a certain query.