0% found this document useful (0 votes)

12 views26 pages

NLP 05

Uploaded by

Haisam Abbas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views26 pages

NLP 05

Uploaded by

Haisam Abbas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

WEEK 5

Natural Language
Processing
CSC 4106

MUHAMMAD ATIF SAEED

LECTURER (Artificial Intelligence & Robotics)
Lucene and Elasticsearch in NLP

Lucene and Elasticsearch are two widely used search engines that
provide indexing and searching capabilities for text data. They are
extensively utilized in information retrieval systems like web
search engines, enterprise search platforms, and document
management systems.

SLIDE 02
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
Apache Lucene

Apache Lucene is a high-performance, full-featured open-source

search library written in Java. It allows developers to add indexing
and search functionality to their applications. Lucene provides the
core algorithms for text indexing and search but does not come
with a server or REST API (as Elasticsearch does), so it is typically
used as a library within larger systems.

SLIDE 03
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
Core Concepts in Lucene (Cont.)

• Document: The fundamental unit in Lucene. A document consists

of fields, where each field contains a specific piece of
information, such as text, keywords, or metadata.
• Index: A collection of documents is indexed in Lucene for fast
retrieval. The index is built using inverted indexing, where for
each term, Lucene keeps track of which documents contain that
term.

SLIDE 04
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
Core Concepts in Lucene

• Analyzer: Lucene uses analyzers to process text into terms.

Analyzers break down text into tokens, apply filters (like
removing stop words), and normalize the data (e.g., stemming or
lowercasing).
• Query: Lucene provides a variety of query types, such as term
queries, phrase queries, and wildcard queries, which allow users
to search for information efficiently.

SLIDE 05
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
How Lucene Works
• Indexing: Text is processed using an analyzer, which tokenizes the
input into terms, applies filters (such as stemming or removing stop
words), and stores the terms in an inverted index. This index contains
mappings from terms to documents.
• Searching: When a query is made, Lucene retrieves all the documents
that match the query terms by consulting the inverted index. It ranks
documents based on a scoring system using TF-IDF (Term Frequency -

SLIDE 06
Inverse Document Frequency) or BM25 (an improvement on TF-IDF).

Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)

Lucene in NLP Tasks
• Text Search and Retrieval: Used for building custom search engines
that need high-performance indexing and retrieval capabilities.
• Document Classification: Indexes can be used to group or categorize
large sets of documents based on the terms they contain.
• Sentiment Analysis: Lucene can index documents, and sentiment-
related queries (positive or negative terms) can be used to analyze
the sentiment of documents.

SLIDE 07
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
Apache Lucene framework (Cont.)
org.apache.lucene.analysis.standard.StandardAnalyzer
• StandardAnalyzer is used for text analysis in Lucene. It processes and
tokenizes the text into individual words or terms.
• This class applies several preprocessing steps like lowercasing, removing
stop words (common words like "and", "the"), and sometimes handling
word boundaries (tokenization).
• It's one of the most commonly used analyzers in Lucene for standard
English text.

SLIDE 08
• When you are indexing or searching text in Lucene, the StandardAnalyzer
will help break down the text into searchable terms.
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
Apache Lucene framework (Cont.)
org.apache.lucene.document.Document
• A Document in Lucene is a collection of fields. Each Document
represents a record or an entity that you want to index and later
search.
• Think of it as a row in a database, where each field represents a
column with specific data, such as a title, description, or ID.
• You create a Document to represent each item you want to store in
the Lucene index. Each document can have multiple fields (such as

SLIDE 09
text, keywords, etc.).

Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)

Apache Lucene framework (Cont.)
org.apache.lucene.document.Field
• Field represents a piece of data (or a column) within a Document. It
could be text, a number, or any other type of information.
• Fields allow you to store values in a document in such a way that they
can be searched later.
• A field can be indexed, stored, and tokenized based on your needs.
• You use Field objects to add various pieces of information to a

SLIDE 10
document, such as a title or content.

Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)

Apache Lucene framework (Cont.)
org.apache.lucene.document.TextField
• TextField is a subclass of Field. It is used specifically for text fields
that need to be tokenized (broken down into individual
searchable terms) and indexed.
• If you want the text to be both searchable and stored in the
index, TextField is typically the class to use.
• Text fields are often used for things like the content of

SLIDE 11
documents, titles, or descriptions.
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
Apache Lucene framework (Cont.)
org.apache.lucene.index.DirectoryReader
• DirectoryReader allows you to access an index in Lucene for
searching. It opens the index and retrieves documents based on
search queries.
• It reads from a Directory, which is where the index data is stored
(whether in memory or on disk).
• It’s used when you need to read and search an existing Lucene

SLIDE 12
index.
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
Apache Lucene framework (Cont.)

org.apache.lucene.index.IndexWriter
• IndexWriter is responsible for writing and updating the Lucene
index. It adds, removes, or updates documents in the index.
• When you create new documents and want to add them to the
search index, IndexWriter handles that process.
• You use IndexWriter to create or update the search index.

SLIDE 13
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
Apache Lucene framework (Cont.)
org.apache.lucene.index.IndexWriterConfig
• IndexWriterConfig is used to configure how the IndexWriter
should behave. It defines the analyzer to use and other
parameters like open mode (e.g., create a new index or append
to an existing one).
• It sets up the configuration necessary for indexing documents.
• You configure IndexWriter through this class, including what

SLIDE 14
Analyzer to use (e.g., StandardAnalyzer).
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
Apache Lucene framework (Cont.)
org.apache.lucene.queryparser.classic.QueryParser
• QueryParser allows you to parse text queries into Lucene Query
objects. It converts user input (such as keywords) into a query
that Lucene can process to retrieve relevant documents.
• It also uses an Analyzer (e.g., StandardAnalyzer) to tokenize and
process the input query string.
• Used to create complex queries based on user input or

SLIDE 15
predefined query structures.
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
Apache Lucene framework (Cont.)

org.apache.lucene.search.IndexSearcher
• IndexSearcher is the class responsible for executing search
queries against the Lucene index.
• It uses a DirectoryReader to read the index and perform searches
based on queries.
• You use IndexSearcher to search the indexed documents and

SLIDE 16
retrieve relevant results.

Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)

Apache Lucene framework (Cont.)
org.apache.lucene.search.Query
• Query represents a search query in Lucene. There are different
types of queries (e.g., TermQuery, BooleanQuery) depending on
how you want to search the index (e.g., for specific terms,
phrases, or ranges).
• A Query object is created by the QueryParser and then executed
by the IndexSearcher to retrieve matching documents.

SLIDE 17
• It forms the backbone of how searches are made in Lucene.
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
Apache Lucene framework (Cont.)

org.apache.lucene.search.ScoreDoc
• ScoreDoc represents a single document's score in the search
results. It holds the document ID and the score of how well that
document matches the search query.
• Lucene ranks search results, and ScoreDoc helps to keep track of
which documents matched and their relevance scores.

SLIDE 18
• Used to access individual results in the search output.

Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)

Apache Lucene framework (Cont.)

org.apache.lucene.search.TopDocs
• TopDocs contains the top-ranked search results. It includes an
array of ScoreDoc objects representing the documents that
matched the query and the total number of hits.
• This is the structure returned after executing a search query in
Lucene.

SLIDE 19
• It stores the results of the query (usually the top N results).

Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)

Apache Lucene framework (Cont.)

org.apache.lucene.store.Directory
• Directory is an abstract class that represents where the index is
stored. Lucene supports multiple types of storage for its index,
such as in-memory (RAMDirectory) or on-disk (FSDirectory).
• This class manages how the index data is read and written.
• Used as the storage medium for the search index, either in

SLIDE 20
memory or on disk.

Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)

Apache Lucene framework
org.apache.lucene.store.RAMDirectory
• RAMDirectory is a subclass of Directory that stores the index entirely
in memory. It’s useful for situations where you need fast, temporary
access to the index but do not require persistent storage.
• Since everything is in memory, it offers very fast access but is not
suitable for large-scale or persistent storage scenarios.
• This is often used for testing or small applications where persistence

SLIDE 21
is not required.
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
Elasticsearch

Elasticsearch is a distributed, RESTful search and analytics engine

built on top of Apache Lucene. While Lucene is a search library,
Elasticsearch is a full-fledged search engine that provides a
flexible, scalable, and user-friendly way to perform complex search
queries. It also supports powerful features such as distributed
search, near real-time indexing, and a REST API.

SLIDE 22
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
Core Concepts in Elasticsearch (Cont.)

• Index: Similar to Lucene, an Elasticsearch index stores

documents. Each index is partitioned into shards, which allows
Elasticsearch to scale horizontally across multiple machines.
• Document: A single unit of information stored in Elasticsearch
(similar to a row in a relational database). Each document is
stored in JSON format.

SLIDE 23
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
Core Concepts in Elasticsearch
• Cluster and Node: Elasticsearch operates in a distributed fashion.
A cluster is a collection of one or more nodes (machines). Each
node holds a subset of the data and is capable of handling search
and indexing operations.
• Mapping: Defines the structure of documents in an index, such
as the fields, data types, and analyzers.
• Query: Elasticsearch supports complex query operations such as

SLIDE 24
filtering, aggregation, and full-text search.
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
How Elasticsearch Works

• Indexing: Data (documents) is indexed into Elasticsearch using

REST APIs. An analyzer processes the text into terms, and an
inverted index is created just like Lucene.
• Searching: Elasticsearch uses the inverted index to perform fast
search operations across distributed nodes. It provides features
like fuzzy matching, proximity queries, full-text search, and
aggregation.

SLIDE 25
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
Elasticsearch in NLP Tasks
• Search Engines: It is commonly used to build search engines where
users can search for documents or products using natural language
queries.
• Text Analytics: Elasticsearch supports NLP features like tokenization,
stemming, and text analysis, making it suitable for text mining and
document analysis.
• Sentiment Analysis and Opinion Mining: Elasticsearch’s aggregations
and full-text search can be used to analyze user reviews or social

SLIDE 26
media posts.
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)

Apache Lucene 4: Search Library Insights
No ratings yet
Apache Lucene 4: Search Library Insights
8 pages
Elastic Search Presentation
No ratings yet
Elastic Search Presentation
55 pages
Advanced Lucene: Grant Ingersoll Center For Natural Language Processing Apachecon 2005 December 12, 2005
0% (1)
Advanced Lucene: Grant Ingersoll Center For Natural Language Processing Apachecon 2005 December 12, 2005
37 pages
IR Project Guide for CS Students
No ratings yet
IR Project Guide for CS Students
15 pages
Chapter - 6 - Searching and Indexing
No ratings yet
Chapter - 6 - Searching and Indexing
44 pages
Luce Ne Bootcamp
No ratings yet
Luce Ne Bootcamp
83 pages
Chapter 5 1712934164766
No ratings yet
Chapter 5 1712934164766
13 pages
Searching and Indexing
No ratings yet
Searching and Indexing
21 pages
Lucene 4 Guide for Developers
No ratings yet
Lucene 4 Guide for Developers
28 pages
Build a Rich Snippets Search Engine
No ratings yet
Build a Rich Snippets Search Engine
37 pages
Lucene & Solr for Java Developers
No ratings yet
Lucene & Solr for Java Developers
35 pages
Lucene Is A Free/open Source Information Retrieval Library, Originally Implemented in Java
No ratings yet
Lucene Is A Free/open Source Information Retrieval Library, Originally Implemented in Java
21 pages
Lucene and Solr Search Engine Guide
No ratings yet
Lucene and Solr Search Engine Guide
6 pages
Elasticsearch and Apache Lucene
No ratings yet
Elasticsearch and Apache Lucene
7 pages
Apache Lucene
No ratings yet
Apache Lucene
19 pages
Lucene Tutorial
100% (1)
Lucene Tutorial
189 pages
Chapter 5 Searching and Indexing Big Data 250525 070825
No ratings yet
Chapter 5 Searching and Indexing Big Data 250525 070825
19 pages
Lucene Software Architecture Lecture
No ratings yet
Lucene Software Architecture Lecture
11 pages
5 Indexing and Searching Big Data
No ratings yet
5 Indexing and Searching Big Data
11 pages
Lucene Solr
No ratings yet
Lucene Solr
52 pages
Otis - Saveti
No ratings yet
Otis - Saveti
4 pages
Tutorial 3
No ratings yet
Tutorial 3
38 pages
Irt Unit5
No ratings yet
Irt Unit5
111 pages
NLP Toolkits for AI Students
No ratings yet
NLP Toolkits for AI Students
33 pages
Search Engine Architecture Guide
No ratings yet
Search Engine Architecture Guide
23 pages
Chap 2
No ratings yet
Chap 2
29 pages
Apache Lucene
100% (1)
Apache Lucene
13 pages
Text
No ratings yet
Text
5 pages
Lucene 4.0: Flexible Indexing Guide
No ratings yet
Lucene 4.0: Flexible Indexing Guide
35 pages
Java NLP Techniques Guide
No ratings yet
Java NLP Techniques Guide
51 pages
Lect 08
No ratings yet
Lect 08
17 pages
Quick Lucene 3.5.0 Guide
No ratings yet
Quick Lucene 3.5.0 Guide
4 pages
Lucene and Solr
No ratings yet
Lucene and Solr
24 pages
Chapter 2
No ratings yet
Chapter 2
45 pages
Did It Make The News?
No ratings yet
Did It Make The News?
6 pages
Musa Talukdar: Software Engineer 28 June, 2012
No ratings yet
Musa Talukdar: Software Engineer 28 June, 2012
19 pages
4
No ratings yet
4
35 pages
Topic 2 W2 - SDR - Edited - March2023
No ratings yet
Topic 2 W2 - SDR - Edited - March2023
25 pages
Text Mining and The Semantic Web: DR Diana Maynard NLP Group Department of Computer Science University of Sheffield
No ratings yet
Text Mining and The Semantic Web: DR Diana Maynard NLP Group Department of Computer Science University of Sheffield
46 pages
Apache Lucene
No ratings yet
Apache Lucene
5 pages
20 ElasticSearch
No ratings yet
20 ElasticSearch
62 pages
Welcome To Lucene!
No ratings yet
Welcome To Lucene!
11 pages
VV - IR - UNIT-I - Part2
No ratings yet
VV - IR - UNIT-I - Part2
35 pages
Ir and NLP
No ratings yet
Ir and NLP
6 pages
Chapter 2
No ratings yet
Chapter 2
31 pages
Chapter - 6 Part 1
No ratings yet
Chapter - 6 Part 1
21 pages
1-Overview of Information Retrieval - New
No ratings yet
1-Overview of Information Retrieval - New
47 pages
Applications of NLP: Introduction To Natural Language Processing (CSE 5321)
No ratings yet
Applications of NLP: Introduction To Natural Language Processing (CSE 5321)
59 pages
Application NLP
No ratings yet
Application NLP
23 pages
WINSEM2023-24 BCSE306L TH VL2023240500598 2024-04-30 Reference-Material-I
No ratings yet
WINSEM2023-24 BCSE306L TH VL2023240500598 2024-04-30 Reference-Material-I
44 pages
CS460/IT632 Natural Language Processing/Language Technology For The Web
No ratings yet
CS460/IT632 Natural Language Processing/Language Technology For The Web
11 pages
Apache Lucene
No ratings yet
Apache Lucene
5 pages
Chapter 1
No ratings yet
Chapter 1
52 pages
Full Proceedings
No ratings yet
Full Proceedings
79 pages
NLP Unit 1
No ratings yet
NLP Unit 1
44 pages
Natural Language Processing With Java - Sample Chapter
100% (1)
Natural Language Processing With Java - Sample Chapter
33 pages
006 Natural Language Processing - Part 2
No ratings yet
006 Natural Language Processing - Part 2
53 pages
Elasticsearch: by Maruf Hassan
No ratings yet
Elasticsearch: by Maruf Hassan
14 pages
Overview of Google Organization
100% (1)
Overview of Google Organization
3 pages
Digital Tourism
100% (1)
Digital Tourism
109 pages
Iwt Unit1
No ratings yet
Iwt Unit1
31 pages
SEO Guide: Google Updates & Trends
No ratings yet
SEO Guide: Google Updates & Trends
121 pages
Beatniks Shoes Proposal
No ratings yet
Beatniks Shoes Proposal
10 pages
Unit 3
No ratings yet
Unit 3
34 pages
JETIRAI06002
No ratings yet
JETIRAI06002
7 pages
Empowerment Technology Reviewer 1
No ratings yet
Empowerment Technology Reviewer 1
6 pages
IRS B Tech CSE Part 1
No ratings yet
IRS B Tech CSE Part 1
161 pages
Case Study-Google
100% (1)
Case Study-Google
16 pages
Analisa Dan Implementasi SEO Search Engine Optimization Konten Website Untuk Algoritma Google Panda Dan Yahoo Naskah Publikasi
No ratings yet
Analisa Dan Implementasi SEO Search Engine Optimization Konten Website Untuk Algoritma Google Panda Dan Yahoo Naskah Publikasi
22 pages
City of Redding California
No ratings yet
City of Redding California
30 pages
Digi Notes
No ratings yet
Digi Notes
11 pages
DataWalk Introduction To Open Source Intelligence Tools - Spring2020
No ratings yet
DataWalk Introduction To Open Source Intelligence Tools - Spring2020
39 pages
Telecommunications, The Internet, and Wireless Technology: Managing The Digital Firm, 12 Edition
No ratings yet
Telecommunications, The Internet, and Wireless Technology: Managing The Digital Firm, 12 Edition
39 pages
Marigondon National High School: Department of Education
100% (3)
Marigondon National High School: Department of Education
4 pages
Secondary ICT 3 Student Textbook
No ratings yet
Secondary ICT 3 Student Textbook
94 pages
Website Design - Create Best Website Design With 15,000+ Free Templates 2025
No ratings yet
Website Design - Create Best Website Design With 15,000+ Free Templates 2025
14 pages
Latest
No ratings yet
Latest
19 pages
53 Computer Applications 2024 PYQ
No ratings yet
53 Computer Applications 2024 PYQ
12 pages
Saurashtra University: Rajkot - India
No ratings yet
Saurashtra University: Rajkot - India
13 pages
Grade 9 ENTER DATA IN Web Browser
No ratings yet
Grade 9 ENTER DATA IN Web Browser
11 pages
Philippe Dylewski - Offensive Intelligence 300 Techniques, Tools and Tips To Know Everything About Everyone, in Business and Elsewhere
No ratings yet
Philippe Dylewski - Offensive Intelligence 300 Techniques, Tools and Tips To Know Everything About Everyone, in Business and Elsewhere
274 pages
GEO225 Study Guide 10
No ratings yet
GEO225 Study Guide 10
4 pages
Types of Search Engine
No ratings yet
Types of Search Engine
16 pages
Boost Your Business With AI SEO Tools Scale Agile Solutions
No ratings yet
Boost Your Business With AI SEO Tools Scale Agile Solutions
2 pages
Nidhi Digital Marketing Project 6 Month
No ratings yet
Nidhi Digital Marketing Project 6 Month
41 pages
BBA Project: Digital Marketing Study
No ratings yet
BBA Project: Digital Marketing Study
41 pages
Punbus
No ratings yet
Punbus
18 pages
Phontech 6200 User Guide
No ratings yet
Phontech 6200 User Guide
17 pages

NLP 05

Uploaded by

NLP 05

Uploaded by

WEEK 5

MUHAMMAD ATIF SAEED

Apache Lucene is a high-performance, full-featured open-source

• Document: The fundamental unit in Lucene. A document consists

• Analyzer: Lucene uses analyzers to process text into terms.

Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)

Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)

Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)

Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)

Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)

Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)

Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)

Elasticsearch is a distributed, RESTful search and analytics engine

• Index: Similar to Lucene, an Elasticsearch index stores

• Indexing: Data (documents) is indexed into Elasticsearch using

You might also like