Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
12 views26 pages

NLP 05

Uploaded by

Haisam Abbas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views26 pages

NLP 05

Uploaded by

Haisam Abbas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

WEEK 5

Natural Language
Processing
CSC 4106

MUHAMMAD ATIF SAEED


LECTURER (Artificial Intelligence & Robotics)
Lucene and Elasticsearch in NLP

Lucene and Elasticsearch are two widely used search engines that
provide indexing and searching capabilities for text data. They are
extensively utilized in information retrieval systems like web
search engines, enterprise search platforms, and document
management systems.

SLIDE 02
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
Apache Lucene

Apache Lucene is a high-performance, full-featured open-source


search library written in Java. It allows developers to add indexing
and search functionality to their applications. Lucene provides the
core algorithms for text indexing and search but does not come
with a server or REST API (as Elasticsearch does), so it is typically
used as a library within larger systems.

SLIDE 03
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
Core Concepts in Lucene (Cont.)

• Document: The fundamental unit in Lucene. A document consists


of fields, where each field contains a specific piece of
information, such as text, keywords, or metadata.
• Index: A collection of documents is indexed in Lucene for fast
retrieval. The index is built using inverted indexing, where for
each term, Lucene keeps track of which documents contain that
term.

SLIDE 04
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
Core Concepts in Lucene

• Analyzer: Lucene uses analyzers to process text into terms.


Analyzers break down text into tokens, apply filters (like
removing stop words), and normalize the data (e.g., stemming or
lowercasing).
• Query: Lucene provides a variety of query types, such as term
queries, phrase queries, and wildcard queries, which allow users
to search for information efficiently.

SLIDE 05
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
How Lucene Works
• Indexing: Text is processed using an analyzer, which tokenizes the
input into terms, applies filters (such as stemming or removing stop
words), and stores the terms in an inverted index. This index contains
mappings from terms to documents.
• Searching: When a query is made, Lucene retrieves all the documents
that match the query terms by consulting the inverted index. It ranks
documents based on a scoring system using TF-IDF (Term Frequency -

SLIDE 06
Inverse Document Frequency) or BM25 (an improvement on TF-IDF).

Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)


Lucene in NLP Tasks
• Text Search and Retrieval: Used for building custom search engines
that need high-performance indexing and retrieval capabilities.
• Document Classification: Indexes can be used to group or categorize
large sets of documents based on the terms they contain.
• Sentiment Analysis: Lucene can index documents, and sentiment-
related queries (positive or negative terms) can be used to analyze
the sentiment of documents.

SLIDE 07
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
Apache Lucene framework (Cont.)
org.apache.lucene.analysis.standard.StandardAnalyzer
• StandardAnalyzer is used for text analysis in Lucene. It processes and
tokenizes the text into individual words or terms.
• This class applies several preprocessing steps like lowercasing, removing
stop words (common words like "and", "the"), and sometimes handling
word boundaries (tokenization).
• It's one of the most commonly used analyzers in Lucene for standard
English text.

SLIDE 08
• When you are indexing or searching text in Lucene, the StandardAnalyzer
will help break down the text into searchable terms.
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
Apache Lucene framework (Cont.)
org.apache.lucene.document.Document
• A Document in Lucene is a collection of fields. Each Document
represents a record or an entity that you want to index and later
search.
• Think of it as a row in a database, where each field represents a
column with specific data, such as a title, description, or ID.
• You create a Document to represent each item you want to store in
the Lucene index. Each document can have multiple fields (such as

SLIDE 09
text, keywords, etc.).

Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)


Apache Lucene framework (Cont.)
org.apache.lucene.document.Field
• Field represents a piece of data (or a column) within a Document. It
could be text, a number, or any other type of information.
• Fields allow you to store values in a document in such a way that they
can be searched later.
• A field can be indexed, stored, and tokenized based on your needs.
• You use Field objects to add various pieces of information to a

SLIDE 10
document, such as a title or content.

Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)


Apache Lucene framework (Cont.)
org.apache.lucene.document.TextField
• TextField is a subclass of Field. It is used specifically for text fields
that need to be tokenized (broken down into individual
searchable terms) and indexed.
• If you want the text to be both searchable and stored in the
index, TextField is typically the class to use.
• Text fields are often used for things like the content of

SLIDE 11
documents, titles, or descriptions.
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
Apache Lucene framework (Cont.)
org.apache.lucene.index.DirectoryReader
• DirectoryReader allows you to access an index in Lucene for
searching. It opens the index and retrieves documents based on
search queries.
• It reads from a Directory, which is where the index data is stored
(whether in memory or on disk).
• It’s used when you need to read and search an existing Lucene

SLIDE 12
index.
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
Apache Lucene framework (Cont.)

org.apache.lucene.index.IndexWriter
• IndexWriter is responsible for writing and updating the Lucene
index. It adds, removes, or updates documents in the index.
• When you create new documents and want to add them to the
search index, IndexWriter handles that process.
• You use IndexWriter to create or update the search index.

SLIDE 13
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
Apache Lucene framework (Cont.)
org.apache.lucene.index.IndexWriterConfig
• IndexWriterConfig is used to configure how the IndexWriter
should behave. It defines the analyzer to use and other
parameters like open mode (e.g., create a new index or append
to an existing one).
• It sets up the configuration necessary for indexing documents.
• You configure IndexWriter through this class, including what

SLIDE 14
Analyzer to use (e.g., StandardAnalyzer).
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
Apache Lucene framework (Cont.)
org.apache.lucene.queryparser.classic.QueryParser
• QueryParser allows you to parse text queries into Lucene Query
objects. It converts user input (such as keywords) into a query
that Lucene can process to retrieve relevant documents.
• It also uses an Analyzer (e.g., StandardAnalyzer) to tokenize and
process the input query string.
• Used to create complex queries based on user input or

SLIDE 15
predefined query structures.
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
Apache Lucene framework (Cont.)

org.apache.lucene.search.IndexSearcher
• IndexSearcher is the class responsible for executing search
queries against the Lucene index.
• It uses a DirectoryReader to read the index and perform searches
based on queries.
• You use IndexSearcher to search the indexed documents and

SLIDE 16
retrieve relevant results.

Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)


Apache Lucene framework (Cont.)
org.apache.lucene.search.Query
• Query represents a search query in Lucene. There are different
types of queries (e.g., TermQuery, BooleanQuery) depending on
how you want to search the index (e.g., for specific terms,
phrases, or ranges).
• A Query object is created by the QueryParser and then executed
by the IndexSearcher to retrieve matching documents.

SLIDE 17
• It forms the backbone of how searches are made in Lucene.
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
Apache Lucene framework (Cont.)

org.apache.lucene.search.ScoreDoc
• ScoreDoc represents a single document's score in the search
results. It holds the document ID and the score of how well that
document matches the search query.
• Lucene ranks search results, and ScoreDoc helps to keep track of
which documents matched and their relevance scores.

SLIDE 18
• Used to access individual results in the search output.

Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)


Apache Lucene framework (Cont.)

org.apache.lucene.search.TopDocs
• TopDocs contains the top-ranked search results. It includes an
array of ScoreDoc objects representing the documents that
matched the query and the total number of hits.
• This is the structure returned after executing a search query in
Lucene.

SLIDE 19
• It stores the results of the query (usually the top N results).

Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)


Apache Lucene framework (Cont.)

org.apache.lucene.store.Directory
• Directory is an abstract class that represents where the index is
stored. Lucene supports multiple types of storage for its index,
such as in-memory (RAMDirectory) or on-disk (FSDirectory).
• This class manages how the index data is read and written.
• Used as the storage medium for the search index, either in

SLIDE 20
memory or on disk.

Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)


Apache Lucene framework
org.apache.lucene.store.RAMDirectory
• RAMDirectory is a subclass of Directory that stores the index entirely
in memory. It’s useful for situations where you need fast, temporary
access to the index but do not require persistent storage.
• Since everything is in memory, it offers very fast access but is not
suitable for large-scale or persistent storage scenarios.
• This is often used for testing or small applications where persistence

SLIDE 21
is not required.
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
Elasticsearch

Elasticsearch is a distributed, RESTful search and analytics engine


built on top of Apache Lucene. While Lucene is a search library,
Elasticsearch is a full-fledged search engine that provides a
flexible, scalable, and user-friendly way to perform complex search
queries. It also supports powerful features such as distributed
search, near real-time indexing, and a REST API.

SLIDE 22
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
Core Concepts in Elasticsearch (Cont.)

• Index: Similar to Lucene, an Elasticsearch index stores


documents. Each index is partitioned into shards, which allows
Elasticsearch to scale horizontally across multiple machines.
• Document: A single unit of information stored in Elasticsearch
(similar to a row in a relational database). Each document is
stored in JSON format.

SLIDE 23
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
Core Concepts in Elasticsearch
• Cluster and Node: Elasticsearch operates in a distributed fashion.
A cluster is a collection of one or more nodes (machines). Each
node holds a subset of the data and is capable of handling search
and indexing operations.
• Mapping: Defines the structure of documents in an index, such
as the fields, data types, and analyzers.
• Query: Elasticsearch supports complex query operations such as

SLIDE 24
filtering, aggregation, and full-text search.
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
How Elasticsearch Works

• Indexing: Data (documents) is indexed into Elasticsearch using


REST APIs. An analyzer processes the text into terms, and an
inverted index is created just like Lucene.
• Searching: Elasticsearch uses the inverted index to perform fast
search operations across distributed nodes. It provides features
like fuzzy matching, proximity queries, full-text search, and
aggregation.

SLIDE 25
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)
Elasticsearch in NLP Tasks
• Search Engines: It is commonly used to build search engines where
users can search for documents or products using natural language
queries.
• Text Analytics: Elasticsearch supports NLP features like tokenization,
stemming, and text analysis, making it suitable for text mining and
document analysis.
• Sentiment Analysis and Opinion Mining: Elasticsearch’s aggregations
and full-text search can be used to analyze user reviews or social

SLIDE 26
media posts.
Natural Language Processing (NLP) | MUHAMMAD ATIF SAEED (Lecturer)

You might also like