Search Engine Functionality for LLP
Apache Lucene Library and Solr Enterprise Search Server
Apache Lucene
• A high-performance, full-featured text search engine
library written entirely in Java.
• It is a technology suitable for nearly any application
that requires full-text search, especially cross-platform.
Features-Lucene is designed to make it easy to add indexing and
search capability to a broad range of applications, including:
• Searchable email: An email application could let users
search archived messages and add new messages to the
index as they arrive.
• Online documentation search: A documentation reader --
CD-based, Web-based, or embedded within the application --
could let users search online documentation or archived
publications.
• Searchable Webpages: A Web browser or proxy server
could build a personal search engine to index every
Webpage a user has visited, allowing users to easily revisit
pages.
• Website search: A CGI program could let users search your
Website.
• Content search: An application could let the user search
saved documents for specific content; this could be
integrated into the Open Document dialog.
• Version control and content management: A document
management system could index documents, or document
versions, so they can be easily retrieved.
• News and wire service feeds: A news server or relay
could index articles as they arrive.
Usage-Lucene can be used as follows:-
• Indexing Side: Write code to add Documents to the index.
• Search Side: Write code to transform user query into
Lucene Query instances.
• Submit Query to Lucene to Search.
• Display Results
-A Document is one or more Fields. A Field consists of a name,
content, and metadata on how to handle the content. Content is
made searchable by analyzing it. Analysis is completed by
chaining together a Tokenizer, which splits an input stream into
words (tokens) and zero or more TokenFilters, which can alter (for
example, stem) or remove the token.
Indexing- It is the process of preparing and adding text to
Lucene. Key Point is Lucene only indexes Strings, i.e.
• Lucene doesn’t care about XML, Word, PDF, etc.
• There are many good open source extractors available
• We need to convert whatever file format we have into
lucene format.
Solr
• Solr is an open source enterprise search server based on the
Lucene Java search library, with XML/HTTP and JSON APIs, hit
highlighting, faceted search, caching, replication, a web
administration interface and many more features. It runs in a
Java servlet container such as Tomcat.
Features: Its in the form of Java5 webapp (WAR) with web
services-like API. We put documents in it (called "indexing") via
XML over HTTP. And we query it via HTTP GET and receive XML
results.
• Advanced Full-Text Search Capabilities
• Optimized for High Volume Web Traffic
• Standards Based Open Interfaces - XML and HTTP
• Server statistics exposed over JMX for monitoring
• Scalability - Efficient Replication to other Solr Search Servers
• Flexible and Adaptable with XML configuration
• Extensible Plugin Architecture
The admin console :
Usage: Conceptually, Solr can be broken down into four main
areas:
• Schema (schema.xml) –describes the data
• Configuration (solrconfig.xml) - describes how people can
interact with the data
• Indexing
• Searching
As in case of Lucene, content is made searchable by analyzing it
by chaining together a Tokenizer. The Solr schema makes it easy
to configure this analysis process without code.
Configuration--The solrconfig.xml file specifies how Solr should
handle indexing, highlighting, faceting, search, and other
requests, as well as attributes specifying how caching should be
handled and how Lucene should manage the index.
Indexing and searching--Happens via HTTP requests sent to the
Solr server. Index is modified by POSTing XML Documents
containing instructions to add (or update) documents, delete
documents, commit pending adds and deletes.
• Loading data- Send XML add commands over HTTP. For example :
<add><doc>
<field name="id">canes</field>
<field name="name">Carolina Hurricanes</field>
</doc></add>
• Querying data: HTTP GET or POST, where parameters specifying
query options:
o http://solr/select?q=electronics
o http://solr/select?q=electronics&sort=price+desc
• Canonical response format is XML
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">1</int>
</lst>
<result name="response" numFound="14" start="0">
<doc>
<arr name="cat">
<str>electronics</str>
<str>connector</str>
</arr>
<arr name="features">
<str>car power adapter, white</str>
</arr>
<str name="id">F8V7067APLKIT</str> ..…
Lucene v. Solr
Lucene Solr
Embedded/ lightweight Server-side
No Container HTTP as communication language
Provide low-level control over all Want ease of setup and
aspects of process configuration
Thick clients Can be used for Non-Java clients
Distributed Replication/Caching Out-of-the-Box
Need to use features not available JDK 1.5
in Solr
JDK 1.4
Links for installation and documentation:
Lucene:
http://lucene.apache.org/java/2_4_0/gettingstarted.html (official
website)
http://www.ibm.com/developerworks/web/library/wa-
lucene2/?S_TACT=105AGY82&S_CMP=GENSITE
Solr:
http://lucene.apache.org/solr/tutorial.html (official website)
http://www.ibm.com/developerworks/opensource/library/j-solr-
update/index.html?ca=drs-