|
Lucene in Action Erik Hatcher and Otis Gospodnetić 2004 | 456 pages ISBN: 1932394281 |
|||
![]() |
$22.50 | ThoutReader + PDF ebook | |
![]() |
$44.95 | Softbound print book | |
About this Book
Lucene in Action delivers details, best practices, caveats, tips, and tricks for using the best open-source Java search engine available.
This book assumes the reader is familiar with basic Java programming. Lucene itself is a single Java Archive (JAR) file and integrates into the simplest Java stand-alone console program as well as the most sophisticated enterprise application.
Roadmap
We organized part 1 of this book to cover the core Lucene Application Programming Interface (API) in the order youre likely to encounter it as you integrate Lucene into your applications:
- In chapter 1, you meet Lucene. We introduce some basic information-retrieval terminology, and we note Lucenes primary competition. Without wasting any time, we immediately build simple indexing and searching applications that you can put right to use or adapt to your needs. This example application opens the door for exploring the rest of Lucenes capabilities.
- Chapter 2 familiarizes you with Lucenes basic indexing operations. We describe the various field types and techniques for indexing numbers and dates. Tuning the indexing process, optimizing an index, and how to deal with thread-safety are covered.
- Chapter 3 takes you through basic searching, including details of how Lucene ranks documents based on a query. We discuss the fundamental query types as well as how they can be created through human-entered query expressions.
- Chapter 4 delves deep into the heart of Lucenes indexing magic, the analysis process. We cover the analyzer building blocks including tokens, token streams, and token filters. Each of the built-in analyzers gets its share of attention and detail. We build several custom analyzers, showcasing synonym injection and metaphone (like soundex) replacement. Analysis of non-English languages is given attention, with specific examples of analyzing Chinese text.
- Chapter 5 picks up where the searching chapter left off, with analysis now in mind. We cover several advanced searching features, including sorting, filtering, and leveraging term vectors. The advanced query types make their appearance, including the spectacular
SpanQueryfamily. Finally, we cover Lucenes built-in support for query multiple indexes, even in parallel and remotely. - Chapter 6 goes well beyond advanced searching, showing you how to extend Lucenes searching capabilities. Youll learn how to customize search results sorting, extend query expression parsing, implement hit collecting, and tune query performance. Whew!
Part 2 goes beyond Lucenes built-in facilities and shows you what can be done around and above Lucene:
- In chapter 7, we create a reusable and extensible framework for parsing documents in Word, HTML, XML, PDF, and other formats.
- Chapter 8 includes a smorgasbord of extensions and tools around Lucene. We describe several Lucene index viewing and developer tools as well as the many interesting toys in Lucenes Sandbox. Highlighting search terms is one such Sandbox extension that youll likely need, along with other goodies like building an index from an Ant build process, using noncore analyzers, and leveraging the WordNet synonym index.
- Chapter 9 demonstrates the ports of Lucene to various languages, such as C++, C#, Perl, and Python.
- Chapter 10 brings all the technical details of Lucene back into focus with many wonderful case studies contributed by those who have built interesting, fast, and scalable applications with Lucene at their core.
Who should read this book?
Developers who need powerful search capabilities embedded in their applications should read this book. Lucene in Action is also suitable for developers who are curious about Lucene or indexing and search techniques, but who may not have an immediate need to use it. Adding Lucene know-how to your toolbox is valuable for future projectssearch is a hot topic and will continue to be in the future.
This book primarily uses the Java version of Lucene (from Apache Jakarta), and the majority of the code examples use the Java language. Readers familiar with Java will be right at home. Java expertise will be helpful; however, Lucene has been ported to a number of other languages including C++, C#, Python, and Perl. The concepts, techniques, and even the API itself are comparable between the Java and other language versions of Lucene.
Code examples
The source code for this book is available from Mannings website at http://www.manning.com/hatcher2. Instructions for using this code are provided in the README file included with the source-code package.
The majority of the code shown in this book was written by us and is included in the source-code package. Some code (particularly the case-study code) isnt provided in our source-code package; the code snippets shown there are owned by the contributors and are donated as is. In a couple of cases, we have included a small snippet of code from Lucenes codebase, which is licensed under the Apache Software License (http://www.apache.org/licenses/LICENSE-2.0).
Code examples dont include package and import statements, to conserve space; refer to the actual source code for these details.
Why JUnit?
We believe code examples in books should be top-notch quality and real-world applicable. The typical hello world examples often insult our intelligence and generally do little to help readers see how to really adapt to their environment.
Weve taken a unique approach to the code examples in Lucene in Action. Many of our examples are actual JUnit test cases (http://www.junit.org). JUnit, the de facto Java unit-testing framework, easily allows code to assert that a particular assumption works as expected in a repeatable fashion. Automating JUnit test cases through an IDE or Ant allows one-step (or no steps with continuous integration) confidence building. We chose to use JUnit in this book because we use it daily in our other projects and want you to see how we really code. Test Driven Development (TDD) is a development practice we strongly espouse.
If youre unfamiliar with JUnit, please read the following primer. We also suggest that you read Pragmatic Unit Testing in Java with JUnit by Dave Thomas and Andy Hunt, followed by Mannings JUnit in Action by Vincent Massol and Ted Husted.
JUnit primer
This section is a quick and admittedly incomplete introduction to JUnit. Well provide the basics needed to understand our code examples. First, our JUnit test cases extend junit.framework.TestCase and many extend it indirectly through our custom LiaTestCase base class. Our concrete test classes adhere to a naming convention: we suffix class names with Test. For example, our QueryParser tests are in QueryParserTest.java.
JUnit runners automatically execute all methods with the signature public void testXXX(), where XXX is an arbitrary but meaningful name. JUnit test methods should be concise and clear, keeping good software design in mind (such as not repeating yourself, creating reusable functionality, and so on).
Assertions
JUnit is built around a set of assert statements, freeing you to code tests clearly and letting the JUnit framework handle failed assumptions and reporting the details. The most frequently used assert statement is assertEquals; there are a number of overloaded variants of the assertEquals method signature for various data types. An example test method looks like this:
public void testExample() {
SomeObject obj = new SomeObject();
assertEquals(10, obj.someMethod());
}
The assert methods throw a runtime exception if the expected value (10, in this example) isnt equal to the actual value (the result of calling someMethod on obj, in this example). Besides assertEquals, there are several other assert methods for convenience. We also use assertTrue(expression), assertFalse(expression), and assertNull(expression) statements. These test whether the expression is true, false, and null, respectively.
The assert statements have overloaded signatures that take an additional String parameter as the first argument. This String argument is used entirely for reporting purposes, giving the developer more information when a test fails. We use this String message argument to be more descriptive (or sometimes comical).
By coding our assumptions and expectations in JUnit test cases in this manner, we free ourselves from the complexity of the large systems we build and can focus on fewer details at a time. With a critical mass of test cases in place, we can remain confident and agile. This confidence comes from knowing that changing code, such as optimizing algorithms, wont break other parts of the system, because if it did, our automated test suite would let us know long before the code made it to production. Agility comes from being able to keep the codebase clean through refactoring. Refactoring is the art (or is it a science?) of changing the internal structure of the code so that it accommodates evolving requirements without affecting the external interface of a system.
JUnit in context
Lets take what weve said so far about JUnit and frame it within the context of this book. JUnit test cases ultimately extend from junit.framework.TestCase, and test methods have the public void testXXX() signature. One of our test cases (from chapter 3) is shown here:
public class BasicSearchingTest extends LiaTestCase { ![]()
public void testTerm() throws Exception {
IndexSearcher searcher = new IndexSearcher(directory); ![]()
Term t = new Term("subject", "ant");
Query query = new TermQuery(t);
Hits hits = searcher.search(query);
assertEquals("JDwA", 1, hits.length());
t = new Term("subject", "junit");
hits = searcher.search(new TermQuery(t));
assertEquals(2, hits.length()); ![]()
searcher.close();
}
}
Of course, well explain the Lucene API used in this test case later. Here well focus on the JUnit details. A variable used in testTerm, directory, isnt defined in this class. JUnit provides an initialization hook that executes prior to every test method; this hook is a method with the public void setUp() signature. Our LiaTestCase base class implements setUp in this manner:
public abstract class LiaTestCase extends TestCase {
private String indexDir = System.getProperty("index.dir");
protected Directory directory;
protected void setUp() throws Exception {
directory = FSDirectory.getDirectory(indexDir, false);
}
}
If our first assert in testTerm fails, we see an exception like this:
junit.framework.AssertionFailedError: JDwA expected:<1> but was:<0>
at lia.searching.BasicSearchingTest.
testTerm(BasicSearchingTest.java:20)
This failure indicates our test data is different than what we expect.
Testing Lucene
The majority of the tests in this book test Lucene itself. In practice, is this realistic? Isnt the idea to write test cases that test our own code, not the libraries themselves? There is an interesting twist to Test Driven Development used for learning an API: Test Driven Learning. Its immensely helpful to write tests directly to a new API in order to learn how it works and what you can expect from it. This is precisely what weve done in most of our code examples, so that tests are testing Lucene itself. Dont throw these learning tests away, though. Keep them around to ensure your expectations of the API hold true when you upgrade to a new version of the API, and refactor them when the inevitable API change is made.
Mock objects
In a couple of cases, we use mock objects for testing purposes. Mock objects are used as probes sent into real business logic in order to assert that the business logic is working properly. For example, in chapter 4, we have a SynonymEngine interface (see section 4.6). The real business logic that uses this interface is an analyzer. When we want to test the analyzer itself, its unimportant what type of SynonymEngine is used, but we want to use one that has well defined and predictable behavior. We created a MockSynonymEngine, allowing us to reliably and predictably test our analyzer. Mock objects help simplify test cases such that they test only a single facet of a system at a time rather than having intertwined dependencies that lead to complexity in troubleshooting what really went wrong when a test fails. A nice effect of using mock objects comes from the design changes it leads us to, such as separation of concerns and designing using interfaces instead of direct concrete implementations.
Our test data
Most of our book revolves around a common set of example data to provide consistency and avoid having to grok an entirely new set of data for each section. This example data consists of book details. Table 1 shows the data so that you can reference it and make sense of our examples.
Table 1 Sample data used throughout this book| Title / Author | Category | Subject |
| A Modern Art of Education Rudolf Steiner |
/education/pedagogy | education philosophy psychology practice Waldorf |
| Imperial Secrets of Health and Longevity Bob Flaws |
/health/alternative/Chinese | diet chinese medicine qi gong health herbs |
| Tao Te Ching Stephen Mitchell |
/philosophy/eastern | taoism |
| Gödel, Escher, Bach: an Eternal Golden Braid Douglas Hofstadter |
/technology/computers/ai | artificial intelligence number theory mathematics music |
| Mindstorms Seymour Papert |
/technology/computers/programming/education | children computers powerful ideas LOGO education |
| Java Development with Ant Erik Hatcher, Steve Loughran |
/technology/computers/programming | apache jakarta ant build tool junit java development |
| JUnit in Action Vincent Massol, Ted Husted |
/technology/computers/programming | junit unit testing mock objects |
| Lucene in Action Otis Gospodnetić, Erik Hatcher |
/technology/computers/programming | lucene search |
| Extreme Programming Explained Kent Beck |
/technology/computers/programming/methodology | extreme programming agile test driven development methodology |
| Tapestry in Action Howard Lewis-Ship |
/technology/computers/programming | tapestry web user interface components |
| The Pragmatic Programmer Dave Thomas, Andy Hunt |
/technology/computers/programming | pragmatic agile methodology developer tools |
The data, besides the fields shown in the table, includes fields for ISBN, URL, and publication month. The fields for category and subject are our own subjective values, but the other information is objectively factual about the books.
Code conventions and downloads
Source code in listings or in text is in a fixed width font to separate it from ordinary text. Java method names, within text, generally wont include the full method signature.
In order to accommodate the available page space, code has been formatted with a limited width, including line continuation markers where appropriate.
We dont include import statements and rarely refer to fully qualified class namesthis gets in the way and takes up valuable space. Refer to Lucenes Javadocs for this information. All decent IDEs have excellent support for automatically adding import statements; Erik blissfully codes without knowing fully qualified classnames using IDEA IntelliJ, and Otis does the same with XEmacs. Add the Lucene JAR to your projects classpath, and youre all set. Also on the classpath issue (which is a notorious nuisance), we assume that the Lucene JAR and any other necessary JARs are available in the classpath and dont show it explicitly.
Weve created a lot of examples for this book that are freely available to you. A .zip file of all the code is available from Mannings web site for Lucene in Action: http://www.manning.com/hatcher2. Detailed instructions on running the sample code are provided in the main directory of the expanded archive as a README file.
Author online
The purchase of Lucene in Action includes free access to a private web forum run by Manning Publications, where you can discuss the book with the authors and other readers. To access the forum and subscribe to it, point your web browser to http://www.manning.com/hatcher2. This page provides information on how to get on the forum once you are registered, what kind of help is available, and the rules of conduct on the forum.
About the authors
Erik Hatcher codes, writes, and speaks on technical topics that he finds fun and challenging. He has written software for a number of diverse industries using many different technologies and languages. Erik coauthored Java Development with Ant (Manning, 2002) with Steve Loughran, a book that has received wonderful industry acclaim. Since the release of Eriks first book, he has spoken at numerous venues including the No Fluff, Just Stuff symposium circuit, JavaOne, OReillys Open Source Convention, the Open Source Content Management Conference, and many Java User Group meetings. As an Apache Software Foundation member, he is an active contributor and committer on several Apache projects including Lucene, Ant, and Tapestry. Erik currently works at the University of Virginia's Humanities department supporting Applied Research in Patacriticism. He lives in Charlottesville, Virginia with his beautiful wife, Carole, and two astounding sons, Ethan and Jakob.
Otis Gospodnetić has been an active Lucene developer for four years and maintains the jGuru Lucene FAQ. He is a Software Engineer at Wireless Generation, a company that develops technology solutions for educational assessments of students and teachers. In his spare time, he develops Simpy, a Personal Web service that uses Lucene, which he created out of his passion for knowledge, information retrieval, and management. Previous technical publications include several articles about Lucene, published by OReilly Network and IBM developerWorks. Otis also wrote To Choose and Be Chosen: Pursuing Education in America, a guidebook for foreigners wishing to study in the United States; its based on his own experience. Otis is from Croatia and currently lives in New York City.
About the title
By combining introductions, overviews, and how-to examples, the In Action books are designed to help learning and remembering. According to research in cognitive science, the things people remember are things they discover during self-motivated exploration.
Although no one at Manning is a cognitive scientist, we are convinced that for learning to become permanent it must pass through stages of exploration, play, and, interestingly, re-telling of what is being learned. People understand and remember new things, which is to say they master them, only after actively exploring them. Humans learn in action. An essential part of an In Action guide is that it is example-driven. It encourages the reader to try things out, to play with new code, and explore new ideas.
There is another, more mundane, reason for the title of this book: our readers are busy. They use books to do a job or solve a problem. They need books that allow them to jump in and jump out easily and learn just what they want just when they want it. They need books that aid them in action. The books in this series are designed for such readers.
About the cover illustration
The figure on the cover of Lucene in Action is An inhabitant of the coast of Syria. The illustration is taken from a collection of costumes of the Ottoman Empire published on January 1, 1802, by William Miller of Old Bond Street, London. The title page is missing from the collection and we have been unable to track it down to date. The books table of contents identifies the figures in both English and French, and each illustration bears the names of two artists who worked on it, both of whom would no doubt be surprised to find their art gracing the front cover of a computer programming book…two hundred years later.
The collection was purchased by a Manning editor at an antiquarian flea market in the Garage on West 26th Street in Manhattan. The seller was an American based in Ankara, Turkey, and the transaction took place just as he was packing up his stand for the day. The Manning editor did not have on his person the substantial amount of cash that was required for the purchase and a credit card and check were both politely turned down.
With the seller flying back to Ankara that evening the situation was getting hopeless. What was the solution? It turned out to be nothing more than an old-fashioned verbal agreement sealed with a handshake. The seller simply proposed that the money be transferred to him by wire and the editor walked out with the sellers bank information on a piece of paper and the portfolio of images under his arm. Needless to say, we transferred the funds the next day, and we remain grateful and impressed by this unknown persons trust in one of us. It recalls something that might have happened a long time ago.
The pictures from the Ottoman collection, like the other illustrations that appear on our covers, bring to life the richness and variety of dress customs of two centuries ago. They recall the sense of isolation and distance of that periodand of every other historic period except our own hyperkinetic present.
Dress codes have changed since then and the diversity by region, so rich at the time, has faded away. It is now often hard to tell the inhabitant of one continent from another. Perhaps, trying to view it optimistically, we have traded a cultural and visual diversity for a more varied personal life. Or a more varied and interesting intellectual and technical life.
We at Manning celebrate the inventiveness, the initiative, and, yes, the fun of the computer business with book covers based on the rich diversity of regional life of two centuries ago‚ brought back to life by the pictures from this collection.
DESCRIPTION
Lucene is a gem in the open-source world--a highly scalable, fast search engine. It delivers performance and is disarmingly easy to use. Lucene in Action is the authoritative guide to Lucene. It describes how to index your data, including types you definitely need to know such as MS Word, PDF, HTML, and XML. It introduces you to searching, sorting, filtering, and highlighting search results.
Lucene powers search in surprising places--in discussion groups at Fortune 100 companies, in commercial issue trackers, in email search from Microsoft, in the Nutch web search engine (that scales to billions of pages). It is used by diverse companies including Akamai, Overture, Technorati, HotJobs, Epiphany, FedEx, Mayo Clinic, MIT, New Scientist Magazine, and many others.
Adding search to your application can be easy. With many reusable examples and good advice on best practices, Lucene in Action shows you how. And if you would like to search through Lucene in Action over the Web, you can do so using Lucene itself as the search engine--take a look at the authors' awesome Search Inside solution. Its results page resembles Google's and provides a novel yet familiar interface to the entire book and book blog.
What's Inside
- How to integrate Lucene into your applications
- Ready-to-use framework for rich document handling
- Case studies including Nutch, TheServerSide, jGuru, etc.
- Lucene ports to Perl, Python, C#/.Net, and C++
- Sorting, filtering, term vectors, multiple, and remote index searching
- The new SpanQuery family, extending query parser, hit collecting
- Performance testing and tuning
- Lucene add-ons (hit highlighting, synonym lookup, and others)
- Foreword by Doug Cutting, the inventor of Lucene
WHAT THE EXPERTS SAY ABOUT THIS BOOK...
"...packed with examples and advice on how to effectively use this incredibly powerful tool." -- Brian Goetz, Principal Consultant, Quiotix Corporation
"...it unlocked for me the amazing power of Lucene." -- Reece Wilton, Staff Engineer, Walt Disney Internet Group
"...the code examples are useful and reusable." -- Scott Ganyo, Jakarta Lucene Committer
"...code samples as JUnit test cases are incredibly helpful." -- Norman Richards, co-author XDoclet in Action
WHAT THE READERS SAY ABOUT THIS BOOK...
"My suggestion to you: pick up a copy of Lucene in Action. You'll
get plenty of support on this mailing list, but you can educate yourself much
more effectively with that book...It's the cheapest consulting ($40) you can get."
-- Participant on java-user@lucene.apache.org
"Our development team has been able to fully
implement/integrate Lucene in our system in just a
week. That's an absolute record, for us, for the
adoption of any component and Lucene in Action has
been invaluable in achieving it, as well as the easy,
nice architecture of Lucene, of course, that is so
well explained in the book."
-- Irakli N.
"I bought the Lucene in Action ebook, which is
excellent and I can strongly recommend [it].
...Thanks to the authors for Lucene in Action,
it's given me the high level best practices
I was needing."
-- Steve S.
ABOUT THE AUTHORS...
A committer on the Ant, Lucene, and Tapestry open-source projects, Erik Hatcher is coauthor of Manning's award-winning Java Development with Ant.
Otis Gospodnetic is a Lucene committer, a member of Apache Jakarta Project Management Committee, and maintainer of the jGuru's Lucene FAQ.
Both authors have published numerous technical articles including several on Lucene.

