|
Lucene in Action Erik Hatcher and Otis Gospodnetić 2004 | 456 pages ISBN: 1932394281 |
|||
![]() |
$44.95 | Softbound print book | |
![]() |
$22.50 | ThoutReader + PDF ebook | |
Preface
From Erik Hatcher
Ive been intrigued with searching and indexing from the early days of the Internet. I have fond memories (circa 1991) of managing an email list using majordomo, MUSH (Mail Users Shell), and a handful of Perl, awk, and shell scripts. I implemented a CGI web interface to allow users to search the list archives and other users profiles using grep tricks under the covers. Then along came Yahoo!, AltaVista, and Excite, all which I visited regularly.
After my first child, Jakob, was born, my digital photo archive began growing rapidly. I was intrigued with the idea of developing a system to manage the pictures so that I could attach meta-data to each picture, such as keywords and date taken, and, of course, locate the pictures easily in any dimension I chose. In the late 1990s, I prototyped a filesystem-based approach using Microsoft technologies, including Microsoft Index Server, Active Server Pages, and a third COM component for image manipulation. At the time, my professional life was consumed with these same technologies. I was able to cobble together a compelling application in a couple of days of spare-time hacking.
My professional life shifted toward Java technologies, and my computing life consisted of less and less Microsoft Windows. In an effort to reimplement my personal photo archive and search engine in Java technologies in an operating systemagnostic way, I came across Lucene. Lucenes ease of use far exceeded my expectationsI had experienced numerous other open-source libraries and tools that were far simpler conceptually yet far more complex to use.
In 2001, Steve Loughran and I began writing Java Development with Ant (Manning). We took the idea of an image search engine application and generalized it as a document search engine. This application example is used throughout the Ant book and can be customized as an image search engine. The tie to Ant comes not only from a simple compile-and-package build process but also from a custom Ant task, <index>, we created that indexes files during the build process using Lucene. This Ant task now lives in Lucenes Sandbox and is described in section 8.4 of this book.
This Ant task is in production use for my custom blogging system, which I call BlogScene (http://www.blogscene.org/erik). I run an Ant build process, after creating a blog entry, which indexes new entries and uploads them to my server. My blog server consists of a servlet, some Velocity templates, and a Lucene index, allowing for rich queries, even syndication of queries. Compared to other blogging systems, BlogScene is vastly inferior in features and finesse, but the full-text search capabilities are very powerful.
Im now working with the Applied Research in Patacriticism group at the University of Virginia (http://www.patacriticism.org), where Im putting my text analysis, indexing, and searching expertise to the test and stretching my mind with discussions of how quantum physics relates to literature. Poets are the unacknowledged engineers of the world.
From Otis Gospodnetić
My interest in and passion for information retrieval and management began during my student years at Middlebury College. At that time, I discovered an immense source of information known as the Web. Although the Web was still in its infancy, the long-term need for gathering, analyzing, indexing, and searching was evident. I became obsessed with creating repositories of information pulled from the Web, began writing web crawlers, and dreamed of ways to search the collected information. I viewed search as the killer application in a largely uncharted territory. With that in the back of my mind, I began the first in my series of projects that share a common denominator: gathering and searching information.
In 1995, fellow student Marshall Levin and I created WebPh, an open-source program used for collecting and retrieving personal contact information. In essence, it was a simple electronic phone book with a web interface (CGI), one of the first of its kind at that time. (In fact, it was cited as an example of prior art in a court case in the late 1990s!) Universities and government institutions around the world have been the primary adopters of this program, and many are still using it. In 1997, armed with my WebPh experience, I proceeded to create Populus, a popular white pages at the time. Even though the technology (similar to that of WebPh) was rudimentary, Populus carried its weight and was a comparable match to the big players such as WhoWhere, Bigfoot, and Infospace.
After two projects that focused on personal contact information, it was time to explore new territory. I began my next venture, Infojump, which involved culling high-quality information from online newsletters, journals, newspapers, and magazines. In addition to my own software, which consisted of large sets of Perl modules and scripts, Infojump utilized a web crawler called Webinator and a full-text search product called Texis. The service provided by Infojump in 1998 was much like that of FindArticles.com today.
Although WebPh, Populus, and Infojump served their purposes and were fully functional, they all had technical limitations. The missing piece in each of them was a powerful information-retrieval library that would allow full-text searches backed by inverted indexes. Instead of trying to reinvent the wheel, I started looking for a solution that I suspected was out there. In early 2000, I found Lucene, the missing piece Id been looking for, and I fell in love with it.
I joined the Lucene project early on when it still lived at SourceForge and, later, at the Apache Software Foundation when Lucene migrated there in 2002. My devotion to Lucene stems from its being a core component of many ideas that had queued up in my mind over the years. One of those ideas was Simpy, my latest pet project. Simpy is a feature-rich personal web service that lets users tag, index, search, and share information found online. It makes heavy use of Lucene, with thousands of its indexes, and is powered by Nutch, another project of Doug Cuttings (see chapter 10). My active participation in the Lucene project resulted in an offer from Manning to co-author Lucene in Action with Erik Hatcher.
Lucene in Action is the most comprehensive source of information about Lucene. The information contained in the next 10 chapters encompasses all the knowledge you need to create sophisticated applications built on top of Lucene. Its the result of a very smooth and agile collaboration process, much like that within the Lucene community. Lucene and Lucene in Action exemplify what people can achieve when they have similar interests, the willingness to be flexible, and the desire to contribute to the global knowledge pool, despite the fact that they have yet to meet in person.
DESCRIPTION
Lucene is a gem in the open-source world--a highly scalable, fast search engine. It delivers performance and is disarmingly easy to use. Lucene in Action is the authoritative guide to Lucene. It describes how to index your data, including types you definitely need to know such as MS Word, PDF, HTML, and XML. It introduces you to searching, sorting, filtering, and highlighting search results.
Lucene powers search in surprising places--in discussion groups at Fortune 100 companies, in commercial issue trackers, in email search from Microsoft, in the Nutch web search engine (that scales to billions of pages). It is used by diverse companies including Akamai, Overture, Technorati, HotJobs, Epiphany, FedEx, Mayo Clinic, MIT, New Scientist Magazine, and many others.
Adding search to your application can be easy. With many reusable examples and good advice on best practices, Lucene in Action shows you how. And if you would like to search through Lucene in Action over the Web, you can do so using Lucene itself as the search engine--take a look at the authors' awesome Search Inside solution. Its results page resembles Google's and provides a novel yet familiar interface to the entire book and book blog.
What's Inside
- How to integrate Lucene into your applications
- Ready-to-use framework for rich document handling
- Case studies including Nutch, TheServerSide, jGuru, etc.
- Lucene ports to Perl, Python, C#/.Net, and C++
- Sorting, filtering, term vectors, multiple, and remote index searching
- The new SpanQuery family, extending query parser, hit collecting
- Performance testing and tuning
- Lucene add-ons (hit highlighting, synonym lookup, and others)
- Foreword by Doug Cutting, the inventor of Lucene
WHAT THE EXPERTS SAY ABOUT THIS BOOK...
"...packed with examples and advice on how to effectively use this incredibly powerful tool." -- Brian Goetz, Principal Consultant, Quiotix Corporation
"...it unlocked for me the amazing power of Lucene." -- Reece Wilton, Staff Engineer, Walt Disney Internet Group
"...the code examples are useful and reusable." -- Scott Ganyo, Jakarta Lucene Committer
"...code samples as JUnit test cases are incredibly helpful." -- Norman Richards, co-author XDoclet in Action
WHAT THE READERS SAY ABOUT THIS BOOK...
"My suggestion to you: pick up a copy of Lucene in Action. You'll
get plenty of support on this mailing list, but you can educate yourself much
more effectively with that book...It's the cheapest consulting ($40) you can get."
-- Participant on java-user@lucene.apache.org
"Our development team has been able to fully
implement/integrate Lucene in our system in just a
week. That's an absolute record, for us, for the
adoption of any component and Lucene in Action has
been invaluable in achieving it, as well as the easy,
nice architecture of Lucene, of course, that is so
well explained in the book."
-- Irakli N.
"I bought the Lucene in Action ebook, which is
excellent and I can strongly recommend [it].
...Thanks to the authors for Lucene in Action,
it's given me the high level best practices
I was needing."
-- Steve S.
ABOUT THE AUTHORS...
A committer on the Ant, Lucene, and Tapestry open-source projects, Erik Hatcher is coauthor of Manning's award-winning Java Development with Ant.
Otis Gospodnetic is a Lucene committer, a member of Apache Jakarta Project Management Committee, and maintainer of the jGuru's Lucene FAQ.
Both authors have published numerous technical articles including several on Lucene.

