Coding Obsession

Tuesday, September 8, 2009

What is Hibernate Search?

I'm getting this question relatively often, so I think that existing information online is making too many assumptions, or is too practical; I'll try to fill this gap with a very basic introduction.
Hibernate Search is an open source Java project which integrates Hibernate with Lucene; both libraries have proven themselves extremely useful and are stable and of widespread use; in practice many projects face the need to use both.
Unfortunately the String world of Lucene is quite different than the Hibernate world and every project trying to integrate both is doomed to face the same problems, to rewrite more or less the same glue, to have more code to maintain because of their own bugs, or because of API changes in one or both frameworks.

Lucene
Is an Apache library which provides full-text capabilities: you create an index (in memory, on filesystem, in a database,...) and then you can search this index on keywords, phrases, boolean queries, etc..
The results are commonly returned by relevance, so the best matching documents are returned first (think of it as a web search engine like google); the main point is that you have full control about how your items are parsed before entering the string world of the index, to choose which information is important for your business, how you define the matching rules. It is very fast and generally considered stable, still new features are constantly added.
Being extremely flexible, working directly with Lucene is like programming "low level" so often applications introduce a separation layer to standardize the way it is used across an application, thus hiding some of the flexibility and possibly introducing some helpers.

Hibernate
the aim of this very successful open source project is to simplify the interaction between the application and the database; technically it's and Object-Relational Mapping service; you'll find plenty of information and tutorials about it on the web. The important point to introduce Hibernate Search is that it makes you use POJOs to define the domain model of your application, annotating them to define the mapping to the database, and provides good APIs and even an object-oriented query language to interact with the database, all nicely fitted in a transactional world.

Hibernate Search
Hibernate Search is built on top of Lucene, like Hibernate is built on top of your SQL database. As Hibernate maps POJOs to tables, Hibernate Search maps them with to Lucene's index introducing a new set of annotations. The interesting point here is that you annotate with both families of annotations the same entities, and when you make an Hibernate query to the database or a Lucene query to the index, you'll get Hibernate managed entities in both cases. You define your domain model - which is unique - and how it maps to the database and to the index. When you make changes to your data the service will update both database and index at transaction commit.
The API to run and paginate queries is an extension of Hibernate's (and JPA) API, so the changes in an application to introduce full-text capabilities are minimal.
When using Lucene the code ususally gets quite verbose, like when defining Analyzers or Filters; with Hibernate Search you can define these declaratively and reuse them by name. Last but not least it makes use of several performance improving tricks, like: sharing file buffers across concurrent reading sessions, caching filter results, batching index changes, clustering solutions. All nice capabilities which you don't need to know, but they are there in case you'll need them.

Flexibility
Even being a simplifying layer between the application and Lucene, it won't hide any advanced feature but provide tools to make use of them. Developers can customize all aspects: from defining custom bridges for your types up to replacing/extending whole parts of the framework. Each mayor component can be replaced with custom code: define your own index storage strategy by creating a custom DirectoryProvider, use your own LockManager, create a new IndexShardingStrategy, fine-tune all performance settings which Lucene exposes. If you're still missing something, you're free to change the code and submit patches.

Websites:
Hibernate Search - website
Hibernate Search - forums
Lucene's Java implementation website

Books:
Hibernate Search in Action
Java Persistence with Hibernate
Lucene in Action, Second Edition

Sunday, September 28, 2008

Ideas on Hibernate Search

I'm reading the preview of the upcoming excellent Hibernate Search in Action and I am getting inspiration about some improvements I would like to implement, in particular:

Multiple backends
As each index has it's own DirectoryProvider why not also it's own backend?

needing an entity to be indexed "sync" while others "async".
wanting an index using JMS, another async / local.
using different JMS queues without a selector.

The implementation is not hard at all, I'll have to move some classes from the refactored org.hibernate.search.backend.impl.lucene (I'm working on it already) to org.hibernate.search.backend; probably most work will be about discussing which is the best and simplest way for users to configure them.

A scalability improving ReaderProvider
The ReaderProviderS have all to guarantee the returned indexreader is absolutely updated; however this doesn't make much sense in "async" mode, as the user would probably prefer a slightly outdated reader traded for some extra throughput.
In my opinion most users sill use the "async" mode for most entities, in particular if we enable different backends as in my experience there usually are a feq entities which need "sync" mode.
I was thinking about implementing a new ReaderProvider which could "wrap" another implementation (for flexibility) and then periodically retrieve a new IndexReader from the wrapped one in a configurable time period.
So two initialization arguments: backing implementation (class or name), frequency period (ms).
This way if the wrapped ReaderProvider is a plain NotSharedReaderProvider the index would be reopened each other X ms:

drawback: potentially opening more than needed under low load.

advantage: the actual rate is controlled even under high load.

Additionally if it is wrapping a smarter implementation like shared or shared-segments the drawback will degenerate in just some useless new file checks, instead of really reading all data.

I am actually sorry I didn't have this idea earlier, as I think the implementation is trivial but needs good docs and explanation... too late for inclusion in the book?

Automatic Sharding strategies
As Emmanuel explains in the book the IdHashShardingStrategy strategy provided with H.Search is more like a "demo" strategy as the most interesting strategies depend on the user needs.
It occurs me the tips he is giving could apply very well on an entity having an Enum property, in which case a great IndexShardingStrategy could be generated automatically and "autoconfigured" as we know already the number of elements and are guaranteed they have different names.. good for index names postfixes or something like that.
Just add an annotation to the field, something like
@ShardDiscriminator.

Improved Filters exploiting improved Sharding...
Having created such an optimal IndexShardingStrategy will further enable the code to create a special Filter to be used during searches, which is capable of improving Search performance by selecting the correct Index to search in (avoiding the search in all indexes).
When adding a new value the SearchFactory must be restarted anyway, so the fact you can't add a new Enum value dynamically should not be a limit: as you should not anyway, I actually like the fact that you can't. Looks like java compiler enforces correct Hibernate Search configuration and usage...

The same logic could be applied to any String field or @ManyToOne pointing to an entity
whose number will not change and has a way to be transformed in a unique index indentifier (require a toString or have it implement an interface?); We could start supporting the Enum and see how good it is.

Monday, July 23, 2007

Updating to JBoss 4.2.1

I'm going to update an enterprise application from using JBoss 4.0.5 to JBoss 4.2 and writing here the memorandum of most important steps and considerations, as asked by friends and my boss.
Why updating?

Version 4.2 fixes some bugs which were affecting our application.
Red Hat announced they are going to use 4.2 in their Enterprise Application Platform, so they are committed to support this version for a long time.
I had already updated some libraries of JBoss 4.05 to use some features we where needing; the new version comes with the same updated libraries so I think they will be better compatible than my souped-up, unsupported version.

Technologies
The application to migrate is a JavaEE web application, developed in Eclipse and using these technologies:

Seam 1.2.1 is the integrating framework.
Faceletes for page design
Hibernate 3.2 : core, search, annotations and entitymanager.
Lucene 2.2 for fast full-text searching: both using hibernate search and custom code.
JTDS as JDBC driver.
SQL Server 2000 as Database on a windows server
Fedora linux 6 for the webserver.
"some" JSF implementation...

We were using the myfaces JSF implementation, but the JBoss and Seam people now recommend using the Sun reference implementation, and this comes bundled as default in the new application server. You have two options:

keep the myfaces JSF implementation.
update to Sun's RI JSF implementation.

We had no particular need to keep myfaces, and being the second option recommended by the JBoss team, we are actually going to see how painless it is to switch implementations.

So now we can begin our
Migration checklist to JBoss 4.2.1 and Sun's JSF
from Jboss 4.0.5 and myfaces.

In application.xml remove the following modules from application.xml:

<module>
<java>el-api.jar</java>
</module>
<module>
<java>el-ri.jar</java>
</module>

and add this one instead:

<module>
<java>commons-collections-3.1.jar</java>
</module>

You will need to remove and add the relative jars to the root of your ear. The commons-collections-3.1 is needed by Ajax4jsf; Jboss ships now with a different version.

In faces-config.xml add

<application>
<el-resolver>org.jboss.seam.jsf.SeamELResolver</el-resolver>
<message-bundle>messages</message-bundle>
</application>

and update the headers to:

<faces-config version="1.2"
xmlns="http://java.sun.com/xml/ns/javaee"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://java.sun.com/xml/ns/javaee
http://java.sun.com/xml/ns/javaee/web-facesconfig_1_2.xsd">

In web.xml remove this listener:

org.apache.myfaces.webapp.StartupServletContextListener

and all other references to myface's classes. No new listener should be needed.
When using tomahawk you may like to keep some org.apache.myfaces context parameters, as it should work on Sun's RI but uses the parameters named in myface's style.

You may like to update the header of jboss-app.xml:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE jboss-app
 PUBLIC "-//JBoss//DTD J2EE Application 4.2//EN"
 "http://www.jboss.org/j2ee/dtd/jboss-app_4_2.dtd">

Problems & some fixes

Content-Type

Using myfaces the rendered pages have "Content-Type: text/html;" as reported by Firefox's plugin "web developer", using Sun's RI the content type is now "Content-Type: application/xhtml+xml;".
Technically the second one should be better, as discussed by here, but it brings some issues:

Even with same HTML and CSS the pages could look-like different.
Internet explorer (upto 6) doesn't like "xhtml+xml"
Some redirects won't work.

The good news is that you will get firefox to do a full check of your pages, so maybe you'll find some errors faster. To get the content as text/html you define the contentType attribute in your views:

<f:view contentType="text/html"...

Still this leaves a problem with Seam's PDF rendering: there's no way to modify the content type of the redirect servlet that brings you from the page to the pdf download link, I'm going to see if I can get apache to force the "correct" contentType served to clients.