Coding Obsession: Hibernate Search

From today I'm also writing on the very cool in.relation.to blog.
This is the collective blog of experts from the Seam and Hibernate teams - I'm reading it since years - and am honored they invited me to write about my contribution to Hibernate Search.
It's amazing how these very esteemed developers welcome contributions and are open to any kind of discussion.

Recently I saw some very sad statements about the JBoss community not being truly open, or not being meritocratic; I think that people who believe that either didn't ever try to really contribute, or had met the wrong person at the wrong moment: as all communities, they are big and made of humans.
When I started willing to contribute I wasn't an expert at all, still I was welcomed for my interest in the project and I always got - and still get - polite answers even to my most silly questions and doubts. After the traditional couple of patches were accepted, I slowly began feeling as part of a team. I might have been lucky, but luck has endured as every single person I keep meeting in these groups is at the same time very kind, smart and helpful. Just keep in mind they're all very busy: an answer could take some time.

So today I wrote this post about Hibernate Search's new MassIndexer: read it, comment about it, make use of it! Then ask for improvements and join the fun :-)

I'm reading the preview of the upcoming excellent Hibernate Search in Action and I am getting inspiration about some improvements I would like to implement, in particular:

Multiple backends
As each index has it's own DirectoryProvider why not also it's own backend?

needing an entity to be indexed "sync" while others "async".
wanting an index using JMS, another async / local.
using different JMS queues without a selector.

The implementation is not hard at all, I'll have to move some classes from the refactored org.hibernate.search.backend.impl.lucene (I'm working on it already) to org.hibernate.search.backend; probably most work will be about discussing which is the best and simplest way for users to configure them.

A scalability improving ReaderProvider
The ReaderProviderS have all to guarantee the returned indexreader is absolutely updated; however this doesn't make much sense in "async" mode, as the user would probably prefer a slightly outdated reader traded for some extra throughput.
In my opinion most users sill use the "async" mode for most entities, in particular if we enable different backends as in my experience there usually are a feq entities which need "sync" mode.
I was thinking about implementing a new ReaderProvider which could "wrap" another implementation (for flexibility) and then periodically retrieve a new IndexReader from the wrapped one in a configurable time period.
So two initialization arguments: backing implementation (class or name), frequency period (ms).
This way if the wrapped ReaderProvider is a plain NotSharedReaderProvider the index would be reopened each other X ms:

drawback: potentially opening more than needed under low load.

advantage: the actual rate is controlled even under high load.

Additionally if it is wrapping a smarter implementation like shared or shared-segments the drawback will degenerate in just some useless new file checks, instead of really reading all data.

I am actually sorry I didn't have this idea earlier, as I think the implementation is trivial but needs good docs and explanation... too late for inclusion in the book?

Automatic Sharding strategies
As Emmanuel explains in the book the IdHashShardingStrategy strategy provided with H.Search is more like a "demo" strategy as the most interesting strategies depend on the user needs.
It occurs me the tips he is giving could apply very well on an entity having an Enum property, in which case a great IndexShardingStrategy could be generated automatically and "autoconfigured" as we know already the number of elements and are guaranteed they have different names.. good for index names postfixes or something like that.
Just add an annotation to the field, something like
@ShardDiscriminator.

Improved Filters exploiting improved Sharding...
Having created such an optimal IndexShardingStrategy will further enable the code to create a special Filter to be used during searches, which is capable of improving Search performance by selecting the correct Index to search in (avoiding the search in all indexes).
When adding a new value the SearchFactory must be restarted anyway, so the fact you can't add a new Enum value dynamically should not be a limit: as you should not anyway, I actually like the fact that you can't. Looks like java compiler enforces correct Hibernate Search configuration and usage...

The same logic could be applied to any String field or @ManyToOne pointing to an entity
whose number will not change and has a way to be transformed in a unique index indentifier (require a toString or have it implement an interface?); We could start supporting the Enum and see how good it is.

Coding Obsession

Tuesday, December 8, 2009

Blogging now also at in.relation.to

Sunday, September 28, 2008

Ideas on Hibernate Search

About Me

Blog Archive

My Blog List