Sunday, September 28, 2008

Ideas on Hibernate Search

I'm reading the preview of the upcoming excellent Hibernate Search in Action and I am getting inspiration about some improvements I would like to implement, in particular:

Multiple backends
As each index has it's own DirectoryProvider why not also it's own backend?
  • needing an entity to be indexed "sync" while others "async".
  • wanting an index using JMS, another async / local.
  • using different JMS queues without a selector.
The implementation is not hard at all, I'll have to move some classes from the refactored org.hibernate.search.backend.impl.lucene (I'm working on it already) to org.hibernate.search.backend; probably most work will be about discussing which is the best and simplest way for users to configure them.

A scalability improving ReaderProvider
The ReaderProviderS have all to guarantee the returned indexreader is absolutely updated; however this doesn't make much sense in "async" mode, as the user would probably prefer a slightly outdated reader traded for some extra throughput.
In my opinion most users sill use the "async" mode for most entities, in particular if we enable different backends as in my experience there usually are a feq entities which need "sync" mode.
I was thinking about implementing a new ReaderProvider which could "wrap" another implementation (for flexibility) and then periodically retrieve a new IndexReader from the wrapped one in a configurable time period.
So two initialization arguments: backing implementation (class or name), frequency period (ms).
This way if the wrapped ReaderProvider is a plain NotSharedReaderProvider the index would be reopened each other X ms:
  • drawback: potentially opening more than needed under low load.
  • advantage: the actual rate is controlled even under high load.
Additionally if it is wrapping a smarter implementation like shared or shared-segments the drawback will degenerate in just some useless new file checks, instead of really reading all data.

I am actually sorry I didn't have this idea earlier, as I think the implementation is trivial but needs good docs and explanation... too late for inclusion in the book?

Automatic Sharding strategies
As Emmanuel explains in the book the IdHashShardingStrategy strategy provided with H.Search is more like a "demo" strategy as the most interesting strategies depend on the user needs.
It occurs me the tips he is giving could apply very well on an entity having an Enum property, in which case a great IndexShardingStrategy could be generated automatically and "autoconfigured" as we know already the number of elements and are guaranteed they have different names.. good for index names postfixes or something like that.
Just add an annotation to the field, something like
@ShardDiscriminator.

Improved Filters exploiting improved Sharding...
Having created such an optimal IndexShardingStrategy will further enable the code to create a special Filter to be used during searches, which is capable of improving Search performance by selecting the correct Index to search in (avoiding the search in all indexes).
When adding a new value the SearchFactory must be restarted anyway, so the fact you can't add a new Enum value dynamically should not be a limit: as you should not anyway, I actually like the fact that you can't. Looks like java compiler enforces correct Hibernate Search configuration and usage...

The same logic could be applied to any String field or @ManyToOne pointing to an entity
whose number will not change and has a way to be transformed in a unique index indentifier (require a toString or have it implement an interface?); We could start supporting the Enum and see how good it is.