Most commercial search engines include a more or less advanced document processing pipeline for transforming raw input into something that can be indexed. The process involves normalization, entity extraction, linguistic processing, annotation, data cleansing etc.
When it comes to Open Source search engines, they start getting pretty good at the core of indexing and search, however they typically lack a proper document processing pipeline. When I started looking for such frameworks a few days ago, I came across this post announcing that Dieselpoint just released their own document processing pipeline as open source at www.openpipeline.org. I have not yet tried it out but it looks very promising, and could have the potential of being the preferred pipeline for deployments of Apache Solr and other open source engines.
There are also other initiatives like OpenPipe which is similar, which you can read more about in Rogério Pereira Araújo’s blog about the same subject. I might find time for a comparison later on.
Good luck, Dieselpoint, in contributing to open source. I hope that you will let the OS community really contribute and help adapt and improve this framework going forward.
Chris Cleveland