We held a talk at the Open Source Conference GoOpen 2011 in Oslo today, together with our customer NHST, represented by Hans Jørgen Hoel. The talk was about the process of migrating from FAST ESP to Apache Solr for all of NHST’s news publications and other data sources.
The presentation is in Norwegian.
English transcript
NHST Media Group publishes many online newspapers including DN.no (financial), Tradewinds.no (shipping), ReCharge (renewable energy) etc. This presentation was held by NHST and Cominvent.
Agenda:
- Project background
- Architecture
- Search ABC
- The project
- Summary
Project Background
Large amount of news articles on paper and online.
FAST ESP as search platform since 2006, Solr for tax report search since 2009.
Open source, Linux and Java is heavily used in the organization.
FAST was acquired by Microsoft in 2008 and Linux support discontinued. This prompted a new evaluation of the search architecture, and Solr was chosen for the future, with Cominvent as technology partner.
Architecture before
FAST uses one monolithic index, so all sources shared the same data schema (index-profile). Escenic is the main source of content. A plugin existed to push content to FAST based on triggers.
On the search side each publication were either using the FAST search API directly or some flavor of a home-grown search middleware. However, each publication had their own result presentation logic and innovations in one publication would not benefit the others.
Search ABC
Search is NOT database. Optimized for free text, but also handles boolean logic well.
Commercial engines: FAST/Microsoft, Google Search Appliance (GSA), Autonomy IDOL
Open source engines: Apache Solr/Lucene, Xapian, Elastic Search
Usage areas: Intranet, shopping, social media, news etc
Solr is an open source Java based search server with Lucene in the core. It is released by the Apache Foundation under the permissive open source license Apache Software License 2.0, meaning you can do almost anything you like with the software, including sharing it or closing it and charging for it.
The project
We introduced a new, common search and indexing middleware which all publications use. The role of the middleware is to isolate the clients from details and changes in the search engine. There is also a presentation layer in the middleware which provide JSP taglibs for delivering a standard result page with pagination, facets, did-you-mean etc. This makes it very rapid to plug in search in a new publication.
All the data sources also now use the same middleware for indexing, taking care of indexing the content to the right search core .
Challenges
Some features of FAST did not exist in Solr. FAST is more a search platform while Solr is a search server. The major difference was linguistic support which is strong in FAST. This was solved in Solr.
We were using entity extraction in FAST, but did not include that in Solr in this project, as it does not come out of the box, but need integration with 3rd party solutions.
Differences
While FAST uses a monolithic index, Solr can be split in cores, each having its own data schema and configuration. This means that if you need to reconfigure or re-index one data source such as tax-list, you do not affect the rest of the articles. It also allows for easy staging of new content to a new core, and then swapping it into production when ready, without the need for another physical staging server as was needed with FAST.
FAST ships with Lemmatization, while in Solr we use stemming, which is inferior and causes some problems. These are mitigated by tuning the stemming dictinoaries.
To give Solr language support, we implemented some language abstractions in the middleware, adding a language field to each document, and choosing separate fields title_no for Norwegian content and title_en for english content, and then making this implementation detail transparent from the search clients.
Tuning
News is fresh meat. You need immediate indexing as things change (push instead of pull). We also implemented date boost through Solr’s Function Query formula. There are tons of formulas available, and there is almost no limit to what you can tune and boost.
Summary
Solr is a lot less resource demanding than FAST. Can easily run virtualized or in the cloud. NHST scaled into the Amazon EC2 cloud during the peak period of the tax list search last year.
Each developer may run a local copy of Solr on his laptop, this was very hard with FAST.
Cleaner architecture than before, more flexible with multiple cores.
A big win to gather all search related business logic into a common search middleware, including a JSP presentation layer.
Superb tuning possibilities, easier to tune than the old engine.
Although there were challenges and we had to sacrifice entity extraction in the first phase, we’re very happy with the decision to migrate to Solr