Mongosphinx with MongoDB and MongoMapper

Posted: 20th January 2010 by M. E. Patterson in Coding
Tags: , , ,

Can that title have the word ‘mongo’ in it any more times? Well, fear not, I’m about to use it even more…

So, I had to fool around a bit to get the so-called “Mongosphinx” gem working with my app architecture. Thought it might be helpful to others to demonstrate how I did it. I’ll boil it down to a generic sort of implementation. Hit the jump to see the whole bloody mess…

app/models/document.rb

class Document
  include MongoMapper::Document
  key :title,  String
  key :content,  String
  timestamps!

  #cached for the sphinx indexer
  key :sphinx_tags, String

  # for mongosphinx
  fulltext_index :title, :content
  REINDEX_INTERVAL = 10.minutes
  INDEXED_FIELDS = '_sphinx_id, title, content, sphinx_tags'
  before_save: cache_for_indexer
  after_save: reindex

  def self.search(query)
    # method returns a sphinx resultset object with its own each() method
    # iterate over that and pull each element out as an old-fashioned array of Documents
    by_fulltext_index(query).each{|p| p}
  end

  def self.xml_for_sphinx_pipe
    puts MongoSphinx::Indexer::XMLDocset.new(Document.all(:fields => INDEXED_FIELDS)).to_s
  end

  def cache_for_indexer
    self.sphinx_tags = tag_words.join(' ')
    true
  end

  def reindex
    require 'mongo/gridfs'
    # we run this method whenever a doc is saved, but we only actually reindex every 10 minutes, max
    unless RAILS_ENV == 'test'
      db = MongoMapper.database
      file = "indexer_next_run"
      if GridFS::GridStore.exist?(db, file)
        line = GridFS::GridStore.new(db, file, 'r').readlines[0]
        return false unless Time.now > Time.parse(line)
      end
      next_run = (Time.now + REINDEX_INTERVAL).to_s
      GridFS::GridStore.open(db, file, 'w') { |f| f.puts next_run }
      logger.info "Running document re-indexer"
      Process.fork { `rake sphinx:index rotate=true` }
    end
  end
end

I’ll illuminate some of that for you.

If you’ve been using MongoMapper, most of the key stuff should be obvious already. The ‘sphinx_tags’ is a special trick I cooked up to make this work well with my acts_as_mongo_taggable plugin. Basically, whenever a Document is saved/updated, I jam a single string into the sphinx_tags field in Mongo. This lets sphinx index those tags easily.

The search just takes advantage of Mongosphinx’s normal by_fulltext_index method.

The reindex thing, while probably a hacky solution, works well enough for now that I don’t have a need to do anything fancier. This lets us have the app be reasonably quick to include newly-created documents in the index to be available for search. And I take advantage of GridFS (built into Mongo) to store my last_run value so I don’t even need a cron job for this. If my app starts getting significant document-creation traffic, I might want to do something more sophisticated like delta indexing or whatnot, but for now this is fine for me.

lib/tasks/sphinx.rake

Okay, moving on to the aforementioned rake tasks… Here are the contents of my lib/tasks/sphinx.rake file:

namespace :sphinx do
  desc "generate xml that is sphinx-friendly"
  task :genxml => :environment do
    # this will just puts() to stdout; useful for debugging
    Document.xml_for_sphinx_pipe
  end
 
  desc "start up the sphinx daemon"
  task :start => :environment do
    cmd = %( searchd --config "#{Rails.root}/config/sphinx.conf" )
    system! cmd
  end
 
  desc "stop the sphinx daemon"
  task :stop => :environment do
    system! %( searchd --config "#{Rails.root}/config/sphinx.conf" )
  end
 
  desc "run the sphinx indexer"
  task :index => :environment do
    cmd = %( indexer --config "#{Rails.root}/config/sphinx.conf" --all )
    cmd << ' --rotate' if ENV['rotate'] && ENV['rotate'].downcase == 'true'
    system! cmd
  end
end

# a fail-fast, hopefully helpful version of system
def system!(cmd)
  unless system(cmd)
    raise <<-SYSTEM_CALL_FAILED
The following command failed:
  #{cmd}
SYSTEM_CALL_FAILED
  end
end

So this should be pretty self-explanatory, especially if you’ve already used acts_as_sphinx with a standard ActiveRecord-backed app. After you’ve gotten everything (sphinx and the mongosphinx gem, specifically) installed, you should be able to use these rake tasks to start and stop the searchd daemon, as well as run the indexer.

(update 1/22/10: Note that I’m suggesting you get dacort’s fork of mongosphinx. He’s done a nice job of adding excerpting, pagination, and better compatibility with the latest mongomapper. He also pulled in my fix that makes it play nice with ruby 1.9.)

One nicety here is the sphinx:genxml task. Running this is helpful when you’re trying to get everything setup, to prove that you’ve done things right. It should output a big XML file of all the documents it would index. If it doesn’t, or you get something weird, then ur doin’ it wrong.

config/sphinx.conf

Finally, to help you get up and running, here’s my config/sphinx.conf file. Pretty standard:

searchd {
  listen = 127.0.0.1
  port = 9312

  log = ./sphinx/searchd.log
  query_log = ./sphinx/searchd.query.log
  pid_file = ./sphinx/searchd.pid
}

source mongo_project {
  type = xmlpipe2

  xmlpipe_command = ./script/runner "Document.xml_for_sphinx_pipe"
}

index mongo_project {
  source = mongo_project

  charset_type = utf-8
  path = ./sphinx/sphinx_index_main
}

Again, all of this may fall down spectacularly once you get up to some serious data being pushed from the app into the indexer. At that point, do something else, something more awesome. But consider this a basic start on jamming data from your mongo collection(s) right up sphinx’s pipe.

  1. yan says:

    Should mongosphinx work with rails3? I’ve added to my mongomapper model this: fulltext_index :product_id, :product_title, :server => ‘localhost’, :port => 9312

    And get this error: NoMethodError: undefined method `fulltext_index’ for Product:Class

  2. star says:

    I use this for my app(rails 3)
    but an error is “Connection to 0.0.0.0 on 9312 failed. Connection refused – connect(2)”
    I don’t know why?

  3. ketan says:

    rake ts:start

    Failed to start searchd daemon. Check log/searchd.log.
    Failed to start searchd daemon. Check log/searchd.log

    rake ts:in

    FATAL: no indexes found in config file ‘config/development.sphinx.conf’

  4. Solved!

    I patched up (..)Ruby/Gems/1.8/gems/mongosphinx-0.1.1/./lib/mixins/properties.rb (line 56). It seems to be working fine now.

    def sphinx_id
    if (match = self.id.to_s.match(/#{self.class}-([0-9]+)/)) #.to_s. inserted
    return match[1]
    else
    return nil
    end
    end

    Cheers and thx again

    Martin

  5. Hello,

    I’m having an annoying issue with /usr/lib/ruby/gems/1.8/gems/rails-2.3.5/lib/commands/runner.rb:48: undefined method `match’ for ObjectID(’4bab806e51e63a319e000004′):Mongo::ObjectID (NoMethodError)
    from /usr/lib/ruby/gems/1.8/gems/mongosphinx-0.1.1/./lib/indexer.rb:108:in `initialize’

    mongo (0.19.1)
    mongo_ext (0.19.1)
    mongo_mapper (0.7.1)
    mongosphinx (0.1.1)

    Sphinx .98 OR .99

    it also happens anytime i try: puts MongoSphinx::Indexer::XMLDocset.new(Page.all(WHATEVER)).to_s

    I tried over and over again on ubuntu and snow leopard. :(

    Help!

    thx.

    • M. E. Patterson says:

      Just a guess (I’ve moved on to Sunspot/Solr for use with Mongo now, instead of Sphinx), but I’m betting this is due to you using a newer version of MongoMapper. In the more recent versions, Nunemaker has switched MM’s treatment of mongo object IDs from being a straight string to being an actual ObjectID object that can be coerced to a string. It sounds like runner.rb is getting called at some point and told to run match() on the id, assuming that it’s a String (which has a match method), but it’s an ObjectID (which does not).

      Again, just a guess based on the error you are seeing. Likely you’ll have to hack around in MongoSphinx and find wherever it’s trying to use that object id and add .to_s

  6. Matt Beedle says:

    I solved this issue in case anyone else has the same problem, although I’m still not entirely sure of the cause. I changed my genxml task to actually write an tmp/accounts.xml libxml2 file, and then in my sphinx config I changed my xmlpipe_command to = cat /tmp/accounts.xml. So the problem appears to have been to do with extra characters someone getting added to the puts output from xml_for_sphinx_pipe.

    I also had a separate problem starting sphinx. It kept telling me that port 9132 was in use, which it wasn’t. I removed port 9132 from the sphinx.conf file, and now it starts fine, but still on port 9132.

  7. Matt Beedle says:

    Did you have any trouble getting sphinx to parse the xml? Mine keeps blowing up with “XML parse error: junk after document element”. I created a thread on the sphinx forum too, but no reply ;( http://www.sphinxsearch.com/forum/view.html?id=5209

  8. Jarin Udom says:

    ^^^^ Haha that is a great spam post, it looks relevant but it could be used on any technology-related blog.

    Great article by the way :)

  9. Lee Pyotr says:

    Fine blog. I got a lot of effective information. I’ve been following this technology for awhile. It’s intriguing how it keeps varying, yet some of the core factors stay the same. Have you seen much change since Google made their most recent acquisition in the field?