Category Archives: rails

Parallel threads are still a myth in Ruby

Giant Padlock

In the beginning there was a giant lock. It wrapped all your requests and let only one run at  a time. Then it was removed in Rails 2.2, there was much to rejoice and some nay-sayers too. The world moved on, no one knew how many are actually using this new feature. Passenger app instance kept processing only one request a time, anyways. We did our bit of chest thumping and forgot about what good old m.on.key said.

But the debate seems to have been revived with Rubinius planning to remove GIL, JRuby already running threads parallely and finally MRI running native threads. On more serious note, I have been working with Ruby professionally for more than 4 years and I am astonished at anyone who is planning to run Ruby in a parallel environment. Igvita brought up that issue in his post (http://www.igvita.com/2008/11/13/concurrency-is-a-myth-in-ruby/), but yehuda disagrees.

The problem isn’t whether underlying platform supports parallel native threads or not, but the problem is, does Ruby eco-system supports that?

require "thread"
 
class Foo
  def hello
    puts "Original hello"
  end
end
 
threads = []
threads << Thread.new do
  sleep(2)
  Foo.class_eval do
    def hello
      puts "Hello from modified thread #1 "
    end
  end
end
 
threads << Thread.new do
  a = Foo.new()
  a.hello
  sleep(3)
  puts "After sleeping"
  a.hello
  b = Foo.new()
  b.hello
end
 
threads.each { |t| t.join }

Above code runs as expected, but ruby runtime offers no guarantees AFAIK about visibility of modifications in class objects. Lets take another example:

def synchronize(*methods)
    options = methods.extract_options!
    unless options.is_a?(Hash) && with = options[:with]
      raise ArgumentError, "Synchronization needs a mutex. Supply an options hash with a :with key as the last argument (e.g. synchronize :hello, :with => :@mutex)."
    end
 
    methods.each do |method|
      aliased_method, punctuation = method.to_s.sub(/([?!=])$/, ''), $1
 
      if method_defined?("#{aliased_method}_without_synchronization#{punctuation}")
        raise ArgumentError, "#{method} is already synchronized. Double synchronization is not currently supported."
      end
 
      module_eval(<<-EOS, __FILE__, __LINE__ + 1)
        def #{aliased_method}_with_synchronization#{punctuation}(*args, &block)   
          #{with}.synchronize do 
            #{aliased_method}_without_synchronization#{punctuation}(*args, &block)
          end                                                     
        end
      EOS
 
      alias_method_chain method, :synchronization
    end
  end

The idea is, you don’t need to start a synchronize block in your method, if you want entire method body to be wrapped inside synchronized block, rather than that you can specify entire contract at class level. Neat!
Except not, how does it work when classes are getting reloaded in development mode? Will this metaprogrammtically created synchronize hold? The answer will vary from Ruby to Ruby to implementation. In parallel environment, it will not.

The problem I am trying to drive at is, Ruby’s memory model makes no guarantees about class state. It’s a problem, no one talks about.

Lets take another example of “thread safe” code, from ActiveRecord connection pool (http://github.com/rails/rails/blob/master/activerecord/lib/active_record/connection_adapters/abstract/connection_pool.rb).
Anyone closely reading the code can find few thread problems with the code. For example, “@reserved_connections” seems be read and written by multiple threads without bother. It may work fine in Ruby 1.8 and 1.9, but the code is definitely not thread safe, if multiple threads are allowed to run parallel. A solution would be perhaps to use, concurrent hash (http://stackoverflow.com/questions/1080993/pure-ruby-concurrent-hash).

Where are we getting with this? In my opinion, until Ruby runtime makes guarantees about class state, module state , core threading primitives (Queue, Monitors, MonitorMixin, ConditionVariable) get exhaustively tested, concurrent collections are added; Ruby isn’t ready for parallel threads. In my professional experience, I had several problems with ruby’s default primitives. I was listening to a talk by ThoughWorks folks, who were using JRuby, single VM multiple threads for running rails for Mingle, their problems were simply too hard to catch and hard to debug.

In the meanwhile, Ruby community should focus on getting evented and co-operative threading proper. Fiber is a good start.

Thanks to folks in #ruby-pro for reviewing the post. Specially to James Tucker(raggi)

Debug RESTful calls for fun and profit

When using ActiveResource it can be quite useful to print Webservice calls going out and XML responses coming in. It allows me to debug things, it allows me to create useful Httpmocks for testing code that uses the resource.

Somewhat unobtrusive way of doing this is, stick following code to some library file:

module ActiveResource
  class Connection
    alias_method :old_request, :request
 
    def request(method, path, * arguments)
      response = old_request(method, path, * arguments)
      if (response && response.code.to_i == 200 && APP_CONFIG['debug_resource_call'])
        puts("********** method is #{method} and path is #{path} ********** ")
        puts response.body
      end
      response
    end
  end
end

And off I can go with Httpmock

Autotest and add_exception method

Stuff floating on intrawebs for adding more files to be ignored using “add_exception” may not work because by the time “run” hook gets a chance to run, files to be ignored regexp is already compiled. The alternative is to use “initialize” hook like this:

1
2
3
Autotest.add_hook :initialize do |at|
  at.add_exception(/^(coverage|\.git)/)
end

This is with 3.11.0 of ZenTest and YMMV.

Rails debugging warmhole

Since, we started to scale to multi machine clustering, our rails application was showing one weird problem. We were generating Stock Market Technical charts and caching them in memcache (with 1 hour ttl). During testing we found that, if requests gets routed to mongrel cluster running on other machine, charts aren’t appearing, which was weird because we had memcache cluster configured properly in “production.rb” file:

1
2
3
4
5
6
7
8
9
10
memcache_options = {
  :c_threshold => 10_000,
  :compression => true,
  :debug => false,
  :namespace => 'foobar'
  :readonly => false,
  :urlencode => false
}
CACHE = MemCache.new(memcache_options)
CACHE.servers = SIMPLE_CONFIG_FILE["memcache_servers"]

Where SIMPLE_CONFIG_FILE[‘memcache_servers’] contains list of memcache clusters participating in clustering. After, debugging for few hours(*gasp*) and turning on verbose logging on all participating memcache clusters, I found that, trusty CacheFu , was replacing the CACHE constant with following code:

1
2
3
4
5
6
7
silence_warnings do
  Object.const_set :CACHE, memcache_klass.new(config)
  Object.const_set :SESSION_CACHE, memcache_klass.new(config) if config[:session_servers]
end
 
CACHE.servers = Array(config.delete(:servers))
SESSION_CACHE.servers = Array(config[:session_servers]) if config[:session_servers]

Now this deal is real (and sucks). After couple of minutes of hacking, I took out cache_fu config file out of svn,wrote code to generate it on the fly during deployment. Now, pigs can fly.

Making Ruby Bacon play with Mocha

This post is not about pork and coffee. So, stay clear, if google has landed you here thinking I am going to describe some sort of recipe for making nice mocha coffee with chunky bacons.

Its about using shiny new testing library called Bacon by Chris. Mocha is of course, venerable mocking library for Ruby and Rails.Here is a tiny bit of code that will make you started with bacon and mocha:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
require "rubygems"
require "bacon"
require "mocha/standalone"
require "mocha/object"
class Bacon::Context
  include Mocha::Standalone
  alias_method :old_it,:it
  def it description,&block
    mocha_setup
    old_it(description,&block)
    mocha_verify
    mocha_teardown
  end
end

Thats it. Happy baking.

Unthreaded threads of hobbiton

Update: With 1.0.4 release, this method has been removed. It was introduced as a workaround for thread unsafe register_status, but its no longer required, since result caching is anyway threadsafe in this version.

You know the story too well, in your BackgrounDRb worker, you want to run 10 tasks concurrently using thread pool and collect the results back in a instance variable and return it. Now, threads are funny little beasts and simplest of things can easily go out of hand. For example, one of BackgrounDRb users wrote something like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
pages = Array.new
pages_to_scrape.each do |url|
  thread_pool.defer(url) do |url|
    begin
      # model object performs the scraping
      page = ScrapedPage.new(page.url)
      pages << page
    rescue
      logger.info "page scrape failed"
    end
  end
end
return pages

There are many things thats wrong with above code, and one of them is, it modifies a shared variable without acquiring a lock. Remember anything inside thread_pool.defer happens inside a separate thread and hence should be thread safe.

Another use case, will be, say you don’t want to spawn gazillions of workers and rather within one worker, you want to process requests from n users and save the results back using “register_status” with user_id as identifier. thread_pool.defer is for fire and forget kind of jobs and using register_status within block supplied to thread_pool.defer is dangerous.

Enter thread_pool.fetch_parallely(args,request_proc,response_proc). Lets have an example going. Inside one of your workers:

1
2
3
4
5
6
7
8
9
10
11
  def barbar args
    request_proc = lambda do |foo|
      sleep(2)
      "Hello #{foo}"
    end
 
    callback = lambda do |result|
      register_status(result)
    end
    thread_pool.fetch_parallely(args,request_proc,callback)
  end

First argument to fetch_parallely is some data argument that will be simply passed to request_proc. Note that, its necessary and you should not rely on the fact that a closure captures the local scope. Within threads, it could be dangerous. Now, last return value of request_proc will be passed as an argument to callback proc, when request_proc finishes execution and is ready with result.

The difference here is, although request_proc runs within new thread, calback proc gets executed within main thread and hence need not be thread safe. You can use pretty much do anything you want within callback proc.

This is available in git version of BackgrounDRb. Here is the link on, how to install git version of BackgrounDRb.

http://gnufied.org/2008/05/21/bleeding-edge-version-of-backgroundrb-for-better-memory-usage/

Bleeding Edge Version of BackgrounDRb for better memory usage

Update: Do not forget to get rid of autogenerated backgroundrb etcetera files that were generated during last run of rake backgroundrb:setup.

A plain fork() is evil under Ruby and hence there are some issues with memory usage of BackgrounDRb. I have made some changes to packet library so as it uses fork and exec rather than just fork for better memory usage. However, there were quite bit of changes and BackgrounDRb is affected by them. You can try bleeding edge version as following.Also you need rspec for building packet gem.

clone the packet git repo:

git clone git://github.com/gnufied/packet.git
cd packet;rake gem
cd pkg; sudo gem install --local packet-0.1.6.gem

Go to your vendor/plugins directory of your rails directory and remove or
backup older version of backgroundrb plugin and backup related config
file as well.

from vendor/plugins directory:

git clone git://github.com/gnufied/backgroundrb.git
cd RAILS_ROOT
rake backgroundrb:setup
./script/backgroundrb start

Let me know,how it goes.

Tips for budding Ruby hacker

I am no expert in Ruby, but overtime I have accumulated some thoughts that may help you in writing better Ruby code.

  • Always create a directory hierarchy for your library/application. Such as:
       |__ bin
       |__ lib
       |__ tests
       |__ yaml_specs
  • If you are not writing a library and rather an executable application. Then, have a separate file that loads/requires required libraries and does some basic stuff. For example, I have a boot.rb in my Comet server that looks like:

    1
    2
    3
    4
    5
    6
    7
    
    require "rubygems"
    require "eventmachine"
    require "buftok"
    require "sequel/mysql"
    PUSH_SERVER_PATH = File.expand_path(File.join(File.dirname(__FILE__),'..'))
    ["lib","channels"].each {|x| $:.unshift(File.join(PUSH_SERVER_PATH,x)) }
    require "push_server"

    Why? Because such a file can come handy when you are writing test_helper for your applications. There, you can simply require above boot.rb, so as you don’t have to copy stuff back and forth if your required libs change.

  • If your project hierarchy is like above and you are writing an library not an application, don’t make the mistake of putting all your files in lib directory straightaway. Rather have a setup like:
      Root
      |__ bin
      |__ lib
      |__ lib/packet.rb
      |__ lib/packet/other files go here

    And use relative requires in “packet.rb” file, like:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    
    $:.unshift(File.dirname(__FILE__)) unless
      $:.include?(File.dirname(__FILE__)) || $:.include?(File.expand_path(File.dirname(__FILE__)))
    require "packet/packet_parser"
    require "packet/packet_meta_pimp"
    require "packet/packet_core"
    require "packet/packet_master"
    require "packet/packet_connection"
    require "packet/packet_worker"
     
    PACKET_APP = File.expand_path'../' unless defined?(PACKET_APP)
     
    module Packet
      VERSION='0.1.4'
    end

    It would be helpful in avoiding package name collisions that otherwise your users will report.

  • Once chris2 mentioned on #ruby-lang, you shouldn’t be overly clever with test cases. Don’t try to be too DRY in your test cases.
  • Write code that can be easilly tested. What the fuck that means? When I started with Ruby and was doing Network programming. I used to write methods like ughh, that always manipulated state through instance variables. I either used threads or EventMachine. One of the issues with EventMachine is, code written usually relies on state machine and hence it can be notoriously difficult to unit test, because most of the time your methods are working according to the state of instance of variables. That was bad. Try to write code in more functional way, where methods take some parameters and return some values based on arguments. You should minimize methods with side effects as much as possible. This will make your code more readable and easily testable
  • Read code of some good libraries, such as Ramaze , Rake, standard library.
  • Use FastRi rather than Ri. If possible, generate your set of documentation using rdoc on Ruby source code. I spend time just looking through methods, classes just for fun. However, I don’t like the default default RDoc template, use Jamis RDoc template, if you like Rails documentation. Often for gems installed on your machine, you can use gem server or gem_server to view their documentation.
  • #ruby-lang on freenode is generally a good place to shoot general Ruby questions. Be polite, don’t repeat and you will get your answers
  • Avoid monkey patching core classes if your code is a library and will go with third party code.

BackgrounDRb best practises

  • Best place for BackgrounDRb documentation is the README file that comes with the plugin. Read it thoroughly before going anywhere else for documentation.
  • When passing arguments from Rails to BackgrounDRb workers, don’t pass huge ActiveRecord objects. Its asking for trouble. You can easily circumvent the situation by passing id of AR objects.
  • Its always a good idea to run trunk version rather than older tag releases.
  • To debug backgroundrb problems. Its always a good idea to start bdrb in foreground mode by skipping ‘start’ argument while starting the bdrb server. After that, you should fire rails console and try invoking bdrb tasks from rails console and find out whats happening. John Yerhot has posted an excellent write up about this, here
  • Whenever you update the plugin code from svn, don’t forget to remove old backgroundrb script and run :
     rake backgroundrb:setup
  • When deploying the plugin in production, please change backgroundrb.yml, so as production environment is loaded in backgroundrb server. You should avoid keeping backgroundrb.yml file in svn. Rather, you should have a cap task that generates backgroundrb.yml on production servers.
  • When you are processing too many tasks from rails, you should use inbuilt thread pool, rather than firing new workers
  • BackgrounDRb needs Ruby >= 1.8.5
  • When you are starting a worker using
     MiddleMan.new_worker() 

    from rails and using a job_key to start the worker ( You must use unique job keys anyways, if you want more than one instance of same worker running at the same time ), you must always access that instance of worker with same job key. Thats all MiddleMan methods that will invoke a method on that instance of worker must carry job_key as a parameter. For example:

    1
    2
    
       session[:job_key] = MiddleMan.new_worker(:worker => :fibonacci_worker, :job_key => 'the_key', :data => params[:input])
       MiddleMan.send_request(:worker => :fibonacci_worker, :worker_method => :do_work, :data => params[:input],:job_key => session[:job_key])

    Omitting the job_key in subsequent calls will be an error, if your worker is started with a job_key.

BackgrounDRb 1.0 released

Although its been quite sometime since 1.0 release of BackgrounDRb has been out in the wild, yet a belated post mentioning its features is nonetheless welcome.

Although README document available at, http://backgroundrb.rubyforge.org is quite comprehensive and there is precious little I can add, yet I shall try.

  • BackgrounDRb is a Ruby job server and scheduler. Its main intent is to be
    used with Ruby on Rails applications for offloading long-running tasks. However unlike other libraries BackgrounDRb offers tight integration with Rails and hence you can check status of your workers, pass data to workers and get response back, register status of your workers, dynamically start or stop workers from rails and stuff like that.
  • BackgrounDRb doesn’t have any DRb in its skin now. Its based on networking library packet, (http://packet.googlecode.com)
  • Its stable.
  • It has support for thread_pools, storing of results in MemCache clusters
  • It comes with its own scheduler and hence you don’t need to muck around with crontab anymore

A Quick overview of installation:

  • Get the plugin using:
     piston import http://svn.devjavu.com/backgroundrb/trunk/ backgroundrb
  • Remove or backup older backgroundrb scripts/config files in your rails root directory
  • Run following command from root directory of your rails app:
     rake backgroundrb:setup
  • Have a look at generated config file, RAILS_ROOT/config/backgroundrb.yml and see if there is anything you would like to change.
  • Generate a new worker using :
     ./script/generate worker foo
  • Read the detail documentation about writing workers and stuff on http://backgroundrb.rubyforge.org
  • Start your BackgrounDRb server with:
    ./script/backgroundrb start
  • Stop your BackgrounDRb server with:
    ./script/backgroundrb stop