delay_job 在生产一段时间后停止运行

Posted 2023-03-04

技术标签:

【中文标题】delay_job 在生产一段时间后停止运行【英文标题】：delayed_job stops running after some time in production 【发布时间】：2011-05-18 20:23:02 【问题描述】：

在生产中，我们的delayed_job 进程由于某种原因正在死亡。我不确定它是否崩溃或被操作系统杀死或什么。我在delayed_job.log 文件中看不到任何错误。

我可以做些什么来解决这个问题？我正在考虑安装monit 来监控它，但这只会告诉我它何时死去。它不会真正告诉我它为什么会死。

有没有办法让日志文件更健谈，这样我就可以知道它为什么会死掉？

还有其他建议吗？

【问题讨论】：

你是如何开始这个过程的？ 【参考方案1】：

我遇到了两个导致延迟作业静默失败的原因。第一个是人们在分叉进程中使用 libxml 时的实际段错误（这在一段时间前出现在邮件列表中）。

第二个问题与delayed_job所依赖的1.1.0版本的守护进程有问题（https://github.com/collectiveidea/delayed_job/issues#issue/81），这可以通过使用1.0.10轻松解决，这是我自己的Gemfile所拥有的它。

记录

delayed_job 有日志记录，所以如果工作人员在没有打印错误的情况下死亡，通常是因为它没有抛出异常（例如 Segfault）或外部因素正在杀死进程。

监控

我使用 bluepill 来监控我的延迟作业实例，到目前为止，这在确保作业保持运行方面非常成功。为应用程序运行 bluepill 的步骤非常简单

将 bluepill gem 添加到您的 Gemfile：

 # Monitoring
  gem 'i18n' # Not sure why but it complained I didn't have it
  gem 'bluepill'

我创建了一个 bluepill 配置文件：

app_home = "/home/mi/production"
workers = 5
Bluepill.application("mi_delayed_job", :log_file => "#app_home/shared/log/bluepill.log") do |app|
  (0...workers).each do |i|
    app.process("delayed_job.#i") do |process|
      process.working_dir = "#app_home/current"

      process.start_grace_time    = 10.seconds
      process.stop_grace_time     = 10.seconds
      process.restart_grace_time  = 10.seconds

      process.start_command = "cd #app_home/current && RAILS_ENV=production ruby script/delayed_job start -i #i"
      process.stop_command  = "cd #app_home/current && RAILS_ENV=production ruby script/delayed_job stop -i #i"

      process.pid_file = "#app_home/shared/pids/delayed_job.#i.pid"
      process.uid = "mi"
      process.gid = "mi"
    end
  end
end

然后在我刚刚添加的 capistrano 部署文件中：

# Bluepill related tasks
after "deploy:update", "bluepill:quit", "bluepill:start"
namespace :bluepill do
  desc "Stop processes that bluepill is monitoring and quit bluepill"
  task :quit, :roles => [:app] do
    run "cd #current_path && bundle exec bluepill --no-privileged stop"
    run "cd #current_path && bundle exec bluepill --no-privileged quit"
  end

  desc "Load bluepill configuration and start it"
  task :start, :roles => [:app] do
    run "cd #current_path && bundle exec bluepill --no-privileged load /home/mi/production/current/config/delayed_job.bluepill"
  end

  desc "Prints bluepills monitored processes statuses"
  task :status, :roles => [:app] do
    run "cd #current_path && bundle exec bluepill --no-privileged status"
  end
end

希望这会有所帮助。

【讨论】：

这很有帮助，我只是降级了我的守护进程 gem，希望就是这样。还尝试添加 bluepill 监控。谢谢！那些在特殊环境下无法通过Bluepill启动DelayedJob的人可以找到一些有用的信息here 这似乎不起作用，我认为该命令不再在shel环境中运行，所以我最终得到一个未知命令cd。这意味着我们需要从 Capfile 中删除 require 'capistrano/delayed_job'，因为 bluepill 现在将启动该过程？【参考方案2】：

我遇到的这个问题最常见的情况是由数据库问题（mysql连接错误左右）引起的。默认情况下没有日志。

所以我建议你使用god 来控制你的delayed_job（你可以看到它的日志文件！）。

假设您在 Rails4 中使用延迟作业，您应该：

1.install god gem : $gem install god

2.有这个脚本文件：

# filename: cache_cleaner.god
RAILS_ROOT = '/sg552/workspace/m-api-cache-cleaner'
God.watch do |w| 
  w.name = 'cache_cleaner'
  w.dir = RAILS_ROOT
  w.start = "cd #RAILS_ROOT && RAILS_ENV=production bundle exec bin/delayed_job -n 5 start"
  w.stop = "cd #RAILS_ROOT && RAILS_ENV=production bundle exec bin/delayed_job stop"
  w.restart = "cd #RAILS_ROOT && RAILS_ENV=production bundle exec bin/delayed_job -n 5 restart"
  w.log = "#RAILS_ROOT/log/cache_cleaner_stdout.log"
  w.pid_file = File.join(RAILS_ROOT, "log/delayed_job.total.pid")
  # you should NEVER use this config settings: 
  # w.keepalive   (always comment it out! ) 
end

3.要启动/停止/重新启动延迟作业，请将您的命令更改为：

$ bundle exec bin/delayed_job -n 3 start

到：

$ god -c cache_cleaner.god -D  
$ god start/stop/restart cache_cleaner

参考我的个人博客：http://siwei.me/blog/posts/using-delayed-job-with-god

【讨论】：

关于与数据库相关的崩溃原因的评论特别有用 - 类似的崩溃没有记录错误。查看数据库日志显示了死锁错误和崩溃的延迟作业服务器进程之间的相关性。

以上是关于delay_job 在生产一段时间后停止运行的主要内容，如果未能解决你的问题，请参考以下文章