xxl-job调度效率与分布式锁等待问题
Posted ascii_he
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了xxl-job调度效率与分布式锁等待问题相关的知识,希望对你有一定的参考价值。
xxl-job是流行的分布式调度中心,也是中心化的调度组件,由xxl-job-admin统一集中调度。调度核心逻辑在JobScheduleHelper。
调度原理性与问题分析
调度线程scheduleThread负责轮询数据库,查询错过调度或即将要被调度的任务,进行调度处理,这个处理的理想结果是在任务即将被调度之前放置到时间轮对应刻度任务队列里。这个处理要足够快,否则容易导致任务错过调度时间,或调度不均匀。假设调度密集型任务比较多,每次轮询处理总时间过长,就容易导致任务没有及时放置到时间轮刻度任务队列而错过调度,一种可能是会在下次轮询直接被触发任务(调度过期补偿),或是堆积在时间轮里待时间轮下次读刻度任务队列时连续触发该任务,读刻度的任务队列是简单的List,并未对任务去重。单任务的调度处理也应足够快,否则会导致调度线程一次轮询总处理时间过长。
时间轮线程ringThread则比较简单,按秒刻度读取任务队列进行触发任务异步执行,这个处理还是比较快的,即使是单线程,效率也不会太差。
调度线程为了保证一致性:“调度中心”通过DB锁保证集群分布式调度的一致性, 一次任务调度只会触发一次执行。采用的锁是用的select for update的排它锁,锁很大,在任务量大时容易有性能问题,而由网络或宕机导致的事务未提交并最终导致锁未释放则问题更严重,所有任务调度会被阻塞,即使重启了xxl-job-admin也无法恢复,需要人工介入操作数据库将锁释放才行。任务量大时即使采用集群部署xxl-job-admin,由于这把锁的存在,基本还是相当于单线程串行执行。锁未释放的问题,我们业务中就有这样的场景,数据库每天凌晨4点进行全量备份,数据库备份时也会加锁,会偶发导致xxl-job这边锁未释放问题。同时,这个锁的使用频次太高也会对数据库备份带来影响。
总而言之,xxl-job调度设计是有一定缺陷的,在任务量大或者低效数据库运维等场景下,容易有调度效率问题。优化的思路其实也不复杂,一是对调度处理进行分片,使用多线程调度方式,二是替换掉数据库排它锁,用其它效率高、安全性高、副作用小的分布式锁,比如Redis分布式锁。
调度效率优化思路与方案
调度分片
调度处理分片很简单,原始的查询代码按触发状态+下次调度时间分页查询所有任务,分片可以使用xxl_job_info的id进行分片,例如加上 id % $分片总数 = $分片序号这样的查询条件。
<select id="scheduleJobQuery" parameterType="java.util.HashMap" resultMap="XxlJobInfo">
SELECT <include refid="Base_Column_List" />
FROM xxl_job_info AS t
WHERE t.trigger_status = 1
and t.trigger_next_time <![CDATA[ <= ]]> #maxNextTime
ORDER BY id ASC
LIMIT #pagesize
</select>
改造后如下:
<select id="scheduleJobQueryWithSharding" parameterType="java.util.HashMap" resultMap="XxlJobInfo">
SELECT
<include refid="Base_Column_List"/>
FROM xxl_job_info AS t
<where>
t.trigger_status = 1
AND t.trigger_next_time <![CDATA[ <= ]]> #maxNextTime
<if test="shardingTotal != 1">
AND t.id % #shardingTotal = #shardingNow
</if>
</where>
ORDER BY id ASC
LIMIT #pagesize
</select>
调度线程由原来的单线程改为多线程方式,每个线程处理一个分片:
final XxlJobAdminConfig adminConfig = XxlJobAdminConfig.getAdminConfig();
// 分片总数,作为外部化配置
final int scheduleShardingCount = adminConfig.getScheduleShardingCount();
// 调度线程池,线程数为分片总数
scheduleExecutor = new ThreadPoolExecutor(scheduleShardingCount,
scheduleShardingCount, 300, TimeUnit.SECONDS,
new LinkedBlockingQueue<>(), new NamedThreadFactory("xxl-job-scheduler", true),
new ThreadPoolExecutor.AbortPolicy());
for (int i = 0; i < scheduleShardingCount; i++)
final int lockFlag = i; // 该分片的锁标识,也是分片序号
scheduleExecutor.submit(() ->
// ... 前置处理
// 分片的调度逻辑,使用分片查询
List<XxlJobInfo> scheduleList = adminConfig.getXxlJobInfoDao()
.scheduleJobQueryWithSharding(nowTime + PRE_READ_MS, preReadCount,
scheduleShardingCount, lockFlag);
// ... 调度处理
);
由于采用了多线程的方式,需要考虑同步问题,时间轮刻度任务队列用替换BlockingQueue掉List,简化同步操作。
private volatile static Map<Integer, BlockingQueue<Integer>> ringData = new ConcurrentHashMap<>();
调度线程放置任务到时间轮刻度队列相应修改为BlockingQueue的add操作:
private void pushTimeRing(int ringSecond, int jobId)
// push async ring
final BlockingQueue<Integer> queue = ringData.computeIfAbsent(ringSecond, k -> new LinkedBlockingQueue<>());
queue.add(jobId);
if (logger.isDebugEnabled())
logger.debug(">>>>>>>>>>> xxl-job, schedule push time-ring : " + ringSecond
+ " = " + new ArrayList<>(queue));
时间轮线程获取刻度任务队列处理也相应的修改:
// second data
int nowSecond = Calendar.getInstance().get(Calendar.SECOND); // 避免处理耗时太长,跨过刻度,向前校验一个刻度;
for (int i = 0; i < 2; i++)
final BlockingQueue<Integer> queue = ringData.get((nowSecond + 60 - i) % 60);
if (queue != null && !queue.isEmpty())
queue.drainTo(ringItemData);
// ring trigger
if (logger.isDebugEnabled())
logger.debug(">>>>>>>>>>> xxl-job, time-ring beat : " + nowSecond
+ " = " + Collections.singletonList(ringItemData));
if (ringItemData.size() > 0)
// do trigger
for (int jobId : ringItemData)
// do trigger
JobTriggerPoolHelper.trigger(jobId, TriggerTypeEnum.CRON, -1, null, null, null);
// clear
ringItemData.clear();
分布式锁替换
分布式锁要解决的问题是锁的粒度以及锁的自动释放。分布式锁的方案比较成熟,例如Redis分布式锁,可自己采用setnx ex命令或lua脚本实现,或直接使用Redisson的锁的API;又例如Zookeeper分布式锁,可采用Curator的InterProcessMutex。至于使用数据库的乐观锁,读者可自行探讨其可行性。使用Redis分布式锁简单、性能高,也是我推荐的方式。由于项目原因,我们在该组件中已经集成了dubbo,且使用Zookeeper作为注册中心,不想再引入Redis组件,因此我们直接使用Zookeeper分布式锁。
使用Zookeeper分布式锁时需要考虑加锁解锁的效率。因为这个加锁解锁是在调度线程一直轮询的,如果每个线程固化一个锁标识,那么竞争无非是多实例(多进程)下的竞争,试想如何减小竞争呢?仔细分析这个场景,其实它并不是一个需要高频加锁解锁的场景,简单办法就是锁的时间延长。因为一个调度线程处理的任务已经固化好分片策略的,一次获取到锁,多次轮询使用没毛病,除非这个实例出问题了。如果这个实例出问题,锁自动释放了也没问题,所以使用Zookeeper分布式锁+一次获取到锁使用多次也还是很合适的,低频使用也能缓解Zookeeper的压力。除了加锁与释放锁两次轮询有额外的消耗,其它轮询无此消耗,也会提升处理时间。这里采用外部化配置控制锁持有最大时间,到期后释放锁再次让调度线程竞争。这个理论同样适合于DB锁或Redis锁。
使用Zookeeper分布式锁定义CuratorFramework:
@Configuration
public class CuratorConfig
@Value("$zookeeper.address")
private String zookeeperAddress;
@Value("$zookeeper.timeout:20000")
private int zookeeperTimeout;
@Bean
public CuratorFramework curatorFramework()
CuratorFramework curatorFramework = CuratorFrameworkFactory
.builder()
.connectString(zookeeperAddress)
.sessionTimeoutMs(zookeeperTimeout)
.connectionTimeoutMs(zookeeperTimeout)
.retryPolicy(new ExponentialBackoffRetry(2000, 10))
.build();
curatorFramework.start();
return curatorFramework;
完整代码
外部化配置:
adminConfig.getScheduleShardingCount(); // 分片总数
adminConfig.getLockMaxSeconds(); // 锁持有最长时间
/**
* @author xuxueli 2019-05-21
*/
public class JobScheduleHelper
private static final Logger logger = LoggerFactory.getLogger(JobScheduleHelper.class);
private static final JobScheduleHelper instance = new JobScheduleHelper();
public static JobScheduleHelper getInstance()
return instance;
public static final long PRE_READ_MS = 5000; // pre read
private Thread ringThread;
private volatile boolean scheduleThreadToStop = false;
private volatile boolean ringThreadToStop = false;
private volatile static Map<Integer, BlockingQueue<Integer>> ringData = new ConcurrentHashMap<>();
private volatile ExecutorService scheduleExecutor;
private final Map<Integer, InterProcessMutex> lockMap = new ConcurrentHashMap<>();
private final Map<Integer, Long> lockAcquireTimeMap = new ConcurrentHashMap<>();
public void start()
final XxlJobAdminConfig adminConfig = XxlJobAdminConfig.getAdminConfig();
final int scheduleShardingCount = adminConfig.getScheduleShardingCount();
scheduleExecutor = new ThreadPoolExecutor(scheduleShardingCount,
scheduleShardingCount, 300, TimeUnit.SECONDS,
new LinkedBlockingQueue<>(), new NamedThreadFactory("xxl-job-scheduler", true),
new ThreadPoolExecutor.AbortPolicy());
for (int i = 0; i < scheduleShardingCount; i++)
final int lockFlag = i;
scheduleExecutor.submit(new LogAspectRunnable(() ->
try
TimeUnit.MILLISECONDS.sleep(5000 - System.currentTimeMillis() % 1000);
catch (InterruptedException e)
if (!scheduleThreadToStop)
logger.error(e.getMessage(), e);
logger.info(">>>>>>>>> init xxl-job admin scheduler success.", lockFlag);
// pre-read count: treadpool-size * trigger-qps (each trigger cost 50ms, qps = 1000/50 = 20)
int preReadCount = (adminConfig.getTriggerPoolFastMax() + adminConfig.getTriggerPoolSlowMax()) * 20;
// zookeeper lock
InterProcessMutex lock = null;
while (!scheduleThreadToStop)
// define lock
try
lock = lockMap.computeIfAbsent(lockFlag,
k -> new InterProcessMutex(adminConfig.getCuratorFramework(),
"/xxl-job/schedule_lock_" + k));
catch (Throwable e)
logger.warn(">>>>>>>>>>> xxl-job scheduler , 定义分布式锁失败", lockFlag, e);
try
TimeUnit.SECONDS.sleep(10);
catch (InterruptedException interruptedException)
logger.error(interruptedException.getMessage(), interruptedException);
continue;
// check lock
try
// 当期线程未持有锁,则尝试获取锁,获取不到则继续轮询
if (lockAcquireTimeMap.get(lockFlag) == null)
if (lock.acquire(10, TimeUnit.SECONDS))
logger.info("获取到分布式锁:/xxl-job/schedule_lock_" + lockFlag);
lockAcquireTimeMap.put(lockFlag, System.currentTimeMillis());
else
// TimeUnit.SECONDS.sleep(10);
continue;
// 检测锁状态
if (!lock.isOwnedByCurrentThread())
lockAcquireTimeMap.remove(lockFlag); // 清除获取到锁的时间
logger.warn(">>>>>>>>>>> xxl-job scheduler , 当前线程未占有分布式锁", lockFlag);
try
TimeUnit.SECONDS.sleep(10);
catch (InterruptedException e)
logger.error(e.getMessage(), e);
continue;
catch (Exception e)
logger.warn(">>>>>>>>>>> xxl-job scheduler , 轮询分布式锁失败", lockFlag, e);
continue;
// Scan Job
final long start = System.currentTimeMillis();
boolean preReadSuc = true;
try
// 1、pre read
long nowTime = System.currentTimeMillis();
// sharding query
List<XxlJobInfo> scheduleList = adminConfig.getXxlJobInfoDao()
.scheduleJobQueryWithSharding(nowTime + PRE_READ_MS, preReadCount,
scheduleShardingCount, lockFlag);
if (scheduleList == null || scheduleList.size() <= 0)
preReadSuc = false;
continue;
// 2、push time-ring
final List<Integer> jobIds = new ArrayList<>();
for (XxlJobInfo jobInfo : scheduleList)
try
// time-ring jump
if (nowTime > jobInfo.getTriggerNextTime() + PRE_READ_MS)
// 2.1、trigger-expire > 5s:pass && make next-trigger-time
logger.warn(">>>>>>>>>>> xxl-job, schedule misfire, jobId = "
+ jobInfo.getId());
// 1、misfire match
MisfireStrategyEnum misfireStrategyEnum = MisfireStrategyEnum
.match(jobInfo.getMisfireStrategy(),
MisfireStrategyEnum.DO_NOTHING);
if (MisfireStrategyEnum.FIRE_ONCE_NOW == misfireStrategyEnum)
// FIRE_ONCE_NOW 》 trigger
JobTriggerPoolHelper.trigger(jobInfo.getId(),
TriggerTypeEnum.MISFIRE, -1, null, null, null);
if (logger.isDebugEnabled())
logger.debug(">>>>>>>>>>> xxl-job, schedule push trigger : jobId = "
+ jobInfo.getId());
// 2、fresh next
refreshNextValidTime(jobInfo, new Date());
else if (nowTime > jobInfo.getTriggerNextTime())
// 2.2、trigger-expire < 5s:direct-trigger && make next-trigger-time
// 1、trigger
JobTriggerPoolHelper.trigger(jobInfo.getId(), TriggerTypeEnum.CRON, -1,
null, null, null);
if (logger.isDebugEnabled())
logger.debug(">>>>>>>>>>> xxl-job, schedule push trigger : jobId = "
+ jobInfo.getId());
// 2、fresh next
refreshNextValidTime(jobInfo, new Date());
// next-trigger-time in 5s, pre-read again
if (jobInfo.getTriggerStatus() == 1
&& nowTime + PRE_READ_MS > jobInfo.getTriggerNextTime())
// 1、make ring second
int ringSecond = (int) ((jobInfo.getTriggerNextTime() / 1000) % 60);
// 2、push time ring
pushTimeRing(ringSecond, jobInfo.getId());
// 3、fresh next
refreshNextValidTime(jobInfo, new Date(jobInfo.getTriggerNextTime()));
else
// 2.3、trigger-pre-read:time-ring trigger && make next-trigger-time
// 1、make ring second
int ringSecond = (int) ((jobInfo.getTriggerNextTime() / 1000) % 60);
// 2、push time ring
pushTimeRing(ringSecond, jobInfo.getId());
// 3、fresh next
refreshNextValidTime(jobInfo, new Date(jobInfo.getTriggerNextTime()));
// 记录处理成功的jobId
jobIds.add(jobInfo.getId());
catch (Throwable e)
logger.error("P3|XXLJobFail|任务调度失败|,|任务id:,msg=",
jobInfo.getJobTag(), jobInfo.getJobDesc(),
jobInfo.getId(), e.getMessage(), e);
// 3、update trigger jobInfo
for (XxlJobInfo jobInfo : scheduleList)
if (jobIds.contains(jobInfo.getId()))
adminConfig.getXxlJobInfoDao().scheduleUpdate(jobInfo);
catch (Throwable e)
logger.error(">>>>>>>>>>> xxl-job, JobScheduleHelper#scheduleThread error, "
+ "scheduleThreadToStop=", scheduleThreadToStop, e);
finally
try
if (lock.isOwnedByCurrentThread())
final Long acquireTime = lockAcquireTimeMap.get(lockFlag);
if (acquireTime != null
&& System.currentTimeMillis() - acquireTime
> adminConfig.getLockMaxSeconds() * 1000)
// 持有锁达到最大时间,释放
lock.release();
lockAcquireTimeMap.remove(lockFlag);
logger.info("释放分布式锁:/xxl-job/schedule_lock_" + lockFlag);
catch (Exception e)
logger.warn("检测释放锁失败", e);
long cost = System.currentTimeMillis() - start;
// Wait seconds, align second
if (cost < 1000) // scan-overtime, not wait
try
// pre-read period: success > scan each second; fail > skip this period;
TimeUnit.MILLISECONDS
.sleep((preReadSuc ? 1000 : PRE_READ_MS) - System.currentTimeMillis() % 1000);
catch (InterruptedException e)
if (!scheduleThreadToStop)
logger.error(e.getMessage(), e);
try
if (lock != null && lock.isOwnedByCurrentThread())
lock.release();
lockAcquireTimeMap.remove(lockFlag);
logger.info("释放分布式锁:/xxl-job/schedule_lock_" + lockFlag);
catch (Exception e)
logger.warn("检测释放锁失败", e);
logger.info(">>>>>>>>>>> xxl-job, JobScheduleHelper#scheduleThread stop", lockFlag);
));
// ring thread
ringThread = new Thread(() ->
final List<Integer> ringItemData = new ArrayList<>(1024);
while (!ringThreadToStop)
// align second
try
TimeUnit.MILLISECONDS.sleep(1000 - System.currentTimeMillis() % 1000);
catch (InterruptedException e)
if (!ringThreadToStop)
logger.error(e.getMessage(), e);
try
// second data
int nowSecond = Calendar.getInstance().get(Calendar.SECOND); // 避免处理耗时太长,跨过刻度,向前校验一个刻度;
for (int i = 0; i < 2; i++)
final BlockingQueue<Integer> queue = ringData.get((nowSecond + 60 - i) % 60);
if (queue != null && !queue.isEmpty())
queue.drainTo(ringItemData);
// ring trigger
if (logger.isDebugEnabled())
logger.debug(">>>>>>>>>>> xxl-job, time-ring beat : " + nowSecond
+ " = " + Collections.singletonList(ringItemData));
if (ringItemData.size() > 0)
// do trigger
for (int jobId : ringItemData)
// do trigger
JobTriggerPoolHelper.trigger(jobId, TriggerTypeEnum.CRON, -1, null, null, null);
// clear
ringItemData.clear();
catch (Throwable e)
logger.error(">>>>>>>>>>> xxl-job, JobScheduleHelper#ringThread error, "
+ "ringThreadToStop=", ringThreadToStop, e);
logger.info(">>>>>>>>>>> xxl-job, JobScheduleHelper#ringThread stop");
);
ringThread.setDaemon(true);
ringThread.setName("xxl-job-admin-JobScheduleHelper#ringThread");
ringThread.start();
private static void refreshNextValidTime(XxlJobInfo jobInfo, Date fromTime)
Date nextValidTime = null;
try
nextValidTime = generateNextValidTime(jobInfo, fromTime);
catch (Exception e)
logger.warn("P3|XXLJobFail|任务执行失败|,|任务id:,计算下次调度时间失败,任务自动下线,调度类型=,调度表达式=,errMsg=",
jobInfo.getJobTag(), jobInfo.getJobDesc(), jobInfo.getId(), jobInfo.getScheduleType(),
jobInfo.getScheduleConf(), e.getMessage(), e);
jobInfo.setTriggerStatus(0);
jobInfo.setTriggerLastTime(0);
jobInfo.setTriggerNextTime(0);
return;
if (nextValidTime != null)
jobInfo.setTriggerLastTime(jobInfo.getTriggerNextTime());
jobInfo.setTriggerNextTime(nextValidTime.getTime());
else
jobInfo.setTriggerStatus(0);
jobInfo.setTriggerLastTime(0);
jobInfo.setTriggerNextTime(0);
logger.warn(">>>>>>>>>>> xxl-job, refreshNextValidTime fail for job: jobId=, "
+ "scheduleType=, scheduleConf=",
jobInfo.getId(), jobInfo.getScheduleType(), jobInfo.getScheduleConf());
private void pushTimeRing(int ringSecond, int jobId)
// push async ring
final BlockingQueue<Integer> queue = ringData.computeIfAbsent(ringSecond, k -> new LinkedBlockingQueue<>());
queue.add(jobId);
if (logger.isDebugEnabled())
logger.debug(">>>>>>>>>>> xxl-job, schedule push time-ring : " + ringSecond
+ " = " + new ArrayList<>(queue));
public void toStop()
// 1、stop schedule
scheduleThreadToStop = true;
try
TimeUnit.SECONDS.sleep(1); // wait
catch (InterruptedException e)
logger.error(e.getMessage(), e);
if (scheduleExecutor != null)
scheduleExecutor.shutdown();
// if has ring data
boolean hasRingData = false;
if (!ringData.isEmpty())
for (int second : ringData.keySet())
BlockingQueue<Integer> tmpData = ringData.get(second);
if (tmpData != null && tmpData.size() > 0)
hasRingData = true;
break;
if (hasRingData)
try
TimeUnit.SECONDS.sleep(8);
catch (InterruptedException e)
logger.error(e.getMessage(), e);
// stop ring (wait job-in-memory stop)
ringThreadToStop = true;
try
TimeUnit.SECONDS.sleep(1);
catch (InterruptedException e)
logger.error(e.getMessage(), e);
if (ringThread.getState() != Thread.State.TERMINATED)
// interrupt and wait
ringThread.interrupt();
try
ringThread.join();
catch (InterruptedException e)
logger.error(e.getMessage(), e);
logger.info(">>>>>>>>>>> xxl-job, JobScheduleHelper stop");
// ---------------------- tools ----------------------
public static Date generateNextValidTime(XxlJobInfo jobInfo, Date fromTime) throws Exception
ScheduleTypeEnum scheduleTypeEnum = ScheduleTypeEnum.match(jobInfo.getScheduleType(), null);
if (ScheduleTypeEnum.CRON == scheduleTypeEnum)
return new CronExpression(jobInfo.getScheduleConf()).getNextValidTimeAfter(fromTime);
else if (ScheduleTypeEnum.FIX_RATE
== scheduleTypeEnum /*|| ScheduleTypeEnum.FIX_DELAY == scheduleTypeEnum*/)
return new Date(fromTime.getTime() + Integer.parseInt(jobInfo.getScheduleConf()) * 1000);
return null;
以上是关于xxl-job调度效率与分布式锁等待问题的主要内容,如果未能解决你的问题,请参考以下文章