Kafka升级v1.1的错误以及性能总结

Posted 2023-02-28

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了Kafka升级v1.1的错误以及性能总结相关的知识，希望对你有一定的参考价值。

参考技术A 最近部门使用的Kafka从v0.8.2升级到v1.1.1，遇到了几个错误，记录一下。

在灰度producer的时候，遇到了这个问题。
[org.apache.kafka.common.errors.TimeoutException](http://org.apache.kafka.common.errors.timeoutexception/): Failed to update metadata after 60000 ms.
以为是哪里配置问题，百思不得其解。因为我们业务会比较特殊，会在producer端cache数据到一定量级再send，以为是msg过大导致的，调试了许多参数都不见效。后面查阅github上相关问题看到一个说发送到错误的topic的时候会报错。会去查看的时候才我们集群的自动创建topic功能关了，我们是手动上去创建的，创建错了导致metadata一致获取不到。
竟然没有明显的提示，只是提示metadata获取超时，也是很坑。

灰度完producer以后，在灰度consumer的时候，发现对应的数据偶尔会由突刺现象，上去consumer端看日志的时候，发现了对应的error log
[2020-04-07 22:56:35] [ERROR][org.apache.kafka.clients.consumer.internals.ConsumerCoordinator:... Offset commit failed on partition [topic-partition] at offset 277387: The request timed out.]
[2020-04-07 22:43:58] [WARN] [org.apache.kafka.clients.consumer.internals.ConsumerCoordinator:... failed: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records.]
根据log提示，加大了 max.poll.interval.ms 以及降低了 max.poll.records 的值，只有好转但是没有彻底的变好。翻阅官方日志发现 max.poll.interval.ms 的默认值已经很大了是5min，不可能是配置的问题导致的。由于有了producer灰度时遇到的事情的经历，我猜想是不是哪里又遇到什么幺蛾子了，于是上去看broker端log，只有consumer离开集群的这种日志。迫不得已去跟组内运维同学请教，他上去看了才发现一台机器的磁盘有问题，导致offset偶尔提交会失败。

我们升级v1.1是为了使用LZ4的压缩算法。通过前后比较，发现对于broker端的流量流入能少50%以上，理论上可以只使用一半的机器就可以应付之前的数据量级了。对于producer端以及consumer端，producer没看到明显的吞吐下降，倒是由于consumer端需要解压，poll的耗时加了不少，但是我们consumer在没增加的情况下依旧抗下了之前的数据量级。总体来说提升还是很大的。

以上是关于Kafka升级v1.1的错误以及性能总结的主要内容，如果未能解决你的问题，请参考以下文章