你好,我是胡夕。今天我来跟你聊聊CommitFailedException异常的处理。
说起这个异常,我相信用过Kafka Java Consumer客户端API的你一定不会感到陌生。**所谓CommitFailedException,顾名思义就是Consumer客户端在提交位移时出现了错误或异常,而且还是那种不可恢复的严重异常**。如果异常是可恢复的瞬时错误,提交位移的API自己就能规避它们了,因为很多提交位移的API方法是支持自动错误重试的,比如我们在上一期中提到的**commitSync方法**。
每次和CommitFailedException一起出现的,还有一段非常著名的注释。为什么说它很“著名”呢?第一,我想不出在近50万行的Kafka源代码中,还有哪个异常类能有这种待遇,可以享有这么大段的注释,来阐述其异常的含义;第二,纵然有这么长的文字解释,却依然有很多人对该异常想表达的含义感到困惑。
现在,我们一起领略下这段文字的风采,看看社区对这个异常的最新解释:
>
Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
这段话前半部分的意思是,本次提交位移失败了,原因是消费者组已经开启了Rebalance过程,并且将要提交位移的分区分配给了另一个消费者实例。出现这个情况的原因是,你的消费者实例连续两次调用poll方法的时间间隔超过了期望的max.poll.interval.ms参数值。这通常表明,你的消费者实例花费了太长的时间进行消息处理,耽误了调用poll方法。
在后半部分,社区给出了两个相应的解决办法(即橙色字部分):
1. 增加期望的时间间隔max.poll.interval.ms参数值。
1. 减少poll方法一次性返回的消息数量,即减少max.poll.records参数值。
在详细讨论这段文字之前,我还想提一句,实际上这段文字总共有3个版本,除了上面的这个最新版本,还有2个版本,它们分别是:
>
Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured session.timeout.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records.
>
Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records.
这两个较早的版本和最新版相差不大,我就不详细解释了,具体的差异我用橙色标注了。我之所以列出这些版本,就是想让你在日后看到它们时能做到心中有数,知道它们说的是一个事情。
其实不论是哪段文字,它们都表征位移提交出现了异常。下面我们就来讨论下该异常是什么时候被抛出的。从源代码方面来说,CommitFailedException异常通常发生在手动提交位移时,即用户显式调用KafkaConsumer.commitSync()方法时。从使用场景来说,有两种典型的场景可能遭遇该异常。
**场景一**
我们先说说最常见的场景。当消息处理的总时间超过预设的max.poll.interval.ms参数值时,Kafka Consumer端会抛出CommitFailedException异常。这是该异常最“正宗”的登场方式。你只需要写一个Consumer程序,使用KafkaConsumer.subscribe方法随意订阅一个主题,之后设置Consumer端参数max.poll.interval.ms=5秒,最后在循环调用KafkaConsumer.poll方法之间,插入Thread.sleep(6000)和手动提交位移,就可以成功复现这个异常了。在这里,我展示一下主要的代码逻辑。
```
…
Properties props = new Properties();
…
props.put("max.poll.interval.ms", 5000);
consumer.subscribe(Arrays.asList("test-topic"));
while (true) {
ConsumerRecords<String, String> records =
consumer.poll(Duration.ofSeconds(1));
// 使用Thread.sleep模拟真实的消息处理逻辑
Thread.sleep(6000L);
consumer.commitSync();
}
```
如果要防止这种场景下抛出异常,你需要简化你的消息处理逻辑。具体来说有4种方法。