当前位置: 首页 > news >正文

Flink-SQL join 优化 -- MiniBatch + local-global

背景

  • 问题1. 近期在开发flink-sql期间,发现数据在启动后,任务总是进行重试,运行一段时间后,container心跳超时,内存溢出,作业无法进行正常工作
023-10-07 14:53:30,408 | INFO  | [flink-akka.actor.default-dispatcher-29] | Stopping worker container_e03_1678102291469_2749_01_000002(node-group-1jPmk0002.mrs-qrmc.com:8041). | org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager.internalStopWorker(ActiveResourceManager.java:461)
2023-10-07 14:53:30,408 | INFO  | [flink-akka.actor.default-dispatcher-29] | Stopping container container_e03_1678102291469_2749_01_000002(node-group-1jPmk0002.mrs-qrmc.com:8041). | org.apache.flink.yarn.YarnResourceManagerDriver.releaseResource(YarnResourceManagerDriver.java:298)
2023-10-07 14:53:30,409 | INFO  | [org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl #1] | Processing Event EventType: STOP_CONTAINER for Container container_e03_1678102291469_2749_01_000002 | org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl$ContainerEventProcessor.run(NMClientAsyncImpl.java:955)
2023-10-07 14:53:30,824 | WARN  | [flink-akka.actor.default-dispatcher-29] | Remote connection to [/10.155.0.9:42366] failed with java.io.IOException: Connection reset by peer | akka.event.slf4j.Slf4jLogger$$anonfun$receive$1.$anonfun$applyOrElse$2(Slf4jLogger.scala:90)
2023-10-07 14:53:30,825 | WARN  | [flink-akka.actor.default-dispatcher-29] | Association with remote system [akka.tcp://flink@node-group-1jpmk0002.mrs-qrmc.com:32331] has failed, address is now gated for [50] ms. Reason: [Disassociated]  | akka.event.slf4j.Slf4jLogger$$anonfun$receive$1.$anonfun$applyOrElse$2(Slf4jLogger.scala:90)
2023-10-07 14:53:30,825 | WARN  | [flink-metrics-6] | Association with remote system [akka.tcp://flink-metrics@node-group-1jpmk0002.mrs-qrmc.com:28852] has failed, address is now gated for [50] ms. Reason: [Disassociated]  | akka.event.slf4j.Slf4jLogger$$anonfun$receive$1.$anonfun$applyOrElse$2(Slf4jLogger.scala:90)
2023-10-07 14:53:32,171 | INFO  | [flink-akka.actor.default-dispatcher-29] | Received 1 containers. | org.apache.flink.yarn.YarnResourceManagerDriver$YarnContainerEventHandler.lambda$onContainersAllocated$1(YarnResourceManagerDriver.java:620)
2023-10-07 14:53:32,172 | INFO  | [flink-akka.actor.default-dispatcher-29] | Received 1 containers with priority 1, 1 pending container requests. | org.apache.flink.yarn.YarnResourceManagerDriver.onContainersOfPriorityAllocated(YarnResourceManagerDriver.java:352)
2023-10-07 14:53:32,172 | INFO  | [flink-akka.actor.default-dispatcher-29] | Removing container request Capability[<memory:4096, vCores:3>]Priority[1]AllocationRequestId[0]ExecutionTypeRequest[{Execution Type: GUARANTEED, Enforce Execution Type: false}]Resource Profile[null]. | org.apache.flink.yarn.YarnResourceManagerDriver.removeContainerRequest(YarnResourceManagerDriver.java:405)
2023-10-07 14:53:32,172 | INFO  | [flink-akka.actor.default-dispatcher-29] | Accepted 1 requested containers, returned 0 excess containers, 0 pending container requests of resource <memory:4096, vCores:3>. | org.apache.flink.yarn.YarnResourceManagerDriver.onContainersOfPriorityAllocated(YarnResourceManagerDriver.java:392)
2023-10-07 14:53:32,172 | INFO  | [cluster-io-thread-4] | TaskExecutor container_e03_1678102291469_2749_01_000003(node-group-1jPmk0002.mrs-qrmc.com:8041) will be started on node-group-1jPmk0002.mrs-qrmc.com with TaskExecutorProcessSpec {cpuCores=3.0, frameworkHeapSize=128.000mb (134217728 bytes), frameworkOffHeapSize=128.000mb (134217728 bytes), taskHeapSize=1.425gb (1530082070 bytes), taskOffHeapSize=0 bytes, networkMemSize=343.040mb (359703515 bytes), managedMemorySize=1.340gb (1438814063 bytes), jvmMetaspaceSize=256.000mb (268435456 bytes), jvmOverheadSize=409.600mb (429496736 bytes), numSlots=6}. | org.apache.flink.yarn.YarnResourceManagerDriver.createTaskExecutorLaunchContext(YarnResourceManagerDriver.java:472)
2023-10-07 14:53:32,178 | INFO  | [cluster-io-thread-4] | Creating container launch context for TaskManagers | org.apache.flink.yarn.Utils.createTaskExecutorContext(Utils.java:500)
2023-10-07 14:53:32,179 | INFO  | [cluster-io-thread-4] | Starting TaskManagers | org.apache.flink.yarn.Utils.createTaskExecutorContext(Utils.java:520)
2023-10-07 14:53:32,182 | INFO  | [flink-akka.actor.default-dispatcher-29] | Requested worker container_e03_1678102291469_2749_01_000003(node-group-1jPmk0002.mrs-qrmc.com:8041) with resource spec WorkerResourceSpec {cpuCores=3.0, taskHeapSize=1.425gb (1530082070 bytes), taskOffHeapSize=0 bytes, networkMemSize=343.040mb (359703515 bytes), managedMemSize=1.340gb (1438814063 bytes), numSlots=6}. | org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager.lambda$requestNewWorker$1(ActiveResourceManager.java:419)
2023-10-07 14:53:32,182 | INFO  | [org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl #2] | Processing Event EventType: START_CONTAINER for Container container_e03_1678102291469_2749_01_000003 | org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl$ContainerEventProcessor.run(NMClientAsyncImpl.java:955)
2023-10-07 14:53:32,554 | INFO  | [flink-akka.actor.default-dispatcher-29] | Heartbeat of TaskManager with id container_e03_1678102291469_2749_01_000002(node-group-1jPmk0002.mrs-qrmc.com:8041) timed out. | org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(JobMaster.java:1372)
2023-10-07 14:53:32,558 | INFO  | [flink-akka.actor.default-dispatcher-29] | ConstraintEnforcer[20] (1/6) (1a65cb345c504de7b9704270217856a7) switched from RUNNING to FAILED on container_e03_1678102291469_2749_01_000002 @ node-group-1jPmk0002.mrs-qrmc.com (dataPort=32396). | org.apache.flink.runtime.executiongraph.Execution.transitionState(Execution.java:1424)
java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id container_e03_1678102291469_2749_01_000002(node-group-1jPmk0002.mrs-qrmc.com:8041) timed out.at org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(JobMaster.java:1373) ~[flink-dist-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]at org.apache.flink.runtime.heartbeat.HeartbeatMonitorImpl.run(HeartbeatMonitorImpl.java:155) ~[flink-dist-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_332]at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_332]at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.lambda$handleRunAsync$4(AkkaRpcActor.java:443) ~[flink-rpc-akka_427efea8-e9d4-4177-abbe-967ec8f15a58.jar:1.15.0-h0.cbu.mrs.320.r33]at org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68) ~[flink-rpc-akka_427efea8-e9d4-4177-abbe-967ec8f15a58.jar:1.15.0-h0.cbu.mrs.320.r33]at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:443) ~[flink-rpc-akka_427efea8-e9d4-4177-abbe-967ec8f15a58.jar:1.15.0-h0.cbu.mrs.320.r33]at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:213) ~[flink-rpc-akka_427efea8-e9d4-4177-abbe-967ec8f15a58.jar:1.15.0-h0.cbu.mrs.320.r33]at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:78) ~[flink-rpc-akka_427efea8-e9d4-4177-abbe-967ec8f15a58.jar:1.15.0-h0.cbu.mrs.320.r33]at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:163) ~[flink-rpc-akka_427efea8-e9d4-4177-abbe-967ec8f15a58.jar:1.15.0-h0.cbu.mrs.320.r33]at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24) [flink-rpc-akka_427efea8-e9d4-4177-abbe-967ec8f15a58.jar:1.15.0-h0.cbu.mrs.320.r33]at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20) [flink-rpc-akka_427efea8-e9d4-4177-abbe-967ec8f15a58.jar:1.15.0-h0.cbu.mrs.320.r33]at scala.PartialFunction.applyOrElse(PartialFunction.scala:123) [flink-rpc-akka_427efea8-e9d4-4177-abbe-967ec8f15a58.jar:1.15.0-h0.cbu.mrs.320.r33]at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122) [flink-rpc-akka_427efea8-e9d4-4177-abbe-967ec8f15a58.jar:1.15.0-h0.cbu.mrs.320.r33]at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:20) [flink-rpc-akka_427efea8-e9d4-4177-abbe-967ec8f15a58.jar:1.15.0-h0.cbu.mrs.320.r33]at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [flink-rpc-akka_427efea8-e9d4-4177-abbe-967ec8f15a58.jar:1.15.0-h0.cbu.mrs.320.r33]at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) [flink-rpc-akka_427efea8-e9d4-4177-abbe-967ec8f15a58.jar:1.15.0-h0.cbu.mrs.320.r33]at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) [flink-rpc-akka_427efea8-e9d4-4177-abbe-967ec8f15a58.jar:1.15.0-h0.cbu.mrs.320.r33]at akka.actor.Actor.aroundReceive(Actor.scala:537) [flink-rpc-akka_427efea8-e9d4-4177-abbe-967ec8f15a58.jar:1.15.0-h0.cbu.mrs.320.r33]at akka.actor.Actor.aroundReceive$(Actor.scala:535) [flink-rpc-akka_427efea8-e9d4-4177-abbe-967ec8f15a58.jar:1.15.0-h0.cbu.mrs.320.r33]at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:220) [flink-rpc-akka_427efea8-e9d4-4177-abbe-967ec8f15a58.jar:1.15.0-h0.cbu.mrs.320.r33]at akka.actor.ActorCell.receiveMessage(ActorCell.scala:580) [flink-rpc-akka_427efea8-e9d4-4177-abbe-967ec8f15a58.jar:1.15.0-h0.cbu.mrs.320.r33]at akka.actor.ActorCell.invoke(ActorCell.scala:548) [flink-rpc-akka_427efea8-e9d4-4177-abbe-967ec8f15a58.jar:1.15.0-h0.cbu.mrs.320.r33]at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:270) [flink-rpc-akka_427efea8-e9d4-4177-abbe-967ec8f15a58.jar:1.15.0-h0.cbu.mrs.320.r33]at akka.dispatch.Mailbox.run(Mailbox.scala:231) [flink-rpc-akka_427efea8-e9d4-4177-abbe-967ec8f15a58.jar:1.15.0-h0.cbu.mrs.320.r33]at akka.dispatch.Mailbox.exec(Mailbox.scala:243) [flink-rpc-akka_427efea8-e9d4-4177-abbe-967ec8f15a58.jar:1.15.0-h0.cbu.mrs.320.r33]at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) [?:1.8.0_332]at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) [?:1.8.0_332]at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) [?:1.8.0_332]at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175) [?:1.8.0_332]
2023-10-07 14:53:32,567 | INFO  | [flink-akka.actor.default-dispatcher-29] | Call stack:at java.lang.Thread.getStackTrace(Thread.java:1564)at org.apache.flink.runtime.executiongraph.Execution.transitionState(Execution.java:1432)at org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1124)at org.apache.flink.runtime.executiongraph.Execution.markFailed(Execution.java:908)at org.apache.flink.runtime.executiongraph.DefaultExecutionGraph.updateStateInternal(DefaultExecutionGraph.java:1318)at org.apache.flink.runtime.executiongraph.DefaultExecutionGraph.updateState(DefaultExecutionGraph.java:1277)at org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:733)at org.apache.flink.runtime.scheduler.UpdateSchedulerNgOnInternalFailuresListener.notifyTaskFailure(UpdateSchedulerNgOnInternalFailuresListener.java:51)at org.apache.flink.runtime.executiongraph.DefaultExecutionGraph.notifySchedulerNgAboutInternalTaskFailure(DefaultExecutionGraph.java:1543)at org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1119)at org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1059)at org.apache.flink.runtime.executiongraph.Execution.fail(Execution.java:760)at org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot.signalPayloadRelease(SingleLogicalSlot.java:195)at org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot.release(SingleLogicalSlot.java:182)at org.apache.flink.runtime.scheduler.SharedSlot.lambda$release$4(SharedSlot.java:271)at java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:670)at java.util.concurrent.CompletableFuture.uniAcceptStage(CompletableFuture.java:683)at java.util.concurrent.CompletableFuture.thenAccept(CompletableFuture.java:2010)at org.apache.flink.runtime.scheduler.SharedSlot.release(SharedSlot.java:271)at org.apache.flink.runtime.jobmaster.slotpool.AllocatedSlot.releasePayload(AllocatedSlot.java:152)at org.apache.flink.runtime.jobmaster.slotpool.DefaultDeclarativeSlotPool.releasePayload(DefaultDeclarativeSlotPool.java:482)at org.apache.flink.runtime.jobmaster.slotpool.DefaultDeclarativeSlotPool.freeAndReleaseSlots(DefaultDeclarativeSlotPool.java:474)at org.apache.flink.runtime.jobmaster.slotpool.DefaultDeclarativeSlotPool.releaseSlots(DefaultDeclarativeSlotPool.java:445)at org.apache.flink.runtime.jobmaster.slotpool.DeclarativeSlotPoolService.internalReleaseTaskManager(DeclarativeSlotPoolService.java:249)at org.apache.flink.runtime.jobmaster.slotpool.DeclarativeSlotPoolService.releaseTaskManager(DeclarativeSlotPoolService.java:230)at org.apache.flink.runtime.jobmaster.JobMaster.disconnectTaskManager(JobMaster.java:505)at org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.handleTaskManagerConnectionLoss(JobMaster.java:1378)at org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(JobMaster.java:1373)
  • 问题2. 未出现container心跳超时的,作业运行缓慢,超过一天 ,作业仍存在反压情况
    在这里插入图片描述

问题分析

  • 查看日志内容发现,出现内存溢出,
2023-10-07 17:58:27,526 | INFO  | [Checkpoint Timer] | Triggering checkpoint 11 (type=CheckpointType{name='Checkpoint', sharingFilesStrategy=FORWARD_BACKWARD}) @ 1696672707523 for job f80d756ccb03ed8d6c19d114e0ba9e63. | org.apache.flink.runtime.checkpoint.CheckpointCoordinator.createPendingCheckpoint(CheckpointCoordinator.java:832)
2023-10-07 17:59:31,393 | INFO  | [flink-metrics-4] | No response from remote for outbound association. Handshake timed out after [60000 ms]. | akka.event.slf4j.Slf4jLogger$$anonfun$receive$1.$anonfun$applyOrElse$3(Slf4jLogger.scala:96)
2023-10-07 17:59:31,396 | WARN  | [flink-metrics-4] | Association with remote system [akka.tcp://flink-metrics@node-group-1jpmk0002.mrs-qrmc.com:28852] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink-metrics@node-group-1jpmk0002.mrs-qrmc.com:28852]] Caused by: [No response from remote for outbound association. Handshake timed out after [60000 ms].] | akka.event.slf4j.Slf4jLogger$$anonfun$receive$1.$anonfun$applyOrElse$2(Slf4jLogger.scala:90)
**2023-10-07 18:00:15,005 | INFO  | [flink-akka.actor.default-dispatcher-22] | LookupJoin[29] (3/3) (803c659fe1f2ef1c5886f73bab51a914) switched from RUNNING to FAILED on container_e03_1678102291469_2752_01_000006 @ node-group-1jPmk0002.mrs-qrmc.com (dataPort=32396). | org.apache.flink.runtime.executiongraph.Execution.transitionState(Execution.java:1424)
java.lang.OutOfMemoryError: GC overhead limit exceeded**at org.apache.flink.core.memory.MemorySegmentFactory.wrap(MemorySegmentFactory.java:48) ~[flink-dist-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]at org.apache.flink.table.data.binary.BinaryStringData.fromBytes(BinaryStringData.java:95) ~[flink-table-api-java-uber-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]at org.apache.flink.table.data.StringData.fromBytes(StringData.java:59) ~[flink-table-api-java-uber-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]at org.apache.flink.table.data.columnar.ColumnarRowData.getString(ColumnarRowData.java:123) ~[flink-table-api-java-uber-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]at org.apache.flink.table.data.RowData.lambda$createFieldGetter$245ca7d1$1(RowData.java:221) ~[flink-table-api-java-uber-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]at org.apache.flink.table.data.RowData$$Lambda$794/901417218.getFieldOrNull(Unknown Source) ~[?:?]at org.apache.flink.table.data.RowData.lambda$createFieldGetter$25774257$1(RowData.java:296) ~[flink-table-api-java-uber-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]at org.apache.flink.table.data.RowData$$Lambda$788/1820263644.getFieldOrNull(Unknown Source) ~[?:?]at org.apache.flink.table.runtime.typeutils.RowDataSerializer.copyRowData(RowDataSerializer.java:170) ~[flink-table-runtime-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]at org.apache.flink.table.runtime.typeutils.RowDataSerializer.copy(RowDataSerializer.java:131) ~[flink-table-runtime-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]at org.apache.flink.table.runtime.typeutils.RowDataSerializer.copy(RowDataSerializer.java:48) ~[flink-table-runtime-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]at org.apache.flink.connectors.hive.FileSystemLookupFunction.checkCacheReload(FileSystemLookupFunction.java:138) ~[flink-connector-hive_2.12-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]at org.apache.flink.connectors.hive.FileSystemLookupFunction.eval(FileSystemLookupFunction.java:107) ~[flink-connector-hive_2.12-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]at LookupFunction$606.flatMap(Unknown Source) ~[?:?]at org.apache.flink.table.runtime.operators.join.lookup.LookupJoinRunner.fetchElements(LookupJoinRunner.java:90) ~[flink-table-runtime-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]at org.apache.flink.table.runtime.operators.join.lookup.LookupJoinRunner.lambda$processElement$0(LookupJoinRunner.java:76) ~[flink-table-runtime-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]at org.apache.flink.table.runtime.operators.join.lookup.LookupJoinRunner$$Lambda$1805/947327052.run(Unknown Source) ~[?:?]at org.apache.flink.table.runtime.operators.join.lookup.AbstractLookupJoinRunner.withRetry(AbstractLookupJoinRunner.java:104) ~[flink-table-runtime-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]at org.apache.flink.table.runtime.operators.join.lookup.LookupJoinRunner.processElement(LookupJoinRunner.java:71) ~[flink-table-runtime-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]at org.apache.flink.table.runtime.operators.join.lookup.LookupJoinRunner.processElement(LookupJoinRunner.java:31) ~[flink-table-runtime-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]at org.apache.flink.streaming.api.operators.ProcessOperator.processElement(ProcessOperator.java:40) ~[flink-dist-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask$StreamTaskNetworkOutput.emitRecord(OneInputStreamTask.java:233) ~[flink-dist-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]at org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.processElement(AbstractStreamTaskNetworkInput.java:134) ~[flink-dist-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]at org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext(AbstractStreamTaskNetworkInput.java:105) ~[flink-dist-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]at org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65) ~[flink-dist-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]at org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:524) ~[flink-dist-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]at org.apache.flink.streaming.runtime.tasks.StreamTask$$Lambda$743/1473507981.runDefaultAction(Unknown Source) ~[?:?]at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:216) ~[flink-dist-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:812) ~[flink-dist-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:761) ~[flink-dist-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]at org.apache.flink.runtime.taskmanager.Task$$Lambda$1317/775704149.run(Unknown Source) ~[?:?]at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:958) ~[flink-dist-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]
2023-10-07 18:00:15,071 | INFO  | [flink-akka.actor.default-dispatcher-22] | Call stack:at java.lang.Thread.getStackTrace(Thread.java:1564)
  • checkpoint失败
2023-10-07 15:03:42,678 | WARN  | [flink-akka.actor.default-dispatcher-36] | **Failed to trigger or complete checkpoint 6 for job 0595b0727d9241894b541fd6e82af814. (0 consecutive failed attempts so far) |** org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleCheckpointException(CheckpointFailureManager.java:114)
2253 org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint Coordinator is suspending.
2254         at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.stopCheckpointScheduler(CheckpointCoordinator.java:1909) ~[flink-dist-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]
2255         at org.apache.flink.runtime.checkpoint.CheckpointCoordinatorDeActivator.jobStatusChanges(CheckpointCoordinatorDeActivator.java:46) ~[flink-dist-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]
2256         at org.apache.flink.runtime.executiongraph.DefaultExecutionGraph.notifyJobStatusChange(DefaultExecutionGraph.java:1512) ~[flink-dist-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]
2257         at org.apache.flink.runtime.executiongraph.DefaultExecutionGraph.transitionState(DefaultExecutionGraph.java:1113) ~[flink-dist-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]
2258         at org.apache.flink.runtime.executiongraph.DefaultExecutionGraph.transitionState(DefaultExecutionGraph.java:1082) ~[flink-dist-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]
2259         at org.apache.flink.runtime.scheduler.SchedulerBase.transitionExecutionGraphState(SchedulerBase.java:590) ~[flink-dist-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]
2260         at org.apache.flink.runtime.scheduler.DefaultScheduler.addVerticesToRestartPending(DefaultScheduler.java:369) ~[flink-dist-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]
2261         at org.apache.flink.runtime.scheduler.DefaultScheduler.restartTasksWithDelay(DefaultScheduler.java:345) ~[flink-dist-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]
2262         at org.apache.flink.runtime.scheduler.DefaultScheduler.maybeRestartTasks(DefaultScheduler.java:328) ~[flink-dist-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]
2263         at org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:303) ~[flink-dist-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]
2264         at org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:291) ~[flink-dist-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]
2265         at org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:282) ~[flink-dist-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]
2266         at org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:739) ~[flink-dist-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]
2267         at org.apache.flink.runtime.scheduler.UpdateSchedulerNgOnInternalFailuresListener.notifyTaskFailure(UpdateSchedulerNgOnInternalFailuresListener.java:51) ~[flink-dist-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]
2268         at org.apache.flink.runtime.executiongraph.DefaultExecutionGraph.notifySchedulerNgAboutInternalTaskFailure(DefaultExecutionGraph.java:1543) ~[flink-dist-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]
2269         at org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1119) ~[flink-dist-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]
2270         at org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1059) ~[flink-dist-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]
2271         at org.apache.flink.runtime.executiongraph.Execution.fail(Execution.java:760) ~[flink-dist-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]
2272         at org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot.signalPayloadRelease(SingleLogicalSlot.java:195) ~[flink-dist-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]
2273         at org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot.release(SingleLogicalSlot.java:182) ~[flink-dist-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]
2274         at org.apache.flink.runtime.scheduler.SharedSlot.lambda$release$4(SharedSlot.java:271) ~[flink-dist-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]
2275         at java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:670) ~[?:1.8.0_332]
2276         at java.util.concurrent.CompletableFuture.uniAcceptStage(CompletableFuture.java:683) ~[?:1.8.0_332]
2277         at java.util.concurrent.CompletableFuture.thenAccept(CompletableFuture.java:2010) ~[?:1.8.0_332]
2278         at org.apache.flink.runtime.scheduler.SharedSlot.release(SharedSlot.java:271) ~[flink-dist-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]
2279         at org.apache.flink.runtime.jobmaster.slotpool.AllocatedSlot.releasePayload(AllocatedSlot.java:152) ~[flink-dist-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]
2280         at org.apache.flink.runtime.jobmaster.slotpool.DefaultDeclarativeSlotPool.releasePayload(DefaultDeclarativeSlotPool.java:482) ~[flink-dist-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]
2281         at org.apache.flink.runtime.jobmaster.slotpool.DefaultDeclarativeSlotPool.freeAndReleaseSlots(DefaultDeclarativeSlotPool.java:474) ~[flink-dist-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]
2282         at org.apache.flink.runtime.jobmaster.slotpool.DefaultDeclarativeSlotPool.releaseSlots(DefaultDeclarativeSlotPool.java:445) ~[flink-dist-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]
2283         at org.apache.flink.runtime.jobmaster.slotpool.DeclarativeSlotPoolService.internalReleaseTaskManager(DeclarativeSlotPoolService.java:249) ~[flink-dist-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]
2284         at org.apache.flink.runtime.jobmaster.slotpool.DeclarativeSlotPoolService.releaseTaskManager(DeclarativeSlotPoolService.java:230) ~[flink-dist-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]
2285         at org.apache.flink.runtime.jobmaster.JobMaster.disconnectTaskManager(JobMaster.java:505) ~[flink-dist-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]
2286         at org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.handleTaskManagerConnectionLoss(JobMaster.java:1378) ~[flink-dist-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]
2287         at org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(JobMaster.java:1373) ~[flink-dist-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]
2288         at org.apache.flink.runtime.heartbeat.HeartbeatMonitorImpl.run(HeartbeatMonitorImpl.java:155) ~[flink-dist-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]
2289         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_332]
2290         at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_332]
2291         at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.lambda$handleRunAsync$4(AkkaRpcActor.java:443) ~[flink-rpc-akka_427efea8-e9d4-4177-abbe-967ec8f15a58.jar:1.15.0-h0.cbu.mrs.320.r33]
2292         at org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68) ~[flink-rpc-akka_427efea8-e9d4-4177-abbe-967ec8f15a58.jar:1.15.0-h0.cbu.mrs.320.r33]
2293         at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:443) ~[flink-rpc-akka_427efea8-e9d4-4177-abbe-967ec8f15a58.jar:1.15.0-h0.cbu.mrs.320.r33]
2294         at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:213) ~[flink-rpc-akka_427efea8-e9d4-4177-abbe-967ec8f15a58.jar:1.15.0-h0.cbu.mrs.320.r33]
2295         at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:78) ~[flink-rpc-akka_427efea8-e9d4-4177-abbe-967ec8f15a58.jar:1.15.0-h0.cbu.mrs.320.r33]
2296         at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:163) ~[flink-rpc-akka_427efea8-e9d4-4177-abbe-967ec8f15a58.jar:1.15.0-h0.cbu.mrs.320.r33]
2297         at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24) [flink-rpc-akka_427efea8-e9d4-4177-abbe-967ec8f15a58.jar:1.15.0-h0.cbu.mrs.320.r33]
2298         at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20) [flink-rpc-akka_427efea8-e9d4-4177-abbe-967ec8f15a58.jar:1.15.0-h0.cbu.mrs.320.r33]
2299         at scala.PartialFunction.applyOrElse(PartialFunction.scala:123) [flink-rpc-akka_427efea8-e9d4-4177-abbe-967ec8f15a58.jar:1.15.0-h0.cbu.mrs.320.r33]
2300         at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122) [flink-rpc-akka_427efea8-e9d4-4177-abbe-967ec8f15a58.jar:1.15.0-h0.cbu.mrs.320.r33]
2301         at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:20) [flink-rpc-akka_427efea8-e9d4-4177-abbe-967ec8f15a58.jar:1.15.0-h0.cbu.mrs.320.r33]
2302         at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [flink-rpc-akka_427efea8-e9d4-4177-abbe-967ec8f15a58.jar:1.15.0-h0.cbu.mrs.320.r33]
2303         at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) [flink-rpc-akka_427efea8-e9d4-4177-abbe-967ec8f15a58.jar:1.15.0-h0.cbu.mrs.320.r33]
2304         at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) [flink-rpc-akka_427efea8-e9d4-4177-abbe-967ec8f15a58.jar:1.15.0-h0.cbu.mrs.320.r33]
2305         at akka.actor.Actor.aroundReceive(Actor.scala:537) [flink-rpc-akka_427efea8-e9d4-4177-abbe-967ec8f15a58.jar:1.15.0-h0.cbu.mrs.320.r33]
2306         at akka.actor.Actor.aroundReceive$(Actor.scala:535) [flink-rpc-akka_427efea8-e9d4-4177-abbe-967ec8f15a58.jar:1.15.0-h0.cbu.mrs.320.r33]
2307         at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:220) [flink-rpc-akka_427efea8-e9d4-4177-abbe-967ec8f15a58.jar:1.15.0-h0.cbu.mrs.320.r33]
2308         at akka.actor.ActorCell.receiveMessage(ActorCell.scala:580) [flink-rpc-akka_427efea8-e9d4-4177-abbe-967ec8f15a58.jar:1.15.0-h0.cbu.mrs.320.r33]
2309         at akka.actor.ActorCell.invoke(ActorCell.scala:548) [flink-rpc-akka_427efea8-e9d4-4177-abbe-967ec8f15a58.jar:1.15.0-h0.cbu.mrs.320.r33]
2310         at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:270) [flink-rpc-akka_427efea8-e9d4-4177-abbe-967ec8f15a58.jar:1.15.0-h0.cbu.mrs.320.r33]
2311         at akka.dispatch.Mailbox.run(Mailbox.scala:231) [flink-rpc-akka_427efea8-e9d4-4177-abbe-967ec8f15a58.jar:1.15.0-h0.cbu.mrs.320.r33]
2312         at akka.dispatch.Mailbox.exec(Mailbox.scala:243) [flink-rpc-akka_427efea8-e9d4-4177-abbe-967ec8f15a58.jar:1.15.0-h0.cbu.mrs.320.r33]
2313         at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) [?:1.8.0_332]
2314         at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) [?:1.8.0_332]
2315         at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) [?:1.8.0_332]
2316         at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175) [?:1.8.0_332]
2317 2023-10-07 15:03:42,681 | WARN  | [flink-akka.actor.default-dispatcher-19] | Remote connection to [null] failed with java.net.ConnectException: Connection refused: node-group-1jPmk0002.mrs-qrmc.com/10.155.0.9:32331 | akka.event.slf4j.Slf4jLogger$$anonfun$receive$1.$anonfun$applyOrElse$2(Slf4jLogger.scala:90)
2318 2023-10-07 15:03:42,681 | WARN  | [flink-akka.actor.default-dispatcher-19] | Association with remote system [akka.tcp://flink@node-group-1jpmk0002.mrs-qrmc.com:32331] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@node-group-1jpmk0002.mrs-qrmc.com:32331]] Caused by: [java.net.ConnectException     : Connection refused: node-group-1jPmk0002.mrs-qrmc.com/10.155.0.9:32331] | akka.event.slf4j.Slf4jLogger$$anonfun$receive$1.$anonfun$applyOrElse$2(Slf4jLogger.scala:90)
2319 2023-10-07 15:03:42,681 | INFO  | [SourceCoordinator-Source: dws_hljy_logistics_ykt_rytb_m_rec_consume[11]] | Removing registered reader after failure for subtask 0 of source Source: dws_hljy_logistics_ykt_rytb_m_rec_consume[11]. | org.apache.flink.runtime.source.coordinator.SourceCoordinator.lambda$subtaskFailed$3(SourceCoordinator.java:267     )
2320 2023-10-07 15:03:42,681 | INFO  | [flink-akka.actor.default-dispatcher-36] | Source: dws_hljy_logistics_ykt_rytb_m_rec_consume[11]
  • akka 超时
8 2023-10-07 15:02:01,864 | INFO  | [flink-scheduler-1] | Triggering Checkpoint 5 for job 0595b0727d9241894b541fd6e82af814 failed due to **java.util.concurrent.TimeoutException: Invocation of [RemoteRpcInvocation(TaskExecutorGateway.triggerCheckpoint(ExecutionAttemptID, long, long, CheckpointOptions))] at recipient [akka.tcp://flink@node-group-1     jpmk0002.mrs-qrmc.com:32331/user/rpc/taskmanager_0] timed out. This is usually caused by: 1) Akka failed sending the message silently, due to problems like oversized payload or serialization failures. In that case, you should find detailed error information in the logs. 2) The recipient needs more time for responding, due to problems like slow machines or network jitters. In that case, you can try to increase akka.ask.timeout. |** org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$triggerCheckpointRequest$10(CheckpointCoordinator.java:681)
2079 2023-10-07 15:02:01,865 | WARN  | [Checkpoint Timer] | Failed to trigger or complete checkpoint 5 for job 0595b0727d9241894b541fd6e82af814. (0 consecutive failed attempts so far) | org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleCheckpointException(CheckpointFailureManager.java:114)
2080 org.apache.flink.runtime.checkpoint.CheckpointException: Trigger checkpoint failure.
2081         at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$triggerCheckpointRequest$10(CheckpointCoordinator.java:691) ~[flink-dist-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]
2082         at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:884) ~[?:1.8.0_332]
2083         at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:866) ~[?:1.8.0_332]
2084         at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_332]
2085         at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_332]
2086         at org.apache.flink.util.concurrent.FutureUtils$WaitingConjunctFuture.handleCompletedFuture(FutureUtils.java:914) ~[flink-dist-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]
2087         at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) ~[?:1.8.0_332]
2088         at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) ~[?:1.8.0_332]
2089         at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_332]
2090         at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_332]
2091         at org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.lambda$invokeRpc$1(AkkaInvocationHandler.java:252) ~[?:?]
2092         at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) ~[?:1.8.0_332]
2093         at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) ~[?:1.8.0_332]
2094         at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_332]
2095         at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_332]
2096         at org.apache.flink.util.concurrent.FutureUtils.doForward(FutureUtils.java:1387) ~[flink-dist-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]
2097         at org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.lambda$null$1(ClassLoadingUtils.java:93) ~[?:?]
2098         at org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68) ~[?:?]
2099         at org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.lambda$guardCompletionWithContextClassLoader$2(ClassLoadingUtils.java:92) ~[?:?]
2100         at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) ~[?:1.8.0_332]
2101         at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) ~[?:1.8.0_332]
2102         at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_332]
2103         at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_332]
2104         at org.apache.flink.runtime.concurrent.akka.AkkaFutureUtils$1.onComplete(AkkaFutureUtils.java:45) ~[?:?]
2105         at akka.dispatch.OnComplete.internal(Future.scala:299) ~[?:?]
2106         at akka.dispatch.OnComplete.internal(Future.scala:297) ~[?:?]
2107         at akka.dispatch.japi$CallbackBridge.apply(Future.scala:224) ~[?:?]
2108         at akka.dispatch.japi$CallbackBridge.apply(Future.scala:221) ~[?:?]
2109         at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60) ~[flink-scala_2.12-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]
2110         at org.apache.flink.runtime.concurrent.akka.AkkaFutureUtils$DirectExecutionContext.execute(AkkaFutureUtils.java:65) ~[?:?]
2111         at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:68) ~[flink-scala_2.12-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]
2112         at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:284) ~[flink-scala_2.12-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]
2113         at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:284) ~[flink-scala_2.12-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]
2114         at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:284) ~[flink-scala_2.12-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]
2115         at akka.pattern.PromiseActorRef$.$anonfun$apply$1(AskSupport.scala:729) ~[?:?]
2116         at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:479) ~[?:?]
2117         at akka.dispatch.internal.SameThreadExecutionContext$$anon$1.unbatchedExecute(SameThreadExecutionContext.scala:21) ~[?:?]
2118         at akka.dispatch.BatchingExecutor.execute(BatchingExecutor.scala:133) ~[?:?]
2119         at akka.dispatch.BatchingExecutor.execute$(BatchingExecutor.scala:124) ~[?:?]
2120         at akka.dispatch.internal.SameThreadExecutionContext$$anon$1.execute(SameThreadExecutionContext.scala:20) ~[?:?]
2121         at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:365) ~[?:?]
2122         at akka.actor.LightArrayRevolverScheduler$$anon$3.executeBucket$1(LightArrayRevolverScheduler.scala:314) ~[?:?]
2123         at akka.actor.LightArrayRevolverScheduler$$anon$3.nextTick(LightArrayRevolverScheduler.scala:318) ~[?:?]
2124         at akka.actor.LightArrayRevolverScheduler$$anon$3.run(LightArrayRevolverScheduler.scala:270) ~[?:?]
2125         at java.lang.Thread.run(Thread.java:750) [?:1.8.0_332]
2126 Caused by: java.util.concurrent.TimeoutException: Invocation of [RemoteRpcInvocation(TaskExecutorGateway.triggerCheckpoint(ExecutionAttemptID, long, long, CheckpointOptions))] at recipient [akka.tcp://flink@node-group-1jpmk0002.mrs-qrmc.com:32331/user/rpc/taskmanager_0] timed out. This is usually caused by: 1) Akka failed sending the message      silently, due to problems like oversized payload or serialization failures. In that case, you should find detailed error information in the logs. 2) The recipient needs more time for responding, due to problems like slow machines or network jitters. In that case, you can try to increase akka.ask.timeout.
2127         at org.apache.flink.runtime.jobmaster.RpcTaskManagerGateway.triggerCheckpoint(RpcTaskManagerGateway.java:128) ~[flink-dist-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]
2128         at org.apache.flink.runtime.executiongraph.Execution.triggerCheckpointHelper(Execution.java:854) ~[flink-dist-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]
2129         at org.apache.flink.runtime.executiongraph.Execution.triggerCheckpoint(Execution.java:830) ~[flink-dist-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]
2130         at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.triggerTasks(CheckpointCoordinator.java:745) ~[flink-dist-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]
2131         at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.triggerCheckpointRequest(CheckpointCoordinator.java:678) ~[flink-dist-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]
2132         at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$startTriggeringCheckpoint$7(CheckpointCoordinator.java:645) ~[flink-dist-1.15.0-h0.cbu.mrs.320.r33.jar:1.15.0-h0.cbu.mrs.320.r33]
2133         at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836) ~[?:1.8.0_332]
2134         at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811) ~[?:1.8.0_332]
2135         at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456) ~[?:1.8.0_332]
2136         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_332]
2137         at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_332]
2138         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_332]
2139         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:1.8.0_332]
2140         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_332]
2141         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_332]

作业优化调整

初步优化

  • 根据上述3个原因,并且不存在数据倾斜的情况,适当增加内存,增加checkpoint超时/间隔时间,akka请求超时时间

-- 增加内存
set 'taskmanager.heap.size' = '5092MB';-- 增加checkpoint间隔/超时时间
set 'execution.checkpointing.timeout'='60min';
set 'execution.checkpointing.interval' = '10 s';-- 增加异步通信时间
set 'akka.ask.timeout' = '30 s';
  • 优化发现仍然未解决上述问题,再次查看flink-sql作业,发现仍然看到任务不断attempt,进一步分析发现flink-sql 数据流和外部维表进行join时耗时较久,将维表去除进行测试,发现反压很快消失确定问题方向后,进行针对性

二次优化

  • 根据发现的问题后,一是内存溢出,尽量减少维表数据量,根据业务需求、数据建模需要,只选择必须字段加工新的维表,减少join时的缓存数据量
如 维表 dim_a 有  col1,col2,col3,col4,col5,col6,确认只需要,col2,,col6,,则可以加工出 col2,col6的维表
  • 对指标计算,为减少数据计算,利用两阶聚合(先分桶和 group key聚合,再根据group key进行聚合)优势,minibatch(批次计算)优势,在牺牲较低延迟的基础,批次计算指标后,计算效率得到大幅提升,优化前 运行4d 40min 仍存在反压,优化后35min后,反压完全消失
  set 'table.exec.mini-batch.enabled'='true';set 'table.exec.mini-batch.allow-latency'='10 s';set 'table.exec.mini-batch.size'='5000';set 'table.optimizer.agg-phase-strategy'='TWO_PHASE';

优化对比

  • 优化前

在这里插入图片描述

  • 优化后
    在这里插入图片描述
  • 从下图发现,经过两阶聚合后,左边经过mini-batch,两阶聚合优化后,处理数据量明显减少
    在这里插入图片描述

总结

  • 定位问题时,需要根据具体日志信息进行分析,结合flink运行原理和监控页面一步步定位出现问题蒜子
  • 参考文档 https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/dev/table/tuning/
    在这里插入图片描述
    在这里插入图片描述

相关文章:

Flink-SQL join 优化 -- MiniBatch + local-global

背景 问题1. 近期在开发flink-sql期间&#xff0c;发现数据在启动后&#xff0c;任务总是进行重试&#xff0c;运行一段时间后&#xff0c;container心跳超时&#xff0c;内存溢出&#xff0c;作业无法进行正常工作 023-10-07 14:53:30,408 | INFO | [flink-akka.actor.defa…...

在c#中使用NPOI结合Magicodes.IE.excel将xlsx文件内存中转换为xls文件

项目中使用Magicodes.IE作为导出excel的组件&#xff0c;但只支持新格式xlsx&#xff0c;有需求要导出旧格式xls文件&#xff0c;因此只能考虑转换的方案&#xff0c;经多种方案尝试和查找相关解决方案&#xff0c;在一份使用NPOI转换的xlsx到xls的文章到找到相关代码&#xff…...

面试经典 150 题 14 —(数组 / 字符串)— 134. 加油站

134. 加油站 方法一 class Solution { public:int canCompleteCircuit(vector<int>& gas, vector<int>& cost) {int minSpare std::numeric_limits<int>::max(); // 初始化最小剩余汽油量为整型的最大值int spare 0; // 当前剩余汽油量int len g…...

如何设计一个安全的对外接口?

转载 https://blog.csdn.net/weixin_46742102/article/details/108831868?spm1001.2101.3001.6650.1&utm_mediumdistribute.pc_relevant.none-task-blog-2%7Edefault%7ECTRLIST%7ERate-1-108831868-blog-125359890.235%5Ev38%5Epc_relevant_anti_t3_base&depth_1-utm_…...

模拟pdf运行js脚本触发xss攻击及防攻击

一、引入pdfbox依赖 <dependency><groupId>org.apache.pdfbox</groupId><artifactId>pdfbox</artifactId><version>3.0.0</version> </dependency> 二、生成一个带js脚本的pdf文件 //Creating PDF document object PDDocum…...

【数据结构】树和二叉树概念及其结构

目录 一 树概念及结构 1 树的概念 2 树的相关概念 3 树的表示 二 二叉树概念及结构 1 概念 2 特殊二叉树 3 二叉树的性质 一 树概念及结构 1 树的概念 树是一种非线性的数据结构&#xff0c;它是由n&#xff08;n>0&#xff09;个有限结点组成一个具有层次关系的集…...

刘京城:我的《软件方法》学习经历(有彩蛋)

DDD领域驱动设计批评文集 做强化自测题获得“软件方法建模师”称号 《软件方法》各章合集 写在前面&#xff08;潘加宇&#xff09; 下面是刘京城写的关于他学习《软件方法》的经历。我在前面啰嗦几句。 我做软件建模方面的研究和普及工作已经24年了&#xff0c;和各行各业…...

浏览器详解(四) 渲染

大家好&#xff0c;我是半虹&#xff0c;这篇文章来讲浏览器渲染 1、基本介绍 浏览器是多进程多线程的架构&#xff0c;包括有浏览器进程、渲染器进程、GPU 进程、插件进程等 在上篇文章中我们介绍过浏览器进程&#xff0c;作为浏览器主进程&#xff0c;负责浏览器基本界面的…...

idea新建一个module时,文件夹显示灰色/pom.xml文件显示灰色且中间有条横线

1.问题 2.解决方法 File->Settings->Ignored Files->找到勾选的pom.xml文件&#xff0c;取消勾选&#xff0c;点击ok即可。 3.已解决...

NoSQL数据库(林子雨慕课课程)

文章目录 5.1 NoSQL数据库5.2 NoSQL和关系数据库的比较5.3 四大类型NoSQL数据库5.3.1 键值数据库和列族数据库5.3.2 文档数据库、图数据库、以及不同数据库比较分析 5.4 NoSQL数据库的理论基石CAP理论&#xff1a;BASE理论&#xff1a;Eventual consistency&#xff08;最终一致…...

模拟器运行在AndroidStudio内部,设置其独立窗口显示

在窗口内部运行 设置成独立窗口 Android Studio->Settings或Preferences->Tools->Emulator->取消勾选Launch in the Running Devices tool window --->点击右下角的OK按钮 ---> 重启Android Studio 再次启动模拟器...

计算机网络 | 体系结构

计算机网络 | 体系结构 计算机网络 | 体系结构概念及功能计算机网络简介计算机网络的功能因特网发展阶段小结 组成与分类计算机网络的组成计算机网络的分类小结 标准化工作及相关组织速率相关性能指标速率带宽吞吐量小结 时延相关性能指标时延时延带宽积往返时延RTT利用率小结 …...

ELK 处理 SpringCloud 日志

在排查线上异常的过程中&#xff0c;查询日志总是必不可缺的一部分。现今大多采用的微服务架构&#xff0c;日志被分散在不同的机器上&#xff0c;使得日志的查询变得异常困难。工欲善其事&#xff0c;必先利其器。如果此时有一个统一的实时日志分析平台&#xff0c;那可谓是雪…...

mac使用python递归删除文件夹下所有的.DS_Store文件

import osfolder_path "yourself file path"for root, dirs, files in os.walk(folder_path):for filename in files:if filename .DS_Store:file_path os.path.join(root, filename)os.remove(file_path)print("delete ok")...

Gitlab+Jenkins自动化部署,解放双手

项目打包 ​ 在部署项目前需要对源码进行打包&#xff0c;一个简单的SpringBoot项目默认是打包为jar包&#xff0c;也就是在pom.xml中的<packaging>jar</packaging>方式&#xff0c;当然也会有一些打包成war包方式&#xff0c;使用外置的Tomcat应用服务器部署war包…...

NNDL:作业3

在Softmax回归的风险函数(公式(3.39))中如果加上正则化项会有什么影响? (1) 在 Softmax 回归的风险函数中加入正则化项会对模型的训练产生影响。正则化项的作用是对模型的复杂度进行惩罚&#xff0c;防止过拟合的发生。 (2) 原书公式为&#xff1a; 在加入正则化后损失函数…...

dockers --cap-add 哪些值可以设置

--cap-add 参数可以用于向 Docker 容器添加不同的权限。除了 NET_ADMIN&#xff0c;还有一些其他常用的权限值&#xff0c;包括&#xff1a; SYS_ADMIN&#xff1a;添加系统管理员权限&#xff0c;允许容器内的进程执行系统级别的管理操作&#xff0c;如挂载文件系统、设置时间…...

golang常用库之-HTTP客户端请求库 grequests

文章目录 golang常用库之-HTTP客户端请求库 grequests什么是grequests使用 golang常用库之-HTTP客户端请求库 grequests 什么是grequests 官网&#xff1a;github.com/levigross/grequests A Go “clone” of the great and famous Requests library Go语言的grequests库是一…...

17基于matlab卡尔曼滤波的行人跟踪算法,并给出算法估计误差结果,判断算法的跟踪精确性,程序已调通,可直接运行,基于MATLAB平台,可直接拍下。

17基于matlab卡尔曼滤波的行人跟踪算法&#xff0c;并给出算法估计误差结果&#xff0c;判断算法的跟踪精确性&#xff0c;程序已调通&#xff0c;可直接运行&#xff0c;基于MATLAB平台&#xff0c;可直接拍下。 17matlab卡尔曼滤波行人跟踪 (xiaohongshu.com)...

SpringCloud之Stream框架集成RocketMQ消息中间件

Spring Cloud Stream 是一个用来为微服务应用构建消息驱动能力的框架。它可以基于 Spring Boot 来创建独立的、可用于生产的 Spring 应用程序。Spring Cloud Stream 为一些供应商的消息中间件产品提供了个性化的自动化配置实现&#xff0c;并引入了发布-订阅、消费组、分区这三…...

与创新者同行!Apache Doris 首届线下峰会即将开启,最新议程公开!|即刻预约

点击此处 即刻报名 Doris Summit Asia 2023 回顾人类的发展史&#xff0c;地球起源于 46 亿年前的原始星云、地球生命最初出现于 35 亿年前的原始海洋、人类物种诞生于数百万年前&#xff0c;而人类生产力的真正提升源于十八世纪六十年代的工业革命&#xff0c;自此以后&#…...

vue解决:Parsing error: No Babel config file detected for ....

报错信息 Parsing error: No Babel config file detected for C:\Users\Admin\Desktop\shabi\work\src\App.vue. Either disable config file checking with requireConfigFile: false, or configure Babel so that it can find the config files. 分析错误&#xff1a;没有检测…...

算法题:K 次取反后最大化的数组和(典型的贪心算法问题)

这道题没有看题解&#xff0c;直接提交&#xff0c;成绩超越99.5%&#xff0c;说明思路是优的。就是考虑的情况里面弯弯绕比较多&#xff0c;需要考虑全面一点。&#xff08;本题完整题目附在了最后面&#xff09; 具体思路如下&#xff1a; 1、首先排序&#xff0c;然后从最…...

Go语言中向[]byte数组中增加一个元素

要向http.Request的body中添加一个键值对&#xff0c;可以先将其转换为一个map&#xff0c;然后对其进行修改&#xff0c;最后再将其转回为byte数组。 以下是一个示例代码&#xff1a; import ("net/http""io/ioutil""encoding/json" )type Re…...

CSS 布局案例: 2行、多行每行格数不定,最后一列对齐

布局期望的效果如下&#xff1a; 第二行最后一格与第一行最后一格对齐。每行格数不定。自动拉伸填充整个宽度 实现&#xff1a; 一开始打算用display:flex&#xff0c; 自动分散&#xff0c;但是第二行对齐第一行最后一格控制不了。 使用grid fr均分单位控制。 <!DOCTYPE…...

数据结构--算法、数据结构的基本概念

&#x1f4d5;参考&#xff1a;王道 一、算法的基本概念 1.程序数据结构算法 2.算法的特性 &#xff08;1&#xff09;有穷性 执行有穷步之后结束&#xff0c;且每一步都可在有穷时间内完成。 &#xff08;2&#xff09;确定性 &#xff08;3&#xff09;可行性 可通过已经实…...

Edge浏览器下载文件被保存为 .crdownload 文件的问题小记

问题 近期使用Edge浏览器下载文件时&#xff0c;文件都被保存为 .crdownload 格式的文件了&#xff0c;不确定从哪个版本开始的。除非下载未完成导致文件不完整&#xff0c;否则不会被保存为 .crdownload 格式的文件&#xff1b;实际上文件已完成了下载&#xff0c;且手工修改…...

6-10 单链表分段逆转 分数 15

void K_Reverse( List L, int K ) { //此题已经默认size > K 因为当size < K时 反转后将不再符合链表的定义//求出表中元素个数int size 0;for (List cur L->Next; cur ! NULL; cur cur->Next)size; List prv, cur, next, first, head L;//共需要反转 si…...

【单片机】17-温度传感器DS18B20

1.DS18B20相关背景知识 1.温度传感器 &#xff08;1&#xff09;测温度的方式&#xff1a;物理&#xff08;汞柱&#xff0c;气压&#xff09;&#xff0c;电子&#xff08;金属电性能随温度变化&#xff09; &#xff08;2&#xff09;早期&#xff1a;热敏电阻&#xff08;模…...

力扣 -- 5. 最长回文子串

解题步骤&#xff1a; 参考代码&#xff1a; class Solution { public:string longestPalindrome(string s) {int ns.size();vector<vector<bool>> dp(n,vector<bool>(n));//最长回文串的起始位置int start0;//最长回文串的长度int len0;for(int in-1;i>…...