08网站建设百度提交入口网站怎么看
web/
2025/10/3 17:44:09/
文章来源:
08网站建设,百度提交入口网站怎么看,wordpress在线浏览pdf,安阳网站建设哪里最好背景#xff1a; 由于测试环境的磁盘满了#xff0c;导致多个NodeManager出现不健康状态#xff0c;查看了下#xff0c;基本都是data空间满导致#xff0c;不是删除日志文件等就能很快解决的#xff0c;只能删除一些历史没有用的数据。于是从大文件列表中#xff0c;找…背景 由于测试环境的磁盘满了导致多个NodeManager出现不健康状态查看了下基本都是data空间满导致不是删除日志文件等就能很快解决的只能删除一些历史没有用的数据。于是从大文件列表中找出2018年的spark作业的历史中间文件并彻底删除(跳过回收站)
/usr/local/hadoop-2.6.3/bin/hdfs dfs -rm -r -skipTrash /user/hadoop/.sparkStaging/application_1542856934835_1063/Job_20181128135837.jar 问题产生过程 hdfs删除大量历史文件过程中standby的namenode节点gc停顿时间过长退出了 当时没注意stanby已经退出了还在继续删除数据后面发现stanby停了后于是重启stanby的NN 启动时active的namenode已经删除了许多文件导致两个namenode保存的数据块信息不一致了出现大量数据块不一致的报错使得所有的DataNode在与NameNode节点汇报心跳时超时而被当做dead节点。 问题现象 active的namenode的webUI里datanode状态正常 但是standby的webUI里datanode全部dead日志显示datanode频繁连接standby的NN且被远程standby的NN连接关闭standby的NN显示一直在添加新的数据块 解决过程: 【重启stanby的NN】重启stanby的NN,重启后stanby的NN存在GC停顿时间长的日志之后出现大量写数据时管道断开(java.io.IOException: Broken pipe)的报错stanby的节点列表还都是dead状态且DataNode节点的日志大量报与stanby的NN的rpc端连接被重置错误(Connection reset by peer)这个过程之前还不知道原理原来DN也会往stanby的NN发送报告信息。 【active的NN诡异的挂掉】stanby的NN重启一段时间后发现active的NN也挂掉了而且日志没有明显的报错于是重启active的NN过后发现active和stanby的NN的50070的webUI中DataNode列表都是dead了且DataNode节点的日志依然大量报与NN的rpc端连接被重置错误(Connection reset by peer) 【尝试重启DataNode】尝试重启DataNode让其重新注册发现重启后还是依然报与NN的rpc连接重置的错 【刷新节点操作】问了位大神后尝试刷新节点让全部节点重新注册发现刷新节点失败也是报的rpc连接被重置问题 【排查网络问题】由于服务并没有挂掉且对应的rpc端口也有监听猜测是网络、dns、防火墙、时间同步等问题让运维一起排查都反馈没问题不过运维帮忙反馈该rpc端口有时可以连接有时超时连接中有大量的close_wait一般情况下说明程序已经没有响应了导致客户端主动断开请求。 【调整active的NN堆内存大小重启并刷新节点】于是猜想是不是现在active的NN的堆内存不足了导致大量的rpc请求被阻塞于是尝试调大active的NN的堆内存大小停止可能影响NN性能的JobHistoryServer、Balancer和自身的agent监控服务重启重启后发现active的DN节点列表已恢复正常但是stanby的DN节点列表还都是dead尝试再次刷新节点发现有刷新成功和刷新失败的rpc连接重置的日志观察节点列表仍然还不能恢复正常 【发现JN挂掉了一个】于是查看stanby的DN的日志发现报了JN连接异常的错误发现确实active的NN中的JN挂了重启JN以为节点恢复了发现还是不行大神指点JN挂掉其实无所谓确实理论上数据块信息都是在NN挂掉最多导致部分新数据没有后面还会补上的 【重启stanby的NN】在大神的指点下这时候重启stanby的NN能解决90%的问题重启后果然active和stanby的DN列表都恢复正常了。 相关日志 stanby的NameNode因gc停顿时间过长导致退出日志
2019-08-20 14:06:38,841 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizing edits file /usr/local/hadoop-2.6.3/data/dfs/name/current/edits_inprogress_0000000000203838246 - /usr/local/hadoop-2.6.3/data/dfs/name/current/edits_0000000000203838246-0000000000203838808
2019-08-20 14:06:38,841 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment at 203838809
2019-08-20 14:06:40,408 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1069ms
GC pool PS MarkSweep had collection(s): count1 time1555ms
2019-08-20 14:06:45,513 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 2139ms
GC pool PS MarkSweep had collection(s): count2 time2638ms
2019-08-20 14:06:45,513 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30749 milliseconds
2019-08-20 14:07:03,010 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 3326ms
GC pool PS MarkSweep had collection(s): count11 time14667ms
2019-08-20 14:07:14,188 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1009ms
GC pool PS MarkSweep had collection(s): count1 time1509ms
2019-08-20 14:07:18,175 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 2179ms
GC pool PS MarkSweep had collection(s): count2 time2678ms
2019-08-20 14:07:19,723 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1047ms
GC pool PS MarkSweep had collection(s): count1 time1540ms standby的Namenode重启后因合并editLog日志和数据块删除等操作导致gc停顿日志
The number of live datanodes 15 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.
2019-08-20 14:54:31,557 INFO org.apache.hadoop.hdfs.StateChange: STATE* Safe mode ON.
The reported blocks 447854 needs additional 1040262 blocks to reach the threshold 0.9990 of total blocks 1489605.
The number of live datanodes 15 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.
2019-08-20 14:54:32,799 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1387ms
GC pool PS MarkSweep had collection(s): count1 time1886ms
2019-08-20 14:54:39,305 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 3923ms
GC pool PS MarkSweep had collection(s): count4 time4422ms
2019-08-20 14:54:55,588 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 2695ms
GC pool PS MarkSweep had collection(s): count1 time3195ms
2019-08-20 14:56:11,593 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1670ms
GC pool PS MarkSweep had collection(s): count6 time6936ms
2019-08-20 14:56:14,517 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 2424ms
GC pool PS MarkSweep had collection(s): count30 time41545ms
2019-08-20 14:56:16,459 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1441ms
GC pool PS MarkSweep had collection(s): count1 time1942ms
2019-08-20 14:56:17,653 ERROR org.mortbay.log: Error for /jmx
2019-08-20 14:56:28,608 ERROR org.mortbay.log: /jmx?qryHadoop:serviceNameNode,nameFSNamesystemState
2019-08-20 14:56:26,419 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1309ms
GC pool PS MarkSweep had collection(s): count1 time1809ms
2019-08-20 14:56:23,558 ERROR org.mortbay.log: Error for /jmx
2019-08-20 14:56:21,164 ERROR org.mortbay.log: handle failed
2019-08-20 14:56:19,957 ERROR org.mortbay.log: Error for /jmx
standby的Namenode重启后的写数据报错
The reported blocks 448402 needs additional 1039714 blocks to reach the threshold 0.9990 of total blocks 1489605.
The number of live datanodes 15 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.
2019-08-20 14:58:08,273 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 on 9001, call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.blockReceivedAndDeleted from 10.104.108.220:63143 Call#320840 Retry#0: output error
2019-08-20 14:58:08,290 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 9001 caught an exception
java.io.IOException: Broken pipeat sun.nio.ch.FileDispatcherImpl.write0(Native Method)at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)at sun.nio.ch.IOUtil.write(IOUtil.java:65)at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2574)at org.apache.hadoop.ipc.Server.access$1900(Server.java:135)at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:978)at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1043)at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2095)
2019-08-20 14:58:08,273 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 on 9001, call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.blockReceivedAndDeleted from 10.104.101.45:8931 Call#2372642 Retry#0: output error
2019-08-20 14:58:08,290 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 9001 caught an exception
java.io.IOException: Broken pipeat sun.nio.ch.FileDispatcherImpl.write0(Native Method)at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)at sun.nio.ch.IOUtil.write(IOUtil.java:65)at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2574)at org.apache.hadoop.ipc.Server.access$1900(Server.java:135)at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:978)at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1043)at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2095) active的NN诡异地无明显报错的挂掉
2019-08-20 16:27:55,477 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocateBlock: /user/hive/warehouse/bi_ucar.db/fact_complaint_detail/.hive-staging_hive_2019-08-20_16-27-19_940_4374739327044819866-1/-ext-10000/_temporary/0/_temporary/attempt_201908201627_0030_m_000199_0/part-00199. BP-535417423-10.104.104.128-1535976912717 blk_1089864054_16247563{blockUCStateUNDER_CONSTRUCTION, primaryNodeIndex-1, replicas[ReplicaUnderConstruction[[DISK]DS-77fcebb8-363e-4b79-8eb6-974db40231cb:NORMAL:10.104.108.157:50010|RBW], ReplicaUnderConstruction[[DISK]DS-6425eb5e-a10e-4f44-ae1e-eb0170d7e5c5:NORMAL:10.104.108.212:50010|RBW], ReplicaUnderConstruction[[DISK]DS-16b95ffb-ac8a-4c34-86bc-e0ee58380a60:NORMAL:10.104.108.170:50010|RBW]]}
2019-08-20 16:27:55,488 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.104.108.170:50010 is added to blk_1089864054_16247563{blockUCStateUNDER_CONSTRUCTION, primaryNodeIndex-1, replicas[ReplicaUnderConstruction[[DISK]DS-77fcebb8-363e-4b79-8eb6-974db40231cb:NORMAL:10.104.108.157:50010|RBW], ReplicaUnderConstruction[[DISK]DS-6425eb5e-a10e-4f44-ae1e-eb0170d7e5c5:NORMAL:10.104.108.212:50010|RBW], ReplicaUnderConstruction[[DISK]DS-16b95ffb-ac8a-4c34-86bc-e0ee58380a60:NORMAL:10.104.108.170:50010|RBW]]} size 0
2019-08-20 16:27:55,489 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.104.108.212:50010 is added to blk_1089864054_16247563{blockUCStateUNDER_CONSTRUCTION, primaryNodeIndex-1, replicas[ReplicaUnderConstruction[[DISK]DS-77fcebb8-363e-4b79-8eb6-974db40231cb:NORMAL:10.104.108.157:50010|RBW], ReplicaUnderConstruction[[DISK]DS-6425eb5e-a10e-4f44-ae1e-eb0170d7e5c5:NORMAL:10.104.108.212:50010|RBW], ReplicaUnderConstruction[[DISK]DS-16b95ffb-ac8a-4c34-86bc-e0ee58380a60:NORMAL:10.104.108.170:50010|RBW]]} size 0
2019-08-20 16:27:55,489 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.104.108.157:50010 is added to blk_1089864054_16247563{blockUCStateUNDER_CONSTRUCTION, primaryNodeIndex-1, replicas[ReplicaUnderConstruction[[DISK]DS-77fcebb8-363e-4b79-8eb6-974db40231cb:NORMAL:10.104.108.157:50010|RBW], ReplicaUnderConstruction[[DISK]DS-6425eb5e-a10e-4f44-ae1e-eb0170d7e5c5:NORMAL:10.104.108.212:50010|RBW], ReplicaUnderConstruction[[DISK]DS-16b95ffb-ac8a-4c34-86bc-e0ee58380a60:NORMAL:10.104.108.170:50010|RBW]]} size 0
2019-08-20 16:27:55,492 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hive/warehouse/bi_ucar.db/fact_complaint_detail/.hive-staging_hive_2019-08-20_16-27-19_940_4374739327044819866-1/-ext-10000/_temporary/0/_temporary/attempt_201908201627_0030_m_000199_0/part-00199 is closed by DFSClient_NONMAPREDUCE_1289526722_42
2019-08-20 16:27:55,511 INFO BlockStateChange: BLOCK* BlockManager: ask 10.104.132.196:50010 to delete [blk_1089864025_16247534, blk_1089863850_16247357]
2019-08-20 16:27:55,511 INFO BlockStateChange: BLOCK* BlockManager: ask 10.104.108.213:50010 to delete [blk_1089864033_16247542, blk_1089864028_16247537]
2019-08-20 16:27:55,568 WARN org.apache.hadoop.security.UserGroupInformation: No groups available for user hadoop
2019-08-20 16:27:55,616 WARN org.apache.hadoop.security.UserGroupInformation: No groups available for user hadoop
2019-08-20 16:27:55,715 WARN org.apache.hadoop.security.UserGroupInformation: No groups available for user hadoop
2019-08-20 16:27:55,715 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: there are no corrupt file blocks.
2019-08-20 16:27:56,661 WARN org.apache.hadoop.security.UserGroupInformation: No groups available for user hadoop
2019-08-20 16:27:56,665 WARN org.apache.hadoop.security.UserGroupInformation: No groups available for user hadoop
active节点刷新节点失败报rpc连接重置错误DataNode节点大量报与NN的RPC连接重置错误2019-08-20 16:20:15,284 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Copied BP-535417423-10.104.104.128-1535976912717:blk_1089743067_16126311 to /10.104.132.198:22528
2019-08-20 16:20:15,288 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Copied BP-535417423-10.104.104.128-1535976912717:blk_1089743066_16126310 to /10.104.132.198:22540
2019-08-20 16:20:17,097 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-535417423-10.104.104.128-1535976912717:blk_1089862867_16246374 src: /10.104.108.170:55257 dest: /10.104.108.156:50010
2019-08-20 16:20:17,111 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received BP-535417423-10.104.104.128-1535976912717:blk_1089862867_16246374 src: /10.104.108.170:55257 dest: /10.104.108.156:50010 of size 43153
2019-08-20 16:20:19,102 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in offerService
java.io.IOException: Failed on local exception: java.io.IOException: Connection reset by peer; Host Details : local host is: datanodetest17.bi/10.104.108.156; destination host is: namenodetest02.bi.10101111.com:9001; at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)at org.apache.hadoop.ipc.Client.call(Client.java:1473)at org.apache.hadoop.ipc.Client.call(Client.java:1400)at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)at com.sun.proxy.$Proxy13.sendHeartbeat(Unknown Source)at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:140)at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:617)at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:715)at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:889)at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Connection reset by peerat sun.nio.ch.FileDispatcherImpl.read0(Native Method)at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)at sun.nio.ch.IOUtil.read(IOUtil.java:197)at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57)at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)at java.io.FilterInputStream.read(FilterInputStream.java:133)at java.io.FilterInputStream.read(FilterInputStream.java:133)at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:514)at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)at java.io.BufferedInputStream.read(BufferedInputStream.java:265)at java.io.DataInputStream.readInt(DataInputStream.java:387)at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1072)at org.apache.hadoop.ipc.Client$Connection.run(Client.java:967)
2019-08-20 16:20:20,458 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-535417423-10.104.104.128-1535976912717:blk_1089862871_16246378 src: /10.104.108.170:55263 dest: /10.104.108.156:50010
2019-08-20 16:20:20,712 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.104.108.170:55263, dest: /10.104.108.156:50010, bytes: 196672, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_-1795425359_1146224, offset: 0, srvID: 9f1f3a39-a45d-4961-859f-c1953bde9a73, blockid: BP-535417423-10.104.104.128-1535976912717:blk_1089862871_16246378, duration: 98410715 其他方案 在尝试解决的过程中大神指点了尝试调大以下参数默认是10
dfs.namenode.handler.count dfs.namenode.service.handler.count 我认为目前在未有明确报错信息的情况下不要盲目更改参数否则可能有其他副作用 总结 大量删除文件过程中必须时刻关注active和stanby的NN的状态和日志一旦发现异常信息及时停止删除避免发生后续NN挂掉或者其数据丢失问题 当active和stanby的NN都启动且WebUI中的DN列表如果都是dead的情况下可以尝试先刷新节点让其重新注册有机会恢复正常 stanby的NN在大量的操作导致edits过大standby节点合并的时候就可能发生gc暂停时间过长而退出应避免连续的大量文件操作 rpc端口有时可以连接有时超时连接中有大量的close_wait一般情况下说明程序已经没有响应了导致客户端主动断开请求可能是NN所在节点的对内存不足触发gc导致大量线程阻塞使得rpc请求超时而重置 调整NameNode的堆内存大小在hadoop-env.sh中配置HADOOP_NAMENODE_OPTS
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/web/86341.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!