处理HDFS DataXceiver error processing WRITE_BLOCK operation导致Hregionserver宕机及处理过程

最近新搭建了一套HBase(2.3.x)+Hadoop(3.2.x)高版本测试环境

每到半夜的时候就会开始宕节点,然后级联,然后雪崩,然后第二天就被DISS···

彻查了下日志,发现有几个问题点:

1、ERROR [regionserver/hbase1:16020] zookeeper.ZKWatcher: regionserver:16020-0xxxxxx, quorum=master1:2181,master2:2181,master3:2181, baseZNode=/hbase Received unexpected KeeperException, re-throwing exception
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase/master

2、INFO [main-SendThread(master2:2181)] zookeeper.ClientCnxn: Session establishment complete on server master2/172.xxx.xxx.1:2181, sessionid = 0xxxxxx, negotiated timeout = 30000

3、INFO [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 3213ms
GC pool ‘ParNew’ had collection(s): count=1 time=48ms
GC pool ‘ConcurrentMarkSweep’ had collection(s): count=1 time=3608ms

4、INFO [AsyncFSWAL-0hdfs://master/hbase] wal.AbstractFSWAL: Slow sync cost: 321 ms, current pipeline: [DatanodeInfoWithStorage[172.xxx.xxx.1:9866,DS-xxx,DISK], DatanodeInfoWithStorage[172.xxx.xxx.2:9866,DS-xxx,DISK], DatanodeInfoWithStorage[172.xxx.xxx.3:9866,DS-xxx,DISK]]
INFO [LruBlockCacheStatsExecutor] hfile.LruBlockCache: totalSize=9.13 GB, freeSize=454.38 MB, max=9.57 GB, blockCount=142290, accesses=26146057, hits=22935485, hitRatio=87.72%, , cachingAccesses=25568452, cachingHits=22876827, cachingHitsRatio=89.47%, evictions=2841, evicted=2481248, evictedPerRun=873.3713481168603
INFO [hbase2:16020Replication Statistics #0] regionserver.Replication: Global stats: WAL Edits Buffer Used=0B, Limit=268435456B

还有一个zookeeper leader日志

5、WARN [NIOWorkerThread-2:NIOServerCnxn@364] – Unexpected exception EndOfStreamException: Unable to read additional data from client, it probably closed the socket: address = /172.xxx.xxx.7:39505, session = 0xxxxx at org.apache.zookeeper.server.NIOServerCnxn.handleFailedRead(NIOServerCnxn.java:163) at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:326) at org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522) at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

根据上面这些日志的点,综合考量之后,判断是WAL同步的时候链接由于数据量太大导致超过zookeeper的tickTime了,直接把数值提升为3分钟试试

1、增加/修改hbase-site.xml

<property>
<name>zookeeper.session.timeout</name>
<value>180000</value>
</property>

2、增加/修改zoo.cfg

tickTime=9000
minSessionTimeout=18000
maxSessionTimeout=180000

用ansible更新完配置后,直接重启zookeeper集群和HBase集群

————————–分割线————————————

2022年2月21日更新:

然后又跑了两天发现还是会宕机,但是频率没有那么高了,而且在宕机之前内存基本上都是98%以上,说明还有地方需要修改,然后对整个集群的所有日志全部过了一遍发现了之前没有发现的问题:

ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: hbase01:9866:DataXceiver error processing WRITE_BLOCK operation src: /172.xxx.xxx.1:52236 dst: /172.xxx.xxx.2:9866
java.io.IOException: Premature EOF from inputStream

文件操作超租期,实际上就是data stream操作过程中文件被删除了,需要调整dfs.datanode.max.transfer.threads参数的上限

检查下了配置文件,发现配置的是4096,直接翻一倍怼上去

<property>
    <name>dfs.datanode.max.transfer.threads</name>
    <value>8192</value>
</property>

针对WAL的配置也进行修改,针对我们读多写少的业务修改了读写缓存

  <property>
    <name>hfile.block.cache.size</name>
    <value>0.3</value>
  </property>

  <property>
    <name>hbase.regionserver.global.memstore.size</name>
    <value>0.2</value>
  </property>

用ansible更新完配置后,直接重启HBase集群和Hadoop集群

—————————分割线————————————

2022年2月24日更新:

今天又雪崩了,我就X了,遍历了所有节点的日志和配置文件,终于发现问题所在

我更新了hadoop的hdfs-site.xml文件

但是

但是

但是

········

我没有把hdfs-site.xml文件更新到hbase的conf里面

我没有把hdfs-site.xml文件更新到hbase的conf里面

我没有把hdfs-site.xml文件更新到hbase的conf里面

我屮艸芔茻

—————————分割线————————————

2022年2月26日更新:

今天···又宕了···我已经无F**K说了,我已经开始怀疑是不是这个2.3.x版本和CMS的联合BUG了

在高人指点下,改了下nofile、sigpending和nproc

user soft sigpending 655350
user hard sigpending 655350
user soft nofile 655360
user hard nofile 655360
user soft nproc 655360
user hard nproc 655360

重启···静候··佳音··

—————————分割线————————————

2022年2月27日更新:

好吧,今天又宕掉了,百思不得其解啊百思不得其解

终极杀招:降级

把hbase的版本从2.3.7降级到2.3.6

等···结果

—————————分割线————————————

2022年3月2日更新:

在hbase-env.sh里把G1GC配置上了

export HBASE_OPTS="-XX:+UseG1GC"
export HBASE_MASTER_OPTS="$HBASE_MASTER_OPTS -XX:+UseG1GC -Xmx12g -Xms12g -XX:+UnlockExperimentalVMOptions -XX:-ResizePLAB -XX:MaxGCPauseMillis=100 -XX:ParallelGCThreads=4 -XX:ConcGCThreads=1 -XX:G1HeapWastePercent=10 -XX:InitiatingHeapOccupancyPercent=45 -XX:G1MixedGCLiveThresholdPercent=80 -XX:G1NewSizePercent=1 -XX:G1MaxNewSizePercent=15"
export HBASE_REGIONSERVER_OPTS="$HBASE_REGIONSERVER_OPTS -XX:+UseG1GC -Xmx28g -Xms28g -XX:+UnlockExperimentalVMOptions -XX:-ResizePLAB -XX:MaxGCPauseMillis=100 -XX:ParallelGCThreads=8 -XX:ConcGCThreads=2 -XX:G1HeapWastePercent=10 -XX:InitiatingHeapOccupancyPercent=45 -XX:G1MixedGCLiveThresholdPercent=80 -XX:G1NewSizePercent=1 -XX:G1MaxNewSizePercent=15"

这次,没问题了

3 评论

留下评论

error: Content is protected !!