记一次Alluxio HA master启动失败

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了记一次Alluxio HA master启动失败相关的知识,希望对你有一定的参考价值。

1. 今天遇到一个情况,就是alluxio不能正常访问,经过日志查看,发现下面错误。

2018-05-14 03:35:58,680 ERROR logger.type (HdfsUnderFileSystem.java:open) - 4 try to open hdfs://sandy-bridge/user/alluxio/journal/FileSystemMaster/completed/log.00000000000000000001 : Cannot obtain block length for LocatedBlock{BP-1941630157-10.16.13.73-1486732586674:blk_1322900685_252817168; getBlockSize()=254; corrupt=false; offset=0; locs=[10.16.13.189:1019, 10.16.13.84:1019, 10.16.13.128:1019]; storageIDs=[DS-30126b4d-afdf-449a-8de1-e479c1abf33d, DS-ed2e905e-fa43-4f51-801f-3305da180d2a, DS-0e1946c8-dccb-4143-8d74-c11d8d429d02]; storageTypes=[DISK, DISK, DISK]}
java.io.IOException: Cannot obtain block length for LocatedBlock{BP-1941630157-10.16.13.73-1486732586674:blk_1322900685_252817168; getBlockSize()=254; corrupt=false; offset=0; locs=[10.16.13.189:1019, 10.16.13.84:1019, 10.16.13.128:1019]; storageIDs=[DS-30126b4d-afdf-449a-8de1-e479c1abf33d, DS-ed2e905e-fa43-4f51-801f-3305da180d2a, DS-0e1946c8-dccb-4143-8d74-c11d8d429d02]; storageTypes=[DISK, DISK, DISK]}
at org.apache.hadoop.hdfs.DFSInputStream.readBlockLength(DFSInputStream.java:400)
at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:305)
at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:242)
at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:235)
at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1487)
at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:302)
at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:298)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:298)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766)
at alluxio.underfs.hdfs.HdfsUnderFileSystem.open(HdfsUnderFileSystem.java:387)
at alluxio.underfs.BaseUnderFileSystem.open(BaseUnderFileSystem.java:124)
at alluxio.master.journal.JournalReader.getNextInputStream(JournalReader.java:114)
at alluxio.master.journal.JournalTailer.processNextJournalLogFiles(JournalTailer.java:118)
at alluxio.master.AbstractMaster.start(AbstractMaster.java:140)
at alluxio.master.file.FileSystemMaster.start(FileSystemMaster.java:419)
at alluxio.master.DefaultAlluxioMaster.startMasters(DefaultAlluxioMaster.java:263)
at alluxio.master.FaultTolerantAlluxioMaster.start(FaultTolerantAlluxioMaster.java:91)
at alluxio.ServerUtils.run(ServerUtils.java:38)

2. 首先是怀疑文件log.00000000000000000001损坏,经过hfs fsck的检查,并没有发现corruption,但是Total size: 0,这是个问题

[[email protected] hdfs]$ hdfs fsck /user/alluxio/journal/FileSystemMaster/completed/log.00000000000000000001
Connecting to namenode via http://hdfs-namenode.eu-central-1.compute.internal:50070/fsck?ugi=hdfs&path=%2Fuser%2Falluxio%2Fjournal%2FFileSystemMaster%2Fcompleted%2Flog.00000000000000000001
FSCK started by hdfs (auth:KERBEROS_SSL) from /10.16.13.73 for path /user/alluxio/journal/FileSystemMaster/completed/log.00000000000000000001 at Mon May 14 03:53:11 UTC 2018
Status: HEALTHY
Total size: 0 B (Total open files size: 254 B)
Total dirs: 0
Total files: 0
Total symlinks: 0 (Files currently being written: 1)
Total blocks (validated): 0 (Total open file blocks (not validated): 1)
Minimally replicated blocks: 0
Over-replicated blocks: 0
Under-replicated blocks: 0
Mis-replicated blocks: 0
Default replication factor: 3
Average block replication: 0.0
Corrupt blocks: 0
Missing replicas: 0
Number of data-nodes: 41
Number of racks: 1
FSCK ended at Mon May 14 03:53:11 UTC 2018 in 1 milliseconds
The filesystem under path '/user/alluxio/journal/FileSystemMaster/completed/log.00000000000000000001' is HEALTHY

3. 将这个问题件mv走,再启动alluxio HA master,启动成功。

[[email protected] hdfs]$ hdfs dfs -mv /user/alluxio/journal/FileSystemMaster/completed/log.00000000000000000001 /user/alluxio/journal/FileSystemMaster/completed/log.00000000000000000001.bak
[[email protected] hdfs]$ hdfs dfs -ls /user/alluxio/journal/FileSystemMaster/completed/
Found 2 items
-rw-r--r-- 3 alluxio alluxio 254 2018-01-29 09:32 /user/alluxio/journal/FileSystemMaster/completed/log.00000000000000000001.bak
-rw-r--r-- 3 alluxio alluxio 397 2018-05-14 03:03 /user/alluxio/journal/FileSystemMaster/completed/log.00000000000000000002

4. 其中尝试过,将文件再mv回来,但是alluxio依然启动失败,还是最开始的错误。


5. 直接cat这个文件,发现也不能访问。

hdfs dfs -cat /user/alluxio/journal/FileSystemMaster/completed/log.00000000000000000001.bak
cat: Cannot obtain block length for LocatedBlock{BP-1941630157-10.16.13.73-1486732586674:blk_1322900685_252817168; getBlockSize()=254; corrupt=false; offset=0; locs=[DatanodeInfoWithStorage[10.16.13.189:1019,DS-30126b4d-afdf-449a-8de1-e479c1abf33d,DISK], DatanodeInfoWithStorage[10.16.13.128:1019,DS-0e1946c8-dccb-4143-8d74-c11d8d429d02,DISK], DatanodeInfoWithStorage[10.16.13.84:1019,DS-ed2e905e-fa43-4f51-801f-3305da180d2a,DISK]]}

6. 而正常的文件,输出如下:

[[email protected] hdfs]$ hdfs dfs -cat /user/alluxio/journal/FileSystemMaster/completed/log.00000000000000000002
NOT_PERSISTED(0,@ HPXhdatadownloadz_20180510130731077.zip"
NOT_PERSISTED(0,@ HPXhdatadownloadzdatadownload Z 6Perrier_%3F%3F_20180101_20180104_20180510130731077.zip"
NOT_PERSISTED(0,@ HPXhdatadownloadzdatadownload Z 6Perrier_%3F%3F_20180101_20180104_20180510130731077.zip"
NOT_PERSISTED(0,@ HPXhdatadownloadzdatadownload Z 6Perrier_%3F%3F_20180101_20180104_20180510130731077.zip"
NOT_PERSISTED(0,@ HPXhdatadownloadz datadownload Z 6Perrier_%3F%3F_20180101_20180104_20180510130731077.zip"

7. Alluxio master是启动成功了,但是丢了一部分数据。

 

这个问题,有时间,还要继续研究一下,看是否能将数据找回。

以上是关于记一次Alluxio HA master启动失败的主要内容,如果未能解决你的问题,请参考以下文章

【Ambari-部署】记一次HDFS HA启用失败恢复过程

k8s记一次kubelet启动后master无法获取

k8s记一次kubelet启动后master无法获取

记一次启动 SpringBoot 失败的问题

记一次centos6升级salt-minion启动失败的问题

记一次Postgresql异常中断导致的启动失败