记一次Alluxio HA master启动失败
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了记一次Alluxio HA master启动失败相关的知识,希望对你有一定的参考价值。
1. 今天遇到一个情况,就是alluxio不能正常访问,经过日志查看,发现下面错误。2018-05-14 03:35:58,680 ERROR logger.type (HdfsUnderFileSystem.java:open) - 4 try to open hdfs://sandy-bridge/user/alluxio/journal/FileSystemMaster/completed/log.00000000000000000001 : Cannot obtain block length for LocatedBlock{BP-1941630157-10.16.13.73-1486732586674:blk_1322900685_252817168; getBlockSize()=254; corrupt=false; offset=0; locs=[10.16.13.189:1019, 10.16.13.84:1019, 10.16.13.128:1019]; storageIDs=[DS-30126b4d-afdf-449a-8de1-e479c1abf33d, DS-ed2e905e-fa43-4f51-801f-3305da180d2a, DS-0e1946c8-dccb-4143-8d74-c11d8d429d02]; storageTypes=[DISK, DISK, DISK]} java.io.IOException: Cannot obtain block length for LocatedBlock{BP-1941630157-10.16.13.73-1486732586674:blk_1322900685_252817168; getBlockSize()=254; corrupt=false; offset=0; locs=[10.16.13.189:1019, 10.16.13.84:1019, 10.16.13.128:1019]; storageIDs=[DS-30126b4d-afdf-449a-8de1-e479c1abf33d, DS-ed2e905e-fa43-4f51-801f-3305da180d2a, DS-0e1946c8-dccb-4143-8d74-c11d8d429d02]; storageTypes=[DISK, DISK, DISK]} at org.apache.hadoop.hdfs.DFSInputStream.readBlockLength(DFSInputStream.java:400) at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:305) at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:242) at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:235) at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1487) at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:302) at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:298) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:298) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766) at alluxio.underfs.hdfs.HdfsUnderFileSystem.open(HdfsUnderFileSystem.java:387) at alluxio.underfs.BaseUnderFileSystem.open(BaseUnderFileSystem.java:124) at alluxio.master.journal.JournalReader.getNextInputStream(JournalReader.java:114) at alluxio.master.journal.JournalTailer.processNextJournalLogFiles(JournalTailer.java:118) at alluxio.master.AbstractMaster.start(AbstractMaster.java:140) at alluxio.master.file.FileSystemMaster.start(FileSystemMaster.java:419) at alluxio.master.DefaultAlluxioMaster.startMasters(DefaultAlluxioMaster.java:263) at alluxio.master.FaultTolerantAlluxioMaster.start(FaultTolerantAlluxioMaster.java:91) at alluxio.ServerUtils.run(ServerUtils.java:38)
2. 首先是怀疑文件log.00000000000000000001损坏,经过hfs fsck的检查,并没有发现corruption,但是Total size: 0,这是个问题。
[[email protected] hdfs]$ hdfs fsck /user/alluxio/journal/FileSystemMaster/completed/log.00000000000000000001 Connecting to namenode via http://hdfs-namenode.eu-central-1.compute.internal:50070/fsck?ugi=hdfs&path=%2Fuser%2Falluxio%2Fjournal%2FFileSystemMaster%2Fcompleted%2Flog.00000000000000000001 FSCK started by hdfs (auth:KERBEROS_SSL) from /10.16.13.73 for path /user/alluxio/journal/FileSystemMaster/completed/log.00000000000000000001 at Mon May 14 03:53:11 UTC 2018 Status: HEALTHY Total size: 0 B (Total open files size: 254 B) Total dirs: 0 Total files: 0 Total symlinks: 0 (Files currently being written: 1) Total blocks (validated): 0 (Total open file blocks (not validated): 1) Minimally replicated blocks: 0 Over-replicated blocks: 0 Under-replicated blocks: 0 Mis-replicated blocks: 0 Default replication factor: 3 Average block replication: 0.0 Corrupt blocks: 0 Missing replicas: 0 Number of data-nodes: 41 Number of racks: 1 FSCK ended at Mon May 14 03:53:11 UTC 2018 in 1 milliseconds The filesystem under path '/user/alluxio/journal/FileSystemMaster/completed/log.00000000000000000001' is HEALTHY
3. 将这个问题件mv走,再启动alluxio HA master,启动成功。
[[email protected] hdfs]$ hdfs dfs -mv /user/alluxio/journal/FileSystemMaster/completed/log.00000000000000000001 /user/alluxio/journal/FileSystemMaster/completed/log.00000000000000000001.bak [[email protected] hdfs]$ hdfs dfs -ls /user/alluxio/journal/FileSystemMaster/completed/ Found 2 items -rw-r--r-- 3 alluxio alluxio 254 2018-01-29 09:32 /user/alluxio/journal/FileSystemMaster/completed/log.00000000000000000001.bak -rw-r--r-- 3 alluxio alluxio 397 2018-05-14 03:03 /user/alluxio/journal/FileSystemMaster/completed/log.00000000000000000002
4. 其中尝试过,将文件再mv回来,但是alluxio依然启动失败,还是最开始的错误。
5. 直接cat这个文件,发现也不能访问。
hdfs dfs -cat /user/alluxio/journal/FileSystemMaster/completed/log.00000000000000000001.bak cat: Cannot obtain block length for LocatedBlock{BP-1941630157-10.16.13.73-1486732586674:blk_1322900685_252817168; getBlockSize()=254; corrupt=false; offset=0; locs=[DatanodeInfoWithStorage[10.16.13.189:1019,DS-30126b4d-afdf-449a-8de1-e479c1abf33d,DISK], DatanodeInfoWithStorage[10.16.13.128:1019,DS-0e1946c8-dccb-4143-8d74-c11d8d429d02,DISK], DatanodeInfoWithStorage[10.16.13.84:1019,DS-ed2e905e-fa43-4f51-801f-3305da180d2a,DISK]]}
6. 而正常的文件,输出如下:
[[email protected] hdfs]$ hdfs dfs -cat /user/alluxio/journal/FileSystemMaster/completed/log.00000000000000000002 NOT_PERSISTED(0,@ HPXhdatadownloadz_20180510130731077.zip" NOT_PERSISTED(0,@ HPXhdatadownloadzdatadownload Z 6Perrier_%3F%3F_20180101_20180104_20180510130731077.zip" NOT_PERSISTED(0,@ HPXhdatadownloadzdatadownload Z 6Perrier_%3F%3F_20180101_20180104_20180510130731077.zip" NOT_PERSISTED(0,@ HPXhdatadownloadzdatadownload Z 6Perrier_%3F%3F_20180101_20180104_20180510130731077.zip" NOT_PERSISTED(0,@ HPXhdatadownloadz datadownload Z 6Perrier_%3F%3F_20180101_20180104_20180510130731077.zip"
7. Alluxio master是启动成功了,但是丢了一部分数据。
这个问题,有时间,还要继续研究一下,看是否能将数据找回。
以上是关于记一次Alluxio HA master启动失败的主要内容,如果未能解决你的问题,请参考以下文章