如何对CDH集群中的Impala打印线程堆栈

Posted hadoop123

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了如何对CDH集群中的Impala打印线程堆栈相关的知识,希望对你有一定的参考价值。

点击hadoop123关注我哟

知名的大数据中台技术分享基地,涉及大数据架构(hadoop/spark/flink等)数据平台(数据交换、数据服务、数据治理等)数据产品(BI、AB测试平台)等,也会分享最新技术进展,大数据相关职位和求职信息,大数据技术交流聚会、讲座以及会议等。




上一篇文章《》介绍了怎么对Impala进程打印线程堆栈,JVM部分直接用 jstack 比较直接,但 C++ 部分由于要使用 gdb 或 breakpad 工具,还需要编译源码,显得比较繁琐。本文直接演示如何在 CDH 集群中打印 Impala 进程的线程堆栈,不再需要编译源码。当然第一次操作时还是需要下载一些工具,可以在集群中固定选一台机器来配置环境,以后再操作时就比较方便了。


1. 生成 Minidump 文件

登上 impalad 所在机器,找到 impalad 进程ID.

$ ps aux | grep impalad
root 4374 0.0 0.0 12944 972 pts/0 S+ 16:49 0:00 grep --color=auto impalad
impala 29645 1.0 3.0 2999416 231972 ? Sl 16:17 0:20 /opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/impala/sbin-retail/impalad --flagfile=/run/cloudera-scm-agent/process/55-impala-IMPALAD/impala-conf/impalad_flags
impala 29652 0.0 0.1 197888 13556 ? Sl 16:17 0:00 python2.7 /usr/lib/cmf/agent/build/env/bin/cmf-redactor /usr/lib/cmf/service/impala/impala.sh impalad impalad_flags false

上面进程号为 29645 就是 impalad 进程。对它发送 SIGUSR1 信号触发 minidump:

$ kill -s SIGUSR1 29645

在 /var/log/impalad/impalad.INFO 中可以找到:

Wrote minidump to /var/log/impala-minidumps/impalad/3745e5d7-9281-4548-2fd5b4b1-adc7f7eb.dmp

2. 生成 Breakpad symbol 文件

2.1 配置 Breakpad 工具

Impala 源码中有一个脚本 (bin/dump_breakpad_symbols.py) 可以生成 breakpad 形式的 symbol 文件。下载对应版本的 Impala 源码,可以在 cloudera github 的 release 页面查找:https://github.com/cloudera/Impala/releases

本例中 CDH 版本是 5.16.2,下载并解压 https://github.com/cloudera/Impala/archive/cdh5.16.2-release.tar.gz (大小为 692MB)

注:cloudera impala repo很大 (15GB),如果只需要一个版本的代码,没必要 git clone.

wget https://github.com/cloudera/Impala/archive/cdh5.16.2-release.tar.gz
tar zxf cdh5.16.2-release.tar.gz
cd Impala-cdh5.16.2-release

为了让 bin/dump_breakpad_symbols.py 能运行,我们还需要配置一下环境。确保 JAVA_HOME 变量指向了正确的目录,然后运行

# 确保 JAVA_HOME 变量有配置并指向了正确的目录
$ export JAVA_HOME=/usr/java/jdk1.8.0_162-cloudera
$ source bin/impala-config.sh

# 国内用户可以使用阿里云的 python 镜像
$ export PYPI_MIRROR="http://mirrors.aliyun.com/pypi"
$ $IMPALA_HOME/infra/python/deps/download_requirements

然后需要初始化一下toolchain里的breakpad,使用 bin/bootstrap_toolchain.py。正常来说这个脚本会下载所有的toolchain,耗时较长,我们只需要breakpad部分,可以对 bin/boostrap_toolchain.py 作如下修改:

   # LLVM and Kudu are the largest packages. Sort them first so that
# their download starts as soon as possible.
- packages = map(Package, ["llvm", "kudu",
- "avro", "binutils", "boost", "breakpad", "bzip2", "cmake", "crcutil",
- "flatbuffers", "gcc", "gflags", "glog", "gperftools", "gtest", "libev",
- "lz4", "openldap", "openssl", "protobuf",
- "rapidjson", "re2", "snappy", "thrift", "tpc-h", "tpc-ds", "zlib"])
- packages.insert(0, Package("llvm", "3.9.1-asserts"))
+ packages = map(Package, ["breakpad"])
bootstrap(toolchain_root, packages)

即在 bootstrap_toolchain.py 的最后部分里把其它 package 都去掉,只加上 breakpad 的。然后再执行这个脚本:

$ bin/bootstrap_toolchain.py
INFO:bootstrap_virtualenv:Creating python virtualenv
INFO:bootstrap_virtualenv:Installing packages into the virtualenv
INFO:bootstrap_virtualenv:Installing stage 2 packages into the virtualenv
2019-11-10 01:31:23,683 Thread-3 INFO: Downloading https://native-toolchain.s3.amazonaws.com/build/257-0847514126/breakpad/97a98836768f8f0154f8f86e5e14c2bb7e74132e-p2-gcc-4.9.2/breakpad-97a98836768f8f0154f8f86e5e14c2bb7e74132e-p2-gcc-4.9.2-ec2-package-ubuntu-16-04.tar.gz to /root/Impala-cdh5.16.2-release/toolchain/breakpad-97a98836768f8f0154f8f86e5e14c2bb7e74132e-p2-gcc-4.9.2-ec2-package-ubuntu-16-04.tar.gz (attempt 1)
2019-11-10 01:31:24,452 Thread-3 INFO: Extracting breakpad-97a98836768f8f0154f8f86e5e14c2bb7e74132e-p2-gcc-4.9.2-ec2-package-ubuntu-16-04.tar.gz

2.2 生成 symbol 文件

2.2.1 使用本地 parcel 里的可执行文件

之后就可以使用 dump_breakpad_symbols.py 了,前面在用 ps 查找 impalad 进程的时候看到可执行文件是 /opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/impala/sbin-retail/impalad,对它来生成 symbol 文件,放到 /tmp/syms 目录下:

$ bin/dump_breakpad_symbols.py -f /opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/impala/sbin-retail/impalad -d /tmp/syms
INFO:root:Processing binary file: /opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.

2.2.2 使用 deb 安装包里的可执行文件

上述方式生成的 symbol 文件不带有文件名和行号,如果想尽可能地结合代码,可以下载并解析对应系统的 rpm/deb 包。这些包可以在 http://archive.cloudera.com 中找到,比如 cdh5 对应的 ubuntu 的包都在 http://archive.cloudera.com/cdh5/ubuntu 下。本例中使用的系统是 ubuntu16.04,各个版本的 impala cdh 包在 http://archive.cloudera.com/cdh5/ubuntu/xenial/amd64/cdh/pool/contrib/i/impala 下都可以找到,下载如下两个文件:

  • 可执行文件deb包 (345MB):http://archive.cloudera.com/cdh5/ubuntu/xenial/amd64/cdh/pool/contrib/i/impala/impala_2.12.0+cdh5.16.2+0-1.cdh5.16.2.p0.22~xenial-cdh5.16.2_amd64.deb

  • 包含上述可执行文件debug信息的deb包 (471MB):http://archive.cloudera.com/cdh5/ubuntu/xenial/amd64/cdh/pool/contrib/i/impala/impala-dbg_2.12.0+cdh5.16.2+0-1.cdh5.16.2.p0.22~xenial-cdh5.16.2_amd64.deb

然后仍是使用 dump_breakpad_symbols.py:

$ bin/dump_breakpad_symbols.py -r ~/Downloads/impala_2.12.0+cdh5.16.2+0-1.cdh5.16.2.p0.22~xenial-cdh5.16.2_amd64.deb -s ~/Downloads/impala-dbg_2.12.0+cdh5.16.2+0-1.cdh5.16.2.p0.22~xenial-cdh5.16.2_amd64.deb -d /tmp/syms
INFO:root:Extracting to /tmp/tmpBDEwFI: /home/quanlong/Downloads/impala_2.12.0+cdh5.16.2+0-1.cdh5.16.2.p0.22~xenial-cdh5.16.2_amd64.deb
INFO:root:Extracting to /tmp/tmpBDEwFI: /home/quanlong/Downloads/impala-dbg_2.12.0+cdh5.16.2+0-1.cdh5.16.2.p0.22~xenial-cdh5.16.2_amd64.deb
INFO:root:Processing binary file: /tmp/tmpBDEwFI/usr/lib/impala/lib/libstdc++.so.6.0.20
INFO:root:Processing binary file: /tmp/tmpBDEwFI/usr/lib/impala/lib/libgcc_s.so.1
INFO:root:Processing binary file: /tmp/tmpBDEwFI/usr/lib/impala/lib/libkudu_client.so.0.1.0
INFO:root:Processing binary file: /tmp/tmpBDEwFI/usr/lib/impala/lib/libstdc++.so.6
INFO:root:Processing binary file: /tmp/tmpBDEwFI/usr/lib/impala/lib/libkudu_client.so.0
INFO:root:Processing binary file: /tmp/tmpBDEwFI/usr/lib/impala/lib/openssl/libssl.so
INFO:root:Processing binary file: /tmp/tmpBDEwFI/usr/lib/impala/lib/openssl/libcrypto.so
INFO:root:Processing binary file: /tmp/tmpBDEwFI/usr/lib/impala/lib/openssl/libcrypto.so.1.0.0
INFO:root:Processing binary file: /tmp/tmpBDEwFI/usr/lib/impala/lib/openssl/libssl.so.1.0.0
INFO:root:Processing binary file: /tmp/tmpBDEwFI/usr/lib/impala/sbin-debug/libfesupport.so
INFO:root:Processing binary file: /tmp/tmpBDEwFI/usr/lib/impala/sbin-debug/impalad
INFO:root:Processing binary file: /tmp/tmpBDEwFI/usr/lib/impala/sbin-retail/libfesupport.so
INFO:root:Processing binary file: /tmp/tmpBDEwFI/usr/lib/impala/sbin-retail/impalad

这样 /tmp/syms 里的 symbol 信息就包含文件名和行号了。

3. 使用 symbol 文件解析 minidump

使用 Impala 源码目录里 toolchain 下的 breakpad 目录下的 minidump_stackwalk 工具就可以根据 symbol 文件来解析 minidump,假设把解析结果放到 /tmp/resolved.txt,把 breakpad 的日志放到 /tmp/breakpad.log,指令如下:

$ toolchain/breakpad-97a98836768f8f0154f8f86e5e14c2bb7e74132e-p2/bin/minidump_stackwalk /var/log/impala-minidumps/impalad/3745e5d7-9281-4548-2fd5b4b1-adc7f7eb.dmp /tmp/syms > /tmp/resolved.txt 2>/tmp/breakpad.log

生成的 resolved.txt 形式如下:

Operating system: Linux
0.0.0 Linux 4.4.0-24-generic #43-Ubuntu SMP Wed Jun 8 19:27:37 UTC 2016 x86_64
CPU: amd64
family 6 model 63 stepping 0
2 CPUs

GPU: UNKNOWN

Crash reason: DUMP_REQUESTED
Crash address: 0x217a097
Process uptime: not available

Thread 0 (crashed)
0 impalad!google_breakpad::ExceptionHandler::WriteMinidump() + 0x57
rax = 0x0000000002149a7e rdx = 0x0000000000000000
rcx = 0x000000000217a07f rbx = 0x0000000000000000
rsi = 0x0000000000000001 rdi = 0x00007ffed049f068
rbp = 0x00007ffed049f770 rsp = 0x00007ffed049efd0
r8 = 0x0000000000000000 r9 = 0x0000000000000024
r10 = 0x0000000002288a89 r11 = 0x0000000000000000
r12 = 0x00007ffed049f630 r13 = 0x0000000000d5cff0
r14 = 0x0000000000000000 r15 = 0x00007ffed049f690
rip = 0x000000000217a097
Found by: given as instruction pointer in context
1 impalad!google_breakpad::ExceptionHandler::WriteMinidump(std::string const&, bool (*)(google_breakpad::MinidumpDescriptor const&, void*, bool), void*) + 0xf0
rbx = 0x00007f92561325a0 rbp = 0x00007ffed049f770
rsp = 0x00007ffed049f620 r12 = 0x00007ffed049f630
r13 = 0x0000000000d5cff0 r14 = 0x0000000000000000
r15 = 0x00007ffed049f690 rip = 0x000000000217a960
Found by: call frame info
2 libpthread-2.23.so + 0x11390
rbx = 0x0000000000000000 rbp = 0x00007ffed049fdd0
rsp = 0x00007ffed049f780 r12 = 0x0000000007ada458
r13 = 0x0000000007ada480 r14 = 0x0000000000000000
r15 = 0x00007ffed049fdf0 rip = 0x00007f92556fe390
Found by: call frame info
3 impalad!boost::thread::join_noexcept() + 0x5c
rbp = 0x00007ffed049fdf0 rsp = 0x00007ffed049fde0
rip = 0x0000000001334cec
Found by: previous frame's frame pointer
4 impalad!impala::ThriftServer::Join() [thread.hpp : 767 + 0x8]
rbx = 0x000000000648b420 rbp = 0x00007ffed049fe80
rsp = 0x00007ffed049fe40 r12 = 0x00007f91fef44700
r13 = 0x00007ffed049ff20 r14 = 0x0000000006acbae0
r15 = 0x0000000000000002 rip = 0x0000000000b34f4f
Found by: call frame info
5 impalad!impala::ImpalaServer::Join() [impala-server.cc : 2151 + 0xc]
rbx = 0x0000000006621800 rbp = 0x00007ffed049feb0
rsp = 0x00007ffed049fe90 r12 = 0x00007ffed049ffb0
r13 = 0x00007ffed049ff20 r14 = 0x0000000006acbae0
r15 = 0x0000000000000002 rip = 0x0000000000c28f8a
Found by: call frame info
6 impalad!ImpaladMain(int, char**) [impalad-main.cc : 98 + 0xc]
rbx = 0x00007ffed049ff90 rbp = 0x00007ffed04a0130
rsp = 0x00007ffed049fec0 r12 = 0x00007ffed049ffb0
r13 = 0x00007ffed049ff20 r14 = 0x0000000006acbae0
r15 = 0x0000000000000002 rip = 0x0000000000c238e1

Found by: call frame info

......

第一个线程 (Thread 0) 标记了 Crashed,但实际是在做 minidump 的线程,上面的 Crash reason 已经写了是 DUMP_REQUESTED。实际进程 crash 时,会有具体的原因的。解析的输出包含了很多寄存器的值,有点影响阅读,可以把它们去掉:

grep -v = /tmp/resolved.txt | grep -v 'Found by' | less

这样能看到比较舒服的堆栈:

Thread 119
0 libpthread-2.23.so + 0xd360
1 impalad!impala::io::DiskIoMgr::WorkLoop(impala::io::DiskIoMgr::DiskQueue*) [disk-io-mgr.cc : 977 + 0x5]
2 impalad!impala::Thread::SuperviseThread(std::string const&, std::string const&, boost::function<void ()>, impala::ThreadDebugInfo const*, impala::Promise<long>*) [function_template.hpp : 767 + 0x7]
3 impalad!boost::detail::thread_data<boost::_bi::bind_t<void, void (*)(std::string const&, std::string const&, boost::function<void ()>, impala::ThreadDebugInfo const*, impala::Promise<long>*), boost::_bi::list5<boost::_bi::value
<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void ()> >, boost::_bi::value<impala::ThreadDebugInfo*>, boost::_bi::value<impala::Promise<long>*> > > >::run() [bind.hpp : 525 + 0x6]
4 impalad!thread_proxy + 0xda
5 libpthread-2.23.so + 0x76ba
6 libc-2.23.so + 0x10741d

4. 操作错误示例

解析文件里如果没有函数名,则是 symbol 文件和 minidump 没有配对上,breakpad.log 里可能会有类似的日志:

2019-11-09 23:57:23: minidump_processor.cc:201: INFO: Looking at thread /var/log/impala-minidumps/impalad/9e41139b-a5b1-4f94-df3da8b6-c0c66040.dmp:0/155 id 0x73cd
2019-11-09 23:57:23: minidump.cc:473: INFO: MinidumpContext: looks like AMD64 context
2019-11-09 23:57:23: minidump.cc:473: INFO: MinidumpContext: looks like AMD64 context
2019-11-09 23:57:23: simple_symbol_supplier.cc:196: INFO: No symbol file at /tmp/syms/impalad/DD8351C4C1817BE1D142C187FA70CCAC0/impalad.sym
2019-11-09 23:57:23: stackwalker.cc:103: INFO: Couldn't load symbols for: /opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/impala/sbin-retail/impalad|DD8351C4C1817BE1D142C187FA70CCAC0
2019-11-09 23:57:23: simple_symbol_supplier.cc:196: INFO: No symbol file at /tmp/syms/libpthread-2.23.so/23E017CE2254FC6511D9BC8F534BB4F00/libpthread-2.23.so.sym
2019-11-09 23:57:23: stackwalker.cc:103: INFO: Couldn't load symbols for: /lib/x86_64-linux-gnu/libpthread-2.23.so|23E017CE2254FC6511D9BC8F534BB4F00

最重要的是 "No symbol file at /tmp/syms/impalad/DD...C0/impalad.sym" 这句,表示找不到想要的 symbol 文件。查看 /tmp/syms/impalad 目录,确实这串字符串匹配不上,log里要的是DD8351C4C1817BE1D142C187FA70CCAC0:

$ ls /tmp/syms/impalad/
7F9EC4C10024BDC531665853311E9CCE0

这是因为我选择了错误的 impalad 文件来生成 symbol,其实要选择 impalad 进程使用的文件,即 /opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/impala/sbin-retail/impalad

在 CDH parcel 目录里有多个 impalad 文件,切记不要选错了:

$ find /opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8 -name impalad
/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/impala/sbin-retail/impalad
/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/impala/sbin-debug/impalad
/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/debug/usr/lib/impala/sbin-retail/impalad
/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/debug/usr/lib/impala/sbin-debug/impalad
/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/bin/impalad

当然最好还是使用 deb 包来 dump symbol,这样得到的信息更全,详见 2.2.2。

5. 总结

操作步骤:

  1. 触发 Minidump: kill -s SIGUSR1 $PID

  2. 生成 Breakpad symbol 文件:bin/dump_breakpad_symbols.py -f impalad文件 -d /tmp/syms

  3. 解析 Minidump 文件: minidump_stackwalk minidump文件 /tmp/syms > /tmp/resolved.txt 2>/tmp/breakpad.log

环境配置步骤详见文章内容。

参考文档

https://cwiki.apache.org/confluence/display/IMPALA/Debugging+Impala+Minidumps


以上是关于如何对CDH集群中的Impala打印线程堆栈的主要内容,如果未能解决你的问题,请参考以下文章

0039-如何使用Python Impyla客户端连接Hive和Impala

如何使用R连接Hive与Impala

CDH 5.3.2 - 需要从 shell/脚本重新启动 impala 守护进程

CDH5上安装Hive,HBase,Impala,Spark等服务

在 CDH 集群之间复制 parquet 表

Impala篇---Hue从初始到安装应用