如何使用subprocess.run()来运行Hive查询?
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了如何使用subprocess.run()来运行Hive查询?相关的知识,希望对你有一定的参考价值。
所以我试图使用subprocess
模块执行一个配置单元查询,并将输出保存到文件data.txt
以及日志(到log.txt
),但我似乎遇到了一些麻烦。我看看this gist和this SO question,但似乎都没有给我我需要的东西。
这是我正在运行的:
import subprocess
query = "select user, sum(revenue) as revenue from my_table where user = 'dave' group by user;"
outfile = "data.txt"
logfile = "log.txt"
log_buff = open("log.txt", "a")
data_buff = open("data.txt", "w")
# note - "hive -e [query]" would normally just print all the results
# to the console after finishing
proc = subprocess.run(["hive" , "-e" '"{}"'.format(query)],
stdin=subprocess.PIPE,
stdout=data_buff,
stderr=log_buff,
shell=True)
log_buff.close()
data_buff.close()
我也看了this SO question regarding subprocess.run() vs subprocess.Popen,我相信我想要.run()
,因为我想要阻止这个过程直到完成。
最终输出应该是带有查询的制表符分隔结果的文件data.txt
,以及带有hive作业生成的所有日志记录的log.txt
。任何帮助都会很精彩。
更新:
通过上述方式,我目前得到以下输出:
log.txt的
[ralston@tpsci-gw01-vm tmp]$ cat log.txt
Java HotSpot(TM) 64-Bit Server VM warning: Using the ParNew young collector with the Serial old collector is deprecated and will likely be removed in a future release
Java HotSpot(TM) 64-Bit Server VM warning: Using the ParNew young collector with the Serial old collector is deprecated and will likely be removed in a future release
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/y/share/hadoop-2.8.3.0.1802131730/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/y/libexec/tez/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Logging initialized using configuration in file:/home/y/libexec/hive/conf/hive-log4j.properties
data.txt中
[ralston@tpsci-gw01-vm tmp]$ cat data.txt
hive> [ralston@tpsci-gw01-vm tmp]$
我可以验证java / hive进程是否运行:
[ralston@tpsci-gw01-vm tmp]$ ps -u ralston
PID TTY TIME CMD
14096 pts/0 00:00:00 hive
14141 pts/0 00:00:07 java
14259 pts/0 00:00:00 ps
16275 ? 00:00:00 sshd
16276 pts/0 00:00:00 bash
但它似乎没有完成,也没有记录我想要的一切。
所以我设法使用以下设置:
import subprocess
query = "select user, sum(revenue) as revenue from my_table where user = 'dave' group by user;"
outfile = "data.txt"
logfile = "log.txt"
log_buff = open("log.txt", "a")
data_buff = open("data.txt", "w")
# Remove shell=True from proc, and add "> outfile.txt" to the command
proc = subprocess.Popen(["hive" , "-e", '"{}"'.format(query), ">", "{}".format(outfile)],
stdin=subprocess.PIPE,
stdout=data_buff,
stderr=log_buff)
# keep track of job runtime and set limit
start, elapsed, finished, limit = time.time(), 0, False, 60
while not finished:
try:
outs, errs = proc.communicate(timeout=10)
print("job finished")
finished = True
except subprocess.TimeoutExpired:
elapsed = abs(time.time() - start) / 60.
if elapsed >= 60:
print("Job took over 60 mins")
break
print("Comm timed out. Continuing")
continue
print("done")
log_buff.close()
data_buff.close()
根据需要生成输出。我知道process.communicate()
,但以前没有用。我认为这个问题与不将> ${outfile}
的输出文件添加到配置单元查询有关。
随意添加任何细节。我从来没有见过任何人必须循环proc.communicate()
所以我怀疑我可能做错了什么。
以上是关于如何使用subprocess.run()来运行Hive查询?的主要内容,如果未能解决你的问题,请参考以下文章
如何在 python 中使用 subprocess.run() 运行 step-cli
如何在 Python 3 中使用 stdin 将字符串传递给 subprocess.run
将 subprocess.run 输出重定向到 jupyter notebook 的
获取错误 - AttributeError:'module'对象在运行subprocess.run时没有属性'run'([“ls”,“ - l”])
如何使用 Python 3.6 的 subprocess.run() 在 Linux OS 中执行 .run 文件 [重复]