大数据基础之Oozie常见问题

Posted barneywill

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了大数据基础之Oozie常见问题相关的知识,希望对你有一定的参考价值。

1 oozie如何查看任务日志?

通过oozie job id可以查看流程详细信息,命令如下:

oozie job -info 0012077-180830142722522-oozie-hado-W

 

流程详细信息如下:

Job ID : 0012077-180830142722522-oozie-hado-W

------------------------------------------------------------------------------------------------------------------------------------

Workflow Name :  $workflow_name

App Path      : hdfs://$hdfs_name/oozie/wf/$workflow_name.xml

Status        : KILLED

Run           : 0

User          : hadoop

Group         : -

Created       : 2018-09-25 02:51 GMT

Started       : 2018-09-25 02:51 GMT

Last Modified : 2018-09-25 02:53 GMT

Ended         : 2018-09-25 02:53 GMT

CoordAction ID: -

 

Actions

------------------------------------------------------------------------------------------------------------------------------------

ID                                                                            Status    Ext ID                 Ext Status Err Code 

------------------------------------------------------------------------------------------------------------------------------------

[email protected]:start:                                  OK        -                      OK         -        

------------------------------------------------------------------------------------------------------------------------------------

[email protected]$action_name                    ERROR     application_1537326594090_5663FAILED/KILLEDJA018    

------------------------------------------------------------------------------------------------------------------------------------

[email protected]                                     OK        -                      OK         E0729    

------------------------------------------------------------------------------------------------------------------------------------

 

失败的任务定义如下

<action name="$action_name"> 

        <spark xmlns="uri:oozie:spark-action:0.1"> 

            <job-tracker>${job_tracker}</job-tracker> 

            <name-node>${name_node}</name-node> 

            <master>${jobmaster}</master> 

            <mode>${jobmode}</mode> 

            <name>${jobname}</name> 

            <class>${jarclass}</class> 

            <jar>${jarpath}</jar> 

            <spark-opts>${sparkopts}</spark-opts> 

        </spark>

 

在yarn上可以看到application_1537326594090_5663对应的application如下

application_1537326594090_5663       hadoop oozie:launcher:T=spark:W=$workflow_name:A=$action_name:ID=0012077-180830142722522-oozie-hado-W         Oozie Launcher

 

查看application_1537326594090_5663日志发现

2018-09-25 10:52:05,237 [main] INFO  org.apache.hadoop.yarn.client.api.impl.YarnClientImpl  - Submitted application application_1537326594090_5664

 

yarn上application_1537326594090_5664对应的application如下

application_1537326594090_5664       hadoop    $app_name SPARK

 

即application_1537326594090_5664才是Action对应的spark任务,为什么中间会多一步,

简要来说,Oozie执行Action时,即ActionExecutor(最主要的子类是JavaActionExecutor,hive、spark等action都是这个类的子类),JavaActionExecutor首先会提交一个LauncherMapper(map任务)到yarn,其中会执行LauncherMain(具体的action是其子类,比如JavaMain、SparkMain等),spark任务会执行SparkMain,在SparkMain中会调用org.apache.spark.deploy.SparkSubmit来提交任务

 

2 oozie提交spark任务如何添加依赖?

spark任务添加依赖的方式:

如果是local方式运行,可以通过--jars来添加依赖;

如果是yarn方式运行,可以通过spark.yarn.jars来添加依赖;

这两种方式在oozie上都行不通,首先oozie上没办法也不应该通过local运行,其次通过spark.yarn.jars方式配置你会发现根本不会生效,来看为什么

查看LauncherMapper的日志(可见上述问题1)

 

Spark Version 2.1.1

Spark Action Main class        : org.apache.spark.deploy.SparkSubmit

 

Oozie Spark action configuration

=================================================================

...

                    --conf

                    spark.yarn.jars=hdfs://$hdfs_name/spark/sparkjars/*.jar

                    --conf

                    spark.yarn.jars=hdfs://$hdfs_name/oozie/share/lib/lib_20180801121138/spark/spark-yarn_2.11-2.1.1.jar

 

可见oozie会自己添加一个新的spark.yarn.jars配置,如果提供两个相同的key,spark会如何处理

 

org.apache.spark.deploy.SparkSubmit

    val appArgs = new SparkSubmitArguments(args)

 

org.apache.spark.launcher.SparkSubmitOptionParser

        if (!handle(name, value)) {

 

org.apache.spark.deploy.SparkSubmitArguments

  override protected def handle(opt: String, value: String): Boolean = {

  ...

      case CONF =>

        value.split("=", 2).toSeq match {

          case Seq(k, v) => sparkProperties(k) = v

          case _ => SparkSubmit.printErrorAndExit(s"Spark config without ‘=‘: $value")

        }

 

可见会直接覆盖,使用最后一个配置,即oozie的配置,而不是应用自己提供的配置,这样就需要应用自己将特殊依赖打包到应用jar中,具体使用maven的maven-assembly-plugin,配置其中的<dependencySets><dependencySet><includes><include>,详细配置如下:

 

<assembly xmlns="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.0"

          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

          xsi:schemaLocation="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.0 http://maven.apache.org/xsd/assembly-1.1.0.xsd">

    <!-- TODO: a jarjar format would be better -->

    <id>jar-with-dependencies</id>

    <formats>

        <format>jar</format>

    </formats>

    <includeBaseDirectory>false</includeBaseDirectory>

    <dependencySets>

        <dependencySet>

            <outputDirectory>/</outputDirectory>

            <useProjectArtifact>true</useProjectArtifact>

            <unpack>true</unpack>

            <scope>runtime</scope>

            <includes>

                <include>redis.clients:jedis</include>

                <include>org.apache.commons:commons-pool2</include>

            </includes>

        </dependencySet>

    </dependencySets>

</assembly>

 

这里只是将默认提供的jar-with-dependencies.xml内容拷贝出来添加includes配置;

 

以上是关于大数据基础之Oozie常见问题的主要内容,如果未能解决你的问题,请参考以下文章

大数据组件之oozie

大数据Hadoop之——任务调度器Oozie(Oozie环境部署)

云小课|MRS基础原理之Oozie任务调度

大数据用户画像之OozieHue集成Spark2 应用调度

Sqoop框架基础

大数据:分享大数据之基础语法