大数据时代之hadoop:hadoop脚本解析
Posted slgkaifa
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了大数据时代之hadoop:hadoop脚本解析相关的知识,希望对你有一定的参考价值。
“兵马未动,粮草先行”,要想深入的了解hadoop,我认为启动或停止hadoop的脚本是必需要先了解的。说究竟。hadoop就是一个分布式存储和计算框架,可是这个分布式环境是怎样启动。管理的呢,我就带着大家先从脚本入手吧。说实话,hadoop的启动脚本写的真好,里面考虑的地方很周全(比方说路径中有空格。软连接等)。
1、hadoop脚本简介
hadoop的脚本分布在$HADOOP_HOME以下的bin文件夹下和conf文件夹下,主要介绍例如以下:
bin文件夹下
hadoop hadoop底层核心脚本。全部分布式程序终于都是通过这个脚本启动的。
hadoop-config.sh 基本别的脚本都会内嵌调用这个脚本,这个脚本作用就是解析命令行可选參数(--config :hadoop conf文件夹路径 和--hosts)
hadoop-daemon.sh 启动或停止本机command參数所指定的分布式程序,通过调用hadoop脚本实现。
hadoop-daemons.sh 启动全部机器上的hadoop分布式程序,通过调用slaves.sh实现。
slaves.sh 在全部的机器上执行一组指定的命令(通过ssh无password登陆),供上层使用。
start-dfs.sh 在本机启动namenode。在slaves机器上启动datanode。在master机器上启动secondarynamenode。通过调用hadoop-daemon.sh和hadoop-daemons.sh实现。
start-mapred.sh 在本机启动jobtracker。在slaves机器上启动tasktracker,通过调用hadoop-daemon.sh和hadoop-daemons.sh实现。
start-all.sh 启动全部分布式hadoop程序,通过调用start-dfs.sh和start-mapred.sh实现。
start-balancer.sh 启动hadoop分布式环境复杂均衡调度程序,平衡各节点存储和处理能力。
还有几个stop 脚本,就不用具体说了。
conf文件夹下
hadoop-env.sh 配置hadoop执行时所需要的一些參数变量,比方JAVA_HOME,HADOOP_LOG_DIR,HADOOP_PID_DIR等。
2、脚本的魅力(详解)
2.1、hadoop-config.sh
这个脚本主要做三部分内容:
#软连接解析 this="$0" while [ -h "$this" ]; do ls=`ls -ld "$this"` link=`expr "$ls" : ‘.*-> \(.*\)$‘` if expr "$link" : ‘.*/.*‘ > /dev/null; then this="$link" else this=`dirname "$this"`/"$link" fi done #绝对路径解析 # convert relative path to absolute path bin=`dirname "$this"` script=`basename "$this"` bin=`cd "$bin"; pwd` this="$bin/$script" # the root of the Hadoop installation export HADOOP_HOME=`dirname "$this"`/..
2、命令行可选參数--config解析并赋值
#check to see if the conf dir is given as an optional argument if [ $# -gt 1 ] then if [ "--config" = "$1" ] then shift confdir=$1 shift HADOOP_CONF_DIR=$confdir fi fi
3、命令行可选參数--config解析并赋值
#check to see it is specified whether to use the slaves or the # masters file if [ $# -gt 1 ] then if [ "--hosts" = "$1" ] then shift slavesfile=$1 shift export HADOOP_SLAVES="${HADOOP_CONF_DIR}/$slavesfile" fi fi
2.2、hadoop
# if no args specified, show usage if [ $# = 0 ]; then echo "Usage: hadoop [--config confdir] COMMAND" echo "where COMMAND is one of:" echo " namenode -format format the DFS filesystem" echo " secondarynamenode run the DFS secondary namenode" echo " namenode run the DFS namenode" echo " datanode run a DFS datanode" echo " dfsadmin run a DFS admin client" echo " mradmin run a Map-Reduce admin client" echo " fsck run a DFS filesystem checking utility" echo " fs run a generic filesystem user client" echo " balancer run a cluster balancing utility" echo " jobtracker run the MapReduce job Tracker node" echo " pipes run a Pipes job" echo " tasktracker run a MapReduce task Tracker node" echo " job manipulate MapReduce jobs" echo " queue get information regarding JobQueues" echo " version print the version" echo " jar <jar> run a jar file" echo " distcp <srcurl> <desturl> copy file or directories recursively" echo " archive -archiveName NAME <src>* <dest> create a hadoop archive" echo " daemonlog get/set the log level for each daemon" echo " or" echo " CLASSNAME run the class named CLASSNAME" echo "Most commands print help when invoked w/o parameters." exit 1 fi
2、设置java执行环境
代码简单,就不写出来了。包含JAVA_HOME,JAVA_HEAP_MAX。CLASSPATH,HADOOP_LOG_DIR。HADOOP_POLICYFILE。当中用到了设置IFS-储界定符号的环境变量,默认值是空白字符(换行。制表符或者空格)。
3、依据cmd设置执行时class
# figure out which class to run if [ "$COMMAND" = "namenode" ] ; then CLASS=‘org.apache.hadoop.hdfs.server.namenode.NameNode‘ HADOOP_OPTS="$HADOOP_OPTS $HADOOP_NAMENODE_OPTS" elif [ "$COMMAND" = "secondarynamenode" ] ; then CLASS=‘org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode‘ HADOOP_OPTS="$HADOOP_OPTS $HADOOP_SECONDARYNAMENODE_OPTS" elif [ "$COMMAND" = "datanode" ] ; then CLASS=‘org.apache.hadoop.hdfs.server.datanode.DataNode‘ HADOOP_OPTS="$HADOOP_OPTS $HADOOP_DATANODE_OPTS" elif [ "$COMMAND" = "fs" ] ; then CLASS=org.apache.hadoop.fs.FsShell HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS" elif [ "$COMMAND" = "dfs" ] ; then CLASS=org.apache.hadoop.fs.FsShell HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS" elif [ "$COMMAND" = "dfsadmin" ] ; then CLASS=org.apache.hadoop.hdfs.tools.DFSAdmin HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS" elif [ "$COMMAND" = "mradmin" ] ; then CLASS=org.apache.hadoop.mapred.tools.MRAdmin HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS" elif [ "$COMMAND" = "fsck" ] ; then CLASS=org.apache.hadoop.hdfs.tools.DFSck HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS" elif [ "$COMMAND" = "balancer" ] ; then CLASS=org.apache.hadoop.hdfs.server.balancer.Balancer HADOOP_OPTS="$HADOOP_OPTS $HADOOP_BALANCER_OPTS" elif [ "$COMMAND" = "jobtracker" ] ; then CLASS=org.apache.hadoop.mapred.JobTracker HADOOP_OPTS="$HADOOP_OPTS $HADOOP_JOBTRACKER_OPTS" elif [ "$COMMAND" = "tasktracker" ] ; then CLASS=org.apache.hadoop.mapred.TaskTracker HADOOP_OPTS="$HADOOP_OPTS $HADOOP_TASKTRACKER_OPTS" elif [ "$COMMAND" = "job" ] ; then CLASS=org.apache.hadoop.mapred.JobClient elif [ "$COMMAND" = "queue" ] ; then CLASS=org.apache.hadoop.mapred.JobQueueClient elif [ "$COMMAND" = "pipes" ] ; then CLASS=org.apache.hadoop.mapred.pipes.Submitter HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS" elif [ "$COMMAND" = "version" ] ; then CLASS=org.apache.hadoop.util.VersionInfo HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS" elif [ "$COMMAND" = "jar" ] ; then CLASS=org.apache.hadoop.util.RunJar elif [ "$COMMAND" = "distcp" ] ; then CLASS=org.apache.hadoop.tools.DistCp CLASSPATH=${CLASSPATH}:${TOOL_PATH} HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS" elif [ "$COMMAND" = "daemonlog" ] ; then CLASS=org.apache.hadoop.log.LogLevel HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS" elif [ "$COMMAND" = "archive" ] ; then CLASS=org.apache.hadoop.tools.HadoopArchives CLASSPATH=${CLASSPATH}:${TOOL_PATH} HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS" elif [ "$COMMAND" = "sampler" ] ; then CLASS=org.apache.hadoop.mapred.lib.InputSampler HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS" else CLASS=$COMMAND fi
4、设置本地库
# setup ‘java.library.path‘ for native-hadoop code if necessary JAVA_LIBRARY_PATH=‘‘ if [ -d "${HADOOP_HOME}/build/native" -o -d "${HADOOP_HOME}/lib/native" ]; then #通过执行一个java 类来决定当前平台,挺有意思 JAVA_PLATFORM=`CLASSPATH=${CLASSPATH} ${JAVA} -Xmx32m org.apache.hadoop.util.PlatformName | sed -e "s/ /_/g"` if [ -d "$HADOOP_HOME/build/native" ]; then JAVA_LIBRARY_PATH=${HADOOP_HOME}/build/native/${JAVA_PLATFORM}/lib fi if [ -d "${HADOOP_HOME}/lib/native" ]; then if [ "x$JAVA_LIBRARY_PATH" != "x" ]; then JAVA_LIBRARY_PATH=${JAVA_LIBRARY_PATH}:${HADOOP_HOME}/lib/native/${JAVA_PLATFORM} else JAVA_LIBRARY_PATH=${HADOOP_HOME}/lib/native/${JAVA_PLATFORM} fi fi fi
5、执行分布式程序
# run it exec "$JAVA" $JAVA_HEAP_MAX $HADOOP_OPTS -classpath "$CLASSPATH" $CLASS "[email protected]"
2.3、hadoop-daemon.sh
启动或停止本机command參数所指定的分布式程序,通过调用hadoop脚本实现。事实上也挺简单的。1、声明用法
usage="Usage: hadoop-daemon.sh [--config <conf-dir>] [--hosts hostlistfile] (start|stop) <hadoop-command> <args...>" # if no args specified, show usage if [ $# -le 1 ]; then echo $usage exit 1 fi
2、环境变量设置
首先内嵌执行hadoop-env.sh脚本。然后设置HADOOP_PID_DIR等环境变量。
3、启动或停止程序
case $startStop in (start) mkdir -p "$HADOOP_PID_DIR" if [ -f $pid ]; then #假设程序已经启动的话,就停止,并退出。 if kill -0 `cat $pid` > /dev/null 2>&1; then echo $command running as process `cat $pid`. Stop it first. exit 1 fi fi if [ "$HADOOP_MASTER" != "" ]; then echo rsync from $HADOOP_MASTER rsync -a -e ssh --delete --exclude=.svn --exclude=‘logs/*‘ --exclude=‘contrib/hod/logs/*‘ $HADOOP_MASTER/ "$HADOOP_HOME" fi # rotate 当前已经存在的log hadoop_rotate_log $log echo starting $command, logging to $log cd "$HADOOP_HOME" #通过nohup 和bin/hadoop脚本启动相关程序 nohup nice -n $HADOOP_NICENESS "$HADOOP_HOME"/bin/hadoop --config $HADOOP_CONF_DIR $command "[email protected]" > "$log" 2>&1 < /dev/null & #获取新启动的进程pid并写入到pid文件里 echo $! > $pid sleep 1; head "$log" ;; (stop) if [ -f $pid ]; then if kill -0 `cat $pid` > /dev/null 2>&1; then echo stopping $command kill `cat $pid` else echo no $command to stop fi else echo no $command to stop fi ;; (*) echo $usage exit 1 ;; esac
2.4、slaves.sh
在全部的机器上执行一组指定的命令(通过ssh无password登陆)。供上层使用。
1、声明用法
usage="Usage: slaves.sh [--config confdir] command..." # if no args specified, show usage if [ $# -le 0 ]; then echo $usage exit 1 fi
2、设置远程主机列表
# If the slaves file is specified in the command line, # then it takes precedence over the definition in # hadoop-env.sh. Save it here. HOSTLIST=$HADOOP_SLAVES if [ -f "${HADOOP_CONF_DIR}/hadoop-env.sh" ]; then . "${HADOOP_CONF_DIR}/hadoop-env.sh" fi if [ "$HOSTLIST" = "" ]; then if [ "$HADOOP_SLAVES" = "" ]; then export HOSTLIST="${HADOOP_CONF_DIR}/slaves" else export HOSTLIST="${HADOOP_SLAVES}" fi fi
3、分别在远程主机运行相关命令
#挺重要,里面技术含量也挺高,对远程主机文件进行去除特殊字符和删除空行;对命令行进行空格替换。并通过ssh在目标主机运行命令;最后等待命令在全部目标主机运行完后。退出。 for slave in `cat "$HOSTLIST"|sed "s/#.*$//;/^$/d"`; do ssh $HADOOP_SSH_OPTS $slave $"${@// /\\ }" 2>&1 | sed "s/^/$slave: /" & if [ "$HADOOP_SLAVE_SLEEP" != "" ]; then sleep $HADOOP_SLAVE_SLEEP fi done wait
2.5、hadoop-daemons.sh
1、声明用法
# Run a Hadoop command on all slave hosts. usage="Usage: hadoop-daemons.sh [--config confdir] [--hosts hostlistfile] [start|stop] command args..." # if no args specified, show usage if [ $# -le 1 ]; then echo $usage exit 1 fi
2、在远程主机调用命令
#通过salves.sh来实现 exec "$bin/slaves.sh" --config $HADOOP_CONF_DIR cd "$HADOOP_HOME" \; "$bin/hadoop-daemon.sh" --config $HADOOP_CONF_DIR "[email protected]"
2.6、start-dfs.sh
在本机(调用此脚本的主机)启动namenode,在slaves机器上启动datanode,在master机器上启动secondarynamenode。通过调用hadoop-daemon.sh和hadoop-daemons.sh实现。
1、声明使用方式
# Start hadoop dfs daemons. # Optinally upgrade or rollback dfs state. # Run this on master node. usage="Usage: start-dfs.sh [-upgrade|-rollback]"
2、启动程序
# start dfs daemons # start namenode after datanodes, to minimize time namenode is up w/o data # note: datanodes will log connection errors until namenode starts #在本机(调用此脚本的主机)启动namenode "$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR start namenode $nameStartOpt #在slaves机器上启动datanode "$bin"/hadoop-daemons.sh --config $HADOOP_CONF_DIR start datanode $dataStartOpt #在master机器上启动secondarynamenode "$bin"/hadoop-daemons.sh --config $HADOOP_CONF_DIR --hosts masters start secondarynamenode
2.7、start-mapred.sh
# start mapred daemons # start jobtracker first to minimize connection errors at startup #在本机(调用此脚本的主机)启动jobtracker "$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR start jobtracker #在master机器上启动tasktracker "$bin"/hadoop-daemons.sh --config $HADOOP_CONF_DIR start tasktracker
其它的脚本就都已经很easy了。不用再具体说明了,仅仅要看下,大致都能看懂。
对了,最后再说下hadoop的脚本里面用的shell解释器的声明吧。
#!/usr/bin/env bash作用就是适应各种linux操作系统。可以找到 bash shell来解释运行本脚本,也挺实用的。
以上是关于大数据时代之hadoop:hadoop脚本解析的主要内容,如果未能解决你的问题,请参考以下文章
成为大数据顶尖程序员,先过了这些Hadoop面试题!(附答案解析)