大数据:从入门到XX
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了大数据:从入门到XX相关的知识,希望对你有一定的参考价值。
对APACHE的开源项目做了一个简单的分析之后,下一步就是去一窥hadoop的真容了。直接访问HADOOP官网地址,这里就是学习hadoop的官方渠道了,以下内容摘自官网:
What Is Apache Hadoop? The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. The project includes these modules: Hadoop Common: The common utilities that support the other Hadoop modules. Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data. Hadoop YARN: A framework for job scheduling and cluster resource management. Hadoop MapReduce: A YARN-based system for parallel processing of large data sets. |
HADOOP对每个模块的描述都简介到位,无需赘言。目前,HADOOP项目有两个分支在同时发展,包括2.6.x和2.7.x。当前最新版本为2.7.2,点击此处可以下载。
点击官网首页的learn about,进入学习文档页面,也可以在二进制安装包中打开 .\hadoop-2.7.2\share\doc\hadoop\index.html文件。
hadoop部署一共包括三种模式:Local (Standalone) Mode、Pseudo-Distributed Mode、Fully-Distributed Mode,先从最简单的单机版入手。单机版的特点有三个:运行于本地文件系统、运行于单个java进程、有利于程序调试。
hadoop支持的操作系统,包括GNU/linux和windows两类,windows平台就不做探究了,我这边选用的是redhat linux 6.3版本。对于学习用的虚拟机,我一般是把所有能装的软件全部勾上,不给自己的学习制造麻烦。
hadoop 2.7.2 单机版安装过程:
1、确定操作系统版本
[[email protected] ~]# lsb_release -a LSB Version: :core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch Distributor ID: RedHatEnterpriseServer Description: Red Hat Enterprise Linux Server release 6.3 (Santiago) Release: 6.3 Codename: Santiago [[email protected] ~]# uname -a Linux localhost.localdomain 2.6.32-279.el6.x86_64 #1 SMP Wed Jun 13 18:24:36 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux |
2、从网上下载确定java版本
hadoop2.7.2要求java版本为1.7以上,当前jdk可用版本为1.8.0_92,点击这里下载。
3、检查linux上已安装的jdk,并移除已安装的jdk
[[email protected] ~]# rpm -qa |grep ‘openjdk‘ java-1.6.0-openjdk-javadoc-1.6.0.0-1.45.1.11.1.el6.x86_64 java-1.6.0-openjdk-devel-1.6.0.0-1.45.1.11.1.el6.x86_64 java-1.6.0-openjdk-1.6.0.0-1.45.1.11.1.el6.x86_64 [[email protected] ~]# rpm -e --nodeps java-1.6.0-openjdk-1.6.0.0-1.45.1.11.1.el6.x86_64 [[email protected] ~]# rpm -e --nodeps java-1.6.0-openjdk-devel-1.6.0.0-1.45.1.11.1.el6.x86_64 [[email protected] ~]# rpm -e --nodeps java-1.6.0-openjdk-javadoc-1.6.0.0-1.45.1.11.1.el6.x86_64 |
4、安装jdk1.8.0_92
[[email protected] local]# rpm -ivh jdk-8u92-linux-x64.rpm Preparing... ########################################### [100%] 1:jdk1.8.0_92 ########################################### [100%] Unpacking JAR files... tools.jar... plugin.jar... javaws.jar... deploy.jar... rt.jar... jsse.jar... charsets.jar... localedata.jar... |
5、修改/etc/profile文件,在文件末尾增加 6行信息
[[email protected] etc]# vi /etc/profile JRE_HOME=/usr/java/jdk1.8.0_92/jre PATH=$PATH:$JAVA_HOME/bin:$JRE_HOME/bin CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:$JRE_HOME/lib export JAVA_HOME JRE_HOME PATH CLASSPATH |
6、让修改后的环境变量生效
[[email protected] etc]# source /etc/profile |
7、增加hadoop组,增加hadoop用户,并修改hadoop用户密码
[[email protected] ~]# groupadd hadoop [[email protected] ~]# useradd -m -g hadoop hadoop [[email protected] ~]# passwd hadoop |
8、解压hadoop安装包到hadoop用户根目录
[[email protected] ~]$ tar xxvf hadoop-2.7.2.tar.gz |
9、在hadoop用户下设置环境变量,在文件末尾增加下面两行内容
[[email protected] ~]$ vi .bash_profile export HADOOP_COMMON_HOME=~/hadoop-2.7.2 export PATH=$PATH:~/hadoop-2.7.2/bin:~/hadoop-2.7.2/sbin |
10、使环境变量修改生效
[[email protected] ~]$ source .bash_profile |
11、执行测试任务,根据正则表达式匹配在input目录下文件中出现的次数
[[email protected] ~]$ mkdir input [[email protected] ~]$ cp ./hadoop-2.7.2/etc/hadoop/*.xml input [[email protected] ~]$ hadoop jar ./hadoop-2.7.2/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar grep input output ‘der[a-z.]+‘ 16/03/11 11:11:39 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 16/03/11 11:11:40 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id 16/03/11 11:11:40 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 16/03/11 11:11:40 INFO input.FileInputFormat: Total input paths to process : 8 16/03/11 11:11:40 INFO mapreduce.JobSubmitter: number of splits:8 ...... ...... 16/03/11 11:14:35 INFO mapreduce.Job: Counters: 30 File System Counters FILE: Number of bytes read=1159876 FILE: Number of bytes written=2227372 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 Map-Reduce Framework Map input records=8 Map output records=8 Map output bytes=228 Map output materialized bytes=250 Input split bytes=116 Combine input records=0 Combine output records=0 Reduce input groups=2 Reduce shuffle bytes=250 Reduce input records=8 Reduce output records=8 Spilled Records=16 Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=59 Total committed heap usage (bytes)=265175040 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=390 File Output Format Counters Bytes Written=192 |
12、查看分析结果,显示正常内容
[[email protected] ~]$ rm -rf output/ [[email protected] ~]$ cat output/* 2 der. 1 der.zookeeper.path 1 der.zookeeper.kerberos.principal 1 der.zookeeper.kerberos.keytab 1 der.zookeeper.connection.string 1 der.zookeeper.auth.type 1 der.uri 1 der.password |
13、如果再次测试,需要先移除output目录
[[email protected] ~]$ rm -rf output/ |
本文出自 “沈进群” 博客,谢绝转载!
以上是关于大数据:从入门到XX的主要内容,如果未能解决你的问题,请参考以下文章