大数据:从入门到XX

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了大数据:从入门到XX相关的知识,希望对你有一定的参考价值。

    对APACHE的开源项目做了一个简单的分析之后,下一步就是去一窥hadoop的真容了。直接访问HADOOP官网地址,这里就是学习hadoop的官方渠道了,以下内容摘自官网:

    What Is Apache Hadoop?

        The Apache Hadoop project develops open-source software for      reliable, scalable, distributed computing.

    The project includes these modules:

        Hadoop Common: The common utilities that support the other Hadoop modules.

        Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.

        Hadoop YARN: A framework for job scheduling and cluster resource management.

        Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

    HADOOP对每个模块的描述都简介到位,无需赘言。目前,HADOOP项目有两个分支在同时发展,包括2.6.x和2.7.x。当前最新版本为2.7.2,点击此处可以下载。

    点击官网首页的learn about,进入学习文档页面,也可以在二进制安装包中打开 .\hadoop-2.7.2\share\doc\hadoop\index.html文件。

    hadoop部署一共包括三种模式:Local (Standalone) Mode、Pseudo-Distributed Mode、Fully-Distributed Mode,先从最简单的单机版入手。单机版的特点有三个:运行于本地文件系统、运行于单个java进程、有利于程序调试。

    hadoop支持的操作系统,包括GNU/linux和windows两类,windows平台就不做探究了,我这边选用的是redhat linux 6.3版本。对于学习用的虚拟机,我一般是把所有能装的软件全部勾上,不给自己的学习制造麻烦。

    hadoop 2.7.2 单机版安装过程:

    1、确定操作系统版本

[[email protected] ~]# lsb_release -a
LSB Version:    :core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch
Distributor ID: RedHatEnterpriseServer
Description:    Red Hat Enterprise Linux Server release 6.3 (Santiago)
Release:        6.3
Codename:       Santiago
[[email protected] ~]# uname -a
Linux localhost.localdomain 2.6.32-279.el6.x86_64 #1 SMP Wed Jun 13 18:24:36 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux

    2、从网上下载确定java版本

    hadoop2.7.2要求java版本为1.7以上,当前jdk可用版本为1.8.0_92,点击这里下载

    3、检查linux上已安装的jdk,并移除已安装的jdk

[[email protected] ~]# rpm -qa |grep ‘openjdk‘
java-1.6.0-openjdk-javadoc-1.6.0.0-1.45.1.11.1.el6.x86_64
java-1.6.0-openjdk-devel-1.6.0.0-1.45.1.11.1.el6.x86_64
java-1.6.0-openjdk-1.6.0.0-1.45.1.11.1.el6.x86_64

[[email protected] ~]# rpm -e --nodeps java-1.6.0-openjdk-1.6.0.0-1.45.1.11.1.el6.x86_64
[[email protected] ~]# rpm -e --nodeps java-1.6.0-openjdk-devel-1.6.0.0-1.45.1.11.1.el6.x86_64
[[email protected] ~]# rpm -e --nodeps java-1.6.0-openjdk-javadoc-1.6.0.0-1.45.1.11.1.el6.x86_64

    4、安装jdk1.8.0_92

[[email protected] local]# rpm -ivh jdk-8u92-linux-x64.rpm
Preparing...                ########################################### [100%]
   1:jdk1.8.0_92            ########################################### [100%]
Unpacking JAR files...
        tools.jar...
        plugin.jar...
        javaws.jar...
        deploy.jar...
        rt.jar...
        jsse.jar...
        charsets.jar...
        localedata.jar...

    5、修改/etc/profile文件,在文件末尾增加 6行信息

[[email protected] etc]# vi /etc/profile


JAVA_HOME=/usr/java/jdk1.8.0_92
JRE_HOME=/usr/java/jdk1.8.0_92/jre
PATH=$PATH:$JAVA_HOME/bin:$JRE_HOME/bin
CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:$JRE_HOME/lib

export JAVA_HOME JRE_HOME PATH CLASSPATH

6、让修改后的环境变量生效

[[email protected] etc]# source /etc/profile

7、增加hadoop组,增加hadoop用户,并修改hadoop用户密码

[[email protected] ~]# groupadd hadoop
[[email protected] ~]# useradd -m -g hadoop hadoop
[[email protected] ~]# passwd hadoop

8、解压hadoop安装包到hadoop用户根目录

[[email protected] ~]$ tar xxvf hadoop-2.7.2.tar.gz

9、在hadoop用户下设置环境变量,在文件末尾增加下面两行内容

[[email protected] ~]$ vi .bash_profile


export HADOOP_COMMON_HOME=~/hadoop-2.7.2

export PATH=$PATH:~/hadoop-2.7.2/bin:~/hadoop-2.7.2/sbin

10、使环境变量修改生效

[[email protected] ~]$ source .bash_profile

11、执行测试任务,根据正则表达式匹配在input目录下文件中出现的次数

[[email protected] ~]$ mkdir input
[[email protected] ~]$ cp ./hadoop-2.7.2/etc/hadoop/*.xml input
[[email protected] ~]$ hadoop jar ./hadoop-2.7.2/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar grep input output ‘der[a-z.]+‘

16/03/11 11:11:39 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/03/11 11:11:40 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
16/03/11 11:11:40 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
16/03/11 11:11:40 INFO input.FileInputFormat: Total input paths to process : 8
16/03/11 11:11:40 INFO mapreduce.JobSubmitter: number of splits:8
......
......
16/03/11 11:14:35 INFO mapreduce.Job: Counters: 30
        File System Counters
                FILE: Number of bytes read=1159876
                FILE: Number of bytes written=2227372
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
        Map-Reduce Framework
                Map input records=8
                Map output records=8
                Map output bytes=228
                Map output materialized bytes=250
                Input split bytes=116
                Combine input records=0
                Combine output records=0
                Reduce input groups=2
                Reduce shuffle bytes=250
                Reduce input records=8
                Reduce output records=8
                Spilled Records=16
                Shuffled Maps =1
                Failed Shuffles=0
                Merged Map outputs=1
                GC time elapsed (ms)=59
                Total committed heap usage (bytes)=265175040
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=390
        File Output Format Counters
                Bytes Written=192

12、查看分析结果,显示正常内容

[[email protected] ~]$ rm -rf output/
[[email protected] ~]$ cat output/*
2       der.
1       der.zookeeper.path
1       der.zookeeper.kerberos.principal
1       der.zookeeper.kerberos.keytab
1       der.zookeeper.connection.string
1       der.zookeeper.auth.type
1       der.uri
1       der.password

13、如果再次测试,需要先移除output目录

[[email protected] ~]$ rm -rf output/


本文出自 “沈进群” 博客,谢绝转载!

以上是关于大数据:从入门到XX的主要内容,如果未能解决你的问题,请参考以下文章

大数据:从入门到XX

大数据:从入门到XX

大数据:从入门到XX

新一代大数据计算引擎 Flink从入门到实战

大数据仓库Hive实战视频教程-HIVE完美入门学习视频教程 HIVE教程 HIVE从入门到精通

跟风舞烟学大数据可视化-Echarts从入门到上手实战