hadoop 的数据集

Posted

技术标签:

【中文标题】hadoop 的数据集【英文标题】:dataset for hadoop 【发布时间】:2014-11-12 01:55:15 【问题描述】:

这里我放了猪代码。当我尝试执行此代码时,我看到了错误。我无法调试它。任何人都可以帮助我调试代码。请用他们的输入和输出结果发布答案。

我会要求人们用他们的输入和输出结果来回答这个问题。请。

【问题讨论】:

【参考方案1】:

这个数据集的问题是多个字符作为分隔符“::”。在 pig 中,您不能使用多个字符作为分隔符。要解决此问题,您有 3 个选项

1. Use REGEX_EXTRACT_ALL build-in function(need to write regex for this input)
2. Write custom UDF
3. Replace the multiple character delimiter to single character delimiter(This is very simple).

我从这个站点 http://www.grouplens.org/datasets/movielens/ 下载了数据集并尝试使用选项 3

    1. Go to your input folder /home/bigdata/sample/inputs/
    2. Run this sed command
        >> sed 's/::/$/g' movies.dat > Testmovies.dat
        >> sed 's/::/$/g' ratings.dat > Testratings.dat
        >> sed 's/::/$/g' users.dat > Testusers.dat

这会将多字符分隔符 '::' 转换为单字符分隔符 '$'。我在所有三个文件中都选择了 '$' 作为分隔符 bcoz '$' 不存在。

    3. Now load the new input files(Testmovies.dat,Testratings.dat,Testusers.dat)  in the pig script using '$' as a delimiter

修改后的猪脚本:

    -- filtering action and war movies
    A = LOAD 'Testmovies.dat' USING PigStorage('$')as (MOVIEID: chararray,TITLE:chararray,GENRE: chararray);
    B = filter A by ((GENRE matches '.*Action.*') AND (GENRE matches '.*War.*'));
    -- finding action and war movie ratings
    C = LOAD 'Testratings.dat' USING PigStorage('$')as (UserID: chararray, MovieID:chararray, Rating: int, Timestamp: chararray);
    D = JOIN B by $0, C by MovieID;
    -- calculating avg
    E = group D by $0;
    F = foreach E generate group as mvId,  AVG(D.Rating) as avgRating;
    -- finding max avg-rating
    G = group F ALL;
    H = FOREACH G GENERATE MAX(F.$1) AS avgMax;
    -- finding max avg-rated movie
    I = FILTER F BY (float)avgRating == (float)H.avgMax;
    -- filtering female users age between 20-30
    J = LOAD 'Testusers.dat' USING PigStorage('$') as (UserID: chararray, Gender: chararray, Age: int, Occupation: chararray, Zip: chararray);
    K = filter J by ((Gender == 'F') AND (Age >= 20 AND Age <= 30));
    L = foreach K generate UserID;
    -- finding filtered female users rated movies
    M = JOIN L by $0, C by UserID;
    -- finding filtered female users who rated highest rated action and war movies
    N = JOIN I by $0, M by $2;
    -- finding distinct female users
    O = foreach N generate $2 as User;
    Q1 = Distinct O;
    DUMP Q1;

    Sample Output:
    (5763)
    (5785)
    (5805)
    (5808)
    (5812)
    (5825)
    (5832)
    (5852)
    (5869)
    (5878)
    (5920)
    (5955)
    (5972)
    (5974)
    (6009)
    (6036)

【讨论】:

我已尝试在猪本地模式下根据您的建议执行上述代码。但我没有得到输出,出现以下错误。 输入:成功从:“/home/bigdata/movies/Testratings.dat”读取2000360条记录 成功从:“/home/bigdata/movies/Testusers.dat”读取6040条记录从:“/home/bigdata/movies/Testratings.dat”读取 58 条记录 成功从:“/home/bigdata/movies/Testmovies.dat”读取 3883 条记录 输出:成功将 0 条记录存储在:“file:/ tmp/temp1612815615/tmp206540​​224" 计数器:写入的总记录数:0 写入的总字节数:0 Spillable Memory Manager 溢出计数:0 主动溢出的总包数:0 主动溢出的总记录数:0 2014-12-07 16:32:40,346 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 遇到警告 ACCESSING_NON_EXISTENT_FIELD 6009020 次。 2014-12-07 16:32:40,348 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 成功! 2014-12-07 16:32:40,355 [main] WARN org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend 已经初始化 2014-12-07 16:32:40,359 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - 处理的总输入路径:1 2014-12-07 16: 32:40,359 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - 处理的总输入路径:1

以上是关于hadoop 的数据集的主要内容,如果未能解决你的问题,请参考以下文章

hadoop基础之初识Hadoop MapReduce架构

Hadoop组件学习——HDFS的设计与高可用性

大数据学习系列之七 ----- Hadoop+Spark+Zookeeper+HBase+Hive集

Hadoop基础教程04

hadoop学习;大数据集在HDFS中存为单个文件;安装linux下eclipse出错解决;查看.class文件插件

数据热点相比Hadoop,如何看待Spark技术?