hadoop 的数据集
Posted
技术标签:
【中文标题】hadoop 的数据集【英文标题】:dataset for hadoop 【发布时间】:2014-11-12 01:55:15 【问题描述】:这里我放了猪代码。当我尝试执行此代码时,我看到了错误。我无法调试它。任何人都可以帮助我调试代码。请用他们的输入和输出结果发布答案。
我会要求人们用他们的输入和输出结果来回答这个问题。请。
【问题讨论】:
【参考方案1】:这个数据集的问题是多个字符作为分隔符“::”。在 pig 中,您不能使用多个字符作为分隔符。要解决此问题,您有 3 个选项
1. Use REGEX_EXTRACT_ALL build-in function(need to write regex for this input)
2. Write custom UDF
3. Replace the multiple character delimiter to single character delimiter(This is very simple).
我从这个站点 http://www.grouplens.org/datasets/movielens/ 下载了数据集并尝试使用选项 3
1. Go to your input folder /home/bigdata/sample/inputs/
2. Run this sed command
>> sed 's/::/$/g' movies.dat > Testmovies.dat
>> sed 's/::/$/g' ratings.dat > Testratings.dat
>> sed 's/::/$/g' users.dat > Testusers.dat
这会将多字符分隔符 '::' 转换为单字符分隔符 '$'。我在所有三个文件中都选择了 '$' 作为分隔符 bcoz '$' 不存在。
3. Now load the new input files(Testmovies.dat,Testratings.dat,Testusers.dat) in the pig script using '$' as a delimiter
修改后的猪脚本:
-- filtering action and war movies
A = LOAD 'Testmovies.dat' USING PigStorage('$')as (MOVIEID: chararray,TITLE:chararray,GENRE: chararray);
B = filter A by ((GENRE matches '.*Action.*') AND (GENRE matches '.*War.*'));
-- finding action and war movie ratings
C = LOAD 'Testratings.dat' USING PigStorage('$')as (UserID: chararray, MovieID:chararray, Rating: int, Timestamp: chararray);
D = JOIN B by $0, C by MovieID;
-- calculating avg
E = group D by $0;
F = foreach E generate group as mvId, AVG(D.Rating) as avgRating;
-- finding max avg-rating
G = group F ALL;
H = FOREACH G GENERATE MAX(F.$1) AS avgMax;
-- finding max avg-rated movie
I = FILTER F BY (float)avgRating == (float)H.avgMax;
-- filtering female users age between 20-30
J = LOAD 'Testusers.dat' USING PigStorage('$') as (UserID: chararray, Gender: chararray, Age: int, Occupation: chararray, Zip: chararray);
K = filter J by ((Gender == 'F') AND (Age >= 20 AND Age <= 30));
L = foreach K generate UserID;
-- finding filtered female users rated movies
M = JOIN L by $0, C by UserID;
-- finding filtered female users who rated highest rated action and war movies
N = JOIN I by $0, M by $2;
-- finding distinct female users
O = foreach N generate $2 as User;
Q1 = Distinct O;
DUMP Q1;
Sample Output:
(5763)
(5785)
(5805)
(5808)
(5812)
(5825)
(5832)
(5852)
(5869)
(5878)
(5920)
(5955)
(5972)
(5974)
(6009)
(6036)
【讨论】:
我已尝试在猪本地模式下根据您的建议执行上述代码。但我没有得到输出,出现以下错误。 输入:成功从:“/home/bigdata/movies/Testratings.dat”读取2000360条记录 成功从:“/home/bigdata/movies/Testusers.dat”读取6040条记录从:“/home/bigdata/movies/Testratings.dat”读取 58 条记录 成功从:“/home/bigdata/movies/Testmovies.dat”读取 3883 条记录 输出:成功将 0 条记录存储在:“file:/ tmp/temp1612815615/tmp206540224" 计数器:写入的总记录数:0 写入的总字节数:0 Spillable Memory Manager 溢出计数:0 主动溢出的总包数:0 主动溢出的总记录数:0 2014-12-07 16:32:40,346 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 遇到警告 ACCESSING_NON_EXISTENT_FIELD 6009020 次。 2014-12-07 16:32:40,348 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 成功! 2014-12-07 16:32:40,355 [main] WARN org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend 已经初始化 2014-12-07 16:32:40,359 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - 处理的总输入路径:1 2014-12-07 16: 32:40,359 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - 处理的总输入路径:1以上是关于hadoop 的数据集的主要内容,如果未能解决你的问题,请参考以下文章
大数据学习系列之七 ----- Hadoop+Spark+Zookeeper+HBase+Hive集