加入 Apache Pig

Posted

技术标签:

【中文标题】加入 Apache Pig【英文标题】:JOIN in Apache Pig 【发布时间】:2016-09-19 17:10:39 【问题描述】:

我在 hdfs 的两个不同位置有两个由 json 对象组成的文件,我需要根据一个公共字段加入这两个文件。

第一个文件由推文数据组成,有 34 个字段(我从字面上算过)。它看起来像:

"contributors": null, "truncated": false, "text": "US Bank Loans And credit card capitol one business", "avl_brand_all": ["US Bank"], "is_quote_status": false    , "in_reply_to_status_id": null, "id": 770150015968825344, "favorite_count": 0, "avl_num_sentences": 1, "source": "<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</    a>", "retweeted": false, "coordinates": null, "entities": "symbols": [], "user_mentions": [], "hashtags": [], "urls": ["url": "<link>": [51, 74], "expand    ed_url": "http://usbanklogins.com/bank/", "display_url": "usbanklogins.com/bank/"], "in_reply_to_screen_name": null, "in_reply_to_user_id": null, "avl_word_tags": ["distance": 1, "    word": "u", "pos": "OTHER", "distance": 1, "word": "bank", "pos": "NOUN", "distance": 1, "word": "loan", "pos": "NOUN", "distance": 1, "word": "credit", "pos": "NOUN", "distan    ce": 1, "word": "card", "pos": "NOUN", "distance": 1, "word": "capitol", "pos": "VERB", "distance": 1, "word": "one", "pos": "OTHER", "distance": 1, "word": "business", "pos": "    NOUN"], "avl_brand_1": "US Bank", "retweet_count": 0, "avl_lexicon_text": "us bank loans and credit card capitol one business", "id_str": "770150015968825344", "favorited": false, "a    vl_sentences": ["us bank loans and credit card capitol one business"], "user": "follow_request_sent": false, "has_extended_profile": false, "profile_use_background_image": true, "id"    : 485610502, "verified": false, "profile_text_color": "0C3E53", "profile_image_url_https": "<link>", "profile    _sidebar_fill_color": "FFF7CC", "geo_enabled": false, "entities": "url": "urls": ["url": "link", "indices": [0, 22], "expanded_url": "http://www.seowithme.com", "    display_url": "seowithme.com"], "description": "urls": [], "followers_count": 347, "profile_sidebar_border_color": "F2E195", "location": "", "default_profile_image": false, "id_s    tr": "485610502", "is_translation_enabled": false, "utc_offset": null, "statuses_count": 117, "description": "seowithme", "friends_count": 959, "profile_link_color": "FF0000", "profil    e_image_url": "http://pbs.twimg.com/profile_images/2334489262/qyznw08zjrgv3vlxtdvt_normal.jpeg", "notifications": false, "profile_background_image_url_https": "https://abs.twimg.com/i    mages/themes/theme12/bg.gif", "profile_background_color": "BADFCD", "profile_background_image_url": "http://abs.twimg.com/images/themes/theme12/bg.gif", "screen_name": "sajanshrestha2    2", "lang": "en", "following": false, "profile_background_tile": false, "favourites_count": 2, "name": "sajan shrestha", "url": "<link>", "created_at": "Tue Feb 07 11:    40:39 +0000 2012", "contributors_enabled": false, "time_zone": null, "protected": false, "default_profile": false, "is_translator": false, "listed_count": 0, "avl_num_paragraphs": 1,     "geo": null, "in_reply_to_user_id_str": null, "possibly_sensitive": false, "lang": "en", "created_at": "Mon Aug 29 06:44:07 +0000 2016", "avl_source": "individual", "in_reply_to_stat    us_id_str": null, "place": null, "metadata": "iso_language_code": "en", "result_type": "recent", "avl_num_words": 8

第二个文件有 json 对象,每个对象只有两个字段。看起来像:

"avl_syntaxnet_tags": ["pos_tag": "PRP", "position": "1", "dep_rel": "dep", "parent": "3", "word": "us", "pos_tag": "NN", "position": "2", "dep_rel": "nn", "parent": "3", "word":     "bank", "pos_tag": "NNS", "position": "3", "dep_rel": "nsubj", "parent": "7", "word": "loans", "pos_tag": "CC", "position": "4", "dep_rel": "cc", "parent": "3", "word": "and", "    pos_tag": "NN", "position": "5", "dep_rel": "nn", "parent": "6", "word": "credit", "pos_tag": "NN", "position": "6", "dep_rel": "conj", "parent": "3", "word": "card", "pos_tag": "    VBP", "position": "7", "dep_rel": "ROOT", "parent": "0", "word": "capitol", "pos_tag": "CD", "position": "8", "dep_rel": "num", "parent": "9", "word": "one", "pos_tag": "NN", "pos    ition": "9", "dep_rel": "dobj", "parent": "7", "word": "business"], "avl_lexicon_text": "us bank loans and credit card capitol one business"

现在,两个 json_objects 中有一个名为 avl_lexicon_text 的公共字段,我想使用公共字段连接这两个对象。

我为加入编写了以下 Pig 脚本:

a = LOAD file1 as (a1, a2);
b = LOAD file2 as (b1, b2, b3, b4, b5, b6, b7, b8, b9, b10, b11, b12, b13, b14, b15, b16, b17, b18, b19, b20, b21, b22, b23, b24, b25, b26, b27, b28, b29, b30, b31, b32, b33, b34);
x = JOIN b BY b19 FULL, a BY a2;
STORE x INTO '$SYNTAXNET_OUTPUT';

我检查了b19b 中的avl_lexicon_text 字段,而a2a 中是相同的。我得到的结果真的很奇怪。当我dump x 时,我没有得到包含ab 中所有字段的新json_object。我得到b 中的所有对象,然后是a 中的所有对象。

有人可以建议我这样做的正确方法吗?

编辑:另外,有没有办法在不加载架构的情况下做到这一点?因为将来某个时候,如果任何文件的格式发生变化(添加了新字段或删除了现有字段),我不想更改 pig 脚本。有没有一种方法可以在不引用字段位置但通过访问字段名称的情况下进行 JOIN?谢谢! )

【问题讨论】:

【参考方案1】:

由于您指定了 FULL 外连接,因此该行为是预期的。 删除 FULL 以仅获取匹配的记录。请参阅 here 了解 FULL 外连接。

x = JOIN b BY b19, a BY a2;

【讨论】:

当我这样做时没有任何输出。成功! Output(s): Successfully stored 0 records in: "/tmp/syntaxnet_output2" 显示来自 a 和 b 的示例记录,连接应该发生在哪里 我有他们的问题。两个 json 对象都应该基于avl_lexicon_text 加入。 我认为是加载数据的问题。我没有指定正确的架构和分隔符。我可以使用最适合推特数据的大象鸟解析器。将使用它并更新。感谢您的宝贵时间。

以上是关于加入 Apache Pig的主要内容,如果未能解决你的问题,请参考以下文章

在centos中 将apache httpd 服务加入系统服务

Apache flink 加入

SkyWalking 全票通过加入 Apache 孵化器

Linux下将Mysql和Apache加入到系统服务里的方法

Apache Pig Group / 展平 / 加入

Apache Pig:加入后展平列名