为 JSON 数据创建配置单元表

Posted

技术标签:

【中文标题】为 JSON 数据创建配置单元表【英文标题】:Create hive table for JSON data 【发布时间】:2016-01-10 05:09:49 【问题描述】:

如何为 HDFS 路径中可用的以下 Twitter JSON 数据创建配置单元表。我尝试了一些来自网络的查询来创建表格,但遇到了一些问题。


    "extended_entities": 
        "media": [
            "display_url": "pic.twitter.com/9SoA83sVvP",
            "indices": [100, 123],
            "sizes": 
                "small": 
                    "w": 340,
                    "h": 340,
                    "resize": "fit"
                ,
                "large": 
                    "w": 480,
                    "h": 480,
                    "resize": "fit"
                ,
                "thumb": 
                    "w": 150,
                    "h": 150,
                    "resize": "crop"
                ,
                "medium": 
                    "w": 480,
                    "h": 480,
                    "resize": "fit"
                
            ,
            "id_str": "685710180164579329",
            "expanded_url": "http://twitter.com/add7dave/status/685710518456209408/video/1",
            "media_url_https": "https://pbs.twimg.com/ext_tw_video_thumb/685710180164579329/pu/img/4wOqavTprNIaMgjK.jpg",
            "id": 685710180164579329,
            "type": "video",
            "media_url": "http://pbs.twimg.com/ext_tw_video_thumb/685710180164579329/pu/img/4wOqavTprNIaMgjK.jpg",
            "url": "https://t.co/9SoA83sVvP",
            "video_info": 
                "aspect_ratio": [1, 1],
                "duration_millis": 7567,
                "variants": [
                    "content_type": "application/x-mpegURL",
                    "url": "https://video.twimg.com/ext_tw_video/685710180164579329/pu/pl/6JnchC_1FWviydJV.m3u8"
                , 
                    "content_type": "application/dash+xml",
                    "url": "https://video.twimg.com/ext_tw_video/685710180164579329/pu/pl/6JnchC_1FWviydJV.mpd"
                , 
                    "content_type": "video/mp4",
                    "bitrate": 320000,
                    "url": "https://video.twimg.com/ext_tw_video/685710180164579329/pu/vid/240x240/W7suov-YC1Iq1-QT.mp4"
                , 
                    "content_type": "video/webm",
                    "bitrate": 832000,
                    "url": "https://video.twimg.com/ext_tw_video/685710180164579329/pu/vid/480x480/bDG_UfEw3jBM7z4e.webm"
                , 
                    "content_type": "video/mp4",
                    "bitrate": 832000,
                    "url": "https://video.twimg.com/ext_tw_video/685710180164579329/pu/vid/480x480/bDG_UfEw3jBM7z4e.mp4"
                ]
            
        ]
    ,
    "in_reply_to_status_id_str": null,
    "in_reply_to_status_id": null,
    "created_at": "Sat Jan 09 06:31:42 +0000 2016",
    "in_reply_to_user_id_str": null,
    "source": "<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android<\/a>",
    "retweet_count": 0,
    "retweeted": false,
    "geo": null,
    "filter_level": "low",
    "in_reply_to_screen_name": null,
    "is_quote_status": false,
    "id_str": "685710518456209408",
    "in_reply_to_user_id": null,
    "favorite_count": 0,
    "id": 685710518456209408,
    "text": "New video NO-17\n#BritanniaFilmfareAwards\n@GoodDayCookies\n@BritanniaIndLtd\nAmitabh Bachchan dialogue https://t.co/9SoA83sVvP",
    "place": null,
    "lang": "en",
    "favorited": false,
    "possibly_sensitive": false,
    "coordinates": null,
    "truncated": false,
    "timestamp_ms": "1452321102142",
    "entities": 
        "urls": [],
        "hashtags": [
            "indices": [16, 40],
            "text": "BritanniaFilmfareAwards"
        ],
        "media": [
            "display_url": "pic.twitter.com/9SoA83sVvP",
            "indices": [100, 123],
            "sizes": 
                "small": 
                    "w": 340,
                    "h": 340,
                    "resize": "fit"
                ,
                "large": 
                    "w": 480,
                    "h": 480,
                    "resize": "fit"
                ,
                "thumb": 
                    "w": 150,
                    "h": 150,
                    "resize": "crop"
                ,
                "medium": 
                    "w": 480,
                    "h": 480,
                    "resize": "fit"
                
            ,
            "id_str": "685710180164579329",
            "expanded_url": "http://twitter.com/add7dave/status/685710518456209408/video/1",
            "media_url_https": "https://pbs.twimg.com/ext_tw_video_thumb/685710180164579329/pu/img/4wOqavTprNIaMgjK.jpg",
            "id": 685710180164579329,
            "type": "photo",
            "media_url": "http://pbs.twimg.com/ext_tw_video_thumb/685710180164579329/pu/img/4wOqavTprNIaMgjK.jpg",
            "url": "https://t.co/9SoA83sVvP"
        ],
        "user_mentions": [
            "indices": [41, 56],
            "screen_name": "GoodDayCookies",
            "id_str": "2197439803",
            "name": "Britannia Good Day",
            "id": 2197439803
        , 
            "indices": [57, 73],
            "screen_name": "BritanniaIndLtd",
            "id_str": "3281245460",
            "name": "Britannia Industries",
            "id": 3281245460
        ],
        "symbols": []
    ,
    "contributors": null,
    "user": 
        "utc_offset": 19800,
        "friends_count": 1517,
        "profile_image_url_https": "https://pbs.twimg.com/profile_images/593327096736256001/TT8Ds75__normal.jpg",
        "listed_count": 1,
        "profile_background_image_url": "http://abs.twimg.com/images/themes/theme19/bg.gif",
        "default_profile_image": false,
        "favourites_count": 25,
        "description": "Sharukhan, Kapil sharma , Narendra modi Fan (Supporter) be happy *↓*",
        "created_at": "Thu Sep 15 08:04:58 +0000 2011",
        "is_translator": false,
        "profile_background_image_url_https": "https://abs.twimg.com/images/themes/theme19/bg.gif",
        "protected": false,
        "screen_name": "add7dave",
        "id_str": "373836462",
        "profile_link_color": "9266CC",
        "id": 373836462,
        "geo_enabled": false,
        "profile_background_color": "FFF04D",
        "lang": "en",
        "profile_sidebar_border_color": "000000",
        "profile_text_color": "000000",
        "verified": false,
        "profile_image_url": "http://pbs.twimg.com/profile_images/593327096736256001/TT8Ds75__normal.jpg",
        "time_zone": "Chennai",
        "url": null,
        "contributors_enabled": false,
        "profile_background_tile": false,
        "profile_banner_url": "https://pbs.twimg.com/profile_banners/373836462/1428993069",
        "statuses_count": 21397,
        "follow_request_sent": null,
        "followers_count": 438,
        "profile_use_background_image": true,
        "default_profile": false,
        "following": null,
        "name": "aditya dave",
        "location": "Bhavnagar, Gujarat",
        "profile_sidebar_fill_color": "000000",
        "notifications": null
    

我尝试了下表,但它给出了错误。

hive> CREATE EXTERNAL TABLE tweets (
    id BIGINT,
    created_at STRING,
    source STRING,
    favorited BOOLEAN,
    retweeted_status STRUCT<
      text:STRING,
      user:STRUCT<screen_name:STRING,name:STRING>,
      retweet_count:INT>,
    entities STRUCT<
      urls:ARRAY<STRUCT<expanded_url:STRING>>,
      user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>,
      hashtags:ARRAY<STRUCT<text:STRING>>>,
    text STRING,
    user STRUCT<
      screen_name:STRING,
      name:STRING,
      friends_count:INT,
      followers_count:INT,
      statuses_count:INT,
      verified:BOOLEAN,
      utc_offset:INT,
      time_zone:STRING>,
    in_reply_to_screen_name STRING
  )
  PARTITIONED BY (datehour INT)
  ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
  LOCATION '/user/flume/tweets/01092015';

【问题讨论】:

你能发布你得到的错误吗? FailedPredicateException(identifier,useSQL11ReservedKeywordsForIdentifier()?) at org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.identifier(HiveParser_IdentifiersParser.java:10924) at org.apache.hadoop.hive .ql.parse.HiveParser.identifier(HiveParser.java:45850) 失败:ParseException 行 9:2 无法识别谓词“用户”。失败的规则:列规范中的“标识符”。 我认为 user 是一个关键字,而您将其用作列名,这可能是导致问题的原因.. 昨晚我为你回答了这个问题***.com/questions/34696792/…,你忽略了它,现在又发布了同样的问题?这很粗鲁。 我已经解释了如何在 Hive 中为 twitter 数据创建表,follow this link 【参考方案1】:

您可以使用此实用程序 https://github.com/quux00/hive-json-schema 从 JSON 创建配置单元模式。但是正如Ben Watson 在他的回答here 中一样,如果有一些像user 这样的列使用保留名称,则必须用反引号将它们括起来或使用像https://github.com/rcongiu/Hive-JSON-Serde 这样可以将hive 列映射到json 元素的serde .

【讨论】:

以上是关于为 JSON 数据创建配置单元表的主要内容,如果未能解决你的问题,请参考以下文章

从配置单元表中的 json 字符串中提取值

如何使用配置单元外部配置单元表创建一个空数据框?

从 spark sql 插入配置单元表

未为创建的新配置单元表创建分区文件

使用 Java 将数据存储为 Apache Spark 中的配置单元表

如何从配置单元表中的json字符串中提取数组元素?