将文件中以换行符分隔的 JSON 上传到 BigQuery

Posted

技术标签:

【中文标题】将文件中以换行符分隔的 JSON 上传到 BigQuery【英文标题】:Uploading Newline-delimited JSON From a File to BigQuery 【发布时间】:2013-05-21 23:07:47 【问题描述】:

我最近编写了一个 Python 脚本,用于将本地的、以换行符分隔的 JSON 文件上传到 BigQuery 表。它与官方文档here 中提供的示例非常相似。我遇到的问题是我尝试上传的文件中的非 ASCII 字符正在使我的 POST 请求无效。

这是脚本的相关部分...

def upload(dataFilePath, loadJob, recipeJSON, logger):
    body = '--xxx\n'
    body += 'Content-Type: application/json; charset=UTF-8\n\n'
    body += loadJob
    body += '\n--xxx\n' 
    body += 'Content-Type: application/octet-stream\n\n'

    dataFile = io.open(dataFilePath, 'r', encoding = 'utf-8')
    body += dataFile.read()
    dataFile.close()

    body += '\n--xxx--\n'

    credentials = buildCredentials(recipeJSON['keyPath'], recipeJSON['accountEmail'])
    http = httplib2.Http()
    http = credentials.authorize(http)
    service = build('bigquery', 'v2', http=http)

    projectId = recipeJSON['projectId']

    url = BIGQUERY_URL_BASE + projectId + "/jobs"

    headers = 'Content-Type': 'multipart/related; boundary=xxx'
    response, content = http.request(url, method="POST", body=body, headers=headers)

...这是我在运行时得到的堆栈跟踪...

Traceback (most recent call last):
  File "/usr/local/uploader/upload_data.py", line 179, in <module>
    main(sys.argv)
  File "/usr/local/uploader/upload_data.py", line 170, in main
    if (upload(unprocessedFile, loadJob, recipeJSON, logger)):
  File "/usr/local/uploader/upload_data.py", line 100, in upload
    response, content = http.request(url, method="POST", body=body, headers=headers)
  File "/usr/local/lib/python2.7/site-packages/oauth2client/util.py", line 128, in positional_wrapper
    return wrapped(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/oauth2client/client.py", line 490, in new_request
redirections, connection_type)
  File "/usr/local/lib/python2.7/site-packages/httplib2/__init__.py", line 1570, in request
(response, content) = self._request(conn, authority, uri, request_uri, method, body, headers, redirections, cachekey)
  File "/usr/local/lib/python2.7/site-packages/httplib2/__init__.py", line 1317, in _request
(response, content) = self._conn_request(conn, request_uri, method, body, headers)
  File "/usr/local/lib/python2.7/site-packages/httplib2/__init__.py", line 1253, in _conn_request
conn.request(method, request_uri, body, headers)
  File "/usr/local/lib/python2.7/httplib.py", line 973, in request
    self._send_request(method, url, body, headers)
  File "/usr/local/lib/python2.7/httplib.py", line 1007, in _send_request
    self.endheaders(body)
  File "/usr/local/lib/python2.7/httplib.py", line 969, in endheaders
    self._send_output(message_body)
  File "/usr/local/lib/python2.7/httplib.py", line 833, in _send_output
    self.send(message_body)
  File "/usr/local/lib/python2.7/httplib.py", line 805, in send
    self.sock.sendall(data)
  File "/usr/local/lib/python2.7/ssl.py", line 229, in sendall
    v = self.send(data[count:])
  File "/usr/local/lib/python2.7/ssl.py", line 198, in send
    v = self._sslobj.write(data)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 4586-4611: ordinal not in range(128)

我正在使用 Python 2.7 和以下库: 分发(0.6.36) 谷歌 api-python 客户端 (1.1) httplib2 (0.8) oauth2client (1.1) pyOpenSSL (0.13) python-gflags (2.0) wsgiref (0.1.2)

有其他人遇到过这个问题吗?

似乎 httplib2 的请求方法将“body”作为字符串,这意味着它稍后需要在通过网络发送之前进行编码。我一直在寻找一种将编码覆盖为 UTF-8 的方法,但到目前为止还没有运气。

提前致谢!

编辑:

我可以通过做两件事来解决这个问题: 1.) 读取我的原始文件内容,无需解码。 (我也可以在上面的第一次尝试中对“身体”进行编码......) 2.) 将 url 和 headers 编码为字节。

代码最终看起来像这样:

def upload(dataFilePath, loadJob, recipeJSON, logger):
    part_one = '--xxx\n'
    part_one += 'Content-Type: application/json; charset=UTF-8\n\n'
    part_one += loadJob
    part_one += '\n--xxx\n'
    part_one += 'Content-Type: application/octet-stream\n\n'

    dataFile = io.open(dataFilePath, 'rb')
    part_two = dataFile.read()
    dataFile.close()

    part_three = '\n--xxx--\n'

    body = part_one.encode('utf-8')
    body += part_two
    body += part_three.encode('utf-8')

    credentials = buildCredentials(recipeJSON['keyPath'], recipeJSON['accountEmail'])
    http = httplib2.Http()
    http = credentials.authorize(http)
    service = build('bigquery', 'v2', http=http)

    projectId = recipeJSON['projectId']

    url = BIGQUERY_URL_BASE + projectId + "/jobs"

    headers = 'Content-Type'.encode('utf-8'): 'multipart/related; boundary=xxx'.encode('utf-8')
    response, content = http.request(url.encode('utf-8'), method="POST", body=body, headers=headers)

【问题讨论】:

【参考方案1】:

io.open() 将以 unicode 文本形式打开文件。要么使用普通的open(),要么使用二进制模式:

dataFile = io.open(dataFilePath, 'rb')

您正在通过网络直接发送文件内容,因此您需要发送字节,而不是 unicode,并且正如您所发现的,混合 Unicode 和字节会导致痛苦的错误,因为 python 尝试使用自动编码回字节连接两种不同类型时的 ASCII 编解码器。 这里根本不需要解码为 Unicode

【讨论】:

是的,最终,必须解码文件的内容是没有意义的。我试图只将二进制数据读入“body”,但生成了... code ... _send_request self.endheaders(body) 文件“/usr/local/lib/python2.7/httplib.py”,第 969 行,在 endheaders self._send_output(message_body) 文件“/usr/local/lib/python2.7/httplib.py”中,第 827 行,在 _send_output msg += message_body UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2在位置 4586:序数不在范围内(128)code 确保一切都是字节。例如,由于您使用 JSON 数据来构建凭据,这很可能意味着 credentials 也是 Unicode。这同样适用于url 谢谢,@Martijn。我最终将标头和 url 转换为字节并且它起作用了。看起来 SignedJwtAssertionCredentials 类型的凭据对象没有引起问题。

以上是关于将文件中以换行符分隔的 JSON 上传到 BigQuery的主要内容,如果未能解决你的问题,请参考以下文章

使用php将json api结果以换行符分隔格式保存到json文件

将逗号分隔的 JSON 转换为换行符分隔的节点

使用空字典作为值将 JSON 文件加载到 BigQuery

将 Pandas DataFrame 写入换行符分隔的 JSON

编写换行符分隔的 Json

换行分隔 JSON 格式所需的解析过滤器