从 BigQuery 转换为 MySQL，可变数组大小

Posted 2023-03-25

技术标签:

【中文标题】从 BigQuery 转换为 MySQL，可变数组大小【英文标题】：Converting from BigQuery to MySQL, variable array size 【发布时间】：2021-04-20 07:47:53 【问题描述】：

我正在将下面的 BigQuery 源代码翻译成 mysql。问题在于GENERATE_ARRAY，产生了一个可变长度数组，这对于MySQL 来说是不可接受的。我想我需要使用WITH array AS ( ... ) 制作一个虚拟表，但这似乎并不容易，因为数组的大小不固定。有什么应对办法吗？

The table used looks like this→

-- This query generates a row for every hour the patient is in the ICU.
-- The hours are based on clock-hours (i.e. 02:00, 03:00).
-- The hour clock starts 24 hours before the first heart rate measurement.
-- Note that the time of the first heart rate measurement is ceilinged to the hour.

-- this query extracts the cohort and every possible hour they were in the ICU
-- this table can be to other tables on ICUSTAY_ID and (ENDTIME - 1 hour,ENDTIME]

-- get first/last measurement time
CREATE TABLE IF NOT EXISTS icustay_hourly(
with all_hours as
(
select
  it.stay_id

  -- ceiling the intime to the nearest hour by adding 59 minutes then truncating
  -- note that we truncate by parsing as string, rather than using DATETIME_TRUNC
  -- this is done to enable compatibility with psql
  , PARSE_DATETIME(
      '%Y-%m-%d %H:00:00',
      FORMAT_DATETIME(
        '%Y-%m-%d %H:00:00',
          DATE_ADD(it.intime_hr, INTERVAL '59' MINUTE)
  )) AS endtime

  -- create integers for each charttime in hours from admission
  -- so 0 is admission time, 1 is one hour after admission, etc, up to ICU disch
  --  we allow 24 hours before ICU admission (to grab labs before admit)
  , GENERATE_ARRAY(-24, CEIL(TIMESTAMP_DIFF(HOUR, it.outtime_hr, it.intime_hr))) as hrs

  from mimic_derived.icustay_times it
)
SELECT stay_id
, CAST(hr AS BIGINT) as hr
, DATE_ADD(endtime, INTERVAL CAST(hr AS BIGINT) HOUR) as endtime
FROM all_hours
CROSS JOIN UNNEST(all_hours.hrs) AS hr
);

【问题讨论】：

大部分代码与 MySQL 无关。你应该问一个新的问题。提供样本数据、期望的结果以及您想要实现的逻辑的清晰解释。您现有的代码可以作为参考。 【参考方案1】：

我使用 Python 并生成了预期的表。

import pandas as pd
import pymysql
import datetime

com = pymysql.connect(  host=111.111.111.111, 
                        port=3306, 
                        user=root,
                        password=pw, 
                        db="mimic_derived"  )

icustay_times = pd.read_sql_query('select * from icustay_times', com)

icustay_hourly = []
for row in range(len(icustay_times)): # 12 minutes
    intime = icustay_times['intime_hr'][row]
    if pd.isnull(intime): continue # NULL intime_hr & outtime_hr
        
    stay_id = icustay_times['stay_id'][row]
    
    endtime = intime + datetime.timedelta(minutes=59) # DATETIME_ADD
    endtime = endtime - datetime.timedelta(days=1) # a day eariler (starts from -24)

    outtime = icustay_times['outtime_hr'][row]

    dt = outtime - intime.floor('H') # DATETIME_DIFF
    dt_in_hours = dt.components.days*24 + dt.components.hours

    rng = range(-24, dt_in_hours+1) # GENERATE_ARRAY
    for i, hr in enumerate(rng): # CROSS JOIN UNNEST
        icustay_hourly.append([ stay_id,  # DATETIME_ADD
                                hr, 
                                endtime.floor('H') + datetime.timedelta(hours=i)])
        
icustay_hourly = pd.DataFrame(icustay_hourly, columns=('stay_id', 'hr', 'endtime'))
com.close()

# <Create Table>
# pd.read_sql_query('''
# CREATE TABLE icustay_hourly(
#     stay_id INT UNSIGNED NOT NULL,
#     hr INT NOT NULL,
#     endtime DATETIME NOT NULL
# )''', com)

# <Case 1>
# data.to_csv('icustay_hourly.csv', index=False)
# pd.read_sql_query('''
# LOAD DATA INFILE 'icustay_hourly.csv' INTO TABLE emp
# FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' IGNORE 1 ROWS
# ''', com)

# <Case 2)>
# for row in range(len(icustay_times)): # 40 minutes
#     query = f'''
#     INSERT INTO icustay_hourly (stay_id, hr, endtime) 
#     VALUES (
#         icustay_times['stay_id'][row], 
#         icustay_times['hr'][row], 
#         icustay_times['endtime'][row]
#     )'''
#     try:
#         pd.read_sql_query(query, com)
#     except TypeError: # pd.read_sql_query requires return
#         pass:

【讨论】：

以上是关于从 BigQuery 转换为 MySQL，可变数组大小的主要内容，如果未能解决你的问题，请参考以下文章

BigQuery 相关子查询 - 将数组转换为数组

将 MySQL 查询转换为 BigQuery 查询

BigQuery：将数组中的键值对转换为列

BigQuery - 将结果转换为 JSON 数组

在 BigQuery 中将列转换为数组

在 Bigquery 中，如何将结构的字符串化数组转换为正确的数组？