从 BigQuery 转换为 MySQL,可变数组大小
Posted
技术标签:
【中文标题】从 BigQuery 转换为 MySQL,可变数组大小【英文标题】:Converting from BigQuery to MySQL, variable array size 【发布时间】:2021-04-20 07:47:53 【问题描述】:我正在将下面的 BigQuery
源代码翻译成 mysql
。
问题在于GENERATE_ARRAY
,产生了一个可变长度数组,这对于MySQL
来说是不可接受的。
我想我需要使用WITH array AS ( ... )
制作一个虚拟表,但这似乎并不容易,因为数组的大小不固定。
有什么应对办法吗?
The table used looks like this→
-- This query generates a row for every hour the patient is in the ICU.
-- The hours are based on clock-hours (i.e. 02:00, 03:00).
-- The hour clock starts 24 hours before the first heart rate measurement.
-- Note that the time of the first heart rate measurement is ceilinged to the hour.
-- this query extracts the cohort and every possible hour they were in the ICU
-- this table can be to other tables on ICUSTAY_ID and (ENDTIME - 1 hour,ENDTIME]
-- get first/last measurement time
CREATE TABLE IF NOT EXISTS icustay_hourly(
with all_hours as
(
select
it.stay_id
-- ceiling the intime to the nearest hour by adding 59 minutes then truncating
-- note that we truncate by parsing as string, rather than using DATETIME_TRUNC
-- this is done to enable compatibility with psql
, PARSE_DATETIME(
'%Y-%m-%d %H:00:00',
FORMAT_DATETIME(
'%Y-%m-%d %H:00:00',
DATE_ADD(it.intime_hr, INTERVAL '59' MINUTE)
)) AS endtime
-- create integers for each charttime in hours from admission
-- so 0 is admission time, 1 is one hour after admission, etc, up to ICU disch
-- we allow 24 hours before ICU admission (to grab labs before admit)
, GENERATE_ARRAY(-24, CEIL(TIMESTAMP_DIFF(HOUR, it.outtime_hr, it.intime_hr))) as hrs
from mimic_derived.icustay_times it
)
SELECT stay_id
, CAST(hr AS BIGINT) as hr
, DATE_ADD(endtime, INTERVAL CAST(hr AS BIGINT) HOUR) as endtime
FROM all_hours
CROSS JOIN UNNEST(all_hours.hrs) AS hr
);
【问题讨论】:
大部分代码与 MySQL 无关。你应该问一个新的问题。提供样本数据、期望的结果以及您想要实现的逻辑的清晰解释。您现有的代码可以作为参考。 【参考方案1】:我使用 Python 并生成了预期的表。
import pandas as pd
import pymysql
import datetime
com = pymysql.connect( host=111.111.111.111,
port=3306,
user=root,
password=pw,
db="mimic_derived" )
icustay_times = pd.read_sql_query('select * from icustay_times', com)
icustay_hourly = []
for row in range(len(icustay_times)): # 12 minutes
intime = icustay_times['intime_hr'][row]
if pd.isnull(intime): continue # NULL intime_hr & outtime_hr
stay_id = icustay_times['stay_id'][row]
endtime = intime + datetime.timedelta(minutes=59) # DATETIME_ADD
endtime = endtime - datetime.timedelta(days=1) # a day eariler (starts from -24)
outtime = icustay_times['outtime_hr'][row]
dt = outtime - intime.floor('H') # DATETIME_DIFF
dt_in_hours = dt.components.days*24 + dt.components.hours
rng = range(-24, dt_in_hours+1) # GENERATE_ARRAY
for i, hr in enumerate(rng): # CROSS JOIN UNNEST
icustay_hourly.append([ stay_id, # DATETIME_ADD
hr,
endtime.floor('H') + datetime.timedelta(hours=i)])
icustay_hourly = pd.DataFrame(icustay_hourly, columns=('stay_id', 'hr', 'endtime'))
com.close()
# <Create Table>
# pd.read_sql_query('''
# CREATE TABLE icustay_hourly(
# stay_id INT UNSIGNED NOT NULL,
# hr INT NOT NULL,
# endtime DATETIME NOT NULL
# )''', com)
# <Case 1>
# data.to_csv('icustay_hourly.csv', index=False)
# pd.read_sql_query('''
# LOAD DATA INFILE 'icustay_hourly.csv' INTO TABLE emp
# FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' IGNORE 1 ROWS
# ''', com)
# <Case 2)>
# for row in range(len(icustay_times)): # 40 minutes
# query = f'''
# INSERT INTO icustay_hourly (stay_id, hr, endtime)
# VALUES (
# icustay_times['stay_id'][row],
# icustay_times['hr'][row],
# icustay_times['endtime'][row]
# )'''
# try:
# pd.read_sql_query(query, com)
# except TypeError: # pd.read_sql_query requires return
# pass:
【讨论】:
以上是关于从 BigQuery 转换为 MySQL,可变数组大小的主要内容,如果未能解决你的问题,请参考以下文章