redshift 按字符串的一部分对外部表进行分区

Posted 2023-03-30

技术标签:

【中文标题】redshift 按字符串的一部分对外部表进行分区【英文标题】：redshift partition external table by part of a string 【发布时间】：2021-02-19 15:14:29 【问题描述】：

经过几天的搜索，我还没有看到答案 - 所以这里是

我有一个带有 foo 表的 Athena 数据库。将它添加到 Redshift 我使用这个命令：

create external schema athena_schema from data catalog 
database 'my-catalog-db' 
iam_role '...role/my_redshift_role';

我的表 foo 有 45 个字段，其中一个是存储为字符串的时间戳。我想通过字符串的日期部分对表中的数据进行分区。

字符串看起来像“2021/02/09 20:10:09:001”，我们称之为 mydate

所以我试过这个：

alter table athena_schema.foo
partition(left(mydate, 10) = '2021/02/09')
location 's3://my s3 location/foo_2021_02_09/';

而且 Redshift 不喜欢现有字段上的子字符串命令。我试过了。有任何想法吗？感谢您的宝贵时间。

【问题讨论】：

【参考方案1】：

当您在 Redshift Spectrum（和 Athena）外部表中定义分区时，分区列将成为表中的一个单独列。这意味着您不能将分区映射到表数据文件中也存在的列。

在来自"Partitioning Redshift Spectrum external tables" 的示例DDL 中，您可以看到分区列saledate 作为另一列添加到表中。

CREATE EXTERNAL TABLE spectrum.sales_part (
      salesid     INTEGER
    , listid      INTEGER
    , sellerid    INTEGER
    , buyerid     INTEGER
    , eventid     INTEGER
    , dateid      SMALLINT
    , qtysold     SMALLINT
    , pricepaid   DECIMAL(8,2)
    , commission  DECIMAL(8,2)
    , saletime    TIMESTAMP               )
PARTITIONED BY (saledate CHAR(10))
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION 's3://awssampledbuswest2/tickit/spectrum/sales_partition/'
TABLE PROPERTIES ('numRows'='172000');

--Add partitions
ALTER TABLE spectrum.sales_part ADD
PARTITION (saledate='2008-01') LOCATION 's3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-01/'
PARTITION (saledate='2008-02') LOCATION 's3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-02/'
PARTITION (saledate='2008-03') LOCATION 's3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-03/';

--Query using partition column `saledate`
SELECT TOP 5 
       spectrum.sales_part.eventid
     , SUM(spectrum.sales_part.pricepaid)
FROM spectrum.sales_part, event
WHERE spectrum.sales_part.eventid = event.eventid
  AND spectrum.sales_part.pricepaid > 30
  AND saledate = '2008-01'
GROUP BY spectrum.sales_part.eventid
ORDER BY 2 DESC;

【讨论】：

以上是关于redshift 按字符串的一部分对外部表进行分区的主要内容，如果未能解决你的问题，请参考以下文章