在 Amazon Redshift 中提取部分字符串

Posted 2023-03-31

技术标签:

【中文标题】在 Amazon Redshift 中提取部分字符串【英文标题】：Extract parts of a string in Amzon Redshift 【发布时间】：2021-05-24 14:53:48 【问题描述】：

我正在尝试从如下 URL 中提取某些元素（有多个这样的 url）：

https://google.com/?utm_medium=cpc&utm_source=google&utm_campaign=c_lp_generic_us_2021-03-23&gclid=Cj0KCQjwkZiFBhD9ARIsAGxFX8AienwPdwPa_-qZnqbRzFoK98BU3VvTvdI4La5IrPW7anUaBOX5QSQaAs01EALw_wcB

我希望从 URL 中提取 utm_medium、utm_source 和 utm_campaign

输出应该是 4 列

网址（网址上方） utm_medium = 每次点击费用 utm_source=谷歌 utm_campaign = c_lp_generic_us_2021-03-23

如何使用 utm_medium、utm_source 和 utm_campaign 格式设计 url，因此我希望它们可以用作某种参考点

【问题讨论】：

将substring 与文档中所示的正则表达式一起使用 - postgresql.org/docs/9.3/functions-matching.html 【参考方案1】：

试试这个：

在 postgres 中

with cte as (
select 'https://google.com/?utm_medium=cpc&utm_source=google&utm_campaign=c_lp_generic_us_2021-03-23&gclid=Cj0KCQjwkZiFBhD9ARIsAGxFX8AienwPdwPa_-qZnqbRzFoK98BU3VvTvdI4La5IrPW7anUaBOX5QSQaAs01EALw_wcB' url_
)
select 
split_part(url_,'?',1),
max(split_part(t.split_,'=',2)) filter(where split_part(t.split_,'=',1)='utm_medium') "utm_medium",
max(split_part(t.split_,'=',2)) filter(where split_part(t.split_,'=',1)='utm_source') "utm_source",
max(split_part(t.split_,'=',2)) filter(where split_part(t.split_,'=',1)='utm_campaign') "utm_campaign"
from cte cross join lateral 

regexp_split_to_table(split_part(url_,'?',2),'&') t(split_)
group by 1

根据评论编辑：在postgresql 和redshift 两者中

with cte as (
select 'https://google.com/?utm_medium=cpc&utm_source=google&utm_campaign=c_lp_generic_us_2021-03-23&gclid=Cj0KCQjwkZiFBhD9ARIsAGxFX8AienwPdwPa_-qZnqbRzFoK98BU3VvTvdI4La5IrPW7anUaBOX5QSQaAs01EALw_wcB' url_
)
select 
split_part(url_,'?',1),
substring(split_part(url_,'utm_medium=',2),1,position('&' in split_part(url_,'utm_medium=',2))-1) "utm_medium",
substring(split_part(url_,'utm_source=',2),1,position('&' in split_part(url_,'utm_source=',2))-1) "utm_source",
substring(split_part(url_,'utm_campaign=',2),1,position('&' in split_part(url_,'utm_campaign=',2))-1) "utm_campaign"
from cte

DEMO

【讨论】：

谢谢！但是我在过滤器附近遇到语法错误（其中...（第 6 行）。我正在拉一个红移数据库。您收到错误，因为您没有使用 postgresql。演示。如果您使用的是 redshift，请在您的问题中标记正确的数据库。 Redshift 和 postgres 是不同的数据库。也用红移添加了答案非常感谢，正如我在您的演示中看到的那样，它可以正常工作（为我令人困惑的问题道歉。）在 tabelplus 中运行此查询时，我仍然收到错误：查询 1 错误：错误：“（”或附近的语法错误第 10 行：regexp_split_to_table (split_part(url_,'?',2),'&') t(split_)... 超级令人沮丧...我对此很陌生。谢谢 - 对不起，我现在收到另一个错误：查询 1 错误：错误：找不到从“未知”到文本的转换函数【参考方案2】：

上面的 split_part 效果很好。如果您需要其他方式，可以按照以下方式使用。

select substring('https://google.com/?utm_medium=cpc&utm_source=google&utm_campaign=c_lp_generic_us_2021-03-23' 
from  position('utm_medium=' IN 'https://google.com/?utm_medium=cpc&utm_source=google&utm_campaign=c_lp_generic_us_2021-03-23') +11
for position('&utm_source' IN 'https://google.com/?utm_medium=cpc&utm_source=google&utm_campaign=c_lp_generic_us_2021-03-23') - 
(position('utm_medium=' IN 'https://google.com/?utm_medium=cpc&utm_source=google&utm_campaign=c_lp_generic_us_2021-03-23') + 11) )

【讨论】：

以上是关于在 Amazon Redshift 中提取部分字符串的主要内容，如果未能解决你的问题，请参考以下文章

从 Amazon Redshift 中的 JSON 字段中提取数据

从 Amazon Redshift 中的 json 数组中提取特定键

在 Amazon Redshift 中是不是可以在两列上设置条件？

HIVE 或 PIG 作为 Amazon Redshift 的替代品？

Redshift - 提取约束

Amazon Redshift 字符串列出删除多个单引号