在 Bigquery 中使用正则表达式获取地址的街道名称和编号
Posted
技术标签:
【中文标题】在 Bigquery 中使用正则表达式获取地址的街道名称和编号【英文标题】:Get the street name and number of an address using regular expressions in Bigquery 【发布时间】:2020-12-07 20:14:31 【问题描述】:我有一堆地址,它们看起来不太相似。例如,我可以有
STREET NAME 12501 -ADP 11 LT P 1 -A , PS 1 Y ZOC
或
av some avenue 5640 j aguirre conchali
通常,字符串以街道名称开头,后跟其编号。如何获取第一个数字之前的所有字符,但包括数字?
例如:
av some avenue 5640 j aguirre conchali --> av some avenue 5640
STREET NAME 12501 -ADP 11 LT P 1 -A , PS 1 Y ZOC --> STREET NAME 12501
pje. 27 de abril 5492 --> pje. 27 de abril 5492 (in this case, the street name corresponds to the date April 27th)
1 poniente 643, valparaiso --> 1 poniente 643 (in this case, the street name is "1 poniente")
我正在尝试在BigQuery
中使用REGEXP_EXTRACT
执行此操作,但目前没有取得很大成功。
【问题讨论】:
您需要为每种格式定义一个正则表达式并使用布尔分隔符来创建一个大的OR
语句,如format1|format2|format3|format4
但这仍在使用REGEXP_EXTRACT
还是我应该切换到REGEXP_CONTAINS
并在REGEXP_CONTAINS
为真的地方执行REGEXP_EXTRACT
?而且这种方法表明每种格式应该使用不同的列?
【参考方案1】:
使用
REGEXP_EXTRACT('column', '^((?:\D*(?:\d2\s+de\s+\w+\s+\d4)|\d+)?\D*\d*)')
见proof。
说明
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
--------------------------------------------------------------------------------
\D* non-digits (all but 0-9) (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
\d2 digits (0-9) (2 times)
--------------------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ")
(1 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
de 'de'
--------------------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ")
(1 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
\w+ word characters (a-z, A-Z, 0-9, _)
(1 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ")
(1 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
\d4 digits (0-9) (4 times)
--------------------------------------------------------------------------------
) end of grouping
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
)? end of grouping
--------------------------------------------------------------------------------
\D* non-digits (all but 0-9) (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
\d* digits (0-9) (0 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
) end of \1
【讨论】:
非常感谢!你的回答太棒了!如果我理解这一点,为了合并除de
之外的其他特殊词,我应该添加类似de|la|el
的内容还是应该是de+la+el
?另外,你在哪里学的正则表达式。我需要将这些东西纳入我的腰带。
@PedroPabloSeverinHonorato 是的,(?:de|la|el)
。以上是关于在 Bigquery 中使用正则表达式获取地址的街道名称和编号的主要内容,如果未能解决你的问题,请参考以下文章
BigQuery - 将正则表达式与 LIKE 运算符 (?) 结合使用