BigQuery 为 REGEXP_MATCH 或 _EXTRACT 返回 null

Posted

技术标签:

【中文标题】BigQuery 为 REGEXP_MATCH 或 _EXTRACT 返回 null【英文标题】:BigQuery returns null for REGEXP_MATCH or _EXTRACT 【发布时间】:2017-04-12 19:41:39 【问题描述】:

我正在使用以下查询来返回列中字符串的子集:

    SELECT
    REGEXP_EXTRACT(content, r'/\ :dependencies\ \[(.*?)\]]\ /g') AS deps
    FROM x[my-test-162023:lab.clj_files_results_030904]

但返回以下内容:

Row deps     
1   null     
2   null     
3   null     
4   null     
5   null     
6   null  

我已经在 http://www.regexpal.com/ 和 https://regex101.com/r/Gjre2i/2 上测试了正则表达式模式,它似乎工作正常。

感谢任何帮助/提示。

更新: 我要查询的表如下所示:

Row content  
1   (defproject spaghetti "0.1.0-SNAPSHOT" :description "FIXME: write description" :url "http://example.com/FIXME" :license :name "Eclipse Public License" :url "http://www.eclipse.org/legal/epl-v10.html" :source-paths ["src/clj" "src/cljs"] :dependencies [[org.clojure/clojure "1.6.0"] [org.clojure/clojurescript "0.0-2371" :scope "provided"] [org.clojure/core.async "0.1.346.0-17112a-alpha"] [ring "1.3.1"] [compojure "1.2.0"] [enlive "1.1.5"] [om "0.7.3"] [figwheel "0.1.4-SNAPSHOT"] [environ "1.0.0"] [com.cemerick/piggieback "0.1.3"] [weasel "0.4.3-SNAPSHOT"] [leiningen "2.5.0"] [http-kit "2.1.19"] [com.cognitect/transit-cljs "0.8.188"] ; [devcards "0.1.2-SNAPSHOT"] [sablono "0.2.22"] [prismatic/om-tools "0.3.3"]] :plugins [[lein-cljsbuild "1.0.3"] [lein-environ "1.0.0"]] :min-lein-version "2.5.0" :uberjar-name "spaghetti.jar" :cljsbuild :builds :app :source-paths ["src/cljs/spaghetti"] :compiler :output-to "resources/public/js/app.js" :output-dir "resources/public/js/out" :source-map "resources/public/js/out.js.map" :optimizations :none :preamble ["react/react.min.js" "public/js/adsr/index.js" "public/js/WebMIDIAPIWrapper/js/WebMIDIAPIWrapper.js" "public/js/hammerjs/hammer.min.js" "public/js/wavy-jones/wavy-jones.js"] :externs ["react/externs/react.js" "public/js/adsr/adsr.externs.js" "public/js/WebMIDIAPIWrapper/WebMIDIAPIWrapper.externs.js" "public/js/hammerjs/hammerjs.externs.js" "public/js/wavy-jonewavy-jones.externs.js"] :pretty-print true :profiles :dev :repl-options :init-ns spaghetti.server :timeout 120000 :nrepl-middleware [cemerick.piggieback/wrap-cljs-repl] :plugins [[lein-figwheel "0.1.4-SNAPSHOT"]] :figwheel :http-server-root "public" :port 3449 :css-dirs ["resources/public/css"] :env :is-dev true :cljsbuild :builds :app :source-paths ["env/dev/cljs"] :uberjar :hooks [leiningen.cljsbuild] :env :production true :omit-source true :aot :all :cljsbuild :builds :app :source-paths ["env/prod/cljs"] :compiler :optimizations :advanced :pretty-print false )     
2   (defproject pomodoro "0.0.4" :license :name "MIT" :url "http://opensource.org/licenses/MIT" :distribution :repo :description "A simple pomodoro timer" :url "https://github.com/landau/cljs-pomodoro" :dependencies [[org.clojure/clojure "1.6.0"] [org.clojure/clojurescript "0.0-2322"] [org.clojure/core.async "0.1.338.0-5c5012-alpha"] [com.andrewmcveigh/cljs-time "0.1.6"] [reagent "0.4.2"]] :plugins [[lein-ring "0.8.11"] [lein-cljsbuild "1.0.3"] [lein-environ "0.5.0"]] :ring :handler server.core/app :profiles :uberjar :aot :all :dev :dependencies [[ring-mock "0.1.5"] [ring/ring-devel "1.3.0"] [compojure "1.1.9"]] :env :dev true :release :ring :open-browser? false :stacktraces? false :auto-reload? false :source-paths ["src"] :main server.core :cljsbuild  :builds [:id "dev" :source-paths ["src-cljs"] :compiler :output-to "public/js/pomodoro.js" :output-dir "public/js/dev" :optimizations :none :pretty-print tru :source-map true :id "prod" :source-paths ["src-cljs"] :compiler :output-to "public/js/main.js" :optimizations :advanced :pretty-print false :externs ["public/js/react-min-0.11.2.js"] ])    
3   (defproject datascript-mori "0.15.2" :description "Wrapper for datascript interplay mori" :url "https://github.com/typeetfunc/datascript-mori" :license :name "Eclipse Public License" :url "http://www.eclipse.org/legal/epl-v10.html" :min-lein-version "2.5.3" :dependencies [[org.clojure/clojure "1.7.0"] [org.clojure/clojurescript "1.7.170"] [datascript "0.15.0"]] :plugins [[lein-cljsbuild "1.1.2" :exclusions [[org.clojure/clojure]]] [lein-git-deps "0.0.2-SNAPSHOT"]] :git-dependencies [["https://github.com/swannodette/mori.git"]] :source-paths ["src" ".lein-git-deps/mori/src"] :clean-targets ^:protect false ["target"] :cljsbuild :builds [:id "min" :source-paths ["src" ".lein-git-deps/mori/src"] :compiler  :output-to "release-js/datascript-mori.bare.js" :main datascript-mori.core :optimizations :advanced :pretty-print false  :notify-command ["release-js/wrap_bare.sh"]] )    
4   (defproject pandect "0.6.1-SNAPSHOT" :description "Message Digest and Checksum Library for Clojure" :url "https://github.com/xsc/pandect" :license :name "MIT License" :url "https://opensource.org/licenses/MIT" :year 2014 :key "mit" :dependencies [[org.clojure/clojure "1.8.0" :scope "provided"] [org.bouncycastle/bcprov-jdk15on "1.54" :scope "provided"]] :exclusions [org.clojure/clojure] :source-paths ["src/clojure" "target/generated"] :java-source-paths ["src/java"] :profiles :dev :plugins [[lein-codox "0.9.4"]] :codox :project :name "pandect" :metadata :doc/format :markdown :output-path "doc" :namespaces [pandect.core pandect.buffer #"^pandect\.algo\.[a-z\-]+"] :benchmark :dependencies [[criterium "0.4.3"] [clj-message-digest "1.0.0"] [digest "1.4.4"]] :source-paths ["shootout"] :jvm-opts ^:replace ["-Xmx1g" "-server"] :1.5 :dependencies [[org.clojure/clojure "1.5.1"]] :1.6 :dependencies [[org.clojure/clojure "1.6.0"]] :1.7 :dependencies [[org.clojure/clojure "1.7.0"]] :prep-tasks ["codegen"] :aliases "benchmark" ["with-profile" "dev,benchmark" "run" "-m"] "codegen" ["run" "-m" "pandect.codegen"] "all" ["with-profile" "+dev:+1.5:+1.6:+1.7"] :pedantic? :abort)  
5   (defproject stch-library/sql "0.1.1" :description "A DSL in Clojure for SQL query, DML, and DDL." :url "https://github.com/stch-library/sql" :license :name "Eclipse Public License" :url "http://www.eclipse.org/legal/epl-v10.html" :dependencies [[org.clojure/clojure "1.5.1"] [stch-library/schema "0.3.3"]] :profiles :dev :dependencies [[speclj "3.0.2"]] :plugins [[speclj "3.0.2"] [codox "0.6.7"]] :codox :src-dir-uri "https://github.com/stch-library/sql/blob/master/" :src-linenum-anchor-prefix "L" :test-paths ["spec"])    
6   (defproject laboratory "0.1.0-SNAPSHOT" :description "do science in production" :url "https://github.com/yeller/laboratory" :license :name "Eclipse Public License" :url "http://www.eclipse.org/legal/epl-v10.html" :dependencies [[org.clojure/clojure "1.8.0"]] :profiles :dev :dependencies [[org.clojure/tools.namespace "0.2.4"]] :benches :dependencies [[criterium "0.4.1"]] :source-paths ["src" "benches"] :global-vars *warn-on-reflection* true *unchecked-math* :warn-on-boxed ;*compiler-options* :disable-locals-clearing true *assert* true)     

【问题讨论】:

这里的正则表达式不需要分隔符。 r' :dependencies \[(.*?)\]] ' 应该可以工作。 我刚刚尝试过 @WiktorStribiżew 但没有成功.. 试试r' :dependencies \[((?:\[[^\][]*(?:\[[^\][]*][^\][]*)*]|[^\][])*)] '(见demo) 我尝试了以下@WiktorStribiżew - SELECT REGEXP_EXTRACT(content, r' :dependencies \[((?:\[[^\][]*(?:\[[^\][]*][^\][]*)*]|[^\][])*)] ') FROM [my-test-23:lab.clj_project_files] 但没有运气... 【参考方案1】:

/g 表示您要提取所有匹配项,而不仅仅是 REGEXP_EXTRACT 所做的那样。 您应该改用REGEXP_EXTRACT_ALLUNNEST

所以试试下面

#standardSQL
SELECT deps
FROM `my-test-23.lab.clj_project_files`, 
  UNNEST(REGEXP_EXTRACT_ALL(content, r' :dependencies \[(.*?)\]] ')) AS deps

得到“查询返回零记录” ....

根据您的示例尝试以下虚拟数据(仅删除前两行)

#standardSQL
WITH yourTable AS (
  SELECT 1 AS id, '(defproject :dependencies [[org.clojure/clojure "1.6.0"] [org.clojure/clojurescript "0.0-2371" :scope "provided"] [org.clojure/core.async "0.1.346.0-17112a-alpha"] [ring "1.3.1"] [compojure "1.2.0"] [enlive "1.1.5"] [om "0.7.3"] [figwheel "0.1.4-SNAPSHOT"] [environ "1.0.0"] [com.cemerick/piggieback "0.1.3"] [weasel "0.4.3-SNAPSHOT"] [leiningen "2.5.0"] [http-kit "2.1.19"] [com.cognitect/transit-cljs "0.8.188"] ; [devcards "0.1.2-SNAPSHOT"] [sablono "0.2.22"] [prismatic/om-tools "0.3.3"]] :plugins [[lein-cljsbuild "1.0.3"] [lein-environ "1.0.0"]]  '  AS content UNION ALL
  SELECT 2, '(defproject :dependencies [[org.clojure/clojure "1.6.0"] [org.clojure/clojurescript "0.0-2322"] [org.clojure/core.async "0.1.338.0-5c5012-alpha"] [com.andrewmcveigh/cljs-time "0.1.6"] [reagent "0.4.2"]] :plugins [[lein-ring "0.8.11"] [lein-cljsbuild "1.0.3"] [lein-environ "0.5.0"]] :ring :handler server.core/app :profiles :uberjar :aot :all :dev :dependencies [[ring-mock "0.1.5"] [ring/ring-devel "1.3.0"] [compojure "1.1.9"]] :env :dev true  ' 
)
SELECT id, deps
FROM yourTable, UNNEST(SPLIT(REPLACE(
    REGEXP_EXTRACT(content, r' :dependencies \[(\[.*?])*] ') 
  ,'] [', '],['))) AS deps
ORDER BY id   

结果如下

Row id  deps     
1   1   [org.clojure/clojurescript "0.0-2371" :scope "provided"]     
2   1   [prismatic/om-tools "0.3.3"]     
3   1   [sablono "0.2.22"]   
4   1   [com.cognitect/transit-cljs "0.8.188"] ; [devcards "0.1.2-SNAPSHOT"]     
5   1   [http-kit "2.1.19"]  
6   1   [leiningen "2.5.0"]  
7   1   [weasel "0.4.3-SNAPSHOT"]    
8   1   [com.cemerick/piggieback "0.1.3"]    
9   1   [environ "1.0.0"]    
10  1   [figwheel "0.1.4-SNAPSHOT"]  
11  1   [om "0.7.3"]     
12  1   [enlive "1.1.5"]     
13  1   [compojure "1.2.0"]  
14  1   [ring "1.3.1"]   
15  1   [org.clojure/core.async "0.1.346.0-17112a-alpha"]    
16  1   [org.clojure/clojure "1.6.0"]    
17  2   [org.clojure/clojure "1.6.0"]    
18  2   [org.clojure/clojurescript "0.0-2322"]   
19  2   [org.clojure/core.async "0.1.338.0-5c5012-alpha"]    
20  2   [com.andrewmcveigh/cljs-time "0.1.6"]    
21  2   [reagent "0.4.2"]      

...但是如果我删除它并在我自己的数据集/表上运行它,我不会得到任何结果...

看起来您的真实数据与您的问题中显示的数据“略有”不同

试试下面 - 它现在应该可以工作了 :o)

#standardSQL
SELECT id, deps
FROM `my-test-23.lab.clj_project_files`, 
  UNNEST(SPLIT(REGEXP_REPLACE(
    REGEXP_EXTRACT(content, r'(?s) :dependencies \[(\[.*?])]') 
  , r']\n *\[', '],['))) AS deps
ORDER BY id   

【讨论】:

由于某种原因,我得到“错误:无法解析表名:数据集名称丢失。” @user228137 - 代替yourTable,您需要输入真正的完整表名,例如project.dataset.table,并用反引号括起来 - 请参阅更新的答案 你不应该在这里使用 [ 和 ] - 而不是反引号作为答案 - 因为它是标准 SQL。请重试并告诉我们 所以只需更改/修复模式 :o) - 我没有机会密切关注这篇文章 :o) - 如果您仍然有问题,您应该提供您的实际数据示例,否则会有没有什么比现在更能帮助你了 @user228137 - 请检查我在回答中的更新。我不得不稍微调整一下,让它按我期望的方式工作。请您试一试并告诉我您需要什么不同 - 这样我们就可以得到最终版本。正如我所说,我对替换和拆分做了一些调整,但实际上它应该是更好的正则表达式。但同时替换和拆分看起来就像他们的工作:o)

以上是关于BigQuery 为 REGEXP_MATCH 或 _EXTRACT 返回 null的主要内容,如果未能解决你的问题,请参考以下文章

Google Big Query 中 REGEXP_MATCH 的奇怪行为

将 BigQuery 结果导出为 Avro 或 JSON

BigQuery 下载或导出为 JSON 不考虑类型

ValueError:必须使用 beam.io.gcp.bigquery.ReadFromBigQuery 指定 BigQuery 表或查询

BigQuery 负载作业限制为 15 TB

连接工作表不能以宏或脚本为目标