clojure - strng-concat with group by in maps of maps

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了clojure - strng-concat with group by in maps of maps相关的知识,希望对你有一定的参考价值。

给定来自jdbc源的输入数据,例如:

  (def input-data
    [{:doc_id 1 :doc_seq 1  :doc_content "this is a very long "}
    {:doc_id 1 :doc_seq 2  :doc_content "sentence from a mainframe "}
    {:doc_id 1 :doc_seq 3  :doc_content "system that was built before i was "}
    {:doc_id 1 :doc_seq 4  :doc_content "born."}
    {:doc_id 2 :doc_seq 1  :doc_content "this is a another very long "}
    {:doc_id 2 :doc_seq 2  :doc_content "sentence from the same mainframe "}
    {:doc_id 3 :doc_seq 1  :doc_content "Ok here we are again. "}
    {:doc_id 3 :doc_seq 2  :doc_content "The mainframe only had 40 char per field so"}
    {:doc_id 3 :doc_seq 3  :doc_content "they broke it into multiple rows "}
    {:doc_id 3 :doc_seq 4  :doc_content "which seems to be common"}
    {:doc_id 3 :doc_seq 5  :doc_content " for the time. "}
    {:doc_id 3 :doc_seq 6  :doc_content "thanks for your help."}])

我想通过doc id分组,并将字符串连接到doc_content,所以我的输出看起来像这样:

  [{:doc_id 1 :doc_content "this is a very long sentence from a mainfram system that was built before i was born."}
   {:doc_id 2 :doc_content "this is a another very long sentence ... clip..."}
   {:doc_id 3 :doc_content "... clip..."}]

我正在考虑使用group-by然而输出一个地图,我需要输出一些懒惰的东西,因为输入数据集可能非常大。也许我可以运行group-byreduce-kv的一些组合来获得我正在寻找的东西......或者如果我能强迫它变得懒惰的话,可能还有frequencies的东西。

我可以保证它会被分类;我将把顺序(通过sql)放在doc_iddoc_seq上,所以这个程序唯一负责的是aggregate / string-concat部分。我可能会有整个序列的大输入数据,但该序列中的特定doc_id应该只有几十个doc_seq

任何提示赞赏,

答案

partition-by是懒惰的,只要每个doc序列适合内存,这应该工作:

(defn collapse-docs [docs]
  (apply merge-with
         (fn [l r]
           (if (string? r)
             (str l r)
             r))
         docs))

(sequence ;; you may want to use eduction here, depending on use case
  (comp
    (partition-by :doc_id)
    (map collapse-docs))
  input-data)
=>
({:doc_id 1,
  :doc_seq 4,
  :doc_content "this is a very long sentence from a mainframe system that was built before i was born."}
  {:doc_id 2, :doc_seq 2, :doc_content "this is a another very long sentence from the same mainframe "}
  {:doc_id 3,
   :doc_seq 6,
   :doc_content "Ok here we are again. The mainframe only had 40 char per field sothey broke it into multiple rows which seems to be common for the time. thanks for your help."})

以上是关于clojure - strng-concat with group by in maps of maps的主要内容,如果未能解决你的问题,请参考以下文章

Clojure 类型提示,无法解析类名 clojure.core$double

Clojure基础课程2-Clojure中的数据长啥样?

Clojure基础课程2-Clojure中的数据长啥样?

《Learn Clojure》直播第二期:Clojure 数据类型介绍

Lein Clojure 1.3 与 Clojure 1.2.1

Clojure + 狐猴