clojure - strng-concat with group by in maps of maps
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了clojure - strng-concat with group by in maps of maps相关的知识,希望对你有一定的参考价值。
给定来自jdbc源的输入数据,例如:
(def input-data
[{:doc_id 1 :doc_seq 1 :doc_content "this is a very long "}
{:doc_id 1 :doc_seq 2 :doc_content "sentence from a mainframe "}
{:doc_id 1 :doc_seq 3 :doc_content "system that was built before i was "}
{:doc_id 1 :doc_seq 4 :doc_content "born."}
{:doc_id 2 :doc_seq 1 :doc_content "this is a another very long "}
{:doc_id 2 :doc_seq 2 :doc_content "sentence from the same mainframe "}
{:doc_id 3 :doc_seq 1 :doc_content "Ok here we are again. "}
{:doc_id 3 :doc_seq 2 :doc_content "The mainframe only had 40 char per field so"}
{:doc_id 3 :doc_seq 3 :doc_content "they broke it into multiple rows "}
{:doc_id 3 :doc_seq 4 :doc_content "which seems to be common"}
{:doc_id 3 :doc_seq 5 :doc_content " for the time. "}
{:doc_id 3 :doc_seq 6 :doc_content "thanks for your help."}])
我想通过doc id
分组,并将字符串连接到doc_content
,所以我的输出看起来像这样:
[{:doc_id 1 :doc_content "this is a very long sentence from a mainfram system that was built before i was born."}
{:doc_id 2 :doc_content "this is a another very long sentence ... clip..."}
{:doc_id 3 :doc_content "... clip..."}]
我正在考虑使用group-by
然而输出一个地图,我需要输出一些懒惰的东西,因为输入数据集可能非常大。也许我可以运行group-by
和reduce-kv
的一些组合来获得我正在寻找的东西......或者如果我能强迫它变得懒惰的话,可能还有frequencies
的东西。
我可以保证它会被分类;我将把顺序(通过sql)放在doc_id
和doc_seq
上,所以这个程序唯一负责的是aggregate / string-concat部分。我可能会有整个序列的大输入数据,但该序列中的特定doc_id
应该只有几十个doc_seq
。
任何提示赞赏,
答案
partition-by
是懒惰的,只要每个doc序列适合内存,这应该工作:
(defn collapse-docs [docs]
(apply merge-with
(fn [l r]
(if (string? r)
(str l r)
r))
docs))
(sequence ;; you may want to use eduction here, depending on use case
(comp
(partition-by :doc_id)
(map collapse-docs))
input-data)
=>
({:doc_id 1,
:doc_seq 4,
:doc_content "this is a very long sentence from a mainframe system that was built before i was born."}
{:doc_id 2, :doc_seq 2, :doc_content "this is a another very long sentence from the same mainframe "}
{:doc_id 3,
:doc_seq 6,
:doc_content "Ok here we are again. The mainframe only had 40 char per field sothey broke it into multiple rows which seems to be common for the time. thanks for your help."})
以上是关于clojure - strng-concat with group by in maps of maps的主要内容,如果未能解决你的问题,请参考以下文章
Clojure 类型提示,无法解析类名 clojure.core$double
《Learn Clojure》直播第二期:Clojure 数据类型介绍