在 Haskell 中有效地读取和排序包含文本行的文件
Posted
技术标签:
【中文标题】在 Haskell 中有效地读取和排序包含文本行的文件【英文标题】:Efficiently read and sort a file containing lines of text in Haskell 【发布时间】:2022-01-19 02:33:34 【问题描述】:我碰巧按自然频率对德语单词列表进行了排序1。我对我的算法的内存性能不满意。
图形是用hp/D3.js 创建的。它显示了 V1、V2 和 V3 的运行时堆,如下面的代码所示。
我上传了完整的代码,包括关于如何使用性能分析(通过堆栈和 nix)on github here 运行的简短说明。它也完整地粘贴在下面。
版本 1 使用来自 Data.Text.IO
的严格 IO 读取两个大文件。从Data.Text.Lazy.IO
可以很好地看出带有 Lazy IO 的版本 2 和 3 有什么不同:有些东西会立即出现,而版本 2 和 3 会建立。
数据结构的大小
我可以根据these formulae 给出相当准确的尺寸,而且我知道文件中的内容,那里的德语单词的平均长度约为 16 个字符。这些数字不是从输出中解释的,而是独立计算的。
mapFrequencies:550 MB (HashMap Text Int
)
ls:200 MB ([Text]
)
vec:167 MB (Vector Text
)
我不明白的地方
除此之外,我完全迷失了。我正在尝试理解这些问题:
为什么我的-# SCC foo #-
被忽略了?我无法控制分析中的成本中心。这发生在 GHC 8.8.4 和 GHC 9.2.1、nix/cabal 和 stack 上。
配置文件建议内存使用量峰值略高于 1 GB。但是,运行top
,我可以看到该算法确实使用了 2.6 GB。这几乎是金额的两倍。这些金额不应该相等吗?
垃圾收集发生在哪里?我的怀疑是没有。第 2 版和第 3 版显示了一些 gargabe 收集,但只是内存使用量超过了第 1 版。
考虑到我选择的 hashmap、list 和 vector,我是否可以期待更精简的内存配置文件?仅添加 hashmap 和向量将达到 717 MB,不到我在 top
中看到的一半。怎么去?
还有其他更适合此类任务的设备吗?我为排序算法选择了向量。由于Text
,我无法移动到Storable
、Unboxed
或Primitive
中的任何一个(至少我不知道如何)。
在运行时统计信息的摘要中(如下所示),它显示“生产力 43.5%”。我的猜测是,分析本身就是原因的一部分。但是,根据数字,是否也存在垃圾收集器的过度活动?
-- app/Main.hs
-# LANGUAGE OverloadedStrings #-
module Main where
import Control.Category ((<<<))
import Control.Monad.ST (runST)
import Data.Functor ((<&>))
import Data.HashMap.Strict (HashMap)
import qualified Data.HashMap.Strict as HashMap
import Data.Maybe (catMaybes, fromMaybe)
import Data.Ord (Down (Down), comparing)
import Data.Text (Text)
import qualified Data.Text as Text
import qualified Data.Text.IO as Text
import qualified Data.Text.Lazy as Lazy
import qualified Data.Text.Lazy.IO as Lazy
import Data.Vector (Vector, freeze, thaw)
import qualified Data.Vector as Vector
import qualified Data.Vector.Algorithms.Tim as Tim
import System.IO (hFlush, stdout)
import GHC.Conc (pseq)
main :: IO ()
main = do
putStr ""
putStr "Running v1 ..."
hFlush stdout
u1 <- runV1
putStrLn $ u1 `seq` " done."
putStrLn ""
putStr "Running v2 ..."
hFlush stdout
u2 <- runV2
putStrLn $ u2 `seq` " done."
putStrLn ""
putStr "Running v3 ..."
hFlush stdout
u3 <- runV3
putStrLn $ u3 `seq` " done."
fileFrequencies :: FilePath
fileFrequencies = "deu_news_2020_freq.txt"
fileData :: FilePath
fileData = "german.utf8.dic"
fileSorted :: FilePath
fileSorted = "german.utf8.sorted.dic"
- |
straightforward implementation, using Text-based IO
-
runV1 :: IO ()
runV1 = do
mapFrequencies <- readFrequencies
ls <- Text.lines <$> Text.readFile fileData
let sorted = quicksort mapFrequencies $ -# SCC vec #- Vector.fromList (-# SCC ls #- ls)
Text.writeFile fileSorted $ Text.unlines $ -# SCC lsSorted #- Vector.toList (-# SCC sorted #- sorted)
where
-# SCC readFrequencies #-
readFrequencies :: IO (HashMap Text Int)
readFrequencies = do
ls <- Text.lines <$> Text.readFile fileFrequencies
pure $ -# SCC hmap #- mkHashMap (-# SCC ls #- ls)
- |
why not Lazy? read the file line by line, no need to hold it all in memory
-
runV2 :: IO ()
runV2 = do
mapFrequencies <- readFrequencies
ls <- fmap Lazy.toStrict . Lazy.lines <$> Lazy.readFile fileData
let sorted = quicksort mapFrequencies $ -# SCC vec #- Vector.fromList (-# SCC ls #- ls)
Text.writeFile fileSorted $ Text.unlines $ -# sCC lsSorted #- Vector.toList (-# SCC sorted #- sorted)
where
-# SCC readFrequencies #-
readFrequencies :: IO (HashMap Text Int)
readFrequencies = do
ls <- fmap Lazy.toStrict . Lazy.lines <$> Lazy.readFile fileFrequencies
pure $ -# SCC hmap #- mkHashMap (-# SCC ls #- ls)
-|
trying to help with garbage collection, only making it worse
-
runV3 :: IO ()
runV3 = do
mapFrequencies <- readFrequencies
ls <- fmap Lazy.toStrict . Lazy.lines <$> Lazy.readFile fileData
let -- alternatives:
-- Vector.fromListN (length ls) ls
-- Vector.generate (length ls) $ \i -> ls !! i
vec = -# SCC vec #- Vector.fromList (-# SCC ls #- ls)
-- the idea: ls can get garbage-collected ...
sorted = vec `seq` -# SCC sorted #- quicksort mapFrequencies vec
-- ... before we sort and write to the file
sorted `pseq` Lazy.writeFile fileSorted (Lazy.unlines $ Lazy.fromStrict <$> -# SCC lsSorted #- Vector.toList sorted)
where
readFrequencies :: IO (HashMap Text Int)
readFrequencies = do
ls <- fmap Lazy.toStrict . Lazy.lines <$> Lazy.readFile fileFrequencies
pure $ -# SCC hmap #- mkHashMap (-# SCC ls #- ls)
freq :: HashMap Text Int -> Text -> Int
freq m w = fromMaybe 0 $ HashMap.lookup w m
quicksort ::
HashMap Text Int -> Vector Text -> Vector Text
quicksort freqs vec = runST $ do
mvec <- thaw vec
Tim.sortBy (comparing $ Down <<< freq freqs) mvec
freeze mvec
mkHashMap :: [Text] -> HashMap Text Int
mkHashMap ls =
HashMap.fromList $
catMaybes $
ls <&> \l -> case Text.head l of
'#' -> Nothing
_ ->
let [w, f] = Text.splitOn "\t" l
in Just (w, read $ Text.unpack f)
运行时统计摘要(+RTS -s)
343,377,611,904 bytes allocated in the heap
1,345,257,485,736 bytes copied during GC
1,489,914,240 bytes maximum residency (1608 sample(s))
203,039,648 bytes maximum slop
2829 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 328286 colls, 0 par 12.067s 12.117s 0.0000s 0.0114s
Gen 1 1608 colls, 0 par 1001.504s 1001.547s 0.6229s 1.0471s
INIT time 0.000s ( 0.000s elapsed)
MUT time 160.134s (160.481s elapsed)
GC time 663.692s (663.771s elapsed)
RP time 0.000s ( 0.000s elapsed)
PROF time 349.879s (349.893s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 1173.705s (1174.145s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 2,144,311,061 bytes per MUT second
Productivity 43.5% of total user, 43.5% of total elapsed
1词频信息由 Uni Leipzig 自然语言处理小组提供给我。它是从 3500 万个句子的语料库中生成的,分布在 Creative Commons Attribution-NonCommercial 4.0 International Public Licence 下。
【问题讨论】:
imgur.com/a/Rq35gug 按类型划分的堆配置文件。元组中有相当多的内存,仅在 V2 和 V3 中 【参考方案1】:编辑:我在顶部添加新结果。下面你仍然可以看到早期优化的不那么有趣的结果。
右边那个短峰是优化后的代码。左边的大峰是用来比较的。
这是代码(由@bodigrim 在discourse.haskell.org 上提供):
-# LANGUAGE OverloadedStrings #-
module Main where
import qualified Data.ByteString as BS
import qualified Data.ByteString.Unsafe as BS
import qualified Data.ByteString.Builder as BSB
import qualified Data.ByteString.Char8 as BS (lines, readInt)
import Data.List (sortOn)
import qualified Data.Map.Strict as Map
main :: IO ()
main = do
mapFrequencies <- Map.fromList . parseFrequencies <$> BS.readFile fileFrequencies
ls <- BS.lines <$> BS.readFile fileData
let sorted = sortOn (\k -> Map.findWithDefault 0 k mapFrequencies) ls
BSB.writeFile fileSorted $ foldMap ((<> "\n") . BSB.byteString) sorted
fileFrequencies :: FilePath
fileFrequencies = "deu_news_2020_freq.txt"
fileData :: FilePath
fileData = "german.utf8.dic"
fileSorted :: FilePath
fileSorted = "german.utf8.sorted.dic"
parseFrequencies :: BS.ByteString -> [(BS.ByteString, Int)]
parseFrequencies bs = case BS.uncons bs of
Nothing -> []
-- this is admittedly brittle, just to demonstrate single-pass parsing with readInt
Just (35, _) -> parseFrequencies (BS.unsafeTail (BS.dropWhile (/= 10) bs))
_ -> let (w, f) = BS.break (== 9) bs in
case BS.readInt (BS.unsafeTail f) of
Just (i, bs') -> (w, i) : parseFrequencies (BS.unsafeTail bs')
Nothing -> []
外卖
自定义字节串解析在性能方面获得了巨大的回报Data.List.sort
太棒了,可以在排序中进行早期垃圾收集
在我的情况下,HashMap
的任何运行时性能提升都不值得额外的内存,因此 Map
很好,即使我的查找涉及字节串比较
我的旧结果
我获得了某种洞察力和一些优化(新代码here):
最左边的图是版本 1 的堆配置文件,通过 Data.Text.IO
使用严格 IO。这是同一个堆,只是我切换到按类型分析(通过 +RTS -hT),因为我的手动成本中心不起作用:
ls <- Text.lines <$> Text.readFile fileData
-- ls then converted into vector
ls <- Text.lines <$> Text.readFile fileFrequencies
-- ls then converted into a strict hashmap
中间的图是我切换到惰性 IO 得到的结果:
ls <- fmap Lazy.toStrict . Lazy.lines <$> Lazy.readFile fileData
-- ls then converted into vector
ls <- fmap Lazy.toStrict . Lazy.lines <$> Lazy.readFile fileFrequencies
-- ls then converted into a strict hashmap
“ARR_WORDS”有一些好处,但除此之外情况变得更糟。我使用以下方法在版本 3 中获得了最干净的结果:
ls <- (Lazy.toStrict <$!>) . Lazy.lines <$> Lazy.readFile fileData
-- ls then converted into vector
ls <- fmap Lazy.toStrict . Lazy.lines <$> Lazy.readFile fileFrequencies
-- ls then converted into a strict hashmap
调整严格/惰性评估
总之,为了将大文件读入我的严格哈希映射
ls <- fmap Lazy.toStrict . Lazy.lines <$> Lazy.readFile file
似乎是要走的路。用于读取大文件并将数据通过Vector.fromList
转换为向量
ls <- (Lazy.toStrict <$!>) . Lazy.lines <$> Lazy.readFile file
似乎是必要的。后一行与前一行相比没有任何缺点(据我所知),并且可能成为我逐行读取文本文件的标准方式。
调整垃圾收集器
我了解到,GHC 的垃圾收集器使用两倍的实时内存。因此,考虑到堆配置文件,预计内存使用量应为 2.4 GB。
我可以通过 +RTS -A 对其进行优化,即设置分配区域大小,默认为 1 MB:
/usr/bin/env time -f '%M' cabal run readFilePerformance -- +RTS -s -A64M
45,138,177,376 bytes allocated in the heap
1,917,960,640 bytes copied during GC
569,104,376 bytes maximum residency (9 sample(s))
136,677,384 bytes maximum slop
1494 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 665 colls, 0 par 3.446s 3.446s 0.0052s 0.0248s
Gen 1 9 colls, 0 par 1.186s 1.186s 0.1318s 0.6386s
INIT time 0.000s ( 0.000s elapsed)
MUT time 19.486s ( 19.591s elapsed)
GC time 4.632s ( 4.632s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 24.118s ( 24.224s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 2,316,458,656 bytes per MUT second
Productivity 80.8% of total user, 80.9% of total elapsed
1532300
将其与默认值进行比较:
/usr/bin/env time -f '%M' cabal run readFilePerformance -- +RTS -s
45,188,257,608 bytes allocated in the heap 2,943,601,200 bytes copied during GC
836,171,600 bytes maximum residency (14 sample(s))
197,200,048 bytes maximum slop
1889 MiB total memory in use (0 MB lost due to fragmentation)
45,188,257,608 bytes allocated in the heap
2,943,601,200 bytes copied during GC
836,171,600 bytes maximum residency (14 sample(s))
197,200,048 bytes maximum slop
1889 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 43127 colls, 0 par 3.157s 3.163s 0.0001s 0.0117s
Gen 1 14 colls, 0 par 1.954s 1.954s 0.1396s 1.0358s
INIT time 0.000s ( 0.000s elapsed)
MUT time 17.347s ( 17.465s elapsed)
GC time 5.111s ( 5.117s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 22.458s ( 22.582s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 2,604,931,438 bytes per MUT second
Productivity 77.2% of total user, 77.3% of total elapsed
1937440
我从 1.9 GB 减少到大约 1.5 GB 的 RAM 使用,但运行时间略有增加,分配大小增加到 64 MB。超出此范围并不会产生明显更好的结果。
切换到垃圾收集器的压缩算法(通过 +RTS -c)并没有改善内存占用。
【讨论】:
以上是关于在 Haskell 中有效地读取和排序包含文本行的文件的主要内容,如果未能解决你的问题,请参考以下文章
C 语言文件操作 ( 按照文本行的方式读写文件 | fgets 函数 | fputs 函数 )