了解此 Haskell 程序的内存使用情况

Posted

技术标签:

【中文标题】了解此 Haskell 程序的内存使用情况【英文标题】:Understanding memory usage of this Haskell program 【发布时间】:2016-09-06 16:51:23 【问题描述】:

我应该先说我是 Haskell 和管道库的初学者,我想了解是什么导致了该程序在 test 函数中的高内存使用率。

特别是在test 中产生r1 值的折叠中,除非使用deepseq,否则在产生最终结果之前,我看到MyRecord 值的累积。在我的 ~ 500000 行 / ~ 230 MB 的样本数据集上,内存使用量超过了 1.5 GB。

产生r2 值的折叠在常量内存中运行。

我想了解的是:

1) 什么可能导致 MyMemory 值在第一折中生成,为什么使用 deepseq 会修复它?在使用deepseq 来实现恒定的内存使用之前,我非常随意地向它扔东西,但我想了解它为什么会起作用。是否可以在不使用 deepseq 的情况下实现恒定的内存使用,同时仍然产生相同的 Maybe Int 结果类型?

2)。第二次折叠有什么不同导致它没有出现相同的问题?

我知道,如果我只使用整数而不是元组,我可以使用 Pipes.Prelude 中的内置 sum 函数,但我最终会想要处理包含任何解析错误的第二个元素。

-# LANGUAGE OverloadedStrings #-
-# LANGUAGE FlexibleContexts #-
-# LANGUAGE ScopedTypeVariables #-

module Test where

import           Control.Arrow
import           Control.DeepSeq
import           Control.Monad
import           Data.Aeson
import           Data.Function
import           Data.Maybe
import           Data.Monoid
import           Data.Text (Text)

import           Pipes
import qualified Pipes.Aeson as PA (DecodingError(..))
import qualified Pipes.Aeson.Unchecked as PA
import qualified Pipes.ByteString as PB
import qualified Pipes.Group as PG
import qualified Pipes.Parse as PP
import qualified Pipes.Prelude as P

import           System.IO
import           Control.Lens
import qualified Control.Foldl as Fold

data MyRecord = MyRecord
   myRecordField1 :: !Text
  , myRecordField2 :: !Int
  , myRecordField3 :: !Text
  , myRecordField4 :: !Text
  , myRecordField5 :: !Text
  , myRecordField6 :: !Text
  , myRecordField7 :: !Text
  , myRecordField8 :: !Text
  , myRecordField9 :: !Text
  , myRecordField10 :: !Int
  , myRecordField11 :: !Text
  , myRecordField12 :: !Text
  , myRecordField13 :: !Text
   deriving (Eq, Show)

instance FromJSON MyRecord where
  parseJSON (Object o) =
    MyRecord <$> o .: "field1" <*> o .: "field2" <*> o .: "field3" <*>
    o .: "field4" <*>
    o .: "field5" <*>
    o .: "filed6" <*>
    o .: "field7" <*>
    o .: "field8" <*>
    o .: "field9" <*>
    (read <$> o .: "field10") <*>
    o .: "field11" <*>
    o .: "field12" <*>
    o .: "field13"
  parseJSON x = fail $ "MyRecord: expected Object, got: " <> show x

instance ToJSON MyRecord where
    toJSON _ = undefined

test :: IO ()
test = do
  withFile "some-file" ReadMode $ \hIn
  -

      the pipeline is composed as follows:

      1 a producer reading a file with Pipes.ByteString, splitting chunks into lines,
        and parsing the lines as JSON to produce tuples of (Maybe MyRecord, Maybe
        ByteString), the second element being an error if parsing failed

      2 a pipe filtering that tuple on a field of Maybe MyRecord, passing matching
        (Maybe MyRecord, Maybe ByteString) downstream

      3 and a pipe that picks an Int field out of Maybe MyRecord, passing (Maybe Int,
        Maybe ByteString downstream)

      pipeline == 1 >-> 2 >-> 3

      memory profiling indicates the memory build up is due to accumulation of
      MyRecord "objects", and data types comprising their fields (mainly
      Text/ARR_WORDS)

  -
   -> do
    let pipeline = f1 hIn >-> f2 >-> f3
    -- need to use deepseq to avoid leaking memory
    r1 <-
      P.fold
        (\acc (v, _) -> (+) <$> acc `deepseq` acc <*> pure (fromMaybe 0 v))
        (Just 0)
        id
        (pipeline :: Producer (Maybe Int, Maybe PB.ByteString) IO ())
    print r1
    hSeek hIn AbsoluteSeek 0
    -- this works just fine as is and streams in constant memory
    r2 <-
      P.fold
        (\acc v ->
           case fst v of
             Just x -> acc + x
             Nothing -> acc)
        0
        id
        (pipeline :: Producer (Maybe Int, Maybe PB.ByteString) IO ())
    print r2
    return ()
  return ()

f1
  :: (FromJSON a, MonadIO m)
  => Handle -> Producer (Maybe a, Maybe PB.ByteString) m ()
f1 hIn = PB.fromHandle hIn & asLines & resumingParser PA.decode

f2
  :: Pipe (Maybe MyRecord, Maybe PB.ByteString) (Maybe MyRecord, Maybe PB.ByteString) IO r
f2 = filterRecords (("some value" ==) . myRecordField5)

f3 :: Pipe (Maybe MyRecord, d) (Maybe Int, d) IO r
f3 = P.map (first (fmap myRecordField10))

filterRecords
  :: Monad m
  => (MyRecord -> Bool)
  -> Pipe (Maybe MyRecord, Maybe PB.ByteString) (Maybe MyRecord, Maybe PB.ByteString) m r
filterRecords predicate =
  for cat $ \(l, e) ->
    when (isNothing l || (predicate <$> l) == Just True) $ yield (l, e)

asLines
  :: Monad m
  => Producer PB.ByteString m x -> Producer PB.ByteString m x
asLines p = Fold.purely PG.folds Fold.mconcat (view PB.lines p)

parseRecords
  :: (Monad m, FromJSON a, ToJSON a)
  => Producer PB.ByteString m r
  -> Producer a m (Either (PA.DecodingError, Producer PB.ByteString m r) r)
parseRecords = view PA.decoded

resumingParser
  :: Monad m
  => PP.StateT (Producer a m r) m (Maybe (Either e b))
  -> Producer a m r
  -> Producer (Maybe b, Maybe a) m ()
resumingParser parser p = do
  (x, p') <- lift $ PP.runStateT parser p
  case x of
    Nothing -> return ()
    Just (Left _) -> do
      (x', p'') <- lift $ PP.runStateT PP.draw p'
      yield (Nothing, x')
      resumingParser parser p''
    Just (Right b) -> do
      yield (Just b, Nothing)
      resumingParser parser p'

【问题讨论】:

查看 haskell 标签信息部分,并请发布您是如何编译和运行二进制文件的 因为seq (Just undefined) = ()seq (undefined :: Int) () = undefined 考虑forceMaybe Nothing = Nothing; forceMaybe x@(Just !_) = x 感谢大家的cmets!如果我理解所说的内容,则问题是由于将结果累积在 Maybe 中,并且折叠函数创建了对 MyRecord 值的依赖或引用。在需要整个折叠的结果之前,不会在可能导致 MyRecord 堆积的构造函数之外评估中间值。 deepseq 强制任何“隐藏”在 Maybe 中的东西,并允许 MyRecord 被垃圾收集。这是对正在发生的事情的公平评估吗? 对了,你可以用P.sum (pipeline &gt;-&gt; P.map (fromMaybe 0 . fst)) 【参考方案1】:

正如docs for Pipes.foldl 中提到的,折叠是严格的。然而, 严格性是implemented with $!,它只强制评估 到 WHNF - 弱头正常形式。 WHNF 足以全面评估一个简单的 像 Int 这样的类型,但它的强度不足以完全评估更多 像Maybe Int 这样的复杂类型。

一些例子:

main1 = do
  let a = 3 + undefined
      b = seq a 10
  print b                -- error: Exception: Prelude.undefined

main2 = do
  let a = Just (3 + undefined)
      b = seq a 10
  print b                -- no exception

在第一种情况下,变量 acc 是一个大 thunk 的 Just - 所有元素的总和。在每次迭代中,变量 accJust aJust (a+b)Just (a+b+c) 等等。 没有在折叠期间执行 - 它只是在 最后。大量内存使用来自存储这个不断增长的总和 在内存中。

在第二种情况下,每次迭代都会将总和减少 $! 到一个简单的 Int。

除了使用deepseq,还可以使用force

force x = x `deepseq` x

和mentioned in the deepseq docs一样,结合ViewPatterns你 可以创建一个完全评估函数参数的模式:

-# LANGUAGE ViewPatterns #-

...
P.fold
  (\(force -> !acc) (v,_) -> (+) <$> acc <*> pure (fromMaybe 0 v))
  (Just 0)
  ...

【讨论】:

谢谢@ErikR 我认为折叠累加器中的这些thunk 的累积也会cause MyRecord's not to be released 是正确的吗? 是的,因为总和包含通过调用 myRecordField10 对您的 MyRecord 值的引用。

以上是关于了解此 Haskell 程序的内存使用情况的主要内容,如果未能解决你的问题,请参考以下文章

如何在 Haskell 中查找程序的内存使用情况

Haskell 中的内存高效字符串

初步了解Windows应用程序内存结构 - 使用VMMAP工具

尽管总内存使用量只有 22Mb,但 Haskell 线程堆溢出?

在 PHP (PDO) 中了解 MySQL 的内存使用情况

我应该使用哪些 Instruments 工具来了解我的 Monotouch 应用程序的内存使用情况?