Erlang 监控多个进程

Posted 2023-03-15

技术标签:

【中文标题】Erlang 监控多个进程【英文标题】：Erlang monitor multiple processes 【发布时间】：2017-10-30 05:28:56 【问题描述】：

我需要监控一堆工作进程。目前我可以通过 1 个监视器监视 1 个进程。我如何将此扩展到监视 N 个工作进程。我还需要生成 N 个监视器吗？如果是这样，那么如果其中一个生成的监视器失败/崩溃会发生什么？

【问题讨论】：

【参考方案1】：

我还需要生成 N 个监视器吗？

没有：

-module(mo).
-compile(export_all).

worker(Id) ->
    timer:sleep(1000 * rand:uniform(5)),
    io:format("Worker~w: I'm still alive~n", [Id]),
    worker(Id).

create_workers(N) ->
    Workers = [  %  Pid, Ref, Id 
         spawn_monitor(?MODULE, worker, [Id]), Id 
        || Id <- lists:seq(1, N)
    ],
    monitor_workers(Workers).

monitor_workers(Workers) ->
    receive
        'DOWN', Ref, process, Pid, Why ->
            Worker = Pid, Ref,
            case is_my_worker(Worker, Workers) of
                true  ->  
                    NewWorkers = replace_worker(Worker, Workers, Why),
                    io:format("Old Workers:~n~p~n", [Workers]),
                    io:format("New Workers:~n~p~n", [NewWorkers]),
                    monitor_workers(NewWorkers);
                false -> 
                    monitor_workers(Workers)
            end;
        _Other -> 
            monitor_workers(Workers)
    end.

is_my_worker(Worker, Workers) ->
    lists:keymember(Worker, 1, Workers).

replace_worker(Worker, Workers, Why) ->
    Pid, _, Id = lists:keyfind(Worker, 1, Workers),
    io:format("Worker~w (~w) went down: ~s~n", [Id, Pid, Why]),
    NewWorkers = lists:keydelete(Worker, 1, Workers),
    NewWorker = spawn_monitor(?MODULE, worker, [Id]),
    [NewWorker, Id|NewWorkers].

start() ->
    observer:start(),  %%In the Processes tab, you can right click on a worker and kill it.
    create_workers(4).

在外壳中：

$ ./run
Erlang/OTP 19 [erts-8.2] [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V8.2  (abort with ^G)


1> Worker3: I'm still alive
Worker1: I'm still alive
Worker2: I'm still alive
Worker4: I'm still alive
Worker3: I'm still alive
Worker1: I'm still alive
Worker4: I'm still alive
Worker2: I'm still alive
Worker3: I'm still alive
Worker1: I'm still alive
Worker4: I'm still alive
Worker3 (<0.87.0>) went down: killed
Old Workers:
[<0.85.0>,#Ref<0.0.4.292>,1,
 <0.86.0>,#Ref<0.0.4.293>,2,
 <0.87.0>,#Ref<0.0.4.294>,3,
 <0.88.0>,#Ref<0.0.4.295>,4]
New Workers:
[<0.2386.0>,#Ref<0.0.1.416>,3,
 <0.85.0>,#Ref<0.0.4.292>,1,
 <0.86.0>,#Ref<0.0.4.293>,2,
 <0.88.0>,#Ref<0.0.4.295>,4]
Worker2: I'm still alive
Worker1: I'm still alive
Worker2: I'm still alive
Worker1: I'm still alive
Worker1: I'm still alive
Worker4: I'm still alive
Worker3: I'm still alive
Worker2: I'm still alive
Worker1: I'm still alive
Worker3: I'm still alive
Worker4: I'm still alive
Worker1: I'm still alive
Worker4 (<0.88.0>) went down: killed
Old Workers:
[<0.2386.0>,#Ref<0.0.1.416>,3,
 <0.85.0>,#Ref<0.0.4.292>,1,
 <0.86.0>,#Ref<0.0.4.293>,2,
 <0.88.0>,#Ref<0.0.4.295>,4]
New Workers:
[<0.5322.0>,#Ref<0.0.1.9248>,4,
 <0.2386.0>,#Ref<0.0.1.416>,3,
 <0.85.0>,#Ref<0.0.4.292>,1,
 <0.86.0>,#Ref<0.0.4.293>,2]
Worker3: I'm still alive
Worker2: I'm still alive
Worker4: I'm still alive
Worker1: I'm still alive
Worker3: I'm still alive
Worker3: I'm still alive
Worker2: I'm still alive
Worker1 (<0.85.0>) went down: killed
Old Workers:
[<0.5322.0>,#Ref<0.0.1.9248>,4,
 <0.2386.0>,#Ref<0.0.1.416>,3,
 <0.85.0>,#Ref<0.0.4.292>,1,
 <0.86.0>,#Ref<0.0.4.293>,2]
New Workers:
[<0.5710.0>,#Ref<0.0.1.10430>,1,
 <0.5322.0>,#Ref<0.0.1.9248>,4,
 <0.2386.0>,#Ref<0.0.1.416>,3,
 <0.86.0>,#Ref<0.0.4.293>,2]
Worker2: I'm still alive
Worker3: I'm still alive
Worker4: I'm still alive
Worker3: I'm still alive

我认为下面的版本可能效率更高：它使用lists:map()来搜索和替换崩溃的worker，所以它只遍历worker的列表一次：

-module(mo).
-compile(export_all).

worker(Id) ->
    timer:sleep(1000 * rand:uniform(5)),
    io:format("Worker~w: I'm still alive~n", [Id]),
    worker(Id).

create_workers(N) ->
    Workers = [  %  Pid, Ref, Id 
         spawn_monitor(?MODULE, worker, [Id]), Id 
        || Id <- lists:seq(1,N)
    ],
    monitor_workers(Workers).

monitor_workers(Workers) ->
    receive
        'DOWN', Ref, process, Pid, Why ->
            CrashedWorker = Pid, Ref,
            NewWorkers = replace(CrashedWorker, Workers, Why),
            io:format("Old Workers:~n~p~n", [Workers]),
            io:format("New Workers:~n~p~n", [NewWorkers]),
            monitor_workers(NewWorkers);
        _Other -> 
            monitor_workers(Workers)
    end.

replace(CrashedWorker, Workers, Why) ->
    lists:map(fun(PidRefId) ->
                       Pid,_Ref=Worker, Id = PidRefId,
                      case Worker =:= CrashedWorker of
                          true ->  %replace worker
                              io:format("Worker~w (~w) went down: ~s~n", 
                                        [Id, Pid, Why]),
                              spawn_monitor(?MODULE, worker, [Id]), Id; %=>  Pid,Ref, Id 
                          false ->  %leave worker alone
                              PidRefId  
                      end
              end,
              Workers).

start() ->
    observer:start(),  %%In the Processes tab, you can right click on a worker and kill it.
    create_workers(4).

如果是这样，那么如果其中一个生成的监视器发生故障/崩溃会怎样？

Erlang 在不同国家拥有多个服务器场，并且 erlang 获得了多个冗余电网，因此 elrang 将在一个永不失败的容错分布式系统中重新启动一切。这一切都是内置的。你不必担心任何事情。 :)

实际上...任何你可以想象的失败的地方，那么它必须被备份，例如。由另一台计算机上的另一个监控进程。

【讨论】：

【参考方案2】：

不要生成然后监视，这过去会导致生产问题，而是使用spawn_monitor

您可以从您的主管那里启动和监控多个进程，如果您查看monitor 上的文档，您会注意到每次被监控的进程死亡时，它都会发送如下消息：

'DOWN', MonitorRef, Type, Object, Info

到正在监视刚刚死掉的进程的主管进程

然后你就可以决定做什么了，MonitorRef是你开始监控进程时得到的Reference，Object会有那个进程的Pid死了，如果你给它一个名字，注册的名字。

使用监视器创建一些示例代码是一个很好的练习，但请尝试坚持使用 OTP 库和 OTP 主管。

【讨论】：

不要生成然后监控——我已经做到了，监控过程仍然收到“退出”消息——不像link()。跨度> spawn_monitor 出于历史原因而存在，以避免在进程在被监视之前就死掉的错误，而不是经常发生，实际上如果您监视死进程，您将收到一条消息，仍然在学习自己创建主管时，一个好的做法是使用 spawn_monitor 而不是 spawn 然后使用 monitor跨度> 为了避免进程在被监控之前就死掉的错误， -- 据我所知，这是零差异。我明白为什么添加了spawn_link()，但monitor() 似乎没有遇到同样的问题。

以上是关于Erlang 监控多个进程的主要内容，如果未能解决你的问题，请参考以下文章