将正在运行的 ipython 笔记本与进程匹配

Posted

技术标签:

【中文标题】将正在运行的 ipython 笔记本与进程匹配【英文标题】:Match a running ipython notebook to a process 【发布时间】:2014-07-16 05:20:52 【问题描述】:

我的服务器运行许多长时间运行的笔记本,我想监控笔记本的内存。

有没有办法在 pid 或进程名称和笔记本之间进行匹配?

【问题讨论】:

【参考方案1】:

这是可能的,尽管我只能想到我在下面概述的相当老套的解决方案。总结:

    从驻留在服务器security 目录中的相应 json 连接文件中获取每个内核 (id) 正在侦听的端口 解析对netstat 的调用输出以确定哪个 pid 正在侦听步骤 1 中找到的端口 查询服务器的会话 url 以查找哪个内核 id 映射到哪个会话,从而映射到哪个笔记本。见the ipython wiki for the api。尽管运行 IPython 2.1.0 并不是所有这些都对我有用,但会话 url 可以。

我怀疑有一个更简单的方法,但我还不确定在哪里可以找到它。

import glob
import os.path
import posixpath
import re
import json
import subprocess
import urllib2

# the url and port at which your notebook server listens
server_path = 'http://localhost'
server_port = 8888
# the security directory of the notebook server, containing its connections files
server_sec_dir = 'C:/Users/Josh/.ipython/profile_default/security/'

# part 1 : open all the connection json files to find their port numbers
kernels = 
for json_path in glob.glob(os.path.join(server_sec_dir, 'kernel-*.json')):
    control_port = json.load(open(json_path, 'r'))['control_port']
    key = os.path.basename(json_path)[7:-5]
    kernels[control_port] = 'control_port': control_port, 'key': key

# part2 : get netstat info for which processes use which tcp ports
netstat_ouput = subprocess.check_output(['netstat', '-ano'])
# parse the netstat output to map ports to PIDs
netstat_regex = re.compile(
    "^\s+\w+\s+" # protocol word
    "\d+(\.\d+)3:(\d+)\s+" # local ip:port
    "\d+(\.\d+)3:(\d+)\s+" # foreign ip:port 
    "LISTENING\s+" # connection state
    "(\d+)$" # PID
)
for line in netstat_ouput.splitlines(False):
    match = netstat_regex.match(line)
    if match and match.lastindex == 5:
        port = int(match.group(2))
        if port in kernels:
            pid = int(match.group(5))
            kernels[port]['pid'] = pid

# reorganize kernels to use 'key' as keys
kernels = kernel['key']: kernel for kernel in kernels.values()

# part 3 : find which kernel corresponds to which notebook
# by querying the notebook server api for sessions
sessions_url = posixpath.join('%s:%d' % (server_path, server_port),
                              'api','sessions')
response = urllib2.urlopen(sessions_url).read()
for session in json.loads(response):
    key = session['kernel']['id']
    if key in kernels:
        nb_path = os.path.join(session['notebook']['path'],
                               session['notebook']['name'])
        kernels[key]['nb_path'] = nb_path

# now do what you will with the dict. I just print a pretty list version:
print json.dumps(kernels.values(), sort_keys=True, indent=4)

输出(对我来说,目前):

[
    
        "key": "9142896a-34ca-4c01-bc71-e5709652cac5",
        "nb_path": "2015/2015-01-16\\serhsdh.ipynb",
        "pid": 11436,
        "port": 56173
    ,
    
        "key": "1ddedd95-5673-45a6-b0fb-a3083debb681",
        "nb_path": "Untitled0.ipynb",
        "pid": 11248,
        "port": 52191
    ,
    
        "key": "330343dc-ae60-4f5c-b9b8-e5d05643df19",
        "nb_path": "ipynb\\temp.ipynb",
        "pid": 4680,
        "port": 55446
    ,
    
        "key": "888ad49b-5729-40c8-8d53-0e025b03ecc6",
        "nb_path": "Untitled2.ipynb",
        "pid": 7584,
        "port": 55401
    ,
    
        "key": "26d9ddd2-546a-40b4-975f-07403bb4e048",
        "nb_path": "Untitled1.ipynb",
        "pid": 10916,
        "port": 55351
    
]

【讨论】:

【参考方案2】:

由于问题是关于监控笔记本的内存,我编写了一个完整的示例来显示正在运行的笔记本的内存消耗。它基于出色的@jcb91 answer 和其他一些答案(1、2、3、4)。

import json
import os
import os.path
import posixpath
import subprocess
import urllib2

import pandas as pd
import psutil


def show_notebooks_table(host, port):
    """Show table with info about running jupyter notebooks.

    Args:
        host: host of the jupyter server.
        port: port of the jupyter server.

    Returns:
        DataFrame with rows corresponding to running notebooks and following columns:
            * index: notebook kernel id.
            * path: path to notebook file.
            * pid: pid of the notebook process.
            * memory: notebook memory consumption in percentage.
    """
    notebooks = get_running_notebooks(host, port)
    prefix = long_substr([notebook['path'] for notebook in notebooks])
    df = pd.DataFrame(notebooks)
    df = df.set_index('kernel_id')
    df.index.name = prefix
    df.path = df.path.apply(lambda x: x[len(prefix):])
    df['pid'] = df.apply(lambda row: get_process_id(row.name), axis=1)
    # same notebook can be run in multiple processes
    df = expand_column(df, 'pid')
    df['memory'] = df.pid.apply(memory_usage_psutil)
    return df.sort_values('memory', ascending=False)


def get_running_notebooks(host, port):
    """Get kernel ids and paths of the running notebooks.

    Args:
        host: host at which the notebook server is listening. E.g. 'localhost'.
        port: port at which the notebook server is listening. E.g. 8888.
        username: name of the user who runs the notebooks.

    Returns:
        list of dicts kernel_id: notebook kernel id, path: path to notebook file.
    """
    # find which kernel corresponds to which notebook
    # by querying the notebook server api for sessions
    sessions_url = posixpath.join('http://%s:%d' % (host, port), 'api', 'sessions')
    response = urllib2.urlopen(sessions_url).read()
    res = json.loads(response)
    notebooks = ['kernel_id': notebook['kernel']['id'],
                  'path': notebook['notebook']['path'] for notebook in res]
    return notebooks


def get_process_id(name):
    """Return process ids found by (partial) name or regex.

    Source: https://***.com/a/44712205/304209.
    >>> get_process_id('kthreadd')
    [2]
    >>> get_process_id('watchdog')
    [10, 11, 16, 21, 26, 31, 36, 41, 46, 51, 56, 61]  # ymmv
    >>> get_process_id('non-existent process')
    []
    """
    child = subprocess.Popen(['pgrep', '-f', name], stdout=subprocess.PIPE, shell=False)
    response = child.communicate()[0]
    return [int(pid) for pid in response.split()]


def memory_usage_psutil(pid=None):
    """Get memory usage percentage by current process or by process specified by id, like in top.

    Source: https://***.com/a/30014612/304209.

    Args:
        pid: pid of the process to analyze. If None, analyze the current process.

    Returns:
        memory usage of the process, in percentage like in top, values in [0, 100].
    """
    if pid is None:
        pid = os.getpid()
    process = psutil.Process(pid)
    return process.memory_percent()


def long_substr(strings):
    """Find longest common substring in a list of strings.

    Source: https://***.com/a/2894073/304209.

    Args:
        strings: list of strings.

    Returns:
        longest substring which is found in all of the strings.
    """
    substr = ''
    if len(strings) > 1 and len(strings[0]) > 0:
        for i in range(len(strings[0])):
            for j in range(len(strings[0])-i+1):
                if j > len(substr) and all(strings[0][i:i+j] in x for x in strings):
                    substr = strings[0][i:i+j]
    return substr


def expand_column(dataframe, column):
    """Transform iterable column values into multiple rows.

    Source: https://***.com/a/27266225/304209.

    Args:
        dataframe: DataFrame to process.
        column: name of the column to expand.

    Returns:
        copy of the DataFrame with the following updates:
            * for rows where column contains only 1 value, keep them as is.
            * for rows where column contains a list of values, transform them
                into multiple rows, each of which contains one value from the list in column.
    """
    tmp_df = dataframe.apply(
        lambda row: pd.Series(row[column]), axis=1).stack().reset_index(level=1, drop=True)
    tmp_df.name = column
    return dataframe.drop(column, axis=1).join(tmp_df)

这是show_notebooks_table('localhost', 8888) 的示例输出:

【讨论】:

非常好!我必须做的一个小修正,我的笔记本没有公共前缀,所以函数long_substr 返回了.ipynb,所以我删除了过去,因为它只是从路径的每个开头截断了 6 个字符 感谢您接受答复!随时更新其中的代码,以便它也适用于您的情况。 使用python3,将urllib.request 替换为urllib2 后,我在此行收到403 错误:response = urllib.request.urlopen(sessions_url).read().... @Him 我认为问题在于授权 - jupyter 提高了安全性,现在您需要添加授权标头,如令牌或密码。这是一个要点gist.github.com/2torus/dbeb47b81eb18f2baf20bff6adf7b805,它会提示您输入令牌,然后将其添加到请求标头中。【参考方案3】:

添加到 Dennis Golomazov 的 answer 到:

使代码与 Python 3 兼容 允许登录到受密码保护的会话

我用这个 (source) 替换了 get_running_notebooks 函数:

import requests
import posixpath
import json

def get_running_notebooks(host, port, password=''):
    """
    Get kernel ids and paths of the running notebooks.

    Args:
        host: host at which the notebook server is listening. E.g. 'localhost'.
        port: port at which the notebook server is listening. E.g. 8888.
            
    Returns:
        list of dicts kernel_id: notebook kernel id, path: path to notebook file.
    """
        
    BASE_URL = 'http://0:1/'.format(host, port)
        
    # Get the cookie data
    s = requests.Session()
    url = BASE_URL + 'login?next=%2F'
    resp = s.get(url)
    xsrf_cookie = resp.cookies['_xsrf']

    # Login with the password
    params = '_xsrf': xsrf_cookie, 'password': password
    res = s.post(url, data=params)

    # Find which kernel corresponds to which notebook
    # by querying the notebook server api for sessions
    url = posixpath.join(BASE_URL, 'api', 'sessions')
    ret = s.get(url)
    #print('Status code:', ret.status_code)

    # Get the notebook list
    res = json.loads(ret.text)
    notebooks = ['kernel_id': notebook['kernel']['id'],
                  'path': notebook['notebook']['path'] for notebook in res]
    return notebooks

【讨论】:

【参考方案4】:

我来这里是为了寻找这个问题的简单答案,所以我会把它发布给其他人。

import os
os.getpid()

【讨论】:

以上是关于将正在运行的 ipython 笔记本与进程匹配的主要内容,如果未能解决你的问题,请参考以下文章

在 ipython 笔记本中显示 matplotlib 时出错

无法建立 iPython 笔记本 Websocket 连接

如何在 IPython 笔记本中打开交互式 matplotlib 窗口?

Tensorflow Numpy不能在ipython笔记本上运行Python2.7和3.5

调整 ipython 笔记本输出窗口的大小

在 AWS EMR v4.0.0 上使用 Pyspark 配置 Ipython/Jupyter 笔记本