如何规范seaborn distplot？

Posted 2023-03-12

技术标签:

【中文标题】如何规范seaborn distplot？【英文标题】：How to normalize seaborn distplot? 【发布时间】：2019-08-03 08:06:04 【问题描述】：

出于再现性原因、数据集和再现性原因，我在 [此处][1] 共享它。

这是我正在做的 - 从第 2 列开始，我正在读取当前行并将其与前一行的值进行比较。如果它更大，我会继续比较。如果当前值小于前一行的值，我想将当前值（较小）除以前一个值（较大）。因此，以下代码：

这给出了以下图。

sns.distplot(quotient, hist=False, label=protname)

从图中我们可以看出

当quotient_times 小于 3 时，Data-V 的商为 0.8，如果 quotient_times 为大于 3。

我想规范化这些值，以便我们将第二个绘图值的y-axis 置于 0 和 1 之间。我们如何在 Python 中做到这一点？

【问题讨论】：

norm_hist=True 不，这没有帮助。 【参考方案1】：

前言

据我了解，seaborn distplot 默认情况下会进行 kde 估计。如果您想要一个标准化的 distplot 图，可能是因为您假设该图的 Ys 应该在 [0;1] 之间。如果是这样，堆栈溢出问题引发了kde estimators showing values above 1的问题。

引用one answer:

连续的 pdf (pdf=概率密度函数) 永远不会说值小于 1，对于连续随机变量的 pdf，f函数 p(x) 不是概率。您可以参考连续随机变量及其分布

引用importanceofbeingernest的第一条评论：

对 pdf 的积分是 1。这里没有矛盾。

据我所知，CDF (Cumulative Density Function) 的值应该在 [0; 1]。

注意：所有可能的连续拟合函数都是on SciPy site and available in the package scipy.stats

也许也看看probability mass functions？

如果你真的想对同一张图进行归一化，那么你应该收集绘制函数（选项1）或函数定义（选项2）的实际数据点，然后自己进行归一化并重新绘制。

选项 1

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import sys

print('System versions          : '.format(sys.version))
print('System versions          : '.format(sys.version_info))
print('Numpy versqion           : '.format(np.__version__))
print('matplotlib.pyplot version: '.format(matplotlib.__version__))
print('seaborn version          : '.format(sns.__version__))

protocols = 

types = "data_v": "data_v.csv"

for protname, fname in types.items():
    col_time,col_window = np.loadtxt(fname,delimiter=',').T
    trailing_window = col_window[:-1] # "past" values at a given index
    leading_window  = col_window[1:]  # "current values at a given index
    decreasing_inds = np.where(leading_window < trailing_window)[0]
    quotient = leading_window[decreasing_inds]/trailing_window[decreasing_inds]
    quotient_times = col_time[decreasing_inds]

    protocols[protname] = 
        "col_time": col_time,
        "col_window": col_window,
        "quotient_times": quotient_times,
        "quotient": quotient,
    

    fig, (ax1, ax2) = plt.subplots(1,2, sharey=False, sharex=False)
    g = sns.distplot(quotient, hist=True, label=protname, ax=ax1, rug=True)
    ax1.set_title('basic distplot (kde=True)')
    # get distplot line points
    line = g.get_lines()[0]
    xd = line.get_xdata()
    yd = line.get_ydata()
    # https://***.com/questions/29661574/normalize-numpy-array-columns-in-python
    def normalize(x):
        return (x - x.min(0)) / x.ptp(0)
    #normalize points
    yd2 = normalize(yd)
    # plot them in another graph
    ax2.plot(xd, yd2)
    ax2.set_title('basic distplot (kde=True)\nwith normalized y plot values')

    plt.show()

选项 2

下面，我尝试执行 kde 并标准化获得的估计。我不是统计专家，所以 kde 的使用在某些方面可能是错误的（它与 seaborn 的不同，正如屏幕截图中所见，这是因为 seaborn 的工作方式比我好得多。它只是试图模仿kde 与 scipy 的拟合。我猜结果还不错）

截图：

代码：

import numpy as np
from scipy import stats
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import sys

print('System versions          : '.format(sys.version))
print('System versions          : '.format(sys.version_info))
print('Numpy versqion           : '.format(np.__version__))
print('matplotlib.pyplot version: '.format(matplotlib.__version__))
print('seaborn version          : '.format(sns.__version__))

protocols = 

types = "data_v": "data_v.csv"

for protname, fname in types.items():
    col_time,col_window = np.loadtxt(fname,delimiter=',').T
    trailing_window = col_window[:-1] # "past" values at a given index
    leading_window  = col_window[1:]  # "current values at a given index
    decreasing_inds = np.where(leading_window < trailing_window)[0]
    quotient = leading_window[decreasing_inds]/trailing_window[decreasing_inds]
    quotient_times = col_time[decreasing_inds]

    protocols[protname] = 
        "col_time": col_time,
        "col_window": col_window,
        "quotient_times": quotient_times,
        "quotient": quotient,
    

    fig, (ax1, ax2, ax3, ax4) = plt.subplots(1,4, sharey=False, sharex=False)
    diff=quotient_times
    ax1.plot(diff, quotient, ".", label=protname, color="blue")
    ax1.set_ylim(0, 1.0001)
    ax1.set_title(protname)
    ax1.set_xlabel("quotient_times")
    ax1.set_ylabel("quotient")
    ax1.legend()

    sns.distplot(quotient, hist=True, label=protname, ax=ax2, rug=True)
    ax2.set_title('basic distplot (kde=True)')

    # taken from seaborn's source code (utils.py and distributions.py)
    def seaborn_kde_support(data, bw, gridsize, cut, clip):
        if clip is None:
            clip = (-np.inf, np.inf)
        support_min = max(data.min() - bw * cut, clip[0])
        support_max = min(data.max() + bw * cut, clip[1])
        return np.linspace(support_min, support_max, gridsize)

    kde_estim = stats.gaussian_kde(quotient, bw_method='scott')

    # manual linearization of data
    #linearized = np.linspace(quotient.min(), quotient.max(), num=500)

    # or better: mimic seaborn's internal stuff
    bw = kde_estim.scotts_factor() * np.std(quotient)
    linearized = seaborn_kde_support(quotient, bw, 100, 3, None)

    # computes values of the estimated function on the estimated linearized inputs
    Z = kde_estim.evaluate(linearized)

    # https://***.com/questions/29661574/normalize-numpy-array-columns-in-python
    def normalize(x):
        return (x - x.min(0)) / x.ptp(0)

    # normalize so it is between 0;1
    Z2 = normalize(Z)
    for name, func in 'min': np.min, 'max': np.max.items():
        print(': source=, normalized='.format(name, func(Z), func(Z2)))

    # plot is different from seaborns because not exact same method applied
    ax3.plot(linearized, Z, ".", label=protname, color="orange")
    ax3.set_title('Non linearized gaussian kde values')

    # manual kde result with Y axis avalues normalized (between 0;1)
    ax4.plot(linearized, Z2, ".", label=protname, color="green")
    ax4.set_title('Normalized gaussian kde values')

    plt.show()

输出：

System versions          : 3.7.2 (default, Feb 21 2019, 17:35:59) [MSC v.1915 64 bit (AMD64)]
System versions          : sys.version_info(major=3, minor=7, micro=2, releaselevel='final', serial=0)
Numpy versqion           : 1.16.2
matplotlib.pyplot version: 3.0.2
seaborn version          : 0.9.0
min: source=0.0021601491646143518, normalized=0.0
max: source=9.67319154426489, normalized=1.0

与评论相反，绘图：

[(x-min(quotient))/(max(quotient)-min(quotient)) for x in quotient]

不改变行为！它仅更改内核密度估计的源数据。曲线形状将保持不变。

Quoting seaborn's distplot doc:

该函数结合了matplotlib hist函数（带自动使用 seaborn kdeplot() 和 rugplot() 函数。它还可以适应 scipy.stats 分布和在数据上绘制估计的 PDF。

默认情况下：

kde : bool，可选设置为 True 是否绘制高斯核密度估计。

默认使用kde。引用 seaborn 的 kde 文档：

拟合并绘制单变量或双变量核密度估计值。

引用SCiPy gaussian kde method doc:

使用高斯核表示核密度估计。

核密度估计是一种估计概率密度的方法以非参数方式的随机变量的函数 (PDF)。 gaussian_kde 适用于单变量和多变量数据。它包括自动带宽确定。估计效果最好对于单峰分布；双峰或多峰分布倾向于被过度平滑。

请注意，正如您自己提到的那样，我确实相信您的数据是双峰的。它们看起来也很离散。据我所知，离散分布函数的分析方式可能与连续分布函数不同，而且拟合可能会很棘手。

这是一个包含各种法律的示例：

import numpy as np
from scipy.stats import uniform, powerlaw, logistic
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import sys

print('System versions          : '.format(sys.version))
print('System versions          : '.format(sys.version_info))
print('Numpy versqion           : '.format(np.__version__))
print('matplotlib.pyplot version: '.format(matplotlib.__version__))
print('seaborn version          : '.format(sns.__version__))

protocols = 

types = "data_v": "data_v.csv"

for protname, fname in types.items():
    col_time,col_window = np.loadtxt(fname,delimiter=',').T
    trailing_window = col_window[:-1] # "past" values at a given index
    leading_window  = col_window[1:]  # "current values at a given index
    decreasing_inds = np.where(leading_window < trailing_window)[0]
    quotient = leading_window[decreasing_inds]/trailing_window[decreasing_inds]
    quotient_times = col_time[decreasing_inds]

    protocols[protname] = 
        "col_time": col_time,
        "col_window": col_window,
        "quotient_times": quotient_times,
        "quotient": quotient,
    
    fig, [(ax1, ax2, ax3), (ax4, ax5, ax6)] = plt.subplots(2,3, sharey=False, sharex=False)
    diff=quotient_times
    ax1.plot(diff, quotient, ".", label=protname, color="blue")
    ax1.set_ylim(0, 1.0001)
    ax1.set_title(protname)
    ax1.set_xlabel("quotient_times")
    ax1.set_ylabel("quotient")
    ax1.legend()
    quotient2 = [(x-min(quotient))/(max(quotient)-min(quotient)) for x in quotient]
    print(quotient2)
    sns.distplot(quotient, hist=True, label=protname, ax=ax2, rug=True)
    ax2.set_title('basic distplot (kde=True)')
    sns.distplot(quotient2, hist=True, label=protname, ax=ax3, rug=True)
    ax3.set_title('logistic distplot')

    sns.distplot(quotient, hist=True, label=protname, ax=ax4, rug=True, kde=False, fit=uniform)
    ax4.set_title('uniform distplot')
    sns.distplot(quotient, hist=True, label=protname, ax=ax5, rug=True, kde=False, fit=powerlaw)
    ax5.set_title('powerlaw distplot')
    sns.distplot(quotient, hist=True, label=protname, ax=ax6, rug=True, kde=False, fit=logistic)
    ax6.set_title('logistic distplot')
    plt.show()

输出：

System versions          : 3.7.2 (default, Feb 21 2019, 17:35:59) [MSC v.1915 64 bit (AMD64)]
System versions          : sys.version_info(major=3, minor=7, micro=2, releaselevel='final', serial=0)
Numpy versqion           : 1.16.2
matplotlib.pyplot version: 3.0.2
seaborn version          : 0.9.0
[1.0, 0.05230125523012544, 0.0433775382360589, 0.024590765616971128, 0.05230125523012544, 0.05230125523012544, 0.05230125523012544, 0.02836946874603772, 0.05230125523012544, 0.05230125523012544, 0.05230125523012544, 0.05230125523012544, 0.03393500048652319, 0.05230125523012544, 0.05230125523012544, 0.05230125523012544, 0.0037013196009011043, 0.0, 0.05230125523012544]

截图：

【讨论】：

看看我提供的 scipy 链接。 docs.scipy.org/doc/scipy-0.14.0/reference/generated/…【参考方案2】：

在最新更新中，sns.distplot 已被弃用，而必须使用 sns.histplot。因此，要获得归一化的直方图/密度，必须使用以下语法：

sns.histplot(x, kind='hist', stat='density');

或

sns.plot(x, stat='density');

而不是

sns.distplot(x, kde=False, norm_hist=True);

PS：要获取密度而不是直方图，必须将种类值更改为“kde”。

【讨论】：

以上是关于如何规范seaborn distplot？的主要内容，如果未能解决你的问题，请参考以下文章