如何从给定的计数、平均值、标准差、最小值、最大值等生成数据集?
Posted
技术标签:
【中文标题】如何从给定的计数、平均值、标准差、最小值、最大值等生成数据集?【英文标题】:How to generate dataset from given count, mean, standard deviation, min, max etc? 【发布时间】:2020-08-30 19:12:01 【问题描述】:我拥有 pandas DataFrame.describe() 方法中的所有统计细节,例如计数、平均值、标准差、最小值、最大值等。我需要从这些细节中生成数据集。是否有任何应用程序或 python 代码可以完成这项工作。 我想生成任何具有这些统计信息的随机数据集
计数 263 平均 35.790875 标准 24.874763 最小 0.0000000 25% 16.000000 50% 32.000000 75% 49.000000 最大 99.000000
【问题讨论】:
您能否将describe
中的所有统计详细信息添加到您的问题中?
我是新用户,无法嵌入图片,我已将所有详细信息作为文本列出。
Levi,我知道你是 SO 新手。如果您认为某个答案解决了问题,请单击答案左侧的绿色复选标记将其标记为“已接受”。这有助于将注意力集中在仍然没有答案的旧 SO 问题上。当然,如果您正在等待其他答案,那很好。
【参考方案1】:
您好,欢迎来到论坛!这是一个很好的问题,我喜欢它。
我认为在一般情况下这是不平凡的。您可以创建一个具有正确计数、平均值、最小值和百分位数的数据集,但标准差相当棘手。
这是一种获取满足您的示例要求的数据集的方法。它可以适用于一般情况,但预计会有许多“边界情况”。基本思想是满足从最简单到最难的每个要求,注意在前进的过程中不要使之前的要求无效。
from numpy import std
import math
COUNT = 263
MEAN = 35.790875
STD = 24.874763
MIN = 0
P25 = 16
P50 = 32
P75 = 49
MAX = 99
#Positions of the percentiles
P25_pos = floor(0.25 * COUNT) - 1
P50_pos = floor(0.5 * COUNT) - 1
P75_pos = floor(0.75 * COUNT) - 1
MAX_pos = COUNT -1
#Count requirement
v = [0] * COUNT
#Min requirement
v[0] = MIN
#Max requirement
v[MAX_pos] = MAX
#Good, we already satisfied the easiest 3 requirements. Notice that these are deterministic,
#there is only one way to satisfy them
#This will satisfy the 25th percentile requirement
for i in range(1, P25_pos):
#We could also interpolate the value from P25 to P50, even adding a bit of randomness.
v[i] = P25
v[P25_pos] = P25
#Actually pandas does some linear interpolation (https://***.com/questions/39581893/pandas-find-percentile-stats-of-a-given-column)
#when calculating percentiles but we can simulate that by letting the next value be also P25
if P25_pos + 1 != P50_pos:
v[P25_pos + 1] = P25
#We do something extremely similar with the other percentiles
for i in range(P25_pos + 3, P50_pos):
v[i] = P50
v[P50_pos] = P50
if P50_pos + 1 != P75_pos:
v[P50_pos + 1] = P50
for i in range(P50_pos + 1, P75_pos):
v[i] = P50
v[P75_pos] = P75
if P75_pos + 1 != v[MAX_pos]:
v[P75_pos + 1] = P75
for i in range(P75_pos + 1, MAX_pos):
v[i] = P75
#This will give us correct 25%, 50%, 75%, min, max, and count values. We are still missing MEAN and std.
#We are getting a mean of 24.84, and we need to increase it a little bit to get 35.790875. So we manually teak the numbers between the 75th and 100th percentile.
#That is, numbers between pos 197 and 261.
#This would be much harder to do automatically instead of with a hardcoded example.
#This increases the average a bit, but not enough!
for i in range(P75_pos + 1, 215):
v[i] = MAX
#We solve an equation to get the necessary value for v[256] for the mean to be what we want to be.
#This equation comes from the formula for the average: AVG = SUM/COUNT. We simply clear the variable v[215] from that formula.
new_value = MEAN * COUNT - sum(v) + v[215]
#The new value for v[215] should be between P75 and MAX so we don't invalidate the percentiles.
assert(P75 <= new_value)
assert(new_value <= MAX)
v[256] = new_value
#Now comes the tricky part: we need the correct std. As of now, it is 20.916364, and it should be higher: 24.874763
#For this, as we don't want to change the average, we are going to change values in pairs,
#as we need to compensate each absolute increase with an absolute decrease
for i in range(1, P25_pos - 3):
#We can move the values between the 0th and 25th percentile between 0 and 16
v[i] -= 12
#Between the 25th and 50th percentile, we can move the values between 32 and 49
v[P25_pos + 1 + i] += 12
#As of now, this got us a std of 24.258115. We need it to be a bit higher: 24.874763
#The trick we did before of imposing a value for getting the correct mean is much harder to do here,
#because the equation is much more complicated
#So we'll just approximate the value intead with a while loop. There are faster ways than this, see: https://en.wikipedia.org/wiki/Root-finding_algorithms
current_std = math.sqrt(sum([(val - MEAN)**2 for val in v])/(COUNT - 1))
while 24.874763 - current_std >= 10e-5:
for i in range(1, P25_pos - 3):
#We can move the values between the 0th and 25th percentile between 0 and 16
v[i] -= 0.00001
#Between the 25th and 50th percentile, we can move the values between 32 and 49
v[P25_pos + 1 + i] += 0.00001
current_std = math.sqrt(sum([(val - MEAN)**2 for val in v])/(COUNT - 1))
#We tweak some further decimal points now
while 24.874763 - current_std >= 10e-9:
v[1] += 0.0001
#Between the 25th and 50th percentile, we can move the values between 32 and 49
v[P25_pos + 2] -= 0.0001
current_std = math.sqrt(sum([(val - MEAN)**2 for val in v])/(COUNT - 1))
df = pd.DataFrame('col':v)
#Voila!
df.describe()
输出:
col
count 263.000000
mean 35.790875
std 24.874763
min 0.000000
25% 16.000000
50% 32.000000
75% 49.000000
max 99.000000
【讨论】:
我想根据给定的统计信息生成原始数据集 那是不可能的。统计信息是对原始数据集的缩减 您当然可以模拟具有这些特征的数据集;这就是你想要的吗? 是的,但是如何找到具有此均值、标准差等的随机数? 太棒了!这真的奏效了。谢谢。但如果数字更随机会更好【参考方案2】:我只是想到了另一种让数字看起来不那么虚假的方法。它的速度要慢得多,所以只有在你不关心数据集很小的情况下才使用它。这是一个大小为 40 的数据集的示例,但如果要生成更大的数据集,可以更改 COUNT 变量的值。此外,此代码可以适应其他值要求 - 只需更改标题即可。
我们开始的方式与我之前的答案相同,满足除 MEAN 和 STD 之外的所有要求:
from math import floor
lr = 10e-6
COUNT = 40.0
MEAN = 35.790875
STD = 24.874763
MIN = 0.0
P25 = 16.0
P50 = 32.0
P75 = 49.0
MAX = 99.0
#Positions of the percentiles
P25_pos = floor(0.25 * COUNT) - 1
P50_pos = floor(0.5 * COUNT) - 1
P75_pos = floor(0.75 * COUNT) - 1
MAX_pos = int(COUNT -1)
#Count requirement
X = [0.0] * int(COUNT)
#Min requirement
X[0] = MIN
#Max requirement
X[MAX_pos] = MAX
#Good, we already satisfied the easiest 3 requirements. Notice that these are deterministic,
#there is only one way to satisfy them
#This will satisfy the 25th percentile requirement
for i in range(1, P25_pos):
#We could also interpolate the value from P25 to P50, even adding a bit of randomness.
X[i] = 0.0
X[P25_pos] = P25
#Actually pandas does some linear interpolation (https://***.com/questions/39581893/pandas-find-percentile-stats-of-a-given-column)
#when calculating percentiles but we can simulate that by letting the next value be also P25
if P25_pos + 1 != P50_pos:
X[P25_pos + 1] = P25
#We do something extremely similar with the other percentiles
for i in range(P25_pos + 2, P50_pos):
X[i] = P25
X[P50_pos] = P50
if P50_pos + 1 != P75_pos:
X[P50_pos + 1] = P50
for i in range(P50_pos + 1, P75_pos):
X[i] = P50
X[P75_pos] = P75
if P75_pos + 1 != X[MAX_pos]:
X[P75_pos + 1] = P75
for i in range(P75_pos + 2, MAX_pos):
X[i] = P75
但是那么,我们将其视为(受约束的)gradient descent 问题:我们希望最小化我们的 MEAN 和 STD 与预期的 MEAN 和 STD 之间的差异,同时保持四分位数的值。我们想要学习的值是我们数据集中的值 - 当然,我们排除了四分位数,因为我们已经对这些值必须是什么有了一个限制。
def std(X):
return sum([(val - sum(X)/len(X))**2 for val in X])/(len(X) - 1)
#This function measures the difference between our STD and MEAN and the expected values
def cost(X):
m = sum(X) / len(X)
return ((sum([(val - m)**2 for val in X])/(len(X) - 1) - STD**2)) ** 2 + (m - MEAN)**4
#You have to install this library
import autograd.numpy as anp # Thinly-wrapped numpy
from autograd import grad #for automatically calculating gradients of functions
#This is the derivative of the cost and it is used in the gradient descent to update the values of the dataset
grad_cost = grad(cost)
def learn(lr, epochs):
for j in range(0, epochs):
gr = []
for i in range(len(X)):
gr.append(grad_cost(X)[i] * lr)
for i in range(1, P25_pos):
if X[i] - gr[i] >= MIN and X[i] - gr[i] <= P25:
X[i] -= gr[i]
for i in range(P25_pos+2, P50_pos):
if X[i] - gr[i] >= P25 and X[i] - gr[i] <= P50:
X[i] -= gr[i]
for i in range(P50_pos + 2, P75_pos):
if X[i] - gr[i] >= P50 and X[i] - gr[i] <= P75:
X[i] -= gr[i]
for i in range(P75_pos + 2, MAX_pos):
if X[i] - gr[i] >= P75 and X[i] - gr[i] <= MAX:
X[i] -= gr[i]
if j % 100 == 0:
print(cost(X))
#if j % 200 == 0:
# print(gr)
print(cost(X))
print(X)
您现在可以使用 learn(learning_rate, epochs) 函数进行梯度下降。我使用的 learning_rates 介于 10e-7 和 10e-4 之间。
对于这种情况,经过一段时间的学习(大约 100K epoch,大约需要一个小时),我得到了 24.871 的 STD(与 24.874 的实际值相比)和 31.730 的平均值(与35.790 的实际值)。这些是我得到的结果:
col
count 40.000000
mean 31.730694
std 24.871651
min 0.000000
25% 16.000000
50% 32.000000
75% 49.000000
max 99.000000
具有以下排序的列值:
[0.0, 1.6232547073078982, 1.6232547073078982, 1.6232547073078982, 1.6232547073078982, 1.6232547073078982, 1.6232547073078982, 1.6232547073078982, 1.6232547073078982, 16.0, 16.0, 17.870937400371687, 17.870937400371687, 17.870937400371687, 17.870937400371687, 17.870937400371687, 17.870937400371687, 17.870937400371687, 17.870937400371687, 32.0, 32.0, 38.50321491745568, 38.50321491745568, 38.50321491745568, 38.50321491745568, 38.50321491745568, 38.50321491745568, 38.50321491745568, 38.50321491745568, 49.0, 49.0, 64.03106466400027, 64.03106466400027, 64.03106466400027, 64.03106466400027, 64.03106466400027, 64.03106466400027, 64.03106466400027, 64.03106466400027, 99.0]
这些结果肯定可以通过更多的培训得到改善。当我得到更好的结果时,我会更新答案。
【讨论】:
你也可以自己继续训练,从我得到的列值开始【参考方案3】:我也有类似的问题,但没那么复杂。供您参考。
def simulate_data(COUNT,MIN,P25,P50,P75,MAX):
c = np.round(np.random.normal(0.5*COUNT, 0.25 * COUNT, COUNT),0)
y = [MIN,P25,P50,P75,MAX]
x = [min(c),np.percentile(c,25),np.percentile(c,50),np.percentile(c,75),max(c)]
y_I = np.interp(c, x, y)
return y_I
【讨论】:
以上是关于如何从给定的计数、平均值、标准差、最小值、最大值等生成数据集?的主要内容,如果未能解决你的问题,请参考以下文章
R语言使用psych包的describeBy函数计算不同分组(group)的描述性统计值(样本个数均值标准差中位数剔除异常均值最小最大值数据范围极差偏度峰度均值标准差等)