如何生成记录并将它们分布在表中的对之间？

Posted 2023-03-31

技术标签:

【中文标题】如何生成记录并将它们分布在表中的对之间？【英文标题】：How to generate records and spread them among pairs from a table? 【发布时间】：2011-06-05 21:24:59 【问题描述】：

我必须在大约 4 万个目的地之间生成大约一百万次随机旅行。每个目的地都有自己的重量（total_probability），重量越多，应该在这个地方开始或结束的旅行就越多。

行程应随机生成，但目的地（起点和终点）应按概率加权，或者可以只预先计算确切的行程次数（将每个权重除以权重之和，然后乘以1M 并舍入为整数）。

问题是如何在 PostgreSQL 中制作它而不生成包含所有目标对的 40K*40K 表。

          Table "public.dests"
   Column          |       Type       | Modifiers 
-------------------+------------------+-----------
 id                | integer          | 
 total_probability | double precision | 

          Table "public.trips"
   Column   |       Type       | Modifiers 
------------+------------------+-----------
 from_id    | integer          | 
 to_id      | integer          | 
 trips_num  | integer          | 
 ...
 some other metrics...

行程的主键是 (from_id, to_id) 我应该生成一个包含 1M 记录的表，然后迭代地更新它，还是包含 1M 插入的 for 循环足够快？我在一台 2 核的轻量级笔记本电脑上工作。

P.S.我放弃了，用 Python 做了这个。为了在 Python 中执行一组查询和转换，我将从 Python 而不是 shell 脚本运行 SQL 脚本。感谢您的建议！

【问题讨论】：

您是否有任何理由不使用返回单行的视图或函数，并即时生成这些行程？ @Denis：我找不到按概率列排序的方法。 @Denis, #2：好吧，似乎有办法（forums.devarticles.com/database-development-6/…），但我需要进行 1M 次查询。 :-/ 【参考方案1】：

在 9.1 中，您可以在 VIEWs 上使用 TRIGGERs，这可以有效地让您创建物化视图（尽管是手动的）。我认为你的第一次运行可能很昂贵，但使用循环可能是要走的路，但在那之后，我会使用一系列TRIGGERs 来维护表格中的数据。

在一天结束时，您需要决定是要计算每个查询的结果，还是通过物化视图记忆结果。

【讨论】：

【参考方案2】：

我对您的要求感到困惑，但我想这可以让您开始：

select 
    f.id as "from", t.id as to, 
    f.total_prob as from_prob, t.total_prob as to_prob
from 
    (
        select id, total_prob
        from dest
        order by random()
        limit 1010
    ) f
    inner join
    (
        select id, total_prob
        from dest
        order by random()
        limit 1010
    ) t on f.i != t.i
order by random()
limit 1000000
;

编辑：

这在我不那么现代的桌面上花了大约十分钟：

create table trips (from_id integer, to_id integer, trip_prob double precision);

insert into trips (from_id, to_id, trip_prob)
select 
    f.id, t.id, f.total_prob * t.total_prob
from 
    (
        select id, total_prob
        from dests
    ) f
    inner join
    (
        select id, total_prob
        from dests
    ) t on f.id != t.id
where random() <= f.total_prob * t.total_prob
order by random()
limit 1000000
;

alter table trips add primary key (from_id, to_id);

select * from trips limit 5;
 from_id | to_id |     trip_prob      
---------+-------+--------------------
       1 |     6 | 0.0728749980226821
       1 |    11 |  0.239824750923743
       1 |    14 |  0.235899211677577
       1 |    15 |  0.176168172647811
       1 |    17 |   0.19708509944588
(5 rows)

【讨论】：

你如何用概率来衡量它们，以便total_prob 更大的那些出现更多？您可以使用加权因子。请参阅use.perl.org/~bart/journal/33630，我从数学上推导出权重因子：-log(1 - rand())/weight (1-rand() 因为 rand() 可以的值为 0，但绝不为 1；减去因为下面的数字的对数1 是负数）并选择具有最低值的项目。

以上是关于如何生成记录并将它们分布在表中的对之间？的主要内容，如果未能解决你的问题，请参考以下文章