亿级毫秒响应实时推荐系统-解决方案探索
Posted DB印象
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了亿级毫秒响应实时推荐系统-解决方案探索相关的知识,希望对你有一定的参考价值。
阅读使人充实,讨论使人敏捷,写作使人精确。
>>> 前 言
最近接触推荐相关业务需求过程中参考了不少德哥的文章,收益良多,强烈推荐德哥文章链接:https://github.com/digoal
业务应用场景
假如某个主播平台资源中,共有2000w主播,运营关注到Aken在平台上比较喜欢A主播,期盼给他推荐更多和A类似的主播,1个或N多个。
站在技术角度,我们如何去实现?
假如来1个千万级数据全量搜索计算,现有的Mongodb、mysql、Oracle等技术组件毫秒级有没有这个可能?
DT数据时代,通过用户画像条件组合,快速提取精准目标群体进行精准营销,已经是当今行业的普遍需求。例如:
淘宝平台根据用户购买习惯,推荐相关商品。
头条网站根据用户浏览习惯,推荐相关资讯。
网易云音乐根据用户听歌喜好,推荐相关歌曲。
因此这本质上是一个:实时推荐系统的问题。
通用的解决方案
营销推荐基于用户画像,通用解决方案是给用户贴标签,然后根据标签组合,圈出需要的用户。
表现在数据库层面,通常会用到宽表,以及分布式的系统。宽表用于存储用户标签,例如每个字段代表一个标签,业务查询语句通过标签字段进行组合,搜索符合条件的id,即目标用户或推荐对象。
回到上面主播推荐的业务场景,画像表的设计会是下面这个样子:
create table tab_aken_signature (
uid int primary key,
tag1 float4, -- 主播属性1
tag2 float4, -- 主播属性2
...
tagn float4 -- 主播属性n
);
首先,对于推荐系统,我理解的核心思想有两点:
1.给用户推荐那些和他们喜欢物品相似的物品。
2.喜欢物品怎么表示,以及物品之间相似度怎么计算是需要我们重点考虑的。
uid int ---primary key,每个uid代表一个推荐对象
tagval bit(64) ---文本标签转位图
tagarr text[] ---元素列表或元组加权
akendb=# \d+ tab_aken_signature
Table "public.tab_aken_signature"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
--------+---------+-----------+----------+---------+----------+--------------+-------------
id | integer | | | | plain | |
tagval | bit(64) | | | | extended | |
tagarr | text[] | | | | extended | |
Indexes:
"idx_tagarr" gin (tagarr _text_sml_ops)
Access method: heap
akendb=#
akendb=# create extension smlar;
CREATE EXTENSION
akendb=# create index idx_signature on tab_aken_signature using gin(tagarr _text_sml_ops )
select id,tagarr from tab_aken_signature
where tagarr % '{1_0110110101111001,2_0111100111100000,3_1010110010111110,4_1011111011110100}'
limit 10;
insert into tab_aken_signature
select id,val::bit(64),regexp_split_to_array('1_'||substring(val,1,16)||',2_'||substring(val,17,16)||',3_'||substring(val,33,16)||',4_'||substring(val,41,16), ',')
from (select id, (sqrt(random())::numeric*9223372036854775807*2-9223372036854775807::numeric)::int8::bit(64)::text as val from generate_series(1,156250) t(id)) t;
nohup pgbench -h 9.107.133.92 -p 11005 -U tbase -d akendb -M prepared -n -r -P 1 -f ./test.sql -c 64 -j 10 -t 10 &
akendb=# select count(*) from tab_aken_signature;
count
-----------
100000000
(1 row)
akendb=# select * from tab_aken_signature limit 10;
id | tagval | tagarr
----+------------------------------------------------------------------+-------------------------------------------------------------------------------
1 | 0101011101100100100101111001000100010110101011101111111001100111 | {1_0101011101100100,2_1001011110010001,3_0001011010101110,4_1010111011111110}
2 | 1110001100010000101010110010010000111110110000011000100001001011 | {1_1110001100010000,2_1010101100100100,3_0011111011000001,4_1100000110001000}
3 | 0110111011001011001101110111111111001000011101010010010000100100 | {1_0110111011001011,2_0011011101111111,3_1100100001110101,4_0111010100100100}
akendb=# set smlar.type = overlap;
akendb=# set smlar.threshold = 2; --在相似度超过50%的主播中推荐完全相似的1个
akendb=# select smlar( tagarr, '{1_0100001010100000,2_0100111110001111,3_0100111111101101,4_1110110100101101}') as similarity,
length(replace(bitxor(bit'0100001010100000010011111000111101001111111011010010110100101111', tagval)::text,'0','')) hm_distance,*
from tab_aken_signature
where tagarr % '{1_0100001010100000,2_0100111110001111,3_0100111111101101,4_1110110100101101}'
and length(replace(bitxor(bit'0100001010100000010011111000111101001111111011010010110100101111', tagval)::text,'0','')) < 2
limit 100;
similarity | hm_distance | id | tagval | tagarr
------------+-------------+----+------------------------------------------------------------------+-------------------------------------------------------------------------------
4 | 0 | 8 | 0100001010100000010011111000111101001111111011010010110100101111 | {1_0100001010100000,2_0100111110001111,3_0100111111101101,4_1110110100101101}
(1 row)
Time: 2.643 ms
akendb=#
akendb=# set smlar.type = overlap;
akendb=# set smlar.threshold = 2; --在相似度超过50%的主播中推荐最相似的30个
akendb=#select smlar( tagarr, '{1_0100001010100000,2_0100111110001111,3_0100111111101101,4_1110110100101101}') as similarity ,
length(replace(bitxor(bit'0100001010100000010011111000111101001111111011010010110100101111', tagval)::text,'0','')) as hm_distance,id,tagarr
from tab_aken_signature
where tagarr % '{1_0100001010100000,2_0100111110001111,3_0100111111101101,4_1110110100101101}' order by similarity desc,hm_distance
limit 30;
similarity | hm_distance | id | tagarr
------------+-------------+--------+-------------------------------------------------------------------------------
4 | 0 | 8 | {1_0100001010100000,2_0100111110001111,3_0100111111101101,4_1110110100101101}
2 | 13 | 73473 | {1_0101001001100010,2_0100011000001110,3_0100111111101101,4_1110110100101101}
2 | 16 | 27276 | {1_0101001111000101,2_0101010000001111,3_0100111111101101,4_1110110100101101}
2 | 16 | 82574 | {1_0101010011110001,2_0111001101001010,3_0100111111101101,4_1110110100101101}
2 | 18 | 55714 | {1_0111110101111011,2_1100111110111110,3_0100111111101101,4_1110110100101101}
2 | 18 | 68023 | {1_0101101010001100,2_1000111111110011,3_0100111111101101,4_1110110100101101}
2 | 20 | 153629 | {1_0111001011001110,2_0001100111111010,3_0100111111101101,4_1110110100101101}
2 | 20 | 100863 | {1_0100100000101101,2_1011101101101110,3_0100111111101101,4_1110110100101101}
2 | 24 | 149608 | {1_1010110011111011,2_0011001010000001,3_0100111111101101,4_1110110100101101}
2 | 24 | 56512 | {1_0010101100010110,2_1001101000100010,3_0100111111101101,4_1110110100101101}
2 | 26 | 78559 | {1_1011001010001110,2_0011010101000000,3_0100111111101101,4_1110110100101101}
(11 rows)
Time: 2.797 ms
akendb=#
参考资料
1.https://github.com/digoal
2.https://github.com/jirutka/smlar
3.https://github.com/eulerto/pg_similarity
往期推荐
1.
2.
------让学习成为一种习惯-Aken
以上是关于亿级毫秒响应实时推荐系统-解决方案探索的主要内容,如果未能解决你的问题,请参考以下文章
推荐系统[九]项目技术细节讲解z4:向量检索技术工程上实践,曝光去重实践以及检索引擎该如何选择:支撑亿级索引5毫秒级的检索[elasticsearchmilvus]