在 python (pyspark) 中使用 combinebykey spark rdd 计算组上的聚合

Posted

技术标签:

【中文标题】在 python (pyspark) 中使用 combinebykey spark rdd 计算组上的聚合【英文标题】:Computing aggregates on group by using combinebykey spark rdd in python (pyspark) 【发布时间】:2017-02-09 09:30:13 【问题描述】:

我是 spark rdd 的新手,我想使用 spark shuffle 操作通过键对它们进行分组来计算聚合。 起初我的方法是使用rdd.groupby(),但是在执行它需要更长的时间来收敛并且内存非常低,我知道这个操作在洗牌方面非常昂贵。 我遇到了另一个操作rdd.combinebykey(),但我在理解和使用它时遇到了问题。

这是我存储在 rdd 中的数据,称之为“customerrdd”

[(u'1', u'Customer#000000001', u'IVhzIApeRb ot,c,E', u'15', u'25-989-741-2988', u'711.56', u'BUILDING', u'to the even, regular platelets. regular, ironic epitaphs nag e', u''), (u'2', u'Customer#000000002', u'XSTf4,NCwDVaWNe6tEgvwfmRchLXak', u'13', u'23-768-687-3665', u'121.65', u'AUTOMOBILE', u'l accounts. blithely ironic theodolites integrate boldly: caref', u''), (u'3', u'Customer#000000003', u'MG9kdTD2WBHm', u'1', u'11-719-748-3364', u'7498.12', u'AUTOMOBILE', u' deposits eat slyly ironic, even instructions. express foxes detect slyly. blithely even accounts abov', u''), (u'4', u'Customer#000000004', u'XxVSJsLAGtn', u'4', u'14-128-190-5944', u'2866.83', u'MACHINERY', u' requests. final, regular ideas sleep final accou', u''), (u'5', u'Customer#000000005', u'KvpyuHCplrB84WgAiGV6sYpZq7Tj', u'3', u'13-750-942-6364', u'794.47', u'HOUSEHOLD', u'n accounts will have to unwind. foxes cajole accor', u''), (u'6', u'Customer#000000006', u'sKZz0CsnMD7mp4Xd0YrBvx,LREYKUWAh yVn', u'20', u'30-114-968-4951', u'7638.57', u'AUTOMOBILE', u'tions. even deposits boost according to the slyly bold packages. final accounts cajole requests. furious', u''), (u'7', u'Customer#000000007', u'TcGe5gaZNgVePxU5kRrvXBfkasDTea', u'18', u'28-190-982-9759', u'9561.95', u'AUTOMOBILE', u'ainst the ironic, express theodolites. express, even pinto beans among the exp', u''), (u'8', u'Customer#000000008', u'I0B10bB0AymmC, 0PrRYBCP1yGJ8xcBPmWhl5', u'17', u'27-147-574-9335', u'6819.74', u'BUILDING', u'among the slyly regular theodolites kindle blithely courts. carefully even theodolites haggle slyly along the ide', u''), (u'9', u'Customer#000000009', u'xKiAFTjUsCuxfeleNqefumTrjS', u'8', u'18-338-906-3675', u'8324.07', u'FURNITURE', u'r theodolites according to the requests wake thinly excuses: pending requests haggle furiousl', u''), (u'10', u'Customer#000000010', u'6LrEaV6KR6PLVcgl2ArL Q3rqzLzcT1 v2', u'5', u'15-741-346-9870', u'2753.54', u'HOUSEHOLD', u'es regular deposits haggle. fur', u''), (u'11', u'Customer#000000011', u'PkWS 3HlXqwTuzrKg633BEi', u'23', u'33-464-151-3439', u'-272.60', u'BUILDING', u'ckages. requests sleep slyly. quickly even pinto beans promise above the slyly regular pinto beans. ', u''), (u'12', u'Customer#000000012', u'9PWKuhzT4Zr1Q', u'13', u'23-791-276-1263', u'3396.49', u'HOUSEHOLD', u' to the carefully final braids. blithely regular requests nag. ironic theodolites boost quickly along', u''), (u'13', u'Customer#000000013', u'nsXQu0oVjD7PM659uC3SRSp', u'3', u'13-761-547-5974', u'3857.34', u'BUILDING', u'ounts sleep carefully after the close frays. carefully bold notornis use ironic requests. blithely', u''), (u'14', u'Customer#000000014', u'KXkletMlL2JQEA ', u'1', u'11-845-129-3851', u'5266.30', u'FURNITURE', u', ironic packages across the unus', u''), (u'15', u'Customer#000000015', u'YtWggXoOLdwdo7b0y,BZaGUQMLJMX1Y,EC,6Dn', u'23', u'33-687-542-7601', u'2788.52', u'HOUSEHOLD', u' platelets. regular deposits detect asymptotes. blithely unusual packages nag slyly at the fluf', u''), (u'16', u'Customer#000000016', u'cYiaeMLZSMAOQ2 d0W,', u'10', u'20-781-609-3107', u'4681.03', u'FURNITURE', u'kly silent courts. thinly regular theodolites sleep fluffily after ', u''), (u'17', u'Customer#000000017', u'izrh 6jdqtp2eqdtbkswDD8SG4SzXruMfIXyR7', u'2', u'12-970-682-3487', u'6.34', u'AUTOMOBILE', u'packages wake! blithely even pint', u''), (u'18', u'Customer#000000018', u'3txGO AiuFux3zT0Z9NYaFRnZt', u'6', u'16-155-215-1315', u'5494.43', u'BUILDING', u's sleep. carefully even instructions nag furiously alongside of t', u''), (u'19', u'Customer#000000019', u'uc,3bHIx84H,wdrmLOjVsiqXCq2tr', u'18', u'28-396-526-5053', u'8914.71', u'HOUSEHOLD', u' nag. furiously careful packages are slyly at the accounts. furiously regular in', u''), (u'20', u'Customer#000000020', u'JrPk8Pqplj4Ne', u'22', u'32-957-234-8742', u'7603.40', u'FURNITURE', u'g alongside of the special excuses-- fluffily enticing packages wake ', u''), (u'21', u'Customer#000000021', u'XYmVpr9yAHDEn', u'8', u'18-902-614-8344', u'1428.25', u'MACHINERY', u' quickly final accounts integrate blithely furiously u', u''), (u'22', u'Customer#000000022', u'QI6p41,FNs5k7RZoCCVPUTkUdYpB', u'3', u'13-806-545-9701', u'591.98', u'MACHINERY', u's nod furiously above the furiously ironic ideas. ', u''), (u'23', u'Customer#000000023', u'OdY W13N7Be3OC5MpgfmcYss0Wn6TKT', u'3', u'13-312-472-8245', u'3332.02', u'HOUSEHOLD', u'deposits. special deposits cajole slyly. fluffily special deposits about the furiously ', u''), (u'24', u'Customer#000000024', u'HXAFgIAyjxtdqwimt13Y3OZO 4xeLe7U8PqG', u'13', u'23-127-851-8031', u'9255.67', u'MACHINERY', u'into beans. fluffily final ideas haggle fluffily', u''), (u'25', u'Customer#000000025', u'Hp8GyFQgGHFYSilH5tBfe', u'12', u'22-603-468-3533', u'7133.70', u'FURNITURE', u'y. accounts sleep ruthlessly according to the regular theodolites. unusual instructions sleep. ironic, final', u''), (u'26', u'Customer#000000026', u'8ljrc5ZeMl7UciP', u'22', u'32-363-455-4837', u'5182.05', u'AUTOMOBILE', u'c requests use furiously ironic requests. slyly ironic dependencies us', u''), (u'27', u'Customer#000000027', u'IS8GIyxpBrLpMT0u7', u'3', u'13-137-193-2709', u'5679.84', u'BUILDING', u' about the carefully ironic pinto beans. accoun', u''), (u'28', u'Customer#000000028', u'iVyg0daQ,Tha8x2WPWA9m2529m', u'8', u'18-774-241-1462', u'1007.18', u'FURNITURE', u' along the regular deposits. furiously final pac', u''), (u'29', u'Customer#000000029', u'sJ5adtfyAkCK63df2,vF25zyQMVYE34uh', u'0', u'10-773-203-7342', u'7618.27', u'FURNITURE', u'its after the carefully final platelets x-ray against ', u''), (u'30', u'Customer#000000030', u'nJDsELGAavU63Jl0c5NKsKfL8rIJQQkQnYL2QJY', u'1', u'11-764-165-5076', u'9321.01', u'BUILDING', u'lithely final requests. furiously unusual account', u''), (u'31', u'Customer#000000031', u'LUACbO0viaAv6eXOAebryDB xjVst', u'23', u'33-197-837-7094', u'5236.89', u'HOUSEHOLD', u's use among the blithely pending depo', u''), (u'32', u'Customer#000000032', u'jD2xZzi UmId,DCtNBLXKj9q0Tlp2iQ6ZcO3J', u'15', u'25-430-914-2194', u'3471.53', u'BUILDING', u'cial ideas. final, furious requests across the e', u''), (u'33', u'Customer#000000033', u'qFSlMuLucBmx9xnn5ib2csWUweg D', u'17', u'27-375-391-1280', u'-78.56', u'AUTOMOBILE', u's. slyly regular accounts are furiously. carefully pending requests', u''), (u'34', u'Customer#000000034', u'Q6G9wZ6dnczmtOx509xgE,M2KV', u'15', u'25-344-968-5422', u'8589.70', u'HOUSEHOLD', u'nder against the even, pending accounts. even', u''), (u'35', u'Customer#000000035', u'TEjWGE4nBzJL2', u'17', u'27-566-888-7431', u'1228.24', u'HOUSEHOLD', u'requests. special, express requests nag slyly furiousl', u''), (u'36', u'Customer#000000036', u'3TvCzjuPzpJ0,DdJ8kW5U', u'21', u'31-704-669-5769', u'4987.27', u'BUILDING', u'haggle. enticing, quiet platelets grow quickly bold sheaves. carefully regular acc', u''), (u'37', u'Customer#000000037', u'7EV4Pwh,3SboctTWt', u'8', u'18-385-235-7162', u'-917.75', u'FURNITURE', u'ilent packages are carefully among the deposits. furiousl', u''), (u'38', u'Customer#000000038', u'a5Ee5e9568R8RLP 2ap7', u'12', u'22-306-880-7212', u'6345.11', u'HOUSEHOLD', u'lar excuses. closely even asymptotes cajole blithely excuses. carefully silent pinto beans sleep carefully fin', u''), (u'39', u'Customer#000000039', u'nnbRg,Pvy33dfkorYE FdeZ60', u'2', u'12-387-467-6509', u'6264.31', u'AUTOMOBILE', u'tions. slyly silent excuses slee', u''), (u'40', u'Customer#000000040', u'gOnGWAyhSV1ofv', u'3', u'13-652-915-8939', u'1335.30', u'BUILDING', u'rges impress after the slyly ironic courts. foxes are. blithely ', u''), (u'41', u'Customer#000000041', u'IM9mzmyoxeBmvNw8lA7G3Ydska2nkZF', u'10', u'20-917-711-4011', u'270.95', u'HOUSEHOLD', u'ly regular accounts hang bold, silent packages. unusual foxes haggle slyly above the special, final depo', u''), (u'42', u'Customer#000000042', u'ziSrvyyBke', u'5', u'15-416-330-4175', u'8727.01', u'BUILDING', u'ssly according to the pinto beans: carefully special requests across the even, pending accounts wake special', u''), (u'43', u'Customer#000000043', u'ouSbjHk8lh5fKX3zGso3ZSIj9Aa3PoaFd', u'19', u'29-316-665-2897', u'9904.28', u'MACHINERY', u'ial requests: carefully pending foxes detect quickly. carefully final courts cajole quickly. carefully', u''), (u'44', u'Customer#000000044', u'Oi,dOSPwDu4jo4x,,P85E0dmhZGvNtBwi', u'16', u'26-190-260-5375', u'7315.94', u'AUTOMOBILE', u'r requests around the unusual, bold a', u''), (u'45', u'Customer#000000045', u'4v3OcpFgoOmMG,CbnF,4mdC', u'9', u'19-715-298-9917', u'9983.38', u'AUTOMOBILE', u'nto beans haggle slyly alongside of t', u''), (u'46', u'Customer#000000046', u'eaTXWWm10L9', u'6', u'16-357-681-2007', u'5744.59', u'AUTOMOBILE', u'ctions. accounts sleep furiously even requests. regular, regular accounts cajole blithely around the final pa', u''), (u'47', u'Customer#000000047', u'b0UgocSqEW5 gdVbhNT', u'2', u'12-427-271-9466', u'274.58', u'BUILDING', u'ions. express, ironic instructions sleep furiously ironic ideas. furi', u''), (u'48', u'Customer#000000048', u'0UU iPhBupFvemNB', u'0', u'10-508-348-5882', u'3792.50', u'BUILDING', u're fluffily pending foxes. pending, bold platelets sleep slyly. even platelets cajo', u''), (u'49', u'Customer#000000049', u'cNgAeX7Fqrdf7HQN9EwjUa4nxT,68L FKAxzl', u'10', u'20-908-631-4424', u'4573.94', u'FURNITURE', u'nusual foxes! fluffily pending packages maintain to the regular ', u''), (u'50', u'Customer#000000050', u'9SzDYlkzxByyJ1QeTI o', u'6', u'16-658-112-3221', u'4266.13', u'MACHINERY', u'ts. furiously ironic accounts cajole furiously slyly ironic dinos.', u''), (u'51', u'Customer#000000051', u'uR,wEaiTvo4', u'12', u'22-344-885-4251', u'855.87', u'FURNITURE', u'eposits. furiously regular requests integrate carefully packages. furious', u''), (u'52', u'Customer#000000052', u'7 QOqGqqSy9jfV51BC71jcHJSD0', u'11', u'21-186-284-5998', u'5630.28', u'HOUSEHOLD', u'ic platelets use evenly even accounts. stealthy theodolites cajole furiou', u''), (u'53', u'Customer#000000053', u'HnaxHzTfFTZs8MuCpJyTbZ47Cm4wFOOgib', u'15', u'25-168-852-5363', u'4113.64', u'HOUSEHOLD', u'ar accounts are. even foxes are blithely. fluffily pending deposits boost', u''), (u'54', u'Customer#000000054', u',k4vf 5vECGWFy,hosTE,', u'4', u'14-776-370-4745', u'868.90', u'AUTOMOBILE', u'sual, silent accounts. furiously express accounts cajole special deposits. final, final accounts use furi', u''), (u'55', u'Customer#000000055', u'zIRBR4KNEl HzaiV3a i9n6elrxzDEh8r8pDom', u'10', u'20-180-440-8525', u'4572.11', u'MACHINERY', u'ully unusual packages wake bravely bold packages. unusual requests boost deposits! blithely ironic packages ab', u''), (u'56', u'Customer#000000056', u'BJYZYJQk4yD5B', u'10', u'20-895-685-6920', u'6530.86', u'FURNITURE', u'. notornis wake carefully. carefully fluffy requests are furiously even accounts. slyly expre', u''), (u'57', u'Customer#000000057', u'97XYbsuOPRXPWU', u'21', u'31-835-306-1650', u'4151.93', u'AUTOMOBILE', u'ove the carefully special packages. even, unusual deposits sleep slyly pend', u''), (u'58', u'Customer#000000058', u'g9ap7Dk1Sv9fcXEWjpMYpBZIRUohi T', u'13', u'23-244-493-2508', u'6478.46', u'HOUSEHOLD', u'ideas. ironic ideas affix furiously express, final instructions. regular excuses use quickly e', u''), (u'59', u'Customer#000000059', u'zLOCP0wh92OtBihgspOGl4', u'1', u'11-355-584-3112', u'3458.60', u'MACHINERY', u'ously final packages haggle blithely after the express deposits. furiou', u''), (u'60', u'Customer#000000060', u'FyodhjwMChsZmUz7Jz0H', u'12', u'22-480-575-5866', u'2741.87', u'MACHINERY', u'latelets. blithely unusual courts boost furiously about the packages. blithely final instruct', u''), (u'61', u'Customer#000000061', u'9kndve4EAJxhg3veF BfXr7AqOsT39o gtqjaYE', u'17', u'27-626-559-8599', u'1536.24', u'FURNITURE', u'egular packages shall have to impress along the ', u''), (u'62', u'Customer#000000062', u'upJK2Dnw13,', u'7', u'17-361-978-7059', u'595.61', u'MACHINERY', u'kly special dolphins. pinto beans are slyly. quickly regular accounts are furiously a', u''), (u'63', u'Customer#000000063', u'IXRSpVWWZraKII', u'21', u'31-952-552-9584', u'9331.13', u'AUTOMOBILE', u'ithely even accounts detect slyly above the fluffily ir', u''), (u'64', u'Customer#000000064', u'MbCeGY20kaKK3oalJD,OT', u'3', u'13-558-731-7204', u'-646.64', u'BUILDING', u'structions after the quietly ironic theodolites cajole be', u''), (u'65', u'Customer#000000065', u'RGT yzQ0y4l0H90P783LG4U95bXQFDRXbWa1sl,X', u'23', u'33-733-623-5267', u'8795.16', u'AUTOMOBILE', u'y final foxes serve carefully. theodolites are carefully. pending i', u''), (u'66', u'Customer#000000066', u'XbsEqXH1ETbJYYtA1A', u'22', u'32-213-373-5094', u'242.77', u'HOUSEHOLD', u'le slyly accounts. carefully silent packages benea', u''), (u'67', u'Customer#000000067', u'rfG0cOgtr5W8 xILkwp9fpCS8', u'9', u'19-403-114-4356', u'8166.59', u'MACHINERY', u'indle furiously final, even theodo', u''), (u'68', u'Customer#000000068', u'o8AibcCRkXvQFh8hF,7o', u'12', u'22-918-832-2411', u'6853.37', u'HOUSEHOLD', u' pending pinto beans impress realms. final dependencies ', u''), (u'69', u'Customer#000000069', u'Ltx17nO9Wwhtdbe9QZVxNgP98V7xW97uvSH1prEw', u'9', u'19-225-978-5670', u'1709.28', u'HOUSEHOLD', u'thely final ideas around the quickly final dependencies affix carefully quickly final theodolites. final accounts c', u''), (u'70', u'Customer#000000070', u'mFowIuhnHjp2GjCiYYavkW kUwOjIaTCQ', u'22', u'32-828-107-2832', u'4867.52', u'FURNITURE', u'fter the special asymptotes. ideas after the unusual frets cajole quickly regular pinto be', u''), (u'71', u'Customer#000000071', u'TlGalgdXWBmMV,6agLyWYDyIz9MKzcY8gl,w6t1B', u'7', u'17-710-812-5403', u'-611.19', u'HOUSEHOLD', u'g courts across the regular, final pinto beans are blithely pending ac', u''), (u'72', u'Customer#000000072', u'putjlmskxE,zs,HqeIA9Wqu7dhgH5BVCwDwHHcf', u'2', u'12-759-144-9689', u'-362.86', u'FURNITURE', u'ithely final foxes sleep always quickly bold accounts. final wat', u''), (u'73', u'Customer#000000073', u'8IhIxreu4Ug6tt5mog4', u'0', u'10-473-439-3214', u'4288.50', u'BUILDING', u'usual, unusual packages sleep busily along the furiou', u''), (u'74', u'Customer#000000074', u'IkJHCA3ZThF7qL7VKcrU nRLl,kylf ', u'4', u'14-199-862-7209', u'2764.43', u'MACHINERY', u'onic accounts. blithely slow packages would haggle carefully. qui', u''), (u'75', u'Customer#000000075', u'Dh 6jZ,cwxWLKQfRKkiGrzv6pm', u'18', u'28-247-803-9025', u'6684.10', u'AUTOMOBILE', u' instructions cajole even, even deposits. finally bold deposits use above the even pains. slyl', u''), (u'76', u'Customer#000000076', u'm3sbCvjMOHyaOofH,e UkGPtqc4', u'0', u'10-349-718-3044', u'5745.33', u'FURNITURE', u'pecial deposits. ironic ideas boost blithely according to the closely ironic theodolites! furiously final deposits n', u''), (u'77', u'Customer#000000077', u'4tAE5KdMFGD4byHtXF92vx', u'17', u'27-269-357-4674', u'1738.87', u'BUILDING', u'uffily silent requests. carefully ironic asymptotes among the ironic hockey players are carefully bli', u''), (u'78', u'Customer#000000078', u'HBOta,ZNqpg3U2cSL0kbrftkPwzX', u'9', u'19-960-700-9191', u'7136.97', u'FURNITURE', u'ests. blithely bold pinto beans h', u''), (u'79', u'Customer#000000079', u'n5hH2ftkVRwW8idtD,BmM2', u'15', u'25-147-850-4166', u'5121.28', u'MACHINERY', u'es. packages haggle furiously. regular, special requests poach after the quickly express ideas. blithely pending re', u''), (u'80', u'Customer#000000080', u'K,vtXp8qYB ', u'0', u'10-267-172-7101', u'7383.53', u'FURNITURE', u'tect among the dependencies. bold accounts engage closely even pinto beans. ca', u''), (u'81', u'Customer#000000081', u'SH6lPA7JiiNC6dNTrR', u'20', u'30-165-277-3269', u'2023.71', u'BUILDING', u'r packages. fluffily ironic requests cajole fluffily. ironically regular theodolit', u''), (u'82', u'Customer#000000082', u'zhG3EZbap4c992Gj3bK,3Ne,Xn', u'18', u'28-159-442-5305', u'9468.34', u'AUTOMOBILE', u's wake. bravely regular accounts are furiously. regula', u''), (u'83', u'Customer#000000083', u'HnhTNB5xpnSF20JBH4Ycs6psVnkC3RDf', u'22', u'32-817-154-4122', u'6463.51', u'BUILDING', u'ccording to the quickly bold warhorses. final, regular foxes integrate carefully. bold packages nag blithely ev', u''), (u'84', u'Customer#000000084', u'lpXz6Fwr9945rnbtMc8PlueilS1WmASr CB', u'11', u'21-546-818-3802', u'5174.71', u'FURNITURE', u'ly blithe foxes. special asymptotes haggle blithely against the furiously regular depo', u''), (u'85', u'Customer#000000085', u'siRerlDwiolhYR 8FgksoezycLj', u'5', u'15-745-585-8219', u'3386.64', u'FURNITURE', u'ronic ideas use above the slowly pendin', u''), (u'86', u'Customer#000000086', u'US6EGGHXbTTXPL9SBsxQJsuvy', u'0', u'10-677-951-2353', u'3306.32', u'HOUSEHOLD', u'quests. pending dugouts are carefully aroun', u''), (u'87', u'Customer#000000087', u'hgGhHVSWQl 6jZ6Ev', u'23', u'33-869-884-7053', u'6327.54', u'FURNITURE', u'hely ironic requests integrate according to the ironic accounts. slyly regular pla', u''), (u'88', u'Customer#000000088', u'wtkjBN9eyrFuENSMmMFlJ3e7jE5KXcg', u'16', u'26-516-273-2566', u'8031.44', u'AUTOMOBILE', u's are quickly above the quickly ironic instructions; even requests about the carefully final deposi', u''), (u'89', u'Customer#000000089', u'dtR, y9JQWUO6FoJExyp8whOU', u'14', u'24-394-451-5404', u'1530.76', u'FURNITURE', u'counts are slyly beyond the slyly final accounts. quickly final ideas wake. r', u''), (u'90', u'Customer#000000090', u'QxCzH7VxxYUWwfL7', u'16', u'26-603-491-1238', u'7354.23', u'BUILDING', u'sly across the furiously even ', u''), (u'91', u'Customer#000000091', u'S8OMYFrpHwoNHaGBeuS6E 6zhHGZiprw1b7 q', u'8', u'18-239-400-3677', u'4643.14', u'AUTOMOBILE', u'onic accounts. fluffily silent pinto beans boost blithely according to the fluffily exp', u''), (u'92', u'Customer#000000092', u'obP PULk2LH LqNF,K9hcbNqnLAkJVsl5xqSrY,', u'2', u'12-446-416-8471', u'1182.91', u'MACHINERY', u'. pinto beans hang slyly final deposits. ac', u''), (u'93', u'Customer#000000093', u'EHXBr2QGdh', u'7', u'17-359-388-5266', u'2182.52', u'MACHINERY', u'press deposits. carefully regular platelets r', u''), (u'94', u'Customer#000000094', u'IfVNIN9KtkScJ9dUjK3Pg5gY1aFeaXewwf', u'9', u'19-953-499-8833', u'5500.11', u'HOUSEHOLD', u'latelets across the bold, final requests sleep according to the fluffily bold accounts. unusual deposits amon', u''), (u'95', u'Customer#000000095', u'EU0xvmWvOmUUn5J,2z85DQyG7QCJ9Xq7', u'15', u'25-923-255-2929', u'5327.38', u'MACHINERY', u'ithely. ruthlessly final requests wake slyly alongside of the furiously silent pinto beans. even the', u''), (u'96', u'Customer#000000096', u'vWLOrmXhRR', u'8', u'18-422-845-1202', u'6323.92', u'AUTOMOBILE', u'press requests believe furiously. carefully final instructions snooze carefully. ', u''), (u'97', u'Customer#000000097', u'OApyejbhJG,0Iw3j rd1M', u'17', u'27-588-919-5638', u'2164.48', u'AUTOMOBILE', u'haggle slyly. bold, special ideas are blithely above the thinly bold theo', u''), (u'98', u'Customer#000000098', u'7yiheXNSpuEAwbswDW', u'12', u'22-885-845-6889', u'-551.37', u'BUILDING', u'ages. furiously pending accounts are quickly carefully final foxes: busily pe', u''), (u'99', u'Customer#000000099', u'szsrOiPtCHVS97Lt', u'15', u'25-515-237-9232', u'4088.65', u'HOUSEHOLD', u'cajole slyly about the regular theodolites! furiously bold requests nag along the pending, regular packages. somas', u''), (u'100', u'Customer#000000100', u'fptUABXcmkC5Wx', u'20', u'30-749-445-4907', u'9889.89', u'FURNITURE', u'was furiously fluffily quiet deposits. silent, pending requests boost against ', u'')]

我在attribute key-6 向customerrdd 应用了groupby(), 进一步对于聚合操作说 addition on attribute key-3 我已经应用了具有一系列平面图和映射值的 reducebykey 操作,这是它的代码:

def func(x):
    return x


def stringconverfunc(z):
    return str(z)


def floatconverfunc(l):
    return float(l)

def aggonvalfunc(y):
    return y[3]


grouprdd=customerrdd.groupBy(lambda w:(w[6]))


result=grouprdd.flatMapValues(lambda q: func(q)).mapValues(lambda p: aggonvalfunc(p)) \
        .mapValues(lambda line: stringconverfunc(line)).mapValues(lambda line: line.strip()) \
        .mapValues(lambda line: floatconverfunc(line)).reduceByKey(lambda x, y: x + y).collect()
print result

输出:

[(u'BUILDING', 20), (u'AUTOMOBILE', 21), (u'HOUSEHOLD', 21), (u'MACHINERY', 16), (u'FURNITURE', 22)]

但是,上述方法在洗牌方面的成本相当高,并且不适用于更大的数据集。因此,我想用rdd.combinebykey 实现相同的上述概念,以便更快地计算它并可以用于更大的数据集。 我试图通过引用combinebykey 来实现它,但对如何提供需要执行聚合的键和值感到困惑。谁能帮忙?我想有建议

【问题讨论】:

【参考方案1】:

好的,对于初学者来说,很难了解所有这些,所以我会尝试给你一些提示。

您可以在不分组的情况下分配键,这可以通过keyBy 完成并且不涉及洗牌。最后,键值对 rdd 只是一个由大小为 2 的元组组成的 rdd,其中第一个条目是键,第二个条目是值。 如果您事先进行分组,您可以从 reduceByKeycombineByKey 获得的任何性能提升都将变得毫无用处,否则可以避免。

此外,您可以使用具有前导和尾随空格的字符串调用float,它会自动删除字符串。您也不需要创建 lambda x: f(x) 形式的 lambda,直接使用 f 而不需要任何大括号,它会产生相同的效果。出于同样的原因,您不需要用另一个函数包装 strfloatoperator 模块提供了添加和检索值的函数,因此您也不需要定义这些函数。请查看 python 文档以获取更多信息。

我的解决方案是:

from operator import itemgetter, add

# `itemgetter(6)` is equivalent to `lambda x: x[6]`. Therefore we'll use element at
# index 6 to key the rdd's entries.
# This operation is equivalent to `customerrdd.map(lambda x: (x[6], x))`
rdd = customerrdd.keyBy(itemgetter(6))

# Now extract element at index 3 from the values so we no longer have a tuple
rdd = rdd.mapValues(itemgetter(3))

# Convert those elements to floats
rdd = rdd.mapValues(float)

# We could've done the previous steps in one by doing
# rdd = customerrdd.map(lambda x: (x[6], float(x[3]))

# Sum them up and collect the result
result = rdd.reduceByKey(add).collect()

没有cmets

from operator import itemgetter, add

result = customerrdd.keyBy(itemgetter(6))\
    .mapValues(itemgetter(3))\
    .mapValues(float)\
    .reduceByKey(add).collect()

返回

[(u'BUILDING', 204.0),
 (u'AUTOMOBILE', 280.0),
 (u'MACHINERY', 135.0),
 (u'HOUSEHOLD', 255.0),
 (u'FURNITURE', 224.0)]

诚然,结果与您的不同,但我运行了您的代码并得到了相同的结果。所以我猜你的结果有不同的 rdd。

【讨论】:

嗨@swenzel,首先感谢您的回复,但我想使用combinebykey 操作或类似的操作来实现,以减少洗牌的开销。但是reducebykey 涉及计算时的高混洗开销。避免此操作groupby() groupbykey() reducebykey() 的原因是在执行时性能受到影响,并且由于此操作造成的内存复杂性,较大的数据集有时作业会失败。因此在combinebykey之类的操作方面寻找解决方案 @ShafaatHussain 我很确定reduceByKey 不会比combineByKey 慢,这对于简单的求和来说完全是矫枉过正。您的瓶颈是 groupBy 操作。你试过我提供的代码了吗? 再次感谢,但我只是不需要求和,将扩展它以用于不同的聚合操作。是的将尝试提供您的解决方案:) 在这种情况下(如果您的聚合不是太特别),您应该考虑将您的 rdd 转换为 DataFrame 并使用类似 sql 的 groupby 和聚合函数。 functions 模块中已经实现了很多。如果您仍然喜欢combineByKey,恐怕我不知道如何比您已经找到的来源更好地解释它。 是的,我知道 spark 中的数据帧,但实际上我只需要在 rdd 上执行操作 :)

以上是关于在 python (pyspark) 中使用 combinebykey spark rdd 计算组上的聚合的主要内容,如果未能解决你的问题,请参考以下文章

如何在 pyspark.sql.functions.when() 中使用多个条件?

在 for 循环中使用 udf 在 Pyspark 中创建多个列

在 zeppelin 中使用从 %pyspark 到 %python 的 Dataframe

在 groupby 操作 PySpark 中聚合列中的稀疏向量

在 python 3 中使用 pyspark 从 MySql db 加载数据

在 pyspark 中,如何将字符串添加/连接到列?