Posted 绠楁硶缇庨灞?/a> 濡備
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了相关的知识,希望对你有一定的参考价值。
涓€锛宲yspark 馃崕 or spark-scala 馃敟 ?
pyspark寮轰簬鍒嗘瀽锛宻park-scala寮轰簬宸ョ▼銆?/p>
濡傛灉搴旂敤鍦烘櫙鏈夐潪甯搁珮鐨勬€ц兘闇€姹傦紝搴旇閫夋嫨spark-scala.
濡傛灉搴旂敤鍦烘櫙鏈夐潪甯稿鐨勫彲瑙嗗寲鍜屾満鍣ㄥ涔犵畻娉曢渶姹傦紝鎺ㄨ崘浣跨敤pyspark锛屽彲浠ユ洿濂藉湴鍜宲ython涓殑鐩稿叧搴撻厤鍚堜娇鐢ㄣ€?/p>
姝ゅspark-scala鏀寔spark graphx鍥捐绠楁ā鍧楋紝鑰宲yspark鏄笉鏀寔鐨勩€?/p>
pyspark瀛︿範鏇茬嚎骞崇紦锛宻park-scala瀛︿範鏇茬嚎闄″抄銆?/p>
浠庡涔犳垚鏈潵璇达紝spark-scala瀛︿範鏇茬嚎闄″抄锛屼笉浠呭洜涓簊cala鏄竴闂ㄥ洶闅剧殑璇█锛屾洿鍔犲洜涓哄湪鍓嶆柟鐨勯亾璺笂浼氭湁鏃犲敖鐨勭幆澧冮厤缃棝鑻︾瓑寰呯潃璇昏€呫€?/p>
鑰宲yspark瀛︿範鎴愭湰鐩稿杈冧綆锛岀幆澧冮厤缃浉瀵瑰鏄撱€備粠瀛︿範鎴愭湰鏉ヨ锛屽鏋滆pyspark鐨勫涔犳垚鏈槸3锛岄偅涔坰park-scala鐨勫涔犳垚鏈ぇ姒傛槸9銆?/p>
濡傛灉璇昏€呮湁杈冨己鐨勫涔犺兘鍔涘拰鍏呭垎鐨勫涔犳椂闂达紝寤鸿閫夋嫨spark-scala锛岃兘澶熻В閿乻park鐨勫叏閮ㄦ妧鑳斤紝骞惰幏寰楁渶浼樻€ц兘锛岃繖涔熸槸宸ヤ笟鐣屾渶鏅亶浣跨敤spark鐨勬柟寮忋€?/p>
濡傛灉璇昏€呭涔犳椂闂存湁闄愶紝骞跺Python鎯呮湁鐙挓锛屽缓璁€夋嫨pyspark銆俻yspark鍦ㄥ伐涓氱晫鐨勪娇鐢ㄧ洰鍓嶄篃瓒婃潵瓒婃櫘閬嶃€?/p>
浜岋紝鏈功馃摎 闈㈠悜璇昏€咅煠?/span>
鏈功鍋囧畾璇昏€呭叿鏈夊熀纭€鐨勭殑Python缂栫爜鑳藉姏锛岀啛鎮塒ython涓璶umpy, pandas搴撶殑鍩烘湰鐢ㄦ硶銆?/p>
骞朵笖鍋囧畾璇昏€呭叿鏈変竴瀹氱殑SQL浣跨敤缁忛獙锛岀啛鎮塻elect,join,group by绛塻ql璇硶銆?/p>
涓夛紝鏈功鍐欎綔椋庢牸馃崏
鏈功鏄竴鏈浜虹被鐢ㄦ埛鏋佸叾鍙嬪杽鐨刾yspark鍏ラ棬宸ュ叿涔︼紝Don't let me think鏄湰涔︾殑鏈€楂樿拷姹傘€?/p>
鏈功涓昏鏄湪鍙傝€僺park瀹樻柟鏂囨。锛屽苟缁撳悎浣滆€呭涔犱娇鐢ㄧ粡楠屽熀纭€涓婃暣鐞嗘€荤粨鍐欐垚鐨勩€?/p>
涓嶅悓浜嶴park瀹樻柟鏂囨。鐨勭箒鍐楁柇鐮侊紝鏈功鍦ㄧ瘒绔犵粨鏋勫拰鑼冧緥閫夊彇涓婂仛浜嗗ぇ閲忕殑浼樺寲锛屽湪鐢ㄦ埛鍙嬪ソ搴︽柟闈㈡洿鑳滀竴绛广€?/p>
鏈功鎸夌収鍐呭闅炬槗绋嬪害銆佽鑰呮绱範鎯拰spark鑷韩鐨勫眰娆$粨鏋勮璁″唴瀹癸紝寰簭娓愯繘锛屽眰娆℃竻鏅帮紝鏂逛究鎸夌収鍔熻兘鏌ユ壘鐩稿簲鑼冧緥銆?/p>
鏈功鍦ㄨ寖渚嬭璁′笂灏藉彲鑳界畝绾﹀寲鍜岀粨鏋勫寲锛屽寮鸿寖渚嬫槗璇绘€у拰閫氱敤鎬э紝澶ч儴鍒嗕唬鐮佺墖娈靛湪瀹炶返涓彲鍗冲彇鍗崇敤銆?/p>
濡傛灉璇撮€氳繃瀛︿範spark瀹樻柟鏂囨。鎺屾彙pyspark鐨勯毦搴﹀ぇ姒傛槸5锛岄偅涔堥€氳繃鏈功瀛︿範鎺屾彙pyspark鐨勯毦搴﹀簲璇ュぇ姒傛槸2.
浠呬互涓嬪浘瀵规瘮spark瀹樻柟鏂囨。涓庢湰涔︺€?0澶╁悆鎺夐偅鍙猵yspark銆嬬殑宸紓銆?/p>
鍥涳紝鏈功瀛︿範鏂规 鈴?/span>
1锛屽涔犺鍒?/strong>
鏈功鏄綔鑰呭埄鐢ㄥ伐浣滀箣浣欏ぇ姒?涓湀鍐欐垚鐨勶紝澶ч儴鍒嗚鑰呭簲璇ュ湪10澶╁彲浠ュ畬鍏ㄥ浼氥€?/p>
棰勮姣忓ぉ鑺辫垂鐨勫涔犳椂闂村湪30鍒嗛挓鍒?涓皬鏃朵箣闂淬€?/p>
褰撶劧锛屾湰涔︿篃闈炲父閫傚悎浣滀负pyspark鐨勫伐鍏锋墜鍐屽湪宸ョ▼钀藉湴鏃朵綔涓鸿寖渚嬪簱鍙傝€冦€?/p>
鐐瑰嚮瀛︿範鍐呭钃濊壊鏍囬鍗冲彲杩涘叆璇ョ珷鑺傘€?/strong>
鏃ユ湡 | 瀛︿範鍐呭 | 鍐呭闅惧害 | 棰勮瀛︿範鏃堕棿 | 鏇存柊鐘舵€?/th> |
---|---|---|---|---|
涓€銆佸熀纭€绡?/strong> | ||||
day1 | 1-1,蹇€熸惌寤轰綘鐨凷park寮€鍙戠幆澧?/td> | 猸愶笍猸愶笍 | 1hour | 鉁?/td> |
day2 | 1-2,60鍒嗛挓鐪嬫噦Spark鐨勫熀鏈師鐞?/td> | 猸愶笍猸愶笍猸愶笍 | 1hour | 鉁?/td> |
浜屻€佹牳蹇冪瘒 | ||||
day3 | 2-1,2灏忔椂鍏ラ棬Spark涔婻DD缂栫▼ | 猸愶笍猸愶笍猸愶笍 | 0.5hour | |
day4 | 2-2,7閬揜DD缂栫▼缁冧範棰?/td> | 猸愶笍猸愶笍猸愶笍 | 1hour | |
day5 | 2-3,2灏忔椂鍏ラ棬SparkSQL缂栫▼ | 猸愶笍猸愶笍猸愶笍 | 2hour | |
day6 | 2-4,7閬揝parkSQL缂栫▼缁冧範棰?/td> | 猸愶笍猸愶笍猸愶笍 | 1hour | |
涓夈€佽繘闃剁瘒 | ||||
day7 | 3-1,Spark鎬ц兘璋冧紭鏂规硶 | 猸愶笍猸愶笍猸愶笍猸愶笍猸愶笍 | 2hour | |
day8 | 3-2,RDD鍜孲parkSQL缁煎悎搴旂敤 | 猸愶笍猸愶笍猸愶笍猸愶笍猸愶笍 | 2hour | |
鍥涖€佹嫇灞曠瘒 | ||||
day9 | 4-1,鎺㈢储MLlib鏈哄櫒瀛︿範 | 猸愶笍猸愶笍猸愶笍猸愶笍 | 2hour | |
day10 | 4-2,鍒濊瘑StructuredStreaming | 猸愶笍猸愶笍猸愶笍 | 1hour |
2锛屽涔犵幆澧?/strong>
鏈功鍏ㄩ儴婧愮爜鍦╦upyter涓紪鍐欐祴璇曢€氳繃锛屽缓璁€氳繃git鍏嬮殕鍒版湰鍦帮紝骞跺湪jupyter涓氦浜掑紡杩愯瀛︿範銆?/p>
涓轰簡鐩存帴鑳藉鍦╦upyter涓墦寮€markdown鏂囦欢锛屽缓璁畨瑁卝upytext锛屽皢markdown杞崲鎴恑pynb鏂囦欢銆?/p>
涓虹畝鍗曡捣瑙侊紝鏈功鎸夌収濡備笅2涓楠ら厤缃崟鏈虹増spark3.0.1鐜杩涜缁冧範銆?/p>
step1: 瀹夎java8
java瀹夎鏁欑▼锛歨ttps://www.runoob.com/java/java-environment-setup.html
step2: 瀹夎pyspark,findspark
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple pyspark
pip install findspark
姝ゅ锛屼篃鍙互鍦╧esci浜戠notebook涓洿鎺ヨ繍琛宲yspark
https://www.kesci.com/home/project
import findspark
#鎸囧畾spark_home,鎸囧畾python璺緞
spark_home = "/Users/liangyun/anaconda3/lib/python3.7/site-packages/pyspark"
python_path = "/Users/liangyun/anaconda3/bin/python"
findspark.init(spark_home,python_path)
import pyspark
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName("test").setMaster("local[4]")
sc = SparkContext(conf=conf)
print("spark version:",pyspark.__version__)
rdd = sc.parallelize(["hello","spark"])
print(rdd.reduce(lambda x,y:x+' '+y))
spark version: 3.0.1
hello spark
浜旓紝榧撳姳鍜岃仈绯讳綔鑰?/span>
濡傛灉鏈功瀵逛綘鏈夋墍甯姪锛屾兂榧撳姳涓€涓嬩綔鑰咃紝璁板緱缁欐湰椤圭洰鍔犱竴棰楁槦鏄焥tar猸愶笍锛屽苟鍒嗕韩缁欎綘鐨勬湅鍙嬩滑鍠旔煒?
以上是关于的主要内容,如果未能解决你的问题,请参考以下文章