澶ф暟鎹潰璇曢
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了澶ф暟鎹潰璇曢相关的知识,希望对你有一定的参考价值。
鏍囩锛?a href='http://www.mamicode.com/so/1/%e6%8e%92%e5%88%97' title='鎺掑垪'>鎺掑垪
pyspark 璇︾粏 鐣岄潰 xmx 璁$畻妗嗘灦 blackhole Once notSpark Core闈㈣瘯绡?1
闅忕潃Spark鎶€鏈湪浼佷笟涓簲鐢ㄨ秺鏉ヨ秺骞挎硾锛孲park鎴愪负澶ф暟鎹紑鍙戝繀椤绘帉鎻$殑鎶€鑳姐€傚墠鏈熷垎浜簡寰堝鍏充簬Spark鐨勫涔犺棰戝拰鏂囩珷锛屼负浜嗚繘涓€姝ュ珐鍥哄拰鎺屾彙Spark锛屽湪鍘熸湁spark涓撳垔鍩虹涓婏紝鏂板銆奡park闈㈣瘯2000棰樸€嬩笓鍒婏紝棰橀泦鍖呭惈鍩虹姒傚康銆佸師鐞嗐€佺紪鐮佸紑鍙戙€佹€ц兘璋冧紭銆佽繍缁淬€佹簮浠g爜浠ュ強Spark鍛ㄨ竟鐢熸€佺郴缁熺瓑銆傞儴鍒嗛闆嗘潵婧愪簬浜掕仈缃戯紝鐢辨宄拌胺蹇楁効鑰呮敹闆嗗拰鏁寸悊锛岄儴鍒嗛闆嗙敱姊呭嘲璋峰織鎰胯€呯粨鍚堢敓浜у疄闄呯鍒扮殑闂璁捐鍑烘潵锛屽笇鏈涜兘缁欏ぇ瀹跺甫鏉ュ府鍔┿€?/div>
涓€銆佺畝绛旈
1.Spark master浣跨敤zookeeper杩涜HA鐨勶紝鏈夊摢浜涘厓鏁版嵁淇濆瓨鍦╖ookeeper锛?/div>
绛旓細spark閫氳繃杩欎釜鍙傛暟spark.deploy.zookeeper.dir鎸囧畾master鍏冩暟鎹湪zookeeper涓繚瀛樼殑浣嶇疆锛屽寘鎷琖orker锛孌river鍜孉pplication浠ュ強Executors銆俿tandby鑺傜偣瑕佷粠zk涓紝鑾峰緱鍏冩暟鎹俊鎭紝鎭㈠闆嗙兢杩愯鐘舵€侊紝鎵嶈兘瀵瑰缁х画鎻愪緵鏈嶅姟锛屼綔涓氭彁浜よ祫婧愮敵璇风瓑锛屽湪鎭㈠鍓嶆槸涓嶈兘鎺ュ彈璇锋眰鐨勩€傚彟澶栵紝Master鍒囨崲闇€瑕佹敞鎰?鐐?/div>
1锛夊湪Master鍒囨崲鐨勮繃绋嬩腑锛屾墍鏈夌殑宸茬粡鍦ㄨ繍琛岀殑绋嬪簭鐨嗘甯歌繍琛岋紒鍥犱负Spark Application鍦ㄨ繍琛屽墠灏卞凡缁忛€氳繃Cluster Manager鑾峰緱浜嗚绠楄祫婧愶紝鎵€浠ュ湪杩愯鏃禞ob鏈韩鐨勮皟搴﹀拰澶勭悊鍜孧aster鏄病鏈変换浣曞叧绯荤殑锛?/div>
2锛?鍦∕aster鐨勫垏鎹㈣繃绋嬩腑鍞竴鐨勫奖鍝嶆槸涓嶈兘鎻愪氦鏂扮殑Job锛氫竴鏂归潰涓嶈兘澶熸彁浜ゆ柊鐨勫簲鐢ㄧ▼搴忕粰闆嗙兢锛屽洜涓哄彧鏈堿ctive Master鎵嶈兘鎺ュ彈鏂扮殑绋嬪簭鐨勬彁浜よ姹傦紱鍙﹀涓€鏂归潰锛屽凡缁忚繍琛岀殑绋嬪簭涓篃涓嶈兘澶熷洜涓篈ction鎿嶄綔瑙﹀彂鏂扮殑Job鐨勬彁浜よ姹傦紱
2.Spark master HA 涓讳粠鍒囨崲杩囩▼涓嶄細褰卞搷闆嗙兢宸叉湁鐨勪綔涓氳繍琛岋紝涓轰粈涔堬紵
绛旓細鍥犱负绋嬪簭鍦ㄨ繍琛屼箣鍓嶏紝宸茬粡鐢宠杩囪祫婧愪簡锛宒river鍜孍xecutors閫氳锛屼笉闇€瑕佸拰master杩涜閫氳鐨勩€?/div>
3.Spark on Mesos涓紝浠€涔堟槸鐨勭矖绮掑害鍒嗛厤锛屼粈涔堟槸缁嗙矑搴﹀垎閰嶏紝鍚勮嚜鐨勪紭鐐瑰拰缂虹偣鏄粈涔堬紵
绛旓細1锛夌矖绮掑害锛氬惎鍔ㄦ椂灏卞垎閰嶅ソ璧勬簮锛?绋嬪簭鍚姩锛屽悗缁叿浣撲娇鐢ㄥ氨浣跨敤鍒嗛厤濂界殑璧勬簮锛屼笉闇€瑕佸啀鍒嗛厤璧勬簮锛涘ソ澶勶細浣滀笟鐗瑰埆澶氭椂锛岃祫婧愬鐢ㄧ巼楂橈紝閫傚悎绮楃矑搴︼紱涓嶅ソ锛氬鏄撹祫婧愭氮璐癸紝鍋囧涓€涓猨ob鏈?000涓猼ask锛屽畬鎴愪簡999涓紝杩樻湁涓€涓病瀹屾垚锛岄偅涔堜娇鐢ㄧ矖绮掑害锛?99涓祫婧愬氨浼氶棽缃湪閭i噷锛岃祫婧愭氮璐广€?锛夌粏绮掑害鍒嗛厤锛氱敤璧勬簮鐨勬椂鍊欏垎閰嶏紝鐢ㄥ畬浜嗗氨绔嬪嵆鍥炴敹璧勬簮锛屽惎鍔ㄤ細楹荤儲涓€鐐癸紝鍚姩涓€娆″垎閰嶄竴娆★紝浼氭瘮杈冮夯鐑︺€?/div>
4.濡備綍閰嶇疆spark master鐨凥A锛?/div>
1)閰嶇疆zookeeper
2)淇敼spark_env.sh鏂囦欢,spark鐨刴aster鍙傛暟涓嶅湪鎸囧畾锛屾坊鍔犲涓嬩唬鐮佸埌鍚勪釜master鑺傜偣
export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=zk01:2181,zk02:2181,zk03:2181 -Dspark.deploy.zookeeper.dir=/spark"
3) 灏唖park_env.sh鍒嗗彂鍒板悇涓妭鐐?/div>
4)鎵惧埌涓€涓猰aster鑺傜偣锛屾墽琛?/start-all.sh锛屼細鍦ㄨ繖閲屽惎鍔ㄤ富master,鍏朵粬鐨刴aster澶囪妭鐐癸紝鍚姩master鍛戒护: ./sbin/start-master.sh
5)鎻愪氦绋嬪簭鐨勬椂鍊欐寚瀹歮aster鐨勬椂鍊欒鎸囧畾涓夊彴master锛屼緥濡?/div>
./spark-shell --master spark://master01:7077,master02:7077,master03:7077
5.Apache Spark鏈夊摢浜涘父瑙佺殑绋冲畾鐗堟湰锛孲park1.6.0鐨勬暟瀛楀垎鍒唬琛ㄤ粈涔堟剰鎬濓紵
绛旓細甯歌鐨勫ぇ鐨勭ǔ瀹氱増鏈湁Spark 1.3,Spark1.6, Spark 2.0 锛孲park1.6.0鐨勬暟瀛楀惈涔?/div>
1锛夌涓€涓暟瀛楋細1
major version : 浠h〃澶х増鏈洿鏂帮紝涓€鑸兘浼氭湁涓€浜?api 鐨勫彉鍖栵紝浠ュ強澶х殑浼樺寲鎴栨槸涓€浜涚粨鏋勭殑鏀瑰彉锛?/div>
2锛夌浜屼釜鏁板瓧锛?
minor version : 浠h〃灏忕増鏈洿鏂帮紝涓€鑸細鏂板姞 api锛屾垨鑰呮槸瀵瑰綋鍓嶇殑 api 灏辫浼樺寲锛屾垨鑰呮槸鍏朵粬鍐呭鐨勬洿鏂帮紝姣斿璇?WEB UI 鐨勬洿鏂扮瓑绛夛紱
3锛夌涓変釜鏁板瓧锛?
patch version 锛?浠h〃淇褰撳墠灏忕増鏈瓨鍦ㄧ殑涓€浜?bug锛屽熀鏈笉浼氭湁浠讳綍 api 鐨勬敼鍙樺拰鍔熻兘鏇存柊锛涜寰楁湁涓€涓ぇ绁炴浘缁忚杩囷紝濡傛灉瑕佸垏鎹?spark 鐗堟湰鐨勮瘽锛屾渶濂介€?patch version 闈?0 鐨勭増鏈紝鍥犱负涓€鑸被浼间簬 1.2.0, … 1.6.0 杩欐牱鐨勭増鏈槸灞炰簬澶ф洿鏂扮殑锛屾湁鍙兘浼氭湁涓€浜涢殣钘忕殑 bug 鎴栨槸涓嶇ǔ瀹氭€у瓨鍦紝鎵€浠ユ渶濂介€夋嫨 1.2.1, … 1.6.1 杩欐牱鐨勭増鏈€?/div>
閫氳繃鐗堟湰鍙风殑瑙i噴璇存槑锛屽彲浠ュ緢瀹规槗浜嗚В鍒帮紝spark2.1.1鐨勫彂甯冩椂鏄拡瀵瑰ぇ鐗堟湰2.1鍋氱殑涓€浜沚ug淇敼锛屼笉浼氭柊澧炲姛鑳斤紝涔熶笉浼氭柊澧濧PI锛屼細姣?.1.0鐗堟湰鏇村姞绋冲畾銆?/div>
6.driver鐨勫姛鑳芥槸浠€涔堬紵
绛旓細 1锛変竴涓猄park浣滀笟杩愯鏃跺寘鎷竴涓狣river杩涚▼锛屼篃鏄綔涓氱殑涓昏繘绋嬶紝鍏锋湁main鍑芥暟锛屽苟涓旀湁SparkContext鐨勫疄渚嬶紝鏄▼搴忕殑浜哄彛鐐癸紱2锛夊姛鑳斤細璐熻矗鍚戦泦缇ょ敵璇疯祫婧愶紝鍚憁aster娉ㄥ唽淇℃伅锛岃礋璐d簡浣滀笟鐨勮皟搴︼紝锛岃礋璐d綔涓氱殑瑙f瀽銆佺敓鎴怱tage骞惰皟搴ask鍒癊xecutor涓娿€傚寘鎷珼AGScheduler锛孴askScheduler銆?/div>
7.spark鐨勬湁鍑犵閮ㄧ讲妯″紡锛屾瘡绉嶆ā寮忕壒鐐癸紵
1锛夋湰鍦版ā寮?/div>
Spark涓嶄竴瀹氶潪瑕佽窇鍦╤adoop闆嗙兢锛屽彲浠ュ湪鏈湴锛岃捣澶氫釜绾跨▼鐨勬柟寮忔潵鎸囧畾銆傚皢Spark搴旂敤浠ュ绾跨▼鐨勬柟寮忕洿鎺ヨ繍琛屽湪鏈湴锛屼竴鑸兘鏄负浜嗘柟渚胯皟璇曪紝鏈湴妯″紡鍒嗕笁绫?/div>
· local锛氬彧鍚姩涓€涓猠xecutor
· local[k]:鍚姩k涓猠xecutor
· local
锛氬惎鍔ㄨ窡cpu鏁扮洰鐩稿悓鐨?executor
2)standalone妯″紡
鍒嗗竷寮忛儴缃查泦缇わ紝 鑷甫瀹屾暣鐨勬湇鍔★紝璧勬簮绠$悊鍜屼换鍔$洃鎺ф槸Spark鑷繁鐩戞帶锛岃繖涓ā寮忎篃鏄叾浠栨ā寮忕殑鍩虹锛?/div>
3)Spark on yarn妯″紡
鍒嗗竷寮忛儴缃查泦缇わ紝璧勬簮鍜屼换鍔$洃鎺т氦缁檡arn绠$悊锛屼絾鏄洰鍓嶄粎鏀寔绮楃矑搴﹁祫婧愬垎閰嶆柟寮忥紝鍖呭惈cluster鍜宑lient杩愯妯″紡锛宑luster閫傚悎鐢熶骇锛宒river杩愯鍦ㄩ泦缇ゅ瓙鑺傜偣锛屽叿鏈夊閿欏姛鑳斤紝client閫傚悎璋冭瘯锛宒irver杩愯鍦ㄥ鎴风
4锛塖park On Mesos妯″紡銆傚畼鏂规帹鑽愯繖绉嶆ā寮忥紙褰撶劧锛屽師鍥犱箣涓€鏄缂樺叧绯伙級銆傛鏄敱浜嶴park寮€鍙戜箣鍒濆氨鑰冭檻鍒版敮鎸丮esos锛屽洜姝わ紝鐩墠鑰岃█锛孲park杩愯鍦∕esos涓婁細姣旇繍琛屽湪YARN涓婃洿鍔犵伒娲伙紝鏇村姞鑷劧銆傜敤鎴峰彲閫夋嫨涓ょ璋冨害妯″紡涔嬩竴杩愯鑷繁鐨勫簲鐢ㄧ▼搴忥細
1) 绮楃矑搴︽ā寮忥紙Coarse-grained Mode锛夛細姣忎釜搴旂敤绋嬪簭鐨勮繍琛岀幆澧冪敱涓€涓狣irver鍜岃嫢骞蹭釜Executor缁勬垚锛屽叾涓紝姣忎釜Executor鍗犵敤鑻ュ共璧勬簮锛屽唴閮ㄥ彲杩愯澶氫釜Task锛堝搴斿灏戜釜“slot”锛夈€傚簲鐢ㄧ▼搴忕殑鍚勪釜浠诲姟姝e紡杩愯涔嬪墠锛岄渶瑕佸皢杩愯鐜涓殑璧勬簮鍏ㄩ儴鐢宠濂斤紝涓旇繍琛岃繃绋嬩腑瑕佷竴鐩村崰鐢ㄨ繖浜涜祫婧愶紝鍗充娇涓嶇敤锛屾渶鍚庣▼搴忚繍琛岀粨鏉熷悗锛屽洖鏀惰繖浜涜祫婧愩€?/div>
2) 缁嗙矑搴︽ā寮忥紙Fine-grained Mode锛夛細閴翠簬绮楃矑搴︽ā寮忎細閫犳垚澶ч噺璧勬簮娴垂锛孲park On Mesos杩樻彁渚涗簡鍙﹀涓€绉嶈皟搴︽ā寮忥細缁嗙矑搴︽ā寮忥紝杩欑妯″紡绫讳技浜庣幇鍦ㄧ殑浜戣绠楋紝鎬濇兂鏄寜闇€鍒嗛厤銆?/div>
8.Spark鎶€鏈爤鏈夊摢浜涚粍浠讹紝姣忎釜缁勪欢閮芥湁浠€涔堝姛鑳斤紝閫傚悎浠€涔堝簲鐢ㄥ満鏅紵
绛旓細鍙互鐢讳竴涓繖鏍风殑鎶€鏈爤鍥惧厛锛岀劧鍚庡垎鍒В閲婁笅姣忎釜缁勪欢鐨勫姛鑳藉拰鍦烘櫙
file:///E:/%E5%AE%89%E8%A3%85%E8%BD%AF%E4%BB%B6/%E6%9C%89%E9%81%93%E7%AC%94%E8%AE%B0%E6%96%87%E4%BB%B6/qq19B99AF2399E52F466CC3CF7E3B24ED5/dc318cd93346448487e9f423ce499b4b/d1d97571615f01111094fdcae4bed078.jpg
1锛塖park core锛氭槸鍏跺畠缁勪欢鐨勫熀纭€锛宻park鐨勫唴鏍革紝涓昏鍖呭惈锛氭湁鍚戝惊鐜浘銆丷DD銆丩ingage銆丆ache銆乥roadcast绛夛紝骞跺皝瑁呬簡搴曞眰閫氳妗嗘灦锛屾槸Spark鐨勫熀纭€銆?/div>
2锛塖parkStreaming鏄竴涓瀹炴椂鏁版嵁娴佽繘琛岄珮閫氶噺銆佸閿欏鐞嗙殑娴佸紡澶勭悊绯荤粺锛屽彲浠ュ澶氱鏁版嵁婧愶紙濡侹dfka銆丗lume銆乀witter銆乑ero鍜孴CP 濂楁帴瀛楋級杩涜绫讳技Map銆丷educe鍜孞oin绛夊鏉傛搷浣滐紝灏嗘祦寮忚绠楀垎瑙f垚涓€绯诲垪鐭皬鐨勬壒澶勭悊浣滀笟銆?/div>
3锛塖park sql锛歋hark鏄疭parkSQL鐨勫墠韬紝Spark SQL鐨勪竴涓噸瑕佺壒鐐规槸鍏惰兘澶熺粺涓€澶勭悊鍏崇郴琛ㄥ拰RDD锛屼娇寰楀紑鍙戜汉鍛樺彲浠ヨ交鏉惧湴浣跨敤SQL鍛戒护杩涜澶栭儴鏌ヨ锛屽悓鏃惰繘琛屾洿澶嶆潅鐨勬暟鎹垎鏋?/div>
4锛塀linkDB 锛氭槸涓€涓敤浜庡湪娴烽噺鏁版嵁涓婅繍琛屼氦浜掑紡 SQL 鏌ヨ鐨勫ぇ瑙勬ā骞惰鏌ヨ寮曟搸锛屽畠鍏佽鐢ㄦ埛閫氳繃鏉冭 鏁版嵁绮惧害鏉ユ彁鍗囨煡璇㈠搷搴旀椂闂达紝鍏舵暟鎹殑绮惧害琚帶鍒跺湪鍏佽鐨勮宸寖鍥村唴銆?/div>
5锛塎LBase鏄疭park鐢熸€佸湀鐨勪竴閮ㄥ垎涓撴敞浜庢満鍣ㄥ涔狅紝璁╂満鍣ㄥ涔犵殑闂ㄦ鏇翠綆锛岃涓€浜涘彲鑳藉苟涓嶄簡瑙f満鍣ㄥ涔犵殑鐢ㄦ埛涔熻兘鏂逛究鍦颁娇鐢∕Lbase銆侻LBase鍒嗕负鍥涢儴鍒嗭細MLlib銆丮LI銆丮L Optimizer鍜孧LRuntime銆?/div>
6锛塆raphX鏄疭park涓敤浜庡浘鍜屽浘骞惰璁$畻
9.Spark涓璚ork鐨勪富瑕佸伐浣滄槸浠€涔堬紵
绛旓細涓昏鍔熻兘锛氱鐞嗗綋鍓嶈妭鐐瑰唴瀛橈紝CPU鐨勪娇鐢ㄧ姸鍐碉紝鎺ユ敹master鍒嗛厤杩囨潵鐨勮祫婧愭寚浠わ紝閫氳繃ExecutorRunner鍚姩绋嬪簭鍒嗛厤浠诲姟锛寃orker灏辩被浼间簬鍖呭伐澶达紝绠$悊鍒嗛厤鏂拌繘绋嬶紝鍋氳绠楃殑鏈嶅姟锛岀浉褰撲簬process鏈嶅姟銆傞渶瑕佹敞鎰忕殑鏄細1锛墂orker浼氫笉浼氭眹鎶ュ綋鍓嶄俊鎭粰master锛寃orker蹇冭烦缁檓aster涓昏鍙湁workid锛屽畠涓嶄細鍙戦€佽祫婧愪俊鎭互蹇冭烦鐨勬柟寮忕粰mater锛宮aster鍒嗛厤鐨勬椂鍊欏氨鐭ラ亾work锛屽彧鏈夊嚭鐜版晠闅滅殑鏃跺€欐墠浼氬彂閫佽祫婧愩€?锛墂orker涓嶄細杩愯浠g爜锛屽叿浣撹繍琛岀殑鏄疎xecutor鏄彲浠ヨ繍琛屽叿浣揳ppliaction鍐欑殑涓氬姟閫昏緫浠g爜锛屾搷浣滀唬鐮佺殑鑺傜偣锛屽畠涓嶄細杩愯绋嬪簭鐨勪唬鐮佺殑銆?/div>
10.Spark涓轰粈涔堟瘮mapreduce蹇紵
绛旓細1锛夊熀浜庡唴瀛樿绠楋紝鍑忓皯浣庢晥鐨勭鐩樹氦浜掞紱2锛夐珮鏁堢殑璋冨害绠楁硶锛屽熀浜嶥AG锛?)瀹归敊鏈哄埗Linage锛岀簿鍗庨儴鍒嗗氨鏄疍AG鍜孡ingae
11.绠€鍗曡涓€涓媓adoop鍜宻park鐨剆huffle鐩稿悓鍜屽樊寮傦紵
绛旓細1锛変粠 high-level 鐨勮搴︽潵鐪嬶紝涓よ€呭苟娌℃湁澶х殑宸埆銆?閮芥槸灏?mapper锛圫park 閲屾槸 ShuffleMapTask锛夌殑杈撳嚭杩涜 partition锛屼笉鍚岀殑 partition 閫佸埌涓嶅悓鐨?reducer锛圫park 閲?reducer 鍙兘鏄笅涓€涓?stage 閲岀殑 ShuffleMapTask锛屼篃鍙兘鏄?ResultTask锛夈€俁educer 浠ュ唴瀛樹綔缂撳啿鍖猴紝杈?shuffle 杈?aggregate 鏁版嵁锛岀瓑鍒版暟鎹?aggregate 濂戒互鍚庤繘琛?reduce() 锛圫park 閲屽彲鑳芥槸鍚庣画鐨勪竴绯诲垪鎿嶄綔锛夈€?/div>
2锛変粠 low-level 鐨勮搴︽潵鐪嬶紝涓よ€呭樊鍒笉灏忋€?Hadoop MapReduce 鏄?sort-based锛岃繘鍏?combine() 鍜?reduce() 鐨?records 蹇呴』鍏?sort銆傝繖鏍风殑濂藉鍦ㄤ簬 combine/reduce() 鍙互澶勭悊澶ц妯$殑鏁版嵁锛屽洜涓哄叾杈撳叆鏁版嵁鍙互閫氳繃澶栨帓寰楀埌锛坢apper 瀵规瘡娈垫暟鎹厛鍋氭帓搴忥紝reducer 鐨?shuffle 瀵规帓濂藉簭鐨勬瘡娈垫暟鎹仛褰掑苟锛夈€傜洰鍓嶇殑 Spark 榛樿閫夋嫨鐨勬槸 hash-based锛岄€氬父浣跨敤 HashMap 鏉ュ shuffle 鏉ョ殑鏁版嵁杩涜 aggregate锛屼笉浼氬鏁版嵁杩涜鎻愬墠鎺掑簭銆傚鏋滅敤鎴烽渶瑕佺粡杩囨帓搴忕殑鏁版嵁锛岄偅涔堥渶瑕佽嚜宸辫皟鐢ㄧ被浼?sortByKey() 鐨勬搷浣滐紱濡傛灉浣犳槸Spark 1.1鐨勭敤鎴凤紝鍙互灏唖park.shuffle.manager璁剧疆涓簊ort锛屽垯浼氬鏁版嵁杩涜鎺掑簭銆傚湪Spark 1.2涓紝sort灏嗕綔涓洪粯璁ょ殑Shuffle瀹炵幇銆?/div>
3锛変粠瀹炵幇瑙掑害鏉ョ湅锛屼袱鑰呬篃鏈変笉灏戝樊鍒€?Hadoop MapReduce 灏嗗鐞嗘祦绋嬪垝鍒嗗嚭鏄庢樉鐨勫嚑涓樁娈碉細map(), spill, merge, shuffle, sort, reduce() 绛夈€傛瘡涓樁娈靛悇鍙稿叾鑱岋紝鍙互鎸夌収杩囩▼寮忕殑缂栫▼鎬濇兂鏉ラ€愪竴瀹炵幇姣忎釜闃舵鐨勫姛鑳姐€傚湪 Spark 涓紝娌℃湁杩欐牱鍔熻兘鏄庣‘鐨勯樁娈碉紝鍙湁涓嶅悓鐨?stage 鍜屼竴绯诲垪鐨?transformation()锛屾墍浠?spill, merge, aggregate 绛夋搷浣滈渶瑕佽暣鍚湪 transformation() 涓€?/div>
濡傛灉鎴戜滑灏?map 绔垝鍒嗘暟鎹€佹寔涔呭寲鏁版嵁鐨勮繃绋嬬О涓?shuffle write锛岃€屽皢 reducer 璇诲叆鏁版嵁銆乤ggregate 鏁版嵁鐨勮繃绋嬬О涓?shuffle read銆傞偅涔堝湪 Spark 涓紝闂灏卞彉涓烘€庝箞鍦?job 鐨勯€昏緫鎴栬€呯墿鐞嗘墽琛屽浘涓姞鍏?shuffle write 鍜?shuffle read 鐨勫鐞嗛€昏緫锛熶互鍙婁袱涓鐞嗛€昏緫搴旇鎬庝箞楂樻晥瀹炵幇锛?/div>
Shuffle write鐢变簬涓嶈姹傛暟鎹湁搴忥紝shuffle write 鐨勪换鍔″緢绠€鍗曪細灏嗘暟鎹?partition 濂斤紝骞舵寔涔呭寲銆備箣鎵€浠ヨ鎸佷箙鍖栵紝涓€鏂归潰鏄鍑忓皯鍐呭瓨瀛樺偍绌洪棿鍘嬪姏锛屽彟涓€鏂归潰涔熸槸涓轰簡 fault-tolerance銆?/div>
12.Mapreduce鍜孲park鐨勯兘鏄苟琛岃绠楋紝閭d箞浠栦滑鏈変粈涔堢浉鍚屽拰鍖哄埆
绛旓細涓よ€呴兘鏄敤mr妯″瀷鏉ヨ繘琛屽苟琛岃绠?
1)hadoop鐨勪竴涓綔涓氱О涓簀ob锛宩ob閲岄潰鍒嗕负map task鍜宺educe task锛屾瘡涓猼ask閮芥槸鍦ㄨ嚜宸辩殑杩涚▼涓繍琛岀殑锛屽綋task缁撴潫鏃讹紝杩涚▼涔熶細缁撴潫銆?/div>
2)spark鐢ㄦ埛鎻愪氦鐨勪换鍔℃垚涓篴pplication锛屼竴涓猘pplication瀵瑰簲涓€涓猻parkcontext锛宎pp涓瓨鍦ㄥ涓猨ob锛屾瘡瑙﹀彂涓€娆ction鎿嶄綔灏变細浜х敓涓€涓猨ob銆傝繖浜沯ob鍙互骞惰鎴栦覆琛屾墽琛岋紝姣忎釜job涓湁澶氫釜stage锛宻tage鏄痵huffle杩囩▼涓璂AGSchaduler閫氳繃RDD涔嬮棿鐨勪緷璧栧叧绯诲垝鍒唈ob鑰屾潵鐨勶紝姣忎釜stage閲岄潰鏈夊涓猼ask锛岀粍鎴恡askset鏈塗askSchaduler鍒嗗彂鍒板悇涓猠xecutor涓墽琛岋紝executor鐨勭敓鍛藉懆鏈熸槸鍜宎pp涓€鏍风殑锛屽嵆浣挎病鏈塲ob杩愯涔熸槸瀛樺湪鐨勶紝鎵€浠ask鍙互蹇€熷惎鍔ㄨ鍙栧唴瀛樿繘琛岃绠椼€?/div>
3)hadoop鐨刯ob鍙湁map鍜宺educe鎿嶄綔锛岃〃杈捐兘鍔涙瘮杈冩瑺缂鸿€屼笖鍦╩r杩囩▼涓細閲嶅鐨勮鍐檋dfs锛岄€犳垚澶ч噺鐨刬o鎿嶄綔锛屽涓猨ob闇€瑕佽嚜宸辩鐞嗗叧绯汇€?/div>
spark鐨勮凯浠h绠楅兘鏄湪鍐呭瓨涓繘琛岀殑锛孉PI涓彁渚涗簡澶ч噺鐨凴DD鎿嶄綔濡俲oin锛実roupby绛夛紝鑰屼笖閫氳繃DAG鍥惧彲浠ュ疄鐜拌壇濂界殑瀹归敊銆?/div>
13.RDD鏈哄埗锛?/div>
绛旓細rdd鍒嗗竷寮忓脊鎬ф暟鎹泦锛岀畝鍗曠殑鐞嗚В鎴愪竴绉嶆暟鎹粨鏋勶紝鏄痵park妗嗘灦涓婄殑閫氱敤璐у竵銆?/div>
鎵€鏈夌畻瀛愰兘鏄熀浜巖dd鏉ユ墽琛岀殑锛屼笉鍚岀殑鍦烘櫙浼氭湁涓嶅悓鐨剅dd瀹炵幇绫伙紝浣嗘槸閮藉彲浠ヨ繘琛屼簰鐩歌浆鎹€?/div>
rdd鎵ц杩囩▼涓細褰㈡垚dag鍥撅紝鐒跺悗褰㈡垚lineage淇濊瘉瀹归敊鎬х瓑銆?浠庣墿鐞嗙殑瑙掑害鏉ョ湅rdd瀛樺偍鐨勬槸block鍜宯ode涔嬮棿鐨勬槧灏勩€?/div>
14銆乻park鏈夊摢浜涚粍浠讹紵
绛旓細涓昏鏈夊涓嬬粍浠讹細
1锛塵aster锛氱鐞嗛泦缇ゅ拰鑺傜偣锛屼笉鍙備笌璁$畻銆?/div>
2锛墂orker锛氳绠楄妭鐐癸紝杩涚▼鏈韩涓嶅弬涓庤绠楋紝鍜宮aster姹囨姤銆?/div>
3锛塂river锛氳繍琛岀▼搴忕殑main鏂规硶锛屽垱寤簊park context瀵硅薄銆?/div>
4锛塻park context锛氭帶鍒舵暣涓猘pplication鐨勭敓鍛藉懆鏈燂紝鍖呮嫭dagsheduler鍜宼ask scheduler绛夌粍浠躲€?/div>
5锛塩lient锛氱敤鎴锋彁浜ょ▼搴忕殑鍏ュ彛銆?/div>
15銆乻park宸ヤ綔鏈哄埗锛?/div>
绛旓細鐢ㄦ埛鍦╟lient绔彁浜や綔涓氬悗锛屼細鐢盌river杩愯main鏂规硶骞跺垱寤簊park context涓婁笅鏂囥€?/div>
鎵цadd绠楀瓙锛屽舰鎴恉ag鍥捐緭鍏agscheduler锛屾寜鐓dd涔嬮棿鐨勪緷璧栧叧绯诲垝鍒唖tage杈撳叆task scheduler銆?task scheduler浼氬皢stage鍒掑垎涓簍ask set鍒嗗彂鍒板悇涓妭鐐圭殑executor涓墽琛屻€?/div>
16銆乻park鐨勪紭鍖栨€庝箞鍋氾紵
绛旓細 spark璋冧紭姣旇緝澶嶆潅锛屼絾鏄ぇ浣撳彲浠ュ垎涓轰笁涓柟闈㈡潵杩涜锛?锛夊钩鍙板眰闈㈢殑璋冧紭锛氶槻姝笉蹇呰鐨刯ar鍖呭垎鍙戯紝鎻愰珮鏁版嵁鐨勬湰鍦版€э紝閫夋嫨楂樻晥鐨勫瓨鍌ㄦ牸寮忓parquet锛?锛夊簲鐢ㄧ▼搴忓眰闈㈢殑璋冧紭锛氳繃婊ゆ搷浣滅鐨勪紭鍖栭檷浣庤繃澶氬皬浠诲姟锛岄檷浣庡崟鏉¤褰曠殑璧勬簮寮€閿€锛屽鐞嗘暟鎹€炬枩锛屽鐢≧DD杩涜缂撳瓨锛屼綔涓氬苟琛屽寲鎵ц绛夌瓑锛?锛塉VM灞傞潰鐨勮皟浼橈細璁剧疆鍚堥€傜殑璧勬簮閲忥紝璁剧疆鍚堢悊鐨凧VM锛屽惎鐢ㄩ珮鏁堢殑搴忓垪鍖栨柟娉曞kyro锛屽澶ff head鍐呭瓨绛夌瓑
17.绠€瑕佹弿杩癝park鍒嗗竷寮忛泦缇ゆ惌寤虹殑姝ラ
1锛夊噯澶噇inux鐜锛岃缃泦缇ゆ惌寤鸿处鍙峰拰鐢ㄦ埛缁勶紝璁剧疆ssh锛屽叧闂槻鐏锛屽叧闂璼eLinux锛岄厤缃甴ost锛宧ostname
2锛夐厤缃甹dk鍒扮幆澧冨彉閲?/div>
3锛夋惌寤篽adoop闆嗙兢锛屽鏋滆鍋歮aster ha锛岄渶瑕佹惌寤簔ookeeper闆嗙兢
淇敼hdfs-site.xml,hadoop_env.sh,yarn-site.xml,slaves绛夐厤缃枃浠?/div>
4锛夊惎鍔╤adoop闆嗙兢锛屽惎鍔ㄥ墠瑕佹牸寮忓寲namenode
5锛夐厤缃畇park闆嗙兢锛屼慨鏀箂park-env.xml锛宻laves绛夐厤缃枃浠讹紝鎷疯礉hadoop鐩稿叧閰嶇疆鍒皊park conf鐩綍涓?/div>
6)鍚姩spark闆嗙兢銆?/div>
18.浠€涔堟槸RDD瀹戒緷璧栧拰绐勪緷璧栵紵
RDD鍜屽畠渚濊禆鐨刾arent RDD(s)鐨勫叧绯绘湁涓ょ涓嶅悓鐨勭被鍨嬶紝鍗崇獎渚濊禆锛坣arrow dependency锛夊拰瀹戒緷璧栵紙wide dependency锛夈€?/div>
1锛夌獎渚濊禆鎸囩殑鏄瘡涓€涓猵arent RDD鐨凱artition鏈€澶氳瀛怰DD鐨勪竴涓狿artition浣跨敤
2锛夊渚濊禆鎸囩殑鏄涓瓙RDD鐨凱artition浼氫緷璧栧悓涓€涓猵arent RDD鐨凱artition
19.spark-submit鐨勬椂鍊欏浣曞紩鍏ュ閮╦ar鍖?/div>
鏂规硶涓€锛歴park-submit –jars
鏍规嵁spark瀹樼綉锛屽湪鎻愪氦浠诲姟鐨勬椂鍊欐寚瀹?ndash;jars锛岀敤閫楀彿鍒嗗紑銆傝繖鏍峰仛鐨勭己鐐规槸姣忔閮借鎸囧畾jar鍖咃紝濡傛灉jar鍖呭皯鐨勮瘽鍙互杩欎箞鍋氾紝浣嗘槸濡傛灉澶氱殑璇濅細寰堥夯鐑︺€?/div>
鍛戒护锛歴park-submit --master yarn-client --jars ***.jar,***.jar
鏂规硶浜岋細extraClassPath
鎻愪氦鏃跺湪spark-default涓瀹氬弬鏁帮紝灏嗘墍鏈夐渶瑕佺殑jar鍖呰€冨埌涓€涓枃浠堕噷锛岀劧鍚庡湪鍙傛暟涓寚瀹氳鐩綍灏卞彲浠ヤ簡锛岃緝涓婁竴涓柟渚垮緢澶氾細
spark.executor.extraClassPath=/home/hadoop/wzq_workspace/lib/* spark.driver.extraClassPath=/home/hadoop/wzq_workspace/lib/*
闇€瑕佹敞鎰忕殑鏄?浣犺鍦ㄦ墍鏈夊彲鑳借繍琛宻park浠诲姟鐨勬満鍣ㄤ笂淇濊瘉璇ョ洰褰曞瓨鍦紝骞朵笖灏唈ar鍖呰€冨埌鎵€鏈夋満鍣ㄤ笂銆傝繖鏍峰仛鐨勫ソ澶勬槸鎻愪氦浠g爜鐨勬椂鍊欎笉鐢ㄥ啀鍐欎竴闀夸覆jar浜嗭紝缂虹偣鏄鎶婃墍鏈夌殑jar鍖呴兘鎷蜂竴閬嶃€?/div>
20.cache鍜宲esist鐨勫尯鍒?/div>
绛旓細1锛塩ache鍜宲ersist閮芥槸鐢ㄤ簬灏嗕竴涓猂DD杩涜缂撳瓨鐨勶紝杩欐牱鍦ㄤ箣鍚庝娇鐢ㄧ殑杩囩▼涓氨涓嶉渶瑕侀噸鏂拌绠椾簡锛屽彲浠ュぇ澶ц妭鐪佺▼搴忚繍琛屾椂闂达紱2锛?cache鍙湁涓€涓粯璁ょ殑缂撳瓨绾у埆MEMORY_ONLY 锛宑ache璋冪敤浜唒ersist锛岃€宲ersist鍙互鏍规嵁鎯呭喌璁剧疆鍏跺畠鐨勭紦瀛樼骇鍒紱3锛塭xecutor鎵ц鐨勬椂鍊欙紝榛樿60%鍋歝ache锛?0%鍋歵ask鎿嶄綔锛宲ersist鏈€鏍规湰鐨勫嚱鏁帮紝鏈€搴曞眰鐨勫嚱鏁?/div>
浜屻€侀€夋嫨棰?/div>
1. Spark 鐨勫洓澶х粍浠朵笅闈㈠摢涓笉鏄?(D )
A.Spark Streaming B. Mlib
C Graphx D.Spark R
2.涓嬮潰鍝釜绔彛涓嶆槸 spark 鑷甫鏈嶅姟鐨勭鍙?(C )
A.8080 B.4040 C.8090 D.18080
澶囨敞锛?080锛歴park闆嗙兢web ui绔彛锛?040锛歴parkjob鐩戞帶绔彛锛?8080锛歫obhistory绔彛
3.spark 1.4 鐗堟湰鐨勬渶澶у彉鍖?(B )
A spark sql Release 鐗堟湰 B .寮曞叆 Spark R
C DataFrame D.鏀寔鍔ㄦ€佽祫婧愬垎閰?/div>
4. Spark Job 榛樿鐨勮皟搴︽ā寮?(A )
A FIFO B FAIR
C 鏃?D 杩愯鏃舵寚瀹?/div>
5.鍝釜涓嶆槸鏈湴妯″紡杩愯鐨勪釜鏉′欢 ( D)
A spark.localExecution.enabled=true
B 鏄惧紡鎸囧畾鏈湴杩愯
C finalStage 鏃犵埗 Stage
D partition榛樿鍊?/div>
6.涓嬮潰鍝釜涓嶆槸 RDD 鐨勭壒鐐?(C )
A. 鍙垎鍖?B 鍙簭鍒楀寲 C 鍙慨鏀?D 鍙寔涔呭寲
7. 鍏充簬骞挎挱鍙橀噺锛屼笅闈㈠摢涓槸閿欒鐨?(D )
A 浠讳綍鍑芥暟璋冪敤 B 鏄彧璇荤殑
C 瀛樺偍鍦ㄥ悇涓妭鐐?D 瀛樺偍鍦ㄧ鐩樻垨 HDFS
8. 鍏充簬绱姞鍣紝涓嬮潰鍝釜鏄敊璇殑 (D )
A 鏀寔鍔犳硶 B 鏀寔鏁板€肩被鍨?/div>
C 鍙苟琛?D 涓嶆敮鎸佽嚜瀹氫箟绫诲瀷
9.Spark 鏀寔鐨勫垎甯冨紡閮ㄧ讲鏂瑰紡涓摢涓槸閿欒鐨?(D )
A standalone B spark on mesos
C spark on YARN D Spark on local
10.Stage 鐨?Task 鐨勬暟閲忕敱浠€涔堝喅瀹?(A )
A Partition B Job C Stage D TaskScheduler
11.涓嬮潰鍝釜鎿嶄綔鏄獎渚濊禆 (B )
A join B filter
C group D sort
12.涓嬮潰鍝釜鎿嶄綔鑲畾鏄渚濊禆 (C )
A map B flatMap
C reduceByKey D sample
13.spark 鐨?master 鍜?worker 閫氳繃浠€涔堟柟寮忚繘琛岄€氫俊鐨勶紵 (D )
A http B nio C netty D Akka
14 榛樿鐨勫瓨鍌ㄧ骇鍒?(A )
A MEMORY_ONLY B MEMORY_ONLY_SER
C MEMORY_AND_DISK D MEMORY_AND_DISK_SER
15 spark.deploy.recoveryMode 涓嶆敮鎸侀偅绉?(D )
A.ZooKeeper B. FileSystem
D NONE D Hadoop
16.涓嬪垪鍝釜涓嶆槸 RDD 鐨勭紦瀛樻柟娉?(C )
A persist() B Cache()
C Memory()
17.Task 杩愯鍦ㄤ笅鏉ュ摢閲屼釜閫夐」涓?Executor 涓婄殑宸ヤ綔鍗曞厓 (C )
A Driver program B. spark master
C.worker node D Cluster manager
18.hive 鐨勫厓鏁版嵁瀛樺偍鍦?derby 鍜?mysql 涓湁浠€涔堝尯鍒?(B )
A.娌″尯鍒?B.澶氫細璇?/div>
C.鏀寔缃戠粶鐜 D鏁版嵁搴撶殑鍖哄埆
19.DataFrame 鍜?RDD 鏈€澶х殑鍖哄埆 (B )
A.绉戝缁熻鏀寔 B.澶氫簡 schema
C.瀛樺偍鏂瑰紡涓嶄竴鏍?D.澶栭儴鏁版嵁婧愭敮鎸?/div>
20.Master 鐨?ElectedLeader 浜嬩欢鍚庡仛浜嗗摢浜涙搷浣?(D )
A. 閫氱煡 driver B.閫氱煡 worker
C.娉ㄥ唽 application D.鐩存帴 ALIVE
-----------------------------------------------------------------------------------------------------------------------------
銆怱park闈㈣瘯2000棰?1-70銆慡park core闈㈣瘯绡?2
杩欐壒Spark闈㈣瘯棰樼敱蹇楁効鑰匱affry锛堟煇楂樻牎鐮旂┒鐢燂級鎻愪緵锛岄潪甯告劅璋㈠織鎰胯€呯殑浼樿川棰橀泦锛屽ぇ瀹跺鏋滄湁濂界殑闈㈣瘯棰樺彲浠ョ淇$粰缇や富锛堝彲鍔犲叆蹇楁効鑰呯兢QQ缇わ細233864572锛夈€備负纭繚棰橀泦璐ㄩ噺锛屽織鎰胯€呰础鐚嚭鏉ョ殑棰橀泦锛岀兢涓诲強鍚勪綅姊呭嘲璋峰钩鍙扮粍鎴愬憳浼氬鏍革紝涓埆鍦版柟浼氱暐鍔犱慨鏀癸紝杩樿蹇楁効鑰呯悊瑙c€?/div>
涓€銆侀潰璇?0棰?/div>
1.cache鍚庨潰鑳戒笉鑳芥帴鍏朵粬绠楀瓙,瀹冩槸涓嶆槸action鎿嶄綔锛?/div>
绛旓細cache鍙互鎺ュ叾浠栫畻瀛愶紝浣嗘槸鎺ヤ簡绠楀瓙涔嬪悗锛岃捣涓嶅埌缂撳瓨搴旀湁鐨勬晥鏋滐紝鍥犱负浼氶噸鏂拌Е鍙慶ache銆?/div>
cache涓嶆槸action鎿嶄綔
2.reduceByKey鏄笉鏄痑ction锛?/div>
绛旓細涓嶆槸锛屽緢澶氫汉閮戒細浠ヤ负鏄痑ction锛宺educe rdd鏄痑ction
3.鏁版嵁鏈湴鎬ф槸鍦ㄥ摢涓幆鑺傜‘瀹氱殑锛?/div>
鍏蜂綋鐨則ask杩愯鍦ㄩ偅浠栨満鍣ㄤ笂锛宒ag鍒掑垎stage鐨勬椂鍊欑‘瀹氱殑
4.RDD鐨勫脊鎬ц〃鐜板湪鍝嚑鐐癸紵
1锛夎嚜鍔ㄧ殑杩涜鍐呭瓨鍜岀鐩樼殑瀛樺偍鍒囨崲锛?/div>
2锛夊熀浜嶭ingage鐨勯珮鏁堝閿欙紱
3锛塼ask濡傛灉澶辫触浼氳嚜鍔ㄨ繘琛岀壒瀹氭鏁扮殑閲嶈瘯锛?/div>
4锛塻tage濡傛灉澶辫触浼氳嚜鍔ㄨ繘琛岀壒瀹氭鏁扮殑閲嶈瘯锛岃€屼笖鍙細璁$畻澶辫触鐨勫垎鐗囷紱
5锛塩heckpoint鍜宲ersist锛屾暟鎹绠椾箣鍚庢寔涔呭寲缂撳瓨
6锛夋暟鎹皟搴﹀脊鎬э紝DAG TASK璋冨害鍜岃祫婧愭棤鍏?/div>
7锛夋暟鎹垎鐗囩殑楂樺害寮规€э紝a.鍒嗙墖寰堝纰庣墖鍙互鍚堝苟鎴愬ぇ鐨勶紝b.par
5.甯歌鐨勫閿欐柟寮忔湁鍝嚑绉嶇被鍨嬶紵
1锛?鏁版嵁妫€鏌ョ偣,浼氬彂鐢熸嫹璐濓紝娴垂璧勬簮
2锛?璁板綍鏁版嵁鐨勬洿鏂帮紝姣忔鏇存柊閮戒細璁板綍涓嬫潵锛屾瘮杈冨鏉備笖姣旇緝娑堣€楁€ц兘
6.RDD閫氳繃Linage锛堣褰曟暟鎹洿鏂帮級鐨勬柟寮忎负浣曞緢楂樻晥锛?/div>
1锛塴azy璁板綍浜嗘暟鎹殑鏉ユ簮锛孯DD鏄笉鍙彉鐨勶紝涓旀槸lazy绾у埆鐨勶紝涓攔DD
涔嬮棿鏋勬垚浜嗛摼鏉★紝lazy鏄脊鎬х殑鍩虹煶銆傜敱浜嶳DD涓嶅彲鍙橈紝鎵€浠ユ瘡娆℃搷浣滃氨
浜х敓鏂扮殑rdd锛屼笉瀛樺湪鍏ㄥ眬淇敼鐨勯棶棰橈紝鎺у埗闅惧害涓嬮檷锛屾墍鏈夋湁璁$畻閾炬潯
灏嗗鏉傝绠楅摼鏉″瓨鍌ㄤ笅鏉ワ紝璁$畻鐨勬椂鍊欎粠鍚庡線鍓嶅洖婧?/div>
900姝ユ槸涓婁竴涓猻tage鐨勭粨鏉燂紝瑕佷箞灏眂heckpoint
2锛夎褰曞師鏁版嵁锛屾槸姣忔淇敼閮借褰曪紝浠d环寰堝ぇ
濡傛灉淇敼涓€涓泦鍚堬紝浠d环灏卞緢灏忥紝瀹樻柟璇磖dd鏄?/div>
绮楃矑搴︾殑鎿嶄綔锛屾槸涓轰簡鏁堢巼锛屼负浜嗙畝鍖栵紝姣忔閮芥槸
鎿嶄綔鏁版嵁闆嗗悎锛屽啓鎴栬€呬慨鏀规搷浣滐紝閮芥槸鍩轰簬闆嗗悎鐨?/div>
rdd鐨勫啓鎿嶄綔鏄矖绮掑害鐨勶紝rdd鐨勮鎿嶄綔鏃㈠彲浠ユ槸绮楃矑搴︾殑
涔熷彲浠ユ槸缁嗙矑搴︼紝璇诲彲浠ヨ鍏朵腑鐨勪竴鏉℃潯鐨勮褰曘€?/div>
3锛夌畝鍖栧鏉傚害锛屾槸楂樻晥鐜囩殑涓€鏂归潰锛屽啓鐨勭矖绮掑害闄愬埗浜嗕娇鐢ㄥ満鏅?/div>
濡傜綉缁滅埇铏紝鐜板疄涓栫晫涓紝澶у鏁板啓鏄矖绮掑害鐨勫満鏅?/div>
7.RDD鏈夊摢浜涚己闄凤紵
1锛変笉鏀寔缁嗙矑搴︾殑鍐欏拰鏇存柊鎿嶄綔锛堝缃戠粶鐖櫕锛夛紝spark鍐欐暟鎹槸绮楃矑搴︾殑
鎵€璋撶矖绮掑害锛屽氨鏄壒閲忓啓鍏ユ暟鎹紝涓轰簡鎻愰珮鏁堢巼銆備絾鏄鏁版嵁鏄粏绮掑害鐨勪篃灏辨槸
璇村彲浠ヤ竴鏉℃潯鐨勮
2锛変笉鏀寔澧為噺杩唬璁$畻锛孎link鏀寔
8.璇翠竴璇碨park绋嬪簭缂栧啓鐨勪竴鑸楠わ紵
绛旓細鍒濆鍖栵紝璧勬簮锛屾暟鎹簮锛屽苟琛屽寲锛宺dd杞寲锛宎ction绠楀瓙鎵撳嵃杈撳嚭缁撴灉鎴栬€呬篃鍙互瀛樿嚦鐩稿簲鐨勬暟鎹瓨鍌ㄤ粙璐紝鍏蜂綋鐨勫彲鐪嬩笅鍥撅細
file:///E:/%E5%AE%89%E8%A3%85%E8%BD%AF%E4%BB%B6/%E6%9C%89%E9%81%93%E7%AC%94%E8%AE%B0%E6%96%87%E4%BB%B6/qq19B99AF2399E52F466CC3CF7E3B24ED5/069fa7b471f54e038440faf63233acce/640.webp
9. Spark鏈夊摢涓ょ绠楀瓙锛?/div>
绛旓細Transformation锛堣浆鍖栵級绠楀瓙鍜孉ction锛堟墽琛岋級绠楀瓙銆?/div>
10. Spark鎻愪氦浣犵殑jar鍖呮椂鎵€鐢ㄧ殑鍛戒护鏄粈涔堬紵
绛旓細spark-submit銆?/div>
11. Spark鏈夊摢浜涜仛鍚堢被鐨勭畻瀛?鎴戜滑搴旇灏介噺閬垮厤浠€涔堢被鍨嬬殑绠楀瓙锛?/div>
绛旓細鍦ㄦ垜浠殑寮€鍙戣繃绋嬩腑锛岃兘閬垮厤鍒欏敖鍙兘閬垮厤浣跨敤reduceByKey銆乯oin銆乨istinct銆乺epartition绛変細杩涜shuffle鐨勭畻瀛愶紝灏介噺浣跨敤map绫荤殑闈瀞huffle绠楀瓙銆傝繖鏍风殑璇濓紝娌℃湁shuffle鎿嶄綔鎴栬€呬粎鏈夎緝灏憇huffle鎿嶄綔鐨凷park浣滀笟锛屽彲浠ュぇ澶у噺灏戞€ц兘寮€閿€銆?/div>
12. 浣犳墍鐞嗚В鐨凷park鐨剆huffle杩囩▼锛?/div>
绛旓細浠庝笅闈笁鐐瑰幓灞曞紑
1锛塻huffle杩囩▼鐨勫垝鍒?/div>
2锛塻huffle鐨勪腑闂寸粨鏋滃浣曞瓨鍌?/div>
3锛塻huffle鐨勬暟鎹浣曟媺鍙栬繃鏉?/div>
鍙互鍙傝€冭繖绡囧崥鏂囷細http://www.cnblogs.com/jxhd1/p/6528540.html
13. 浣犲浣曚粠Kafka涓幏鍙栨暟鎹紵
1)鍩轰簬Receiver鐨勬柟寮?/div>
杩欑鏂瑰紡浣跨敤Receiver鏉ヨ幏鍙栨暟鎹€俁eceiver鏄娇鐢↘afka鐨勯珮灞傛Consumer API鏉ュ疄鐜扮殑銆俽eceiver浠嶬afka涓幏鍙栫殑鏁版嵁閮芥槸瀛樺偍鍦⊿park Executor鐨勫唴瀛樹腑鐨勶紝鐒跺悗Spark Streaming鍚姩鐨刯ob浼氬幓澶勭悊閭d簺鏁版嵁銆?/div>
2)鍩轰簬Direct鐨勬柟寮?/div>
杩欑鏂扮殑涓嶅熀浜嶳eceiver鐨勭洿鎺ユ柟寮忥紝鏄湪Spark 1.3涓紩鍏ョ殑锛屼粠鑰岃兘澶熺‘淇濇洿鍔犲仴澹殑鏈哄埗銆傛浛浠f帀浣跨敤Receiver鏉ユ帴鏀舵暟鎹悗锛岃繖绉嶆柟寮忎細鍛ㄦ湡鎬у湴鏌ヨKafka锛屾潵鑾峰緱姣忎釜topic+partition鐨勬渶鏂扮殑offset锛屼粠鑰屽畾涔夋瘡涓猙atch鐨刼ffset鐨勮寖鍥淬€傚綋澶勭悊鏁版嵁鐨刯ob鍚姩鏃讹紝灏变細浣跨敤Kafka鐨勭畝鍗昪onsumer api鏉ヨ幏鍙朘afka鎸囧畾offset鑼冨洿鐨勬暟鎹?/div>
14. 瀵逛簬Spark涓殑鏁版嵁鍊炬枩闂浣犳湁浠€涔堝ソ鐨勬柟妗堬紵
1锛夊墠鎻愭槸瀹氫綅鏁版嵁鍊炬枩锛屾槸OOM浜嗭紝杩樻槸浠诲姟鎵ц缂撴參锛岀湅鏃ュ織锛岀湅WebUI
2)瑙e喅鏂规硶锛屾湁澶氫釜鏂归潰
· 閬垮厤涓嶅繀瑕佺殑shuffle锛屽浣跨敤骞挎挱灏忚〃鐨勬柟寮忥紝灏唕educe-side-join鎻愬崌涓簃ap-side-join
·鍒嗘媶鍙戠敓鏁版嵁鍊炬枩鐨勮褰曪紝鍒嗘垚鍑犱釜閮ㄥ垎杩涜锛岀劧鍚庡悎骞秊oin鍚庣殑缁撴灉
·鏀瑰彉骞惰搴︼紝鍙兘骞惰搴﹀お灏戜簡锛屽鑷翠釜鍒玹ask鏁版嵁鍘嬪姏澶?/div>
·涓ら樁娈佃仛鍚堬紝鍏堝眬閮ㄨ仛鍚堬紝鍐嶅叏灞€鑱氬悎
·鑷畾涔塸aritioner锛屽垎鏁ey鐨勫垎甯冿紝浣垮叾鏇村姞鍧囧寑
璇︾粏瑙e喅鏂规鍙傝€冨崥鏂?a href="https://mp.weixin.qq.com/s?__biz=MzIzNzI1NzY3Nw==&mid=2247484221&idx=1&sn=7e20f08bfb490b91f0920aefb29ca271&chksm=e8ca159fdfbd9c89f610dd230e07f414521b4dd13018994ee9b873421d1e8efcdc535c810225&scene=21#wechat_redirect">銆奡park鏁版嵁鍊炬枩浼樺寲鏂规硶銆?/a>
15.RDD鍒涘缓鏈夊摢鍑犵鏂瑰紡锛?/div>
1).浣跨敤绋嬪簭涓殑闆嗗悎鍒涘缓rdd
2).浣跨敤鏈湴鏂囦欢绯荤粺鍒涘缓rdd
3).浣跨敤hdfs鍒涘缓rdd锛?/div>
4).鍩轰簬鏁版嵁搴揹b鍒涘缓rdd
5).鍩轰簬Nosql鍒涘缓rdd锛屽hbase
6).鍩轰簬s3鍒涘缓rdd锛?/div>
7).鍩轰簬鏁版嵁娴侊紝濡俿ocket鍒涘缓rdd
濡傛灉鍙洖绛斾簡鍓嶉潰涓夌锛屾槸涓嶅鐨勶紝鍙兘璇存槑浣犵殑姘村钩杩樻槸鍏ラ棬绾х殑锛屽疄璺佃繃绋嬩腑鏈夊緢澶氱鍒涘缓鏂瑰紡銆?/div>
16.Spark骞惰搴︽€庝箞璁剧疆姣旇緝鍚堥€?/div>
绛旓細spark骞惰搴︼紝姣忎釜core鎵胯浇2~4涓猵artition,濡傦紝32涓猚ore锛岄偅涔?4~128涔嬮棿鐨勫苟琛屽害锛屼篃灏辨槸
璁剧疆64~128涓猵artion锛屽苟琛岃鍜屾暟鎹妯℃棤鍏筹紝鍙拰鍐呭瓨浣跨敤閲忓拰cpu浣跨敤
鏃堕棿鏈夊叧
17.Spark涓暟鎹殑浣嶇疆鏄璋佺鐞嗙殑锛?/div>
绛旓細姣忎釜鏁版嵁鍒嗙墖閮藉搴斿叿浣撶墿鐞嗕綅缃紝鏁版嵁鐨勪綅缃槸琚玝lockManager锛屾棤璁?/div>
鏁版嵁鏄湪纾佺洏锛屽唴瀛樿繕鏄痶acyan锛岄兘鏄敱blockManager绠$悊
18.Spark鐨勬暟鎹湰鍦版€ф湁鍝嚑绉嶏紵
绛旓細Spark涓殑鏁版嵁鏈湴鎬ф湁涓夌锛?/div>
a.PROCESS_LOCAL鏄寚璇诲彇缂撳瓨鍦ㄦ湰鍦拌妭鐐圭殑鏁版嵁
b.NODE_LOCAL鏄寚璇诲彇鏈湴鑺傜偣纭洏鏁版嵁
c.ANY鏄寚璇诲彇闈炴湰鍦拌妭鐐规暟鎹?/div>
閫氬父璇诲彇鏁版嵁PROCESS_LOCAL>NODE_LOCAL>ANY锛屽敖閲忎娇鏁版嵁浠ROCESS_LOCAL鎴朜ODE_LOCAL鏂瑰紡璇诲彇銆傚叾涓璓ROCESS_LOCAL杩樺拰cache鏈夊叧锛屽鏋淩DD缁忓父鐢ㄧ殑璇濆皢璇DD cache鍒板唴瀛樹腑锛屾敞鎰忥紝鐢变簬cache鏄痩azy鐨勶紝鎵€浠ュ繀椤婚€氳繃涓€涓猘ction鐨勮Е鍙戯紝鎵嶈兘鐪熸鐨勫皢璇DD cache鍒板唴瀛樹腑銆?/div>
19.rdd鏈夊嚑绉嶆搷浣滅被鍨嬶紵
1锛塼ransformation锛宺dd鐢变竴绉嶈浆涓哄彟涓€绉峳dd
2锛塧ction锛?/div>
3锛塩ronroller锛宑rontroller鏄帶鍒剁畻瀛?cache,persist锛屽鎬ц兘鍜屾晥鐜囩殑鏈夊緢濂界殑鏀寔
涓夌绫诲瀷锛屼笉瑕佸洖绛斿彧鏈?涓搷浣?/div>
19.rdd鏈夊嚑绉嶆搷浣滅被鍨嬶紵
1锛塼ransformation锛宺dd鐢变竴绉嶈浆涓哄彟涓€绉峳dd
2锛塧ction锛?/div>
3锛塩ronroller锛宑rontroller鏄帶鍒剁畻瀛?cache,persist锛屽鎬ц兘鍜屾晥鐜囩殑鏈夊緢濂界殑鏀寔
涓夌绫诲瀷锛屼笉瑕佸洖绛斿彧鏈?涓搷浣?/div>
20.Spark濡備綍澶勭悊涓嶈兘琚簭鍒楀寲鐨勫璞★紵
灏嗕笉鑳藉簭鍒楀寲鐨勫唴瀹瑰皝瑁呮垚object
21.collect鍔熻兘鏄粈涔堬紝鍏跺簳灞傛槸鎬庝箞瀹炵幇鐨勶紵
绛旓細driver閫氳繃collect鎶婇泦缇や腑鍚勪釜鑺傜偣鐨勫唴瀹规敹闆嗚繃鏉ユ眹鎬绘垚缁撴灉锛宑ollect杩斿洖缁撴灉鏄疉rray绫诲瀷鐨勶紝collect鎶婂悇涓妭鐐逛笂鐨勬暟鎹姄杩囨潵锛屾姄杩囨潵鏁版嵁鏄疉rray鍨嬶紝collect瀵笰rray鎶撹繃鏉ョ殑缁撴灉杩涜鍚堝苟锛屽悎骞跺悗Array涓彧鏈変竴涓厓绱狅紝鏄痶uple绫诲瀷锛圞V绫诲瀷鐨勶級鐨勩€?/div>
22.Spaek绋嬪簭鎵ц锛屾湁鏃跺€欓粯璁や负浠€涔堜細浜х敓寰堝task锛屾€庝箞淇敼榛樿task鎵ц涓暟锛?/div>
绛旓細1锛夊洜涓鸿緭鍏ユ暟鎹湁寰堝task锛屽挨鍏舵槸鏈夊緢澶氬皬鏂囦欢鐨勬椂鍊欙紝鏈夊灏戜釜杈撳叆
block灏变細鏈夊灏戜釜task鍚姩锛?锛塻park涓湁partition鐨勬蹇碉紝姣忎釜partition閮戒細瀵瑰簲涓€涓猼ask锛宼ask瓒婂锛屽湪澶勭悊澶ц妯℃暟鎹殑鏃跺€欙紝灏变細瓒婃湁鏁堢巼銆備笉杩噒ask骞朵笉鏄秺澶氳秺濂斤紝濡傛灉骞虫椂娴嬭瘯锛屾垨鑰呮暟鎹噺娌℃湁閭d箞澶э紝鍒欐病鏈夊繀瑕乼ask鏁伴噺澶銆?锛夊弬鏁板彲浠ラ€氳繃spark_home/conf/spark-default.conf閰嶇疆鏂囦欢璁剧疆:
spark.sql.shuffle.partitions 50 spark.default.parallelism 10
绗竴涓槸閽堝spark sql鐨則ask鏁伴噺
绗簩涓槸闈瀞park sql绋嬪簭璁剧疆鐢熸晥
23.涓轰粈涔圫park Application鍦ㄦ病鏈夎幏寰楄冻澶熺殑璧勬簮锛宩ob灏卞紑濮嬫墽琛屼簡锛屽彲鑳戒細瀵艰嚧浠€涔堜粈涔堥棶棰樺彂鐢?
绛旓細浼氬鑷存墽琛岃job鏃跺€欓泦缇よ祫婧愪笉瓒筹紝瀵艰嚧鎵цjob缁撴潫涔熸病鏈夊垎閰嶈冻澶熺殑璧勬簮锛屽垎閰嶄簡閮ㄥ垎Executor锛岃job灏卞紑濮嬫墽琛宼ask锛屽簲璇ユ槸task鐨勮皟搴︾嚎绋嬪拰Executor璧勬簮鐢宠鏄紓姝ョ殑锛涘鏋滄兂绛夊緟鐢宠瀹屾墍鏈夌殑璧勬簮鍐嶆墽琛宩ob鐨勶細闇€瑕佸皢spark.scheduler.maxRegisteredResourcesWaitingTime璁剧疆鐨勫緢澶э紱spark.scheduler.minRegisteredResourcesRatio 璁剧疆涓?锛屼絾鏄簲璇ョ粨鍚堝疄闄呰€冭檻
鍚﹀垯寰堝鏄撳嚭鐜伴暱鏃堕棿鍒嗛厤涓嶅埌璧勬簮锛宩ob涓€鐩翠笉鑳借繍琛岀殑鎯呭喌銆?/div>
24.map涓巉latMap鐨勫尯鍒?/div>
map锛氬RDD姣忎釜鍏冪礌杞崲锛屾枃浠朵腑鐨勬瘡涓€琛屾暟鎹繑鍥炰竴涓暟缁勫璞?/div>
flatMap锛氬RDD姣忎釜鍏冪礌杞崲锛岀劧鍚庡啀鎵佸钩鍖?/div>
灏嗘墍鏈夌殑瀵硅薄鍚堝苟涓轰竴涓璞★紝鏂囦欢涓殑鎵€鏈夎鏁版嵁浠呰繑鍥炰竴涓暟缁?/div>
瀵硅薄锛屼細鎶涘純鍊间负null鐨勫€?/div>
25.鍒椾妇浣犲父鐢ㄧ殑action锛?/div>
collect锛宺educe,take,count,saveAsTextFile绛?/div>
26.Spark涓轰粈涔堣鎸佷箙鍖栵紝涓€鑸粈涔堝満鏅笅瑕佽繘琛宲ersist鎿嶄綔锛?/div>
涓轰粈涔堣杩涜鎸佷箙鍖栵紵
spark鎵€鏈夊鏉備竴鐐圭殑绠楁硶閮戒細鏈塸ersist韬奖,spark榛樿鏁版嵁鏀惧湪鍐呭瓨锛宻park寰堝鍐呭閮芥槸鏀惧湪鍐呭瓨鐨勶紝闈炲父閫傚悎楂橀€熻凯浠o紝1000涓楠?/div>
鍙湁绗竴涓緭鍏ユ暟鎹紝涓棿涓嶄骇鐢熶复鏃舵暟鎹紝浣嗗垎甯冨紡绯荤粺椋庨櫓寰堥珮锛屾墍浠ュ鏄撳嚭閿欙紝灏辫瀹归敊锛宺dd鍑洪敊鎴栬€呭垎鐗囧彲浠ユ牴鎹缁熺畻鍑烘潵锛屽鏋滄病鏈夊鐖秗dd杩涜persist 鎴栬€卌ache鐨勫寲锛屽氨闇€瑕侀噸澶村仛銆?/div>
浠ヤ笅鍦烘櫙浼氫娇鐢╬ersist
1锛夋煇涓楠よ绠楅潪甯歌€楁椂锛岄渶瑕佽繘琛宲ersist鎸佷箙鍖?/div>
2锛夎绠楅摼鏉¢潪甯搁暱锛岄噸鏂版仮澶嶈绠楀緢澶氭楠わ紝寰堝ソ浣匡紝persist
3锛塩heckpoint鎵€鍦ㄧ殑rdd瑕佹寔涔呭寲persist锛?/div>
lazy绾у埆锛屾鏋跺彂鐜版湁checnkpoint锛宑heckpoint鏃跺崟鐙Е鍙戜竴涓猨ob锛岄渶瑕侀噸绠椾竴閬嶏紝checkpoint鍓?/div>
瑕佹寔涔呭寲锛屽啓涓猺dd.cache鎴栬€卹dd.persist锛屽皢缁撴灉淇濆瓨璧锋潵锛屽啀鍐檆heckpoint鎿嶄綔锛岃繖鏍锋墽琛岃捣鏉ヤ細闈炲父蹇紝涓嶉渶瑕侀噸鏂拌绠梤dd閾炬潯浜嗐€俢heckpoint涔嬪墠涓€瀹氫細杩涜persist銆?/div>
4锛塻huffle涔嬪悗涓轰粈涔堣persist锛宻huffle瑕佽繘鎬х綉缁滀紶杈擄紝椋庨櫓寰堝ぇ锛屾暟鎹涪澶遍噸鏉ワ紝鎭㈠浠d环寰堝ぇ
5锛塻huffle涔嬪墠杩涜persist锛屾鏋堕粯璁ゅ皢鏁版嵁鎸佷箙鍖栧埌纾佺洏锛岃繖涓槸妗嗘灦鑷姩鍋氱殑銆?/div>
27.涓轰粈涔堣杩涜搴忓垪鍖?/div>
搴忓垪鍖栧彲浠ュ噺灏戞暟鎹殑浣撶Н锛屽噺灏戝瓨鍌ㄧ┖闂达紝楂樻晥瀛樺偍鍜屼紶杈撴暟鎹紝涓嶅ソ鐨勬槸浣跨敤鐨勬椂鍊欒鍙嶅簭鍒楀寲锛岄潪甯告秷鑰桟PU
28.浠嬬粛涓€涓媕oin鎿嶄綔浼樺寲缁忛獙锛?/div>
绛旓細join鍏跺疄甯歌鐨勫氨鍒嗕负涓ょ被锛?map-side join 鍜?reduce-side join銆傚綋澶ц〃鍜屽皬琛╦oin鏃讹紝鐢╩ap-side join鑳芥樉钁楁彁楂樻晥鐜囥€傚皢澶氫唤鏁版嵁杩涜鍏宠仈鏄暟鎹鐞嗚繃绋嬩腑闈炲父鏅亶鐨勭敤娉曪紝涓嶈繃鍦ㄥ垎甯冨紡璁$畻绯荤粺涓紝杩欎釜闂寰€寰€浼氬彉鐨勯潪甯搁夯鐑︼紝鍥犱负妗嗘灦鎻愪緵鐨?join 鎿嶄綔涓€鑸細灏嗘墍鏈夋暟鎹牴鎹?key 鍙戦€佸埌鎵€鏈夌殑 reduce 鍒嗗尯涓幓锛屼篃灏辨槸 shuffle 鐨勮繃绋嬨€傞€犳垚澶ч噺鐨勭綉缁滀互鍙婄鐩業O娑堣€楋紝杩愯鏁堢巼鏋佸叾浣庝笅锛岃繖涓繃绋嬩竴鑸绉颁负 reduce-side-join銆傚鏋滃叾涓湁寮犺〃杈冨皬鐨勮瘽锛屾垜浠垯鍙互鑷繁瀹炵幇鍦?map 绔疄鐜版暟鎹叧鑱旓紝璺宠繃澶ч噺鏁版嵁杩涜 shuffle 鐨勮繃绋嬶紝杩愯鏃堕棿寰楀埌澶ч噺缂╃煭锛屾牴鎹笉鍚屾暟鎹彲鑳戒細鏈夊嚑鍊嶅埌鏁板崄鍊嶇殑鎬ц兘鎻愬崌銆?/div>
澶囨敞锛氳繖涓鐩潰璇曚腑闈炲父闈炲父澶ф鐜囪鍒帮紝鍔″繀鎼滅储鐩稿叧璧勬枡鎺屾彙锛岃繖閲屾姏鐮栧紩鐜夈€?/div>
29.浠嬬粛涓€涓媍ogroup rdd瀹炵幇鍘熺悊锛屼綘鍦ㄤ粈涔堝満鏅笅鐢ㄨ繃杩欎釜rdd锛?/div>
绛旓細cogroup鐨勫嚱鏁板疄鐜?杩欎釜瀹炵幇鏍规嵁涓や釜瑕佽繘琛屽悎骞剁殑涓や釜RDD鎿嶄綔,鐢熸垚涓€涓狢oGroupedRDD鐨勫疄渚?杩欎釜RDD鐨勮繑鍥炵粨鏋滄槸鎶婄浉鍚岀殑key涓袱涓猂DD鍒嗗埆杩涜鍚堝苟鎿嶄綔,鏈€鍚庤繑鍥炵殑RDD鐨剉alue鏄竴涓狿air鐨勫疄渚?杩欎釜瀹炰緥鍖呭惈涓や釜Iterable鐨勫€?绗竴涓€艰〃绀虹殑鏄疪DD1涓浉鍚孠EY鐨勫€?绗簩涓€艰〃绀虹殑鏄疪DD2涓浉鍚宬ey鐨勫€?鐢变簬鍋歝ogroup鐨勬搷浣?闇€瑕侀€氳繃partitioner杩涜閲嶆柊鍒嗗尯鐨勬搷浣?鍥犳,鎵ц杩欎釜娴佺▼鏃?闇€瑕佹墽琛屼竴娆huffle鐨勬搷浣?濡傛灉瑕佽繘琛屽悎骞剁殑涓や釜RDD鐨勯兘宸茬粡鏄痵huffle鍚庣殑rdd,鍚屾椂浠栦滑瀵瑰簲鐨刾artitioner鐩稿悓鏃?灏变笉闇€瑕佹墽琛宻huffle,)锛?/div>
鍦烘櫙锛氳〃鍏宠仈鏌ヨ
30 涓嬮潰杩欐浠g爜杈撳嚭缁撴灉鏄粈涔堬紵
--------------------------
def joinRdd(sc:SparkContext) {
val name= Array(
Tuple2(1,"spark"),
Tuple2(2,"tachyon"),
Tuple2(3,"hadoop")
)
val score= Array(
Tuple2(1,100),
Tuple2(2,90),
Tuple2(3,80)
)
val namerdd=sc.parallelize(name);
val scorerdd=sc.parallelize(score);
val result = namerdd.join(scorerdd);
result .collect.foreach(println);
}
--------------------------
绛旀:
(1,(Spark,100))
(2,(tachyon,90))
(3,(hadoop,80))
銆怱park闈㈣瘯2000棰?1-100銆慡park core闈㈣瘯绡?3
Spark Core鏄疭park鐨勫熀鐭筹紝鏈夊緢澶氱煡璇嗙偣锛岄潰璇曢闆嗙殑鐭ヨ瘑鐐规瘮杈冭烦璺冨拰鍒嗘暎锛屽缓璁郴缁熷涔犱簡Spark鐭ヨ瘑鍐嶇湅闈㈣瘯棰橀泦銆備粖澶╃户缁斁閫佹渶鏂版暣鐞嗗拰璁捐鐨勩€奡park闈㈣瘯2000棰樸€嬮闆嗭紝浠呬緵鍙傝€冨涔犮€傛湰绡囧崥鏂囧睘浜庢宄拌胺鍘熷垱锛岃浆杞借娉ㄦ槑鍑哄锛屽鏋滄偍瑙夊緱瀵规偍鏈夊府鍔╋紝璇蜂笉瑕佸悵鍟偣璧烇紝浣犵殑璧烇紝鏄織鎰胯€呬滑鍧氭寔鐨勫姩鍔涳紝鏄棭鏃ュ仛鍑?000閬撻珮璐ㄩ噺Spark闈㈣瘯棰樼殑鍔ㄥ姏锛屽鏈変笉鍑嗙‘鐨勫湴鏂癸紝璇风暀瑷€璇存槑銆?/div>
涓€銆侀潰璇?0棰?绗?1-100棰?
1.Spark浣跨敤parquet鏂囦欢瀛樺偍鏍煎紡鑳藉甫鏉ュ摢浜涘ソ澶勶紵
1) 濡傛灉璇碒DFS 鏄ぇ鏁版嵁鏃朵唬鍒嗗竷寮忔枃浠剁郴缁熼閫夋爣鍑嗭紝閭d箞parquet鍒欐槸鏁翠釜澶ф暟鎹椂浠f枃浠跺瓨鍌ㄦ牸寮忓疄鏃堕閫夋爣鍑?/div>
2) 閫熷害鏇村揩锛氫粠浣跨敤spark sql鎿嶄綔鏅€氭枃浠禖SV鍜宲arquet鏂囦欢閫熷害瀵规瘮涓婄湅锛岀粷澶у鏁版儏鍐?/div>
浼氭瘮浣跨敤csv绛夋櫘閫氭枃浠堕€熷害鎻愬崌10鍊嶅乏鍙筹紝鍦ㄤ竴浜涙櫘閫氭枃浠剁郴缁熸棤娉曞湪spark涓婃垚鍔熻繍琛岀殑鎯呭喌
涓嬶紝浣跨敤parquet寰堝鏃跺€欏彲浠ユ垚鍔熻繍琛?/div>
3) parquet鐨勫帇缂╂妧鏈潪甯哥ǔ瀹氬嚭鑹诧紝鍦╯park sql涓鍘嬬缉鎶€鏈殑澶勭悊鍙兘鏃犳硶姝e父鐨勫畬鎴愬伐浣?/div>
锛堜緥濡備細瀵艰嚧lost task锛宭ost executor锛変絾鏄鏃跺鏋滀娇鐢╬arquet灏卞彲浠ユ甯哥殑瀹屾垚
4) 鏋佸ぇ鐨勫噺灏戠鐩業/o,閫氬父鎯呭喌涓嬭兘澶熷噺灏?5%鐨勫瓨鍌ㄧ┖闂达紝鐢辨鍙互鏋佸ぇ鐨勫噺灏憇park sql澶勭悊
鏁版嵁鐨勬椂鍊欑殑鏁版嵁杈撳叆鍐呭锛屽挨鍏舵槸鍦╯park1.6x涓湁涓笅鎺ㄨ繃婊ゅ櫒鍦ㄤ竴浜涙儏鍐典笅鍙互鏋佸ぇ鐨?/div>
鍑忓皯纾佺洏鐨処O鍜屽唴瀛樼殑鍗犵敤锛岋紙涓嬫帹杩囨护鍣級
5) spark 1.6x parquet鏂瑰紡鏋佸ぇ鐨勬彁鍗囦簡鎵弿鐨勫悶鍚愰噺锛屾瀬澶ф彁楂樹簡鏁版嵁鐨勬煡鎵鹃€熷害spark1.6鍜宻park1.5x鐩告瘮鑰岃█锛屾彁鍗囦簡澶х害1鍊嶇殑閫熷害锛屽湪spark1.6X涓紝鎿嶄綔parquet鏃跺€檆pu涔熻繘琛屼簡鏋佸ぇ鐨勪紭鍖栵紝鏈夋晥鐨勯檷浣庝簡cpu
6) 閲囩敤parquet鍙互鏋佸ぇ鐨勪紭鍖杝park鐨勮皟搴﹀拰鎵ц銆傛垜浠祴璇晄park濡傛灉鐢╬arquet鍙互鏈夋晥鐨勫噺灏憇tage鐨勬墽琛屾秷鑰楋紝鍚屾椂鍙互浼樺寲鎵ц璺緞
2.Executor涔嬮棿濡備綍鍏变韩鏁版嵁锛?/div>
绛旓細鍩轰簬hdfs鎴栬€呭熀浜巘achyon
3.Spark绱姞鍣ㄦ湁鍝簺鐗圭偣锛?/div>
1锛夌疮鍔犲櫒鍦ㄥ叏灞€鍞竴鐨勶紝鍙涓嶅噺锛岃褰曞叏灞€闆嗙兢鐨勫敮涓€鐘舵€?/div>
2锛夊湪exe涓慨鏀瑰畠锛屽湪driver璇诲彇
3锛塭xecutor绾у埆鍏变韩鐨勶紝骞挎挱鍙橀噺鏄痶ask绾у埆鐨勫叡浜?/div>
涓や釜application涓嶅彲浠ュ叡浜疮鍔犲櫒锛屼絾鏄悓涓€涓猘pp涓嶅悓鐨刯ob鍙互鍏变韩
4.濡備綍鍦ㄤ竴涓笉纭畾鐨勬暟鎹妯$殑鑼冨洿鍐呰繘琛屾帓搴忥紵
涓轰簡鎻愰珮鏁堢巼锛岃鍒掑垎鍒掑垎锛屽垝鍒嗙殑鑼冨洿骞朵笖鏄湁搴忕殑
瑕佷箞鏈夊簭锛岃涔堥檷搴忥紵
姘村鎶芥牱锛氱洰鐨勬槸浠庝竴涓泦鍚堜腑閫夊彇锛岄泦鍚堥潪甯哥瓟锛岄€傚悎鍐呭瓨
鏃犳硶瀹圭撼鏁版嵁鐨勬椂鍊欎娇鐢?/div>
浠嶯涓娊鍙栧嚭K涓紝N鏄殢鏈烘暟
5.spark hashParitioner鐨勫紛绔槸浠€涔堬紵
绛?HashPartitioner鍒嗗尯鐨勫師鐞嗗緢绠€鍗曪紝瀵逛簬缁欏畾鐨刱ey锛岃绠楀叾hashCode锛屽苟闄や簬鍒嗗尯鐨勪釜鏁板彇浣欙紝濡傛灉浣欐暟灏忎簬0锛屽垯鐢ㄤ綑鏁?鍒嗗尯鐨勪釜鏁帮紝鏈€鍚庤繑鍥炵殑鍊煎氨鏄繖涓猭ey鎵€灞炵殑鍒嗗尯ID锛涘紛绔槸鏁版嵁涓嶅潎鍖€锛屽鏄撳鑷存暟鎹€炬枩锛屾瀬绔儏鍐典笅鏌愬嚑涓垎鍖轰細鎷ユ湁rdd鐨勬墍鏈夋暟鎹?/div>
6.RangePartitioner鍒嗗尯鐨勫師鐞?
绛?RangePartitioner鍒嗗尯鍒欏敖閲忎繚璇佹瘡涓垎鍖轰腑鏁版嵁閲忕殑鍧囧寑锛岃€屼笖鍒嗗尯涓庡垎鍖轰箣闂存槸鏈夊簭鐨勶紝涔熷氨鏄涓€涓垎鍖轰腑鐨勫厓绱犺偗瀹氶兘鏄瘮鍙︿竴涓垎鍖哄唴鐨勫厓绱犲皬鎴栬€呭ぇ锛涗絾鏄垎鍖哄唴鐨勫厓绱犳槸涓嶈兘淇濊瘉椤哄簭鐨勩€傜畝鍗曠殑璇村氨鏄皢涓€瀹氳寖鍥村唴鐨勬暟鏄犲皠鍒版煇涓€涓垎鍖哄唴銆傚叾鍘熺悊鏄按濉樻娊鏍枫€傚彲浠ュ弬鑰冭繖绡囧崥鏂?/div>
https://www.iteblog.com/archives/1522.html
7.浠嬬粛parition鍜宐lock鏈変粈涔堝叧鑱斿叧绯伙紵
绛旓細1锛塰dfs涓殑block鏄垎甯冨紡瀛樺偍鐨勬渶灏忓崟鍏冿紝绛夊垎锛屽彲璁剧疆鍐椾綑锛岃繖鏍疯璁℃湁涓€閮ㄥ垎纾佺洏绌洪棿鐨勬氮璐癸紝浣嗘槸鏁撮綈鐨刡lock澶у皬锛屼究浜庡揩閫熸壘鍒般€佽鍙栧搴旂殑鍐呭锛?锛塖park涓殑partion鏄脊鎬у垎甯冨紡鏁版嵁闆哛DD鐨勬渶灏忓崟鍏冿紝RDD鏄敱鍒嗗竷鍦ㄥ悇涓妭鐐逛笂鐨刾artion缁勬垚鐨勩€俻artion鏄寚鐨剆park鍦ㄨ绠楄繃绋嬩腑锛岀敓鎴愮殑鏁版嵁鍦ㄨ绠楃┖闂村唴鏈€灏忓崟鍏冿紝鍚屼竴浠芥暟鎹紙RDD锛夌殑partion澶у皬涓嶄竴锛屾暟閲忎笉瀹氾紝鏄牴鎹產pplication閲岀殑绠楀瓙鍜屾渶鍒濊鍏ョ殑鏁版嵁鍒嗗潡鏁伴噺鍐冲畾锛?锛塨lock浣嶄簬瀛樺偍绌洪棿銆乸artion浣嶄簬璁$畻绌洪棿锛宐lock鐨勫ぇ灏忔槸鍥哄畾鐨勩€乸artion澶у皬鏄笉鍥哄畾鐨勶紝鏄粠2涓笉鍚岀殑瑙掑害鍘荤湅鏁版嵁銆?/div>
8.Spark搴旂敤绋嬪簭鐨勬墽琛岃繃绋嬫槸浠€涔堬紵
1)鏋勫缓Spark Application鐨勮繍琛岀幆澧冿紙鍚姩SparkContext锛夛紝SparkContext鍚戣祫婧愮鐞嗗櫒锛堝彲浠ユ槸Standalone銆丮esos鎴朰ARN锛夋敞鍐屽苟鐢宠杩愯Executor璧勬簮锛?/div>
2).璧勬簮绠$悊鍣ㄥ垎閰岴xecutor璧勬簮骞跺惎鍔⊿tandaloneExecutorBackend锛孍xecutor杩愯鎯呭喌灏嗛殢鐫€蹇冭烦鍙戦€佸埌璧勬簮绠$悊鍣ㄤ笂锛?/div>
3).SparkContext鏋勫缓鎴怐AG鍥撅紝灏咲AG鍥惧垎瑙f垚Stage锛屽苟鎶奣askset鍙戦€佺粰Task Scheduler銆侲xecutor鍚慡parkContext鐢宠Task锛孴ask Scheduler灏員ask鍙戞斁缁橢xecutor杩愯鍚屾椂SparkContext灏嗗簲鐢ㄧ▼搴忎唬鐮佸彂鏀剧粰Executor銆?/div>
4).Task鍦‥xecutor涓婅繍琛岋紝杩愯瀹屾瘯閲婃斁鎵€鏈夎祫婧愩€?/div>
9.hbase棰勫垎鍖轰釜鏁板拰spark杩囩▼涓殑reduce涓暟鐩稿悓涔?/div>
绛旓細鍜宻park鐨刴ap涓暟鐩稿悓锛宺educe涓暟濡傛灉娌℃湁璁剧疆鍜宺educe鍓嶇殑map鏁扮浉鍚屻€?/div>
10.濡備綍鐞嗚ВStandalone妯″紡涓嬶紝Spark璧勬簮鍒嗛厤鏄矖绮掑害鐨勶紵
绛旓細spark榛樿鎯呭喌涓嬭祫婧愬垎閰嶆槸绮楃矑搴︾殑锛屼篃灏辨槸璇寸▼搴忓湪鎻愪氦鏃跺氨鍒嗛厤濂借祫婧愶紝鍚庨潰鎵ц鐨勬椂鍊?/div>
浣跨敤鍒嗛厤濂界殑璧勬簮锛岄櫎闈炶祫婧愬嚭鐜颁簡鏁呴殰鎵嶄細閲嶆柊鍒嗛厤銆傛瘮濡係park shell鍚姩锛屽凡鎻愪氦锛屼竴娉ㄥ唽锛屽摢鎬曟病鏈変换鍔★紝worker閮戒細鍒嗛厤璧勬簮缁檈xecutor銆?/div>
11.Spark濡備綍鑷畾涔塸artitioner鍒嗗尯鍣紵
绛旓細1锛塻park榛樿瀹炵幇浜咹ashPartitioner鍜孯angePartitioner涓ょ鍒嗗尯绛栫暐锛屾垜浠篃鍙互鑷繁鎵╁睍鍒嗗尯绛栫暐锛岃嚜瀹氫箟鍒嗗尯鍣ㄧ殑鏃跺€欑户鎵縪rg.apache.spark.Partitioner绫伙紝瀹炵幇绫讳腑鐨勪笁涓柟娉?/div>
def numPartitions: Int锛氳繖涓柟娉曢渶瑕佽繑鍥炰綘鎯宠鍒涘缓鍒嗗尯鐨勪釜鏁帮紱
def getPartition(key: Any): Int锛氳繖涓嚱鏁伴渶瑕佸杈撳叆鐨刱ey鍋氳绠楋紝鐒跺悗杩斿洖璇ey鐨勫垎鍖篒D锛岃寖鍥翠竴瀹氭槸0鍒皀umPartitions-1锛?/div>
equals()锛氳繖涓槸Java鏍囧噯鐨勫垽鏂浉绛夌殑鍑芥暟锛屼箣鎵€浠ヨ姹傜敤鎴峰疄鐜拌繖涓嚱鏁版槸鍥犱负Spark鍐呴儴浼氭瘮杈冧袱涓猂DD鐨勫垎鍖烘槸鍚︿竴鏍枫€?/div>
2锛変娇鐢紝璋冪敤parttionBy鏂规硶涓紶鍏ヨ嚜瀹氫箟鍒嗗尯瀵硅薄
鍙傝€冿細http://blog.csdn.net/high2011/article/details/68491115
12.spark涓璽ask鏈夊嚑绉嶇被鍨嬶紵
绛旓細2绉嶇被鍨嬶細1锛塺esult task绫诲瀷锛屾渶鍚庝竴涓猼ask锛?鏄痵huffleMapTask绫诲瀷锛岄櫎浜嗘渶鍚庝竴涓猼ask閮芥槸
13.union鎿嶄綔鏄骇鐢熷渚濊禆杩樻槸绐勪緷璧栵紵
绛旓細绐勪緷璧?/div>
14.rangePartioner鍒嗗尯鍣ㄧ壒鐐癸紵
绛旓細rangePartioner灏介噺淇濊瘉姣忎釜鍒嗗尯涓暟鎹噺鐨勫潎鍖€锛岃€屼笖鍒嗗尯涓庡垎鍖轰箣闂存槸鏈夊簭鐨勶紝涓€涓垎鍖轰腑鐨勫厓绱犺偗瀹氶兘鏄瘮鍙︿竴涓垎鍖哄唴鐨勫厓绱犲皬鎴栬€呭ぇ锛涗絾鏄垎鍖哄唴鐨勫厓绱犳槸涓嶈兘淇濊瘉椤哄簭鐨勩€傜畝鍗曠殑璇村氨鏄皢涓€瀹氳寖鍥村唴鐨勬暟鏄犲皠鍒版煇涓€涓垎鍖哄唴銆俁angePartitioner浣滅敤锛氬皢涓€瀹氳寖鍥村唴鐨勬暟鏄犲皠鍒版煇涓€涓垎鍖哄唴锛屽湪瀹炵幇涓紝鍒嗙晫鐨勭畻娉曞挨涓洪噸瑕併€傜畻娉曞搴旂殑鍑芥暟鏄痳angeBounds
15.浠€涔堟槸浜屾鎺掑簭锛屼綘鏄浣曠敤spark瀹炵幇浜屾鎺掑簭鐨勶紵锛堜簰鑱旂綉鍏徃甯搁潰锛?/div>
绛旓細灏辨槸鑰冭檻2涓淮搴︾殑鎺掑簭锛宬ey鐩稿悓鐨勬儏鍐典笅濡備綍鎺掑簭锛屽弬鑰冨崥鏂囷細http://blog.csdn.net/sundujing/article/details/51399606
16.濡備綍浣跨敤Spark瑙e喅TopN闂锛燂紙浜掕仈缃戝叕鍙稿父闈級
绛旓細甯歌鐨勯潰璇曢,鍙傝€冨崥鏂囷細http://www.cnblogs.com/yurunmiao/p/4898672.html
17.濡備綍浣跨敤Spark瑙e喅鍒嗙粍鎺掑簭闂锛燂紙浜掕仈缃戝叕鍙稿父闈級
缁勭粐鏁版嵁褰㈠紡锛?/div>
aa 11
bb 11
cc 34
aa 22
bb 67
cc 29
aa 36
bb 33
cc 30
aa 42
bb 44
cc 49
闇€姹傦細
1銆佸涓婅堪鏁版嵁鎸塳ey鍊艰繘琛屽垎缁?/div>
2銆佸鍒嗙粍鍚庣殑鍊艰繘琛屾帓搴?/div>
3銆佹埅鍙栧垎缁勫悗鍊煎緱top 3浣嶄互key-value褰㈠紡杩斿洖缁撴灉
绛旀锛氬涓?/div>
----------------------
val groupTopNRdd = sc.textFile("hdfs://db02:8020/user/hadoop/groupsorttop/groupsorttop.data")
groupTopNRdd.map(_.split(" ")).map(x => (x(0),x(1))).groupByKey().map(
x => {
val xx = x._1
val yy = x._2
(xx,yy.toList.sorted.reverse.take(3))
}
).collect
---------------------
18.绐勪緷璧栫埗RDD鐨刾artition鍜屽瓙RDD鐨刾arition鏄笉鏄兘鏄竴瀵逛竴鐨勫叧绯伙紵
绛旓細涓嶄竴瀹氾紝闄や簡涓€瀵逛竴鐨勭獎渚濊禆锛岃繕鍖呭惈涓€瀵瑰浐瀹氫釜鏁扮殑绐勪緷璧栵紙灏辨槸瀵圭埗RDD鐨勪緷璧栫殑Partition鐨勬暟閲忎笉浼氶殢鐫€RDD鏁伴噺瑙勬ā鐨勬敼鍙樿€屾敼鍙橈級锛屾瘮濡俲oin鎿嶄綔鐨勬瘡涓猵artiion浠呬粎鍜屽凡鐭ョ殑partition杩涜join锛岃繖涓猨oin鎿嶄綔鏄獎渚濊禆锛屼緷璧栧浐瀹氭暟閲忕殑鐖秗dd锛屽洜涓烘槸纭畾鐨刾artition鍏崇郴
19.Hadoop涓紝Mapreduce鎿嶄綔鐨刴apper鍜宺educer闃舵鐩稿綋浜巗park涓殑鍝嚑涓畻瀛愶紵
绛旓細鐩稿綋浜巗park涓殑map绠楀瓙鍜宺educeByKey绠楀瓙锛屽綋鐒惰繕鏄湁鐐瑰尯鍒殑,MR浼氳嚜鍔ㄨ繘琛屾帓搴忕殑锛宻park瑕佺湅浣犵敤鐨勬槸浠€涔坧artitioner
20.浠€涔堟槸shuffle锛屼互鍙婁负浠€涔堥渶瑕乻huffle锛?/div>
shuffle涓枃缈昏瘧涓烘礂鐗岋紝闇€瑕乻huffle鐨勫師鍥犳槸锛氭煇绉嶅叿鏈夊叡鍚岀壒寰佺殑鏁版嵁姹囪仛鍒颁竴涓绠楄妭鐐逛笂杩涜璁$畻
21.涓嶉渶瑕佹帓搴忕殑hash shuffle鏄惁涓€瀹氭瘮闇€瑕佹帓搴忕殑sort shuffle閫熷害蹇紵
绛旓細涓嶄竴瀹氾紒锛佸綋鏁版嵁瑙勬ā灏忥紝Hash shuffle蹇簬Sorted Shuffle鏁版嵁瑙勬ā澶х殑鏃跺€欙紱褰撴暟鎹噺澶э紝sorted Shuffle浼氭瘮Hash shuffle蹇緢澶氾紝鍥犱负鏁伴噺澶х殑鏈夊緢澶氬皬鏂囦欢锛屼笉鍧囧寑锛岀敋鑷冲嚭鐜版暟鎹€炬枩锛屾秷鑰楀唴瀛樺ぇ锛?.x涔嬪墠spark浣跨敤hash锛岄€傚悎澶勭悊涓皬瑙勬ā锛?.x涔嬪悗锛屽鍔犱簡Sorted shuffle锛孲park鏇磋兘鑳滀换澶ц妯″鐞嗕簡銆?/div>
22.Spark涓殑HashShufle鐨勬湁鍝簺涓嶈冻锛?/div>
绛旓細1锛塻huffle浜х敓娴烽噺鐨勫皬鏂囦欢鍦ㄧ鐩樹笂锛屾鏃朵細浜х敓澶ч噺鑰楁椂鐨勩€佷綆鏁堢殑IO鎿嶄綔锛?锛?瀹规槗瀵艰嚧鍐呭瓨涓嶅鐢紝鐢变簬鍐呭瓨闇€瑕佷繚瀛樻捣閲忕殑鏂囦欢鎿嶄綔鍙ユ焺鍜屼复鏃剁紦瀛樹俊鎭紝濡傛灉鏁版嵁澶勭悊瑙勬ā姣旇緝澶х殑鍖栵紝瀹规槗鍑虹幇OOM锛?锛夊鏄撳嚭鐜版暟鎹€炬枩锛屽鑷碠OM
23.conslidate鏄浣曚紭鍖朒ash shuffle鏃跺湪map绔骇鐢熺殑灏忔枃浠讹紵
绛旓細1锛塩onslidate涓轰簡瑙e喅Hash Shuffle鍚屾椂鎵撳紑杩囧鏂囦欢瀵艰嚧Writer handler鍐呭瓨浣跨敤杩囧ぇ浠ュ強浜х敓杩囧鏂囦欢瀵艰嚧澶ч噺鐨勯殢鏈鸿鍐欏甫鏉ョ殑浣庢晥纾佺洏IO锛?锛塩onslidate鏍规嵁CPU鐨勪釜鏁版潵鍐冲畾姣忎釜task shuffle map绔骇鐢熷灏戜釜鏂囦欢锛屽亣璁惧師鏉ユ湁10涓猼ask锛?00涓猺educe锛屾瘡涓狢PU鏈?0涓狢PU
閭d箞浣跨敤hash shuffle浼氫骇鐢?0*100=1000涓枃浠讹紝conslidate浜х敓10*10=100涓枃浠?/div>
澶囨敞锛歝onslidate閮ㄥ垎鍑忓皯浜嗘枃浠跺拰鏂囦欢鍙ユ焺锛屽苟琛岃寰堥珮鐨勬儏鍐典笅锛坱ask寰堝鏃讹級杩樻槸浼氬緢澶氭枃浠?/div>
24.Sort-basesd shuffle浜х敓澶氬皯涓复鏃舵枃浠?/div>
绛旓細2*Map闃舵鎵€鏈夌殑task鏁伴噺锛孧apper闃舵涓苟琛岀殑Partition鐨勬€绘暟閲忥紝鍏跺疄灏辨槸Mapper绔痶ask
25.Sort-based shuffle鐨勭己闄?
1) 濡傛灉mapper涓璽ask鐨勬暟閲忚繃澶э紝渚濇棫浼氫骇鐢熷緢澶氬皬鏂囦欢锛屾鏃跺湪shuffle浼犻€掓暟鎹殑杩囩▼涓璻educer娈碉紝reduce浼氶渶瑕佸悓鏃跺ぇ閲忕殑璁板綍杩涜鍙嶅簭鍒楀寲锛屽鑷村ぇ閲忕殑鍐呭瓨娑堣€楀拰GC鐨勫法澶ц礋鎷咃紝閫犳垚绯荤粺缂撴參鐢氳嚦宕╂簝
2锛夊鏋滈渶瑕佸湪鍒嗙墖鍐呬篃杩涜鎺掑簭锛屾鏃堕渶瑕佽繘琛宮apper娈靛拰reducer娈电殑涓ゆ鎺掑簭
26.Spark shell鍚姩鏃朵細鍚姩derby?
绛旓細 spark shell鍚姩浼氬惎鍔╯park sql锛宻park sql榛樿浣跨敤derby淇濆瓨鍏冩暟鎹紝浣嗘槸灏介噺涓嶈鐢╠erby锛屽畠鏄崟瀹炰緥锛屼笉鍒╀簬寮€鍙戙€備細鍦ㄦ湰鍦扮敓鎴愪竴涓枃浠秏etastore_db,濡傛灉鍚姩鎶ラ敊锛屽氨鎶婇偅涓枃浠剁粰鍒犱簡 锛宒erby鏁版嵁搴撴槸鍗曞疄渚嬶紝涓嶈兘鏀寔澶氫釜鐢ㄦ埛鍚屾椂鎿嶄綔锛屽敖閲忛伩鍏嶄娇鐢?/div>
27.spark.default.parallelism杩欎釜鍙傛暟鏈変粈涔堟剰涔夛紝瀹為檯鐢熶骇涓浣曡缃紵
绛旓細1锛夊弬鏁扮敤浜庤缃瘡涓猻tage鐨勯粯璁ask鏁伴噺銆傝繖涓弬鏁版瀬涓洪噸瑕侊紝濡傛灉涓嶈缃彲鑳戒細鐩存帴褰卞搷浣犵殑Spark浣滀笟鎬ц兘锛?锛夊緢澶氫汉閮戒笉浼氳缃繖涓弬鏁帮紝浼氫娇寰楅泦缇ら潪甯镐綆鏁堬紝浣犵殑cpu锛屽唴瀛樺啀澶氾紝濡傛灉task濮嬬粓涓?锛岄偅涔熸槸娴垂锛宻park瀹樼綉寤鸿task涓暟涓篊PU鐨勬牳鏁?executor鐨勪釜鏁扮殑2~3鍊嶃€?/div>
28.spark.storage.memoryFraction鍙傛暟鐨勫惈涔?瀹為檯鐢熶骇涓浣曡皟浼橈紵
绛旓細1锛夌敤浜庤缃甊DD鎸佷箙鍖栨暟鎹湪Executor鍐呭瓨涓兘鍗犵殑姣斾緥锛岄粯璁ゆ槸0.6,锛岄粯璁xecutor 60%鐨勫唴瀛橈紝鍙互鐢ㄦ潵淇濆瓨鎸佷箙鍖栫殑RDD鏁版嵁銆傛牴鎹綘閫夋嫨鐨勪笉鍚岀殑鎸佷箙鍖栫瓥鐣ワ紝濡傛灉鍐呭瓨涓嶅鏃讹紝鍙兘鏁版嵁灏变笉浼氭寔涔呭寲锛屾垨鑰呮暟鎹細鍐欏叆纾佺洏銆?锛夊鏋滄寔涔呭寲鎿嶄綔姣旇緝澶氾紝鍙互鎻愰珮spark.storage.memoryFraction鍙傛暟锛屼娇寰楁洿澶氱殑鎸佷箙鍖栨暟鎹繚瀛樺湪鍐呭瓨涓紝鎻愰珮鏁版嵁鐨勮鍙栨€ц兘锛屽鏋渟huffle鐨勬搷浣滄瘮杈冨锛屾湁寰堝鐨勬暟鎹鍐欐搷浣滃埌JVM涓紝閭d箞搴旇璋冨皬涓€鐐癸紝鑺傜害鍑烘洿澶氱殑鍐呭瓨缁橨VM锛岄伩鍏嶈繃澶氱殑JVM gc鍙戠敓銆傚湪web ui涓瀵熷鏋滃彂鐜癵c鏃堕棿寰堥暱锛屽彲浠ヨ缃畇park.storage.memoryFraction鏇村皬涓€鐐广€?/div>
29.spark.shuffle.memoryFraction鍙傛暟鐨勫惈涔夛紝浠ュ強浼樺寲缁忛獙锛?/div>
绛旓細1锛塻park.shuffle.memoryFraction鏄痵huffle璋冧紭涓?閲嶈鍙傛暟锛宻huffle浠庝笂涓€涓猼ask鎷夊幓鏁版嵁杩囨潵锛岃鍦‥xecutor杩涜鑱氬悎鎿嶄綔锛岃仛鍚堟搷浣滄椂浣跨敤Executor鍐呭瓨鐨勬瘮渚嬬敱璇ュ弬鏁板喅瀹氾紝榛樿鏄?0%
濡傛灉鑱氬悎鏃舵暟鎹秴杩囦簡璇ュぇ灏忥紝閭d箞灏变細spill鍒扮鐩橈紝鏋佸ぇ闄嶄綆鎬ц兘锛?锛夊鏋淪park浣滀笟涓殑RDD鎸佷箙鍖栨搷浣滆緝灏戯紝shuffle鎿嶄綔杈冨鏃讹紝寤鸿闄嶄綆鎸佷箙鍖栨搷浣滅殑鍐呭瓨鍗犳瘮锛屾彁楂榮huffle鎿嶄綔鐨勫唴瀛樺崰姣旀瘮渚嬶紝閬垮厤shuffle杩囩▼涓暟鎹繃澶氭椂鍐呭瓨涓嶅鐢紝蹇呴』婧㈠啓鍒扮鐩樹笂锛岄檷浣庝簡鎬ц兘銆傛澶栵紝濡傛灉鍙戠幇浣滀笟鐢变簬棰戠箒鐨刧c瀵艰嚧杩愯缂撴參锛屾剰鍛崇潃task鎵ц鐢ㄦ埛浠g爜鐨勫唴瀛樹笉澶熺敤锛岄偅涔堝悓鏍峰缓璁皟浣庤繖涓弬鏁扮殑鍊?/div>
30.浠嬬粛涓€涓嬩綘瀵筓nified Memory Management鍐呭瓨绠$悊妯″瀷鐨勭悊瑙o紵
绛旓細Spark涓殑鍐呭瓨浣跨敤鍒嗕负涓ら儴鍒嗭細鎵ц锛坋xecution锛変笌瀛樺偍锛坰torage锛夈€傛墽琛屽唴瀛樹富瑕佺敤浜巗huffles銆乯oins銆乻orts鍜宎ggregations锛屽瓨鍌ㄥ唴瀛樺垯鐢ㄤ簬缂撳瓨鎴栬€呰法鑺傜偣鐨勫唴閮ㄦ暟鎹紶杈撱€?.6涔嬪墠锛屽浜庝竴涓狤xecutor,鍐呭瓨閮芥湁鍝簺閮ㄥ垎鏋勬垚锛?/div>
1锛塃xecutionMemory銆傝繖鐗囧唴瀛樺尯鍩熸槸涓轰簡瑙e喅 shuffles,joins, sorts and aggregations 杩囩▼涓负浜嗛伩鍏嶉绻両O闇€瑕佺殑buffer銆?閫氳繃spark.shuffle.memoryFraction(榛樿 0.2) 閰嶇疆銆?/div>
2锛塖torageMemory銆傝繖鐗囧唴瀛樺尯鍩熸槸涓轰簡瑙e喅 block cache(灏辨槸浣犳樉绀鸿皟鐢╠d.cache, rdd.persist绛夋柟娉?, 杩樻湁灏辨槸broadcasts,浠ュ強task results鐨勫瓨鍌ㄣ€傚彲浠ラ€氳繃鍙傛暟 spark.storage.memoryFraction(榛樿0.6)銆傝缃?/div>
3锛塐therMemory銆傜粰绯荤粺棰勭暀鐨勶紝鍥犱负绋嬪簭鏈韩杩愯涔熸槸闇€瑕佸唴瀛樼殑銆?(榛樿涓?.2).
浼犵粺鍐呭瓨绠$悊鐨勪笉瓒筹細
1).Shuffle鍗犵敤鍐呭瓨0.2*0.8锛屽唴瀛樺垎閰嶈繖涔堝皯锛屽彲鑳戒細灏嗘暟鎹畇pill鍒扮鐩橈紝棰戠箒鐨勭鐩業O鏄緢澶х殑璐熸媴锛孲torage鍐呭瓨鍗犵敤0.6锛屼富瑕佹槸涓轰簡杩唬澶勭悊銆備紶缁熺殑Spark鍐呭瓨鍒嗛厤瀵规搷浣滀汉鐨勮姹傞潪甯搁珮銆傦紙Shuffle鍒嗛厤鍐呭瓨锛歋huffleMemoryManager, TaskMemoryManager,ExecutorMemoryManager锛変竴涓猅ask鑾峰緱鍏ㄩ儴鐨凟xecution鐨凪emory锛屽叾浠朤ask杩囨潵灏辨病鏈夊唴瀛樹簡锛屽彧鑳界瓑寰呫€?/div>
2).榛樿鎯呭喌涓嬶紝Task鍦ㄧ嚎绋嬩腑鍙兘浼氬崰婊℃暣涓唴瀛橈紝鍒嗙墖鏁版嵁鐗瑰埆澶х殑鎯呭喌涓嬪氨浼氬嚭鐜拌繖绉嶆儏鍐碉紝鍏朵粬Task娌℃湁鍐呭瓨浜嗭紝鍓╀笅鐨刢ores灏辩┖闂蹭簡锛岃繖鏄法澶х殑娴垂銆傝繖涔熸槸浜轰负鎿嶄綔鐨勪笉褰撻€犳垚鐨勩€?/div>
3).MEMORY_AND_DISK_SER鐨剆torage鏂瑰紡锛岃幏寰桼DD鐨勬暟鎹槸涓€鏉℃潯鑾峰彇锛宨terator鐨勬柟寮忋€傚鏋滃唴瀛樹笉澶燂紙spark.storage.unrollFraction锛夛紝unroll鐨勮鍙栨暟鎹繃绋嬶紝灏辨槸鐪嬪唴瀛樻槸鍚﹁冻澶燂紝濡傛灉瓒冲锛屽氨涓嬩竴鏉°€倁nroll鐨剆pace鏄粠Storage鐨勫唴瀛樼┖闂翠腑鑾峰緱鐨勩€倁nroll鐨勬柟寮忓け璐ワ紝灏变細鐩存帴鏀剧鐩樸€?/div>
4). 榛樿鎯呭喌涓嬶紝Task鍦╯pill鍒扮鐩樹箣鍓嶏紝浼氬皢閮ㄥ垎鏁版嵁瀛樻斁鍒板唴瀛樹笂锛屽鏋滆幏鍙栦笉鍒板唴瀛橈紝灏变笉浼氭墽琛屻€傛案鏃犳澧冪殑绛夊緟锛屾秷鑰桟PU鍜屽唴瀛樸€?/div>
鍦ㄦ鍩虹涓婏紝Spark鎻愬嚭浜哢nifiedMemoryManager锛屼笉鍐嶅垎ExecutionMemory鍜孲torage Memory,瀹為檯涓婅繕鏄垎鐨勶紝鍙笉杩囨槸Execution Memory璁块棶Storage Memory锛孲torage Memory涔熷彲浠ヨ闂瓻xecution Memory锛屽鏋滃唴瀛樹笉澶燂紝灏变細鍘诲€熴€?/div>
---------------------------------------------------------------------------------------------------------------------
銆怱park闈㈣瘯2000棰?01-130銆慡park on Yarn闈㈣瘯绡?4
鏈瘒棰橀泦涓昏鏄疭park on Yarn鐩稿叧鐨勯潰璇曢锛屼富瑕佹秹鍙奡park on Yarn銆乊arn銆丮apreduce鐩稿叧闈㈣瘯棰樸€?/div>
涓€銆侀潰璇曢30棰?/div>
1.MRV1鏈夊摢浜涗笉瓒筹紵
1)鍙墿灞曟€э紙瀵逛簬鍙樺寲鐨勫簲浠樿兘鍔涳級
a) JobTracker鍐呭瓨涓繚瀛樼敤鎴蜂綔涓氱殑淇℃伅
b) JobTracker浣跨敤鐨勬槸绮楃矑搴︾殑閿?/div>
2)鍙潬鎬у拰鍙敤鎬?/div>
a) JobTracker澶辨晥浼氬浜嬮泦缇や腑鎵€鏈夌殑杩愯浣滀笟锛岀敤鎴烽渶鎵嬪姩閲嶆柊鎻愪氦鍜屾仮澶嶅伐浣滄祦
3)瀵逛笉鍚岀紪绋嬫ā鍨嬬殑鏀寔
HadoopV1浠apReduce涓轰腑蹇冪殑璁捐铏界劧鑳芥敮鎸佸箍娉涚殑鐢ㄤ緥锛屼絾鏄苟涓嶉€傚悎鎵€鏈夊ぇ鍨嬭绠?濡俿torm锛宻park
2.鎻忚堪Yarn鎵ц涓€涓换鍔$殑杩囩▼锛?/div>
1锛夊鎴风client鍚慠esouceManager鎻愪氦Application锛孯esouceManager鎺ュ彈Application
骞舵牴鎹泦缇よ祫婧愮姸鍐甸€夊彇涓€涓猲ode鏉ュ惎鍔ˋpplication鐨勪换鍔¤皟搴﹀櫒driver锛圓pplicationMaster锛?/div>
2锛塕esouceManager鎵惧埌閭d釜node锛屽懡浠ゅ叾璇ode涓婄殑nodeManager鏉ュ惎鍔ㄤ竴涓柊鐨?/div>
JVM杩涚▼杩愯绋嬪簭鐨刣river锛圓pplicationMaster锛夐儴鍒嗭紝driver锛圓pplicationMaster锛夊惎鍔ㄦ椂浼氶鍏堝悜ResourceManager娉ㄥ唽锛岃鏄庣敱鑷繁鏉ヨ礋璐e綋鍓嶇▼搴忕殑杩愯
3锛塪river锛圓pplicationMaster锛夊紑濮嬩笅杞界浉鍏砵ar鍖呯瓑鍚勭璧勬簮锛屽熀浜庝笅杞界殑jar绛変俊鎭喅瀹氬悜ResourceManager鐢宠鍏蜂綋鐨勮祫婧愬唴瀹广€?/div>
4锛塕esouceManager鎺ュ彈鍒癲river锛圓pplicationMaster锛夋彁鍑虹殑鐢宠鍚庯紝浼氭渶澶у寲鐨勬弧瓒?/div>
璧勬簮鍒嗛厤璇锋眰锛屽苟鍙戦€佽祫婧愮殑鍏冩暟鎹俊鎭粰driver锛圓pplicationMaster锛夛紱
5锛塪river锛圓pplicationMaster锛夋敹鍒板彂杩囨潵鐨勮祫婧愬厓鏁版嵁淇℃伅鍚庝細鏍规嵁鍏冩暟鎹俊鎭彂鎸囦护缁欏叿浣?/div>
鏈哄櫒涓婄殑NodeManager锛岃鍏跺惎鍔ㄥ叿浣撶殑container銆?/div>
6锛塏odeManager鏀跺埌driver鍙戞潵鐨勬寚浠わ紝鍚姩container锛宑ontainer鍚姩鍚庡繀椤诲悜driver锛圓pplicationMaster锛夋敞鍐屻€?/div>
7锛塪river锛圓pplicationMaster锛夋敹鍒癱ontainer鐨勬敞鍐岋紝寮€濮嬭繘琛屼换鍔$殑璋冨害鍜岃绠楋紝鐩村埌
浠诲姟瀹屾垚銆?/div>
琛ュ厖锛氬鏋淩esourceManager绗竴娆℃病鏈夎兘澶熸弧瓒砫river锛圓pplicationMaster锛夌殑璧勬簮璇锋眰
锛屽悗缁彂鐜版湁绌洪棽鐨勮祫婧愶紝浼氫富鍔ㄥ悜driver锛圓pplicationMaster锛夊彂閫佸彲鐢ㄨ祫婧愮殑鍏冩暟鎹俊鎭?/div>
浠ユ彁渚涙洿澶氱殑璧勬簮鐢ㄤ簬褰撳墠绋嬪簭鐨勮繍琛屻€?/div>
3.Yarn涓殑container鏄敱璋佽礋璐i攢姣佺殑锛屽湪Hadoop Mapreduce涓璫ontainer鍙互澶嶇敤涔堬紵
绛旓細ApplicationMaster璐熻矗閿€姣侊紝鍦℉adoop Mapreduce涓嶅彲浠ュ鐢紝鍦╯park on yarn绋嬪簭container鍙互澶嶇敤
4.鎻愪氦浠诲姟鏃讹紝濡備綍鎸囧畾Spark Application鐨勮繍琛屾ā寮忥紵
1锛塩luster妯″紡锛?/spark-submit --class xx.xx.xx --master yarn --deploy-mode cluster xx.jar
2) client妯″紡:./spark-submit --class xx.xx.xx --master yarn --deploy-mode client xx.jar
5. 涓嶅惎鍔⊿park闆嗙兢Master鍜寃ork鏈嶅姟锛屽彲涓嶅彲浠ヨ繍琛孲park绋嬪簭锛?/div>
绛旓細鍙互锛屽彧瑕佽祫婧愮鐞嗗櫒绗笁鏂圭鐞嗗氨鍙互锛屽鐢眣arn绠$悊锛宻park闆嗙兢涓嶅惎鍔ㄤ篃鍙互浣跨敤spark锛泂park闆嗙兢鍚姩鐨勬槸work鍜宮aster锛岃繖涓叾瀹炲氨鏄祫婧愮鐞嗘鏋讹紝yarn涓殑resourceManager鐩稿綋浜巑aster锛孨odeManager鐩稿綋浜巜orker锛屽仛璁$畻鏄疎xecutor锛屽拰spark闆嗙兢鐨剋ork鍜宮anager鍙互娌″叧绯伙紝褰掓牴鎺ュ簳杩樻槸JVM鐨勮繍琛岋紝鍙鎵€鍦ㄧ殑JVM涓婂畨瑁呬簡spark灏卞彲浠ャ€?/div>
6.Spark涓殑4040绔彛鐢变粈涔堝姛鑳?
绛旓細鏀堕泦Spark浣滀笟杩愯鐨勪俊鎭?/div>
7.spark on yarn Cluster 妯″紡涓嬶紝ApplicationMaster鍜宒river鏄湪鍚屼竴涓繘绋嬩箞锛?/div>
绛旓細鏄?driver 浣嶄簬ApplicationMaster杩涚▼涓€傝杩涚▼璐熻矗鐢宠璧勬簮锛岃繕璐熻矗鐩戞帶绋嬪簭銆佽祫婧愮殑鍔ㄦ€佹儏鍐点€?/div>
8.濡備綍浣跨敤鍛戒护鏌ョ湅application杩愯鐨勬棩蹇椾俊鎭?/div>
绛旓細yarn logs -applicationId <app ID>
9.Spark on Yarn 妯″紡鏈夊摢浜涗紭鐐癸紵
1)涓庡叾浠栬绠楁鏋跺叡浜泦缇よ祫婧愶紙eg.Spark妗嗘灦涓嶮apReduce妗嗘灦鍚屾椂杩愯锛屽鏋滀笉鐢╕arn杩涜璧勬簮鍒嗛厤锛孧apReduce鍒嗗埌鐨勫唴瀛樿祫婧愪細寰堝皯锛屾晥鐜囦綆涓嬶級锛涜祫婧愭寜闇€鍒嗛厤锛岃繘鑰屾彁楂橀泦缇よ祫婧愬埄鐢ㄧ瓑銆?/div>
2)鐩歌緝浜嶴park鑷甫鐨凷tandalone妯″紡锛孻arn鐨勮祫婧愬垎閰嶆洿鍔犵粏鑷?/div>
3)Application閮ㄧ讲绠€鍖栵紝渚嬪Spark锛孲torm绛夊绉嶆鏋剁殑搴旂敤鐢卞鎴风鎻愪氦鍚庯紝鐢盰arn璐熻矗璧勬簮鐨勭鐞嗗拰璋冨害锛屽埄鐢–ontainer浣滀负璧勬簮闅旂鐨勫崟浣嶏紝浠ュ畠涓哄崟浣嶅幓浣跨敤鍐呭瓨,cpu绛夈€?/div>
4)Yarn閫氳繃闃熷垪鐨勬柟寮忥紝绠$悊鍚屾椂杩愯鍦╕arn闆嗙兢涓殑澶氫釜鏈嶅姟锛屽彲鏍规嵁涓嶅悓绫诲瀷鐨勫簲鐢ㄧ▼搴忚礋杞芥儏鍐碉紝璋冩暣瀵瑰簲鐨勮祫婧愪娇鐢ㄩ噺锛屽疄鐜拌祫婧愬脊鎬х鐞嗐€?/div>
10.璋堣皥浣犲container鐨勭悊瑙o紵
1锛塁ontainer浣滀负璧勬簮鍒嗛厤鍜岃皟搴︾殑鍩烘湰鍗曚綅锛屽叾涓皝瑁呬簡鐨勮祫婧愬鍐呭瓨锛孋PU锛岀鐩橈紝缃戠粶甯﹀绛夈€?鐩墠yarn浠呬粎灏佽鍐呭瓨鍜孋PU
2)Container鐢盇pplicationMaster鍚慠esourceManager鐢宠鐨勶紝鐢盧esouceManager涓殑璧勬簮璋冨害鍣ㄥ紓姝ュ垎閰嶇粰ApplicationMaster
3) Container鐨勮繍琛屾槸鐢盇pplicationMaster鍚戣祫婧愭墍鍦ㄧ殑NodeManager鍙戣捣鐨勶紝Container杩愯鏃堕渶鎻愪緵鍐呴儴鎵ц鐨勪换鍔″懡浠?
11.杩愯鍦▂arn涓瑼pplication鏈夊嚑绉嶇被鍨嬬殑container锛?/div>
1锛?杩愯ApplicationMaster鐨凜ontainer锛氳繖鏄敱ResourceManager锛堝悜鍐呴儴鐨勮祫婧愯皟搴﹀櫒锛夌敵璇峰拰鍚姩鐨勶紝鐢ㄦ埛鎻愪氦搴旂敤绋嬪簭鏃讹紝鍙寚瀹氬敮涓€鐨凙pplicationMaster鎵€闇€鐨勮祫婧愶紱
2锛?杩愯鍚勭被浠诲姟鐨凜ontainer锛氳繖鏄敱ApplicationMaster鍚慠esourceManager鐢宠鐨勶紝骞剁敱ApplicationMaster涓嶯odeManager閫氫俊浠ュ惎鍔ㄤ箣銆?/div>
12.Spark on Yarn鏋舵瀯鏄€庝箞鏍风殑锛燂紙瑕佷細鐢诲摝锛岃繖涓浘锛?/div>
Yarn鎻愬埌鐨凙pp Master鍙互鐞嗚В涓篠park涓璖tandalone妯″紡涓殑driver銆侰ontainer涓繍琛岀潃Executor,鍦‥xecutor涓互澶氱嚎绋嬪苟琛岀殑鏂瑰紡杩愯Task銆傝繍琛岃繃绋嬪拰绗簩棰樼浉浼笺€?/div>
13.Executor鍚姩鏃讹紝璧勬簮閫氳繃鍝嚑涓弬鏁版寚瀹氾紵
1)num-executors鏄痚xecutor鐨勬暟閲?/div>
2)executor-memory 鏄瘡涓猠xecutor浣跨敤鐨勫唴瀛?/div>
3)executor-cores 鏄瘡涓猠xecutor鍒嗛厤鐨凜PU
14.涓轰粈涔堜細浜х敓yarn锛岃В鍐充簡浠€涔堥棶棰橈紝鏈変粈涔堜紭鍔?
1)涓轰粈涔堜骇鐢焬arn锛岄拡瀵筂RV1鐨勫悇绉嶇己闄锋彁鍑烘潵鐨勮祫婧愮鐞嗘鏋?/div>
2)瑙e喅浜嗕粈涔堥棶棰橈紝鏈変粈涔堜紭鍔匡紝鍙傝€冭繖绡囧崥鏂囷細http://www.aboutyun.com/forum.php?mod=viewthread&tid=6785
15.Mapreduce鐨勬墽琛岃繃绋?
闃舵1锛歩nput/map/partition/sort/spill
闃舵2锛歮apper绔痬erge
闃舵3锛歳educer绔痬erge/reduce/output
璇︾粏杩囩▼鍙傝€冭繖涓猦ttp://www.cnblogs.com/hipercomer/p/4516581.html
16.涓€涓猼ask鐨刴ap鏁伴噺鐢辫皝鏉ュ喅瀹氾紵
涓€鑸儏鍐典笅锛屽湪杈撳叆婧愭槸鏂囦欢鐨勬椂鍊欙紝涓€涓猼ask鐨刴ap鏁伴噺鐢眘plitSize鏉ュ喅瀹氱殑锛岄偅涔坰plitSize鏄敱浠ヤ笅鍑犱釜鏉ュ喅瀹氱殑
goalSize = totalSize / mapred.map.tasks
inSize = max {mapred.min.split.size, minSplitSize}
splitSize = max (minSize, min(goalSize, dfs.block.size))
涓€涓猼ask鐨剅educe鏁伴噺锛岀敱partition鍐冲畾銆?/div>
17.reduce鍚庤緭鍑虹殑鏁版嵁閲忔湁澶氬ぇ锛?/div>
骞朵笉鏄兂鐭ラ亾纭垏鐨勬暟鎹噺鏈夊澶ц繖涓紝鑰屾槸鎯抽棶浣狅紝MR鐨勬墽琛屾満鍒讹紝寮€鍙戝畬绋嬪簭锛屾湁娌℃湁璁ょ湡璇勪及绋嬪簭杩愯鏁堢巼
1锛夌敤浜庡鐞唕edcue浠诲姟鐨勮祫婧愭儏鍐碉紝濡傛灉鏄疢RV1鐨勮瘽锛屽垎浜嗗灏戣祫婧愮粰map锛屽灏戜釜reduce
濡傛灉鏄疢RV2鐨勮瘽锛屽彲浠ユ彁涓€涓嬶紝闆嗙兢鏈夊垎浜嗗灏戝唴瀛樸€丆PU缁檡arn鍋氳绠?銆?/div>
2锛夌粨鍚堝疄闄呭簲鐢ㄥ満鏅洖绛旓紝杈撳叆鏁版嵁鏈夊澶э紝澶х害澶氬皯鏉¤褰曪紝鍋氫簡鍝簺閫昏緫鎿嶄綔锛岃緭鍑虹殑鏃跺€欐湁澶氬皯鏉¤褰曪紝鎵ц浜嗗涔咃紝reduce鎵ц鏃跺€欑殑鏁版嵁鏈夋病鏈夊€炬枩绛?/div>
3锛夊啀鎻愪竴涓嬶紝閽堝mapReduce鍋氫簡鍝嚑鐐逛紭鍖栵紝閫熷害鎻愬崌浜嗗涔咃紝鍒椾妇1,2涓紭鍖栫偣灏卞彲浠?/div>
18.浣犵殑椤圭洰鎻愪氦鍒癹ob鐨勬椂鍊欐暟鎹噺鏈夊澶э紵
绛旓細1锛夊洖绛斿嚭鏁版嵁鏄粈涔堟牸寮忥紝鏈夋病鏈夐噰鐢ㄤ粈涔堝帇缂╋紝閲囩敤浜嗗帇缂╃殑璇濓紝鍘嬬缉姣斿ぇ姒傛槸澶氬皯锛?锛夋枃浠跺ぇ姒傚澶э細澶ф璧蜂簡澶氬皯涓猰ap锛岃捣浜嗗灏戜釜reduce锛宮ap闃舵璇诲彇浜嗗灏戞暟鎹紝reduce闃舵璇诲彇浜嗗灏戞暟鎹紝绋嬪簭澶х害鎵ц浜嗗涔咃紝3锛夐泦缇や粈涔堣妯★紝闆嗙兢鏈夊灏戣妭鐐癸紝澶氬皯鍐呭瓨锛屽灏慍PU鏍告暟绛夈€傛妸杩欎簺鐐瑰洖绛旇繘鍘伙紝鑰屼笉鏄粰涓暟瀛椾簡浜嬨€?/div>
19.浣犱滑鎻愪氦鐨刯ob浠诲姟澶ф鏈夊灏戜釜锛熻繖浜沯ob鎵ц瀹屽ぇ姒傜敤澶氬皯鏃堕棿锛?/div>
杩樻槸鑰冨療浣犲紑鍙戝畬绋嬪簭鏈夋病鏈夎鐪熻瀵熻繃绋嬪簭鐨勮繍琛岋紝鏈夋病鏈夎瘎浼扮▼搴忚繍琛岀殑鏁堢巼
20.浣犱滑涓氬姟鏁版嵁閲忓澶э紵鏈夊灏戣鏁版嵁锛?/div>
杩欎釜涔熸槸鐪嬩綘浠湁娌℃湁瀹為檯鐨勭粡楠?瀵逛簬娌℃湁瀹炴垬鐨勫悓瀛︼紝璇锋妸鍥炵瓟鐨勪晶閲嶇偣鏀惧湪MR鐨勮繍琛屾満鍒朵笂闈紝
MR杩愯鏁堢巼鏂归潰锛屼互鍙婂浣曚紭鍖朚R绋嬪簭锛堢湅鍒汉鐨勪紭鍖杁emo锛岀劧鍚庡湪铏氭嫙鏈轰笂鎷縟emo鍋氫竴涓嬫祴璇曪級銆?/div>
22.濡備綍鏉€姝讳竴涓鍦ㄨ繍琛岀殑job
鏉€姝讳竴涓猨ob
MRV1锛欻adoop job kill jobid
YARN: yarn application -kill applicationId
23.鍒楀嚭浣犳墍鐭ラ亾鐨勮皟搴﹀櫒锛岃鏄庡叾宸ヤ綔鍘熺悊
a) Fifo schedular 榛樿鐨勮皟搴﹀櫒 鍏堣繘鍏堝嚭
b) Capacity schedular 璁$畻鑳藉姏璋冨害鍣?閫夋嫨鍗犵敤鍐呭瓨灏?浼樺厛绾ч珮鐨?/div>
c) Fair schedular 璋冭倸鑴?鍏钩璋冨害鍣?鎵€鏈塲ob 鍗犵敤鐩稿悓璧勬簮
24.YarnClient妯″紡涓嬶紝鎵цSpark SQL鎶ヨ繖涓敊锛孍xception in thread "Thread-2" java.lang.OutOfMemoryError: PermGen space锛屼絾鏄湪Yarn Cluster妯″紡涓嬫甯歌繍琛岋紝鍙兘鏄粈涔堝師鍥狅紵
1锛夊師鍥犳煡璇㈣繃绋嬩腑璋冪敤鐨勬槸Hive鐨勮幏鍙栧厓鏁版嵁淇℃伅銆丼QL瑙f瀽锛屽苟涓斾娇鐢–glib绛夎繘琛屽簭鍒楀寲鍙嶅簭鍒楀寲锛屼腑闂村彲鑳戒骇鐢熻緝澶氱殑class鏂囦欢锛屽鑷碕VM涓殑鎸佷箙浠d娇鐢ㄨ緝澶?/div>
Cluster妯″紡鐨勬寔涔呬唬榛樿澶у皬鏄?4M锛孋lient妯″紡鐨勬寔涔呬唬榛樿澶у皬鏄?2M锛岃€孌river绔繘琛孲QL澶勭悊鏃讹紝鍏舵寔涔呬唬鐨勪娇鐢ㄥ彲鑳戒細杈惧埌90M锛屽鑷碠OM婧㈠嚭锛屼换鍔″け璐ャ€?/div>
yarn-cluster妯″紡涓嬪嚭鐜帮紝yarn-client妯″紡杩愯鏃跺€掓槸姝e父鐨勶紝鍘熸潵鍦?SPARK_HOME/bin/spark-class鏂囦欢涓凡缁忚缃簡鎸佷箙浠eぇ灏忥細
JAVA_OPTS="-XX:MaxPermSize=256m $OUR_JAVA_OPTS"
2锛夎В鍐虫柟娉?鍦⊿park鐨刢onf鐩綍涓殑spark-defaults.conf閲岋紝澧炲姞瀵笵river鐨凧VM閰嶇疆锛屽洜涓篋river鎵嶈礋璐QL鐨勮В鏋愬拰鍏冩暟鎹幏鍙栥€傞厤缃涓嬶細
spark.driver.extraJavaOptions -XX:PermSize=128M -XX:MaxPermSize=256M
25.spark.driver.extraJavaOptions杩欎釜鍙傛暟鏄粈涔堟剰鎬濓紝浣犱滑鐢熶骇鐜閰嶄簡澶氬皯锛?/div>
浼犻€掔粰executors鐨凧VM閫夐」瀛楃涓层€備緥濡侴C璁剧疆鎴栬€呭叾瀹冩棩蹇楄缃€傛敞鎰忥紝鍦ㄨ繖涓€夐」涓缃甋park灞炴€ф垨鑰呭爢澶у皬鏄笉鍚堟硶鐨勩€係park灞炴€ч渶瑕佺敤SparkConf瀵硅薄鎴栬€卻park-submit鑴氭湰鐢ㄥ埌鐨剆park-defaults.conf鏂囦欢璁剧疆銆傚爢鍐呭瓨鍙互閫氳繃spark.executor.memory璁剧疆
26.瀵艰嚧Executor浜х敓FULL gc 鐨勫師鍥狅紝鍙兘瀵艰嚧浠€涔堥棶棰橈紵
绛旓細鍙兘瀵艰嚧Executor鍍垫闂锛屾捣閲忔暟鎹殑shuffle鍜屾暟鎹€炬枩绛夐兘鍙兘瀵艰嚧full gc銆備互shuffle涓轰緥锛屼即闅忕潃澶ч噺鐨凷huffle鍐欐搷浣滐紝JVM鐨勬柊鐢熶唬涓嶆柇GC锛孍den Space鍐欐弧浜嗗氨寰€Survivor Space鍐欙紝鍚屾椂瓒呰繃涓€瀹氬ぇ灏忕殑鏁版嵁浼氱洿鎺ュ啓鍒拌€佺敓浠o紝褰撴柊鐢熶唬鍐欐弧浜嗕箣鍚庯紝涔熶細鎶婅€佺殑鏁版嵁鎼炲埌鑰佺敓浠o紝濡傛灉鑰佺敓浠g┖闂翠笉瓒充簡锛屽氨瑙﹀彂FULL GC锛岃繕鏄┖闂翠笉澶燂紝閭e氨OOM閿欒浜嗭紝姝ゆ椂绾跨▼琚獴locked锛屽鑷存暣涓狤xecutor澶勭悊鏁版嵁鐨勮繘绋嬭鍗′綇
27.Combiner 鍜宲artition鐨勪綔鐢?/div>
combine鍒嗕负map绔拰reduce绔紝浣滅敤鏄妸鍚屼竴涓猭ey鐨勯敭鍊煎鍚堝苟鍦ㄤ竴璧凤紝鍙互鑷畾涔夌殑銆俢ombine鍑芥暟鎶婁竴涓猰ap鍑芥暟浜х敓鐨?lt;key,value>瀵癸紙澶氫釜key,value锛夊悎骞舵垚涓€涓柊<key2,value2>.灏嗘柊鐨?lt;key2,value2>浣滀负杈撳叆鍒皉educe鍑芥暟涓繖涓獀alue2浜﹀彲绉颁箣涓簐alues锛屽洜涓烘湁澶氫釜銆傝繖涓悎骞剁殑鐩殑鏄负浜嗗噺灏戠綉缁滀紶杈撱€俻artition鏄垎鍓瞞ap姣忎釜鑺傜偣鐨勭粨鏋滐紝鎸夌収key鍒嗗埆鏄犲皠缁欎笉鍚岀殑reduce锛屼篃鏄彲浠ヨ嚜瀹氫箟鐨勩€傝繖閲屽叾瀹炲彲浠ョ悊瑙e綊绫汇€傛垜浠浜庨敊缁煎鏉傜殑鏁版嵁褰掔被銆傛瘮濡傚湪鍔ㄧ墿鍥噷鏈夌墰缇婇浮楦箙锛屼粬浠兘鏄贩鍦ㄤ竴璧风殑锛屼絾鏄埌浜嗘櫄涓婁粬浠氨鍚勮嚜鐗涘洖鐗涙锛岀緤鍥炵緤鍦堬紝楦″洖楦$獫銆俻artition鐨勪綔鐢ㄥ氨鏄妸杩欎簺鏁版嵁褰掔被銆傚彧涓嶈繃鍦ㄥ啓绋嬪簭鐨勬椂鍊欙紝mapreduce浣跨敤鍝堝笇HashPartitioner甯垜浠綊绫讳簡銆傝繖涓垜浠篃鍙互鑷畾涔夈€俿huffle灏辨槸map鍜宺educe涔嬮棿鐨勮繃绋嬶紝鍖呭惈浜嗕袱绔殑combine鍜宲artition銆侻ap鐨勭粨鏋滐紝浼氶€氳繃partition鍒嗗彂鍒癛educer涓婏紝Reducer鍋氬畬Reduce鎿嶄綔鍚庯紝閫歄utputFormat锛岃繘琛岃緭鍑簊huffle闃舵鐨勪富瑕佸嚱鏁版槸fetchOutputs(),杩欎釜鍑芥暟鐨勫姛鑳藉氨鏄皢map闃舵鐨勮緭鍑猴紝copy鍒皉educe 鑺傜偣鏈湴
28.Spark鎵ц浠诲姟鏃跺嚭鐜癹ava.lang.OutOfMemoryError: GC overhead limit exceeded鍜宩ava.lang.OutOfMemoryError: java heap space鍘熷洜鍜岃В鍐虫柟娉曪紵
绛旓細鍘熷洜锛氬姞杞戒簡澶璧勬簮鍒板唴瀛橈紝鏈湴鐨勬€ц兘涔熶笉濂斤紝gc鏃堕棿娑堣€楃殑杈冨
瑙e喅鏂规硶锛?/div>
1锛夊鍔犲弬鏁帮紝-XX:-UseGCOverheadLimit锛屽叧闂繖涓壒鎬э紝鍚屾椂澧炲姞heap澶у皬锛?Xmx1024m
2锛変笅闈㈣繖涓袱涓弬鏁拌皟澶х偣
export SPARK_EXECUTOR_MEMORY=6000M
export SPARK_DRIVER_MEMORY=7000M
鍙互鍙傝€冭繖涓細http://www.cnblogs.com/hucn/p/3572384.html
29.璇峰垪鍑哄湪浣犱互鍓嶅伐浣滀腑鎵€浣跨敤杩囩殑寮€鍙憁ap /reduce鐨勮瑷€
绛旓細java锛孲cala锛孭ython锛宻hell
30.浣犺涓?etc/hosts閰嶇疆閿欒锛屼細瀵归泦缇ゆ湁浠€涔堝奖鍝嶏紵
绛旓細1锛夌洿鎺ュ鑷村煙鍚嶆病娉曡В鏋愶紝涓昏妭鐐逛笌瀛愯妭鐐癸紝瀛愯妭鐐逛笌瀛愯妭鐐规病娉曟甯搁€氳锛?锛夐棿鎺ュ鑷撮厤缃敊璇殑鐩稿叧鑺傜偣鍒犵殑鏈嶅姟涓嶆甯革紝鐢氳嚦娌℃硶鍚姩锛宩ob鎵ц澶辫触绛夌瓑
Spark Core闈㈣瘯绡?5
鍘熷垱 2017-06-12 姊呭嘲璋?澶ф暟鎹宄拌胺
Spark RDD鏄疭park鐨勭紪绋嬪熀纭€锛屾帉鎻DD浠ュ強RDD缂栫▼鎶€宸ф槸浼佷笟瀹為檯寮€鍙戠殑蹇呭鎶€鑳斤紝鏈瘒鏁寸悊RDD甯歌鐨勯棶棰橈紝姹囩紪鎴愰锛屼互鍔犳繁瀵筊DD鍙奟DD缂栫▼鐨勭悊瑙c€傚厛鎶婇鐩垪涓惧嚭鏉ワ紝鍚勪綅鎰熷叴瓒g殑鑷繁鍘诲仛涓€閬嶆妸锛屼笅涓€绡囨宄拌胺浼氶€氳繃缃戠洏鐨勬柟寮忥紝鎶婄瓟妗堝叕甯冨嚭鏉ワ紝鎰熷叴瓒g殑绔ラ瀷璇峰強鏃跺叧娉ㄣ€?/div>
1.scala涓璸rivate 涓?private[this] 淇グ绗︾殑鍖哄埆锛?/div>
2.scala涓唴閮ㄧ被鍜宩ava涓殑鍐呴儴绫诲尯鍒?/div>
3.Spark涓璼tandalone妯″紡鐗圭偣锛屾湁鍝簺浼樼偣鍜岀己鐐癸紵
4.FIFO璋冨害妯″紡鐨勫熀鏈師鐞嗐€佷紭鐐瑰拰缂虹偣锛?/div>
5.FAIR璋冨害妯″紡鐨勪紭鐐瑰拰缂虹偣锛?/div>
6.CAPCACITY璋冨害妯″紡鐨勪紭鐐瑰拰缂虹偣锛?/div>
7.鍒椾妇浣犱簡瑙g殑搴忓垪鍖栨柟娉曪紝骞惰皥璋堝簭鍒楀寲鏈変粈涔堝ソ澶勶紵
8.甯歌鐨勬暟鍘嬬缉鏂瑰紡锛屼綘浠敓浜ч泦缇ら噰鐢ㄤ簡浠€涔堝帇缂╂柟寮忥紝鎻愬崌浜嗗灏戞晥鐜囷紵
9.绠€瑕佹弿杩癝park鍐欐暟鎹殑娴佺▼锛?/div>
10.Spark涓璍ineage鐨勫熀鏈師鐞?/div>
11.浣跨敤shll鍜宻cala浠g爜瀹炵幇WordCount锛?/div>
12.璇峰垪涓句綘纰板埌鐨凜PU瀵嗛泦鍨嬬殑搴旂敤鍦烘櫙锛屼綘鏈夊仛鍝簺浼樺寲锛?/div>
13.Spark RDD 鍜?MR2鐨勫尯鍒?/div>
14.Spark璇诲彇hdfs涓婄殑鏂囦欢锛岀劧鍚巆ount鏈夊灏戣鐨勬搷浣滐紝浣犲彲浠ヨ璇磋繃绋嬪悧銆傞偅杩欎釜count鏄湪鍐呭瓨涓紝杩樻槸纾佺洏涓绠楃殑鍛紵
15.spark鍜孧apreduce蹇紵 涓轰粈涔堝揩鍛紵 蹇湪鍝噷鍛紵
16.spark sql鍙堜负浠€涔堟瘮hive蹇憿锛?/div>
17.RDD鐨勬暟鎹粨鏋勬槸鎬庝箞鏍风殑锛?/div>
18.RDD绠楀瓙閲屾搷浣滀竴涓閮╩ap姣斿寰€閲岄潰put鏁版嵁銆傜劧鍚庣畻瀛愬鍐嶉亶鍘唌ap銆備細鏈変粈涔堥棶棰樺悧銆?/div>
19.hadoop鐨勭敓鎬佸憿銆傝璇翠綘鐨勮璇嗐€?/div>
20.jvm鎬庝箞璋冧紭鐨勶紝浠嬬粛浣犵殑Spark JVM璋冧紭缁忛獙锛?/div>
21.jvm缁撴瀯锛熷爢閲岄潰鍑犱釜鍖猴紵
22.鎬庝箞鐢╯park鍋氭暟鎹竻娲?/div>
23.spark鎬庝箞鏁村悎hive锛?/div>
24.spark璇诲彇 鏁版嵁锛屾槸鍑犱釜Partition鍛紵
25.hbase region澶氬ぇ浼氬垎鍖猴紝spark璇诲彇hbase鏁版嵁鏄浣曞垝鍒唒artition鐨勶紵
26.鐢诲浘锛岀敾Spark鐨勫伐浣滄ā寮忥紝閮ㄧ讲鍒嗗竷鏋舵瀯鍥?/div>
27.鐢诲浘锛岀敾鍥捐瑙park宸ヤ綔娴佺▼銆備互鍙婂湪闆嗙兢涓婂拰鍚勪釜瑙掕壊鐨勫搴斿叧绯汇€?/div>
28.java鑷甫鏈夊摢鍑犵绾跨▼姹犮€?/div>
29.鐢诲浘锛岃璁瞫huffle鐨勮繃绋嬨€傞偅浣犳€庝箞鍦ㄧ紪绋嬬殑鏃跺€欐敞鎰忛伩鍏嶈繖浜涙€ц兘闂锛?/div>