澶ф暟鎹潰璇曢

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了澶ф暟鎹潰璇曢相关的知识,希望对你有一定的参考价值。

鏍囩锛?a href='http://www.mamicode.com/so/1/%e6%8e%92%e5%88%97' title='鎺掑垪'>鎺掑垪   pyspark   璇︾粏   鐣岄潰   xmx   璁$畻妗嗘灦   blackhole   Once   not   

Spark Core闈㈣瘯绡?1
 
闅忕潃Spark鎶€鏈湪浼佷笟涓簲鐢ㄨ秺鏉ヨ秺骞挎硾锛孲park鎴愪负澶ф暟鎹紑鍙戝繀椤绘帉鎻$殑鎶€鑳姐€傚墠鏈熷垎浜簡寰堝鍏充簬Spark鐨勫涔犺棰戝拰鏂囩珷锛屼负浜嗚繘涓€姝ュ珐鍥哄拰鎺屾彙Spark锛屽湪鍘熸湁spark涓撳垔鍩虹涓婏紝鏂板銆奡park闈㈣瘯2000棰樸€嬩笓鍒婏紝棰橀泦鍖呭惈鍩虹姒傚康銆佸師鐞嗐€佺紪鐮佸紑鍙戙€佹€ц兘璋冧紭銆佽繍缁淬€佹簮浠g爜浠ュ強Spark鍛ㄨ竟鐢熸€佺郴缁熺瓑銆傞儴鍒嗛闆嗘潵婧愪簬浜掕仈缃戯紝鐢辨宄拌胺蹇楁効鑰呮敹闆嗗拰鏁寸悊锛岄儴鍒嗛闆嗙敱姊呭嘲璋峰織鎰胯€呯粨鍚堢敓浜у疄闄呯鍒扮殑闂璁捐鍑烘潵锛屽笇鏈涜兘缁欏ぇ瀹跺甫鏉ュ府鍔┿€?/div>
 
涓€銆佺畝绛旈
 
1.Spark master浣跨敤zookeeper杩涜HA鐨勶紝鏈夊摢浜涘厓鏁版嵁淇濆瓨鍦╖ookeeper锛?/div>
绛旓細spark閫氳繃杩欎釜鍙傛暟spark.deploy.zookeeper.dir鎸囧畾master鍏冩暟鎹湪zookeeper涓繚瀛樼殑浣嶇疆锛屽寘鎷琖orker锛孌river鍜孉pplication浠ュ強Executors銆俿tandby鑺傜偣瑕佷粠zk涓紝鑾峰緱鍏冩暟鎹俊鎭紝鎭㈠闆嗙兢杩愯鐘舵€侊紝鎵嶈兘瀵瑰缁х画鎻愪緵鏈嶅姟锛屼綔涓氭彁浜よ祫婧愮敵璇风瓑锛屽湪鎭㈠鍓嶆槸涓嶈兘鎺ュ彈璇锋眰鐨勩€傚彟澶栵紝Master鍒囨崲闇€瑕佹敞鎰?鐐?/div>
1锛夊湪Master鍒囨崲鐨勮繃绋嬩腑锛屾墍鏈夌殑宸茬粡鍦ㄨ繍琛岀殑绋嬪簭鐨嗘甯歌繍琛岋紒鍥犱负Spark Application鍦ㄨ繍琛屽墠灏卞凡缁忛€氳繃Cluster Manager鑾峰緱浜嗚绠楄祫婧愶紝鎵€浠ュ湪杩愯鏃禞ob鏈韩鐨勮皟搴﹀拰澶勭悊鍜孧aster鏄病鏈変换浣曞叧绯荤殑锛?/div>
2锛?鍦∕aster鐨勫垏鎹㈣繃绋嬩腑鍞竴鐨勫奖鍝嶆槸涓嶈兘鎻愪氦鏂扮殑Job锛氫竴鏂归潰涓嶈兘澶熸彁浜ゆ柊鐨勫簲鐢ㄧ▼搴忕粰闆嗙兢锛屽洜涓哄彧鏈堿ctive Master鎵嶈兘鎺ュ彈鏂扮殑绋嬪簭鐨勬彁浜よ姹傦紱鍙﹀涓€鏂归潰锛屽凡缁忚繍琛岀殑绋嬪簭涓篃涓嶈兘澶熷洜涓篈ction鎿嶄綔瑙﹀彂鏂扮殑Job鐨勬彁浜よ姹傦紱
2.Spark master HA 涓讳粠鍒囨崲杩囩▼涓嶄細褰卞搷闆嗙兢宸叉湁鐨勪綔涓氳繍琛岋紝涓轰粈涔堬紵
绛旓細鍥犱负绋嬪簭鍦ㄨ繍琛屼箣鍓嶏紝宸茬粡鐢宠杩囪祫婧愪簡锛宒river鍜孍xecutors閫氳锛屼笉闇€瑕佸拰master杩涜閫氳鐨勩€?/div>
3.Spark on Mesos涓紝浠€涔堟槸鐨勭矖绮掑害鍒嗛厤锛屼粈涔堟槸缁嗙矑搴﹀垎閰嶏紝鍚勮嚜鐨勪紭鐐瑰拰缂虹偣鏄粈涔堬紵
绛旓細1锛夌矖绮掑害锛氬惎鍔ㄦ椂灏卞垎閰嶅ソ璧勬簮锛?绋嬪簭鍚姩锛屽悗缁叿浣撲娇鐢ㄥ氨浣跨敤鍒嗛厤濂界殑璧勬簮锛屼笉闇€瑕佸啀鍒嗛厤璧勬簮锛涘ソ澶勶細浣滀笟鐗瑰埆澶氭椂锛岃祫婧愬鐢ㄧ巼楂橈紝閫傚悎绮楃矑搴︼紱涓嶅ソ锛氬鏄撹祫婧愭氮璐癸紝鍋囧涓€涓猨ob鏈?000涓猼ask锛屽畬鎴愪簡999涓紝杩樻湁涓€涓病瀹屾垚锛岄偅涔堜娇鐢ㄧ矖绮掑害锛?99涓祫婧愬氨浼氶棽缃湪閭i噷锛岃祫婧愭氮璐广€?锛夌粏绮掑害鍒嗛厤锛氱敤璧勬簮鐨勬椂鍊欏垎閰嶏紝鐢ㄥ畬浜嗗氨绔嬪嵆鍥炴敹璧勬簮锛屽惎鍔ㄤ細楹荤儲涓€鐐癸紝鍚姩涓€娆″垎閰嶄竴娆★紝浼氭瘮杈冮夯鐑︺€?/div>
4.濡備綍閰嶇疆spark master鐨凥A锛?/div>
1)閰嶇疆zookeeper
2)淇敼spark_env.sh鏂囦欢,spark鐨刴aster鍙傛暟涓嶅湪鎸囧畾锛屾坊鍔犲涓嬩唬鐮佸埌鍚勪釜master鑺傜偣
export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=zk01:2181,zk02:2181,zk03:2181 -Dspark.deploy.zookeeper.dir=/spark"
3) 灏唖park_env.sh鍒嗗彂鍒板悇涓妭鐐?/div>
4)鎵惧埌涓€涓猰aster鑺傜偣锛屾墽琛?/start-all.sh锛屼細鍦ㄨ繖閲屽惎鍔ㄤ富master,鍏朵粬鐨刴aster澶囪妭鐐癸紝鍚姩master鍛戒护: ./sbin/start-master.sh
5)鎻愪氦绋嬪簭鐨勬椂鍊欐寚瀹歮aster鐨勬椂鍊欒鎸囧畾涓夊彴master锛屼緥濡?/div>
./spark-shell --master spark://master01:7077,master02:7077,master03:7077
5.Apache Spark鏈夊摢浜涘父瑙佺殑绋冲畾鐗堟湰锛孲park1.6.0鐨勬暟瀛楀垎鍒唬琛ㄤ粈涔堟剰鎬濓紵
绛旓細甯歌鐨勫ぇ鐨勭ǔ瀹氱増鏈湁Spark 1.3,Spark1.6, Spark 2.0 锛孲park1.6.0鐨勬暟瀛楀惈涔?/div>
1锛夌涓€涓暟瀛楋細1
major version : 浠h〃澶х増鏈洿鏂帮紝涓€鑸兘浼氭湁涓€浜?api 鐨勫彉鍖栵紝浠ュ強澶х殑浼樺寲鎴栨槸涓€浜涚粨鏋勭殑鏀瑰彉锛?/div>
2锛夌浜屼釜鏁板瓧锛?
minor version : 浠h〃灏忕増鏈洿鏂帮紝涓€鑸細鏂板姞 api锛屾垨鑰呮槸瀵瑰綋鍓嶇殑 api 灏辫浼樺寲锛屾垨鑰呮槸鍏朵粬鍐呭鐨勬洿鏂帮紝姣斿璇?WEB UI 鐨勬洿鏂扮瓑绛夛紱
3锛夌涓変釜鏁板瓧锛?
patch version 锛?浠h〃淇褰撳墠灏忕増鏈瓨鍦ㄧ殑涓€浜?bug锛屽熀鏈笉浼氭湁浠讳綍 api 鐨勬敼鍙樺拰鍔熻兘鏇存柊锛涜寰楁湁涓€涓ぇ绁炴浘缁忚杩囷紝濡傛灉瑕佸垏鎹?spark 鐗堟湰鐨勮瘽锛屾渶濂介€?patch version 闈?0 鐨勭増鏈紝鍥犱负涓€鑸被浼间簬 1.2.0, … 1.6.0 杩欐牱鐨勭増鏈槸灞炰簬澶ф洿鏂扮殑锛屾湁鍙兘浼氭湁涓€浜涢殣钘忕殑 bug 鎴栨槸涓嶇ǔ瀹氭€у瓨鍦紝鎵€浠ユ渶濂介€夋嫨 1.2.1, … 1.6.1 杩欐牱鐨勭増鏈€?/div>
閫氳繃鐗堟湰鍙风殑瑙i噴璇存槑锛屽彲浠ュ緢瀹规槗浜嗚В鍒帮紝spark2.1.1鐨勫彂甯冩椂鏄拡瀵瑰ぇ鐗堟湰2.1鍋氱殑涓€浜沚ug淇敼锛屼笉浼氭柊澧炲姛鑳斤紝涔熶笉浼氭柊澧濧PI锛屼細姣?.1.0鐗堟湰鏇村姞绋冲畾銆?/div>
6.driver鐨勫姛鑳芥槸浠€涔堬紵
绛旓細 1锛変竴涓猄park浣滀笟杩愯鏃跺寘鎷竴涓狣river杩涚▼锛屼篃鏄綔涓氱殑涓昏繘绋嬶紝鍏锋湁main鍑芥暟锛屽苟涓旀湁SparkContext鐨勫疄渚嬶紝鏄▼搴忕殑浜哄彛鐐癸紱2锛夊姛鑳斤細璐熻矗鍚戦泦缇ょ敵璇疯祫婧愶紝鍚憁aster娉ㄥ唽淇℃伅锛岃礋璐d簡浣滀笟鐨勮皟搴︼紝锛岃礋璐d綔涓氱殑瑙f瀽銆佺敓鎴怱tage骞惰皟搴ask鍒癊xecutor涓娿€傚寘鎷珼AGScheduler锛孴askScheduler銆?/div>
7.spark鐨勬湁鍑犵閮ㄧ讲妯″紡锛屾瘡绉嶆ā寮忕壒鐐癸紵
1锛夋湰鍦版ā寮?/div>
Spark涓嶄竴瀹氶潪瑕佽窇鍦╤adoop闆嗙兢锛屽彲浠ュ湪鏈湴锛岃捣澶氫釜绾跨▼鐨勬柟寮忔潵鎸囧畾銆傚皢Spark搴旂敤浠ュ绾跨▼鐨勬柟寮忕洿鎺ヨ繍琛屽湪鏈湴锛屼竴鑸兘鏄负浜嗘柟渚胯皟璇曪紝鏈湴妯″紡鍒嗕笁绫?/div>
· local锛氬彧鍚姩涓€涓猠xecutor
· local[k]:鍚姩k涓猠xecutor
· local
锛氬惎鍔ㄨ窡cpu鏁扮洰鐩稿悓鐨?executor
2)standalone妯″紡
鍒嗗竷寮忛儴缃查泦缇わ紝 鑷甫瀹屾暣鐨勬湇鍔★紝璧勬簮绠$悊鍜屼换鍔$洃鎺ф槸Spark鑷繁鐩戞帶锛岃繖涓ā寮忎篃鏄叾浠栨ā寮忕殑鍩虹锛?/div>
3)Spark on yarn妯″紡
鍒嗗竷寮忛儴缃查泦缇わ紝璧勬簮鍜屼换鍔$洃鎺т氦缁檡arn绠$悊锛屼絾鏄洰鍓嶄粎鏀寔绮楃矑搴﹁祫婧愬垎閰嶆柟寮忥紝鍖呭惈cluster鍜宑lient杩愯妯″紡锛宑luster閫傚悎鐢熶骇锛宒river杩愯鍦ㄩ泦缇ゅ瓙鑺傜偣锛屽叿鏈夊閿欏姛鑳斤紝client閫傚悎璋冭瘯锛宒irver杩愯鍦ㄥ鎴风
4锛塖park On Mesos妯″紡銆傚畼鏂规帹鑽愯繖绉嶆ā寮忥紙褰撶劧锛屽師鍥犱箣涓€鏄缂樺叧绯伙級銆傛鏄敱浜嶴park寮€鍙戜箣鍒濆氨鑰冭檻鍒版敮鎸丮esos锛屽洜姝わ紝鐩墠鑰岃█锛孲park杩愯鍦∕esos涓婁細姣旇繍琛屽湪YARN涓婃洿鍔犵伒娲伙紝鏇村姞鑷劧銆傜敤鎴峰彲閫夋嫨涓ょ璋冨害妯″紡涔嬩竴杩愯鑷繁鐨勫簲鐢ㄧ▼搴忥細
1) 绮楃矑搴︽ā寮忥紙Coarse-grained Mode锛夛細姣忎釜搴旂敤绋嬪簭鐨勮繍琛岀幆澧冪敱涓€涓狣irver鍜岃嫢骞蹭釜Executor缁勬垚锛屽叾涓紝姣忎釜Executor鍗犵敤鑻ュ共璧勬簮锛屽唴閮ㄥ彲杩愯澶氫釜Task锛堝搴斿灏戜釜“slot”锛夈€傚簲鐢ㄧ▼搴忕殑鍚勪釜浠诲姟姝e紡杩愯涔嬪墠锛岄渶瑕佸皢杩愯鐜涓殑璧勬簮鍏ㄩ儴鐢宠濂斤紝涓旇繍琛岃繃绋嬩腑瑕佷竴鐩村崰鐢ㄨ繖浜涜祫婧愶紝鍗充娇涓嶇敤锛屾渶鍚庣▼搴忚繍琛岀粨鏉熷悗锛屽洖鏀惰繖浜涜祫婧愩€?/div>
2) 缁嗙矑搴︽ā寮忥紙Fine-grained Mode锛夛細閴翠簬绮楃矑搴︽ā寮忎細閫犳垚澶ч噺璧勬簮娴垂锛孲park On Mesos杩樻彁渚涗簡鍙﹀涓€绉嶈皟搴︽ā寮忥細缁嗙矑搴︽ā寮忥紝杩欑妯″紡绫讳技浜庣幇鍦ㄧ殑浜戣绠楋紝鎬濇兂鏄寜闇€鍒嗛厤銆?/div>
8.Spark鎶€鏈爤鏈夊摢浜涚粍浠讹紝姣忎釜缁勪欢閮芥湁浠€涔堝姛鑳斤紝閫傚悎浠€涔堝簲鐢ㄥ満鏅紵
绛旓細鍙互鐢讳竴涓繖鏍风殑鎶€鏈爤鍥惧厛锛岀劧鍚庡垎鍒В閲婁笅姣忎釜缁勪欢鐨勫姛鑳藉拰鍦烘櫙
 
file:///E:/%E5%AE%89%E8%A3%85%E8%BD%AF%E4%BB%B6/%E6%9C%89%E9%81%93%E7%AC%94%E8%AE%B0%E6%96%87%E4%BB%B6/qq19B99AF2399E52F466CC3CF7E3B24ED5/dc318cd93346448487e9f423ce499b4b/d1d97571615f01111094fdcae4bed078.jpg
1锛塖park core锛氭槸鍏跺畠缁勪欢鐨勫熀纭€锛宻park鐨勫唴鏍革紝涓昏鍖呭惈锛氭湁鍚戝惊鐜浘銆丷DD銆丩ingage銆丆ache銆乥roadcast绛夛紝骞跺皝瑁呬簡搴曞眰閫氳妗嗘灦锛屾槸Spark鐨勫熀纭€銆?/div>
2锛塖parkStreaming鏄竴涓瀹炴椂鏁版嵁娴佽繘琛岄珮閫氶噺銆佸閿欏鐞嗙殑娴佸紡澶勭悊绯荤粺锛屽彲浠ュ澶氱鏁版嵁婧愶紙濡侹dfka銆丗lume銆乀witter銆乑ero鍜孴CP 濂楁帴瀛楋級杩涜绫讳技Map銆丷educe鍜孞oin绛夊鏉傛搷浣滐紝灏嗘祦寮忚绠楀垎瑙f垚涓€绯诲垪鐭皬鐨勬壒澶勭悊浣滀笟銆?/div>
3锛塖park sql锛歋hark鏄疭parkSQL鐨勫墠韬紝Spark SQL鐨勪竴涓噸瑕佺壒鐐规槸鍏惰兘澶熺粺涓€澶勭悊鍏崇郴琛ㄥ拰RDD锛屼娇寰楀紑鍙戜汉鍛樺彲浠ヨ交鏉惧湴浣跨敤SQL鍛戒护杩涜澶栭儴鏌ヨ锛屽悓鏃惰繘琛屾洿澶嶆潅鐨勬暟鎹垎鏋?/div>
4锛塀linkDB 锛氭槸涓€涓敤浜庡湪娴烽噺鏁版嵁涓婅繍琛屼氦浜掑紡 SQL 鏌ヨ鐨勫ぇ瑙勬ā骞惰鏌ヨ寮曟搸锛屽畠鍏佽鐢ㄦ埛閫氳繃鏉冭 鏁版嵁绮惧害鏉ユ彁鍗囨煡璇㈠搷搴旀椂闂达紝鍏舵暟鎹殑绮惧害琚帶鍒跺湪鍏佽鐨勮宸寖鍥村唴銆?/div>
5锛塎LBase鏄疭park鐢熸€佸湀鐨勪竴閮ㄥ垎涓撴敞浜庢満鍣ㄥ涔狅紝璁╂満鍣ㄥ涔犵殑闂ㄦ鏇翠綆锛岃涓€浜涘彲鑳藉苟涓嶄簡瑙f満鍣ㄥ涔犵殑鐢ㄦ埛涔熻兘鏂逛究鍦颁娇鐢∕Lbase銆侻LBase鍒嗕负鍥涢儴鍒嗭細MLlib銆丮LI銆丮L Optimizer鍜孧LRuntime銆?/div>
6锛塆raphX鏄疭park涓敤浜庡浘鍜屽浘骞惰璁$畻
9.Spark涓璚ork鐨勪富瑕佸伐浣滄槸浠€涔堬紵
绛旓細涓昏鍔熻兘锛氱鐞嗗綋鍓嶈妭鐐瑰唴瀛橈紝CPU鐨勪娇鐢ㄧ姸鍐碉紝鎺ユ敹master鍒嗛厤杩囨潵鐨勮祫婧愭寚浠わ紝閫氳繃ExecutorRunner鍚姩绋嬪簭鍒嗛厤浠诲姟锛寃orker灏辩被浼间簬鍖呭伐澶达紝绠$悊鍒嗛厤鏂拌繘绋嬶紝鍋氳绠楃殑鏈嶅姟锛岀浉褰撲簬process鏈嶅姟銆傞渶瑕佹敞鎰忕殑鏄細1锛墂orker浼氫笉浼氭眹鎶ュ綋鍓嶄俊鎭粰master锛寃orker蹇冭烦缁檓aster涓昏鍙湁workid锛屽畠涓嶄細鍙戦€佽祫婧愪俊鎭互蹇冭烦鐨勬柟寮忕粰mater锛宮aster鍒嗛厤鐨勬椂鍊欏氨鐭ラ亾work锛屽彧鏈夊嚭鐜版晠闅滅殑鏃跺€欐墠浼氬彂閫佽祫婧愩€?锛墂orker涓嶄細杩愯浠g爜锛屽叿浣撹繍琛岀殑鏄疎xecutor鏄彲浠ヨ繍琛屽叿浣揳ppliaction鍐欑殑涓氬姟閫昏緫浠g爜锛屾搷浣滀唬鐮佺殑鑺傜偣锛屽畠涓嶄細杩愯绋嬪簭鐨勪唬鐮佺殑銆?/div>
10.Spark涓轰粈涔堟瘮mapreduce蹇紵
绛旓細1锛夊熀浜庡唴瀛樿绠楋紝鍑忓皯浣庢晥鐨勭鐩樹氦浜掞紱2锛夐珮鏁堢殑璋冨害绠楁硶锛屽熀浜嶥AG锛?)瀹归敊鏈哄埗Linage锛岀簿鍗庨儴鍒嗗氨鏄疍AG鍜孡ingae
11.绠€鍗曡涓€涓媓adoop鍜宻park鐨剆huffle鐩稿悓鍜屽樊寮傦紵
绛旓細1锛変粠 high-level 鐨勮搴︽潵鐪嬶紝涓よ€呭苟娌℃湁澶х殑宸埆銆?閮芥槸灏?mapper锛圫park 閲屾槸 ShuffleMapTask锛夌殑杈撳嚭杩涜 partition锛屼笉鍚岀殑 partition 閫佸埌涓嶅悓鐨?reducer锛圫park 閲?reducer 鍙兘鏄笅涓€涓?stage 閲岀殑 ShuffleMapTask锛屼篃鍙兘鏄?ResultTask锛夈€俁educer 浠ュ唴瀛樹綔缂撳啿鍖猴紝杈?shuffle 杈?aggregate 鏁版嵁锛岀瓑鍒版暟鎹?aggregate 濂戒互鍚庤繘琛?reduce() 锛圫park 閲屽彲鑳芥槸鍚庣画鐨勪竴绯诲垪鎿嶄綔锛夈€?/div>
2锛変粠 low-level 鐨勮搴︽潵鐪嬶紝涓よ€呭樊鍒笉灏忋€?Hadoop MapReduce 鏄?sort-based锛岃繘鍏?combine() 鍜?reduce() 鐨?records 蹇呴』鍏?sort銆傝繖鏍风殑濂藉鍦ㄤ簬 combine/reduce() 鍙互澶勭悊澶ц妯$殑鏁版嵁锛屽洜涓哄叾杈撳叆鏁版嵁鍙互閫氳繃澶栨帓寰楀埌锛坢apper 瀵规瘡娈垫暟鎹厛鍋氭帓搴忥紝reducer 鐨?shuffle 瀵规帓濂藉簭鐨勬瘡娈垫暟鎹仛褰掑苟锛夈€傜洰鍓嶇殑 Spark 榛樿閫夋嫨鐨勬槸 hash-based锛岄€氬父浣跨敤 HashMap 鏉ュ shuffle 鏉ョ殑鏁版嵁杩涜 aggregate锛屼笉浼氬鏁版嵁杩涜鎻愬墠鎺掑簭銆傚鏋滅敤鎴烽渶瑕佺粡杩囨帓搴忕殑鏁版嵁锛岄偅涔堥渶瑕佽嚜宸辫皟鐢ㄧ被浼?sortByKey() 鐨勬搷浣滐紱濡傛灉浣犳槸Spark 1.1鐨勭敤鎴凤紝鍙互灏唖park.shuffle.manager璁剧疆涓簊ort锛屽垯浼氬鏁版嵁杩涜鎺掑簭銆傚湪Spark 1.2涓紝sort灏嗕綔涓洪粯璁ょ殑Shuffle瀹炵幇銆?/div>
3锛変粠瀹炵幇瑙掑害鏉ョ湅锛屼袱鑰呬篃鏈変笉灏戝樊鍒€?Hadoop MapReduce 灏嗗鐞嗘祦绋嬪垝鍒嗗嚭鏄庢樉鐨勫嚑涓樁娈碉細map(), spill, merge, shuffle, sort, reduce() 绛夈€傛瘡涓樁娈靛悇鍙稿叾鑱岋紝鍙互鎸夌収杩囩▼寮忕殑缂栫▼鎬濇兂鏉ラ€愪竴瀹炵幇姣忎釜闃舵鐨勫姛鑳姐€傚湪 Spark 涓紝娌℃湁杩欐牱鍔熻兘鏄庣‘鐨勯樁娈碉紝鍙湁涓嶅悓鐨?stage 鍜屼竴绯诲垪鐨?transformation()锛屾墍浠?spill, merge, aggregate 绛夋搷浣滈渶瑕佽暣鍚湪 transformation() 涓€?/div>
濡傛灉鎴戜滑灏?map 绔垝鍒嗘暟鎹€佹寔涔呭寲鏁版嵁鐨勮繃绋嬬О涓?shuffle write锛岃€屽皢 reducer 璇诲叆鏁版嵁銆乤ggregate 鏁版嵁鐨勮繃绋嬬О涓?shuffle read銆傞偅涔堝湪 Spark 涓紝闂灏卞彉涓烘€庝箞鍦?job 鐨勯€昏緫鎴栬€呯墿鐞嗘墽琛屽浘涓姞鍏?shuffle write 鍜?shuffle read 鐨勫鐞嗛€昏緫锛熶互鍙婁袱涓鐞嗛€昏緫搴旇鎬庝箞楂樻晥瀹炵幇锛?/div>
Shuffle write鐢变簬涓嶈姹傛暟鎹湁搴忥紝shuffle write 鐨勪换鍔″緢绠€鍗曪細灏嗘暟鎹?partition 濂斤紝骞舵寔涔呭寲銆備箣鎵€浠ヨ鎸佷箙鍖栵紝涓€鏂归潰鏄鍑忓皯鍐呭瓨瀛樺偍绌洪棿鍘嬪姏锛屽彟涓€鏂归潰涔熸槸涓轰簡 fault-tolerance銆?/div>
12.Mapreduce鍜孲park鐨勯兘鏄苟琛岃绠楋紝閭d箞浠栦滑鏈変粈涔堢浉鍚屽拰鍖哄埆
绛旓細涓よ€呴兘鏄敤mr妯″瀷鏉ヨ繘琛屽苟琛岃绠?
1)hadoop鐨勪竴涓綔涓氱О涓簀ob锛宩ob閲岄潰鍒嗕负map task鍜宺educe task锛屾瘡涓猼ask閮芥槸鍦ㄨ嚜宸辩殑杩涚▼涓繍琛岀殑锛屽綋task缁撴潫鏃讹紝杩涚▼涔熶細缁撴潫銆?/div>
2)spark鐢ㄦ埛鎻愪氦鐨勪换鍔℃垚涓篴pplication锛屼竴涓猘pplication瀵瑰簲涓€涓猻parkcontext锛宎pp涓瓨鍦ㄥ涓猨ob锛屾瘡瑙﹀彂涓€娆ction鎿嶄綔灏变細浜х敓涓€涓猨ob銆傝繖浜沯ob鍙互骞惰鎴栦覆琛屾墽琛岋紝姣忎釜job涓湁澶氫釜stage锛宻tage鏄痵huffle杩囩▼涓璂AGSchaduler閫氳繃RDD涔嬮棿鐨勪緷璧栧叧绯诲垝鍒唈ob鑰屾潵鐨勶紝姣忎釜stage閲岄潰鏈夊涓猼ask锛岀粍鎴恡askset鏈塗askSchaduler鍒嗗彂鍒板悇涓猠xecutor涓墽琛岋紝executor鐨勭敓鍛藉懆鏈熸槸鍜宎pp涓€鏍风殑锛屽嵆浣挎病鏈塲ob杩愯涔熸槸瀛樺湪鐨勶紝鎵€浠ask鍙互蹇€熷惎鍔ㄨ鍙栧唴瀛樿繘琛岃绠椼€?/div>
3)hadoop鐨刯ob鍙湁map鍜宺educe鎿嶄綔锛岃〃杈捐兘鍔涙瘮杈冩瑺缂鸿€屼笖鍦╩r杩囩▼涓細閲嶅鐨勮鍐檋dfs锛岄€犳垚澶ч噺鐨刬o鎿嶄綔锛屽涓猨ob闇€瑕佽嚜宸辩鐞嗗叧绯汇€?/div>
spark鐨勮凯浠h绠楅兘鏄湪鍐呭瓨涓繘琛岀殑锛孉PI涓彁渚涗簡澶ч噺鐨凴DD鎿嶄綔濡俲oin锛実roupby绛夛紝鑰屼笖閫氳繃DAG鍥惧彲浠ュ疄鐜拌壇濂界殑瀹归敊銆?/div>
13.RDD鏈哄埗锛?/div>
绛旓細rdd鍒嗗竷寮忓脊鎬ф暟鎹泦锛岀畝鍗曠殑鐞嗚В鎴愪竴绉嶆暟鎹粨鏋勶紝鏄痵park妗嗘灦涓婄殑閫氱敤璐у竵銆?/div>
鎵€鏈夌畻瀛愰兘鏄熀浜巖dd鏉ユ墽琛岀殑锛屼笉鍚岀殑鍦烘櫙浼氭湁涓嶅悓鐨剅dd瀹炵幇绫伙紝浣嗘槸閮藉彲浠ヨ繘琛屼簰鐩歌浆鎹€?/div>
rdd鎵ц杩囩▼涓細褰㈡垚dag鍥撅紝鐒跺悗褰㈡垚lineage淇濊瘉瀹归敊鎬х瓑銆?浠庣墿鐞嗙殑瑙掑害鏉ョ湅rdd瀛樺偍鐨勬槸block鍜宯ode涔嬮棿鐨勬槧灏勩€?/div>
14銆乻park鏈夊摢浜涚粍浠讹紵
绛旓細涓昏鏈夊涓嬬粍浠讹細
1锛塵aster锛氱鐞嗛泦缇ゅ拰鑺傜偣锛屼笉鍙備笌璁$畻銆?/div>
2锛墂orker锛氳绠楄妭鐐癸紝杩涚▼鏈韩涓嶅弬涓庤绠楋紝鍜宮aster姹囨姤銆?/div>
3锛塂river锛氳繍琛岀▼搴忕殑main鏂规硶锛屽垱寤簊park context瀵硅薄銆?/div>
4锛塻park context锛氭帶鍒舵暣涓猘pplication鐨勭敓鍛藉懆鏈燂紝鍖呮嫭dagsheduler鍜宼ask scheduler绛夌粍浠躲€?/div>
5锛塩lient锛氱敤鎴锋彁浜ょ▼搴忕殑鍏ュ彛銆?/div>
15銆乻park宸ヤ綔鏈哄埗锛?/div>
绛旓細鐢ㄦ埛鍦╟lient绔彁浜や綔涓氬悗锛屼細鐢盌river杩愯main鏂规硶骞跺垱寤簊park context涓婁笅鏂囥€?/div>
鎵цadd绠楀瓙锛屽舰鎴恉ag鍥捐緭鍏agscheduler锛屾寜鐓dd涔嬮棿鐨勪緷璧栧叧绯诲垝鍒唖tage杈撳叆task scheduler銆?task scheduler浼氬皢stage鍒掑垎涓簍ask set鍒嗗彂鍒板悇涓妭鐐圭殑executor涓墽琛屻€?/div>
16銆乻park鐨勪紭鍖栨€庝箞鍋氾紵
绛旓細 spark璋冧紭姣旇緝澶嶆潅锛屼絾鏄ぇ浣撳彲浠ュ垎涓轰笁涓柟闈㈡潵杩涜锛?锛夊钩鍙板眰闈㈢殑璋冧紭锛氶槻姝笉蹇呰鐨刯ar鍖呭垎鍙戯紝鎻愰珮鏁版嵁鐨勬湰鍦版€э紝閫夋嫨楂樻晥鐨勫瓨鍌ㄦ牸寮忓parquet锛?锛夊簲鐢ㄧ▼搴忓眰闈㈢殑璋冧紭锛氳繃婊ゆ搷浣滅鐨勪紭鍖栭檷浣庤繃澶氬皬浠诲姟锛岄檷浣庡崟鏉¤褰曠殑璧勬簮寮€閿€锛屽鐞嗘暟鎹€炬枩锛屽鐢≧DD杩涜缂撳瓨锛屼綔涓氬苟琛屽寲鎵ц绛夌瓑锛?锛塉VM灞傞潰鐨勮皟浼橈細璁剧疆鍚堥€傜殑璧勬簮閲忥紝璁剧疆鍚堢悊鐨凧VM锛屽惎鐢ㄩ珮鏁堢殑搴忓垪鍖栨柟娉曞kyro锛屽澶ff head鍐呭瓨绛夌瓑
17.绠€瑕佹弿杩癝park鍒嗗竷寮忛泦缇ゆ惌寤虹殑姝ラ
1锛夊噯澶噇inux鐜锛岃缃泦缇ゆ惌寤鸿处鍙峰拰鐢ㄦ埛缁勶紝璁剧疆ssh锛屽叧闂槻鐏锛屽叧闂璼eLinux锛岄厤缃甴ost锛宧ostname
2锛夐厤缃甹dk鍒扮幆澧冨彉閲?/div>
3锛夋惌寤篽adoop闆嗙兢锛屽鏋滆鍋歮aster ha锛岄渶瑕佹惌寤簔ookeeper闆嗙兢
淇敼hdfs-site.xml,hadoop_env.sh,yarn-site.xml,slaves绛夐厤缃枃浠?/div>
4锛夊惎鍔╤adoop闆嗙兢锛屽惎鍔ㄥ墠瑕佹牸寮忓寲namenode
5锛夐厤缃畇park闆嗙兢锛屼慨鏀箂park-env.xml锛宻laves绛夐厤缃枃浠讹紝鎷疯礉hadoop鐩稿叧閰嶇疆鍒皊park conf鐩綍涓?/div>
6)鍚姩spark闆嗙兢銆?/div>
18.浠€涔堟槸RDD瀹戒緷璧栧拰绐勪緷璧栵紵
RDD鍜屽畠渚濊禆鐨刾arent RDD(s)鐨勫叧绯绘湁涓ょ涓嶅悓鐨勭被鍨嬶紝鍗崇獎渚濊禆锛坣arrow dependency锛夊拰瀹戒緷璧栵紙wide dependency锛夈€?/div>
1锛夌獎渚濊禆鎸囩殑鏄瘡涓€涓猵arent RDD鐨凱artition鏈€澶氳瀛怰DD鐨勪竴涓狿artition浣跨敤
2锛夊渚濊禆鎸囩殑鏄涓瓙RDD鐨凱artition浼氫緷璧栧悓涓€涓猵arent RDD鐨凱artition
19.spark-submit鐨勬椂鍊欏浣曞紩鍏ュ閮╦ar鍖?/div>
鏂规硶涓€锛歴park-submit –jars
鏍规嵁spark瀹樼綉锛屽湪鎻愪氦浠诲姟鐨勬椂鍊欐寚瀹?ndash;jars锛岀敤閫楀彿鍒嗗紑銆傝繖鏍峰仛鐨勭己鐐规槸姣忔閮借鎸囧畾jar鍖咃紝濡傛灉jar鍖呭皯鐨勮瘽鍙互杩欎箞鍋氾紝浣嗘槸濡傛灉澶氱殑璇濅細寰堥夯鐑︺€?/div>
鍛戒护锛歴park-submit --master yarn-client --jars ***.jar,***.jar
鏂规硶浜岋細extraClassPath
鎻愪氦鏃跺湪spark-default涓瀹氬弬鏁帮紝灏嗘墍鏈夐渶瑕佺殑jar鍖呰€冨埌涓€涓枃浠堕噷锛岀劧鍚庡湪鍙傛暟涓寚瀹氳鐩綍灏卞彲浠ヤ簡锛岃緝涓婁竴涓柟渚垮緢澶氾細
spark.executor.extraClassPath=/home/hadoop/wzq_workspace/lib/* spark.driver.extraClassPath=/home/hadoop/wzq_workspace/lib/*
闇€瑕佹敞鎰忕殑鏄?浣犺鍦ㄦ墍鏈夊彲鑳借繍琛宻park浠诲姟鐨勬満鍣ㄤ笂淇濊瘉璇ョ洰褰曞瓨鍦紝骞朵笖灏唈ar鍖呰€冨埌鎵€鏈夋満鍣ㄤ笂銆傝繖鏍峰仛鐨勫ソ澶勬槸鎻愪氦浠g爜鐨勬椂鍊欎笉鐢ㄥ啀鍐欎竴闀夸覆jar浜嗭紝缂虹偣鏄鎶婃墍鏈夌殑jar鍖呴兘鎷蜂竴閬嶃€?/div>
20.cache鍜宲esist鐨勫尯鍒?/div>
绛旓細1锛塩ache鍜宲ersist閮芥槸鐢ㄤ簬灏嗕竴涓猂DD杩涜缂撳瓨鐨勶紝杩欐牱鍦ㄤ箣鍚庝娇鐢ㄧ殑杩囩▼涓氨涓嶉渶瑕侀噸鏂拌绠椾簡锛屽彲浠ュぇ澶ц妭鐪佺▼搴忚繍琛屾椂闂达紱2锛?cache鍙湁涓€涓粯璁ょ殑缂撳瓨绾у埆MEMORY_ONLY 锛宑ache璋冪敤浜唒ersist锛岃€宲ersist鍙互鏍规嵁鎯呭喌璁剧疆鍏跺畠鐨勭紦瀛樼骇鍒紱3锛塭xecutor鎵ц鐨勬椂鍊欙紝榛樿60%鍋歝ache锛?0%鍋歵ask鎿嶄綔锛宲ersist鏈€鏍规湰鐨勫嚱鏁帮紝鏈€搴曞眰鐨勫嚱鏁?/div>
 
浜屻€侀€夋嫨棰?/div>
1. Spark 鐨勫洓澶х粍浠朵笅闈㈠摢涓笉鏄?(D )
A.Spark Streaming B. Mlib
C Graphx D.Spark R
 
2.涓嬮潰鍝釜绔彛涓嶆槸 spark 鑷甫鏈嶅姟鐨勭鍙?(C )
A.8080 B.4040 C.8090 D.18080
澶囨敞锛?080锛歴park闆嗙兢web ui绔彛锛?040锛歴parkjob鐩戞帶绔彛锛?8080锛歫obhistory绔彛
 
3.spark 1.4 鐗堟湰鐨勬渶澶у彉鍖?(B )
A spark sql Release 鐗堟湰 B .寮曞叆 Spark R
C DataFrame D.鏀寔鍔ㄦ€佽祫婧愬垎閰?/div>
 
4. Spark Job 榛樿鐨勮皟搴︽ā寮?(A )
A FIFO B FAIR
C 鏃?D 杩愯鏃舵寚瀹?/div>
 
5.鍝釜涓嶆槸鏈湴妯″紡杩愯鐨勪釜鏉′欢 ( D)
A spark.localExecution.enabled=true
B 鏄惧紡鎸囧畾鏈湴杩愯
C finalStage 鏃犵埗 Stage
D partition榛樿鍊?/div>
 
6.涓嬮潰鍝釜涓嶆槸 RDD 鐨勭壒鐐?(C )
A. 鍙垎鍖?B 鍙簭鍒楀寲 C 鍙慨鏀?D 鍙寔涔呭寲
 
7. 鍏充簬骞挎挱鍙橀噺锛屼笅闈㈠摢涓槸閿欒鐨?(D )
A 浠讳綍鍑芥暟璋冪敤 B 鏄彧璇荤殑
C 瀛樺偍鍦ㄥ悇涓妭鐐?D 瀛樺偍鍦ㄧ鐩樻垨 HDFS
 
8. 鍏充簬绱姞鍣紝涓嬮潰鍝釜鏄敊璇殑 (D )
A 鏀寔鍔犳硶 B 鏀寔鏁板€肩被鍨?/div>
C 鍙苟琛?D 涓嶆敮鎸佽嚜瀹氫箟绫诲瀷
 
9.Spark 鏀寔鐨勫垎甯冨紡閮ㄧ讲鏂瑰紡涓摢涓槸閿欒鐨?(D )
A standalone B spark on mesos
C spark on YARN D Spark on local
 
10.Stage 鐨?Task 鐨勬暟閲忕敱浠€涔堝喅瀹?(A )
A Partition B Job C Stage D TaskScheduler
 
11.涓嬮潰鍝釜鎿嶄綔鏄獎渚濊禆 (B )
A join B filter
C group D sort
 
12.涓嬮潰鍝釜鎿嶄綔鑲畾鏄渚濊禆 (C )
A map B flatMap
C reduceByKey D sample
 
13.spark 鐨?master 鍜?worker 閫氳繃浠€涔堟柟寮忚繘琛岄€氫俊鐨勶紵 (D )
A http B nio C netty D Akka
 
14 榛樿鐨勫瓨鍌ㄧ骇鍒?(A )
A MEMORY_ONLY B MEMORY_ONLY_SER
C MEMORY_AND_DISK D MEMORY_AND_DISK_SER
 
15 spark.deploy.recoveryMode 涓嶆敮鎸侀偅绉?(D )
A.ZooKeeper B. FileSystem
D NONE D Hadoop
 
16.涓嬪垪鍝釜涓嶆槸 RDD 鐨勭紦瀛樻柟娉?(C )
A persist() B Cache()
C Memory()
 
17.Task 杩愯鍦ㄤ笅鏉ュ摢閲屼釜閫夐」涓?Executor 涓婄殑宸ヤ綔鍗曞厓 (C )
A Driver program B. spark master
C.worker node D Cluster manager
 
18.hive 鐨勫厓鏁版嵁瀛樺偍鍦?derby 鍜?mysql 涓湁浠€涔堝尯鍒?(B )
A.娌″尯鍒?B.澶氫細璇?/div>
C.鏀寔缃戠粶鐜 D鏁版嵁搴撶殑鍖哄埆
 
19.DataFrame 鍜?RDD 鏈€澶х殑鍖哄埆 (B )
A.绉戝缁熻鏀寔 B.澶氫簡 schema
C.瀛樺偍鏂瑰紡涓嶄竴鏍?D.澶栭儴鏁版嵁婧愭敮鎸?/div>
 
20.Master 鐨?ElectedLeader 浜嬩欢鍚庡仛浜嗗摢浜涙搷浣?(D )
A. 閫氱煡 driver B.閫氱煡 worker
C.娉ㄥ唽 application D.鐩存帴 ALIVE
 
 
 
 
 
-----------------------------------------------------------------------------------------------------------------------------
 
銆怱park闈㈣瘯2000棰?1-70銆慡park core闈㈣瘯绡?2
杩欐壒Spark闈㈣瘯棰樼敱蹇楁効鑰匱affry锛堟煇楂樻牎鐮旂┒鐢燂級鎻愪緵锛岄潪甯告劅璋㈠織鎰胯€呯殑浼樿川棰橀泦锛屽ぇ瀹跺鏋滄湁濂界殑闈㈣瘯棰樺彲浠ョ淇$粰缇や富锛堝彲鍔犲叆蹇楁効鑰呯兢QQ缇わ細233864572锛夈€備负纭繚棰橀泦璐ㄩ噺锛屽織鎰胯€呰础鐚嚭鏉ョ殑棰橀泦锛岀兢涓诲強鍚勪綅姊呭嘲璋峰钩鍙扮粍鎴愬憳浼氬鏍革紝涓埆鍦版柟浼氱暐鍔犱慨鏀癸紝杩樿蹇楁効鑰呯悊瑙c€?/div>
涓€銆侀潰璇?0棰?/div>
1.cache鍚庨潰鑳戒笉鑳芥帴鍏朵粬绠楀瓙,瀹冩槸涓嶆槸action鎿嶄綔锛?/div>
绛旓細cache鍙互鎺ュ叾浠栫畻瀛愶紝浣嗘槸鎺ヤ簡绠楀瓙涔嬪悗锛岃捣涓嶅埌缂撳瓨搴旀湁鐨勬晥鏋滐紝鍥犱负浼氶噸鏂拌Е鍙慶ache銆?/div>
cache涓嶆槸action鎿嶄綔
2.reduceByKey鏄笉鏄痑ction锛?/div>
绛旓細涓嶆槸锛屽緢澶氫汉閮戒細浠ヤ负鏄痑ction锛宺educe rdd鏄痑ction
3.鏁版嵁鏈湴鎬ф槸鍦ㄥ摢涓幆鑺傜‘瀹氱殑锛?/div>
鍏蜂綋鐨則ask杩愯鍦ㄩ偅浠栨満鍣ㄤ笂锛宒ag鍒掑垎stage鐨勬椂鍊欑‘瀹氱殑
4.RDD鐨勫脊鎬ц〃鐜板湪鍝嚑鐐癸紵
1锛夎嚜鍔ㄧ殑杩涜鍐呭瓨鍜岀鐩樼殑瀛樺偍鍒囨崲锛?/div>
2锛夊熀浜嶭ingage鐨勯珮鏁堝閿欙紱
3锛塼ask濡傛灉澶辫触浼氳嚜鍔ㄨ繘琛岀壒瀹氭鏁扮殑閲嶈瘯锛?/div>
4锛塻tage濡傛灉澶辫触浼氳嚜鍔ㄨ繘琛岀壒瀹氭鏁扮殑閲嶈瘯锛岃€屼笖鍙細璁$畻澶辫触鐨勫垎鐗囷紱
5锛塩heckpoint鍜宲ersist锛屾暟鎹绠椾箣鍚庢寔涔呭寲缂撳瓨
6锛夋暟鎹皟搴﹀脊鎬э紝DAG TASK璋冨害鍜岃祫婧愭棤鍏?/div>
7锛夋暟鎹垎鐗囩殑楂樺害寮规€э紝a.鍒嗙墖寰堝纰庣墖鍙互鍚堝苟鎴愬ぇ鐨勶紝b.par
5.甯歌鐨勫閿欐柟寮忔湁鍝嚑绉嶇被鍨嬶紵
1锛?鏁版嵁妫€鏌ョ偣,浼氬彂鐢熸嫹璐濓紝娴垂璧勬簮
2锛?璁板綍鏁版嵁鐨勬洿鏂帮紝姣忔鏇存柊閮戒細璁板綍涓嬫潵锛屾瘮杈冨鏉備笖姣旇緝娑堣€楁€ц兘
6.RDD閫氳繃Linage锛堣褰曟暟鎹洿鏂帮級鐨勬柟寮忎负浣曞緢楂樻晥锛?/div>
1锛塴azy璁板綍浜嗘暟鎹殑鏉ユ簮锛孯DD鏄笉鍙彉鐨勶紝涓旀槸lazy绾у埆鐨勶紝涓攔DD
涔嬮棿鏋勬垚浜嗛摼鏉★紝lazy鏄脊鎬х殑鍩虹煶銆傜敱浜嶳DD涓嶅彲鍙橈紝鎵€浠ユ瘡娆℃搷浣滃氨
浜х敓鏂扮殑rdd锛屼笉瀛樺湪鍏ㄥ眬淇敼鐨勯棶棰橈紝鎺у埗闅惧害涓嬮檷锛屾墍鏈夋湁璁$畻閾炬潯
灏嗗鏉傝绠楅摼鏉″瓨鍌ㄤ笅鏉ワ紝璁$畻鐨勬椂鍊欎粠鍚庡線鍓嶅洖婧?/div>
900姝ユ槸涓婁竴涓猻tage鐨勭粨鏉燂紝瑕佷箞灏眂heckpoint
2锛夎褰曞師鏁版嵁锛屾槸姣忔淇敼閮借褰曪紝浠d环寰堝ぇ
濡傛灉淇敼涓€涓泦鍚堬紝浠d环灏卞緢灏忥紝瀹樻柟璇磖dd鏄?/div>
绮楃矑搴︾殑鎿嶄綔锛屾槸涓轰簡鏁堢巼锛屼负浜嗙畝鍖栵紝姣忔閮芥槸
鎿嶄綔鏁版嵁闆嗗悎锛屽啓鎴栬€呬慨鏀规搷浣滐紝閮芥槸鍩轰簬闆嗗悎鐨?/div>
rdd鐨勫啓鎿嶄綔鏄矖绮掑害鐨勶紝rdd鐨勮鎿嶄綔鏃㈠彲浠ユ槸绮楃矑搴︾殑
涔熷彲浠ユ槸缁嗙矑搴︼紝璇诲彲浠ヨ鍏朵腑鐨勪竴鏉℃潯鐨勮褰曘€?/div>
3锛夌畝鍖栧鏉傚害锛屾槸楂樻晥鐜囩殑涓€鏂归潰锛屽啓鐨勭矖绮掑害闄愬埗浜嗕娇鐢ㄥ満鏅?/div>
濡傜綉缁滅埇铏紝鐜板疄涓栫晫涓紝澶у鏁板啓鏄矖绮掑害鐨勫満鏅?/div>
7.RDD鏈夊摢浜涚己闄凤紵
1锛変笉鏀寔缁嗙矑搴︾殑鍐欏拰鏇存柊鎿嶄綔锛堝缃戠粶鐖櫕锛夛紝spark鍐欐暟鎹槸绮楃矑搴︾殑
鎵€璋撶矖绮掑害锛屽氨鏄壒閲忓啓鍏ユ暟鎹紝涓轰簡鎻愰珮鏁堢巼銆備絾鏄鏁版嵁鏄粏绮掑害鐨勪篃灏辨槸
璇村彲浠ヤ竴鏉℃潯鐨勮
2锛変笉鏀寔澧為噺杩唬璁$畻锛孎link鏀寔
8.璇翠竴璇碨park绋嬪簭缂栧啓鐨勪竴鑸楠わ紵
绛旓細鍒濆鍖栵紝璧勬簮锛屾暟鎹簮锛屽苟琛屽寲锛宺dd杞寲锛宎ction绠楀瓙鎵撳嵃杈撳嚭缁撴灉鎴栬€呬篃鍙互瀛樿嚦鐩稿簲鐨勬暟鎹瓨鍌ㄤ粙璐紝鍏蜂綋鐨勫彲鐪嬩笅鍥撅細
file:///E:/%E5%AE%89%E8%A3%85%E8%BD%AF%E4%BB%B6/%E6%9C%89%E9%81%93%E7%AC%94%E8%AE%B0%E6%96%87%E4%BB%B6/qq19B99AF2399E52F466CC3CF7E3B24ED5/069fa7b471f54e038440faf63233acce/640.webp
9. Spark鏈夊摢涓ょ绠楀瓙锛?/div>
绛旓細Transformation锛堣浆鍖栵級绠楀瓙鍜孉ction锛堟墽琛岋級绠楀瓙銆?/div>
10. Spark鎻愪氦浣犵殑jar鍖呮椂鎵€鐢ㄧ殑鍛戒护鏄粈涔堬紵
绛旓細spark-submit銆?/div>
11. Spark鏈夊摢浜涜仛鍚堢被鐨勭畻瀛?鎴戜滑搴旇灏介噺閬垮厤浠€涔堢被鍨嬬殑绠楀瓙锛?/div>
绛旓細鍦ㄦ垜浠殑寮€鍙戣繃绋嬩腑锛岃兘閬垮厤鍒欏敖鍙兘閬垮厤浣跨敤reduceByKey銆乯oin銆乨istinct銆乺epartition绛変細杩涜shuffle鐨勭畻瀛愶紝灏介噺浣跨敤map绫荤殑闈瀞huffle绠楀瓙銆傝繖鏍风殑璇濓紝娌℃湁shuffle鎿嶄綔鎴栬€呬粎鏈夎緝灏憇huffle鎿嶄綔鐨凷park浣滀笟锛屽彲浠ュぇ澶у噺灏戞€ц兘寮€閿€銆?/div>
12. 浣犳墍鐞嗚В鐨凷park鐨剆huffle杩囩▼锛?/div>
绛旓細浠庝笅闈笁鐐瑰幓灞曞紑
1锛塻huffle杩囩▼鐨勫垝鍒?/div>
2锛塻huffle鐨勪腑闂寸粨鏋滃浣曞瓨鍌?/div>
3锛塻huffle鐨勬暟鎹浣曟媺鍙栬繃鏉?/div>
鍙互鍙傝€冭繖绡囧崥鏂囷細http://www.cnblogs.com/jxhd1/p/6528540.html
13. 浣犲浣曚粠Kafka涓幏鍙栨暟鎹紵
1)鍩轰簬Receiver鐨勬柟寮?/div>
杩欑鏂瑰紡浣跨敤Receiver鏉ヨ幏鍙栨暟鎹€俁eceiver鏄娇鐢↘afka鐨勯珮灞傛Consumer API鏉ュ疄鐜扮殑銆俽eceiver浠嶬afka涓幏鍙栫殑鏁版嵁閮芥槸瀛樺偍鍦⊿park Executor鐨勫唴瀛樹腑鐨勶紝鐒跺悗Spark Streaming鍚姩鐨刯ob浼氬幓澶勭悊閭d簺鏁版嵁銆?/div>
2)鍩轰簬Direct鐨勬柟寮?/div>
杩欑鏂扮殑涓嶅熀浜嶳eceiver鐨勭洿鎺ユ柟寮忥紝鏄湪Spark 1.3涓紩鍏ョ殑锛屼粠鑰岃兘澶熺‘淇濇洿鍔犲仴澹殑鏈哄埗銆傛浛浠f帀浣跨敤Receiver鏉ユ帴鏀舵暟鎹悗锛岃繖绉嶆柟寮忎細鍛ㄦ湡鎬у湴鏌ヨKafka锛屾潵鑾峰緱姣忎釜topic+partition鐨勬渶鏂扮殑offset锛屼粠鑰屽畾涔夋瘡涓猙atch鐨刼ffset鐨勮寖鍥淬€傚綋澶勭悊鏁版嵁鐨刯ob鍚姩鏃讹紝灏变細浣跨敤Kafka鐨勭畝鍗昪onsumer api鏉ヨ幏鍙朘afka鎸囧畾offset鑼冨洿鐨勬暟鎹?/div>
14. 瀵逛簬Spark涓殑鏁版嵁鍊炬枩闂浣犳湁浠€涔堝ソ鐨勬柟妗堬紵
1锛夊墠鎻愭槸瀹氫綅鏁版嵁鍊炬枩锛屾槸OOM浜嗭紝杩樻槸浠诲姟鎵ц缂撴參锛岀湅鏃ュ織锛岀湅WebUI
2)瑙e喅鏂规硶锛屾湁澶氫釜鏂归潰
· 閬垮厤涓嶅繀瑕佺殑shuffle锛屽浣跨敤骞挎挱灏忚〃鐨勬柟寮忥紝灏唕educe-side-join鎻愬崌涓簃ap-side-join
·鍒嗘媶鍙戠敓鏁版嵁鍊炬枩鐨勮褰曪紝鍒嗘垚鍑犱釜閮ㄥ垎杩涜锛岀劧鍚庡悎骞秊oin鍚庣殑缁撴灉
·鏀瑰彉骞惰搴︼紝鍙兘骞惰搴﹀お灏戜簡锛屽鑷翠釜鍒玹ask鏁版嵁鍘嬪姏澶?/div>
·涓ら樁娈佃仛鍚堬紝鍏堝眬閮ㄨ仛鍚堬紝鍐嶅叏灞€鑱氬悎
·鑷畾涔塸aritioner锛屽垎鏁ey鐨勫垎甯冿紝浣垮叾鏇村姞鍧囧寑
璇︾粏瑙e喅鏂规鍙傝€冨崥鏂?a href="https://mp.weixin.qq.com/s?__biz=MzIzNzI1NzY3Nw==&mid=2247484221&idx=1&sn=7e20f08bfb490b91f0920aefb29ca271&chksm=e8ca159fdfbd9c89f610dd230e07f414521b4dd13018994ee9b873421d1e8efcdc535c810225&scene=21#wechat_redirect">銆奡park鏁版嵁鍊炬枩浼樺寲鏂规硶銆?/a>
15.RDD鍒涘缓鏈夊摢鍑犵鏂瑰紡锛?/div>
1).浣跨敤绋嬪簭涓殑闆嗗悎鍒涘缓rdd
2).浣跨敤鏈湴鏂囦欢绯荤粺鍒涘缓rdd
3).浣跨敤hdfs鍒涘缓rdd锛?/div>
4).鍩轰簬鏁版嵁搴揹b鍒涘缓rdd
5).鍩轰簬Nosql鍒涘缓rdd锛屽hbase
6).鍩轰簬s3鍒涘缓rdd锛?/div>
7).鍩轰簬鏁版嵁娴侊紝濡俿ocket鍒涘缓rdd
濡傛灉鍙洖绛斾簡鍓嶉潰涓夌锛屾槸涓嶅鐨勶紝鍙兘璇存槑浣犵殑姘村钩杩樻槸鍏ラ棬绾х殑锛屽疄璺佃繃绋嬩腑鏈夊緢澶氱鍒涘缓鏂瑰紡銆?/div>
16.Spark骞惰搴︽€庝箞璁剧疆姣旇緝鍚堥€?/div>
绛旓細spark骞惰搴︼紝姣忎釜core鎵胯浇2~4涓猵artition,濡傦紝32涓猚ore锛岄偅涔?4~128涔嬮棿鐨勫苟琛屽害锛屼篃灏辨槸
璁剧疆64~128涓猵artion锛屽苟琛岃鍜屾暟鎹妯℃棤鍏筹紝鍙拰鍐呭瓨浣跨敤閲忓拰cpu浣跨敤
鏃堕棿鏈夊叧
17.Spark涓暟鎹殑浣嶇疆鏄璋佺鐞嗙殑锛?/div>
绛旓細姣忎釜鏁版嵁鍒嗙墖閮藉搴斿叿浣撶墿鐞嗕綅缃紝鏁版嵁鐨勪綅缃槸琚玝lockManager锛屾棤璁?/div>
鏁版嵁鏄湪纾佺洏锛屽唴瀛樿繕鏄痶acyan锛岄兘鏄敱blockManager绠$悊
18.Spark鐨勬暟鎹湰鍦版€ф湁鍝嚑绉嶏紵
绛旓細Spark涓殑鏁版嵁鏈湴鎬ф湁涓夌锛?/div>
a.PROCESS_LOCAL鏄寚璇诲彇缂撳瓨鍦ㄦ湰鍦拌妭鐐圭殑鏁版嵁
b.NODE_LOCAL鏄寚璇诲彇鏈湴鑺傜偣纭洏鏁版嵁
c.ANY鏄寚璇诲彇闈炴湰鍦拌妭鐐规暟鎹?/div>
閫氬父璇诲彇鏁版嵁PROCESS_LOCAL>NODE_LOCAL>ANY锛屽敖閲忎娇鏁版嵁浠ROCESS_LOCAL鎴朜ODE_LOCAL鏂瑰紡璇诲彇銆傚叾涓璓ROCESS_LOCAL杩樺拰cache鏈夊叧锛屽鏋淩DD缁忓父鐢ㄧ殑璇濆皢璇DD cache鍒板唴瀛樹腑锛屾敞鎰忥紝鐢变簬cache鏄痩azy鐨勶紝鎵€浠ュ繀椤婚€氳繃涓€涓猘ction鐨勮Е鍙戯紝鎵嶈兘鐪熸鐨勫皢璇DD cache鍒板唴瀛樹腑銆?/div>
19.rdd鏈夊嚑绉嶆搷浣滅被鍨嬶紵
1锛塼ransformation锛宺dd鐢变竴绉嶈浆涓哄彟涓€绉峳dd
2锛塧ction锛?/div>
3锛塩ronroller锛宑rontroller鏄帶鍒剁畻瀛?cache,persist锛屽鎬ц兘鍜屾晥鐜囩殑鏈夊緢濂界殑鏀寔
涓夌绫诲瀷锛屼笉瑕佸洖绛斿彧鏈?涓搷浣?/div>
19.rdd鏈夊嚑绉嶆搷浣滅被鍨嬶紵
1锛塼ransformation锛宺dd鐢变竴绉嶈浆涓哄彟涓€绉峳dd
2锛塧ction锛?/div>
3锛塩ronroller锛宑rontroller鏄帶鍒剁畻瀛?cache,persist锛屽鎬ц兘鍜屾晥鐜囩殑鏈夊緢濂界殑鏀寔
涓夌绫诲瀷锛屼笉瑕佸洖绛斿彧鏈?涓搷浣?/div>
20.Spark濡備綍澶勭悊涓嶈兘琚簭鍒楀寲鐨勫璞★紵
灏嗕笉鑳藉簭鍒楀寲鐨勫唴瀹瑰皝瑁呮垚object
21.collect鍔熻兘鏄粈涔堬紝鍏跺簳灞傛槸鎬庝箞瀹炵幇鐨勶紵
绛旓細driver閫氳繃collect鎶婇泦缇や腑鍚勪釜鑺傜偣鐨勫唴瀹规敹闆嗚繃鏉ユ眹鎬绘垚缁撴灉锛宑ollect杩斿洖缁撴灉鏄疉rray绫诲瀷鐨勶紝collect鎶婂悇涓妭鐐逛笂鐨勬暟鎹姄杩囨潵锛屾姄杩囨潵鏁版嵁鏄疉rray鍨嬶紝collect瀵笰rray鎶撹繃鏉ョ殑缁撴灉杩涜鍚堝苟锛屽悎骞跺悗Array涓彧鏈変竴涓厓绱狅紝鏄痶uple绫诲瀷锛圞V绫诲瀷鐨勶級鐨勩€?/div>
22.Spaek绋嬪簭鎵ц锛屾湁鏃跺€欓粯璁や负浠€涔堜細浜х敓寰堝task锛屾€庝箞淇敼榛樿task鎵ц涓暟锛?/div>
绛旓細1锛夊洜涓鸿緭鍏ユ暟鎹湁寰堝task锛屽挨鍏舵槸鏈夊緢澶氬皬鏂囦欢鐨勬椂鍊欙紝鏈夊灏戜釜杈撳叆
block灏变細鏈夊灏戜釜task鍚姩锛?锛塻park涓湁partition鐨勬蹇碉紝姣忎釜partition閮戒細瀵瑰簲涓€涓猼ask锛宼ask瓒婂锛屽湪澶勭悊澶ц妯℃暟鎹殑鏃跺€欙紝灏变細瓒婃湁鏁堢巼銆備笉杩噒ask骞朵笉鏄秺澶氳秺濂斤紝濡傛灉骞虫椂娴嬭瘯锛屾垨鑰呮暟鎹噺娌℃湁閭d箞澶э紝鍒欐病鏈夊繀瑕乼ask鏁伴噺澶銆?锛夊弬鏁板彲浠ラ€氳繃spark_home/conf/spark-default.conf閰嶇疆鏂囦欢璁剧疆:
spark.sql.shuffle.partitions 50 spark.default.parallelism 10
绗竴涓槸閽堝spark sql鐨則ask鏁伴噺
绗簩涓槸闈瀞park sql绋嬪簭璁剧疆鐢熸晥
23.涓轰粈涔圫park Application鍦ㄦ病鏈夎幏寰楄冻澶熺殑璧勬簮锛宩ob灏卞紑濮嬫墽琛屼簡锛屽彲鑳戒細瀵艰嚧浠€涔堜粈涔堥棶棰樺彂鐢?
绛旓細浼氬鑷存墽琛岃job鏃跺€欓泦缇よ祫婧愪笉瓒筹紝瀵艰嚧鎵цjob缁撴潫涔熸病鏈夊垎閰嶈冻澶熺殑璧勬簮锛屽垎閰嶄簡閮ㄥ垎Executor锛岃job灏卞紑濮嬫墽琛宼ask锛屽簲璇ユ槸task鐨勮皟搴︾嚎绋嬪拰Executor璧勬簮鐢宠鏄紓姝ョ殑锛涘鏋滄兂绛夊緟鐢宠瀹屾墍鏈夌殑璧勬簮鍐嶆墽琛宩ob鐨勶細闇€瑕佸皢spark.scheduler.maxRegisteredResourcesWaitingTime璁剧疆鐨勫緢澶э紱spark.scheduler.minRegisteredResourcesRatio 璁剧疆涓?锛屼絾鏄簲璇ョ粨鍚堝疄闄呰€冭檻
鍚﹀垯寰堝鏄撳嚭鐜伴暱鏃堕棿鍒嗛厤涓嶅埌璧勬簮锛宩ob涓€鐩翠笉鑳借繍琛岀殑鎯呭喌銆?/div>
24.map涓巉latMap鐨勫尯鍒?/div>
map锛氬RDD姣忎釜鍏冪礌杞崲锛屾枃浠朵腑鐨勬瘡涓€琛屾暟鎹繑鍥炰竴涓暟缁勫璞?/div>
flatMap锛氬RDD姣忎釜鍏冪礌杞崲锛岀劧鍚庡啀鎵佸钩鍖?/div>
灏嗘墍鏈夌殑瀵硅薄鍚堝苟涓轰竴涓璞★紝鏂囦欢涓殑鎵€鏈夎鏁版嵁浠呰繑鍥炰竴涓暟缁?/div>
瀵硅薄锛屼細鎶涘純鍊间负null鐨勫€?/div>
25.鍒椾妇浣犲父鐢ㄧ殑action锛?/div>
collect锛宺educe,take,count,saveAsTextFile绛?/div>
26.Spark涓轰粈涔堣鎸佷箙鍖栵紝涓€鑸粈涔堝満鏅笅瑕佽繘琛宲ersist鎿嶄綔锛?/div>
涓轰粈涔堣杩涜鎸佷箙鍖栵紵
spark鎵€鏈夊鏉備竴鐐圭殑绠楁硶閮戒細鏈塸ersist韬奖,spark榛樿鏁版嵁鏀惧湪鍐呭瓨锛宻park寰堝鍐呭閮芥槸鏀惧湪鍐呭瓨鐨勶紝闈炲父閫傚悎楂橀€熻凯浠o紝1000涓楠?/div>
鍙湁绗竴涓緭鍏ユ暟鎹紝涓棿涓嶄骇鐢熶复鏃舵暟鎹紝浣嗗垎甯冨紡绯荤粺椋庨櫓寰堥珮锛屾墍浠ュ鏄撳嚭閿欙紝灏辫瀹归敊锛宺dd鍑洪敊鎴栬€呭垎鐗囧彲浠ユ牴鎹缁熺畻鍑烘潵锛屽鏋滄病鏈夊鐖秗dd杩涜persist 鎴栬€卌ache鐨勫寲锛屽氨闇€瑕侀噸澶村仛銆?/div>
浠ヤ笅鍦烘櫙浼氫娇鐢╬ersist
1锛夋煇涓楠よ绠楅潪甯歌€楁椂锛岄渶瑕佽繘琛宲ersist鎸佷箙鍖?/div>
2锛夎绠楅摼鏉¢潪甯搁暱锛岄噸鏂版仮澶嶈绠楀緢澶氭楠わ紝寰堝ソ浣匡紝persist
3锛塩heckpoint鎵€鍦ㄧ殑rdd瑕佹寔涔呭寲persist锛?/div>
lazy绾у埆锛屾鏋跺彂鐜版湁checnkpoint锛宑heckpoint鏃跺崟鐙Е鍙戜竴涓猨ob锛岄渶瑕侀噸绠椾竴閬嶏紝checkpoint鍓?/div>
瑕佹寔涔呭寲锛屽啓涓猺dd.cache鎴栬€卹dd.persist锛屽皢缁撴灉淇濆瓨璧锋潵锛屽啀鍐檆heckpoint鎿嶄綔锛岃繖鏍锋墽琛岃捣鏉ヤ細闈炲父蹇紝涓嶉渶瑕侀噸鏂拌绠梤dd閾炬潯浜嗐€俢heckpoint涔嬪墠涓€瀹氫細杩涜persist銆?/div>
4锛塻huffle涔嬪悗涓轰粈涔堣persist锛宻huffle瑕佽繘鎬х綉缁滀紶杈擄紝椋庨櫓寰堝ぇ锛屾暟鎹涪澶遍噸鏉ワ紝鎭㈠浠d环寰堝ぇ
5锛塻huffle涔嬪墠杩涜persist锛屾鏋堕粯璁ゅ皢鏁版嵁鎸佷箙鍖栧埌纾佺洏锛岃繖涓槸妗嗘灦鑷姩鍋氱殑銆?/div>
27.涓轰粈涔堣杩涜搴忓垪鍖?/div>
搴忓垪鍖栧彲浠ュ噺灏戞暟鎹殑浣撶Н锛屽噺灏戝瓨鍌ㄧ┖闂达紝楂樻晥瀛樺偍鍜屼紶杈撴暟鎹紝涓嶅ソ鐨勬槸浣跨敤鐨勬椂鍊欒鍙嶅簭鍒楀寲锛岄潪甯告秷鑰桟PU
28.浠嬬粛涓€涓媕oin鎿嶄綔浼樺寲缁忛獙锛?/div>
绛旓細join鍏跺疄甯歌鐨勫氨鍒嗕负涓ょ被锛?map-side join 鍜?reduce-side join銆傚綋澶ц〃鍜屽皬琛╦oin鏃讹紝鐢╩ap-side join鑳芥樉钁楁彁楂樻晥鐜囥€傚皢澶氫唤鏁版嵁杩涜鍏宠仈鏄暟鎹鐞嗚繃绋嬩腑闈炲父鏅亶鐨勭敤娉曪紝涓嶈繃鍦ㄥ垎甯冨紡璁$畻绯荤粺涓紝杩欎釜闂寰€寰€浼氬彉鐨勯潪甯搁夯鐑︼紝鍥犱负妗嗘灦鎻愪緵鐨?join 鎿嶄綔涓€鑸細灏嗘墍鏈夋暟鎹牴鎹?key 鍙戦€佸埌鎵€鏈夌殑 reduce 鍒嗗尯涓幓锛屼篃灏辨槸 shuffle 鐨勮繃绋嬨€傞€犳垚澶ч噺鐨勭綉缁滀互鍙婄鐩業O娑堣€楋紝杩愯鏁堢巼鏋佸叾浣庝笅锛岃繖涓繃绋嬩竴鑸绉颁负 reduce-side-join銆傚鏋滃叾涓湁寮犺〃杈冨皬鐨勮瘽锛屾垜浠垯鍙互鑷繁瀹炵幇鍦?map 绔疄鐜版暟鎹叧鑱旓紝璺宠繃澶ч噺鏁版嵁杩涜 shuffle 鐨勮繃绋嬶紝杩愯鏃堕棿寰楀埌澶ч噺缂╃煭锛屾牴鎹笉鍚屾暟鎹彲鑳戒細鏈夊嚑鍊嶅埌鏁板崄鍊嶇殑鎬ц兘鎻愬崌銆?/div>
澶囨敞锛氳繖涓鐩潰璇曚腑闈炲父闈炲父澶ф鐜囪鍒帮紝鍔″繀鎼滅储鐩稿叧璧勬枡鎺屾彙锛岃繖閲屾姏鐮栧紩鐜夈€?/div>
29.浠嬬粛涓€涓媍ogroup rdd瀹炵幇鍘熺悊锛屼綘鍦ㄤ粈涔堝満鏅笅鐢ㄨ繃杩欎釜rdd锛?/div>
绛旓細cogroup鐨勫嚱鏁板疄鐜?杩欎釜瀹炵幇鏍规嵁涓や釜瑕佽繘琛屽悎骞剁殑涓や釜RDD鎿嶄綔,鐢熸垚涓€涓狢oGroupedRDD鐨勫疄渚?杩欎釜RDD鐨勮繑鍥炵粨鏋滄槸鎶婄浉鍚岀殑key涓袱涓猂DD鍒嗗埆杩涜鍚堝苟鎿嶄綔,鏈€鍚庤繑鍥炵殑RDD鐨剉alue鏄竴涓狿air鐨勫疄渚?杩欎釜瀹炰緥鍖呭惈涓や釜Iterable鐨勫€?绗竴涓€艰〃绀虹殑鏄疪DD1涓浉鍚孠EY鐨勫€?绗簩涓€艰〃绀虹殑鏄疪DD2涓浉鍚宬ey鐨勫€?鐢变簬鍋歝ogroup鐨勬搷浣?闇€瑕侀€氳繃partitioner杩涜閲嶆柊鍒嗗尯鐨勬搷浣?鍥犳,鎵ц杩欎釜娴佺▼鏃?闇€瑕佹墽琛屼竴娆huffle鐨勬搷浣?濡傛灉瑕佽繘琛屽悎骞剁殑涓や釜RDD鐨勯兘宸茬粡鏄痵huffle鍚庣殑rdd,鍚屾椂浠栦滑瀵瑰簲鐨刾artitioner鐩稿悓鏃?灏变笉闇€瑕佹墽琛宻huffle,)锛?/div>
鍦烘櫙锛氳〃鍏宠仈鏌ヨ
30 涓嬮潰杩欐浠g爜杈撳嚭缁撴灉鏄粈涔堬紵
--------------------------
def joinRdd(sc:SparkContext) {
val name= Array(
Tuple2(1,"spark"),
Tuple2(2,"tachyon"),
Tuple2(3,"hadoop")
)
val score= Array(
Tuple2(1,100),
Tuple2(2,90),
Tuple2(3,80)
)
val namerdd=sc.parallelize(name);
val scorerdd=sc.parallelize(score);
val result = namerdd.join(scorerdd);
result .collect.foreach(println);
}
--------------------------
绛旀:
(1,(Spark,100))
(2,(tachyon,90))
(3,(hadoop,80))
銆怱park闈㈣瘯2000棰?1-100銆慡park core闈㈣瘯绡?3
Spark Core鏄疭park鐨勫熀鐭筹紝鏈夊緢澶氱煡璇嗙偣锛岄潰璇曢闆嗙殑鐭ヨ瘑鐐规瘮杈冭烦璺冨拰鍒嗘暎锛屽缓璁郴缁熷涔犱簡Spark鐭ヨ瘑鍐嶇湅闈㈣瘯棰橀泦銆備粖澶╃户缁斁閫佹渶鏂版暣鐞嗗拰璁捐鐨勩€奡park闈㈣瘯2000棰樸€嬮闆嗭紝浠呬緵鍙傝€冨涔犮€傛湰绡囧崥鏂囧睘浜庢宄拌胺鍘熷垱锛岃浆杞借娉ㄦ槑鍑哄锛屽鏋滄偍瑙夊緱瀵规偍鏈夊府鍔╋紝璇蜂笉瑕佸悵鍟偣璧烇紝浣犵殑璧烇紝鏄織鎰胯€呬滑鍧氭寔鐨勫姩鍔涳紝鏄棭鏃ュ仛鍑?000閬撻珮璐ㄩ噺Spark闈㈣瘯棰樼殑鍔ㄥ姏锛屽鏈変笉鍑嗙‘鐨勫湴鏂癸紝璇风暀瑷€璇存槑銆?/div>
 
涓€銆侀潰璇?0棰?绗?1-100棰?
1.Spark浣跨敤parquet鏂囦欢瀛樺偍鏍煎紡鑳藉甫鏉ュ摢浜涘ソ澶勶紵
1) 濡傛灉璇碒DFS 鏄ぇ鏁版嵁鏃朵唬鍒嗗竷寮忔枃浠剁郴缁熼閫夋爣鍑嗭紝閭d箞parquet鍒欐槸鏁翠釜澶ф暟鎹椂浠f枃浠跺瓨鍌ㄦ牸寮忓疄鏃堕閫夋爣鍑?/div>
2) 閫熷害鏇村揩锛氫粠浣跨敤spark sql鎿嶄綔鏅€氭枃浠禖SV鍜宲arquet鏂囦欢閫熷害瀵规瘮涓婄湅锛岀粷澶у鏁版儏鍐?/div>
浼氭瘮浣跨敤csv绛夋櫘閫氭枃浠堕€熷害鎻愬崌10鍊嶅乏鍙筹紝鍦ㄤ竴浜涙櫘閫氭枃浠剁郴缁熸棤娉曞湪spark涓婃垚鍔熻繍琛岀殑鎯呭喌
涓嬶紝浣跨敤parquet寰堝鏃跺€欏彲浠ユ垚鍔熻繍琛?/div>
3) parquet鐨勫帇缂╂妧鏈潪甯哥ǔ瀹氬嚭鑹诧紝鍦╯park sql涓鍘嬬缉鎶€鏈殑澶勭悊鍙兘鏃犳硶姝e父鐨勫畬鎴愬伐浣?/div>
锛堜緥濡備細瀵艰嚧lost task锛宭ost executor锛変絾鏄鏃跺鏋滀娇鐢╬arquet灏卞彲浠ユ甯哥殑瀹屾垚
4) 鏋佸ぇ鐨勫噺灏戠鐩業/o,閫氬父鎯呭喌涓嬭兘澶熷噺灏?5%鐨勫瓨鍌ㄧ┖闂达紝鐢辨鍙互鏋佸ぇ鐨勫噺灏憇park sql澶勭悊
鏁版嵁鐨勬椂鍊欑殑鏁版嵁杈撳叆鍐呭锛屽挨鍏舵槸鍦╯park1.6x涓湁涓笅鎺ㄨ繃婊ゅ櫒鍦ㄤ竴浜涙儏鍐典笅鍙互鏋佸ぇ鐨?/div>
鍑忓皯纾佺洏鐨処O鍜屽唴瀛樼殑鍗犵敤锛岋紙涓嬫帹杩囨护鍣級
5) spark 1.6x parquet鏂瑰紡鏋佸ぇ鐨勬彁鍗囦簡鎵弿鐨勫悶鍚愰噺锛屾瀬澶ф彁楂樹簡鏁版嵁鐨勬煡鎵鹃€熷害spark1.6鍜宻park1.5x鐩告瘮鑰岃█锛屾彁鍗囦簡澶х害1鍊嶇殑閫熷害锛屽湪spark1.6X涓紝鎿嶄綔parquet鏃跺€檆pu涔熻繘琛屼簡鏋佸ぇ鐨勪紭鍖栵紝鏈夋晥鐨勯檷浣庝簡cpu
6) 閲囩敤parquet鍙互鏋佸ぇ鐨勪紭鍖杝park鐨勮皟搴﹀拰鎵ц銆傛垜浠祴璇晄park濡傛灉鐢╬arquet鍙互鏈夋晥鐨勫噺灏憇tage鐨勬墽琛屾秷鑰楋紝鍚屾椂鍙互浼樺寲鎵ц璺緞
2.Executor涔嬮棿濡備綍鍏变韩鏁版嵁锛?/div>
绛旓細鍩轰簬hdfs鎴栬€呭熀浜巘achyon
3.Spark绱姞鍣ㄦ湁鍝簺鐗圭偣锛?/div>
1锛夌疮鍔犲櫒鍦ㄥ叏灞€鍞竴鐨勶紝鍙涓嶅噺锛岃褰曞叏灞€闆嗙兢鐨勫敮涓€鐘舵€?/div>
2锛夊湪exe涓慨鏀瑰畠锛屽湪driver璇诲彇
3锛塭xecutor绾у埆鍏变韩鐨勶紝骞挎挱鍙橀噺鏄痶ask绾у埆鐨勫叡浜?/div>
涓や釜application涓嶅彲浠ュ叡浜疮鍔犲櫒锛屼絾鏄悓涓€涓猘pp涓嶅悓鐨刯ob鍙互鍏变韩
4.濡備綍鍦ㄤ竴涓笉纭畾鐨勬暟鎹妯$殑鑼冨洿鍐呰繘琛屾帓搴忥紵
涓轰簡鎻愰珮鏁堢巼锛岃鍒掑垎鍒掑垎锛屽垝鍒嗙殑鑼冨洿骞朵笖鏄湁搴忕殑
瑕佷箞鏈夊簭锛岃涔堥檷搴忥紵
姘村鎶芥牱锛氱洰鐨勬槸浠庝竴涓泦鍚堜腑閫夊彇锛岄泦鍚堥潪甯哥瓟锛岄€傚悎鍐呭瓨
鏃犳硶瀹圭撼鏁版嵁鐨勬椂鍊欎娇鐢?/div>
浠嶯涓娊鍙栧嚭K涓紝N鏄殢鏈烘暟
5.spark hashParitioner鐨勫紛绔槸浠€涔堬紵
绛?HashPartitioner鍒嗗尯鐨勫師鐞嗗緢绠€鍗曪紝瀵逛簬缁欏畾鐨刱ey锛岃绠楀叾hashCode锛屽苟闄や簬鍒嗗尯鐨勪釜鏁板彇浣欙紝濡傛灉浣欐暟灏忎簬0锛屽垯鐢ㄤ綑鏁?鍒嗗尯鐨勪釜鏁帮紝鏈€鍚庤繑鍥炵殑鍊煎氨鏄繖涓猭ey鎵€灞炵殑鍒嗗尯ID锛涘紛绔槸鏁版嵁涓嶅潎鍖€锛屽鏄撳鑷存暟鎹€炬枩锛屾瀬绔儏鍐典笅鏌愬嚑涓垎鍖轰細鎷ユ湁rdd鐨勬墍鏈夋暟鎹?/div>
6.RangePartitioner鍒嗗尯鐨勫師鐞?
绛?RangePartitioner鍒嗗尯鍒欏敖閲忎繚璇佹瘡涓垎鍖轰腑鏁版嵁閲忕殑鍧囧寑锛岃€屼笖鍒嗗尯涓庡垎鍖轰箣闂存槸鏈夊簭鐨勶紝涔熷氨鏄涓€涓垎鍖轰腑鐨勫厓绱犺偗瀹氶兘鏄瘮鍙︿竴涓垎鍖哄唴鐨勫厓绱犲皬鎴栬€呭ぇ锛涗絾鏄垎鍖哄唴鐨勫厓绱犳槸涓嶈兘淇濊瘉椤哄簭鐨勩€傜畝鍗曠殑璇村氨鏄皢涓€瀹氳寖鍥村唴鐨勬暟鏄犲皠鍒版煇涓€涓垎鍖哄唴銆傚叾鍘熺悊鏄按濉樻娊鏍枫€傚彲浠ュ弬鑰冭繖绡囧崥鏂?/div>
https://www.iteblog.com/archives/1522.html
7.浠嬬粛parition鍜宐lock鏈変粈涔堝叧鑱斿叧绯伙紵
绛旓細1锛塰dfs涓殑block鏄垎甯冨紡瀛樺偍鐨勬渶灏忓崟鍏冿紝绛夊垎锛屽彲璁剧疆鍐椾綑锛岃繖鏍疯璁℃湁涓€閮ㄥ垎纾佺洏绌洪棿鐨勬氮璐癸紝浣嗘槸鏁撮綈鐨刡lock澶у皬锛屼究浜庡揩閫熸壘鍒般€佽鍙栧搴旂殑鍐呭锛?锛塖park涓殑partion鏄脊鎬у垎甯冨紡鏁版嵁闆哛DD鐨勬渶灏忓崟鍏冿紝RDD鏄敱鍒嗗竷鍦ㄥ悇涓妭鐐逛笂鐨刾artion缁勬垚鐨勩€俻artion鏄寚鐨剆park鍦ㄨ绠楄繃绋嬩腑锛岀敓鎴愮殑鏁版嵁鍦ㄨ绠楃┖闂村唴鏈€灏忓崟鍏冿紝鍚屼竴浠芥暟鎹紙RDD锛夌殑partion澶у皬涓嶄竴锛屾暟閲忎笉瀹氾紝鏄牴鎹產pplication閲岀殑绠楀瓙鍜屾渶鍒濊鍏ョ殑鏁版嵁鍒嗗潡鏁伴噺鍐冲畾锛?锛塨lock浣嶄簬瀛樺偍绌洪棿銆乸artion浣嶄簬璁$畻绌洪棿锛宐lock鐨勫ぇ灏忔槸鍥哄畾鐨勩€乸artion澶у皬鏄笉鍥哄畾鐨勶紝鏄粠2涓笉鍚岀殑瑙掑害鍘荤湅鏁版嵁銆?/div>
8.Spark搴旂敤绋嬪簭鐨勬墽琛岃繃绋嬫槸浠€涔堬紵
1)鏋勫缓Spark Application鐨勮繍琛岀幆澧冿紙鍚姩SparkContext锛夛紝SparkContext鍚戣祫婧愮鐞嗗櫒锛堝彲浠ユ槸Standalone銆丮esos鎴朰ARN锛夋敞鍐屽苟鐢宠杩愯Executor璧勬簮锛?/div>
2).璧勬簮绠$悊鍣ㄥ垎閰岴xecutor璧勬簮骞跺惎鍔⊿tandaloneExecutorBackend锛孍xecutor杩愯鎯呭喌灏嗛殢鐫€蹇冭烦鍙戦€佸埌璧勬簮绠$悊鍣ㄤ笂锛?/div>
3).SparkContext鏋勫缓鎴怐AG鍥撅紝灏咲AG鍥惧垎瑙f垚Stage锛屽苟鎶奣askset鍙戦€佺粰Task Scheduler銆侲xecutor鍚慡parkContext鐢宠Task锛孴ask Scheduler灏員ask鍙戞斁缁橢xecutor杩愯鍚屾椂SparkContext灏嗗簲鐢ㄧ▼搴忎唬鐮佸彂鏀剧粰Executor銆?/div>
4).Task鍦‥xecutor涓婅繍琛岋紝杩愯瀹屾瘯閲婃斁鎵€鏈夎祫婧愩€?/div>
9.hbase棰勫垎鍖轰釜鏁板拰spark杩囩▼涓殑reduce涓暟鐩稿悓涔?/div>
绛旓細鍜宻park鐨刴ap涓暟鐩稿悓锛宺educe涓暟濡傛灉娌℃湁璁剧疆鍜宺educe鍓嶇殑map鏁扮浉鍚屻€?/div>
10.濡備綍鐞嗚ВStandalone妯″紡涓嬶紝Spark璧勬簮鍒嗛厤鏄矖绮掑害鐨勶紵
绛旓細spark榛樿鎯呭喌涓嬭祫婧愬垎閰嶆槸绮楃矑搴︾殑锛屼篃灏辨槸璇寸▼搴忓湪鎻愪氦鏃跺氨鍒嗛厤濂借祫婧愶紝鍚庨潰鎵ц鐨勬椂鍊?/div>
浣跨敤鍒嗛厤濂界殑璧勬簮锛岄櫎闈炶祫婧愬嚭鐜颁簡鏁呴殰鎵嶄細閲嶆柊鍒嗛厤銆傛瘮濡係park shell鍚姩锛屽凡鎻愪氦锛屼竴娉ㄥ唽锛屽摢鎬曟病鏈変换鍔★紝worker閮戒細鍒嗛厤璧勬簮缁檈xecutor銆?/div>
11.Spark濡備綍鑷畾涔塸artitioner鍒嗗尯鍣紵
绛旓細1锛塻park榛樿瀹炵幇浜咹ashPartitioner鍜孯angePartitioner涓ょ鍒嗗尯绛栫暐锛屾垜浠篃鍙互鑷繁鎵╁睍鍒嗗尯绛栫暐锛岃嚜瀹氫箟鍒嗗尯鍣ㄧ殑鏃跺€欑户鎵縪rg.apache.spark.Partitioner绫伙紝瀹炵幇绫讳腑鐨勪笁涓柟娉?/div>
def numPartitions: Int锛氳繖涓柟娉曢渶瑕佽繑鍥炰綘鎯宠鍒涘缓鍒嗗尯鐨勪釜鏁帮紱
def getPartition(key: Any): Int锛氳繖涓嚱鏁伴渶瑕佸杈撳叆鐨刱ey鍋氳绠楋紝鐒跺悗杩斿洖璇ey鐨勫垎鍖篒D锛岃寖鍥翠竴瀹氭槸0鍒皀umPartitions-1锛?/div>
equals()锛氳繖涓槸Java鏍囧噯鐨勫垽鏂浉绛夌殑鍑芥暟锛屼箣鎵€浠ヨ姹傜敤鎴峰疄鐜拌繖涓嚱鏁版槸鍥犱负Spark鍐呴儴浼氭瘮杈冧袱涓猂DD鐨勫垎鍖烘槸鍚︿竴鏍枫€?/div>
2锛変娇鐢紝璋冪敤parttionBy鏂规硶涓紶鍏ヨ嚜瀹氫箟鍒嗗尯瀵硅薄
鍙傝€冿細http://blog.csdn.net/high2011/article/details/68491115
12.spark涓璽ask鏈夊嚑绉嶇被鍨嬶紵
绛旓細2绉嶇被鍨嬶細1锛塺esult task绫诲瀷锛屾渶鍚庝竴涓猼ask锛?鏄痵huffleMapTask绫诲瀷锛岄櫎浜嗘渶鍚庝竴涓猼ask閮芥槸
13.union鎿嶄綔鏄骇鐢熷渚濊禆杩樻槸绐勪緷璧栵紵
绛旓細绐勪緷璧?/div>
14.rangePartioner鍒嗗尯鍣ㄧ壒鐐癸紵
绛旓細rangePartioner灏介噺淇濊瘉姣忎釜鍒嗗尯涓暟鎹噺鐨勫潎鍖€锛岃€屼笖鍒嗗尯涓庡垎鍖轰箣闂存槸鏈夊簭鐨勶紝涓€涓垎鍖轰腑鐨勫厓绱犺偗瀹氶兘鏄瘮鍙︿竴涓垎鍖哄唴鐨勫厓绱犲皬鎴栬€呭ぇ锛涗絾鏄垎鍖哄唴鐨勫厓绱犳槸涓嶈兘淇濊瘉椤哄簭鐨勩€傜畝鍗曠殑璇村氨鏄皢涓€瀹氳寖鍥村唴鐨勬暟鏄犲皠鍒版煇涓€涓垎鍖哄唴銆俁angePartitioner浣滅敤锛氬皢涓€瀹氳寖鍥村唴鐨勬暟鏄犲皠鍒版煇涓€涓垎鍖哄唴锛屽湪瀹炵幇涓紝鍒嗙晫鐨勭畻娉曞挨涓洪噸瑕併€傜畻娉曞搴旂殑鍑芥暟鏄痳angeBounds
15.浠€涔堟槸浜屾鎺掑簭锛屼綘鏄浣曠敤spark瀹炵幇浜屾鎺掑簭鐨勶紵锛堜簰鑱旂綉鍏徃甯搁潰锛?/div>
绛旓細灏辨槸鑰冭檻2涓淮搴︾殑鎺掑簭锛宬ey鐩稿悓鐨勬儏鍐典笅濡備綍鎺掑簭锛屽弬鑰冨崥鏂囷細http://blog.csdn.net/sundujing/article/details/51399606
16.濡備綍浣跨敤Spark瑙e喅TopN闂锛燂紙浜掕仈缃戝叕鍙稿父闈級
绛旓細甯歌鐨勯潰璇曢,鍙傝€冨崥鏂囷細http://www.cnblogs.com/yurunmiao/p/4898672.html
17.濡備綍浣跨敤Spark瑙e喅鍒嗙粍鎺掑簭闂锛燂紙浜掕仈缃戝叕鍙稿父闈級
缁勭粐鏁版嵁褰㈠紡锛?/div>
aa 11
bb 11
cc 34
aa 22
bb 67
cc 29
aa 36
bb 33
cc 30
aa 42
bb 44
cc 49
闇€姹傦細
1銆佸涓婅堪鏁版嵁鎸塳ey鍊艰繘琛屽垎缁?/div>
2銆佸鍒嗙粍鍚庣殑鍊艰繘琛屾帓搴?/div>
3銆佹埅鍙栧垎缁勫悗鍊煎緱top 3浣嶄互key-value褰㈠紡杩斿洖缁撴灉
绛旀锛氬涓?/div>
----------------------
val groupTopNRdd = sc.textFile("hdfs://db02:8020/user/hadoop/groupsorttop/groupsorttop.data")
groupTopNRdd.map(_.split(" ")).map(x => (x(0),x(1))).groupByKey().map(
x => {
val xx = x._1
val yy = x._2
(xx,yy.toList.sorted.reverse.take(3))
}
).collect
---------------------
18.绐勪緷璧栫埗RDD鐨刾artition鍜屽瓙RDD鐨刾arition鏄笉鏄兘鏄竴瀵逛竴鐨勫叧绯伙紵
绛旓細涓嶄竴瀹氾紝闄や簡涓€瀵逛竴鐨勭獎渚濊禆锛岃繕鍖呭惈涓€瀵瑰浐瀹氫釜鏁扮殑绐勪緷璧栵紙灏辨槸瀵圭埗RDD鐨勪緷璧栫殑Partition鐨勬暟閲忎笉浼氶殢鐫€RDD鏁伴噺瑙勬ā鐨勬敼鍙樿€屾敼鍙橈級锛屾瘮濡俲oin鎿嶄綔鐨勬瘡涓猵artiion浠呬粎鍜屽凡鐭ョ殑partition杩涜join锛岃繖涓猨oin鎿嶄綔鏄獎渚濊禆锛屼緷璧栧浐瀹氭暟閲忕殑鐖秗dd锛屽洜涓烘槸纭畾鐨刾artition鍏崇郴
19.Hadoop涓紝Mapreduce鎿嶄綔鐨刴apper鍜宺educer闃舵鐩稿綋浜巗park涓殑鍝嚑涓畻瀛愶紵
绛旓細鐩稿綋浜巗park涓殑map绠楀瓙鍜宺educeByKey绠楀瓙锛屽綋鐒惰繕鏄湁鐐瑰尯鍒殑,MR浼氳嚜鍔ㄨ繘琛屾帓搴忕殑锛宻park瑕佺湅浣犵敤鐨勬槸浠€涔坧artitioner
20.浠€涔堟槸shuffle锛屼互鍙婁负浠€涔堥渶瑕乻huffle锛?/div>
shuffle涓枃缈昏瘧涓烘礂鐗岋紝闇€瑕乻huffle鐨勫師鍥犳槸锛氭煇绉嶅叿鏈夊叡鍚岀壒寰佺殑鏁版嵁姹囪仛鍒颁竴涓绠楄妭鐐逛笂杩涜璁$畻
21.涓嶉渶瑕佹帓搴忕殑hash shuffle鏄惁涓€瀹氭瘮闇€瑕佹帓搴忕殑sort shuffle閫熷害蹇紵
绛旓細涓嶄竴瀹氾紒锛佸綋鏁版嵁瑙勬ā灏忥紝Hash shuffle蹇簬Sorted Shuffle鏁版嵁瑙勬ā澶х殑鏃跺€欙紱褰撴暟鎹噺澶э紝sorted Shuffle浼氭瘮Hash shuffle蹇緢澶氾紝鍥犱负鏁伴噺澶х殑鏈夊緢澶氬皬鏂囦欢锛屼笉鍧囧寑锛岀敋鑷冲嚭鐜版暟鎹€炬枩锛屾秷鑰楀唴瀛樺ぇ锛?.x涔嬪墠spark浣跨敤hash锛岄€傚悎澶勭悊涓皬瑙勬ā锛?.x涔嬪悗锛屽鍔犱簡Sorted shuffle锛孲park鏇磋兘鑳滀换澶ц妯″鐞嗕簡銆?/div>
22.Spark涓殑HashShufle鐨勬湁鍝簺涓嶈冻锛?/div>
绛旓細1锛塻huffle浜х敓娴烽噺鐨勫皬鏂囦欢鍦ㄧ鐩樹笂锛屾鏃朵細浜х敓澶ч噺鑰楁椂鐨勩€佷綆鏁堢殑IO鎿嶄綔锛?锛?瀹规槗瀵艰嚧鍐呭瓨涓嶅鐢紝鐢变簬鍐呭瓨闇€瑕佷繚瀛樻捣閲忕殑鏂囦欢鎿嶄綔鍙ユ焺鍜屼复鏃剁紦瀛樹俊鎭紝濡傛灉鏁版嵁澶勭悊瑙勬ā姣旇緝澶х殑鍖栵紝瀹规槗鍑虹幇OOM锛?锛夊鏄撳嚭鐜版暟鎹€炬枩锛屽鑷碠OM
23.conslidate鏄浣曚紭鍖朒ash shuffle鏃跺湪map绔骇鐢熺殑灏忔枃浠讹紵
绛旓細1锛塩onslidate涓轰簡瑙e喅Hash Shuffle鍚屾椂鎵撳紑杩囧鏂囦欢瀵艰嚧Writer handler鍐呭瓨浣跨敤杩囧ぇ浠ュ強浜х敓杩囧鏂囦欢瀵艰嚧澶ч噺鐨勯殢鏈鸿鍐欏甫鏉ョ殑浣庢晥纾佺洏IO锛?锛塩onslidate鏍规嵁CPU鐨勪釜鏁版潵鍐冲畾姣忎釜task shuffle map绔骇鐢熷灏戜釜鏂囦欢锛屽亣璁惧師鏉ユ湁10涓猼ask锛?00涓猺educe锛屾瘡涓狢PU鏈?0涓狢PU
閭d箞浣跨敤hash shuffle浼氫骇鐢?0*100=1000涓枃浠讹紝conslidate浜х敓10*10=100涓枃浠?/div>
澶囨敞锛歝onslidate閮ㄥ垎鍑忓皯浜嗘枃浠跺拰鏂囦欢鍙ユ焺锛屽苟琛岃寰堥珮鐨勬儏鍐典笅锛坱ask寰堝鏃讹級杩樻槸浼氬緢澶氭枃浠?/div>
24.Sort-basesd shuffle浜х敓澶氬皯涓复鏃舵枃浠?/div>
绛旓細2*Map闃舵鎵€鏈夌殑task鏁伴噺锛孧apper闃舵涓苟琛岀殑Partition鐨勬€绘暟閲忥紝鍏跺疄灏辨槸Mapper绔痶ask
25.Sort-based shuffle鐨勭己闄?
1) 濡傛灉mapper涓璽ask鐨勬暟閲忚繃澶э紝渚濇棫浼氫骇鐢熷緢澶氬皬鏂囦欢锛屾鏃跺湪shuffle浼犻€掓暟鎹殑杩囩▼涓璻educer娈碉紝reduce浼氶渶瑕佸悓鏃跺ぇ閲忕殑璁板綍杩涜鍙嶅簭鍒楀寲锛屽鑷村ぇ閲忕殑鍐呭瓨娑堣€楀拰GC鐨勫法澶ц礋鎷咃紝閫犳垚绯荤粺缂撴參鐢氳嚦宕╂簝
2锛夊鏋滈渶瑕佸湪鍒嗙墖鍐呬篃杩涜鎺掑簭锛屾鏃堕渶瑕佽繘琛宮apper娈靛拰reducer娈电殑涓ゆ鎺掑簭
26.Spark shell鍚姩鏃朵細鍚姩derby?
绛旓細 spark shell鍚姩浼氬惎鍔╯park sql锛宻park sql榛樿浣跨敤derby淇濆瓨鍏冩暟鎹紝浣嗘槸灏介噺涓嶈鐢╠erby锛屽畠鏄崟瀹炰緥锛屼笉鍒╀簬寮€鍙戙€備細鍦ㄦ湰鍦扮敓鎴愪竴涓枃浠秏etastore_db,濡傛灉鍚姩鎶ラ敊锛屽氨鎶婇偅涓枃浠剁粰鍒犱簡 锛宒erby鏁版嵁搴撴槸鍗曞疄渚嬶紝涓嶈兘鏀寔澶氫釜鐢ㄦ埛鍚屾椂鎿嶄綔锛屽敖閲忛伩鍏嶄娇鐢?/div>
27.spark.default.parallelism杩欎釜鍙傛暟鏈変粈涔堟剰涔夛紝瀹為檯鐢熶骇涓浣曡缃紵
绛旓細1锛夊弬鏁扮敤浜庤缃瘡涓猻tage鐨勯粯璁ask鏁伴噺銆傝繖涓弬鏁版瀬涓洪噸瑕侊紝濡傛灉涓嶈缃彲鑳戒細鐩存帴褰卞搷浣犵殑Spark浣滀笟鎬ц兘锛?锛夊緢澶氫汉閮戒笉浼氳缃繖涓弬鏁帮紝浼氫娇寰楅泦缇ら潪甯镐綆鏁堬紝浣犵殑cpu锛屽唴瀛樺啀澶氾紝濡傛灉task濮嬬粓涓?锛岄偅涔熸槸娴垂锛宻park瀹樼綉寤鸿task涓暟涓篊PU鐨勬牳鏁?executor鐨勪釜鏁扮殑2~3鍊嶃€?/div>
28.spark.storage.memoryFraction鍙傛暟鐨勫惈涔?瀹為檯鐢熶骇涓浣曡皟浼橈紵
绛旓細1锛夌敤浜庤缃甊DD鎸佷箙鍖栨暟鎹湪Executor鍐呭瓨涓兘鍗犵殑姣斾緥锛岄粯璁ゆ槸0.6,锛岄粯璁xecutor 60%鐨勫唴瀛橈紝鍙互鐢ㄦ潵淇濆瓨鎸佷箙鍖栫殑RDD鏁版嵁銆傛牴鎹綘閫夋嫨鐨勪笉鍚岀殑鎸佷箙鍖栫瓥鐣ワ紝濡傛灉鍐呭瓨涓嶅鏃讹紝鍙兘鏁版嵁灏变笉浼氭寔涔呭寲锛屾垨鑰呮暟鎹細鍐欏叆纾佺洏銆?锛夊鏋滄寔涔呭寲鎿嶄綔姣旇緝澶氾紝鍙互鎻愰珮spark.storage.memoryFraction鍙傛暟锛屼娇寰楁洿澶氱殑鎸佷箙鍖栨暟鎹繚瀛樺湪鍐呭瓨涓紝鎻愰珮鏁版嵁鐨勮鍙栨€ц兘锛屽鏋渟huffle鐨勬搷浣滄瘮杈冨锛屾湁寰堝鐨勬暟鎹鍐欐搷浣滃埌JVM涓紝閭d箞搴旇璋冨皬涓€鐐癸紝鑺傜害鍑烘洿澶氱殑鍐呭瓨缁橨VM锛岄伩鍏嶈繃澶氱殑JVM gc鍙戠敓銆傚湪web ui涓瀵熷鏋滃彂鐜癵c鏃堕棿寰堥暱锛屽彲浠ヨ缃畇park.storage.memoryFraction鏇村皬涓€鐐广€?/div>
29.spark.shuffle.memoryFraction鍙傛暟鐨勫惈涔夛紝浠ュ強浼樺寲缁忛獙锛?/div>
绛旓細1锛塻park.shuffle.memoryFraction鏄痵huffle璋冧紭涓?閲嶈鍙傛暟锛宻huffle浠庝笂涓€涓猼ask鎷夊幓鏁版嵁杩囨潵锛岃鍦‥xecutor杩涜鑱氬悎鎿嶄綔锛岃仛鍚堟搷浣滄椂浣跨敤Executor鍐呭瓨鐨勬瘮渚嬬敱璇ュ弬鏁板喅瀹氾紝榛樿鏄?0%
濡傛灉鑱氬悎鏃舵暟鎹秴杩囦簡璇ュぇ灏忥紝閭d箞灏变細spill鍒扮鐩橈紝鏋佸ぇ闄嶄綆鎬ц兘锛?锛夊鏋淪park浣滀笟涓殑RDD鎸佷箙鍖栨搷浣滆緝灏戯紝shuffle鎿嶄綔杈冨鏃讹紝寤鸿闄嶄綆鎸佷箙鍖栨搷浣滅殑鍐呭瓨鍗犳瘮锛屾彁楂榮huffle鎿嶄綔鐨勫唴瀛樺崰姣旀瘮渚嬶紝閬垮厤shuffle杩囩▼涓暟鎹繃澶氭椂鍐呭瓨涓嶅鐢紝蹇呴』婧㈠啓鍒扮鐩樹笂锛岄檷浣庝簡鎬ц兘銆傛澶栵紝濡傛灉鍙戠幇浣滀笟鐢变簬棰戠箒鐨刧c瀵艰嚧杩愯缂撴參锛屾剰鍛崇潃task鎵ц鐢ㄦ埛浠g爜鐨勫唴瀛樹笉澶熺敤锛岄偅涔堝悓鏍峰缓璁皟浣庤繖涓弬鏁扮殑鍊?/div>
30.浠嬬粛涓€涓嬩綘瀵筓nified Memory Management鍐呭瓨绠$悊妯″瀷鐨勭悊瑙o紵
绛旓細Spark涓殑鍐呭瓨浣跨敤鍒嗕负涓ら儴鍒嗭細鎵ц锛坋xecution锛変笌瀛樺偍锛坰torage锛夈€傛墽琛屽唴瀛樹富瑕佺敤浜巗huffles銆乯oins銆乻orts鍜宎ggregations锛屽瓨鍌ㄥ唴瀛樺垯鐢ㄤ簬缂撳瓨鎴栬€呰法鑺傜偣鐨勫唴閮ㄦ暟鎹紶杈撱€?.6涔嬪墠锛屽浜庝竴涓狤xecutor,鍐呭瓨閮芥湁鍝簺閮ㄥ垎鏋勬垚锛?/div>
1锛塃xecutionMemory銆傝繖鐗囧唴瀛樺尯鍩熸槸涓轰簡瑙e喅 shuffles,joins, sorts and aggregations 杩囩▼涓负浜嗛伩鍏嶉绻両O闇€瑕佺殑buffer銆?閫氳繃spark.shuffle.memoryFraction(榛樿 0.2) 閰嶇疆銆?/div>
2锛塖torageMemory銆傝繖鐗囧唴瀛樺尯鍩熸槸涓轰簡瑙e喅 block cache(灏辨槸浣犳樉绀鸿皟鐢╠d.cache, rdd.persist绛夋柟娉?, 杩樻湁灏辨槸broadcasts,浠ュ強task results鐨勫瓨鍌ㄣ€傚彲浠ラ€氳繃鍙傛暟 spark.storage.memoryFraction(榛樿0.6)銆傝缃?/div>
3锛塐therMemory銆傜粰绯荤粺棰勭暀鐨勶紝鍥犱负绋嬪簭鏈韩杩愯涔熸槸闇€瑕佸唴瀛樼殑銆?(榛樿涓?.2).
浼犵粺鍐呭瓨绠$悊鐨勪笉瓒筹細
1).Shuffle鍗犵敤鍐呭瓨0.2*0.8锛屽唴瀛樺垎閰嶈繖涔堝皯锛屽彲鑳戒細灏嗘暟鎹畇pill鍒扮鐩橈紝棰戠箒鐨勭鐩業O鏄緢澶х殑璐熸媴锛孲torage鍐呭瓨鍗犵敤0.6锛屼富瑕佹槸涓轰簡杩唬澶勭悊銆備紶缁熺殑Spark鍐呭瓨鍒嗛厤瀵规搷浣滀汉鐨勮姹傞潪甯搁珮銆傦紙Shuffle鍒嗛厤鍐呭瓨锛歋huffleMemoryManager, TaskMemoryManager,ExecutorMemoryManager锛変竴涓猅ask鑾峰緱鍏ㄩ儴鐨凟xecution鐨凪emory锛屽叾浠朤ask杩囨潵灏辨病鏈夊唴瀛樹簡锛屽彧鑳界瓑寰呫€?/div>
2).榛樿鎯呭喌涓嬶紝Task鍦ㄧ嚎绋嬩腑鍙兘浼氬崰婊℃暣涓唴瀛橈紝鍒嗙墖鏁版嵁鐗瑰埆澶х殑鎯呭喌涓嬪氨浼氬嚭鐜拌繖绉嶆儏鍐碉紝鍏朵粬Task娌℃湁鍐呭瓨浜嗭紝鍓╀笅鐨刢ores灏辩┖闂蹭簡锛岃繖鏄法澶х殑娴垂銆傝繖涔熸槸浜轰负鎿嶄綔鐨勪笉褰撻€犳垚鐨勩€?/div>
3).MEMORY_AND_DISK_SER鐨剆torage鏂瑰紡锛岃幏寰桼DD鐨勬暟鎹槸涓€鏉℃潯鑾峰彇锛宨terator鐨勬柟寮忋€傚鏋滃唴瀛樹笉澶燂紙spark.storage.unrollFraction锛夛紝unroll鐨勮鍙栨暟鎹繃绋嬶紝灏辨槸鐪嬪唴瀛樻槸鍚﹁冻澶燂紝濡傛灉瓒冲锛屽氨涓嬩竴鏉°€倁nroll鐨剆pace鏄粠Storage鐨勫唴瀛樼┖闂翠腑鑾峰緱鐨勩€倁nroll鐨勬柟寮忓け璐ワ紝灏变細鐩存帴鏀剧鐩樸€?/div>
4). 榛樿鎯呭喌涓嬶紝Task鍦╯pill鍒扮鐩樹箣鍓嶏紝浼氬皢閮ㄥ垎鏁版嵁瀛樻斁鍒板唴瀛樹笂锛屽鏋滆幏鍙栦笉鍒板唴瀛橈紝灏变笉浼氭墽琛屻€傛案鏃犳澧冪殑绛夊緟锛屾秷鑰桟PU鍜屽唴瀛樸€?/div>
鍦ㄦ鍩虹涓婏紝Spark鎻愬嚭浜哢nifiedMemoryManager锛屼笉鍐嶅垎ExecutionMemory鍜孲torage Memory,瀹為檯涓婅繕鏄垎鐨勶紝鍙笉杩囨槸Execution Memory璁块棶Storage Memory锛孲torage Memory涔熷彲浠ヨ闂瓻xecution Memory锛屽鏋滃唴瀛樹笉澶燂紝灏变細鍘诲€熴€?/div>
 
 
---------------------------------------------------------------------------------------------------------------------
銆怱park闈㈣瘯2000棰?01-130銆慡park on Yarn闈㈣瘯绡?4
 
鏈瘒棰橀泦涓昏鏄疭park on Yarn鐩稿叧鐨勯潰璇曢锛屼富瑕佹秹鍙奡park on Yarn銆乊arn銆丮apreduce鐩稿叧闈㈣瘯棰樸€?/div>
 
涓€銆侀潰璇曢30棰?/div>
1.MRV1鏈夊摢浜涗笉瓒筹紵
1)鍙墿灞曟€э紙瀵逛簬鍙樺寲鐨勫簲浠樿兘鍔涳級
a) JobTracker鍐呭瓨涓繚瀛樼敤鎴蜂綔涓氱殑淇℃伅
b) JobTracker浣跨敤鐨勬槸绮楃矑搴︾殑閿?/div>
2)鍙潬鎬у拰鍙敤鎬?/div>
a) JobTracker澶辨晥浼氬浜嬮泦缇や腑鎵€鏈夌殑杩愯浣滀笟锛岀敤鎴烽渶鎵嬪姩閲嶆柊鎻愪氦鍜屾仮澶嶅伐浣滄祦
3)瀵逛笉鍚岀紪绋嬫ā鍨嬬殑鏀寔
HadoopV1浠apReduce涓轰腑蹇冪殑璁捐铏界劧鑳芥敮鎸佸箍娉涚殑鐢ㄤ緥锛屼絾鏄苟涓嶉€傚悎鎵€鏈夊ぇ鍨嬭绠?濡俿torm锛宻park
2.鎻忚堪Yarn鎵ц涓€涓换鍔$殑杩囩▼锛?/div>
1锛夊鎴风client鍚慠esouceManager鎻愪氦Application锛孯esouceManager鎺ュ彈Application
骞舵牴鎹泦缇よ祫婧愮姸鍐甸€夊彇涓€涓猲ode鏉ュ惎鍔ˋpplication鐨勪换鍔¤皟搴﹀櫒driver锛圓pplicationMaster锛?/div>
2锛塕esouceManager鎵惧埌閭d釜node锛屽懡浠ゅ叾璇ode涓婄殑nodeManager鏉ュ惎鍔ㄤ竴涓柊鐨?/div>
JVM杩涚▼杩愯绋嬪簭鐨刣river锛圓pplicationMaster锛夐儴鍒嗭紝driver锛圓pplicationMaster锛夊惎鍔ㄦ椂浼氶鍏堝悜ResourceManager娉ㄥ唽锛岃鏄庣敱鑷繁鏉ヨ礋璐e綋鍓嶇▼搴忕殑杩愯
3锛塪river锛圓pplicationMaster锛夊紑濮嬩笅杞界浉鍏砵ar鍖呯瓑鍚勭璧勬簮锛屽熀浜庝笅杞界殑jar绛変俊鎭喅瀹氬悜ResourceManager鐢宠鍏蜂綋鐨勮祫婧愬唴瀹广€?/div>
4锛塕esouceManager鎺ュ彈鍒癲river锛圓pplicationMaster锛夋彁鍑虹殑鐢宠鍚庯紝浼氭渶澶у寲鐨勬弧瓒?/div>
璧勬簮鍒嗛厤璇锋眰锛屽苟鍙戦€佽祫婧愮殑鍏冩暟鎹俊鎭粰driver锛圓pplicationMaster锛夛紱
5锛塪river锛圓pplicationMaster锛夋敹鍒板彂杩囨潵鐨勮祫婧愬厓鏁版嵁淇℃伅鍚庝細鏍规嵁鍏冩暟鎹俊鎭彂鎸囦护缁欏叿浣?/div>
鏈哄櫒涓婄殑NodeManager锛岃鍏跺惎鍔ㄥ叿浣撶殑container銆?/div>
6锛塏odeManager鏀跺埌driver鍙戞潵鐨勬寚浠わ紝鍚姩container锛宑ontainer鍚姩鍚庡繀椤诲悜driver锛圓pplicationMaster锛夋敞鍐屻€?/div>
7锛塪river锛圓pplicationMaster锛夋敹鍒癱ontainer鐨勬敞鍐岋紝寮€濮嬭繘琛屼换鍔$殑璋冨害鍜岃绠楋紝鐩村埌
浠诲姟瀹屾垚銆?/div>
琛ュ厖锛氬鏋淩esourceManager绗竴娆℃病鏈夎兘澶熸弧瓒砫river锛圓pplicationMaster锛夌殑璧勬簮璇锋眰
锛屽悗缁彂鐜版湁绌洪棽鐨勮祫婧愶紝浼氫富鍔ㄥ悜driver锛圓pplicationMaster锛夊彂閫佸彲鐢ㄨ祫婧愮殑鍏冩暟鎹俊鎭?/div>
浠ユ彁渚涙洿澶氱殑璧勬簮鐢ㄤ簬褰撳墠绋嬪簭鐨勮繍琛屻€?/div>
 
 
3.Yarn涓殑container鏄敱璋佽礋璐i攢姣佺殑锛屽湪Hadoop Mapreduce涓璫ontainer鍙互澶嶇敤涔堬紵
绛旓細ApplicationMaster璐熻矗閿€姣侊紝鍦℉adoop Mapreduce涓嶅彲浠ュ鐢紝鍦╯park on yarn绋嬪簭container鍙互澶嶇敤
4.鎻愪氦浠诲姟鏃讹紝濡備綍鎸囧畾Spark Application鐨勮繍琛屾ā寮忥紵
1锛塩luster妯″紡锛?/spark-submit --class xx.xx.xx --master yarn --deploy-mode cluster xx.jar
2) client妯″紡:./spark-submit --class xx.xx.xx --master yarn --deploy-mode client xx.jar
5. 涓嶅惎鍔⊿park闆嗙兢Master鍜寃ork鏈嶅姟锛屽彲涓嶅彲浠ヨ繍琛孲park绋嬪簭锛?/div>
绛旓細鍙互锛屽彧瑕佽祫婧愮鐞嗗櫒绗笁鏂圭鐞嗗氨鍙互锛屽鐢眣arn绠$悊锛宻park闆嗙兢涓嶅惎鍔ㄤ篃鍙互浣跨敤spark锛泂park闆嗙兢鍚姩鐨勬槸work鍜宮aster锛岃繖涓叾瀹炲氨鏄祫婧愮鐞嗘鏋讹紝yarn涓殑resourceManager鐩稿綋浜巑aster锛孨odeManager鐩稿綋浜巜orker锛屽仛璁$畻鏄疎xecutor锛屽拰spark闆嗙兢鐨剋ork鍜宮anager鍙互娌″叧绯伙紝褰掓牴鎺ュ簳杩樻槸JVM鐨勮繍琛岋紝鍙鎵€鍦ㄧ殑JVM涓婂畨瑁呬簡spark灏卞彲浠ャ€?/div>
6.Spark涓殑4040绔彛鐢变粈涔堝姛鑳?
绛旓細鏀堕泦Spark浣滀笟杩愯鐨勪俊鎭?/div>
7.spark on yarn Cluster 妯″紡涓嬶紝ApplicationMaster鍜宒river鏄湪鍚屼竴涓繘绋嬩箞锛?/div>
绛旓細鏄?driver 浣嶄簬ApplicationMaster杩涚▼涓€傝杩涚▼璐熻矗鐢宠璧勬簮锛岃繕璐熻矗鐩戞帶绋嬪簭銆佽祫婧愮殑鍔ㄦ€佹儏鍐点€?/div>
8.濡備綍浣跨敤鍛戒护鏌ョ湅application杩愯鐨勬棩蹇椾俊鎭?/div>
绛旓細yarn logs -applicationId <app ID>
9.Spark on Yarn 妯″紡鏈夊摢浜涗紭鐐癸紵
1)涓庡叾浠栬绠楁鏋跺叡浜泦缇よ祫婧愶紙eg.Spark妗嗘灦涓嶮apReduce妗嗘灦鍚屾椂杩愯锛屽鏋滀笉鐢╕arn杩涜璧勬簮鍒嗛厤锛孧apReduce鍒嗗埌鐨勫唴瀛樿祫婧愪細寰堝皯锛屾晥鐜囦綆涓嬶級锛涜祫婧愭寜闇€鍒嗛厤锛岃繘鑰屾彁楂橀泦缇よ祫婧愬埄鐢ㄧ瓑銆?/div>
2)鐩歌緝浜嶴park鑷甫鐨凷tandalone妯″紡锛孻arn鐨勮祫婧愬垎閰嶆洿鍔犵粏鑷?/div>
3)Application閮ㄧ讲绠€鍖栵紝渚嬪Spark锛孲torm绛夊绉嶆鏋剁殑搴旂敤鐢卞鎴风鎻愪氦鍚庯紝鐢盰arn璐熻矗璧勬簮鐨勭鐞嗗拰璋冨害锛屽埄鐢–ontainer浣滀负璧勬簮闅旂鐨勫崟浣嶏紝浠ュ畠涓哄崟浣嶅幓浣跨敤鍐呭瓨,cpu绛夈€?/div>
4)Yarn閫氳繃闃熷垪鐨勬柟寮忥紝绠$悊鍚屾椂杩愯鍦╕arn闆嗙兢涓殑澶氫釜鏈嶅姟锛屽彲鏍规嵁涓嶅悓绫诲瀷鐨勫簲鐢ㄧ▼搴忚礋杞芥儏鍐碉紝璋冩暣瀵瑰簲鐨勮祫婧愪娇鐢ㄩ噺锛屽疄鐜拌祫婧愬脊鎬х鐞嗐€?/div>
10.璋堣皥浣犲container鐨勭悊瑙o紵
1锛塁ontainer浣滀负璧勬簮鍒嗛厤鍜岃皟搴︾殑鍩烘湰鍗曚綅锛屽叾涓皝瑁呬簡鐨勮祫婧愬鍐呭瓨锛孋PU锛岀鐩橈紝缃戠粶甯﹀绛夈€?鐩墠yarn浠呬粎灏佽鍐呭瓨鍜孋PU
2)Container鐢盇pplicationMaster鍚慠esourceManager鐢宠鐨勶紝鐢盧esouceManager涓殑璧勬簮璋冨害鍣ㄥ紓姝ュ垎閰嶇粰ApplicationMaster
3) Container鐨勮繍琛屾槸鐢盇pplicationMaster鍚戣祫婧愭墍鍦ㄧ殑NodeManager鍙戣捣鐨勶紝Container杩愯鏃堕渶鎻愪緵鍐呴儴鎵ц鐨勪换鍔″懡浠?
11.杩愯鍦▂arn涓瑼pplication鏈夊嚑绉嶇被鍨嬬殑container锛?/div>
1锛?杩愯ApplicationMaster鐨凜ontainer锛氳繖鏄敱ResourceManager锛堝悜鍐呴儴鐨勮祫婧愯皟搴﹀櫒锛夌敵璇峰拰鍚姩鐨勶紝鐢ㄦ埛鎻愪氦搴旂敤绋嬪簭鏃讹紝鍙寚瀹氬敮涓€鐨凙pplicationMaster鎵€闇€鐨勮祫婧愶紱
2锛?杩愯鍚勭被浠诲姟鐨凜ontainer锛氳繖鏄敱ApplicationMaster鍚慠esourceManager鐢宠鐨勶紝骞剁敱ApplicationMaster涓嶯odeManager閫氫俊浠ュ惎鍔ㄤ箣銆?/div>
12.Spark on Yarn鏋舵瀯鏄€庝箞鏍风殑锛燂紙瑕佷細鐢诲摝锛岃繖涓浘锛?/div>
 
 
Yarn鎻愬埌鐨凙pp Master鍙互鐞嗚В涓篠park涓璖tandalone妯″紡涓殑driver銆侰ontainer涓繍琛岀潃Executor,鍦‥xecutor涓互澶氱嚎绋嬪苟琛岀殑鏂瑰紡杩愯Task銆傝繍琛岃繃绋嬪拰绗簩棰樼浉浼笺€?/div>
13.Executor鍚姩鏃讹紝璧勬簮閫氳繃鍝嚑涓弬鏁版寚瀹氾紵
1)num-executors鏄痚xecutor鐨勬暟閲?/div>
2)executor-memory 鏄瘡涓猠xecutor浣跨敤鐨勫唴瀛?/div>
3)executor-cores 鏄瘡涓猠xecutor鍒嗛厤鐨凜PU
14.涓轰粈涔堜細浜х敓yarn锛岃В鍐充簡浠€涔堥棶棰橈紝鏈変粈涔堜紭鍔?
1)涓轰粈涔堜骇鐢焬arn锛岄拡瀵筂RV1鐨勫悇绉嶇己闄锋彁鍑烘潵鐨勮祫婧愮鐞嗘鏋?/div>
2)瑙e喅浜嗕粈涔堥棶棰橈紝鏈変粈涔堜紭鍔匡紝鍙傝€冭繖绡囧崥鏂囷細http://www.aboutyun.com/forum.php?mod=viewthread&tid=6785
15.Mapreduce鐨勬墽琛岃繃绋?
闃舵1锛歩nput/map/partition/sort/spill
闃舵2锛歮apper绔痬erge
闃舵3锛歳educer绔痬erge/reduce/output
璇︾粏杩囩▼鍙傝€冭繖涓猦ttp://www.cnblogs.com/hipercomer/p/4516581.html
 
16.涓€涓猼ask鐨刴ap鏁伴噺鐢辫皝鏉ュ喅瀹氾紵
涓€鑸儏鍐典笅锛屽湪杈撳叆婧愭槸鏂囦欢鐨勬椂鍊欙紝涓€涓猼ask鐨刴ap鏁伴噺鐢眘plitSize鏉ュ喅瀹氱殑锛岄偅涔坰plitSize鏄敱浠ヤ笅鍑犱釜鏉ュ喅瀹氱殑
goalSize = totalSize / mapred.map.tasks
inSize = max {mapred.min.split.size, minSplitSize}
splitSize = max (minSize, min(goalSize, dfs.block.size))
涓€涓猼ask鐨剅educe鏁伴噺锛岀敱partition鍐冲畾銆?/div>
17.reduce鍚庤緭鍑虹殑鏁版嵁閲忔湁澶氬ぇ锛?/div>
骞朵笉鏄兂鐭ラ亾纭垏鐨勬暟鎹噺鏈夊澶ц繖涓紝鑰屾槸鎯抽棶浣狅紝MR鐨勬墽琛屾満鍒讹紝寮€鍙戝畬绋嬪簭锛屾湁娌℃湁璁ょ湡璇勪及绋嬪簭杩愯鏁堢巼
1锛夌敤浜庡鐞唕edcue浠诲姟鐨勮祫婧愭儏鍐碉紝濡傛灉鏄疢RV1鐨勮瘽锛屽垎浜嗗灏戣祫婧愮粰map锛屽灏戜釜reduce
濡傛灉鏄疢RV2鐨勮瘽锛屽彲浠ユ彁涓€涓嬶紝闆嗙兢鏈夊垎浜嗗灏戝唴瀛樸€丆PU缁檡arn鍋氳绠?銆?/div>
2锛夌粨鍚堝疄闄呭簲鐢ㄥ満鏅洖绛旓紝杈撳叆鏁版嵁鏈夊澶э紝澶х害澶氬皯鏉¤褰曪紝鍋氫簡鍝簺閫昏緫鎿嶄綔锛岃緭鍑虹殑鏃跺€欐湁澶氬皯鏉¤褰曪紝鎵ц浜嗗涔咃紝reduce鎵ц鏃跺€欑殑鏁版嵁鏈夋病鏈夊€炬枩绛?/div>
3锛夊啀鎻愪竴涓嬶紝閽堝mapReduce鍋氫簡鍝嚑鐐逛紭鍖栵紝閫熷害鎻愬崌浜嗗涔咃紝鍒椾妇1,2涓紭鍖栫偣灏卞彲浠?/div>
18.浣犵殑椤圭洰鎻愪氦鍒癹ob鐨勬椂鍊欐暟鎹噺鏈夊澶э紵
绛旓細1锛夊洖绛斿嚭鏁版嵁鏄粈涔堟牸寮忥紝鏈夋病鏈夐噰鐢ㄤ粈涔堝帇缂╋紝閲囩敤浜嗗帇缂╃殑璇濓紝鍘嬬缉姣斿ぇ姒傛槸澶氬皯锛?锛夋枃浠跺ぇ姒傚澶э細澶ф璧蜂簡澶氬皯涓猰ap锛岃捣浜嗗灏戜釜reduce锛宮ap闃舵璇诲彇浜嗗灏戞暟鎹紝reduce闃舵璇诲彇浜嗗灏戞暟鎹紝绋嬪簭澶х害鎵ц浜嗗涔咃紝3锛夐泦缇や粈涔堣妯★紝闆嗙兢鏈夊灏戣妭鐐癸紝澶氬皯鍐呭瓨锛屽灏慍PU鏍告暟绛夈€傛妸杩欎簺鐐瑰洖绛旇繘鍘伙紝鑰屼笉鏄粰涓暟瀛椾簡浜嬨€?/div>
19.浣犱滑鎻愪氦鐨刯ob浠诲姟澶ф鏈夊灏戜釜锛熻繖浜沯ob鎵ц瀹屽ぇ姒傜敤澶氬皯鏃堕棿锛?/div>
杩樻槸鑰冨療浣犲紑鍙戝畬绋嬪簭鏈夋病鏈夎鐪熻瀵熻繃绋嬪簭鐨勮繍琛岋紝鏈夋病鏈夎瘎浼扮▼搴忚繍琛岀殑鏁堢巼
20.浣犱滑涓氬姟鏁版嵁閲忓澶э紵鏈夊灏戣鏁版嵁锛?/div>
杩欎釜涔熸槸鐪嬩綘浠湁娌℃湁瀹為檯鐨勭粡楠?瀵逛簬娌℃湁瀹炴垬鐨勫悓瀛︼紝璇锋妸鍥炵瓟鐨勪晶閲嶇偣鏀惧湪MR鐨勮繍琛屾満鍒朵笂闈紝
MR杩愯鏁堢巼鏂归潰锛屼互鍙婂浣曚紭鍖朚R绋嬪簭锛堢湅鍒汉鐨勪紭鍖杁emo锛岀劧鍚庡湪铏氭嫙鏈轰笂鎷縟emo鍋氫竴涓嬫祴璇曪級銆?/div>
22.濡備綍鏉€姝讳竴涓鍦ㄨ繍琛岀殑job
鏉€姝讳竴涓猨ob
MRV1锛欻adoop job kill jobid
YARN: yarn application -kill applicationId
23.鍒楀嚭浣犳墍鐭ラ亾鐨勮皟搴﹀櫒锛岃鏄庡叾宸ヤ綔鍘熺悊
a) Fifo schedular 榛樿鐨勮皟搴﹀櫒 鍏堣繘鍏堝嚭
b) Capacity schedular 璁$畻鑳藉姏璋冨害鍣?閫夋嫨鍗犵敤鍐呭瓨灏?浼樺厛绾ч珮鐨?/div>
c) Fair schedular 璋冭倸鑴?鍏钩璋冨害鍣?鎵€鏈塲ob 鍗犵敤鐩稿悓璧勬簮
24.YarnClient妯″紡涓嬶紝鎵цSpark SQL鎶ヨ繖涓敊锛孍xception in thread "Thread-2" java.lang.OutOfMemoryError: PermGen space锛屼絾鏄湪Yarn Cluster妯″紡涓嬫甯歌繍琛岋紝鍙兘鏄粈涔堝師鍥狅紵
1锛夊師鍥犳煡璇㈣繃绋嬩腑璋冪敤鐨勬槸Hive鐨勮幏鍙栧厓鏁版嵁淇℃伅銆丼QL瑙f瀽锛屽苟涓斾娇鐢–glib绛夎繘琛屽簭鍒楀寲鍙嶅簭鍒楀寲锛屼腑闂村彲鑳戒骇鐢熻緝澶氱殑class鏂囦欢锛屽鑷碕VM涓殑鎸佷箙浠d娇鐢ㄨ緝澶?/div>
Cluster妯″紡鐨勬寔涔呬唬榛樿澶у皬鏄?4M锛孋lient妯″紡鐨勬寔涔呬唬榛樿澶у皬鏄?2M锛岃€孌river绔繘琛孲QL澶勭悊鏃讹紝鍏舵寔涔呬唬鐨勪娇鐢ㄥ彲鑳戒細杈惧埌90M锛屽鑷碠OM婧㈠嚭锛屼换鍔″け璐ャ€?/div>
yarn-cluster妯″紡涓嬪嚭鐜帮紝yarn-client妯″紡杩愯鏃跺€掓槸姝e父鐨勶紝鍘熸潵鍦?SPARK_HOME/bin/spark-class鏂囦欢涓凡缁忚缃簡鎸佷箙浠eぇ灏忥細
JAVA_OPTS="-XX:MaxPermSize=256m $OUR_JAVA_OPTS"
2锛夎В鍐虫柟娉?鍦⊿park鐨刢onf鐩綍涓殑spark-defaults.conf閲岋紝澧炲姞瀵笵river鐨凧VM閰嶇疆锛屽洜涓篋river鎵嶈礋璐QL鐨勮В鏋愬拰鍏冩暟鎹幏鍙栥€傞厤缃涓嬶細
spark.driver.extraJavaOptions -XX:PermSize=128M -XX:MaxPermSize=256M
25.spark.driver.extraJavaOptions杩欎釜鍙傛暟鏄粈涔堟剰鎬濓紝浣犱滑鐢熶骇鐜閰嶄簡澶氬皯锛?/div>
浼犻€掔粰executors鐨凧VM閫夐」瀛楃涓层€備緥濡侴C璁剧疆鎴栬€呭叾瀹冩棩蹇楄缃€傛敞鎰忥紝鍦ㄨ繖涓€夐」涓缃甋park灞炴€ф垨鑰呭爢澶у皬鏄笉鍚堟硶鐨勩€係park灞炴€ч渶瑕佺敤SparkConf瀵硅薄鎴栬€卻park-submit鑴氭湰鐢ㄥ埌鐨剆park-defaults.conf鏂囦欢璁剧疆銆傚爢鍐呭瓨鍙互閫氳繃spark.executor.memory璁剧疆
26.瀵艰嚧Executor浜х敓FULL gc 鐨勫師鍥狅紝鍙兘瀵艰嚧浠€涔堥棶棰橈紵
绛旓細鍙兘瀵艰嚧Executor鍍垫闂锛屾捣閲忔暟鎹殑shuffle鍜屾暟鎹€炬枩绛夐兘鍙兘瀵艰嚧full gc銆備互shuffle涓轰緥锛屼即闅忕潃澶ч噺鐨凷huffle鍐欐搷浣滐紝JVM鐨勬柊鐢熶唬涓嶆柇GC锛孍den Space鍐欐弧浜嗗氨寰€Survivor Space鍐欙紝鍚屾椂瓒呰繃涓€瀹氬ぇ灏忕殑鏁版嵁浼氱洿鎺ュ啓鍒拌€佺敓浠o紝褰撴柊鐢熶唬鍐欐弧浜嗕箣鍚庯紝涔熶細鎶婅€佺殑鏁版嵁鎼炲埌鑰佺敓浠o紝濡傛灉鑰佺敓浠g┖闂翠笉瓒充簡锛屽氨瑙﹀彂FULL GC锛岃繕鏄┖闂翠笉澶燂紝閭e氨OOM閿欒浜嗭紝姝ゆ椂绾跨▼琚獴locked锛屽鑷存暣涓狤xecutor澶勭悊鏁版嵁鐨勮繘绋嬭鍗′綇
27.Combiner 鍜宲artition鐨勪綔鐢?/div>
combine鍒嗕负map绔拰reduce绔紝浣滅敤鏄妸鍚屼竴涓猭ey鐨勯敭鍊煎鍚堝苟鍦ㄤ竴璧凤紝鍙互鑷畾涔夌殑銆俢ombine鍑芥暟鎶婁竴涓猰ap鍑芥暟浜х敓鐨?lt;key,value>瀵癸紙澶氫釜key,value锛夊悎骞舵垚涓€涓柊<key2,value2>.灏嗘柊鐨?lt;key2,value2>浣滀负杈撳叆鍒皉educe鍑芥暟涓繖涓獀alue2浜﹀彲绉颁箣涓簐alues锛屽洜涓烘湁澶氫釜銆傝繖涓悎骞剁殑鐩殑鏄负浜嗗噺灏戠綉缁滀紶杈撱€俻artition鏄垎鍓瞞ap姣忎釜鑺傜偣鐨勭粨鏋滐紝鎸夌収key鍒嗗埆鏄犲皠缁欎笉鍚岀殑reduce锛屼篃鏄彲浠ヨ嚜瀹氫箟鐨勩€傝繖閲屽叾瀹炲彲浠ョ悊瑙e綊绫汇€傛垜浠浜庨敊缁煎鏉傜殑鏁版嵁褰掔被銆傛瘮濡傚湪鍔ㄧ墿鍥噷鏈夌墰缇婇浮楦箙锛屼粬浠兘鏄贩鍦ㄤ竴璧风殑锛屼絾鏄埌浜嗘櫄涓婁粬浠氨鍚勮嚜鐗涘洖鐗涙锛岀緤鍥炵緤鍦堬紝楦″洖楦$獫銆俻artition鐨勪綔鐢ㄥ氨鏄妸杩欎簺鏁版嵁褰掔被銆傚彧涓嶈繃鍦ㄥ啓绋嬪簭鐨勬椂鍊欙紝mapreduce浣跨敤鍝堝笇HashPartitioner甯垜浠綊绫讳簡銆傝繖涓垜浠篃鍙互鑷畾涔夈€俿huffle灏辨槸map鍜宺educe涔嬮棿鐨勮繃绋嬶紝鍖呭惈浜嗕袱绔殑combine鍜宲artition銆侻ap鐨勭粨鏋滐紝浼氶€氳繃partition鍒嗗彂鍒癛educer涓婏紝Reducer鍋氬畬Reduce鎿嶄綔鍚庯紝閫歄utputFormat锛岃繘琛岃緭鍑簊huffle闃舵鐨勪富瑕佸嚱鏁版槸fetchOutputs(),杩欎釜鍑芥暟鐨勫姛鑳藉氨鏄皢map闃舵鐨勮緭鍑猴紝copy鍒皉educe 鑺傜偣鏈湴
28.Spark鎵ц浠诲姟鏃跺嚭鐜癹ava.lang.OutOfMemoryError: GC overhead limit exceeded鍜宩ava.lang.OutOfMemoryError: java heap space鍘熷洜鍜岃В鍐虫柟娉曪紵
绛旓細鍘熷洜锛氬姞杞戒簡澶璧勬簮鍒板唴瀛橈紝鏈湴鐨勬€ц兘涔熶笉濂斤紝gc鏃堕棿娑堣€楃殑杈冨
瑙e喅鏂规硶锛?/div>
1锛夊鍔犲弬鏁帮紝-XX:-UseGCOverheadLimit锛屽叧闂繖涓壒鎬э紝鍚屾椂澧炲姞heap澶у皬锛?Xmx1024m
2锛変笅闈㈣繖涓袱涓弬鏁拌皟澶х偣
export SPARK_EXECUTOR_MEMORY=6000M
export SPARK_DRIVER_MEMORY=7000M
鍙互鍙傝€冭繖涓細http://www.cnblogs.com/hucn/p/3572384.html
29.璇峰垪鍑哄湪浣犱互鍓嶅伐浣滀腑鎵€浣跨敤杩囩殑寮€鍙憁ap /reduce鐨勮瑷€
绛旓細java锛孲cala锛孭ython锛宻hell
30.浣犺涓?etc/hosts閰嶇疆閿欒锛屼細瀵归泦缇ゆ湁浠€涔堝奖鍝嶏紵
绛旓細1锛夌洿鎺ュ鑷村煙鍚嶆病娉曡В鏋愶紝涓昏妭鐐逛笌瀛愯妭鐐癸紝瀛愯妭鐐逛笌瀛愯妭鐐规病娉曟甯搁€氳锛?锛夐棿鎺ュ鑷撮厤缃敊璇殑鐩稿叧鑺傜偣鍒犵殑鏈嶅姟涓嶆甯革紝鐢氳嚦娌℃硶鍚姩锛宩ob鎵ц澶辫触绛夌瓑
Spark Core闈㈣瘯绡?5
鍘熷垱 2017-06-12 姊呭嘲璋?澶ф暟鎹宄拌胺
Spark RDD鏄疭park鐨勭紪绋嬪熀纭€锛屾帉鎻DD浠ュ強RDD缂栫▼鎶€宸ф槸浼佷笟瀹為檯寮€鍙戠殑蹇呭鎶€鑳斤紝鏈瘒鏁寸悊RDD甯歌鐨勯棶棰橈紝姹囩紪鎴愰锛屼互鍔犳繁瀵筊DD鍙奟DD缂栫▼鐨勭悊瑙c€傚厛鎶婇鐩垪涓惧嚭鏉ワ紝鍚勪綅鎰熷叴瓒g殑鑷繁鍘诲仛涓€閬嶆妸锛屼笅涓€绡囨宄拌胺浼氶€氳繃缃戠洏鐨勬柟寮忥紝鎶婄瓟妗堝叕甯冨嚭鏉ワ紝鎰熷叴瓒g殑绔ラ瀷璇峰強鏃跺叧娉ㄣ€?/div>
 
1.scala涓璸rivate 涓?private[this] 淇グ绗︾殑鍖哄埆锛?/div>
 
 
2.scala涓唴閮ㄧ被鍜宩ava涓殑鍐呴儴绫诲尯鍒?/div>
 
 
 
3.Spark涓璼tandalone妯″紡鐗圭偣锛屾湁鍝簺浼樼偣鍜岀己鐐癸紵
 
4.FIFO璋冨害妯″紡鐨勫熀鏈師鐞嗐€佷紭鐐瑰拰缂虹偣锛?/div>
 
5.FAIR璋冨害妯″紡鐨勪紭鐐瑰拰缂虹偣锛?/div>
 
 
6.CAPCACITY璋冨害妯″紡鐨勪紭鐐瑰拰缂虹偣锛?/div>
 
 
7.鍒椾妇浣犱簡瑙g殑搴忓垪鍖栨柟娉曪紝骞惰皥璋堝簭鍒楀寲鏈変粈涔堝ソ澶勶紵
 
 
 
8.甯歌鐨勬暟鍘嬬缉鏂瑰紡锛屼綘浠敓浜ч泦缇ら噰鐢ㄤ簡浠€涔堝帇缂╂柟寮忥紝鎻愬崌浜嗗灏戞晥鐜囷紵
 
9.绠€瑕佹弿杩癝park鍐欐暟鎹殑娴佺▼锛?/div>
 
 
 
10.Spark涓璍ineage鐨勫熀鏈師鐞?/div>
 
 
11.浣跨敤shll鍜宻cala浠g爜瀹炵幇WordCount锛?/div>
 
12.璇峰垪涓句綘纰板埌鐨凜PU瀵嗛泦鍨嬬殑搴旂敤鍦烘櫙锛屼綘鏈夊仛鍝簺浼樺寲锛?/div>
 
13.Spark RDD 鍜?MR2鐨勫尯鍒?/div>
 
14.Spark璇诲彇hdfs涓婄殑鏂囦欢锛岀劧鍚巆ount鏈夊灏戣鐨勬搷浣滐紝浣犲彲浠ヨ璇磋繃绋嬪悧銆傞偅杩欎釜count鏄湪鍐呭瓨涓紝杩樻槸纾佺洏涓绠楃殑鍛紵
 
15.spark鍜孧apreduce蹇紵 涓轰粈涔堝揩鍛紵 蹇湪鍝噷鍛紵
 
 
16.spark sql鍙堜负浠€涔堟瘮hive蹇憿锛?/div>
 
17.RDD鐨勬暟鎹粨鏋勬槸鎬庝箞鏍风殑锛?/div>
 
18.RDD绠楀瓙閲屾搷浣滀竴涓閮╩ap姣斿寰€閲岄潰put鏁版嵁銆傜劧鍚庣畻瀛愬鍐嶉亶鍘唌ap銆備細鏈変粈涔堥棶棰樺悧銆?/div>
 
 
19.hadoop鐨勭敓鎬佸憿銆傝璇翠綘鐨勮璇嗐€?/div>
 
20.jvm鎬庝箞璋冧紭鐨勶紝浠嬬粛浣犵殑Spark JVM璋冧紭缁忛獙锛?/div>
 
21.jvm缁撴瀯锛熷爢閲岄潰鍑犱釜鍖猴紵
 
 
 
 
22.鎬庝箞鐢╯park鍋氭暟鎹竻娲?/div>
 
23.spark鎬庝箞鏁村悎hive锛?/div>
 
24.spark璇诲彇 鏁版嵁锛屾槸鍑犱釜Partition鍛紵
 
 
 
 
25.hbase region澶氬ぇ浼氬垎鍖猴紝spark璇诲彇hbase鏁版嵁鏄浣曞垝鍒唒artition鐨勶紵
 
26.鐢诲浘锛岀敾Spark鐨勫伐浣滄ā寮忥紝閮ㄧ讲鍒嗗竷鏋舵瀯鍥?/div>
 
27.鐢诲浘锛岀敾鍥捐瑙park宸ヤ綔娴佺▼銆備互鍙婂湪闆嗙兢涓婂拰鍚勪釜瑙掕壊鐨勫搴斿叧绯汇€?/div>
 
28.java鑷甫鏈夊摢鍑犵绾跨▼姹犮€?/div>
 
29.鐢诲浘锛岃璁瞫huffle鐨勮繃绋嬨€傞偅浣犳€庝箞鍦ㄧ紪绋嬬殑鏃跺€欐敞鎰忛伩鍏嶈繖浜涙€ц兘闂锛?/div>
 
30.BlockManager鎬庝箞绠$悊纭洏鍜屽唴瀛樼殑锛?/div>
---------------------------------------------------------------------------------------------------------------------
銆怱park闈㈣瘯2000棰?61-190銆慡park Core闈㈣瘯绡?6
鍘熷垱 2017-06-13 姊呭嘲璋?澶ф暟鎹宄拌胺
 
缁х画鍙戞斁銆奡park闈㈣瘯2000棰樸€嬬鍏湡鐨勯鐩紝涓婁竴鏈熺殑鍙傝€冪瓟妗堣幏鍙栨柟寮忥紝鍙戦€佹秷鎭粰鍏紬鍙凤紝娑堟伅鍐呭锛氱浜旀湡绛旀銆傜浜旀湡璇曢锛氶摼鎺?/div>
 
 
 
 
 
1.kafka鏀堕泦鏁版嵁鐨勫師鐞嗭紵
 
2.璁茶鍒楀紡瀛樺偍鐨?parquet鏂囦欢搴曞眰鏍煎紡锛?/div>
 
3.dataset鍜宒ataframe锛?/div>
 
4 scala涓璽rait鐗瑰緛鍜岀敤娉曪紵
 
5.redis鍜宮emcache鐨勫尯鍒紵
 
6.鍒椾妇Spark涓父瑙佺殑绔彛锛屽垎鍒湁浠€涔堝姛鑳斤紵
 
7.Spark master濡備綍閫氳繃Zookeeper鍋欻A锛?/div>
 
8.Spark瀹樼綉涓紝浣犲父鐢ㄥ摢鍑犱釜妯″潡锛?/div>
 
9.浣犳湁瑙佽繃鍝簺鍘熷洜瀵艰嚧鐨勬暟鎹€炬枩锛屾€庝箞瑙e喅锛?/div>
 
10.绠€瑕佹弿杩板渚濊禆绐勪緷璧栦互鍙婂悇鑷殑鐗圭偣锛?/div>
 
11.yarn鐨勫師鐞嗭紵
 
12.BlockManager鎬庝箞绠$悊纭洏鍜屽唴瀛樼殑锛?/div>
 
13.鍝簺绠楀瓙鎿嶄綔娑夊強鍒皊huffle1
 
14.鐪嬭繃婧愮爜锛?浣犵啛鎮夊摢鍑犱釜閮ㄥ垎鐨勬簮鐮侊紵
 
15.闆嗙兢涓?nodemanager鍜孯esourceManager鐨勬暟閲忓叧绯伙紵
 
16.Spark濡備綍澶勭悊缁撴瀯鍖栨暟鎹紝Spark濡備綍澶勭悊闈炵粨鏋勮瘽鏁版嵁锛?/div>
 
17.Spark鎬ц兘浼樺寲涓昏鏈夊摢浜涙墜娈碉紵
 
18.绠€瑕佹弿杩癝park鍒嗗竷寮忛泦缇ゆ惌寤虹殑姝ラ锛?/div>
 
19.瀵逛簬Spark浣犺寰椾粬瀵逛簬鐜版湁澶ф暟鎹殑鐜扮姸鐨勪紭鍔垮拰鍔e娍鍦ㄥ摢閲岋紵
 
20.瀵逛簬绠楁硶鏄惁杩涜杩囪嚜涓荤殑鐮旂┒璁捐锛?/div>
 
21.绠€瑕佹弿杩颁綘浜嗚В鐨勪竴浜涙暟鎹寲鎺樼畻娉曚笌鍐呭
 
22. 浠€涔堟椂鍊檍oin涓嶅彂鐢焥huffle锛?/div>
 
23.spark shuffle鐨勫叿浣撹繃绋嬶紝浣犵煡閬撳嚑绉峴huffle鏂瑰紡
 
24.spark 濡備綍闃叉鍐呭瓨婧㈠嚭 锛?/div>
 
25.绠€杩癶adoop瀹炵幇join鐨勫強鍚勭鏂瑰紡锛?/div>
 
26 rdd杞负dataFrame涓ょ鏂瑰紡锛?/div>
 
27.鍒椾妇浣犵啛鎮夌殑鍐呭瓨绯荤粺锛屽悇鑷殑浼樼己鐐癸紵
 
28.Spark 涓璏aster 瀹炵幇HA鏈夊摢浜涙柟寮?锛?/div>
 
29 鍑芥暟寮忕紪绋嬬壒鐐癸紵
 
30.Sort-based shuffle鐨勭己闄凤紵
 
---------------------------------------------------------------------------------------------------------------------
闈㈣瘯|澶ф暟鎹浉鍏宠瘯棰?闈㈣瘯绡?7
------------------------------------------
闈㈣瘯绯诲垪閲嶆柊缁х画鍙戝竷锛屼笅闈㈣繖涓槸浠庣綉涓婃悳鏉ョ殑锛岄鐩兘鏄ソ棰樼洰锛岀瓟妗堜綔涓哄弬鑰冩槸鍙互鐨勶紝浣滀负瀛︿範绱犳潗锛屼粎渚涘ぇ瀹跺弬鑰冦€?/div>
 
1銆佺畝绛旇涓€涓媓adoop鐨刴ap-reduce缂栫▼妯″瀷
棣栧厛map task浼氫粠鏈湴鏂囦欢绯荤粺璇诲彇鏁版嵁锛岃浆鎹㈡垚key-value褰㈠紡鐨勯敭鍊煎闆嗗悎
浣跨敤鐨勬槸hadoop鍐呯疆鐨勬暟鎹被鍨嬶紝姣斿longwritable銆乼ext绛?/div>
灏嗛敭鍊煎闆嗗悎杈撳叆mapper杩涜涓氬姟澶勭悊杩囩▼锛屽皢鍏惰浆鎹㈡垚闇€瑕佺殑key-value鍦ㄨ緭鍑?/div>
涔嬪悗浼氳繘琛屼竴涓猵artition鍒嗗尯鎿嶄綔锛岄粯璁や娇鐢ㄧ殑鏄痟ashpartitioner锛屽彲浠ラ€氳繃閲嶅啓hashpartitioner鐨刧etpartition鏂规硶鏉ヨ嚜瀹氫箟鍒嗗尯瑙勫垯
涔嬪悗浼氬key杩涜杩涜sort鎺掑簭锛実rouping鍒嗙粍鎿嶄綔灏嗙浉鍚宬ey鐨剉alue鍚堝苟鍒嗙粍杈撳嚭锛屽湪杩欓噷鍙互浣跨敤鑷畾涔夌殑鏁版嵁绫诲瀷锛岄噸鍐橶ritableComparator鐨凜omparator鏂规硶鏉ヨ嚜瀹氫箟鎺掑簭瑙勫垯锛岄噸鍐橰awComparator鐨刢ompara鏂规硶鏉ヨ嚜瀹氫箟鍒嗙粍瑙勫垯
涔嬪悗杩涜涓€涓猚ombiner褰掔害鎿嶄綔锛屽叾瀹炲氨鏄竴涓湰鍦版鐨剅educe棰勫鐞嗭紝浠ュ噺灏忓悗闈hufle鍜宺educer鐨勫伐浣滈噺
reduce task浼氶€氳繃缃戠粶灏嗗悇涓暟鎹敹闆嗚繘琛宺educe澶勭悊锛屾渶鍚庡皢鏁版嵁淇濆瓨鎴栬€呮樉绀猴紝缁撴潫鏁翠釜job
 
2銆乭adoop鐨凾extInputFormat浣滅敤鏄粈涔堬紝濡備綍鑷畾涔夊疄鐜?/div>
InputFormat浼氬湪map鎿嶄綔涔嬪墠瀵规暟鎹繘琛屼袱鏂归潰鐨勯澶勭悊
1鏄痝etSplits锛岃繑鍥炵殑鏄疘nputSplit鏁扮粍锛屽鏁版嵁杩涜split鍒嗙墖锛屾瘡鐗囦氦缁檓ap鎿嶄綔涓€娆?/div>
2鏄痝etRecordReader锛岃繑鍥炵殑鏄疪ecordReader瀵硅薄锛屽姣忎釜split鍒嗙墖杩涜杞崲涓簁ey-value閿€煎鏍煎紡浼犻€掔粰map
甯哥敤鐨処nputFormat鏄疶extInputFormat锛屼娇鐢ㄧ殑鏄疞ineRecordReader瀵规瘡涓垎鐗囪繘琛岄敭鍊煎鐨勮浆鎹紝浠ヨ鍋忕Щ閲忎綔涓洪敭锛岃鍐呭浣滀负鍊?/div>
鑷畾涔夌被缁ф壙InputFormat鎺ュ彛锛岄噸鍐檆reateRecordReader鍜宨sSplitable鏂规硶
鍦╟reateRecordReader涓彲浠ヨ嚜瀹氫箟鍒嗛殧绗?/div>
 
3銆乭adoop鍜宻park鐨勯兘鏄苟琛岃绠楋紝閭d箞浠栦滑鏈変粈涔堢浉鍚屽拰鍖哄埆
涓よ€呴兘鏄敤mr妯″瀷鏉ヨ繘琛屽苟琛岃绠楋紝hadoop鐨勪竴涓綔涓氱О涓簀ob锛宩ob閲岄潰鍒嗕负map task鍜宺educe task锛屾瘡涓猼ask閮芥槸鍦ㄨ嚜宸辩殑杩涚▼涓繍琛岀殑锛屽綋task缁撴潫鏃讹紝杩涚▼涔熶細缁撴潫
spark鐢ㄦ埛鎻愪氦鐨勪换鍔℃垚涓篴pplication锛屼竴涓猘pplication瀵瑰簲涓€涓猻parkcontext锛宎pp涓瓨鍦ㄥ涓猨ob锛屾瘡瑙﹀彂涓€娆ction鎿嶄綔灏变細浜х敓涓€涓猨ob
杩欎簺job鍙互骞惰鎴栦覆琛屾墽琛岋紝姣忎釜job涓湁澶氫釜stage锛宻tage鏄痵huffle杩囩▼涓璂AGSchaduler閫氳繃RDD涔嬮棿鐨勪緷璧栧叧绯诲垝鍒唈ob鑰屾潵鐨勶紝姣忎釜stage閲岄潰鏈夊涓猼ask锛岀粍鎴恡askset鏈塗askSchaduler鍒嗗彂鍒板悇涓猠xecutor涓墽琛岋紝executor鐨勭敓鍛藉懆鏈熸槸鍜宎pp涓€鏍风殑锛屽嵆浣挎病鏈塲ob杩愯涔熸槸瀛樺湪鐨勶紝鎵€浠ask鍙互蹇€熷惎鍔ㄨ鍙栧唴瀛樿繘琛岃绠?/div>
hadoop鐨刯ob鍙湁map鍜宺educe鎿嶄綔锛岃〃杈捐兘鍔涙瘮杈冩瑺缂鸿€屼笖鍦╩r杩囩▼涓細閲嶅鐨勮鍐檋dfs锛岄€犳垚澶ч噺鐨刬o鎿嶄綔锛屽涓猨ob闇€瑕佽嚜宸辩鐞嗗叧绯?/div>
spark鐨勮凯浠h绠楅兘鏄湪鍐呭瓨涓繘琛岀殑锛孉PI涓彁渚涗簡澶ч噺鐨凴DD鎿嶄綔濡俲oin锛実roupby绛夛紝鑰屼笖閫氳繃DAG鍥惧彲浠ュ疄鐜拌壇濂界殑瀹归敊
 
4銆佷负浠€涔堣鐢╢lume瀵煎叆hdfs锛宧dfs鐨勬瀯鏋舵槸鎬庢牱鐨?/div>
flume鍙互瀹炴椂鐨勫鍏ユ暟鎹埌hdfs涓紝褰揾dfs涓婄殑鏂囦欢杈惧埌涓€涓寚瀹氬ぇ灏忕殑鏃跺€欎細褰㈡垚涓€涓枃浠讹紝鎴栬€呰秴杩囨寚瀹氭椂闂寸殑璇濅篃褰㈡垚涓€涓枃浠?/div>
鏂囦欢閮芥槸瀛樺偍鍦╠atanode涓婇潰鐨勶紝namenode璁板綍鐫€datanode鐨勫厓鏁版嵁淇℃伅锛岃€宯amenode鐨勫厓鏁版嵁淇℃伅鏄瓨鍦ㄥ唴瀛樹腑鐨勶紝鎵€浠ュ綋鏂囦欢鍒囩墖寰堝皬鎴栬€呭緢澶氱殑鏃跺€欎細鍗℃
 
5銆乵ap-reduce绋嬪簭杩愯鐨勬椂鍊欎細鏈変粈涔堟瘮杈冨父瑙佺殑闂
姣斿璇翠綔涓氫腑澶ч儴鍒嗛兘瀹屾垚浜嗭紝浣嗘槸鎬绘湁鍑犱釜reduce涓€鐩村湪杩愯
杩欐槸鍥犱负杩欏嚑涓猺educe涓殑澶勭悊鐨勬暟鎹杩滆繙澶т簬鍏朵粬鐨剅educe锛屽彲鑳芥槸鍥犱负瀵归敭鍊煎浠诲姟鍒掑垎鐨勪笉鍧囧寑閫犳垚鐨勬暟鎹€炬枩
瑙e喅鐨勬柟娉曞彲浠ュ湪鍒嗗尯鐨勬椂鍊欓噸鏂板畾涔夊垎鍖鸿鍒欏浜巚alue鏁版嵁寰堝鐨刱ey鍙互杩涜鎷嗗垎銆佸潎鍖€鎵撴暎绛夊鐞嗭紝鎴栬€呮槸鍦╩ap绔殑combiner涓繘琛屾暟鎹澶勭悊鐨勬搷浣?/div>
 
6銆佺畝鍗曡涓€涓媓adoop鍜宻park鐨剆huffle杩囩▼
hadoop锛歮ap绔繚瀛樺垎鐗囨暟鎹紝閫氳繃缃戠粶鏀堕泦鍒皉educe绔?/div>
spark锛歴park鐨剆huffle鏄湪DAGSchedular鍒掑垎Stage鐨勬椂鍊欎骇鐢熺殑锛孴askSchedule瑕佸垎鍙慡tage鍒板悇涓獁orker鐨別xecutor锛屽噺灏憇huffle鍙互鎻愰珮鎬ц兘
 
7銆丠ive涓瓨鏀炬槸浠€涔堬紵
琛紙鏁版嵁+鍏冩暟鎹級銆?瀛樼殑鏄拰hdfs鐨勬槧灏勫叧绯伙紝hive鏄€昏緫涓婄殑鏁版嵁浠撳簱锛屽疄闄呮搷浣滅殑閮芥槸hdfs涓婄殑鏂囦欢锛孒QL灏辨槸鐢╯ql璇硶鏉ュ啓鐨刴r绋嬪簭銆?/div>
 
8銆丠ive涓庡叧绯诲瀷鏁版嵁搴撶殑鍏崇郴锛?/div>
娌℃湁鍏崇郴锛宧ive鏄暟鎹粨搴擄紝涓嶈兘鍜屾暟鎹簱涓€鏍疯繘琛屽疄鏃剁殑CURD鎿嶄綔銆?/div>
鏄竴娆″啓鍏ュ娆¤鍙栫殑鎿嶄綔锛屽彲浠ョ湅鎴愭槸ETL宸ュ叿銆?/div>
 
9銆丗lume宸ヤ綔鏈哄埗鏄粈涔堬紵
鏍稿績姒傚康鏄痑gent锛岄噷闈㈠寘鎷瑂ource銆乧hanel鍜宻ink涓変釜缁勪欢銆?/div>
source杩愯鍦ㄦ棩蹇楁敹闆嗚妭鐐硅繘琛屾棩蹇楅噰闆嗭紝涔嬪悗涓存椂瀛樺偍鍦╟hanel涓紝sink璐熻矗灏哻hanel涓殑鏁版嵁鍙戦€佸埌鐩殑鍦般€?/div>
鍙湁鎴愬姛鍙戦€佷箣鍚巆hanel涓殑鏁版嵁鎵嶄細琚垹闄ゃ€?/div>
棣栧厛涔﹀啓flume閰嶇疆鏂囦欢锛屽畾涔塧gent銆乻ource銆乧hanel鍜宻ink鐒跺悗灏嗗叾缁勮锛屾墽琛宖lume-ng鍛戒护銆?/div>
 
10銆丼qoop宸ヤ綔鍘熺悊鏄粈涔堬紵
hadoop鐢熸€佸湀涓婄殑鏁版嵁浼犺緭宸ュ叿銆?/div>
鍙互灏嗗叧绯诲瀷鏁版嵁搴撶殑鏁版嵁瀵煎叆闈炵粨鏋勫寲鐨刪dfs銆乭ive鎴栬€卋base涓紝涔熷彲浠ュ皢hdfs涓殑鏁版嵁瀵煎嚭鍒板叧绯诲瀷鏁版嵁搴撴垨鑰呮枃鏈枃浠朵腑銆?/div>
浣跨敤鐨勬槸mr绋嬪簭鏉ユ墽琛屼换鍔★紝浣跨敤jdbc鍜屽叧绯诲瀷鏁版嵁搴撹繘琛屼氦浜掋€?/div>
import鍘熺悊锛氶€氳繃鎸囧畾鐨勫垎闅旂杩涜鏁版嵁鍒囧垎锛屽皢鍒嗙墖浼犲叆鍚勪釜map涓紝鍦╩ap浠诲姟涓湪姣忚鏁版嵁杩涜鍐欏叆澶勭悊娌℃湁reduce銆?/div>
export鍘熺悊锛氭牴鎹鎿嶄綔鐨勮〃鍚嶇敓鎴愪竴涓猨ava绫伙紝骞惰鍙栧叾鍏冩暟鎹俊鎭拰鍒嗛殧绗﹀闈炵粨鏋勫寲鐨勬暟鎹繘琛屽尮閰嶏紝澶氫釜map浣滀笟鍚屾椂鎵ц鍐欏叆鍏崇郴鍨嬫暟鎹簱
 
11銆丠base琛屽仴鍒楁棌鐨勬蹇碉紝鐗╃悊妯″瀷锛岃〃鐨勮璁″師鍒欙紵
琛屽仴锛氭槸hbase琛ㄨ嚜甯︾殑锛屾瘡涓鍋ュ搴斾竴鏉℃暟鎹€?/div>
鍒楁棌锛氭槸鍒涘缓琛ㄦ椂鎸囧畾鐨勶紝涓哄垪鐨勯泦鍚堬紝姣忎釜鍒楁棌浣滀负涓€涓枃浠跺崟鐙瓨鍌紝瀛樺偍鐨勬暟鎹兘鏄瓧鑺傛暟缁勶紝鍏朵腑鐨勬暟鎹彲浠ユ湁寰堝锛岄€氳繃鏃堕棿鎴虫潵鍖哄垎銆?/div>
鐗╃悊妯″瀷锛氭暣涓猦base琛ㄤ細鎷嗗垎涓哄涓猺egion锛屾瘡涓猺egion璁板綍鐫€琛屽仴鐨勮捣濮嬬偣淇濆瓨鍦ㄤ笉鍚岀殑鑺傜偣涓婏紝鏌ヨ鏃跺氨鏄鍚勪釜鑺傜偣鐨勫苟琛屾煡璇紝褰搑egion寰堝ぇ鏃朵娇鐢?META琛ㄥ瓨鍌ㄥ悇涓猺egion鐨勮捣濮嬬偣锛?ROOT鍙堝彲浠ュ瓨鍌?META鐨勮捣濮嬬偣銆?/div>
rowkey鐨勮璁″師鍒欙細鍚勪釜鍒楃皣鏁版嵁骞宠 锛岄暱搴﹀師鍒欍€佺浉閭诲師鍒欙紝鍒涘缓琛ㄧ殑鏃跺€欒缃〃鏀惧叆regionserver缂撳瓨涓紝閬垮厤鑷姩澧為暱鍜屾椂闂达紝浣跨敤瀛楄妭鏁扮粍浠f浛string锛屾渶澶ч暱搴?4kb锛屾渶濂?6瀛楄妭浠ュ唴锛屾寜澶╁垎琛紝涓や釜瀛楄妭鏁e垪锛屽洓涓瓧鑺傚瓨鍌ㄦ椂鍒嗘绉掋€?/div>
鍒楁棌鐨勮璁″師鍒欙細灏藉彲鑳藉皯锛堟寜鐓у垪鏃忚繘琛屽瓨鍌紝鎸夌収region杩涜璇诲彇锛屼笉蹇呰鐨刬o鎿嶄綔锛夛紝缁忓父鍜屼笉缁忓父浣跨敤鐨勪袱绫绘暟鎹斁鍏ヤ笉鍚屽垪鏃忎腑锛屽垪鏃忓悕瀛楀敖鍙兘鐭€?/div>
 
12銆丼park Streaming鍜孲torm鏈変綍鍖哄埆锛?/div>
涓€涓疄鏃舵绉掍竴涓噯瀹炴椂浜氱锛屼笉杩噑torm鐨勫悶鍚愮巼姣旇緝浣庛€?/div>
 
13銆乵llib鏀寔鐨勭畻娉曪紵
澶т綋鍒嗕负鍥涘ぇ绫伙紝鍒嗙被銆佽仛绫汇€佸洖褰掋€佸崗鍚岃繃婊ゃ€?/div>
 
14銆佺畝绛旇涓€涓媓adoop鐨刴ap-reduce缂栫▼妯″瀷锛?/div>
棣栧厛map task浼氫粠鏈湴鏂囦欢绯荤粺璇诲彇鏁版嵁锛岃浆鎹㈡垚key-value褰㈠紡鐨勯敭鍊煎闆嗗悎銆?/div>
灏嗛敭鍊煎闆嗗悎杈撳叆mapper杩涜涓氬姟澶勭悊杩囩▼锛屽皢鍏惰浆鎹㈡垚闇€瑕佺殑key-value鍦ㄨ緭鍑恒€?/div>
涔嬪悗浼氳繘琛屼竴涓猵artition鍒嗗尯鎿嶄綔锛岄粯璁や娇鐢ㄧ殑鏄痟ashpartitioner锛屽彲浠ラ€氳繃閲嶅啓hashpartitioner鐨刧etpartition鏂规硶鏉ヨ嚜瀹氫箟鍒嗗尯瑙勫垯銆?/div>
涔嬪悗浼氬key杩涜杩涜sort鎺掑簭锛実rouping鍒嗙粍鎿嶄綔灏嗙浉鍚宬ey鐨剉alue鍚堝苟鍒嗙粍杈撳嚭銆?/div>
鍦ㄨ繖閲屽彲浠ヤ娇鐢ㄨ嚜瀹氫箟鐨勬暟鎹被鍨嬶紝閲嶅啓WritableComparator鐨凜omparator鏂规硶鏉ヨ嚜瀹氫箟鎺掑簭瑙勫垯锛岄噸鍐橰awComparator鐨刢ompara鏂规硶鏉ヨ嚜瀹氫箟鍒嗙粍瑙勫垯銆?/div>
涔嬪悗杩涜涓€涓猚ombiner褰掔害鎿嶄綔锛屽叾瀹炲氨鏄竴涓湰鍦版鐨剅educe棰勫鐞嗭紝浠ュ噺灏忓悗闈hufle鍜宺educer鐨勫伐浣滈噺銆?/div>
reduce task浼氶€氳繃缃戠粶灏嗗悇涓暟鎹敹闆嗚繘琛宺educe澶勭悊锛屾渶鍚庡皢鏁版嵁淇濆瓨鎴栬€呮樉绀猴紝缁?/div>
鏉熸暣涓猨ob銆?/div>
 
15銆丠adoop骞冲彴闆嗙兢閰嶇疆銆佺幆澧冨彉閲忚缃紵
zookeeper锛氫慨鏀箊oo.cfg鏂囦欢锛岄厤缃甦ataDir锛屽拰鍚勪釜zk鑺傜偣鐨剆erver鍦板潃绔彛锛宼ickTime蹇冭烦鏃堕棿榛樿鏄?000ms锛屽叾浠栬秴鏃剁殑鏃堕棿閮芥槸浠ヨ繖涓负鍩虹鐨勬暣鏁板€嶏紝涔嬪悗鍐峝ataDir瀵瑰簲鐩綍涓嬪啓鍏yid鏂囦欢鍜寊oo.cfg涓殑server鐩稿搴斻€?/div>
hadoop锛氫慨鏀?/div>
hadoop-env.sh閰嶇疆java鐜鍙橀噺
core-site.xml閰嶇疆zk鍦板潃锛屼复鏃剁洰褰曠瓑
hdfs-site.xml閰嶇疆nn淇℃伅锛宺pc鍜宧ttp閫氫俊鍦板潃锛宯n鑷姩鍒囨崲銆亃k杩炴帴瓒呮椂鏃堕棿绛?/div>
yarn-site.xml閰嶇疆resourcemanager鍦板潃
mapred-site.xml閰嶇疆浣跨敤yarn
slaves閰嶇疆鑺傜偣淇℃伅
鏍煎紡鍖杗n鍜寊k銆?/div>
hbase锛氫慨鏀?/div>
hbase-env.sh閰嶇疆java鐜鍙橀噺鍜屾槸鍚︿娇鐢ㄨ嚜甯︾殑zk
hbase-site.xml閰嶇疆hdfs涓婃暟鎹瓨鏀捐矾寰勶紝zk鍦板潃鍜岄€氳瓒呮椂鏃堕棿銆乵aster鑺傜偣
regionservers閰嶇疆鍚勪釜region鑺傜偣
zoo.cfg鎷疯礉鍒癱onf鐩綍涓?/div>
spark锛?/div>
瀹夎Scala
淇敼spark-env.sh閰嶇疆鐜鍙橀噺鍜宮aster鍜寃orker鑺傜偣閰嶇疆淇℃伅
鐜鍙橀噺鐨勮缃細鐩存帴鍦?etc/profile涓厤缃畨瑁呯殑璺緞鍗冲彲锛屾垨鑰呭湪褰撳墠鐢ㄦ埛鐨勫涓荤洰褰曚笅锛岄厤缃湪.bashrc鏂囦欢涓紝璇ユ枃浠朵笉鐢╯ource閲嶆柊鎵撳紑shell绐楀彛鍗冲彲锛岄厤缃湪.bash_profile鐨勮瘽鍙褰撳墠鐢ㄦ埛鏈夋晥銆?/div>
 
16銆丠adoop鎬ц兘璋冧紭锛?/div>
璋冧紭鍙互閫氳繃绯荤粺閰嶇疆銆佺▼搴忕紪鍐欏拰浣滀笟璋冨害绠楁硶鏉ヨ繘琛屻€?/div>
hdfs鐨刡lock.size鍙互璋冨埌128/256锛堢綉缁滃緢濂界殑鎯呭喌涓嬶紝榛樿涓?4锛?/div>
璋冧紭鐨勫ぇ澶达細mapred.map.tasks銆乵apred.reduce.tasks璁剧疆mr浠诲姟鏁帮紙榛樿閮芥槸1锛?/div>
mapred.tasktracker.map.tasks.maximum姣忓彴鏈哄櫒涓婄殑鏈€澶ap浠诲姟鏁?/div>
mapred.tasktracker.reduce.tasks.maximum姣忓彴鏈哄櫒涓婄殑鏈€澶educe浠诲姟鏁?/div>
mapred.reduce.slowstart.completed.maps閰嶇疆reduce浠诲姟鍦╩ap浠诲姟瀹屾垚鍒扮櫨鍒嗕箣鍑犵殑鏃跺€欏紑濮嬭繘鍏?/div>
杩欎釜鍑犱釜鍙傛暟瑕佺湅瀹為檯鑺傜偣鐨勬儏鍐佃繘琛岄厤缃紝reduce浠诲姟鏄湪33%鐨勬椂鍊欏畬鎴恈opy锛岃鍦ㄨ繖涔嬪墠瀹屾垚map浠诲姟锛岋紙map鍙互鎻愬墠瀹屾垚锛?/div>
mapred.compress.map.output,mapred.output.compress閰嶇疆鍘嬬缉椤癸紝娑堣€梒pu鎻愬崌缃戠粶鍜岀鐩榠o
鍚堢悊鍒╃敤combiner
娉ㄦ剰閲嶇敤writable瀵硅薄
 
17銆丠adoop楂樺苟鍙戯紵
棣栧厛鑲畾瑕佷繚璇侀泦缇ょ殑楂樺彲闈犳€э紝鍦ㄩ珮骞跺彂鐨勬儏鍐典笅涓嶄細鎸傛帀锛屾敮鎾戜笉浣忓彲浠ラ€氳繃妯悜鎵╁睍銆?/div>
datanode鎸傛帀浜嗕娇鐢╤adoop鑴氭湰閲嶆柊鍚姩銆?/div>
 
18銆乭adoop鐨凾extInputFormat浣滅敤鏄粈涔堬紝濡備綍鑷畾涔夊疄鐜帮紵
InputFormat浼氬湪map鎿嶄綔涔嬪墠瀵规暟鎹繘琛屼袱鏂归潰鐨勯澶勭悊銆?/div>
1鏄痝etSplits锛岃繑鍥炵殑鏄疘nputSplit鏁扮粍锛屽鏁版嵁杩涜split鍒嗙墖锛屾瘡鐗囦氦缁檓ap鎿嶄綔涓€娆?銆?/div>
2鏄痝etRecordReader锛岃繑鍥炵殑鏄疪ecordReader瀵硅薄锛屽姣忎釜split鍒嗙墖杩涜杞崲涓簁ey-value閿€煎鏍煎紡浼犻€掔粰map銆?/div>
甯哥敤鐨処nputFormat鏄疶extInputFormat锛屼娇鐢ㄧ殑鏄疞ineRecordReader瀵规瘡涓垎鐗囪繘琛岄敭鍊煎鐨勮浆鎹紝浠ヨ鍋忕Щ閲忎綔涓洪敭锛岃鍐呭浣滀负鍊笺€?/div>
鑷畾涔夌被缁ф壙InputFormat鎺ュ彛锛岄噸鍐檆reateRecordReader鍜宨sSplitable鏂规硶 銆?/div>
鍦╟reateRecordReader涓彲浠ヨ嚜瀹氫箟鍒嗛殧绗︺€?/div>
 
19銆乭adoop鍜宻park鐨勯兘鏄苟琛岃绠楋紝閭d箞浠栦滑鏈変粈涔堢浉鍚屽拰鍖哄埆锛?/div>
涓よ€呴兘鏄敤mr妯″瀷鏉ヨ繘琛屽苟琛岃绠楋紝hadoop鐨勪竴涓綔涓氱О涓簀ob锛宩ob閲岄潰鍒嗕负map task鍜宺educe task锛屾瘡涓猼ask閮芥槸鍦ㄨ嚜宸辩殑杩涚▼涓繍琛岀殑锛屽綋task缁撴潫鏃讹紝杩涚▼涔熶細缁撴潫銆?/div>
spark鐢ㄦ埛鎻愪氦鐨勪换鍔℃垚涓篴pplication锛屼竴涓猘pplication瀵瑰簲涓€涓猻parkcontext锛宎pp涓瓨鍦ㄥ涓猨ob锛屾瘡瑙﹀彂涓€娆ction鎿嶄綔灏变細浜х敓涓€涓猨ob銆?/div>
杩欎簺job鍙互骞惰鎴栦覆琛屾墽琛岋紝姣忎釜job涓湁澶氫釜stage锛宻tage鏄痵huffle杩囩▼涓璂AGSchaduler閫氳繃RDD涔嬮棿鐨勪緷璧栧叧绯诲垝鍒唈ob鑰屾潵鐨勶紝姣忎釜stage閲岄潰鏈夊涓猼ask锛岀粍鎴恡askset鏈塗askSchaduler鍒嗗彂鍒板悇涓猠xecutor涓墽琛岋紝executor鐨勭敓鍛藉懆鏈熸槸鍜宎pp涓€鏍风殑锛屽嵆浣挎病鏈塲ob杩愯涔熸槸瀛樺湪鐨勶紝鎵€浠ask鍙互蹇€熷惎鍔ㄨ鍙栧唴瀛樿繘琛岃绠椼€?/div>
hadoop鐨刯ob鍙湁map鍜宺educe鎿嶄綔锛岃〃杈捐兘鍔涙瘮杈冩瑺缂鸿€屼笖鍦╩r杩囩▼涓細閲嶅鐨勮鍐檋dfs锛岄€犳垚澶ч噺鐨刬o鎿嶄綔锛屽涓猨ob闇€瑕佽嚜宸辩鐞嗗叧绯汇€?/div>
spark鐨勮凯浠h绠楅兘鏄湪鍐呭瓨涓繘琛岀殑锛孉PI涓彁渚涗簡澶ч噺鐨凴DD鎿嶄綔濡俲oin锛実roupby绛夛紝鑰屼笖閫氳繃DAG鍥惧彲浠ュ疄鐜拌壇濂界殑瀹归敊銆?/div>
 
20銆佷负浠€涔堣鐢╢lume瀵煎叆hdfs锛宧dfs鐨勬瀯鏋舵槸鎬庢牱鐨勶紵
flume鍙互瀹炴椂鐨勫鍏ユ暟鎹埌hdfs涓紝褰揾dfs涓婄殑鏂囦欢杈惧埌涓€涓寚瀹氬ぇ灏忕殑鏃跺€欎細褰㈡垚涓€涓枃浠讹紝鎴栬€呰秴杩囨寚瀹氭椂闂寸殑璇濅篃褰㈡垚涓€涓枃浠躲€?/div>
鏂囦欢閮芥槸瀛樺偍鍦╠atanode涓婇潰鐨勶紝namenode璁板綍鐫€datanode鐨勫厓鏁版嵁淇℃伅锛岃€宯amenode鐨勫厓鏁版嵁淇℃伅鏄瓨鍦ㄥ唴瀛樹腑鐨勶紝鎵€浠ュ綋鏂囦欢鍒囩墖寰堝皬鎴栬€呭緢澶氱殑鏃跺€欎細鍗℃銆?/div>
 
21銆乵ap-reduce绋嬪簭杩愯鐨勬椂鍊欎細鏈変粈涔堟瘮杈冨父瑙佺殑闂锛?/div>
姣斿璇翠綔涓氫腑澶ч儴鍒嗛兘瀹屾垚浜嗭紝浣嗘槸鎬绘湁鍑犱釜reduce涓€鐩村湪杩愯銆?/div>
杩欐槸鍥犱负杩欏嚑涓猺educe涓殑澶勭悊鐨勬暟鎹杩滆繙澶т簬鍏朵粬鐨剅educe锛屽彲鑳芥槸鍥犱负瀵归敭鍊煎浠诲姟鍒掑垎鐨勪笉鍧囧寑閫犳垚鐨勬暟鎹€炬枩銆?/div>
瑙e喅鐨勬柟娉曞彲浠ュ湪鍒嗗尯鐨勬椂鍊欓噸鏂板畾涔夊垎鍖鸿鍒欏浜巚alue鏁版嵁寰堝鐨刱ey鍙互杩涜鎷嗗垎銆佸潎鍖€鎵撴暎绛夊鐞嗭紝鎴栬€呮槸鍦╩ap绔殑combiner涓繘琛屾暟鎹澶勭悊鐨勬搷浣溿€?/div>
16銆佺畝鍗曡涓€涓媓adoop鍜宻park鐨剆huffle杩囩▼锛?/div>
hadoop锛歮ap绔繚瀛樺垎鐗囨暟鎹紝閫氳繃缃戠粶鏀堕泦鍒皉educe绔€?/div>
spark锛歴park鐨剆huffle鏄湪DAGSchedular鍒掑垎Stage鐨勬椂鍊欎骇鐢熺殑锛孴askSchedule瑕佸垎鍙慡tage鍒板悇涓獁orker鐨別xecutor銆?/div>
鍑忓皯shuffle鍙互鎻愰珮鎬ц兘銆?/div>
 
22銆丷DD鏈哄埗锛?/div>
rdd鍒嗗竷寮忓脊鎬ф暟鎹泦锛岀畝鍗曠殑鐞嗚В鎴愪竴绉嶆暟鎹粨鏋勶紝鏄痵park妗嗘灦涓婄殑閫氱敤璐у竵銆?/div>
鎵€鏈夌畻瀛愰兘鏄熀浜巖dd鏉ユ墽琛岀殑锛屼笉鍚岀殑鍦烘櫙浼氭湁涓嶅悓鐨剅dd瀹炵幇绫伙紝浣嗘槸閮藉彲浠ヨ繘琛屼簰鐩歌浆鎹€?/div>
rdd鎵ц杩囩▼涓細褰㈡垚dag鍥撅紝鐒跺悗褰㈡垚lineage淇濊瘉瀹归敊鎬х瓑銆?/div>
浠庣墿鐞嗙殑瑙掑害鏉ョ湅rdd瀛樺偍鐨勬槸block鍜宯ode涔嬮棿鐨勬槧灏勩€?/div>
18銆乻park鏈夊摢浜涚粍浠讹紵
锛?锛塵aster锛氱鐞嗛泦缇ゅ拰鑺傜偣锛屼笉鍙備笌璁$畻銆?/div>
锛?锛墂orker锛氳绠楄妭鐐癸紝杩涚▼鏈韩涓嶅弬涓庤绠楋紝鍜宮aster姹囨姤銆?/div>
锛?锛塂river锛氳繍琛岀▼搴忕殑main鏂规硶锛屽垱寤簊park context瀵硅薄銆?/div>
锛?锛塻park context锛氭帶鍒舵暣涓猘pplication鐨勭敓鍛藉懆鏈燂紝鍖呮嫭dagsheduler鍜宼ask scheduler绛夌粍浠躲€?/div>
锛?锛塩lient锛氱敤鎴锋彁浜ょ▼搴忕殑鍏ュ彛銆?/div>
 
23銆乻park宸ヤ綔鏈哄埗锛?/div>
鐢ㄦ埛鍦╟lient绔彁浜や綔涓氬悗锛屼細鐢盌river杩愯main鏂规硶骞跺垱寤簊park context涓婁笅鏂囥€?/div>
鎵цadd绠楀瓙锛屽舰鎴恉ag鍥捐緭鍏agscheduler锛屾寜鐓dd涔嬮棿鐨勪緷璧栧叧绯诲垝鍒唖tage杈撳叆task scheduler銆?/div>
task scheduler浼氬皢stage鍒掑垎涓簍ask set鍒嗗彂鍒板悇涓妭鐐圭殑executor涓墽琛屻€?/div>
 
24銆乻park鐨勪紭鍖栨€庝箞鍋氾紵
閫氳繃spark-env鏂囦欢銆佺▼搴忎腑sparkconf鍜宻et property璁剧疆銆?/div>
锛?锛夎绠楅噺澶э紝褰㈡垚鐨刲ineage杩囧ぇ搴旇缁欏凡缁忕紦瀛樹簡鐨剅dd娣诲姞checkpoint锛屼互鍑忓皯瀹归敊甯︽潵鐨勫紑閿€銆?/div>
锛?锛夊皬鍒嗗尯鍚堝苟锛岃繃灏忕殑鍒嗗尯閫犳垚杩囧鐨勫垏鎹换鍔″紑閿€锛屼娇鐢╮epartition銆?/div>
 
25銆乲afka宸ヤ綔鍘熺悊锛?/div>
producer鍚慴roker鍙戦€佷簨浠讹紝consumer浠巄roker娑堣垂浜嬩欢銆?/div>
浜嬩欢鐢眛opic鍖哄垎寮€锛屾瘡涓猚onsumer閮戒細灞炰簬涓€涓猤roup銆?/div>
鐩稿悓group涓殑consumer涓嶈兘閲嶅娑堣垂浜嬩欢锛岃€屽悓涓€浜嬩欢灏嗕細鍙戦€佺粰姣忎釜涓嶅悓group鐨刢onsumer銆?/div>
 
26銆丄LS绠楁硶鍘熺悊锛?/div>
绛旓細瀵逛簬user-product-rating鏁版嵁锛宎ls浼氬缓绔嬩竴涓█鐤忕殑璇勫垎鐭╅樀锛屽叾鐩殑灏辨槸閫氳繃涓€瀹氱殑瑙勫垯濉弧杩欎釜绋€鐤忕煩闃点€?/div>
als浼氬绋€鐤忕煩闃佃繘琛屽垎瑙o紝鍒嗕负鐢ㄦ埛-鐗瑰緛鍊硷紝浜у搧-鐗瑰緛鍊硷紝涓€涓敤鎴峰涓€涓骇鍝佺殑璇勫垎鍙互鐢辫繖涓や釜鐭╅樀鐩镐箻寰楀埌銆?/div>
閫氳繃鍥哄畾涓€涓湭鐭ョ殑鐗瑰緛鍊硷紝璁$畻鍙﹀涓€涓壒寰佸€硷紝鐒跺悗浜ゆ浛鍙嶅杩涜鏈€灏忎簩涔樻硶锛岀洿鑷冲樊骞虫柟鍜屾渶灏忥紝鍗冲彲寰楁兂瑕佺殑鐭╅樀銆?/div>
 
27銆乲means绠楁硶鍘熺悊锛?/div>
闅忔満鍒濆鍖栦腑蹇冪偣鑼冨洿锛岃绠楀悇涓被鍒殑骞冲潎鍊煎緱鍒版柊鐨勪腑蹇冪偣銆?/div>
閲嶆柊璁$畻鍚勪釜鐐瑰埌涓績鍊肩殑璺濈鍒掑垎锛屽啀娆¤绠楀钩鍧囧€煎緱鍒版柊鐨勪腑蹇冪偣锛岀洿鑷冲悇涓被鍒暟鎹钩鍧囧€兼棤鍙樺寲銆?/div>
 
28銆乧anopy绠楁硶鍘熺悊锛?/div>
鏍规嵁涓や釜闃堝€兼潵鍒掑垎鏁版嵁锛屼互闅忔満鐨勪竴涓暟鎹偣浣滀负canopy涓績銆?/div>
璁$畻鍏朵粬鏁版嵁鐐瑰埌鍏剁殑璺濈锛屽垝鍏1銆乼2涓紝鍒掑叆t2鐨勪粠鏁版嵁闆嗕腑鍒犻櫎锛屽垝鍏1鐨勫叾浠栨暟鎹偣缁х画璁$畻锛岀洿鑷虫暟鎹泦涓棤鏁版嵁銆?/div>
 
29銆佹湸绱犺礉鍙舵柉鍒嗙被绠楁硶鍘熺悊锛?/div>
瀵逛簬寰呭垎绫荤殑鏁版嵁鍜屽垎绫婚」锛屾牴鎹緟鍒嗙被鏁版嵁鐨勫悇涓壒寰佸睘鎬э紝鍑虹幇鍦ㄥ悇涓垎绫婚」涓殑姒傜巼鍒ゆ柇璇ユ暟鎹槸灞炰簬鍝釜绫诲埆鐨勩€?/div>
 
30銆佸叧鑱旇鍒欐寲鎺樼畻娉昦priori鍘熺悊锛?/div>
涓€涓绻侀」闆嗙殑瀛愰泦涔熸槸棰戠箒椤归泦锛岄拡瀵规暟鎹緱鍑烘瘡涓骇鍝佺殑鏀寔鏁板垪琛紝杩囨护鏀寔鏁板皬浜庨璁惧€肩殑椤癸紝瀵瑰墿涓嬬殑椤硅繘琛屽叏鎺掑垪锛岄噸鏂拌绠楁敮鎸佹暟锛屽啀娆¤繃婊わ紝閲嶅鑷冲叏鎺掑垪缁撴潫锛屽彲寰楀埌棰戠箒椤瑰拰瀵瑰簲鐨勬敮鎸佹暟銆?/div>
 
 
-------------------
--------------------------------------------------------------------------------------------------------------------
闈㈣瘯鐨勬椂鍊欒繖鏍风殑棰樼洰缁欎綘锛屼綘浼氫笉浼氳锛宖uck~锛屾€庝箞杩欎箞鍙樻€侊紝缁忓父鍐欎唬鐮佺殑浜猴紝鐪嬪埌杩欎釜搴旇浼氬緢闈㈢啛銆傜┒绔熶粈涔堟牱鐨勯鐩繖涔堝彉鎬侊紝瑙g瓟濡備笅锛屼粎渚涘弬鑰冨涔?/div>
 
1銆丱peration category READ is not supported in state standby鏄粈涔堝師鍥犲鑷寸殑锛無rg.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby 绛旓細姝ゆ椂璇风櫥褰旽adoop鐨勭鐞嗙晫闈㈡煡鐪嬭繍琛岃妭鐐规槸鍚﹀浜巗tandby
濡傜櫥褰曞湴鍧€鏄細http://xx.xx.xx.xx:50070/dfshealth.html#tab-overview
濡傛灉鏄紝鍒欎笉鍙湪澶勪簬StandBy鏈哄櫒杩愯spark璁$畻锛屽洜涓鸿鍙版満鍣ㄤ负澶囧垎鏈哄櫒
 
2銆佷笉閰嶇疆spark.deploy.recoveryMode閫夐」涓篫OOKEEPER锛屼細鏈変粈涔堜笉濂界殑鍦版柟
濡傛灉涓嶈缃畇park.deploy.recoveryMode鐨勮瘽锛岄偅涔堥泦缇ょ殑鎵€鏈夎繍琛屾暟鎹湪Master閲嶅惎鏄兘浼氫涪澶憋紝鍙弬鑰傿lackHolePersistenceEngine鐨勫疄鐜般€?/div>
 
3銆佸Master濡備綍閰嶇疆
鍥犱负娑夊強鍒板涓狹aster锛屾墍浠ュ浜庡簲鐢ㄧ▼搴忕殑鎻愪氦灏辨湁浜嗕竴鐐瑰彉鍖栵紝鍥犱负搴旂敤绋嬪簭闇€瑕佺煡閬撳綋鍓嶇殑Master鐨処P鍦板潃鍜岀鍙c€傝繖绉岺A鏂规澶勭悊杩欑鎯呭喌寰堢畝鍗曪紝鍙渶瑕佸湪SparkContext鎸囧悜涓€涓狹aster鍒楄〃灏卞彲浠ヤ簡锛屽spark://host1:port1,host2:port2,host3:port3锛屽簲鐢ㄧ▼搴忎細杞鍒楄〃銆?/div>
 
4銆丯o Space Left on the device锛圫huffle涓存椂鏂囦欢杩囧锛?/div>
鐢变簬Spark鍦ㄨ绠楃殑鏃跺€欎細灏嗕腑闂寸粨鏋滃瓨鍌ㄥ埌/tmp鐩綍锛岃€岀洰鍓峫inux鍙堥兘鏀寔tmpfs锛屽叾瀹炲氨鏄皢/tmp鐩綍鎸傝浇鍒板唴瀛樺綋涓€?/div>
閭d箞杩欓噷灏卞瓨鍦ㄤ竴涓棶棰橈紝涓棿缁撴灉杩囧瀵艰嚧/tmp鐩綍鍐欐弧鑰屽嚭鐜板涓嬮敊璇?/div>
No Space Left on the device
瑙e喅鍔炴硶
绗竴绉嶏細淇敼閰嶇疆鏂囦欢spark-env.sh,鎶婁复鏃舵枃浠跺紩鍏ュ埌涓€涓嚜瀹氫箟鐨勭洰褰曚腑鍘诲嵆鍙?/div>
export SPARK_LOCAL_DIRS=/home/utoken/datadir/spark/tmp
绗簩绉嶏細鍋锋噿鏂瑰紡锛岄拡瀵箃mp鐩綍涓嶅惎鐢╰mpfs,鐩存帴淇敼/etc/fstab
 
5銆乯ava.lang.OutOfMemory, unable to create new native thread
Caused by: java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:640)
涓婇潰杩欐閿欒鎻愮ず鐨勬湰璐ㄦ槸Linux鎿嶄綔绯荤粺鏃犳硶鍒涘缓鏇村杩涚▼锛屽鑷村嚭閿欙紝骞朵笉鏄郴缁熺殑鍐呭瓨涓嶈冻銆傚洜姝よ瑙e喅杩欎釜闂闇€瑕佷慨鏀筁inux鍏佽鍒涘缓鏇村鐨勮繘绋嬶紝灏遍渶瑕佷慨鏀筁inux鏈€澶ц繘绋嬫暟銆?/div>
[[email protected] ~]$ulimit -a
涓存椂淇敼鍏佽鎵撳紑鐨勬渶澶ц繘绋嬫暟
[[email protected] ~]$ulimit -u 65535
涓存椂淇敼鍏佽鎵撳紑鐨勬枃浠跺彞鏌?/div>
[[email protected] ~]$ulimit -n 65535
姘镐箙淇敼Linux鏈€澶ц繘绋嬫暟閲?/div>
[[email protected] ~]$ vim /etc/security/limits.d/90-nproc.conf
soft nproc 60000
root soft nproc unlimited
姘镐箙淇敼鐢ㄦ埛鎵撳紑鏂囦欢鐨勬渶澶у彞鏌勬暟锛岃鍊奸粯璁?024锛屼竴鑸兘浼氫笉澶燂紝甯歌閿欒灏辨槸not open file
[[email protected] ~]$ vim /etc/security/limits.conf
bdata soft nofile 65536
bdata hard nofile 65536
 
6銆乄orker鑺傜偣涓殑work鐩綍鍗犵敤璁稿纾佺洏绌洪棿
鐩綍鍦板潃锛?home/utoken/software/spark-1.3.0-bin-hadoop2.4/work
杩欎簺鏄疍river涓婁紶鍒皐orker鐨勬枃浠讹紝闇€瑕佸畾鏃跺仛鎵嬪伐娓呯悊锛屽惁鍒欎細鍗犵敤璁稿纾佺洏绌洪棿
 
7銆乻park-shell鎻愪氦Spark Application濡備綍瑙e喅渚濊禆搴?/div>
spark-shell鐨勮瘽锛屽埄鐢?ndash;driver-class-path閫夐」鏉ユ寚瀹氭墍渚濊禆鐨刯ar鏂囦欢锛屾敞鎰忕殑鏄?ndash;driver-class-path鍚庡鏋滈渶瑕佽窡鐫€澶氫釜jar鏂囦欢鐨勮瘽锛宩ar鏂囦欢涔嬮棿浣跨敤鍐掑彿(:)鏉ュ垎鍓层€?/div>
 
8銆丼park鍦ㄥ彂甯冨簲鐢ㄧ殑鏃跺€欙紝鍑虹幇杩炴帴涓嶄笂master闂锛屽涓?/div>
15/11/19 11:35:50 INFO AppClient$ClientEndpoint: Connecting to master spark://s1:7077…
15/11/19 11:35:50 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://[email protected]:7077] has failed, address is now gated for [5000] ms. Reason: [Disassociated]
瑙e喅鏂瑰紡
妫€鏌ユ墍鏈夋満鍣ㄦ椂闂存槸鍚︿竴鑷淬€乭osts鏄惁閮介厤缃簡鏄犲皠銆佸鎴风鍜屾湇鍔″櫒绔殑Scala鐗堟湰鏄惁涓€鑷淬€丼cala鐗堟湰鏄惁鍜孲park鍏煎
妫€鏌ユ槸鍚﹀吋瀹归棶棰樿鍙傝€冨畼鏂圭綉绔欎粙缁嶏細
 
9銆佸紑鍙憇park搴旂敤绋嬪簭锛堝拰Flume-NG缁撳悎鏃讹級鍙戝竷搴旂敤鏃跺彲鑳藉嚭鐜皁rg.jboss.netty.channel.ChannelException: Failed to bind to: /192.168.10.156:18800
15/11/27 10:33:44 ERROR ReceiverSupervisorImpl: Stopped receiver with error: org.jboss.netty.channel.ChannelException: Failed to bind to: /192.168.10.156:18800
15/11/27 10:33:44 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 70)
org.jboss.netty.channel.ChannelException: Failed to bind to: /192.168.10.156:18800
at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
Caused by: java.net.BindException: Cannot assign requested address
鐢变簬spark閫氳繃Master鍙戝竷鐨勬椂鍊欙紝浼氳嚜鍔ㄩ€夊彇鍙戦€佸埌鏌愪竴鍙扮殑worker鑺傜偣涓婏紝鎵€浠ヨ繖閲岀粦瀹氱鍙g殑鏃跺€欙紝闇€瑕侀€夋嫨鐩稿簲鐨剋orker鏈嶅姟鍣紝浣嗘槸鐢变簬鎴戜滑鏃犳硶浜嬪厛浜嗚В鍒帮紝spark鍙戝竷鍒板摢涓€鍙版湇鍔″櫒鐨勶紝鎵€浠ヨ繖閲屽惎鍔ㄦ姤閿欙紝鏄洜涓哄湪 192.168.10.156:18800鐨勬満鍣ㄤ笂闈㈡病鏈夊惎鍔―river绋嬪簭锛岃€屾槸鍙戝竷鍒颁簡鍏朵粬鏈嶅姟鍣ㄥ幓鍚姩浜嗭紝鎵€浠ユ棤娉曠洃鍚埌璇ユ満鍣ㄥ嚭鐜伴棶棰橈紝鎵€浠ユ垜浠渶瑕佽缃畇park鍒嗗彂鍖呮椂锛屽彂甯冨埌鎵€鏈墂orker鑺傜偣鏈哄櫒锛屾垨鑰呭彂甯冨悗锛屾垜浠幓瀵绘壘鍙戝竷鍒颁簡鍝竴鍙版満鍣紝閲嶆柊淇敼缁戝畾IP锛岄噸鏂板彂甯冿紝鏈変竴瀹氬嚑鐜囧彂甯冩垚鍔熴€傝鎯呭彲瑙併€婂嵃璞$瑪璁?鎴?娓g郴鍒?mdash;—Spark Streaming鍚姩闂 - 鎺ㄩ叿銆?/div>
 
10銆乻park-shell 鎵句笉鍒癶adoop so闂瑙e喅
[main] WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
鍦⊿park鐨刢onf鐩綍涓嬶紝淇敼spark-env.sh鏂囦欢锛屽姞鍏D_LIBRARY_PATH鐜鍙橀噺锛屽€间负HADOOP鐨刵ative搴撹矾寰勫嵆鍙?
 
11銆丒RROR XSDB6: Another instance of Derby may have already booted the database /home/bdata/data/metastore_db.
鍦ㄤ娇鐢℉ive on Spark妯″紡鎿嶄綔hive閲岄潰鐨勬暟鎹椂锛屾姤浠ヤ笂閿欒锛屽師鍥犳槸鍥犱负HIVE閲囩敤浜哾erby杩欎釜鍐呭祵鏁版嵁搴撲綔涓烘暟鎹簱锛屽畠涓嶆敮鎸佸鐢ㄦ埛鍚屾椂璁块棶,瑙e喅鍔炴硶灏辨槸鎶奷erby鏁版嵁搴撴崲鎴恗ysql鏁版嵁搴撳嵆鍙?/div>
鍙樻洿鏂瑰紡
 
12銆乯ava.lang.IllegalArgumentException: java.net.UnknownHostException: dfscluster
瑙e喅鍔炴硶锛?/div>
鎵句笉鍒癶dfs闆嗙兢鍚嶅瓧dfscluster,杩欎釜鏂囦欢鍦℉ADOOP鐨別tc/hadoop涓嬮潰锛屾湁涓枃浠秇dfs-site.xml锛屽鍒跺埌Spark鐨刢onf涓嬶紝閲嶅惎鍗冲彲
濡傦細鎵ц鑴氭湰锛屽垎鍙戝埌鎵€鏈夌殑Spark闆嗙兢鏈哄櫒涓紝
[[email protected] hadoop]foriin34,35,36,37,38;doscphdfs−site.xml192.168.10.i:/u01/spark-1.5.1/conf/ ; done
 
13銆丒xception in thread “main” java.lang.Exception: When running with master ‘yarn-client’ either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.
闂锛氬湪鎵цyarn闆嗙兢鎴栬€呭鎴风鏃讹紝鎶ヤ互涓婇敊璇紝
[[email protected] bin]$ ./spark-sql –master yarn-client
Exception in thread “main” java.lang.Exception: When running with master ‘yarn-client’ either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.
瑙e喅鍔炴硶
鏍规嵁鎻愮ず锛岄厤缃瓾ADOOP_CONF_DIR or YARN_CONF_DIR鐨勭幆澧冨彉閲忓嵆鍙?/div>
export HADOOP_HOME=/u01/hadoop-2.6.1
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
PATH=PATH:HOME/.local/bin:HOME/bin:SQOOP_HOME/bin:HIVEHOME/bin:HADOOP_HOME/bin
 
14銆丣ob aborted due to stage failure: Task 3 in stage 0.0 failed 4 times, most recent failure: Lost task 3.3 in
[Stage 0:> (0 + 4) / 42]2016-01-15 11:28:16,512 [org.apache.spark.scheduler.TaskSchedulerImpl]-[ERROR] Lost executor 0 on 192.168.10.38: remote Rpc client disassociated
[Stage 0:> (0 + 4) / 42]2016-01-15 11:28:23,188 [org.apache.spark.scheduler.TaskSchedulerImpl]-[ERROR] Lost executor 1 on 192.168.10.38: remote Rpc client disassociated
[Stage 0:> (0 + 4) / 42]2016-01-15 11:28:29,203 [org.apache.spark.scheduler.TaskSchedulerImpl]-[ERROR] Lost executor 2 on 192.168.10.38: remote Rpc client disassociated
[Stage 0:> (0 + 4) / 42]2016-01-15 11:28:36,319 [org.apache.spark.scheduler.TaskSchedulerImpl]-[ERROR] Lost executor 3 on 192.168.10.38: remote Rpc client disassociated
2016-01-15 11:28:36,321 [org.apache.spark.scheduler.TaskSetManager]-[ERROR] Task 3 in stage 0.0 failed 4 times; aborting job
Exception in thread “main” org.apache.spark.SparkException : Job aborted due to stage failure: Task 3 in stage 0.0 failed 4 times, most recent failure: Lost task 3.3 in stage 0.0 (TID 14, 192.168.10.38): ExecutorLostFailure (executor 3 lost)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
瑙e喅鏂规
杩欓噷閬囧埌鐨勯棶棰樹富瑕佹槸鍥犱负鏁版嵁婧愭暟鎹噺杩囧ぇ锛岃€屾満鍣ㄧ殑鍐呭瓨鏃犳硶婊¤冻闇€姹傦紝瀵艰嚧闀挎椂闂存墽琛岃秴鏃舵柇寮€鐨勬儏鍐碉紝鏁版嵁鏃犳硶鏈夋晥杩涜浜や簰璁$畻锛屽洜姝ゆ湁蹇呰澧炲姞鍐呭瓨
 
15銆侀暱鏃堕棿绛夊緟鏃犲弽搴旓紝骞朵笖鐪嬪埌鏈嶅姟鍣ㄤ笂闈㈢殑web鐣岄潰鏈夊唴瀛樺拰鏍稿績鏁帮紝浣嗘槸娌℃湁鍒嗛厤锛屽涓嬪浘
[Stage 0:> (0 + 0) / 42]
鎴栬€呮棩蹇椾俊鎭樉绀猴細
16/01/15 14:18:56 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
瑙e喅鏂规
鍑虹幇涓婇潰鐨勯棶棰樹富瑕佸師鍥犳槸鍥犱负鎴戜滑閫氳繃鍙傛暟spark.executor.memory璁剧疆鐨勫唴瀛樿繃澶э紝宸茬粡瓒呰繃浜嗗疄闄呮満鍣ㄦ嫢鏈夌殑鍐呭瓨锛屾晠鏃犳硶鎵ц锛岄渶瑕佺瓑寰呮満鍣ㄦ嫢鏈夎冻澶熺殑鍐呭瓨鍚庯紝鎵嶈兘鎵ц浠诲姟锛屽彲浠ュ噺灏戜换鍔℃墽琛屽唴瀛橈紝璁剧疆灏忎竴浜涘嵆鍙?/div>
 
16銆佸唴瀛樹笉瓒虫垨鏁版嵁鍊炬枩瀵艰嚧Executor Lost锛坰park-submit鎻愪氦锛?/div>
TaskSetManager: Lost task 1.0 in stage 6.0 (TID 100, 192.168.10.37): java.lang.OutOfMemoryError: Java heap space
16/01/15 14:29:51 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on 192.168.10.37:57139 (size: 42.0 KB, free: 24.2 MB)
16/01/15 14:29:53 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on 192.168.10.38:53816 (size: 42.0 KB, free: 24.2 MB)
16/01/15 14:29:55 INFO TaskSetManager: Starting task 3.0 in stage 6.0 (TID 102, 192.168.10.37, ANY, 2152 bytes)
16/01/15 14:29:55 WARN TaskSetManager: Lost task 1.0 in stage 6.0 (TID 100, 192.168.10.37): java.lang.OutOfMemoryError: Java heap space
at java.io.BufferedOutputStream.(BufferedOutputStream.java:76)
at java.io.BufferedOutputStream.(BufferedOutputStream.java:59)
…….
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 6.0 failed 4 times, most recent failure: Lost task 0.3 in stage 6.0 (TID 142, 192.168.10.36): ExecutorLostFailure (executor 4 lost)
……
WARN TaskSetManager: Lost task 4.1 in stage 6.0 (TID 137, 192.168.10.38): java.lang.OutOfMemoryError: GC overhead limit exceeded
瑙e喅鍔炴硶锛?/div>
鐢变簬鎴戜滑鍦ㄦ墽琛孲park浠诲姟鏄紝璇诲彇鎵€闇€瑕佺殑鍘熸暟鎹紝鏁版嵁閲忓お澶э紝瀵艰嚧鍦╓orker涓婇潰鍒嗛厤鐨勪换鍔℃墽琛屾暟鎹椂鎵€闇€瑕佺殑鍐呭瓨涓嶅锛岀洿鎺ュ鑷村唴瀛樻孩鍑轰簡锛屾墍浠ユ垜浠湁蹇呰澧炲姞Worker涓婇潰鐨勫唴瀛樻潵婊¤冻绋嬪簭杩愯闇€瑕併€?/div>
鍦⊿park Streaming鎴栬€呭叾浠杝park浠诲姟涓紝浼氶亣鍒板湪Spark涓父瑙佺殑闂锛屽吀鍨嬪Executor Lost 鐩稿叧鐨勯棶棰?shuffle fetch 澶辫触锛孴ask澶辫触閲嶈瘯绛?銆傝繖灏辨剰鍛崇潃鍙戠敓浜嗗唴瀛樹笉瓒虫垨鑰呮暟鎹€炬枩鐨勯棶棰樸€傝繖涓洰鍓嶉渶瑕佽€冭檻濡備笅鍑犱釜鐐逛互鑾峰緱瑙e喅鏂规锛?/div>
A銆佺浉鍚岃祫婧愪笅锛屽鍔爌artition鏁板彲浠ュ噺灏戝唴瀛橀棶棰樸€?鍘熷洜濡備笅锛氶€氳繃澧炲姞partition鏁帮紝姣忎釜task瑕佸鐞嗙殑鏁版嵁灏戜簡锛屽悓涓€鏃堕棿鍐咃紝鎵€鏈夋鍦ㄨ繍琛岀殑task瑕佸鐞嗙殑鏁伴噺灏戜簡寰堝锛屾墍鏈塃xecutor鍗犵敤鐨勫唴瀛樹篃鍙樺皬浜嗐€傝繖鍙互缂撹В鏁版嵁鍊炬枩浠ュ強鍐呭瓨涓嶈冻鐨勫帇鍔涖€?/div>
B銆佸叧娉╯huffle read 闃舵鐨勫苟琛屾暟銆備緥濡俽educe,group 涔嬬被鐨勫嚱鏁帮紝鍏跺疄浠栦滑閮芥湁绗簩涓弬鏁帮紝骞惰搴?partition鏁?锛屽彧鏄ぇ瀹朵竴鑸兘涓嶈缃€備笉杩囧嚭浜嗛棶棰樺啀璁剧疆涓€涓嬶紝涔熶笉閿欍€?/div>
C銆佺粰涓€涓狤xecutor 鏍告暟璁剧疆鐨勫お澶氾紝涔熷氨鎰忓懗鐫€鍚屼竴鏃跺埢锛屽湪璇xecutor 鐨勫唴瀛樺帇鍔涗細鏇村ぇ锛孏C涔熶細鏇撮绻併€傛垜涓€鑸細鎺у埗鍦?涓乏鍙炽€傜劧鍚庨€氳繃鎻愰珮Executor鏁伴噺鏉ヤ繚鎸佽祫婧愮殑鎬婚噺涓嶅彉銆?/div>
 
17銆?Spark Streaming 鍜宬afka鏁村悎鍚庤鍙栨秷鎭姤閿欙細OffsetOutOfRangeException
瑙e喅鏂规锛氬鏋滃拰kafka娑堟伅涓棿浠剁粨鍚堜娇鐢紝璇锋鏌ユ秷鎭綋鏄惁澶т簬榛樿璁剧疆1m锛屽鏋滃ぇ浜庯紝鍒欓渶瑕佽缃甪etch.message.max.bytes=1m锛岃繖閲岄渶瑕佹妸鍊艰缃ぇ浜?/div>
 
18銆乯ava.io.IOException : Could not locate executable nullinwinutils.exe in the Hadoop binaries.锛坰park sql on hive 浠诲姟寮曞彂HiveContext NullPointerException锛?/div>
瑙e喅鍔炴硶
鍦ㄥ紑鍙慼ive鍜孲park鏁村悎鐨勬椂鍊欙紝濡傛灉鏄疻indows绯荤粺锛屽苟涓旀病鏈夐厤缃瓾ADOOP_HOME鐨勭幆澧冨彉閲忥紝閭d箞鍙兘鎵句笉鍒皐inutils.exe杩欎釜宸ュ叿锛岀敱浜庝娇鐢╤ive鏃讹紝瀵硅鍛戒护鏈変緷璧栵紝鎵€浠ヤ笉瑕佸拷瑙嗚閿欒锛屽惁鍒欏皢鏃犳硶鍒涘缓HiveContext锛屼竴鐩存姤Exception in thread “main” java.lang.RuntimeException: java.lang.NullPointerException
鍥犳锛岃В鍐宠鍔炴硶鏈変袱涓柟寮?/div>
A銆佹妸浠诲姟鎵撳寘鎴恓ar锛屼笂浼犲埌鏈嶅姟鍣ㄤ笂闈紝鏈嶅姟鍣ㄦ槸閰嶇疆杩嘓ADOOP_HOME鐜鍙橀噺鐨勶紝骞朵笖涓嶉渶瑕佷緷璧杦inutils,鎵€浠ュ彧闇€瑕侀€氳繃spark-submit鏂瑰紡鎻愪氦鍗冲彲锛屽锛?/div>
[[email protected] app]$ spark-submit –class com.pride.hive.HiveOnSparkTest –master spark://bdata4:7077 spark-simple-1.0.jar
B銆佽В鍐硍inutils.exe鍛戒护涓嶅彲鐢ㄩ棶棰橈紝閰嶇疆Windows涓婇潰HADOOP_HOME鐨勭幆澧冨彉閲忥紝鎴栬€呭湪绋嬪簭鏈€寮€濮嬬殑鍦版柟璁剧疆HADOOP_HOME鐨勫睘鎬ч厤缃?杩欓噷闇€瑕佹敞鎰忥紝鐢变簬鏈€鏂扮増鏈凡缁忔病鏈墂inutils杩欎簺exe鍛戒护浜嗭紝鎴戜滑闇€瑕佸湪鍏朵粬鍦版柟涓嬭浇璇ュ懡浠ゆ斁鍏ADOOP鐨刡in鐩綍涓嬶紝褰撶劧涔熷彲浠ョ洿鎺ラ厤缃笅杞介」鐩殑鐜鍙橀噺锛屽彉閲忓悕涓€瀹氳鏄疕ADOOP_HOME鎵嶈
涓嬭浇鍦板潃锛歨ttps://github.com/srccodes/hadoop-common-2.2.0-bin/archive/master.zip 锛堣寰桭Q鍝︼級
浠讳綍椤圭洰閮界敓鏁堬紝闇€瑕侀厤缃甒indows鐨勭幆澧冨彉閲忥紝濡傛灉鍙湪绋嬪簭涓敓鏁堝彲鍦ㄧ▼搴忎腑閰嶇疆鍗冲彲锛屽
//鐢ㄤ簬瑙e喅Windows涓嬫壘涓嶅埌winutils.exe鍛戒护
System. setProperty(“hadoop.home.dir”, “E:Softwarehadoop-common-2.2.0-bin” );
 
19銆乀he root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rwx——
瑙e喅鍔炴硶
1銆佺▼搴忎腑璁剧疆鐜鍙橀噺锛歋ystem.setProperty(“HADOOP_USER_NAME”, “bdata”)
2銆佷慨鏀笻DFS鐨勭洰褰曟潈闄?/div>
Update the permission of your /tmp/hive HDFS directory using the following command
hadoop dfs -chmod 777 /tmp/hive
姝ら棶棰樻殏鏈В鍐筹紝浼拌鏄?7鐐硅В鍐硍inutils鏈夐棶棰橈紝寤鸿鏈€濂芥妸浠诲姟绋嬪簭鍙戝竷鍒版湇鍔″櫒涓婇潰瑙e喅
 
20銆丒xception in thread “main” org.apache.hadoop.security.AccessControlException : Permission denied: user=Administrator, access=WRITE, inode=”/data”:bdata:supergroup:drwxr-xr-x
瑙e喅鍔炴硶
1銆佸湪绯荤粺鐨勭幆澧冨彉閲忔垨java JVM鍙橀噺閲岄潰娣诲姞HADOOP_USER_NAME锛屽绋嬪簭涓坊鍔燬ystem.setProperty(“HADOOP_USER_NAME”, “bdata”);锛岃繖閲岀殑鍊煎氨鏄互鍚庝細杩愯HADOOP涓婄殑Linux鐨勭敤鎴峰悕锛屽鏋滄槸eclipse锛屽垯淇敼瀹岄噸鍚痚clipse锛屼笉鐒跺彲鑳戒笉鐢熸晥
2銆乭dfs dfs -chmod 777 淇敼鐩稿簲鏉冮檺鍦板潃
 
21銆佽繍琛孲park-SQL鎶ラ敊锛歰rg.apache.spark.sql.AnalysisException: unresolved operator ‘Project
瑙e喅鍔炴硶锛?/div>
鍦⊿park-sql鍜宧ive缁撳悎鏃舵垨鑰呭崟鐙琒park-sql锛岃繍琛屾煇浜泂ql璇彞鏃讹紝鍋跺皵鍑虹幇涓婇潰閿欒锛岄偅涔堟垜浠彲浠ユ鏌ヤ竴涓媠ql鐨勯棶棰橈紝杩欓噷閬囧埌鐨勯棶棰樻槸宓屽璇彞澶锛屽鑷磗park鏃犳硶瑙f瀽锛屾墍浠ラ渶瑕佷慨鏀箂ql鎴栬€呮敼鐢ㄥ叾浠栨柟寮忓鐞嗭紱鐗瑰埆娉ㄦ剰璇ヨ鍙ュ彲鑳藉湪hive閲岄潰娌℃湁閿欒锛宻park鎵嶄細鍑虹幇鐨勪竴绉嶉敊璇€?/div>
 
22.鍦?SPARK_HOME/conf/spark-env.sh涓缃繖浜涘彉閲忓ソ鍍忎篃鍙槸鍦╰erminal涓殑shell鐜涓墠鏈夋晥JAVA_HOME is not set Exception: Java gateway process exited before sending the driver its port number
浣嗘槸鍦ㄥ懡浠よ涓槸鏈夌殑
[email protected]:~$ echo $JAVA_HOME
/home/pipi/ENV/jdk
瑙e喅鏂规硶1锛氬湪py浠g爜涓姞鍏AVA_HOME鍒皁s涓?/div>
JAVA_HOME = /home/pipi/ENV/jdk
os.environ[鈥楯AVA_HOME鈥榏 = conf.get(SECTION, 鈥楯AVA_HOME鈥?
瑙e喅鏂规硶2锛氭垨鑰呭湪hadoop涓厤缃ソJAVA_HOME
hadoop涓厤缃甁AVA_HOME
 
23.ValueError: Cannot run multiple SparkContexts at once
Welcome to
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ 鈥榑/
/__ / .__/\_,_/_/ /_/\_ version 2.0.1
/_/
 
Using Python version 3.5.2 (default, Sep 10 2016 08:21:44)
SparkSession available as 鈥榮park鈥?
ValueError: Cannot run multiple SparkContexts at once; existing SparkContext(app=pyspark-shell, master=local) created by <module> at <frozen importlib._bootstrap>:222
鍘熷洜鏄細from pyspark.shell import sqlContext
寮曞叆鐨勫寘涓篃瀹氫箟浜嗕竴涓猻c = spark.sparkContext瀵艰嚧鍜屾湰浠g爜涓畾涔夐噸澶嶄簡銆?/div>
 
24.spark杈撳嚭澶warning messages
璋冭瘯log鏃跺€欏彂鐜伴棶棰樿В鍐充簡
鍦ㄧ畝鐣park杈撳嚭璁剧疆鏃禰Spark瀹夎鍜岄厤缃?]淇敼杩?SPARK_HOME/conf/log4j.properties.template鏂囦欢鍙緭鍑篧ARN淇℃伅锛屽氨绠楁敼鎴愪簡ERROR锛屼俊鎭篃杩樻槸浼氳嚜鍔ㄤ慨鏀规垚WARN杈撳嚭鍑烘潵锛屼笉杩囧浜嗕竴鏉℃彁绀猴細
Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel).
灏卞湪杩欐椂鍙戠幇浜嗕竴涓В鍐虫柟妗堬細
鏍规嵁鎻愮ず鍦ㄤ唬鐮佷腑鍔犲叆sc.setLogLevel(鈥楨RROR鈥?灏卞彲浠ヨВ鍐充簡锛?/div>
 
25.org.apache.spark.shuffle.FetchFailedException锛屼竴鑸彂鐢熷湪鏈夊ぇ閲弒huffle鎿嶄綔鐨勬椂鍊?task涓嶆柇鐨刦ailed,鐒跺悗鍙堥噸鎵ц锛屼竴鐩村惊鐜笅鍘伙紝闈炲父鐨勮€楁椂
涓€鑸亣鍒拌繖绉嶉棶棰樻彁楂榚xecutor鍐呭瓨鍗冲彲,鍚屾椂澧炲姞姣忎釜executor鐨刢pu,杩欐牱涓嶄細鍑忓皯task骞惰搴︺€?/div>
 
26.Executor&Task Lost鍥犱负缃戠粶鎴栬€単c鐨勫師鍥?worker鎴杄xecutor娌℃湁鎺ユ敹鍒癳xecutor鎴杢ask鐨勫績璺冲弽棣圵ARN TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1, aa.local): ExecutorLostFailure (executor lost)
鎻愰珮 spark.network.timeout 鐨勫€硷紝鏍规嵁鎯呭喌鏀规垚300(5min)鎴栨洿楂樸€?/div>
榛樿涓?120(120s),閰嶇疆鎵€鏈夌綉缁滀紶杈撶殑寤舵椂锛屽鏋滄病鏈変富鍔ㄨ缃互涓嬪弬鏁帮紝榛樿瑕嗙洊鍏跺睘鎬?/div>
spark.core.connection.ack.wait.timeout
spark.akka.timeout
spark.storage.blockManagerSlaveTimeoutMs
spark.shuffle.io.connectionTimeout
spark.rpc.askTimeout or spark.rpc.lookupTimeout
 
27. Master鎸傛帀,standby閲嶅惎涔熷け鏁堬紝濡侻aster榛樿浣跨敤512M鍐呭瓨锛屽綋闆嗙兢涓繍琛岀殑浠诲姟鐗瑰埆澶氭椂锛屽氨浼氭寕鎺夛紝鍘熷洜鏄痬aster浼氳鍙栨瘡涓猼ask鐨別vent log鏃ュ織鍘荤敓鎴恠park ui锛屽唴瀛樹笉瓒宠嚜鐒朵細OOM锛屽彲浠ュ湪master鐨勮繍琛屾棩蹇椾腑鐪嬪埌锛岄€氳繃HA鍚姩鐨刴aster鑷劧涔熶細鍥犱负杩欎釜鍘熷洜澶辫触銆?/div>
1锛?澧炲姞Master鐨勫唴瀛樺崰鐢紝鍦∕aster鑺傜偣spark-env.sh 涓缃細
export SPARK_DAEMON_MEMORY 10g # 鏍规嵁浣犵殑瀹為檯鎯呭喌
2锛?鍑忓皯淇濆瓨鍦∕aster鍐呭瓨涓殑浣滀笟淇℃伅
spark.ui.retainedJobs 500 # 榛樿閮芥槸1000 spark.ui.retainedStages 500
28. worker鎸傛帀鎴栧亣姝绘湁鏃跺€欐垜浠繕浼氬湪web ui涓湅鍒皐orker鑺傜偣娑堝け鎴栧浜巇ead鐘舵€侊紝鍦ㄨ鑺傜偣杩愯鐨勪换鍔″垯浼氭姤鍚勭 lost worker 鐨勯敊璇紝寮曞彂鍘熷洜鍜屼笂杩板ぇ浣撶浉鍚岋紝worker鍐呭瓨涓繚瀛樹簡澶ч噺鐨剈i淇℃伅瀵艰嚧gc鏃跺け鍘诲拰master涔嬮棿鐨勫績璺炽€?/div>
瑙e喅
1锛夊鍔燤aster鐨勫唴瀛樺崰鐢紝鍦╓orker鑺傜偣spark-env.sh 涓缃細
export SPARK_DAEMON_MEMORY 2g # 鏍规嵁浣犵殑瀹為檯鎯呭喌
2锛夊噺灏戜繚瀛樺湪Worker鍐呭瓨涓殑Driver,Executor淇℃伅
spark.worker.ui.retainedExecutors 200 # 榛樿閮芥槸1000 spark.worker.ui.retainedDrivers 200
 
29.鎶ラ敊锛欵RROR storage.DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file /hadoop/application_1415632483774_448143/spark-local-20141127115224-9ca8/04/shuffle_1_1562_27
java.io.FileNotFoundException: /hadoop/application_1415632483774_448143/spark-local-20141127115224-9ca8/04/shuffle_1_1562_27 (No such file or directory)
琛ㄩ潰涓婄湅鏄洜涓簊huffle娌℃湁鍦版柟鍐欎簡锛屽鏋滃悗闈㈢殑stack鏄痩ocal space 鐨勯棶棰橈紝閭d箞娓呬竴涓嬬鐩樺氨濂戒簡銆備笂闈㈣繖绉嶉棶棰橈紝鏄洜涓轰竴涓猠xcutor缁欏垎閰嶇殑鍐呭瓨涓嶅锛屾鏃讹紝鍑忓皯excutor-core鐨勬暟閲忥紝鍔犲ぇexcutor-memory鐨勫€煎簲璇ュ氨娌℃湁闂銆?/div>
 
30.鎶ラ敊锛欵RROR executor.CoarseGrainedExecutorBackend: Driver Disassociated [akka.tcp://[email protected]:48586] -> [akka.tcp://[email protected]:41656] disassociated! Shutting down.
15/07/23 10:50:56 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM
杩欎釜閿欒姣旇緝闅愭櫐锛屼粠淇℃伅涓婄湅鏉ヤ笉鐭ラ亾鏄粈涔堥棶棰橈紝浣嗘槸褰掓牴缁撳簳杩樻槸鍐呭瓨鐨勯棶棰橈紝鏈変袱涓柟娉曞彲浠ヨВ鍐宠繖涓敊璇紝涓€鏄紝濡備笂闈㈡墍璇达紝鍔犲ぇexcutor-memory鐨勫€硷紝鍑忓皯executor-cores鐨勬暟閲忥紝闂鍙互瑙e喅銆備簩鏄紝鍔犲ぇexecutor.overhead鐨勫€硷紝浣嗘槸杩欐牱鍏跺疄骞舵病鏈夎В鍐虫帀鏍规湰鐨勯棶棰樸€傛墍浠ュ鏋滈泦缇ょ殑璧勬簮鏄敮鎸佺殑璇濓紝灏辩敤1鐨勫姙娉曞惂銆?/div>
鍙﹀锛岃繖涓敊璇篃鍑虹幇鍦╬artitionBy(new HashPartition(partiton-num))鏃讹紝濡傛灉partiton-num澶ぇ鎴栬€呭お灏忕殑鏃跺€欎細鎶ヨ繖绉嶉敊璇紝璇寸櫧浜嗕篃鏄唴瀛樼殑鍘熷洜锛屼笉杩囪繖涓椂鍊欏鍔犲唴瀛樺拰overhead娌℃湁浠€涔堢敤锛屽緱鍘昏皟鏁磋繖涓猵artiton-num鐨勫€笺€?/div>
---------------------------------------------------------------------------------------------------------------------
涓轰簡鏃╂棩鎴愪负涓婇潰BAT涓変釜浜轰腑鐨勪竴鍛橈紝鎴栬€呯洿鎺ユ垚涓轰粬浠紝鍧氭寔瀛︿範瀹屽悗鍒烽鎶婏紝鏈€鏂版暣鐞嗗拰鏀堕泦鐨勯搴撱€?/div>
 
1.缁欏畾a銆乥涓や釜鏂囦欢锛屽悇瀛樻斁50浜夸釜url锛屾瘡涓猽rl鍚勫崰64瀛楄妭锛屽唴瀛橀檺鍒舵槸4G锛岃浣犳壘鍑篴銆乥鏂囦欢鍏卞悓鐨剈rl?
鏂规1锛氬彲浠ヤ及璁℃瘡涓枃浠跺畨鐨勫ぇ灏忎负5G×64=320G锛岃繙杩滃ぇ浜庡唴瀛橀檺鍒剁殑4G銆傛墍浠ヤ笉鍙兘灏嗗叾瀹屽叏鍔犺浇鍒板唴瀛樹腑澶勭悊銆傝€冭檻閲囧彇鍒嗚€屾不涔嬬殑鏂规硶銆?/div>
閬嶅巻鏂囦欢a锛屽姣忎釜url姹傚彇hash(url)%1000锛岀劧鍚庢牴鎹墍鍙栧緱鐨勫€煎皢url鍒嗗埆瀛樺偍鍒?000涓皬鏂囦欢(璁颁负a0,a1,…,a999)涓€傝繖鏍锋瘡涓皬鏂囦欢鐨勫ぇ绾︿负300M銆?/div>
閬嶅巻鏂囦欢b锛岄噰鍙栧拰a鐩稿悓鐨勬柟寮忓皢url鍒嗗埆瀛樺偍鍒?000灏忔枃浠?璁颁负b0,b1,…,b999)銆傝繖鏍峰鐞嗗悗锛屾墍鏈夊彲鑳界浉鍚岀殑url閮藉湪瀵瑰簲鐨勫皬鏂囦欢(a0vsb0,a1vsb1,…,a999vsb999)涓紝涓嶅搴旂殑灏忔枃浠朵笉鍙兘鏈夌浉鍚岀殑url銆傜劧鍚庢垜浠彧瑕佹眰鍑?000瀵瑰皬鏂囦欢涓浉鍚岀殑url鍗冲彲銆?/div>
姹傛瘡瀵瑰皬鏂囦欢涓浉鍚岀殑url鏃讹紝鍙互鎶婂叾涓竴涓皬鏂囦欢鐨剈rl瀛樺偍鍒癶ash_set涓€傜劧鍚庨亶鍘嗗彟涓€涓皬鏂囦欢鐨勬瘡涓猽rl锛岀湅鍏舵槸鍚﹀湪鍒氭墠鏋勫缓鐨刪ash_set涓紝濡傛灉鏄紝閭d箞灏辨槸鍏卞悓鐨剈rl锛屽瓨鍒版枃浠堕噷闈㈠氨鍙互浜嗐€?/div>
鏂规2锛氬鏋滃厑璁告湁涓€瀹氱殑閿欒鐜囷紝鍙互浣跨敤Bloomfilter锛?G鍐呭瓨澶ф鍙互琛ㄧず340浜縝it銆傚皢鍏朵腑涓€涓枃浠朵腑鐨剈rl浣跨敤Bloomfilter鏄犲皠涓鸿繖340浜縝it锛岀劧鍚庢尐涓鍙栧彟澶栦竴涓枃浠剁殑url锛屾鏌ユ槸鍚︿笌Bloomfilter锛屽鏋滄槸锛岄偅涔堣url搴旇鏄叡鍚岀殑url(娉ㄦ剰浼氭湁涓€瀹氱殑閿欒鐜?銆?/div>
Bloomfilter鏃ュ悗浼氬湪鏈珺LOG鍐呰缁嗛槓杩般€傝ˉ鍏咃細鍙﹀涓€绉嶆€濊矾锛屾槸灏唘rl閫氳繃绠楁硶杞负鏁板瓧绫诲瀷锛岃浆鎹㈠悗鐨勮繛鎺ュ氨鏄瘮杈冩暟鍊兼槸鍚︾浉绛変簡銆?/div>
2.鏈変竴涓?G澶у皬鐨勪竴涓枃浠讹紝閲岄潰姣忎竴琛屾槸涓€涓瘝锛岃瘝鐨勫ぇ灏忎笉瓒呰繃16瀛楄妭锛屽唴瀛橀檺鍒跺ぇ灏忔槸1M锛岃姹傝繑鍥為鏁版渶楂樼殑100涓瘝銆?/div>
Step1锛氶『搴忚鏂囦欢涓紝瀵逛簬姣忎釜璇峹锛屽彇hash(x)%5000锛岀劧鍚庢寜鐓ц鍊煎瓨鍒?000涓皬鏂囦欢(璁颁负f0,f1,...,f4999)涓紝杩欐牱姣忎釜鏂囦欢澶ф鏄?00k宸﹀彸锛屽鏋滃叾涓殑鏈夌殑鏂囦欢瓒呰繃浜?M澶у皬锛岃繕鍙互鎸夌収绫讳技鐨勬柟娉曠户缁線涓嬪垎锛岀洿鍒板垎瑙e緱鍒扮殑灏忔枃浠剁殑澶у皬閮戒笉瓒呰繃1M;
Step2锛氬姣忎釜灏忔枃浠讹紝缁熻姣忎釜鏂囦欢涓嚭鐜扮殑璇嶄互鍙婄浉搴旂殑棰戠巼(鍙互閲囩敤trie鏍?hash_map绛?锛屽苟鍙栧嚭鍑虹幇棰戠巼鏈€澶х殑100涓瘝(鍙互鐢ㄥ惈100涓粨鐐圭殑鏈€灏忓爢)锛屽苟鎶?00璇嶅強鐩稿簲鐨勯鐜囧瓨鍏ユ枃浠讹紝杩欐牱鍙堝緱鍒颁簡5000涓枃浠?
Step3锛氭妸杩?000涓枃浠惰繘琛屽綊骞?绫讳技涓庡綊骞舵帓搴?;
鑽夊浘濡備笅(鍒嗗壊澶ч棶棰橈紝姹傝В灏忛棶棰橈紝褰掑苟)锛?/div>
 
3.鐜版湁娴烽噺鏃ュ織鏁版嵁淇濆瓨鍦ㄤ竴涓秴绾уぇ鐨勬枃浠朵腑锛岃鏂囦欢鏃犳硶鐩存帴璇诲叆鍐呭瓨锛岃姹備粠涓彁鍙栨煇澶╁嚭璁块棶鐧惧害娆℃暟鏈€澶氱殑閭d釜IP銆?/div>
Step1锛氫粠杩欎竴澶╃殑鏃ュ織鏁版嵁涓妸璁块棶鐧惧害鐨処P鍙栧嚭鏉ワ紝閫愪釜鍐欏叆鍒颁竴涓ぇ鏂囦欢涓?
Step2锛氭敞鎰忓埌IP鏄?2浣嶇殑锛屾渶澶氭湁2^32涓狪P銆傚悓鏍峰彲浠ラ噰鐢ㄦ槧灏勭殑鏂规硶锛屾瘮濡傛ā1000锛屾妸鏁翠釜澶ф枃浠舵槧灏勪负1000涓皬鏂囦欢;
Step3锛氭壘鍑烘瘡涓皬鏂囦腑鍑虹幇棰戠巼鏈€澶х殑IP(鍙互閲囩敤hash_map杩涜棰戠巼缁熻锛岀劧鍚庡啀鎵惧嚭棰戠巼鏈€澶х殑鍑犱釜)鍙婄浉搴旂殑棰戠巼;
Step4锛氬湪杩?000涓渶澶х殑IP涓紝鎵惧嚭閭d釜棰戠巼鏈€澶х殑IP锛屽嵆涓烘墍姹傘€?/div>
鑽夊浘濡備笅锛?/div>
4.LVS鍜孒AProxy鐩告瘮锛屽畠鐨勭己鐐规槸浠€涔?
涔嬪墠锛岀殑纭槸鐢↙VS杩涜杩嘙ySQL闆嗙兢鐨勮礋杞藉潎琛★紝瀵笻AProxy涔熸湁杩囦簡瑙o紝浣嗘槸灏嗚繖涓よ€呮斁鍦ㄧ溂鍓嶈繘琛屾瘮杈冿紝杩樼湡娌¤瘯鐫€浜嗚В杩囥€傞潰璇曚腑鍑虹幇浜嗚繖涔堜竴棰橈紝闈㈣瘯瀹樼粰浜堢殑绛旀鏄疞VS鐨勯厤缃浉褰撶箒鐞愶紝鍚庢潵鏌ユ壘浜嗙浉鍏宠祫鏂欙紝瀵硅繖涓ょ璐熻浇鍧囪 鏂规鏈変簡鏇磋繘涓€姝ョ殑浜嗚В銆侺VS鐨勮礋杞藉潎琛℃€ц兘涔嬪己鎮嶅凡缁忚揪鍒扮‖浠惰礋杞藉潎琛$殑F5鐨勭櫨鍒嗕箣60浜嗭紝鑰孒Aproxy鐨勮礋杞藉潎琛″拰Nginx璐熻浇鍧囪 锛屽潎涓虹‖浠惰礋杞藉潎琛$殑鐧惧垎涔嬪崄宸﹀彸銆傜敱姝ゅ彲瑙侊紝閰嶇疆澶嶆潅锛岀浉搴旂殑鏁堟灉涔熸槸鏄捐€屾槗瑙佺殑銆傚湪鏌ユ壘璧勬枡鐨勮繃绋嬩腑锛岃瘯鐫€灏哃VS鐨?0绉嶈皟搴︾畻娉曚簡瑙d簡涓€涓嬶紝鐪嬩技鏁伴噺鎸哄鐨?0绉嶇畻娉曞叾瀹炲湪涓嶅悓鐨勭畻娉曚箣闂达紝鏈変簺鍙槸鏈夌潃涓€浜涚粏寰殑宸埆銆傚湪杩?0绉嶈皟搴︾畻娉曚腑锛岄潤鎬佽皟搴︾畻娉曟湁鍥涚锛屽姩鎬佽皟搴︾畻娉曟湁6绉嶃€?/div>
闈欐€佽皟搴︾畻娉曪細
鈶燫R杞璋冨害绠楁硶
杩欑璋冨害绠楁硶涓嶈€冭檻鏈嶅姟鍣ㄧ殑鐘舵€侊紝鎵€浠ユ槸鏃犵姸鎬佺殑锛屽悓鏃朵篃涓嶈€冭檻姣忎釜鏈嶅姟鍣ㄧ殑鎬ц兘锛屾瘮濡傛垜鏈?-N鍙版湇鍔″櫒锛屾潵N涓姹備簡锛岀涓€涓姹傜粰绗竴鍙帮紝绗簩涓姹傜粰绗簩鍙帮紝锛岋紝绗琋涓姹傜粰绗琋鍙版湇鍔″櫒锛屽氨閰辩传銆?/div>
鈶″姞鏉冭疆璇?/div>
杩欑璋冨害绠楁硶鏄€冭檻鍒版湇鍔″櫒鐨勬€ц兘鐨勶紝浣犲彲浠ユ牴鎹笉鍚屾湇鍔″櫒鐨勬€ц兘锛屽姞涓婃潈閲嶈繘琛屽垎閰嶇浉搴旂殑璇锋眰銆?/div>
鈶㈠熀浜庣洰鐨勫湴鍧€鐨刪ash鏁e垪
杩欑璋冨害绠楁硶鍜屽熀浜庢簮鍦板潃鐨刪ash鏁e垪寮傛洸鍚屽伐锛岄兘鏄负浜嗙淮鎸佷竴涓猻ession锛屽熀浜庣洰鐨勫湴鍧€鐨刪ash鏁e垪锛屽皢璁颁綇鍚屼竴璇锋眰鐨勭洰鐨勫湴鍧€锛屽皢杩欑被璇锋眰鍙戝線鍚屼竴鍙扮洰鐨勬湇鍔″櫒銆傜畝鑰岃█涔嬶紝灏辨槸鍙戝線杩欎釜鐩殑鍦板潃鐨勮姹傞兘鍙戝線鍚屼竴鍙版湇鍔″櫒銆傝€屽熀浜庢簮鍦板潃鐨刪ash鏁e垪锛屽氨鏄潵鑷悓涓€婧愬湴鍧€鐨勮姹傞兘鍙戝線鍚屼竴鍙版湇鍔″櫒銆?/div>
鈶e熀浜庢簮鍦板潃鐨刪ash鏁e垪
涓婅堪宸茶锛屼笉鍐嶈禈杩般€?/div>
鍔ㄦ€佽皟搴?/div>
鈶犳渶灏戣繛鎺ヨ皟搴︾畻娉?/div>
杩欑璋冨害绠楁硶浼氳褰曞搷搴旇姹傜殑鏈嶅姟鍣ㄤ笂鎵€寤虹珛鐨勮繛鎺ユ暟锛屾瘡鎺ユ敹鍒颁竴涓姹備細鐩稿簲鐨勫皢璇ユ湇鍔″櫒鐨勬墍寤虹珛杩炴帴鏁板姞1锛屽悓鏃跺皢鏂版潵鐨勮姹傚垎閰嶅埌褰撳墠杩炴帴鏁版渶灏戠殑閭e彴鏈哄櫒涓娿€?/div>
鈶″姞鏉冩渶灏戣繛鎺ヨ皟搴︾畻娉?/div>
杩欑璋冨害绠楁硶鍦ㄦ渶灏戣繛鎺ヨ皟搴︾畻娉曠殑鍩虹涓婅€冭檻鍒版湇鍔″櫒鐨勬€ц兘銆傚綋鐒讹紝鍋氳繖鏍峰瓙鐨勮€冭檻鏄湁鍏跺悎鐞嗘€у瓨鍦ㄧ殑锛屽鏋滄槸鍚屼竴瑙勬牸鐨勬湇鍔″櫒锛岄偅涔堝缓绔嬬殑杩炴帴鏁拌秺澶氾紝蹇呯劧瓒婂鍔犲叾璐熻浇锛岄偅涔堜粎浠呮牴鎹渶灏戣繛鎺ユ暟鐨勮皟搴︾畻娉曪紝蹇呯劧鍙互瀹炵幇鍚堢悊鐨勮礋杞藉潎琛°€備絾濡傛灉锛屾湇鍔″櫒鐨勬€ц兘涓嶄竴鏍峰憿?姣斿鎴戞湁涓€鍙版湇鍔″櫒锛屾渶澶氬彧鑳藉鐞?0涓繛鎺ワ紝鐜板湪寤虹珛浜?涓紝杩樻湁涓€鍙版湇鍔″櫒鏈€澶氳兘澶勭悊1000鏉¤繛鎺ワ紝鐜板湪寤虹珛浜?涓紝濡傛灉鍗曠函鍦版寜鐓т笂杩扮殑鏈€灏戣繛鎺ヨ皟搴︾畻娉曪紝濡ュΕ鐨勫墠鑰呭槢锛屼絾鍓嶈€呭凡缁忓缓绔嬩簡鐧惧垎涔嬩笁鍗佺殑杩炴帴浜嗭紝鑰屽悗鑰呰繛鐧惧垎涔嬩竴鐨勮繛鎺ヨ繕娌℃湁寤虹珛锛岃瘯闂紝杩欏悎鐞嗗悧?鏄剧劧涓嶅悎鐞嗐€傛墍浠ュ姞涓婃潈閲嶏紝鎵嶇畻鍚堢悊銆傜浉搴旂殑鍏紡涔熺浉褰撶畝鍗曪細active*256/weight銆?/div>
鈶㈡渶鐭湡鏈涜皟搴︾畻娉?/div>
杩欑绠楁硶锛屾槸閬垮厤鍑虹幇涓婅堪鍔犳潈鏈€灏戣繛鎺ヨ皟搴︾畻娉曚腑鐨勪竴绉嶇壒娈婃儏鍐碉紝瀵艰嚧鍗充娇鍔犱笂鏉冮噸锛岃皟搴﹀櫒涔熸棤宸埆瀵瑰緟浜嗭紝涓句釜鏍楀瓙锛?/div>
鍋囪鏈変笁鍙版湇鍔″櫒ABC锛屽叾褰撳墠鎵€寤虹珛鐨勮繛鎺ユ暟鐩稿簲鍦颁负1,2,3锛岃€屾潈閲嶄篃鏄?,2,3銆傞偅涔堝鏋滄寜鐓у姞鏉冩渶灏戣繛鎺ヨ皟搴︾畻娉曠殑璇濓紝绠楀嚭鏉ユ槸杩欐牱瀛愮殑锛?/div>
銆€銆€A:1256/1=256
銆€銆€B:2256/2=256
銆€銆€C:3256/3=256
鎴戜滑浼氬彂鐜帮紝鍗充究鍔犱笂鏉冮噸锛孉銆丅銆丆锛岀粡杩囪绠楄繕鏄竴鏍风殑锛岃繖鏍峰瓙璋冨害鍣ㄤ細鏃犲樊鍒殑鍦ˋ銆丅銆丆涓换閫変竴鍙帮紝灏嗚姹傚彂杩囧幓銆?/div>
鑰屾渶鐭湡鏈涘皢active256/weight鐨勭畻娉曟敼杩涗负(active+1)256/weight
閭d箞杩樻槸涔嬪墠鐨勪緥瀛愶細
銆€銆€A:(1+1)256/1=2/1256=2256
銆€銆€B:(2+1)256/2=3/2256=1.5256
銆€銆€C:(3+1)256銆?=4/3256≈1.3256
銆€銆€鏄剧劧C
鈶f案涓嶆帓闃熺畻娉?/div>
銆€銆€灏嗚姹傚彂缁欏綋鍓嶈繛鎺ユ暟涓?鐨勬湇鍔″櫒涓娿€?/div>
鈶ゅ熀浜庡眬閮ㄧ殑鏈€灏戣繛鎺ヨ皟搴︾畻娉?/div>
杩欑璋冨害绠楁硶搴旂敤浜嶤ache绯荤粺锛岀淮鎸佷竴涓姹傚埌涓€鍙版湇鍔″櫒鐨勬槧灏勶紝鍏跺疄鎴戜滑浠旂粏鎯虫兂鍝堬紝涔嬪墠鍋氱殑涓€绯诲垪鏈€灏戣繛鎺ョ浉鍏崇殑璋冨害绠楁硶銆傝€冭檻鍒扮殑鏄湇鍔″櫒鐨勭姸鎬佷笌鎬ц兘锛屼絾鏄竴娆¤姹傚苟涓嶆槸鍗曞悜鐨勶紝灏卞儚鏈変竴涓粠鏈悎浣滆繃鐨勫ぇ鐗涳紝浠栧緢闂诧紝浣犺浠栧幓瑙e喅涓€涓箣鍓嶇鍒拌繃鐨勪竴涓棶棰橈紝鏈繀鏈夋壘涓€涓箣鍓嶅凡缁忚窡浣犲悎浣滆繃鍝€曠幇鍦ㄤ笉鎬庝箞闂茬殑鑷毊鍖犳晥鏋滃ソ鍝锛屾墍浠ュ熀浜庡眬閮ㄧ殑鏈€灏戣繛鎺ヨ皟搴︾畻娉曪紝缁存寔鐨勮繖绉嶆槧灏勭殑浣滅敤鏄紝濡傛灉鏉ヤ簡涓€涓姹傦紝鐩稿搴旂殑鏄犲皠鐨勯偅鍙版湇鍔″櫒锛屾病鏈夎秴杞斤紝ok浜ょ粰鑰佷紮浼村畬浜嬪惂锛屼亢鏀惧績锛屽鏋滈偅鍙版湇鍔″櫒涓嶅瓨鍦紝鎴栬€呮槸瓒呰浇鐨勭姸鎬佷笖鏈夊叾浠栨湇鍔″櫒宸ヤ綔鍦ㄤ竴鍗婄殑璐熻浇鐘舵€侊紝鍒欐寜鏈€灏戣繛鎺ヨ皟搴︾畻娉曞湪闆嗙兢鍏朵綑鐨勬湇鍔″櫒涓壘涓€鍙板皢璇锋眰鍒嗛厤缁欏畠銆?/div>
鈶ュ熀浜庡鍒剁殑灞€閮ㄦ渶灏戣繛鎺ヨ皟搴︾畻娉?/div>
杩欑璋冨害绠楁硶鍚屾牱搴旂敤浜巆ache绯荤粺锛屼絾瀹冪淮鎸佺殑涓嶆槸鍒颁竴鍙版湇鍔″櫒鐨勬槧灏勮€屾槸鍒颁竴缁勬湇鍔″櫒鐨勬槧灏勶紝褰撴湁鏂扮殑璇锋眰鍒版潵锛屾牴鎹渶灏忚繛鎺ュ師鍒欙紝浠庤鏄犲皠鐨勬湇鍔″櫒缁勪腑閫夋嫨涓€鍙版湇鍔″櫒锛屽鏋滃畠娌℃湁瓒呰浇鍒欎氦缁欏畠鍘诲鐞嗚繖涓姹傦紝濡傛灉鍙戠幇瀹冭秴杞斤紝鍒欎粠鏈嶅姟鍣ㄧ粍澶栫殑闆嗙兢涓紝鎸夋渶灏戣繛鎺ュ師鍒欐媺涓€鍙版満鍣ㄥ姞鍏ユ湇鍔″櫒缁勶紝骞朵笖鍦ㄦ湇鍔″櫒缁勬湁涓€娈垫椂闂存湭淇敼鍚庯紝灏嗘渶蹇欑殑閭e彴鏈嶅姟鍣ㄤ粠鏈嶅姟鍣ㄧ粍涓墧闄ゃ€?/div>
5.Sqoop鐢ㄨ捣鏉ユ劅瑙夋€庢牱?
璇村疄璇濓紝Sqoop鍦ㄥ鍏ユ暟鎹殑閫熷害涓婄‘瀹炲崄鍒嗘劅浜猴紝閫氳繃杩涗竴姝ヤ簡瑙o紝鍙戠幇Sqoop1鍜孲qoop2鍦ㄦ灦鏋勪笂杩樻槸鏈夋槑鏄句笉鍚岀殑锛屾棤璁烘槸浠庢暟鎹被鍨嬩笂杩樻槸浠庡畨鍏ㄦ潈闄愶紝瀵嗙爜鏆撮湶鏂归潰锛孲qoop2閮芥湁浜嗘槑鏄剧殑鏀硅繘锛屽悓鏃跺悓涓€浜涘叾浠栫殑寮傛瀯鏁版嵁鍚屾宸ュ叿姣旇緝,濡傛窐瀹濈殑DataX鎴栬€匥ettle鐩告瘮锛孲qoop鏃犺鏄粠瀵煎叆鏁版嵁鐨勬晥鐜囦笂杩樻槸浠庢敮鎸佹彃浠剁殑涓板瘜绋嬪害涓婏紝Sqoop杩樻槸鐩稿綋涓嶉敊婊?!
6.ZooKeeper鐨勮鑹蹭互鍙婄浉搴旂殑Zookepper宸ヤ綔鍘熺悊?
鏋滅劧锛屼汉鐨勮蹇嗗姏鏄湁琛板噺鏇茬嚎鐨勶紝褰撻潰璇曞畼鎶涘嚭杩欎釜闂鍚庯紝鍓嶈€呰鑹诧紝鎴戝彧绛斿嚭浜嗕袱绉?leader鍜宖ollower)锛屽悗鑰呭師鐞嗗帇鏍瑰氨妯$硦鑷冲繕璁颁簡銆傛墍浠ユ伓琛ヤ簡涓€涓嬶紝娑夊強鍒癦ookeeper鐨勮鑹插ぇ姒傛湁濡備笅鍥涚锛歭eader銆乴earner(follower)銆乷bserver銆乧lient銆傚叾涓璴eader涓昏鐢ㄦ潵鍐崇瓥鍜岃皟搴︼紝follower鍜宱bserver鐨勫尯鍒粎浠呭湪浜庡悗鑰呮病鏈夊啓鐨勮亴鑳斤紝浣嗛兘鏈夊皢client璇锋眰鎻愪氦缁檒eader鐨勮亴鑳斤紝鑰宱bserver鐨勫嚭鐜版槸涓轰簡搴斿褰撴姇绁ㄥ帇鍔涜繃澶ц繖绉嶆儏褰㈢殑锛宑lient灏辨槸鐢ㄦ潵鍙戣捣璇锋眰鐨勩€傝€孼ookeeper鎵€鐢ㄧ殑鍒嗗竷寮忎竴鑷存€х畻娉曞寘鎷琹eader鐨勯€変妇鍏跺疄鍜?鍘熷閮ㄨ惤鐨勮幏寰楃鍣ㄤ负閰嬮暱锛屾垨鑰呭緱鐜夌幒鑰呬负鐨囧笣绫讳技锛岃皝id鏈€灏忥紝璋佷负leader锛屼細鏍规嵁浣犳墍閰嶇疆鐨勭浉搴旂殑鏂囦欢鍦ㄧ浉搴旂殑鑺傜偣鏈轰笅鐢熸垚id锛岀劧鍚庣浉搴旂殑鑺傜偣浼氶€氳繃getchildren()杩欎釜鍑芥暟鑾峰彇涔嬪墠璁剧疆鐨勮妭鐐逛笅鐢熸垚鐨刬d锛岃皝鏈€灏忥紝璋佹槸leader銆傚苟涓斿鏋滀竾涓€杩欎釜leader鎸傛帀浜嗘垨鑰呭爼钀戒簡锛屽垯鐢辨灏忕殑椤朵笂銆傝€屼笖鍦ㄩ厤缃浉搴旂殑zookeeper鏂囦欢鐨勬椂鍊欏洖鏈夌被浼间簬濡備笅瀛楁牱鐨勪俊鎭細Server.x=AAAA:BBBB:CCCC銆傚叾涓殑x鍗充负浣犵殑鑺傜偣鍙峰搱锛孉AAA瀵瑰簲浣犳墍閮ㄥ睘zookeeper鎵€鍦ㄧ殑ip鍦板潃锛孊BBB涓烘帴鏀禼lient璇锋眰鐨勭鍙o紝CCCC涓洪噸鏂伴€変妇leader绔彛銆?/div>
7.HBase鐨処nsert涓嶶pdate鐨勫尯鍒?
杩欎釜棰樼洰鏄氨鐫€鏈€杩戠殑涓€娆¢」鐩棶鐨勶紝褰撴椂瀹炵幇鐨勪笌hbase浜や簰鐨勪笁涓柟娉曞垎鍒负insert銆乨elete銆乽pdate銆傜敱浜庨偅涓」鐩槸瀵规帴鐨勪竴涓」鐩紝瀵规帴鐨勫皬浼欎即鍜屾垜鍗忓晢浜嗕笅锛屼笉灏唘pdate鍚堝苟涓篿nsert锛屽鏋滃悎骞剁殑璇濓紝鎸夐偅涓」鐩湰韬紝鍏跺疄閫氳繃insert鎵цoverwrite鐩稿綋浜庨棿鎺ュ湴Update锛屾湰璐ㄤ笂锛屾垨鑰呰鍦ㄥ睍鐜颁笂鏄病浠€涔堝尯鍒殑鍖呮嫭鎵€璋冪敤鐨刾ut銆備絾閭d粎浠呮槸灏辩潃閭d釜椤圭洰鐨勭▼搴忚€岃█锛屽鏋滃熀浜嶩Base shell灞傞潰銆傚皢鍚屼竴rowkey鐨勬暟鎹彃鍏Base锛屽叾瀹炶櫧鐒跺睍鐜颁竴鏉★紝浣嗘槸鐩稿簲鐨則imestamp鏄笉涓€鏍风殑锛岃€屼笖鏈€澶х殑鐗堟湰鏁板彲浠ラ€氳繃閰嶇疆鏂囦欢杩涜鐩稿簲鍦拌缃€?/div>
8.璇风畝杩板ぇ鏁版嵁鐨勭粨鏋滃睍鐜版柟寮忋€?/div>
1)鎶ヨ〃褰㈠紡
鍩轰簬鏁版嵁鎸栨帢寰楀嚭鐨勬暟鎹姤琛紝鍖呮嫭鏁版嵁琛ㄦ牸銆佺煩闃点€佸浘褰㈠拰鑷畾涔夋牸寮忕殑鎶ヨ〃绛夛紝浣跨敤鏂逛究銆佽璁$伒娲汇€?/div>
2)鍥惧舰鍖栧睍鐜?/div>
鎻愪緵鏇茬嚎銆侀ゼ鍥俱€佸爢绉浘銆佷华琛ㄧ洏銆侀奔楠ㄥ垎鏋愬浘绛夊浘褰㈠舰寮忓畯瑙傚睍鐜版ā鍨嬫暟鎹殑鍒嗗竷鎯呭喌锛屼粠鑰屼究浜庤繘琛屽喅绛栥€?/div>
3)KPI灞曠幇
鎻愪緵琛ㄦ牸寮忕哗鏁堜竴瑙堣〃骞跺彲鑷畾涔夌哗鏁堟煡鐪嬫柟寮忥紝濡傛暟鎹〃鏍兼垨璧板娍鍥撅紝浼佷笟绠$悊鑰呭彲鏍规嵁鍙害閲忕殑鐩爣蹇€熻瘎浼拌繘搴︺€?/div>
4)鏌ヨ灞曠幇
鎸夋暟鎹煡璇㈡潯浠跺拰鏌ヨ鍐呭锛屼互鏁版嵁琛ㄦ牸鏉ユ眹鎬绘煡璇㈢粨鏋滐紝鎻愪緵鏄庣粏鏌ヨ鍔熻兘锛屽苟鍙湪鏌ヨ鐨勬暟鎹〃鏍煎熀纭€涓婅繘琛屼笂閽汇€佷笅閽汇€佹棆杞瓑鎿嶄綔銆?/div>
9.渚嬩妇韬竟鐨勫ぇ鏁版嵁銆?/div>
i.QQ锛屽井鍗氱瓑绀句氦杞欢浜х敓鐨勬暟鎹?/div>
ii.澶╃尗锛屼含涓滅瓑鐢靛瓙鍟嗗姟浜х敓鐨勬暟鎹?/div>
iii.浜掕仈缃戜笂鐨勫悇绉嶆暟鎹?/div>
10.绠€杩板ぇ鏁版嵁鐨勬暟鎹鐞嗘柟寮忋€?/div>
绛旓細瀵逛簬鍥惧儚銆佽棰戙€乁RL銆佸湴鐞嗕綅缃瓑绫诲瀷澶氭牱鐨勬暟鎹紝闅句互鐢ㄤ紶缁熺殑缁撴瀯鍖栨柟寮忔弿杩帮紝鍥犳闇€瑕佷娇鐢ㄧ敱澶氱淮琛ㄧ粍鎴愮殑闈㈠悜鍒楀瓨鍌ㄧ殑鏁版嵁绠$悊绯荤粺鏉ョ粍缁囧拰绠$悊鏁版嵁銆備篃灏辨槸璇达紝灏嗘暟鎹寜琛屾帓搴忥紝鎸夊垪瀛樺偍锛屽皢鐩稿悓瀛楁鐨勬暟鎹綔涓轰竴涓垪鏃忔潵鑱氬悎瀛樺偍銆備笉鍚岀殑鍒楁棌瀵瑰簲鏁版嵁鐨勪笉鍚屽睘鎬э紝杩欎簺灞炴€у彲浠ユ牴鎹渶姹傚姩鎬佸鍔狅紝閫氳繃杩欐牱鐨勫垎甯冨紡瀹炴椂鍒楀紡鏁版嵁搴撳鏁版嵁缁熶竴杩涜缁撴瀯鍖栧瓨鍌ㄥ拰绠$悊锛岄伩鍏嶄簡浼犵粺鏁版嵁瀛樺偍鏂瑰紡涓嬬殑鍏宠仈鏌ヨ銆?/div>
11.浠€涔堟槸澶ф暟鎹?
绛旓細澶ф暟鎹槸鎸囨棤娉曞湪瀹硅鐨勬椂闂村唴鐢ㄥ父瑙勮蒋浠跺伐鍏峰鍏跺唴瀹硅繘琛屾姄鍙栥€佺鐞嗗拰澶勭悊鐨勬暟鎹€?/div>
12.娴烽噺鏃ュ織鏁版嵁锛屾彁鍙栧嚭鏌愭棩璁块棶鐧惧害娆℃暟鏈€澶氱殑閭d釜IP銆?/div>
棣栧厛鏄繖涓€澶╋紝骞朵笖鏄闂櫨搴︾殑鏃ュ織涓殑IP鍙栧嚭鏉ワ紝閫愪釜鍐欏叆鍒颁竴涓ぇ鏂囦欢涓€傛敞鎰忓埌IP鏄?2浣嶇殑锛屾渶澶氭湁涓?^32涓狪P銆傚悓鏍峰彲浠ラ噰鐢ㄦ槧灏勭殑鏂规硶锛屾瘮濡傛ā1000锛屾妸鏁翠釜澶ф枃浠舵槧灏勪负1000涓皬鏂囦欢锛屽啀鎵惧嚭姣忎釜灏忔枃涓嚭鐜伴鐜囨渶澶х殑IP(鍙互閲囩敤hash_map杩涜棰戠巼缁熻锛岀劧鍚庡啀鎵惧嚭棰戠巼鏈€澶х殑鍑犱釜)鍙婄浉搴旂殑棰戠巼銆傜劧鍚庡啀鍦ㄨ繖1000涓渶澶х殑IP涓紝鎵惧嚭閭d釜棰戠巼鏈€澶х殑IP锛屽嵆涓烘墍姹傘€?/div>
鎴栬€呭涓嬮槓杩?闆煙涔嬮拱)锛?/div>
绠楁硶鎬濇兂锛氬垎鑰屾不涔?Hash
1)IP鍦板潃鏈€澶氭湁2^32=4G绉嶅彇鍊兼儏鍐碉紝鎵€浠ヤ笉鑳藉畬鍏ㄥ姞杞藉埌鍐呭瓨涓鐞?
2)鍙互鑰冭檻閲囩敤“鍒嗚€屾不涔?rdquo;鐨勬€濇兂锛屾寜鐓P鍦板潃鐨凥ash(IP)%1024鍊硷紝鎶婃捣閲廔P鏃ュ織鍒嗗埆瀛樺偍鍒?024涓皬鏂囦欢涓€傝繖鏍凤紝姣忎釜灏忔枃浠舵渶澶氬寘鍚?MB涓狪P鍦板潃;
3)瀵逛簬姣忎竴涓皬鏂囦欢锛屽彲浠ユ瀯寤轰竴涓狪P涓簁ey锛屽嚭鐜版鏁颁负value鐨凥ashmap锛屽悓鏃惰褰曞綋鍓嶅嚭鐜版鏁版渶澶氱殑閭d釜IP鍦板潃;
4)鍙互寰楀埌1024涓皬鏂囦欢涓殑鍑虹幇娆℃暟鏈€澶氱殑IP锛屽啀渚濇嵁甯歌鐨勬帓搴忕畻娉曞緱鍒版€讳綋涓婂嚭鐜版鏁版渶澶氱殑IP;
13.鎼滅储寮曟搸浼氶€氳繃鏃ュ織鏂囦欢鎶婄敤鎴锋瘡娆℃绱娇鐢ㄧ殑鎵€鏈夋绱覆閮借褰曚笅鏉ワ紝姣忎釜鏌ヨ涓茬殑闀垮害涓?-255瀛楄妭銆?/div>
鍋囪鐩墠鏈変竴鍗冧竾涓褰?杩欎簺鏌ヨ涓茬殑閲嶅搴︽瘮杈冮珮锛岃櫧鐒舵€绘暟鏄?鍗冧竾锛屼絾濡傛灉闄ゅ幓閲嶅鍚庯紝涓嶈秴杩?鐧句竾涓€備竴涓煡璇覆鐨勯噸澶嶅害瓒婇珮锛岃鏄庢煡璇㈠畠鐨勭敤鎴疯秺澶氾紝涔熷氨鏄秺鐑棬銆?锛岃浣犵粺璁℃渶鐑棬鐨?0涓煡璇覆锛岃姹備娇鐢ㄧ殑鍐呭瓨涓嶈兘瓒呰繃1G銆?/div>
鍏稿瀷鐨凾opK绠楁硶锛岃繕鏄湪杩欑瘒鏂囩珷閲屽ご鏈夋墍闃愯堪锛岃鎯呰鍙傝锛氬崄涓€銆佷粠澶村埌灏惧交搴曡В鏋怘ash琛ㄧ畻娉曘€?/div>
鏂囦腑锛岀粰鍑虹殑鏈€缁堢畻娉曟槸锛?/div>
绗竴姝ャ€佸厛瀵硅繖鎵规捣閲忔暟鎹澶勭悊锛屽湪O(N)鐨勬椂闂村唴鐢℉ash琛ㄥ畬鎴愮粺璁?涔嬪墠鍐欐垚浜嗘帓搴忥紝鐗规璁㈡銆侸uly銆?011.04.27);
绗簩姝ャ€佸€熷姪鍫嗚繖涓暟鎹粨鏋勶紝鎵惧嚭TopK锛屾椂闂村鏉傚害涓篘‘logK銆?/div>
鍗筹紝鍊熷姪鍫嗙粨鏋勶紝鎴戜滑鍙互鍦╨og閲忕骇鐨勬椂闂村唴鏌ユ壘鍜岃皟鏁?绉诲姩銆傚洜姝わ紝缁存姢涓€涓狵(璇ラ鐩腑鏄?0)澶у皬鐨勫皬鏍瑰爢锛岀劧鍚庨亶鍘?00涓囩殑Query锛屽垎鍒拰鏍瑰厓绱犺繘琛屽姣旀墍浠ワ紝鎴戜滑鏈€缁堢殑鏃堕棿澶嶆潅搴︽槸锛歄(N)+N’*O(logK)锛?N涓?000涓囷紝N’涓?00涓?銆俹k锛屾洿澶氾紝璇︽儏锛岃鍙傝€冨師鏂囥€?/div>
鎴栬€咃細閲囩敤trie鏍戯紝鍏抽敭瀛楀煙瀛樿鏌ヨ涓插嚭鐜扮殑娆℃暟锛屾病鏈夊嚭鐜颁负0銆傛渶鍚庣敤10涓厓绱犵殑鏈€灏忔帹鏉ュ鍑虹幇棰戠巼杩涜鎺掑簭銆?/div>
14.鏈変竴涓?G澶у皬鐨勪竴涓枃浠讹紝閲岄潰姣忎竴琛屾槸涓€涓瘝锛岃瘝鐨勫ぇ灏忎笉瓒呰繃16瀛楄妭锛屽唴瀛橀檺鍒跺ぇ灏忔槸1M銆傝繑鍥為鏁版渶楂樼殑100涓瘝銆?/div>
鏂规锛氶『搴忚鏂囦欢涓紝瀵逛簬姣忎釜璇峹锛屽彇hash(x)%5000锛岀劧鍚庢寜鐓ц鍊煎瓨鍒?000涓皬鏂囦欢(璁颁负x0,x1,…x4999)涓€傝繖鏍锋瘡涓枃浠跺ぇ姒傛槸200k宸﹀彸銆?/div>
濡傛灉鍏朵腑鐨勬湁鐨勬枃浠惰秴杩囦簡1M澶у皬锛岃繕鍙互鎸夌収绫讳技鐨勬柟娉曠户缁線涓嬪垎锛岀洿鍒板垎瑙e緱鍒扮殑灏忔枃浠剁殑澶у皬閮戒笉瓒呰繃1M銆?/div>
瀵规瘡涓皬鏂囦欢锛岀粺璁℃瘡涓枃浠朵腑鍑虹幇鐨勮瘝浠ュ強鐩稿簲鐨勯鐜?鍙互閲囩敤trie鏍?hash_map绛?锛屽苟鍙栧嚭鍑虹幇棰戠巼鏈€澶х殑100涓瘝(鍙互鐢ㄥ惈100涓粨鐐圭殑鏈€灏忓爢)锛屽苟鎶?00涓瘝鍙婄浉搴旂殑棰戠巼瀛樺叆鏂囦欢锛岃繖鏍峰張寰楀埌浜?000涓枃浠躲€備笅涓€姝ュ氨鏄妸杩?000涓枃浠惰繘琛屽綊骞?绫讳技涓庡綊骞舵帓搴?鐨勮繃绋嬩簡銆?/div>
15.鏈?0涓枃浠讹紝姣忎釜鏂囦欢1G锛屾瘡涓枃浠剁殑姣忎竴琛屽瓨鏀剧殑閮芥槸鐢ㄦ埛鐨剄uery锛屾瘡涓枃浠剁殑query閮藉彲鑳介噸澶嶃€傝姹備綘鎸夌収query鐨勯搴︽帓搴忋€?/div>
杩樻槸鍏稿瀷鐨凾OPK绠楁硶锛岃В鍐虫柟妗堝涓嬶細
鏂规1锛?/div>
椤哄簭璇诲彇10涓枃浠讹紝鎸夌収hash(query)%10鐨勭粨鏋滃皢query鍐欏叆鍒板彟澶?0涓枃浠?璁颁负)涓€傝繖鏍锋柊鐢熸垚鐨勬枃浠舵瘡涓殑澶у皬澶х害涔?G(鍋囪hash鍑芥暟鏄殢鏈虹殑)銆?/div>
鎵句竴鍙板唴瀛樺湪2G宸﹀彸鐨勬満鍣紝渚濇瀵圭敤hash_map(query,query_count)鏉ョ粺璁℃瘡涓猶uery鍑虹幇鐨勬鏁般€傚埄鐢ㄥ揩閫?鍫?褰掑苟鎺掑簭鎸夌収鍑虹幇娆℃暟杩涜鎺掑簭銆傚皢鎺掑簭濂界殑query鍜屽搴旂殑query_cout杈撳嚭鍒版枃浠朵腑銆傝繖鏍峰緱鍒颁簡10涓帓濂藉簭鐨勬枃浠?璁颁负)銆?/div>
瀵硅繖10涓枃浠惰繘琛屽綊骞舵帓搴?鍐呮帓搴忎笌澶栨帓搴忕浉缁撳悎)銆?/div>
鏂规2锛?/div>
涓€鑸琿uery鐨勬€婚噺鏄湁闄愮殑锛屽彧鏄噸澶嶇殑娆℃暟姣旇緝澶氳€屽凡锛屽彲鑳藉浜庢墍鏈夌殑query锛屼竴娆℃€у氨鍙互鍔犲叆鍒板唴瀛樹簡銆傝繖鏍凤紝鎴戜滑灏卞彲浠ラ噰鐢╰rie鏍?hash_map绛夌洿鎺ユ潵缁熻姣忎釜query鍑虹幇鐨勬鏁帮紝鐒跺悗鎸夊嚭鐜版鏁板仛蹇€?鍫?褰掑苟鎺掑簭灏卞彲浠ヤ簡銆?/div>
鏂规3锛?/div>
涓庢柟妗?绫讳技锛屼絾鍦ㄥ仛瀹宧ash锛屽垎鎴愬涓枃浠跺悗锛屽彲浠ヤ氦缁欏涓枃浠舵潵澶勭悊锛岄噰鐢ㄥ垎甯冨紡鐨勬灦鏋勬潵澶勭悊(姣斿MapReduce)锛屾渶鍚庡啀杩涜鍚堝苟銆?/div>
16.JVM&鍨冨溇鍥炴敹鏈哄埗
涓変釜浠o細骞磋交浠o紙Young Generation锛夈€佸勾鑰佷唬锛圤ld Generation锛夊拰鎸佷箙浠o紙Permanent Generation锛?/div>
17.鍦?.5浜夸釜鏁存暟涓壘鍑轰笉閲嶅鐨勬暣鏁帮紝娉紝鍐呭瓨涓嶈冻浠ュ绾宠繖2.5浜夸釜鏁存暟銆?/div>
鏂规1锛氶噰鐢?-Bitmap(姣忎釜鏁板垎閰?bit锛?0琛ㄧず涓嶅瓨鍦紝01琛ㄧず鍑虹幇涓€娆★紝10琛ㄧず澶氭锛?1鏃犳剰涔?杩涜锛屽叡闇€鍐呭瓨2^32*2bit=1GB鍐呭瓨锛岃繕鍙互鎺ュ彈銆傜劧鍚庢壂鎻忚繖2.5浜夸釜鏁存暟锛屾煡鐪婤itmap涓浉瀵瑰簲浣嶏紝濡傛灉鏄?0鍙?1锛?1鍙?0锛?0淇濇寔涓嶅彉銆傛墍鎻忓畬浜嬪悗锛屾煡鐪媌itmap锛屾妸瀵瑰簲浣嶆槸01鐨勬暣鏁拌緭鍑哄嵆鍙€?/div>
鏂规2锛氫篃鍙噰鐢ㄤ笌绗?棰樼被浼肩殑鏂规硶锛岃繘琛屽垝鍒嗗皬鏂囦欢鐨勬柟娉曘€傜劧鍚庡湪灏忔枃浠朵腑鎵惧嚭涓嶉噸澶嶇殑鏁存暟锛屽苟鎺掑簭銆傜劧鍚庡啀杩涜褰掑苟锛屾敞鎰忓幓闄ら噸澶嶇殑鍏冪礌銆?/div>
18.鑵捐闈㈣瘯棰橈細缁?0浜夸釜涓嶉噸澶嶇殑unsignedint鐨勬暣鏁帮紝娌℃帓杩囧簭鐨勶紝鐒跺悗鍐嶇粰涓€涓暟锛屽浣曞揩閫熷垽鏂繖涓暟鏄惁鍦ㄩ偅40浜夸釜鏁板綋涓?
绗竴鍙嶅簲鏃跺揩閫熸帓搴?浜屽垎鏌ユ壘銆備互涓嬫槸鍏跺畠鏇村ソ鐨勬柟娉曪細
鏂规1锛歰o锛岀敵璇?12M鐨勫唴瀛橈紝涓€涓猙it浣嶄唬琛ㄤ竴涓猽nsignedint鍊笺€傝鍏?0浜夸釜鏁帮紝璁剧疆鐩稿簲鐨刡it浣嶏紝璇诲叆瑕佹煡璇㈢殑鏁帮紝鏌ョ湅鐩稿簲bit浣嶆槸鍚︿负1锛屼负1琛ㄧず瀛樺湪锛屼负0琛ㄧず涓嶅瓨鍦ㄣ€?/div>
鏂规2锛氳繖涓棶棰樺湪銆婄紪绋嬬彔鐜戙€嬮噷鏈夊緢濂界殑鎻忚堪锛屽ぇ瀹跺彲浠ュ弬鑰冧笅闈㈢殑鎬濊矾
19.鎬庝箞鍦ㄦ捣閲忔暟鎹腑鎵惧嚭閲嶅娆℃暟鏈€澶氱殑涓€涓?
鏂规1锛氬厛鍋歨ash锛岀劧鍚庢眰妯℃槧灏勪负灏忔枃浠讹紝姹傚嚭姣忎釜灏忔枃浠朵腑閲嶅娆℃暟鏈€澶氱殑涓€涓紝骞惰褰曢噸澶嶆鏁般€傜劧鍚庢壘鍑轰笂涓€姝ユ眰鍑虹殑鏁版嵁涓噸澶嶆鏁版渶澶氱殑涓€涓氨鏄墍姹?鍏蜂綋鍙傝€冨墠闈㈢殑棰?銆?/div>
20.涓婂崈涓囨垨涓婁嚎鏁版嵁(鏈夐噸澶?锛岀粺璁″叾涓嚭鐜版鏁版渶澶氱殑閽盢涓暟鎹€?/div>
鏂规1锛氫笂鍗冧竾鎴栦笂浜跨殑鏁版嵁锛岀幇鍦ㄧ殑鏈哄櫒鐨勫唴瀛樺簲璇ヨ兘瀛樹笅銆傛墍浠ヨ€冭檻閲囩敤hash_map/鎼滅储浜屽弶鏍?绾㈤粦鏍戠瓑鏉ヨ繘琛岀粺璁℃鏁般€傜劧鍚庡氨鏄彇鍑哄墠N涓嚭鐜版鏁版渶澶氱殑鏁版嵁浜嗭紝鍙互鐢ㄧ2棰樻彁鍒扮殑鍫嗘満鍒跺畬鎴愩€?/div>
21.涓€涓枃鏈枃浠讹紝澶х害鏈変竴涓囪锛屾瘡琛屼竴涓瘝锛岃姹傜粺璁″嚭鍏朵腑鏈€棰戠箒鍑虹幇鐨勫墠10涓瘝锛岃缁欏嚭鎬濇兂锛岀粰鍑烘椂闂村鏉傚害鍒嗘瀽銆?/div>
鏂规1锛氳繖棰樻槸鑰冭檻鏃堕棿鏁堢巼銆傜敤trie鏍戠粺璁℃瘡涓瘝鍑虹幇鐨勬鏁帮紝鏃堕棿澶嶆潅搴︽槸O(n*le)(le琛ㄧず鍗曡瘝鐨勫钩鍑嗛暱搴?銆傜劧鍚庢槸鎵惧嚭鍑虹幇鏈€棰戠箒鐨勫墠10涓瘝锛屽彲浠ョ敤鍫嗘潵瀹炵幇锛屽墠闈㈢殑棰樹腑宸茬粡璁插埌浜嗭紝鏃堕棿澶嶆潅搴︽槸O(n*lg10)銆傛墍浠ユ€荤殑鏃堕棿澶嶆潅搴︼紝鏄疧(n*le)涓嶰(n*lg10)涓緝澶х殑鍝竴涓€?/div>
22.WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster uito ensure that workers are registered and have sufficient memory
褰撳墠鐨勯泦缇ょ殑鍙敤璧勬簮涓嶈兘婊¤冻搴旂敤绋嬪簭鎵€璇锋眰鐨勮祫婧愩€?/div>
璧勬簮鍒?绫伙細 cores 鍜?ram
Core浠h〃瀵规墽琛屽彲鐢ㄧ殑executor slots
Ram浠h〃姣忎釜Worker涓婅闇€瑕佺殑绌洪棽鍐呭瓨鏉ヨ繍琛屼綘鐨凙pplication銆?/div>
瑙e喅鏂规硶锛?/div>
搴旂敤涓嶈璇锋眰澶氫綑绌洪棽鍙敤璧勬簮鐨?/div>
鍏抽棴鎺夊凡缁忔墽琛岀粨鏉熺殑Application
23.Application isn’t using all of the Cores: How to set the Cores used by a Spark App
璁剧疆姣忎釜App鎵€鑳借幏寰楃殑core
瑙e喅鏂规硶锛?/div>
spark-env.sh閲岃缃畇park.deploy.defaultCores 鎴杝park.cores.max
24.Spark Executor OOM: How to set Memory Parameters on Spark
OOM鏄唴瀛橀噷鍫嗙殑涓滆タ澶浜?/div>
1锛夊鍔爅ob鐨勫苟琛屽害锛屽嵆澧炲姞job鐨刾artition鏁伴噺锛屾妸澶ф暟鎹泦鍒囧垎鎴愭洿灏忕殑鏁版嵁锛屽彲浠ュ噺灏戜竴娆℃€oad鍒板唴瀛樹腑鐨勬暟鎹噺銆侷nputFomart锛?getSplit鏉ョ‘瀹氥€?/div>
2锛塻park.storage.memoryFraction
绠$悊executor涓璕DD鍜岃繍琛屼换鍔℃椂鐨勫唴瀛樻瘮渚嬶紝濡傛灉shuffle姣旇緝灏忥紝鍙渶瑕佷竴鐐圭偣shuffle memory锛岄偅涔堝氨璋冨ぇ杩欎釜姣斾緥銆傞粯璁ゆ槸0.6銆備笉鑳芥瘮鑰佸勾浠h繕瑕佸ぇ銆傚ぇ浜嗗氨鏄氮璐广€?/div>
3锛塻park.executor.memory濡傛灉杩樻槸涓嶈锛岄偅涔堝氨瑕佸姞Executor鐨勫唴瀛樹簡锛屾敼瀹宔xecutor鍐呭瓨鍚庯紝杩欎釜闇€瑕侀噸鍚€?/div>
25.Shark Server/ Long Running Application Metadata Cleanup
Spark绋嬪簭鐨勫厓鏁版嵁鏄細寰€鍐呭瓨涓棤闄愬瓨鍌ㄧ殑銆俿park.cleaner.ttl鏉ラ槻姝OM锛屼富瑕佸嚭鐜板湪Spark Steaming鍜孲hark Server閲屻€?/div>
export SPARK_JAVA_OPTS +="-Dspark.kryoserializer.buffer.mb=10 -Dspark.cleaner.ttl=43200"
26.Class Not Found: Classpath Issues
闂1銆佺己灏慾ar锛屼笉鍦╟lasspath閲屻€?
闂2銆乯ar鍖呭啿绐侊紝鍚屼竴涓猨ar涓嶅悓鐗堟湰銆?/div>
瑙e喅1锛?/div>
灏嗘墍鏈変緷璧杍ar閮芥墦鍏ュ埌涓€涓猣atJar鍖呴噷锛岀劧鍚庢墜鍔ㄨ缃緷璧栧埌鎸囧畾姣忓彴鏈哄櫒鐨凞IR銆?/div>
val conf = new SparkConf().setAppName(appName).setJars(Seq(System.getProperty("user.dir") + "/target/scala-2.10/sparktest.jar"))
瑙e喅2锛?/div>
鎶婃墍闇€瑕佺殑渚濊禆jar鍖呴兘鏀惧埌default classpath閲岋紝鍒嗗彂鍒板悇涓獁orker node涓娿€?/div>
23.浣跨敤mr锛宻park,spark sql缂栧啓wordcount绋嬪簭
杩欎釜缃戜笂閮藉緢澶氾紝
24.濡備綍涓轰竴涓猦adoop浠诲姟璁剧疆mappers鐨勬暟閲?/div>
浣跨敤job.setNumMapTask(intn)鎵嬪姩鍒嗗壊锛岃繖鏄笉闈犺氨鐨?/div>
瀹樻柟鏂囨。锛?ldquo;Note:Thisisonlyahinttotheframework”璇存槑杩欎釜鏂规硶鍙槸鎻愮ず浣滅敤锛屼笉璧峰喅瀹氭€т綔鐢?/div>
瀹為檯涓婅鐢ㄥ叕寮忚绠楋細
Max(min.split锛宮in(max.split锛宐lock))灏辫缃垎鐗囩殑鏈€澶ф渶涓嬪€糲omputeSplitSize()璁剧疆
鍙互鍙傝€冭繖绡囨枃绔狅細http://blog.csdn.net/strongerbit/article/details/7440111
25.鏈夊彲鑳戒娇hadoop浠诲姟杈撳嚭鍒板涓洰褰曚腑涔?濡傛灉鍙互锛屾€庝箞鍋?
绛旀锛氬湪1.X鐗堟湰鍚庝娇鐢∕ultipleOutputs.java绫诲疄鐜?/div>
26.濡備綍涓轰竴涓猦adoop浠诲姟璁剧疆瑕佸垱寤虹殑reducer鐨勬暟閲?/div>
閰嶇疆job.setNumReduceTask(intn)
鎴栬€呰皟鏁磆dfs-site.xml涓殑mapred.tasktracker.reduce.tasks.maximum榛樿鍙傛暟鍊?/div>
27.Spark Streaming鍜孲torm鏈変綍鍖哄埆锛?/div>
涓€涓疄鏃舵绉掍竴涓噯瀹炴椂浜氱锛屼笉杩噑torm鐨勫悶鍚愮巼姣旇緝浣庛€?/div>
28.濡傛灉鍏徃鍙綘鍐檋adoop骞冲彴璁捐鏂规锛屼綘浼氬浣曡鍒扝adoop鐢熶骇闆嗙兢锛?/div>
杩欎釜棰樼洰姣旇緝鑰冮獙鍏ㄥ眬瑙傦紝绔欏湪鏋舵瀯甯堢殑灞傞潰鍘绘€濊€?/div>
29.hadoop闆嗙兢鐩戞帶锛屼綘浼氬叧娉ㄥ摢浜涚洃鎺х偣锛?/div>
鍋忛噸闆嗙兢鐨勮繍缁?/div>
30.瀹為檯鐢熶骇涓紝浼犵粺鍏崇郴鍨嬫暟鎹簱濡備綍杩佺Щ鍒癶adoop骞冲彴锛岃縼绉昏繃绋嬩腑锛屼綘閬囧埌浜嗗摢浜涢棶棰橈紵

 

以上是关于澶ф暟鎹潰璇曢的主要内容,如果未能解决你的问题,请参考以下文章

骞茶揣锛屾帴鍙f祴璇曢潰璇曢

澶ф暟鎹箣hadoop

Java鈥斺€旈泦鍚堢粡鍏搁潰璇曢

澶ф暟鎹粍浠?瀛︿範鐭ヨ瘑鍥捐氨

澶ф暟鎹椂浠o紝鑻规灉鎵嬫満鍜屽畨鍗撴墜鏈虹殑鍖哄埆!

澶ф暟鎹В鍐虫柟妗?锛堝熀纭€绡囷級