hadoop下的Kmeans算法实现

Posted 2023-01-12 GarfieldEr007

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了hadoop下的Kmeans算法实现相关的知识，希望对你有一定的参考价值。

前一段时间，从配置Hadoop到运行kmeans的mapreduce程序，着实让我纠结了几天，昨天终于把前面遇到的配置问题和程序运行问题搞定。Kmeans算法看起来很简单，但对于第一次接触mapreduce程序来说，还是有些挑战，还好基本都搞明白了。Kmeans算法是从网上下的在此分析一下过程。

Kmeans.Java

[java] view plain copy

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class KMeans
public static void main(String[] args) throws Exception
CenterInitial centerInitial = new CenterInitial();
centerInitial.run(args);//初始化中心点
int times=0;
double s = 0,shold = 0.1;//shold是预制。
do
Configuration conf = new Configuration();
conf.set("fs.default.name", "hdfs://localhost:9000");
Job job = new Job(conf,"KMeans");//建立KMeans的MapReduce作业
job.setJarByClass(KMeans.class);//设定作业的启动类
job.setOutputKeyClass(Text.class);//设定Key输出的格式：Text
job.setOutputValueClass(Text.class);//设定value输出的格式：Text
job.setMapperClass(KMapper.class);//设定Mapper类
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);//设定Reducer类
job.setReducerClass(KReducer.class);
FileSystem fs = FileSystem.get(conf);
fs.delete(new Path(args[2]),true);//args[2]是output目录，fs.delete是将已存在的output删除
//解析输入和输出参数，分别作为作业的输入和输出，都是文件
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[2]));
//运行作业并判断是否完成成功
job.waitForCompletion(true);
if(job.waitForCompletion(true))//上一次mapreduce过程结束
//上两个中心点做比较，如果中心点之间的距离小于阈值就停止；如果距离大于阈值，就把最近的中心点作为新中心点
NewCenter newCenter = new NewCenter();
s = newCenter.run(args);
times++;
while(s > shold);//当误差小于阈值停止。
System.out.println("Iterator: " + times);//迭代次数

问题：args[]是什么，这个问题纠结了几日才得到答案，args[]就是最开始向程序中传递的参数，具体在Run Configurations里配置，如下

hdfs://localhost:9000/home/administrator/hadoop/kmeans/input hdfs://localhost:9000/home/administrator/hadoop/kmeans hdfs://localhost:9000/home/administrator/hadoop/kmeans/output

代码的功能在程序中注释。

输入数据，保存在2.txt中：(1,1) (9,9) (2,3) (10,30) (4,4) (34,40) (5,6) (15,20)

3.txt用于保存临时的中心

part-r-00000用于保存reduce的结果

程序的mapreduce过程及结果：

[java] view plain copy

初始化过程：(10,30) (2,3)
13/01/26 08:58:38 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
13/01/26 08:58:38 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/01/26 08:58:38 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
13/01/26 08:58:38 INFO input.FileInputFormat: Total input paths to process : 2
13/01/26 08:58:38 WARN snappy.LoadSnappy: Snappy native library not loaded
13/01/26 08:58:38 INFO mapred.JobClient: Running job: job_local_0001
13/01/26 08:58:39 INFO util.ProcessTree: setsid exited with exit code 0
13/01/26 08:58:39 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@15718f2
13/01/26 08:58:39 INFO mapred.MapTask: io.sort.mb = 100
13/01/26 08:58:39 INFO mapred.MapTask: data buffer = 79691776/99614720
13/01/26 08:58:39 INFO mapred.MapTask: record buffer = 262144/327680
0list:1
0c:10
1list:1
1c:30
中心点(2,3)对应坐标(1,1)
Mapper输出：(2,3) (1,1)
0list:9
0c:10
以上是关于hadoop下的Kmeans算法实现的主要内容，如果未能解决你的问题，请参考以下文章