如何为hadoop实现组比较器？

Posted 2023-04-17

技术标签:

【中文标题】如何为hadoop实现组比较器？【英文标题】：how to implement a group comparator for hadoop? 【发布时间】：2014-04-02 06:48:46 【问题描述】：

给定一个名为 KeyLabelDistance 的类，我将其作为 Hadoop 中的键和值传递，我想对其执行二次排序，即我首先要根据键的递增值对键进行排序，然后按递减顺序对键进行排序距离。

为了做到这一点，我需要编写自己的 GroupingComparator。我的问题是，由于 setGroupingComparator() 方法仅将扩展 RawComparator 的类作为参数，我如何在分组比较器中执行此比较字节？我需要显式序列化和反序列化对象吗？并且让 KeyLabelDistance 类实现 WritableComparable 如下是否会使对 SortComparator 的需求变得多余？

我从这个答案中得到了 SortComparator 和 GroupComparator 的使用：What are the differences between Sort Comparator and Group Comparator in Hadoop?

以下是KeyLabelDistance的实现：

public class KeyLabelDistance implements WritableComparable<KeyLabelDistance>
    
        private int key;
        private int label;
        private double distance;
        KeyLabelDistance()
        
            key = 0;
            label = 0;
            distance = 0;
        
        KeyLabelDistance(int key, int label, double distance)
        
            this.key = key;
            this.label = label;
            this.distance = distance;
        
        public int getKey() 
            return key;
        
        public void setKey(int key) 
            this.key = key;
        
        public int getLabel() 
            return label;
        
        public void setLabel(int label) 
            this.label = label;
        
        public double getDistance() 
            return distance;
        
        public void setDistance(double distance) 
            this.distance = distance;
        

        public int compareTo(KeyLabelDistance lhs, KeyLabelDistance rhs)
        
            if(lhs == rhs)
                return 0;
            else
            
                if(lhs.getKey() < rhs.getKey())
                    return -1;
                else if(lhs.getKey() > rhs.getKey())
                    return 1;
                else
                
                    //If the keys are equal, look at the distances -> since more is the "distance" more is the "similarity", the comparison is counterintuitive
                    if(lhs.getDistance() < rhs.getDistance() )
                        return 1;
                    else if(lhs.getDistance() > rhs.getDistance())
                        return -1;
                    else return 0;

组比较器的代码如下：

public class KeyLabelDistanceGroupingComparator extends WritableComparator
    public int compare (KeyLabelDistance lhs, KeyLabelDistance rhs)
    
        if(lhs == rhs)
            return 0;
        else
        
            if(lhs.getKey() < rhs.getKey())
                return -1;
            else if(lhs.getKey() > rhs.getKey())
                return 1;
            return 0;

任何帮助表示赞赏。在此先感谢。

【问题讨论】：

【参考方案1】：

您可以扩展 WritableComparator，后者又实现 RawComparator。您的排序和分组比较器都将扩展 WritableComparator。

如果您不提供这些比较器，hadoop 将在内部最终使用可写对象的 compareTo，这是您的密钥。

【讨论】：

谢谢，我试过了。现在我在我的问题中也包含了组比较器的代码，但出现以下错误： KeyLabelDistanceGroupingComparator.java:3: 找不到符号符号：构造函数 WritableComparator() 位置：类 org.apache.hadoop.io.WritableComparator 当你在java中扩展一个类并且超类没有默认构造函数时，这是你得到的错误。在代码中创建构造函数并调用 super()。例如：XYZKeyValueComparator() super(MyWritable.class, true);

以上是关于如何为hadoop实现组比较器？的主要内容，如果未能解决你的问题，请参考以下文章