将 Hive 与 Mahout 集成以进行推荐

Posted 2023-02-16

技术标签:

【中文标题】将 Hive 与 Mahout 集成以进行推荐【英文标题】：Integration Hive with Mahout for Recommendation 【发布时间】：2014-02-26 18:48:46 【问题描述】：

我想将 mahout 与 hive 一起使用，我将从 hive 获取数据并使用数据模型来填充数据并使用 mahout 进行推荐。这可能吗。因为我看到 mahout 仅适用于文件。 1) 如何使用 hive 表将数据加载到 mahout？ 2) 有没有其他方法可以将 mahout 推荐与 hive 或其他人一起使用？

这里我有 hive jdbc 结果，我想填充到 mahout 中的 DataModel。如何填充？

我想使用数据库结果而不是从文件中读取 mahout 推荐。例如：

蜂巢：

    import java.sql.SQLException;
    import java.sql.Connection;
    import java.sql.ResultSet;
    import java.sql.Statement;
    import java.sql.DriverManager;

    public class HiveJdbcClient 
      private static String driverName = "org.apache.hive.jdbc.HiveDriver";

      /**
       * @param args
       * @throws SQLException
       */
      public static void main(String[] args) throws SQLException 
          try 
          Class.forName(driverName);
         catch (ClassNotFoundException e) 
          // TODO Auto-generated catch block
          e.printStackTrace();
          System.exit(1);
        
        //replace "hive" here with the name of the user the queries should run as
        Connection con = DriverManager.getConnection("jdbc:hive2://localhost:10000/default", "hive", "");
        Statement stmt = con.createStatement();
        String tableName = "testHiveDriverTable";
        stmt.execute("drop table if exists " + tableName);
        stmt.execute("create table " + tableName + " (key int, value string)");
        // show tables
        String sql = "show tables '" + tableName + "'";
        System.out.println("Running: " + sql);
        ResultSet res = stmt.executeQuery(sql);
        if (res.next()) 
          System.out.println(res.getString(1));
        
           // describe table
        sql = "describe " + tableName;
        System.out.println("Running: " + sql);
        res = stmt.executeQuery(sql);
        while (res.next()) 
          System.out.println(res.getString(1) + "\t" + res.getString(2));
        

        // load data into table
        // NOTE: filepath has to be local to the hive server
        // NOTE: /tmp/a.txt is a ctrl-A separated file with two fields per line
        String filepath = "/tmp/a.txt";
        sql = "load data local inpath '" + filepath + "' into table " + tableName;
        System.out.println("Running: " + sql);
        stmt.execute(sql);

        // select * query
        sql = "select * from " + tableName;
        System.out.println("Running: " + sql);
        res = stmt.executeQuery(sql);
        while (res.next()) 
          System.out.println(String.valueOf(res.getInt(1)) + "\t" + res.getString(2));
        

        // regular hive query
        sql = "select count(1) from " + tableName;
        System.out.println("Running: " + sql);
        res = stmt.executeQuery(sql);
        while (res.next()) 
          System.out.println(res.getString(1));

驯兽师：

// Create a data source from the CSV file
File userPreferencesFile = new File("data/dataset1.csv");
DataModel dataModel = new FileDataModel(userPreferencesFile);

UserSimilarity userSimilarity = new PearsonCorrelationSimilarity(dataModel);
UserNeighborhood userNeighborhood = new NearestNUserNeighborhood(2, userSimilarity, dataModel);

// Create a generic user based recommender with the dataModel, the userNeighborhood and the userSimilarity
Recommender genericRecommender =  new GenericUserBasedRecommender(dataModel, userNeighborhood, userSimilarity);

// Recommend 5 items for each user
for (LongPrimitiveIterator iterator = dataModel.getUserIDs(); iterator.hasNext();)

    long userId = iterator.nextLong();

    // Generate a list of 5 recommendations for the user
    List<RecommendedItem> itemRecommendations = genericRecommender.recommend(userId, 5);

    System.out.format("User Id: %d%n", userId);

    if (itemRecommendations.isEmpty())
    `enter code here
        System.out.println("No recommendations for this user.");
    
    else
    
        // Display the list of recommendations
        for (RecommendedItem recommendedItem : itemRecommendations)
        
            System.out.format("Recommened Item Id %d. Strength of the preference: %f%n", recommendedItem.getItemID(), recommendedItem.getValue());

【问题讨论】：

【参考方案1】：

Mahout 0.9 版为 mysql/Oracle/Postgress 等 JDBC 投诉数据库、MongoDB/HBase/Cassandra 等 NoSQL 数据库以及您提到的基于文件系统提供数据模型（源数据）。

在此版本中，Hive 不是 100% SQL 标准数据库，数据模型 MySQLJDBCDataModel 和 SQL92JDBCDataModel 不适合用于 Hive 表，因为 SQL 语法在 JDBC 投诉数据库中完全不同。

对于您的第一个问题，您可能希望扩展 AbstractJDBCDataModel 并覆盖构造函数以传入 Hive 数据源和配置单元特定 SQL 查询，以获取首选项、首选项时间、用户、所有用户等，类似于在AbstractJDBCDataModel 构造函数。

对于第二个问题，如果您使用的是非分布式算法（味觉算法），上述方法适用。如果使用分布式算法，Mahout 可以在 Hadoop 上运行，获取 Hive 表支持的 HDFS 文件。请参阅here 在 Hadoop 上运行 Mahout

【讨论】：

DataModel dataModel = new FileDataModel(userPreferencesFile); UserSimilarity userSimilarity = new PearsonCorrelationSimilarity(dataModel); 我可以这样做吗？ // 从 hive 读取数据...在上面的示例中 DataModel dataModel = null; while (res.next()) System.out.println(String.valueOf(res.getInt(1)) + "\t" + res.getString(2));数据模型 = 新的通用数据模型（）； // 如何加载数据？这是正确的我的理解吗？ // 为 mahout 使用数据模型 UserSimilarity userSimilarity = new PearsonCorrelationSimilarity(dataModel); Mahout 基础库从传入的数据模型中读取数据。因此，数据模型不能作为空值传递。是的，GenericDataModel 也可以扩展并覆盖构造函数以传入配置单元特定的 SQL 查询

以上是关于将 Hive 与 Mahout 集成以进行推荐的主要内容，如果未能解决你的问题，请参考以下文章