在 java 代码中使用 mahout,而不是 cli

Posted

技术标签:

【中文标题】在 java 代码中使用 mahout,而不是 cli【英文标题】:Using mahout in java code, not cli 【发布时间】:2013-08-31 11:25:32 【问题描述】:

我希望能够使用 java 构建模型,我可以使用 CLI 来实现,如下所示:

    ./mahout trainlogistic --input Candy-Crush.twtr.csv \
       --output ./model \
       --target hd_click --categories 2 \
       --predictors click_frequency country_code ctr      device_price_range hd_conversion  time_of_day num_clicks phone_type twitter is_weekend app_entertainment app_wallpaper app_widgets arcade books_and_reference brain business cards casual comics communication education entertainment finance game_wallpaper game_widgets health_and_fitness health_fitness libraries_and_demo libraries_demo lifestyle media_and_video media_video medical music_and_audio news_and_magazines news_magazines personalization photography productivity racing shopping social sports sports_apps sports_games tools transportation travel_and_local weather app_entertainment_percentage app_wallpaper_percentage app_widgets_percentage arcade_percentage books_and_reference_percentage brain_percentage business_percentage cards_percentage casual_percentage comics_percentage communication_percentage education_percentage entertainment_percentage finance_percentage game_wallpaper_percentage game_widgets_percentage health_and_fitness_percentage health_fitness_percentage libraries_and_demo_percentage libraries_demo_percentage lifestyle_percentage media_and_video_percentage media_video_percentage medical_percentage music_and_audio_percentage news_and_magazines_percentage news_magazines_percentage personalization_percentage photography_percentage productivity_percentage racing_percentage shopping_percentage social_percentage sports_apps_percentage sports_games_percentage sports_percentage tools_percentage transportation_percentage travel_and_local_percentage weather_percentage reads_magazine_sum reads_magazine_count interested_in_gardening_sum interested_in_gardening_count kids_birthday_coming_sum kids_birthday_coming_count job_seeker_sum job_seeker_count friends_sum friends_count married_sum married_count charity_donor_sum charity_donor_count student_sum student_count interested_in_real_estate_sum interested_in_real_estate_count sports_fan_sum sports_fan_count bascketball_sum bascketball_count interested_in_politics_sum interested_in_politics_count gamer_sum gamer_count activist_sum activist_count traveler_sum traveler_count likes_soccer_sum likes_soccer_count interested_in_celebs_sum interested_in_celebs_count auto_racing_sum auto_racing_count age_group_sum age_group_count healthy_lifestyle_sum healthy_lifestyle_count interested_in_finance_sum interested_in_finance_count sports_teams_usa_sum sports_teams_usa_count interested_in_deals_sum interested_in_deals_count business_oriented_sum business_oriented_count interested_in_cooking_sum interested_in_cooking_count music_lover_sum music_lover_count beauty_sum beauty_count follows_fashion_sum follows_fashion_count likes_wrestling_sum likes_wrestling_count name_sum name_count shopper_sum shopper_count golf_sum golf_count vegetarian_sum vegetarian_count dating_sum dating_count interested_in_fashion_sum interested_in_fashion_count interested_in_news_sum interested_in_news_count likes_tennis_sum likes_tennis_count male_sum male_count interested_in_cars_sum interested_in_cars_count follows_bloggers_sum follows_bloggers_count entertainment_sum entertainment_count interested_in_books_sum interested_in_books_count has_kids_sum has_kids_count interested_in_movies_sum interested_in_movies_count musicians_sum musicians_count tech_oriented_sum tech_oriented_count female_sum female_count has_pet_sum has_pet_count practicing_sports_sum practicing_sports_count \
       --types      numeric         word         numeric  word               word           word        numeric    word       word    word        numeric       \
       --features 100 --passes 1 --rate 50

我无法理解 20 个新闻组的示例,因为它太大了,值得学习。 谁能给我一个与cli命令相同的代码?

澄清:

我需要这样的东西:

    model.train(1,0,"monday",6,44,1,7,4,6,78,7,3,4,6,........,"good");
    model.train(1,0,"sunday",6,44,5,7,9,2,4,6,78,7,3,4,6,........,"bad");
    model.train(1,0,"monday",4,99,2,4,6,3,4,6,........,"good");

    model.writeTofile("myModel.model");

如果您不熟悉分类并且只想告诉我如何从 JAVA 执行 CLI 命令,请不要回答

【问题讨论】:

我不明白你的问题。 我想建造能够建造模型是什么意思?你说你可以使用 CLI 来做到这一点,但在最后一句中你要求提供作为 cli 命令执行某些操作的代码。 是的...我需要代码来做。 但是您告诉使用您可以做到。那我们为什么要写你的​​代码呢? 因为我有一个生产环境,我不能在那里运行cli命令,我需要用纯java来做 那么请不要在这里写了。我正在寻求有关 mahout 的帮助,而不是有关如何不运行 java 的帮助。 mahout 是一个 java 库。 【参考方案1】:

我对 Mahout API 不是 100% 熟悉(我同意文档非常稀少),所以我只能提供指点,但我希望它有所帮助:

trainlogistic 示例的 Java 源代码实际上可以在 mahout-examples 库中找到 - 它位于 maven [0](在 org.apache.mahout.classifier.sgd.TrainLogistic 中)。我想如果你愿意,你可以使用完全相同的源代码,但它依赖于 mahout-examples 库中的几个实用程序类(而且它也不是很干净)。

在此示例中执行训练的类是 org.apache.mahout.classifier.sgd.OnlineLogisticRegression [1],尽管考虑到您拥有大量预测变量,您可能希望使用 AdaptiveLogisticRegression [2](相同的包),它使用一个数字OnlineLogisticRegressions 内部。但是您必须自己查看哪种方法最适合您的数据。

API 相当简单,有一个 train 方法,它采用 Vector 输入数据和一个 classify 方法来测试您的模型,以及 learningRate 和其他方法来更改模型的参数。

要像命令行工具一样将模型保存到磁盘,请使用org.apache.mahout.classifier.sgd.ModelSerializer,它有一个简单的 API 来写入和读取您的模型。 (OLR 类本身也有 writereadFields 方法,但坦率地说,我不确定它们的作用或与 ModelSerializer 是否有区别 - 它们也没有记录。)

最后,除了mahout-examples 中的源代码之外,还有另外两个直接使用 Mahout API 的示例,这可能很有用 [3, 4]。

来源:

[0]http://repo1.maven.org/maven2/org/apache/mahout/mahout-examples/0.8/

[1]http://archive.cloudera.com/cdh4/cdh/4/mahout/mahout-core/org/apache/mahout/classifier/sgd/OnlineLogisticRegression.html

[2]http://archive.cloudera.com/cdh4/cdh/4/mahout/mahout-core/org/apache/mahout/classifier/sgd/AdaptiveLogisticRegression.html

[3]http://mail-archives.apache.org/mod_mbox/mahout-user/201206.mbox/%3CCAJwFCa3X2fL_SRxT7f7v9uMjS3Tc9WrT7vuMQCVXyH71k0H0zQ@mail.gmail.com%3E

[4]http://skife.org/mahout/2013/02/14/first_steps_with_mahout.html

【讨论】:

【参考方案2】:

这篇博客有一篇关于如何使用 Mahout Java API 进行训练和分类的好文章:http://nigap.blogspot.com/2012/02/bayes-algorithm-with-apache-mahout.html

【讨论】:

本教程不错,但它基于很久以前发布的 Mahout 0.5。 :(【参考方案3】:

您可以使用 Runtime.exec 从 java 执行相同的 cmd 行。

简单的方法是:

Process p = Runtime.getRuntime().exec("/usr/bin/bash -ic \"<path_to_mahout>/mahout trainlogistic --input Candy-Crush.twtr.csv " + "--output ./model " + "--target hd_click --categories 2 " + "--predictors click_frequency country_code ctr device_price_range hd_conversion time_of_day num_clicks phone_type twitter is_weekend app_entertainment app_wallpaper app_widgets arcade books_and_reference brain business cards casual comics communication education entertainment finance game_wallpaper game_widgets health_and_fitness health_fitness libraries_and_demo libraries_demo lifestyle media_and_video media_video medical music_and_audio news_and_magazines news_magazines personalization photography productivity racing shopping social sports sports_apps sports_games tools transportation travel_and_local weather app_entertainment_percentage app_wallpaper_percentage app_widgets_percentage arcade_percentage books_and_reference_percentage brain_percentage business_percentage cards_percentage casual_percentage comics_percentage communication_percentage education_percentage entertainment_percentage finance_percentage game_wallpaper_percentage game_widgets_percentage health_and_fitness_percentage health_fitness_percentage libraries_and_demo_percentage libraries_demo_percentage lifestyle_percentage media_and_video_percentage media_video_percentage medical_percentage music_and_audio_percentage news_and_magazines_percentage news_magazines_percentage personalization_percentage photography_percentage productivity_percentage racing_percentage shopping_percentage social_percentage sports_apps_percentage sports_games_percentage sports_percentage tools_percentage transportation_percentage travel_and_local_percentage weather_percentage reads_magazine_sum reads_magazine_count interested_in_gardening_sum interested_in_gardening_count kids_birthday_coming_sum kids_birthday_coming_count job_seeker_sum job_seeker_count friends_sum friends_count married_sum married_count charity_donor_sum charity_donor_count student_sum student_count interested_in_real_estate_sum interested_in_real_estate_count sports_fan_sum sports_fan_count bascketball_sum bascketball_count interested_in_politics_sum interested_in_politics_count gamer_sum gamer_count activist_sum activist_count traveler_sum traveler_count likes_soccer_sum likes_soccer_count interested_in_celebs_sum interested_in_celebs_count auto_racing_sum auto_racing_count age_group_sum age_group_count healthy_lifestyle_sum healthy_lifestyle_count interested_in_finance_sum interested_in_finance_count sports_teams_usa_sum sports_teams_usa_count interested_in_deals_sum interested_in_deals_count business_oriented_sum business_oriented_count interested_in_cooking_sum interested_in_cooking_count music_lover_sum music_lover_count beauty_sum beauty_count follows_fashion_sum follows_fashion_count likes_wrestling_sum likes_wrestling_count name_sum name_count shopper_sum shopper_count golf_sum golf_count vegetarian_sum vegetarian_count dating_sum dating_count interested_in_fashion_sum interested_in_fashion_count interested_in_news_sum interested_in_news_count likes_tennis_sum likes_tennis_count male_sum male_count interested_in_cars_sum interested_in_cars_count follows_bloggers_sum follows_bloggers_count entertainment_sum entertainment_count interested_in_books_sum interested_in_books_count has_kids_sum has_kids_count interested_in_movies_sum interested_in_movies_count musicians_sum musicians_count tech_oriented_sum tech_oriented_count female_sum female_count has_pet_sum has_pet_count practicing_sports_sum practicing_sports_count " + "--types numeric word numeric word word word numeric word word word numeric " + "--features 100 --passes 1 --rate 50\"");

如果你选择这个,那么我建议你先阅读这个: When Runtime.exec() won't

这样应用程序将在不同的进程中运行。

此外,您还可以按照以下站点中的“与您的应用程序集成”部分进行操作: Recomender Documentation

这也是写推荐人的一个很好的参考: Introducing Apache Mahout

希望这会有所帮助。 干杯

【讨论】:

我知道如何使用 java Runtime。这不是我要求的。 @DimaGoltsman 很抱歉,只是想提供帮助。我以为你想让它自动运行而不是手动运行。您没有提到您不允许创建新流程。无论如何,仅仅因为您的问题不清楚,并不意味着我的回答是错误的。

以上是关于在 java 代码中使用 mahout,而不是 cli的主要内容,如果未能解决你的问题,请参考以下文章

Mahout 0.13.0 spark-shell 示例因“java.library.path 中没有 jniViennaCL”而失败

Hadoop MapReduce vs MPI(vs Spark vs Mahout vs Mesos) - 何时使用一个而不是另一个?

使用 Hadoop 的机器学习框架 [关闭]

Python 中 Java 的 Mahout 等价物

在 mahout-0.6 上的“Mahout in Action”中运行示例代码时出现 IOException

在“Mahout in Action”中运行示例代码时出现 IOException