将带有分类数据的 csv 转换为 libsvm

Posted 2023-03-12

技术标签:

【中文标题】将带有分类数据的 csv 转换为 libsvm【英文标题】：Convert csv with categorical data to libsvm 【发布时间】：2015-10-05 09:30:42 【问题描述】：

我正在使用spark MLlib 构建机器学习模型。如果数据中有分类变量，我需要提供libsvm 格式文件作为输入。

我尝试将 csv 文件转换为 libsvm 使用 1. Convert.c 在libsvm 网站上的建议 2. Csvtolibsvm.py in phraug github

但这两个脚本似乎都没有转换分类数据。我还安装了weka 并尝试保存为libsvm 格式。但在weka explorer 中找不到该选项。

请提出任何其他将带有分类数据的csv 转换为libsvm 格式的方法，或者如果我在这里遗漏任何内容，请告诉我。

提前感谢您的帮助。

【问题讨论】：

【参考方案1】：

我猜你想训练一个 SVM。它需要一个 rdd [LabeledPoint] 的输入。

https://spark.apache.org/docs/1.4.1/api/scala/#org.apache.spark.mllib.classification.SVMWithSGD

我建议您将分类列与此处的第二个答案类似：

How to transform a categorical variable in Spark into a set of columns coded as 0,1?

LogisticRegression 案例与 SVM 案例非常相似。

【讨论】：

【参考方案2】：

您可以尝试使用哈希技巧将分类特征转换为数字，然后如果 order 确实将函数映射到每一行，则将数据帧转换为 rdd。下面的假例子是使用 pyspark 解决的。

例如转换的数据框是df:

>> df.show(5)

+------+----------------+-------+-------+
|gender|            city|country|     os|
+------+----------------+-------+-------+
|     M|         chennai|     IN|android|
|     F|       hyderabad|     IN|ANDROID|
|     M|leighton buzzard|     GB|ANDROID|
|     M|          kanpur|     IN|ANDROID|
|     F|       lafayette|     US|    ios|
+------+----------------+-------+-------+

我想使用特征：yob、city、country 来预测性别。

import hashlib
from pyspark.sql import Row
from pyspark.ml.linalg import SparseVector

spark = SparkSession \
    .builder \
    .appName("Spark-app")\
     .config("spark.some.config.option", "some-value")\
    .getOrCreate() # create the spark session

NR_BINS = 100000 # the total number of categories, it should be a big number if you have many different categories in each feature and a lot of categorical features. in the meantime do consider the memory.

def hashnum(input):
    return int(hashlib.md5(input).hexdigest(), 16)%NR_BINS + 1

def libsvm_converter(row):
    target = "gender"
    features = ['city', 'country', 'os']
    if row[target] == "M":
        lab = 1
    elif row[target] == "F":
        lab = 0
    else:
        return
    sparse_vector = []
    for f in features:
        v = '-'.format(f, row[f].encode('utf-8'))
        hashv = hashnum(v) # the index in libsvm
        sparse_vector.append((hashv, 1)) # the value is always 1 because of categorical feature
    sparse_vector = list(set(sparse_vector)) # in case there are ***es (BR_BINS not big enough)
    return Row(label = lab, features=SparseVector(NR_BINS, sparse_vector))


libsvm = df.rdd.map(libsvm_converter_2)
data = spark.createDataFrame(libsvm)

如果你检查数据，它看起来像这样；

>> data.show()
+--------------------+-----+
|            features|label|
+--------------------+-----+
|(100000,[12626,68...|    1|
|(100000,[59866,68...|    0|
|(100000,[66386,68...|    1|
|(100000,[53746,68...|    1|
|(100000,[6966,373...|    0|
+--------------------+-----+

【讨论】：

以上是关于将带有分类数据的 csv 转换为 libsvm的主要内容，如果未能解决你的问题，请参考以下文章