ALINK(二十):数据处理数值型数据处理向量标准化 VectorNormalizeBatchOp/向量标准化训练 VectorStandardScalerTrainBatchOp /向量
Posted 秋华
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了ALINK(二十):数据处理数值型数据处理向量标准化 VectorNormalizeBatchOp/向量标准化训练 VectorStandardScalerTrainBatchOp /向量相关的知识,希望对你有一定的参考价值。
向量标准化 (VectorNormalizeBatchOp)
Java 类名:com.alibaba.alink.operator.batch.dataproc.vector.VectorNormalizeBatchOp
Python 类名:VectorNormalizeBatchOp
功能介绍
对 Vector 进行正则化操作。
指定参数范数的阶,例如p = 2, 对于向量<x1, x2, x3>,计算向量的平方和再开二次方记为norm,最终计算结果为<x1/norm, x2/norm, x3/norm>
参数说明
名称 |
中文名称 |
描述 |
类型 |
是否必须? |
默认值 |
selectedCol |
选中的列名 |
计算列对应的列名 |
String |
✓ |
|
outputCol |
输出结果列 |
输出结果列列名,可选,默认null |
String |
null |
|
reservedCols |
算法保留列名 |
算法保留列 |
String[] |
null |
|
p |
范数的阶 |
范数的阶,默认2 |
Double |
2.0 |
|
numThreads |
组件多线程线程个数 |
组件多线程线程个数 |
Integer |
1 |
代码示例
Python 代码
from pyalink.alink import * import pandas as pd useLocalEnv(1) df = pd.DataFrame([ ["1:3,2:4,4:7", 1], ["0:3,5:5", 3], ["2:4,4:5", 4] ]) data = BatchOperator.fromDataframe(df, schemaStr="vec string, id bigint") VectorNormalizeBatchOp().setSelectedCol("vec").setOutputCol("vec_norm").linkFrom(data).collectToDataframe()
Java 代码
import org.apache.flink.types.Row; import com.alibaba.alink.operator.batch.BatchOperator; import com.alibaba.alink.operator.batch.dataproc.vector.VectorNormalizeBatchOp; import com.alibaba.alink.operator.batch.source.MemSourceBatchOp; import org.junit.Test; import java.util.Arrays; import java.util.List; public class VectorNormalizeBatchOpTest { @Test public void testVectorNormalizeBatchOp() throws Exception { List <Row> df = Arrays.asList( Row.of("1:3,2:4,4:7", 1), Row.of("0:3,5:5", 3), Row.of("2:4,4:5", 4) ); BatchOperator <?> data = new MemSourceBatchOp(df, "vec string, id int"); new VectorNormalizeBatchOp().setSelectedCol("vec").setOutputCol("vec_norm").linkFrom(data).print(); } }
运行结果
vec |
id |
vec_norm |
1:3,2:4,4:7 |
1 |
1:0.34874291623145787 2:0.46499055497527714 4:0.813733471206735 |
0:3,5:5 |
3 |
0:0.5144957554275265 5:0.8574929257125441 |
2:4,4:5 |
4 |
2:0.6246950475544243 4:0.7808688094430304 |
向量标准化训练 (VectorStandardScalerTrainBatchOp)
Java 类名:com.alibaba.alink.operator.batch.dataproc.vector.VectorStandardScalerTrainBatchOp
Python 类名:VectorStandardScalerTrainBatchOp
功能介绍
标准化是对向量数据进行按正态化处理的组件
生成向量标准化的模型,在VectorStandardScalerPredictBatchOp中加载,对数据做标准化处理。
参数说明
名称 |
中文名称 |
描述 |
类型 |
是否必须? |
默认值 |
selectedCol |
选中的列名 |
计算列对应的列名 |
String |
✓ |
|
withMean |
是否使用均值 |
是否使用均值,默认使用 |
Boolean |
true |
|
withStd |
是否使用标准差 |
是否使用标准差,默认使用 |
Boolean |
true |
代码示例
Python 代码
from pyalink.alink import * import pandas as pd useLocalEnv(1) df = pd.DataFrame([ ["a", "10.0, 100"], ["b", "-2.5, 9"], ["c", "100.2, 1"], ["d", "-99.9, 100"], ["a", "1.4, 1"], ["b", "-2.2, 9"], ["c", "100.9, 1"] ]) data = BatchOperator.fromDataframe(df, schemaStr="col string, vector string") trainOp = VectorStandardScalerTrainBatchOp().setSelectedCol("vector") model = trainOp.linkFrom(data) VectorStandardScalerPredictBatchOp().linkFrom(model, data).collectToDataframe()
Java 代码
import org.apache.flink.types.Row; import com.alibaba.alink.operator.batch.BatchOperator; import com.alibaba.alink.operator.batch.dataproc.vector.VectorStandardScalerPredictBatchOp; import com.alibaba.alink.operator.batch.dataproc.vector.VectorStandardScalerTrainBatchOp; import com.alibaba.alink.operator.batch.source.MemSourceBatchOp; import org.junit.Test; import java.util.Arrays; import java.util.List; public class VectorStandardScalerTrainBatchOpTest { @Test public void testVectorStandardScalerTrainBatchOp() throws Exception { List <Row> df = Arrays.asList( Row.of("a", "10.0, 100"), Row.of("b", "-2.5, 9"), Row.of("c", "100.2, 1"), Row.of("d", "-99.9, 100"), Row.of("a", "1.4, 1"), Row.of("b", "-2.2, 9"), Row.of("c", "100.9, 1") ); BatchOperator <?> data = new MemSourceBatchOp(df, "col string, vector string"); BatchOperator <?> trainOp = new VectorStandardScalerTrainBatchOp().setSelectedCol("vector"); BatchOperator <?> model = trainOp.linkFrom(data); new VectorStandardScalerPredictBatchOp().linkFrom(model, data).print(); } }
运行结果
col1 |
vec |
a |
-0.07835182408093559,1.4595814453461897 |
c |
1.2269606224811418,-0.6520885789229323 |
b |
-0.2549018445693762,-0.4814485769617911 |
a |
-0.20280511721213143,-0.6520885789229323 |
c |
1.237090541689495,-0.6520885789229323 |
b |
-0.25924323851581327,-0.4814485769617911 |
d |
-1.6687491397923802,1.4595814453461897 |
向量标准化预测 (VectorStandardScalerPredictBatchOp)
Java 类名:com.alibaba.alink.operator.batch.dataproc.vector.VectorStandardScalerPredictBatchOp
Python 类名:VectorStandardScalerPredictBatchOp
功能介绍
标准化是对向量数据进行按正态化处理的组件
加载VectorStandardScalerTrainBatchOp中生成的模型,对向量数据做标准化预处理。
参数说明
名称 |
中文名称 |
描述 |
类型 |
是否必须? |
默认值 |
outputCol |
输出结果列 |
输出结果列列名,可选,默认null |
String |
null |
|
numThreads |
组件多线程线程个数 |
组件多线程线程个数 |
Integer |
1 |
代码示例
Python 代码
from pyalink.alink import * import pandas as pd useLocalEnv(1) df = pd.DataFrame([ ["a", "10.0, 100"], ["b", "-2.5, 9"], ["c", "100.2, 1"], ["d", "-99.9, 100"], ["a", "1.4, 1"], ["b", "-2.2, 9"], ["c", "100.9, 1"] ]) data = BatchOperator.fromDataframe(df, schemaStr="col string, vector string") trainOp = VectorStandardScalerTrainBatchOp().setSelectedCol("vector") model = trainOp.linkFrom(data) VectorStandardScalerPredictBatchOp().linkFrom(model, data).collectToDataframe()
Java 代码
import org.apache.flink.types.Row; import com.alibaba.alink.operator.batch.BatchOperator; import com.alibaba.alink.operator.batch.dataproc.vector.VectorStandardScalerPredictBatchOp; import com.alibaba.alink.operator.batch.dataproc.vector.VectorStandardScalerTrainBatchOp; import com.alibaba.alink.operator.batch.source.MemSourceBatchOp; import org.junit.Test; import java.util.Arrays; import java.util.List; public class VectorStandardScalerPredictBatchOpTest { @Test public void testVectorStandardScalerPredictBatchOp() throws Exception { List <Row> df = Arrays.asList( Row.of("a", "10.0, 100"), Row.of("b", "-2.5, 9"), Row.of("c", "100.2, 1"), Row.of("d", "-99.9, 100"), Row.of("a", "1.4, 1"), Row.of("b", "-2.2, 9"), Row.of("c", "100.9, 1") ); BatchOperator <?> data = new MemSourceBatchOp(df, "col string, vector string"); BatchOperator <?> trainOp = new VectorStandardScalerTrainBatchOp().setSelectedCol("vector"); BatchOperator <?> model = trainOp.linkFrom(data); new VectorStandardScalerPredictBatchOp().linkFrom(model, data).print(); } }
运行结果
col1 |
vec |
a |
-0.07835182408093559,1.4595814453461897 |
c |
1.2269606224811418,-0.6520885789229323 |
b |
-0.2549018445693762,-0.4814485769617911 |
a |
-0.20280511721213143,-0.6520885789229323 |
c |
1.237090541689495,-0.6520885789229323 |
b |
-0.25924323851581327,-0.4814485769617911 |
d |
-1.6687491397923802,1.4595814453461897 |
以上是关于ALINK(二十):数据处理数值型数据处理向量标准化 VectorNormalizeBatchOp/向量标准化训练 VectorStandardScalerTrainBatchOp /向量的主要内容,如果未能解决你的问题,请参考以下文章