基于逻辑回归信用卡欺诈检测

Posted jiaxinhuang

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了基于逻辑回归信用卡欺诈检测相关的知识,希望对你有一定的参考价值。

文件读取

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
#由于数据太多,只读取前1000行
data = pd.read_csv("creditcard.csv",nrows=1000)
data.head()

数据预处理

缺失值、异常值的处理、删除多余列

#判断是否有缺失值
data.isnull()
#如果有缺失值,是否进行填补
data.fillna(method=ffill)
#判断是否有异常值,可以采用箱型图
data[[V1,V2,V3]].boxplot()
#对于异常值可以删除,或则修改,根据情况而做出判断
#由于time这一列不起作用,可以直接删除,还有种方法是做pca来选择最佳的列,利用pca
data.drop([‘time‘])

技术图片

 

特征标准化

主要解决特征因为数值差距过大而导致迭代次数过慢或则结果

#对Amount的那一列数进行特征归一化
from sklearn.preprocessing import StandardScaler
data[normAmount] = StandardScaler().fit_transform(data[Amount].values.reshape(-1, 1))#因为fiy_transform(data),data:2D数据

技术图片技术图片

 

 

数据分析阶段

#查看标签的数量
count_classes = pd.value_counts(data[Class], sort = True)
#可以做条形图进行统计

技术图片

 

 

 

 采样方法

因为当数据不足的时候需要采取上采样或者下采样,因为从上图分析可知,类别相差很多,为了预测的正确性,采用过采样的方法,增加类别的样本数量

X = data.loc[:, data.columns != Class]
y = data.loc[:, data.columns == Class]

# Number of data points in the minority class
number_records_fraud = len(data[data.Class == 1])
fraud_indices = np.array(data[data.Class == 1].index)

# Picking the indices of the normal classes
normal_indices = data[data.Class == 0].index

# Out of the indices we picked, randomly select "x" number (number_records_fraud)
random_normal_indices = np.random.choice(normal_indices, number_records_fraud, replace = False)
random_normal_indices = np.array(random_normal_indices)
print("random_normal_indices:",random_normal_indices)


# # Appending the 2 indices
under_sample_indices = np.concatenate([fraud_indices,random_normal_indices])

# # Under sample dataset
under_sample_data = data.iloc[under_sample_indices,:]

X_undersample = under_sample_data.loc[:, under_sample_data.columns != Class]
y_undersample = under_sample_data.loc[:, under_sample_data.columns == Class]

# Showing ratio
print("Percentage of normal transactions: ", len(under_sample_data[under_sample_data.Class == 0])/len(under_sample_data))
print("Percentage of fraud transactions: ", len(under_sample_data[under_sample_data.Class == 1])/len(under_sample_data))
print("Total number of transactions in resampled data: ", len(under_sample_data))
random_normal_indices: [979 686]
Percentage of normal transactions:  0.5
Percentage of fraud transactions:  0.5
Total number of transactions in resampled data:  4
建立预测模型
from sklearn.model_selection import train_test_split

# Whole dataset
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state = 0)

print("Number transactions train dataset: ", len(X_train))
print("Number transactions test dataset: ", len(X_test))
print("Total number of transactions: ", len(X_train)+len(X_test))

# Undersampled dataset
X_train_undersample, X_test_undersample, y_train_undersample, y_test_undersample = train_test_split(X_undersample
                                                                                                   ,y_undersample
                                                                                                   ,test_size = 0.3
                                                                                                   ,random_state = 0)
lr = LogisticRegression(C = best_c, penalty = l1)
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
y_pred_undersample = lr.predict(X_test_undersample.values)

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test_undersample,y_pred_undersample)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

 



以上是关于基于逻辑回归信用卡欺诈检测的主要内容,如果未能解决你的问题,请参考以下文章

机器学习实战分享:用 Python 进行信用卡欺诈检测

机器学习总结四:逻辑回归与反欺诈检测案例

学习笔记Flink—— 基于Flink 在线交易反欺诈检测

学习笔记Flink—— 基于Flink 在线交易反欺诈检测

学习笔记Flink—— 基于Flink 在线交易反欺诈检测

学习笔记Flink—— 基于Flink 在线交易反欺诈检测