基于逻辑回归信用卡欺诈检测

Posted 2020-12-14 jiaxinhuang

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了基于逻辑回归信用卡欺诈检测相关的知识，希望对你有一定的参考价值。

文件读取

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
#由于数据太多，只读取前1000行
data = pd.read_csv("creditcard.csv",nrows=1000)
data.head()

数据预处理

缺失值、异常值的处理、删除多余列

#判断是否有缺失值
data.isnull()
#如果有缺失值，是否进行填补
data.fillna(method=‘ffill‘)
#判断是否有异常值，可以采用箱型图
data[[‘V1‘,‘V2‘,‘V3‘]].boxplot()
#对于异常值可以删除，或则修改，根据情况而做出判断
#由于time这一列不起作用，可以直接删除，还有种方法是做pca来选择最佳的列，利用pca
data.drop([‘time‘])

技术图片

特征标准化

主要解决特征因为数值差距过大而导致迭代次数过慢或则结果

#对Amount的那一列数进行特征归一化
from sklearn.preprocessing import StandardScaler
data[‘normAmount‘] = StandardScaler().fit_transform(data[‘Amount‘].values.reshape(-1, 1))#因为fiy_transform(data),data:2D数据

技术图片

数据分析阶段

#查看标签的数量
count_classes = pd.value_counts(data[‘Class‘], sort = True)
#可以做条形图进行统计

技术图片

采样方法

因为当数据不足的时候需要采取上采样或者下采样，因为从上图分析可知，类别相差很多，为了预测的正确性，采用过采样的方法，增加类别的样本数量

X = data.loc[:, data.columns != ‘Class‘]
y = data.loc[:, data.columns == ‘Class‘]

# Number of data points in the minority class
number_records_fraud = len(data[data.Class == 1])
fraud_indices = np.array(data[data.Class == 1].index)

# Picking the indices of the normal classes
normal_indices = data[data.Class == 0].index

# Out of the indices we picked, randomly select "x" number (number_records_fraud)
random_normal_indices = np.random.choice(normal_indices, number_records_fraud, replace = False)
random_normal_indices = np.array(random_normal_indices)
print("random_normal_indices:",random_normal_indices)


# # Appending the 2 indices
under_sample_indices = np.concatenate([fraud_indices,random_normal_indices])

# # Under sample dataset
under_sample_data = data.iloc[under_sample_indices,:]

X_undersample = under_sample_data.loc[:, under_sample_data.columns != ‘Class‘]
y_undersample = under_sample_data.loc[:, under_sample_data.columns == ‘Class‘]

# Showing ratio
print("Percentage of normal transactions: ", len(under_sample_data[under_sample_data.Class == 0])/len(under_sample_data))
print("Percentage of fraud transactions: ", len(under_sample_data[under_sample_data.Class == 1])/len(under_sample_data))
print("Total number of transactions in resampled data: ", len(under_sample_data))

random_normal_indices: [979 686]
Percentage of normal transactions:  0.5
Percentage of fraud transactions:  0.5
Total number of transactions in resampled data:  4
建立预测模型

from sklearn.model_selection import train_test_split

# Whole dataset
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state = 0)

print("Number transactions train dataset: ", len(X_train))
print("Number transactions test dataset: ", len(X_test))
print("Total number of transactions: ", len(X_train)+len(X_test))

# Undersampled dataset
X_train_undersample, X_test_undersample, y_train_undersample, y_test_undersample = train_test_split(X_undersample
                                                                                                   ,y_undersample
                                                                                                   ,test_size = 0.3
                                                                                                   ,random_state = 0)
lr = LogisticRegression(C = best_c, penalty = ‘l1‘)
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
y_pred_undersample = lr.predict(X_test_undersample.values)

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test_undersample,y_pred_undersample)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

以上是关于基于逻辑回归信用卡欺诈检测的主要内容，如果未能解决你的问题，请参考以下文章