141自然语言处理通关手册--外卖订单的评论分析

Posted 2023-05-10

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了141自然语言处理通关手册--外卖订单的评论分析相关的知识，希望对你有一定的参考价值。

参考技术A 随着外卖平台的成熟以及物流业的快速发展，点外卖已成为大部分中国人的日常活动之一。虽然顾客与店家之间并无直接接触，但是平台的实时评论打分机制能够有力地监督店家，保障顾客权益。另一方面，这些外卖评论也为商家或平台提供了大量文本数据，如果利用得当，将是一笔宝贵的数据资源。例如，商家通过分析外卖评论，可以获取某地区用户的口味偏好、差评好评的侧重点等，以及时调整发送菜品；而平台通过大量外卖分析，也可以从宏观的角度分析不同年龄段、不同地区、不同工种民众的饮食习惯，为商业决策提供信息。

本实验中有一批外卖订单的评论，主要分为正面评论以及负面评论，本文的任务主要有二：
分析外卖订单的文本数据，挖掘有用信息。
根据数据训练一个外卖评论的自动分类器，能够将评论自动归类为正面或负面评论。

首先读取文本，熟悉数据格式。

接下来对文本数据进行一系列分析，包括所有文本及正负样本的词云图，正负样本的高频词统计分析，对数据概况进行宏观表示。

所有文本形成的词云，词汇越大，说明词频越高：

从直观上感受，外卖评论主要的关注点集中于口味、送餐速度，其次是服务态度、份量、包装、价格等，并且比较有用的关键词多是动词、形容词以及名词，当然也存在很多无意义的高频词，比如“的”、“了”。
所有正样本形成的词云：

从大体上看，正面评论主要是要表达“味道好吃”、“送餐速度快”、“感谢外卖员”、“服务态度好”这些方面。
所有负样本形成的词云：

从大体上看，负面评论主要是要表达“味道难吃”、“送餐速度慢”这俩方面。另外，还存在许多否定词，比如“还是”、“一点”、“不是”、“不”、“没有”。一个有趣的现象是，不管是正面还是负面评论中，“好吃”都是一个高频词，筛选负面评论中含有“好吃”的文本一探究竟。

通过一些文本的观察可知，“好吃”一般与一些否定词相搭配，形成否定的负面意思，比如“没有以前好吃”，“不好吃”，“不太好吃”，“不是特别好吃”等表达。
在有了一定的感性认识之后，接下来对词汇作一些定量分析，分别统计出正/负面评价的词频，观察两类文本中高低频词的特性，以便作更进一步的文本预处理操作。
统计正样本的词频：

统计负样本的词频：

通过以上的高低频词的打印结果，我们可以观察并分析出两个结论：
正负样本中的高频词中存在一些共同的词汇，比如“了”、“的”、“也”以及一些标点符号，因此理论上这些符号不存在区分度，可以作为停用词去除。
大部分低频词从直观上感受，与正负面情感无直接关联，因此理论上也可以把出现次数为 1 的词汇均作为停用词去除。

基于以上逻辑，我们构建一份停用词表。

接下来对数据尝试多种预处理方式，主要包括：
原文本
只保留文本中的中文
去除停用词
只保留文本中特定词性的词汇

原文本分词处理：

对于每条评论，去除非中文字符并且分词：

停用词过滤：

只保留某些词性的词汇：

所有数据如下：

在完成对数据的多种预处理之后，接下来调用机器学习集成库 scikit-learn 中的模型进行分类训练：
应用同一模型对不同处理之后的数据进行训练，看哪一种预处理方式最佳；
固定预处理数据，尝试应用多种模型，对比得到最好效果的模型。

首先，由于原始数据的标签存在固定顺序，对数据随机打乱顺序：

将文本转换为 TF-IDF 形式：

将数据切分为训练集与测试集：

模型训练并输出测试结果：

综合以上过程的主函数：

各数据统计均应用逻辑回归进行训练，作对比：

在这里，对以上结果中的评价指标作简单介绍，首先观察如下表格（称为混淆矩阵）：

注意，这里的正例并非指正面评论，而负例并非指负面评论，以上表格是针对每一个类别而言。在我们的案例中，有两个类别，正面评论以及负面评论，针对每一个类别都有一个对应的混淆矩阵，因此对于每个类别，都有对应的指标。

根据预测值与真实值的不同，分类结果可分为四大情况：
TP(True positives)：实际为正例且被分类器划分为正例的样本数，例如某样本真实情况为正面评论且分类器也预测其为正面评论。
FN(False positives)：实际为正例但被分类器划分为负例的样本数。
FP(False negatives)：实际为负例但被分类器划分为正例的样本数。
TN(True negatives)：实际为负例且被分类器划分为负例的样本数。

注意，一般把 accuracy 译为准确率或正确率，precision 译为精确率或精准率或查准率，在中文中这些词汇含义容易混淆，因此最好使用英文，方便记忆。

最后来看 Micro-F1 和 Macro-F1，是针对所有类的综合考量：
Micro-F1：指上述结果中的 micro avg，先计算出所有类别总的 precision 和 recall，然后计算出的 f1 即为 Micro-F1。
Macro-F1：指上述结果中的 macro avg，计算出每一个类别的 precison 和 recall 后计算 f1，最后将 f1 平均即为 Macro-F1。

除了以上模型评价指标，实际上还有 ROC 曲线、AUC 面积等指标，那么这么多指标，到底以谁为准，如何综合判断呢？比如，我们关注评论分类器的整体判别能力，那么可以以 accuracy 或者 F1为主；希望把评论中的差评尽可能地找出来，那么需要更关注差评的 recall；希望预测为差评的样本别出错就行了，那么更关注差评的 precision。这其实取决于相关的场景以及具体的任务倾向性，不能一概而论。

由以上结果可知，只保留文本中的中文以及去除停用词结果稍好一些，在只保留某些词性的词汇时，效果反而变差，由此可见，切忌基于主观上的推理做一些盲目的文本预处理工作，预处理的有效性需要基于结果来对比证实。接下来，我们只保留文本中的中文作为训练数据，应用多种模型进行分类尝试，分别为：
支持向量机
朴素贝叶斯
GBDT
感知机

由以上结果可知，支持向量机比逻辑回归效果稍优，其它模型反而效果下降。那么，还有后续的优化空间吗？答案是肯定的。同学们可以从以下几方面入手尝试更多的优化：
数据增强
改变数据的表征形式
机器学习模型调参
深度学习模型

算法通关手册刷题笔记2 数组排序之冒泡排序选择排序

算法通关手册刷题笔记2 数组排序之冒泡排序、选择排序

持续更新中

文章目录

算法通关手册刷题笔记2 数组排序之冒泡排序、选择排序
冒泡排序题目 [#](https://algo.itcharge.cn/01.Array/02.Array-Sort/11.Array-Sort-List/#冒泡排序题目)
选择排序题目 [#](https://algo.itcharge.cn/01.Array/02.Array-Sort/11.Array-Sort-List/#选择排序题目)
- 0215 数组中的第K个最大元素
补基础🙅‍

冒泡排序题目 #

题号	标题	题解	标签	难度
剑指 Offer 45	把数组排成最小的数	Python	贪心、字符串、排序	中等
0283	移动零	Python	数组、双指针	简单

剑指 Offer 45 把数组排成最小的数

整了十五分钟,没整出来，想法有许多（写下面注释里面去了）

class Solution:
    def minNumber(self, nums: List[int]) -> str:
        # 字符串也能比大小
        # 这题需要return
        # 0开头的要剔除吗？ 噢噢 题目说不需要
        # 涉及到排列组合哇
        # 冒泡排序其实不一定需要list？ 来一个比较一个，就用一个temp变量
        # 如何拼接？ 先转字符串再拼接肯定更好
        num2strs = []
        for i in range(len(nums)):
            s = str(nums[i])
            num2strs.append(s)
        # 写一个排列组合的循环？
        # 次数是阶乘啊，这应该也是本题考察的一个关键点吧... 肯定不能这么写
        
        """for i in range(len(nums)):
            for j in range(i+1,len(nums)):"""
        # 上面这是暴力枚举诶，能不能换种想法剪枝，先比较再排序？
        # 放在高位的数字要小，放在低位的数字要大
        # 先把每个数字的最高位数字 比如示例2 ，先比较3 3 3 5 9, 这就有初步的顺序了， 5 和 9 往后放
        # 示例1中 1 和 2 一比较，结果就出来了
        # 比较方法就用冒泡排序
        # 比完第一位，比第二位（次高位）
        # 难搞的就是这种情况 3 、 30、 34， 整个数字位数不一样但是前面有几位是相同的这种情况
        # 我知道了，把3扩充成30  ，且30放在3的前面，34放在3的后面 ， 通过3补0可以知道34一定在3的后面，因为12345..9都比0大， 而30就得放3前面了，同理正整数都比0大，330 303，要小的话，肯定要让0在前面啊
        # 不不不，否决刚刚自己的想法， 应该是把3和4拆开来做比较吧， 
        # 举个反例， 54 和 5 ，因为5比4大，所以554 比 545 大，而我们要留下的是更小的数，所以 54 应该放在 5 的前面， 
        # 突然有了一个很大胆的想法，把所有数组都拆成个位数
        # 那么 示例1 就是 [1,0,2] ，
        # 啊不不不 也不行，不能这样搞
        # 再来些例子吧， 比如 54 和 5491， 肯定是545491 比 549154小，那么我们该比较什么，就是那个连接点谁小，连接点会是谁，就两种情况，一个是除去共同前缀之后的串的第一位数，比如这里的91， 另一个就是共同前缀
        # 再来个例子， 54 和 5411 ，因为1比5小，所以 541154比 545411小
        # nice ，nice ，nice
        # 这下子真的找到规律啦！
        # 就是不断对前缀进行冒泡排序！
        
        
        
        return str

看题解…

虽说没做出来吧，但我的分析确实沾到边了😎

算法通关手册上的题解

orz 是大佬简洁的代码

import functools

class Solution:
    def minNumber(self, nums: List[int]) -> str:
        def cmp(a, b):
            if a + b == b + a:
                return 0
            elif a + b > b + a:
                return 1
            else:
                return -1

        nums_s = list(map(str, nums))
        nums_s.sort(key=functools.cmp_to_key(cmp))
        return ''.join(nums_s)

The method first creates a new list nums_s using list comprehension, which converts each integer in nums to a string. Then it sorts nums_s using a lambda function as the key that concatenates each element with the other and compares them using the cmp function. Finally, the method joins all the elements of nums_s into a single string and returns it.

看看力扣评论区里面的题解

参考链接：https://leetcode.cn/problems/ba-shu-zu-pai-cheng-zui-xiao-de-shu-lcof/solution/mian-shi-ti-45-ba-shu-zu-pai-cheng-zui-xiao-de-s-4/

来源：力扣（LeetCode）

设数组nums中任意两数字的字符串为x和y，规定判断规则为

若拼接字符串x+y>y+x，则x”大于“y
反之，若x+y<y+x，则x"小于"y

x”小于"y在这里的意思是，x放在y前面比 y放在x前面所得到的数更小，也就是排序完成后，数组中x应该在y左边

题解作者给出的第一种方法是用快速排序 (快排的原理忘记了所以没看懂🤐)

class Solution:
    def minNumber(self, nums: List[int]) -> str:
        def quick_sort(l , r):
            if l >= r: return
            i, j = l, r
            while i < j:
                while strs[j] + strs[l] >= strs[l] + strs[j] and i < j: j -= 1
                while strs[i] + strs[l] <= strs[l] + strs[i] and i < j: i += 1
                strs[i], strs[j] = strs[j], strs[i]
            strs[i], strs[l] = strs[l], strs[i]
            quick_sort(l, i - 1)
            quick_sort(i + 1, r)
        
        strs = [str(num) for num in nums]
        quick_sort(0, len(strs) - 1)
        return ''.join(strs)

给出的第二个是用内置函数

class Solution:
    def minNumber(self, nums: List[int]) -> str:
        def sort_rule(x, y):
            a, b = x + y, y + x
            if a > b: return 1
            elif a < b: return -1
            else: return 0
        
        strs = [str(num) for num in nums]
        strs.sort(key = functools.cmp_to_key(sort_rule))
        return ''.join(strs)

这个其实和那个通关手册作者给的题解是差不多的做法，其实最难想的还是那个cmp规则，那么这个是咋想到的呢，orz
- 评论区有两个数学证明orz
  - https://leetcode.cn/problems/ba-shu-zu-pai-cheng-zui-xiao-de-shu-lcof/solution/mian-shi-ti-45-ba-shu-zu-pai-cheng-zui-xiao-de-s-4/378349
  - https://leetcode.cn/problems/ba-shu-zu-pai-cheng-zui-xiao-de-shu-lcof/solution/mian-shi-ti-45-ba-shu-zu-pai-cheng-zui-xiao-de-s-4/378553

再自己写写

这道题被划分到冒泡排序里面了，但是作者用的是内置函数哈哈哈
- 那么怎样用冒泡排序呢
```
class Solution:
    def minNumber(self, nums: List[int]) -> str:
        nums_s = [str(num) for num in nums]
        n = len(nums_s)
        for i in range(n):
            for j in range(0, n-i-1):
                if nums_s[j] + nums_s[j+1] > nums_s[j+1] + nums_s[j]:
                    nums_s[j], nums_s[j+1] = nums_s[j+1], nums_s[j]
        return ''.join(nums_s)
```
  其实关键就是，灵活变通一下排序规则，之前学的最基本的冒泡排序没有加入复杂的规则
  
  冒泡排序的基本思想是比较相邻的元素，如果第一个比第二个大，就交换它们两个。重复直到整个都排好
  
  在我的实现中，我使用了冒泡排序的基本比较和交换操作，但是改变了比较的方式，我的实现是比较两个数字的拼接结果的大小，并不是直接比较两个数字本身的大小。这样就能确保拼接后的数字最小的排在最前面。
  
  冒泡排序的本质就是重复遍历序列，比较相邻元素并进行交换，我在这个思想的基础上实现了自定义比较方式，从而解决了题目中给出的问题。

优化一下刚刚那个冒泡排序！

I introduced a label called ‘exchanges’ to keep track of whether any elements have been swapped during a pass through the so as to optimize the original code. If no elements have been swapped during a pass, the list is already sorted and the algorithm can stop. This can save time as the algorithm does not need to continue iterating through the entire list when it is already sorted. Add a check before the inner loop, if no exchanges have been made on the previous pass, the list is already sorted and you can break the loop.

class Solution:
    def minNumber(self, nums: List[int]) -> str:
        nums_s = [str(num) for num in nums]
        n = len(nums_s)
        for i in range(n):
            exchanges = False
            for j in range(0, n-i-1):
                if nums_s[j] + nums_s[j+1] > nums_s[j+1] + nums_s[j]:
                    exchanges = True
                    nums_s[j], nums_s[j+1] = nums_s[j+1], nums_s[j]
            if not exchanges:
                break
        return ''.join(nums_s)

看到了好多高级用法诶，来试试lambda表达式啥的吧

参考：面试代码题记录3-leetcode179-最大数
```
class Solution:
    def minNumber(self, nums: List[int]) -> str:
        strs = list(map(str, nums))
        strs.sort(key = functools.cmp_to_key(lambda x,y: int(x+y) - int(y+x)))
        return ''.join(strs[i] for i in range(len(strs)))
```
This code is solving the problem by using the built-in sort() method and the ‘key’ argument to sort the array of strings based on the numerical value of the concatenation of each pair of strings.

The ‘key’ argument takes a function that is used to extract a comparison key from each element in the list, and in this case, it is using the lambda function lambda x,y: int(x+y) - int(y+x) which compares the concatenation of two strings by converting them to integers and subtracting the second concatenation from the first.
This way, it sorts the array of strings so that the concatenation of any two adjacent strings is always in increasing order.
After sorting, it’s joining the sorted array of strings into one string and returning it as the final output.
This is a more concise and pythonic way to solve the problem.

这段lambda表达式的等价if else语句是咋样的呢，其实就是那个规则函数

The equivalent of the lambda function lambda x,y: int(x+y) - int(y+x) in an if-else statement would be:
```
def compare(x, y):
    if int(x+y) > int(y+x):
        return 1
    elif int(x+y) < int(y+x):
        return -1
    else:
        return 0
```
This function takes in two strings x and y and compares them by concatenating them and converting them to integers. It returns 1 if the concatenation of x and y is greater than the concatenation of y and x, -1 if it is less than and 0 if they are equal. This function can be used in place of the lambda function in the sort() method with key argument.
```
strs.sort(key = functools.cmp_to_key(compare))
```
It will work in the same way.

所以lambda函数的返回值的正负是怎么影响sort function的呢？

The built-in sort() method in Python uses the return value of the function passed to the ‘key’ argument to determine the sort order of the elements in the list.

The ‘key’ function should return a value that can be used to determine the sort order of the elements.

In this case, the lambda function lambda x,y: int(x+y) - int(y+x) is passed as the key function, it compares the concatenation of two strings by converting them to integers and subtracting the second concatenation from the first.
- If the lambda function returns a positive value, it means that the concatenation of the first string x and the second string y is greater than the concatenation of the second string y and the first string x, so the second string will be considered smaller than the first one.
- If the lambda function returns a negative value, it means that the concatenation of the first string x and the second string y is less than the concatenation of the second string y and the first string x, so the first string will be considered smaller than the second one.
- If the lambda function returns 0, it means that the concatenation of the first string x and the second string y is equal to the concatenation of the second string y and the first string x, so the sort order of the two strings will not be changed.
Therefore the lambda function returns a value that is used to determine the sort order of the elements in the list, and the returned value of the lambda function is used to sort the numbers in order of which number when concatenated to the next number is smaller.

0283 移动零

选择排序题目 #

题号	标题	题解	标签	难度
0215	数组中的第K个最大元素	Python	数组、堆排序	中等

0215 数组中的第K个最大元素

补基础🙅‍

map

The map() function in Python is used to apply a given function to each item in an iterable (e.g. list, tuple, etc.) and returns an iterator that produces the modified items.

前面的剑指Offer 45，里面那个list(map(str, nums))
- map(str, nums) applies the built-in str() function to each item in the nums list, which converts each integer to a string.
- The list() function is then used to create a new list from the iterator returned bymap()
来一个使用map() 来使得到列表里面每个元素的平方的例子
```
numbers = [1,2,3,4,5]
squared_numbers = list(map(lambda x: x**2, numbers))
print(squared_numbers)
```
This will output: [1, 4, 9, 16, 25]

In this example, map(lambda x: x**2, numbers) applies a lambda function to each item in the numbers list, which squares the number, and returns an iterator. The list() function is then used to create a new list from the iterator.
In general, the syntax(语法) of the map() function is:
```
map(function, iterable)
```

functools

还是结合算法通关手册上面给的剑指Offer45的题解来学习

import functools

class Solution:
    def minNumber(self, nums: List[int]) -> str:
        def cmp(a, b):
            if a + b == b + a:
                return 0
            elif a + b > b + a:
                return 1
            else:
                return -1

        nums_s = list(map(str, nums))
        nums_s.sort(key=functools.cmp_to_key(cmp))
        # the `''.join(nums_s)` returns a concatenation of all elements of `nums_s` into a single string.
        return ''.join(nums_s)

functools is a built-in Python module that contains several useful functions for functional programming. One of the functions in functools is cmp_to_key(), which is used to convert a cmp function to a key function.

nums_s.sort(key=functools.cmp_to_key(cmp)) uses the built-in sort() method of lists, which sorts the elements of nums_s list in ascending order. The key argument is used to specify a custom function to extract a comparison key from each element. Thefunctools.cmp_to_key(cmp)is passed as the key argument, which converts the cmp function defined within the method to a key function that can be used by thesort() method.

看俩博客仔细学一学lambda表达式

emmm, 🕊了，把链接保存在这里之后看
- https://blogboard.io/blog/knowledge/python-sorted-lambda/
  key(cmp)is passed as the key argument, which **converts the cmp function defined within the method to a key function that can be used by thesort()` method**.