python difflib文本比较利器,入手不亏
Posted Coderusher
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了python difflib文本比较利器,入手不亏相关的知识,希望对你有一定的参考价值。
@[toc]
一、引言
difflib模块:是提供的类和方法用来进行序列的差异化比较,它能够比对文件并生成差异结果文本或者html格式的差异化比较页面。其作为 python 的内置库,有着强大的文本比对功能,此篇介绍五个常用的 difflib 功能
二、正文
1. 字符串比较
1.1 计算原理
1.2 参数介绍
1.3 举个栗子
- 实例1
>>> s = SequenceMatcher(None, "abcd", "bcde") >>> s.ratio() 0.75 >>> s.quick_ratio() 0.75
如果想要去除掉多余的字符再进行比较
- 实例2
>>> s = SequenceMatcher(lambda x: x in "|\\", "abcd|", "dc\\fa") # 去除两个字符中的 ( | ) 以及 ( \\ ) 符号后比较 >>> s.ratio() 0.75 >>> s.quick_ratio() 0.75
1.4 相关函数性能比较
函数 | 计算速度 | 内存开销 |
---|---|---|
ratio() | 快 | 大 |
quick_ratio() | 慢 | 小 |
- 论证过程
将相似度比对过程遍历100000遍得到计算速度与内存占用上的差异
# 导入第三方库
import os
import psutil
import time
def show_info():
pid = os.getpid()
#模块名比较容易理解:获得当前进程的pid
p = psutil.Process(pid)
#根据pid找到进程,进而找到占用的内存值
info = p.memory_full_info()
memory = info.uss/1024/1024
return memory
def func(ratio_func):
start_time = time.time() # 记录起始时间
initial_memory = show_info() # 记录起始内存
if ratio_func == "ratio":
ratio = [similarity.ratio() for i in range(1000000)]
else:
ratio = [similarity.quick_ratio() for i in range(1000000)]
final_memory = show_info() # 记录终止内存
end_time = time.time() # 记录终止时间
print(f"耗时:end_time-start_times")
print(f内存占用:final_memory-initial_memory:.2fMB)
if __name__ == __main__:
similarity = difflib.SequenceMatcher(None, 需要比对的字符1, 需要比对的字符2)
func("ratio")
func("quick_ratio")
- 输出结果
>>> func("ratio") 耗时:0.9709699153900146s 内存占用:36.58MB >>> func("quick_ratio") 耗时:2.730135917663574s 内存占用:32.68MB
2. 文本比较
2.1 相关符号与含义
符号 | 含义 |
---|---|
- | 仅在片段1中存在 |
+ | 仅在片段2中存在 |
(空格) | 片段1和2中都存在 |
? | 下标显示 |
^ | 存在差异字符 |
2.2 Differ
text1 =
- Beautiful is better than ugly.
- Explicit is better than implicit.
- Simple is better than complex.
- Complex is better than complicated.
.splitlines(keepends=True)
text2 =
- Beautifu is better than ugly.
- Explicit is better than implicit.
- Simple is better than complex.
- Complex is better than complicated.
.splitlines(keepends=True)
#以文本方式展示两个文本的不同:
d = difflib.Differ()
result = list(d.compare(text1, text2))
result = " ".join(result)
print(result)
- 结果展示
-
- Beautiful is better than ugly.
? ^
- Beautiful is better than ugly.
-
- Beautifu is better than ugly.
? ^- Explicit is better than implicit.
- Simple is better than complex.
- Complex is better than complicated.
- Beautifu is better than ugly.
2.3 HtmlDiff
text1 =
- Beautiful is better than ugly.
- Explicit is better than implicit.
- Simple is better than complex.
- Complex is better than complicated.
.splitlines(keepends=True)
text2 =
- Beautifu is better than ugly.
- Explicit is better than implicit.
- Simple is better than complex.
- Complex is better than complicated.
.splitlines(keepends=True)
#以html方式展示两个文本的不同, 浏览器打开:
d = difflib.HtmlDiff()
with open("passwd.html", w) as f:
f.write(d.make_file(text1, text2))- 结果展示 ![image.png](https://s2.51cto.com/images/20220819/1660911791958174.png?x-oss-process=image/watermark,size_14,text_QDUxQ1RP5Y2a5a6i,color_FFFFFF,t_30,g_se,x_10,y_10,shadow_20,type_ZmFuZ3poZW5naGVpdGk=)
2.4 context_diff
s1 = [bacon\\n, eggs\\n, ham\\n, guido\\n]
s2 = [python\\n, eggy\\n, hamster\\n, guido\\n]
for line in context_diff(s1, s2, fromfile=before.py, tofile=after.py):
sys.stdout.write(line)
对于字符串列表进行比较,可以看出只有第四个元素是相同的,每个元素会依次进行比较,而不是按照索引进行比较,假使s1 = [eggs\\n, ham\\n, guido\\n]为三个元素
- 结果展示
```python
*** before.py
--- after.py
***************
*** 1,4 ****
! bacon
! eggs
! ham
guido
--- 1,4 ----
! python
! eggy
! hamster
guido
2.5 get_close_matches
d=get_close_matches(appel, [ape, apple, peach, puppy])
print(d)
- 结果展示
```python
[apple, ape]
2.6 ndiff
diff = ndiff(one\\ntwo\\nthree\\n.splitlines(1),
ore\\ntree\\nemu\\n.splitlines(1))
print(.join(diff))
- 结果展示
```python
- one
? ^
+ ore
? ^
- two
- three
? -
+ tree
+ emu
2.7 restore
diff = ndiff(one\\ntwo\\nthree\\n.splitlines(1),
ore\\ntree\\nemu\\n.splitlines(1))
diff = list(diff) # materialize the generated delta into a list
print(.join(restore(diff, 1)))
- 结果展示
```python
one
two
three
3. 参考链接
以上是关于python difflib文本比较利器,入手不亏的主要内容,如果未能解决你的问题,请参考以下文章