字符编码文件处理

Posted 2020-12-21 syy1757528181

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了字符编码文件处理相关的知识，希望对你有一定的参考价值。

字符编码

	字符编码，针对的是文字，只跟文本文件有关，与视频文件、音频文件无关
	文本编辑器的输入和输出是两个过程，文本编辑包含输入和输出两个过程
人在操作计算机的时候，输入的是人能够看懂的字符，但是计算机只能识别0110这样的二进制数据，那么输入的字符肯定通过‘字符编码表‘转换成二进制数据
	任何国家想要让计算机识别本国语言，都必须创建一个本国的字符编码表
    编码：
    	把人编辑的文字转换成计算机能识别的二进制数
    解码：
    	把计算机能识别的二进制数转化成人能看懂的文字
    乱码：
    	现象是字符不能正常显示，原因是编码和解码使用的字符编码不一样
        怎么编码的就怎么解码，肯定不会乱码

	字符编码表记录着字符与数字的对应关系
    世界第一个字符编码表是‘ASCII码表‘，美国人发明，由八位二进制表示一个英文字符，一共有2**8-1=255种字符与二进制数的对应关系
    八位二进制数也叫8bit
    	8bit = 1Bytes（1个字节）
        1024Bytes = 1KB
        1024KB = 1MB
        1024MB = 1GB
        1024GB = 1TB
        1024TB = 1PB
        
	‘GBK‘，由中国人发明，由2Bytes（十六位二进制数）表示一个中文字符，一共有2**16-1=65535中字符与二进制数的对应关系，其中英文字符还是由八位二进制数表示（兼容ASCII）
    
    Unicode万国码，统一用2Bytes（十六进制数表示）一个字符，一共有2**16-1=65535种字符与二进制数的对应关系
    
unicode编码的两个特点：
	1.兼容万国字符
    2.Unicode编码表，与其它各个国家的编码都有对应关系（通过utf-8分化/统一），这也是发明此编码的目的
        
所有国家使用Unicode编码会导致的问题：
	1.浪费存储空间
    2.增加I/O次数，程序运行效率降低
    
当内存中的数据存到硬盘的时候，会按照‘utf-8编码‘（nicode transformation formate）:	
    1.会将Unicode编码的英文字符由2Bytes变成1Bytes
    2.会将Unicode编码的中文字符由2Bytes变成3Bytes
    ...
    
现在计算机存储字符编码：	
	内存中都Unicode编码
    硬盘中都是utf-8
    
内存中的数据由内存保存到硬盘：
	应用程序编码   >>>	内存中的Unicode格式的二进制数（默认）		>>>(encode)>>>		硬盘中的utf-8格式的二进制数据
硬盘中的数据由硬盘读取到内存：
	硬盘中的utf-8格式的二进制数据	    >>>(decode)>>>		内存中的Unicode格式的二进制数		>>> 应用程序编码 
    
python2与python3：
	python2默认使用ASCII码（因为在开发python2的时候Unicode使用还不广泛），为了兼容各国语言，python2新增Unicode数据类型
    python3中默认使用Unicode编码，使用utf-8
    
文件头：
	每个python文件的开头最好都标识，-*- coding:utf-8 -*—
    	1.因为所有的编码都支持英文字符，所以文件头才能生效 
    
指定编码：
	1.编辑器或者应用程序指定
    2.文件头标识
    3.定义变量的时候指定， x = u‘上‘
    
ps:
    1.pycharm默认使用Unicode编码
    2.基于python解释器开发的软件，只要是中文，前面都需要加一个 u，为的就是让python2使用Unicode把中文存储到内存，而不使用python2默认的ASCII编码，或者文件头指定的编码

#encode，编码，将内存中的Unicode格式的二进制数据编码成可以存储和传输的utf-8的二进制数据，存储到硬盘

s = ‘上‘
print(s.encode(‘utf-8‘))	#b‘xe4xb8x8a‘
print(type(s.encode(‘utf-8‘)))	#<class ‘bytes‘>,二进制数据类型

#bytes()函数，数据类型的转化，本质上是
s = ‘上‘
print(s.encode(‘utf-8‘))
print(type(bytes(s,encoding=‘utf-8‘)))	#<class ‘bytes‘>

#decode，解码，将硬盘中的utf-8格式的二进制数据解码成Unicode格式的二进制数据
res1 = s.encode(‘utf-8‘)
res2 = res1.decode(‘utf-8‘)
print(res2)		#‘上‘

#使用str()函数，数据类型的转化，本质上是
res1 = s.encode(‘utf-8‘)
res2 = str(res1,encoding=‘utf-8‘)
print(res2)		#上

文件处理

文件：
	操作系统提供给用户操作（保存/读取）复杂硬件（硬盘）的简单的接口
    
操作文件：
	应用程序需要永久的保存数据或者临时读取数据

#程序操作文件方法一：
    #打开文件，r取消转义，应用程序想要操作硬盘/文件，使用open()函数
    f = open(r‘E:python_testa.txt.py‘,encoding=‘utf-8‘)	#Windows操作系统默认使用gbk编码
    print(f)		#f是文件对象,遥控器（文件句柄）
    #读文件
    print(f.read())		#读取文件对象
    f.close()		#告诉操作系统，关闭文件对象

#程序操作文件方法二：
	#自动close()
    #支持多次open()
with open(r‘E:python_testa.txt.py‘,encoding=‘utf-8‘) as f ,        open(r‘E:python_test.txt.py‘,encoding=‘utf-8‘) as g ,        open(r‘E:python_testc.txt.py‘,encoding=‘utf-8‘) as h:
    print(f)
    print(f.read())

文件处理的模式mode

r		只读模式，默认
w		只写模式
a		只追加模式（只能在文件末尾添加内容）

r、w、a这三个模式为纯净模式

r+		可读可追加
w+		
a+

操作文件单位

t		文本文件，默认，需要指定encoding参数，如果不指定的话，那么默认是操作系统的编码
b		二进制，一定不能指定encoding参数，该模式通常用来处理非文本文件，直接存储网络上传输来的二进制数据

with open(r‘E:python_test	est‘,mode=‘rt‘,encoding=‘utf-8‘) as f:
    print(f.readable())		#True
    print(f.writable())		#False
    print(f.read())			#将文件内容全部读出
    print(f.write())		#TypeError:

r模式

#文件不存在直接报错
with open(r‘E:python_test	est1‘,mode=‘r‘,encoding=‘utf-8‘) as f:
    print(f)	#FileNotFoundError:
    
#相对路径
with open(r‘E:python_test	est‘,mode=‘r‘,encoding=‘utf-8‘) as f:
    print(f)	#<_io.BufferedReader name=‘E:\python_test\test‘>

with open(r‘test‘,mode=‘r‘) as f:
    print(f)	#<_io.BufferedReader name=‘test‘>，路径文件和本文件在同一层的话，可以使用相对路径
    
#读文件时的光标
with open(r‘E:python_test	est‘,mode=‘r‘,encoding=‘utf-8‘) as f:
    print(‘>>>1: ‘)
    print(f.read())
    print(‘>>>2: ‘)
    print(f.read())
    
>>>1: 
b"















s..."
>>>2: 
b‘‘

#readlines()函数
with open(r‘E:python_test	est‘,encoding=‘utf-8‘,mode=‘r‘) as f:
    print(f.readlines())	#[‘第一行
‘, ‘第二行
‘, ‘第三行
‘]
    
#f 是可以被for循环的，可以解决read()函数一次读取文件到内存的内存的占用问题
with open(r‘E:python_test	est‘,encoding=‘utf-8‘,mode=‘r‘) as f:
    for i in f:
        print(i)
第一行

第二行

第三行

#readline()函数
with open(r‘E:python_test	est‘,encoding=‘utf-8‘,mode=‘r‘) as f:
    print(f.readline())
    print(f.readline())
    print(f.readline())
    print(f.readline())		#打印空行
    print(f.readline())		#打印空行
    
#
：	换行符，等于

w模式

#文件不存在的话，直接创建，再编辑
with open(r‘xxx.txt‘,encoding=‘utf-8‘,mode=‘w‘) as f:
    print(f)	#<_io.TextIOWrapper name=‘xxx.txt‘ mode=‘w‘ encoding=‘utf-8‘>
    
#文件存在的话，先清空文件内容，再编辑
with open(r‘xxx.txt‘,encoding=‘utf-8‘,mode=‘w‘) as f:
    print(f.readable())		#False
    print(f.writable())		#True
    f.write(‘今天的天气不错‘)
    
#写多行
with open(r‘xxx.txt‘,encoding=‘utf-8‘,mode=‘w‘) as f:
    f.write(‘今天的天气不错
‘)
    f.write(‘今天的天气不错
‘)
    f.write(‘今天的天气不错
‘)
    f.write(‘今天的天气不错
‘)
    
#writelines()函数
l = [‘111‘,‘222‘,‘333‘]		#容器类型
with open(r‘E:python_testxxx.txt‘,mode=‘w‘,encoding=‘utf-8‘) as f:
    print(f.writelines(l))	#111222333
    
l = [‘1111‘,‘2222‘,‘3333‘]
with open(r‘E:python_testxxx.txt‘,mode=‘w‘,encoding=‘utf-8‘) as f:
    for i in l:
        f.write(i)

a模式 -- 只追加

#文件不存在的话，直接创建，再编辑
with open(r‘E:python_testxxxx.txt‘,mode=‘a‘,encoding=‘utf-8‘) as f:
    print(f)		#<_io.TextIOWrapper name=‘E:\python_test\xxxx.txt‘ mode=‘a‘ encoding=‘utf-8‘>
    
#文件存在的话，不会清空文件内容，追加（光标在原来文件的末尾）
with open(r‘E:python_testxxx.txt‘,mode=‘a‘,encoding=‘utf-8‘) as f:
    print(f.readable())		#False
    print(f.writable())		#True
    
#光标
with open(r‘E:python_testxxxx.txt‘,mode=‘a‘,encoding=‘utf-8‘) as f:
    f.write(‘大灰狼‘)

r+模式

#可读，只追加写
with open(r‘E:python_testxxx.txt‘,mode=‘r+‘,encoding=‘utf-8‘) as f:
    print(f.readable())			#True
    print(f.writable())			#True
    print(f.read())				#...
    print(f.write(‘
233‘))		#返回字符数
    
#只追加写
with open(r‘E:python_testxxx.txt‘,mode=‘r+‘,encoding=‘utf-8‘) as f:
    print(f.readline())
    f.write(‘
hahaha‘)
    print(f.write(‘哈‘))		#打印函数返回值，f被写入两次，在最后追加

w+模式

#可(读)，覆盖写
	#该模式下，先清空文件，再(读)或者覆盖写
with open(r‘E:python_testxxx.txt‘,mode=‘w+‘,encoding=‘utf-8‘) as f:
    print(f.readable())			#True
    print(f.writable())			#True
    print(f.read())
    print(f.write(‘哈哈哈‘))		#文件被覆盖
    
with open(r‘E:python_testxxxx.txt‘,mode=‘w+‘,encoding=‘utf-8‘) as f:
    f.readline()
    print(f.write(‘哈‘))			#文件被覆盖

a+模式

#可(读)，追加写
with open(r‘E:python_testxxx.txt‘,mode=‘a+‘,encoding=‘utf-8‘) as f:
    print(f.readable())			#True
    print(f.writable())			#True
    print(f.read())
    print(f.write(‘
哈哈哈‘))
    
with open(r‘E:python_testxxxx.txt‘,mode=‘a+‘,encoding=‘utf-8‘) as f:
    f.readline()
    print(f.write(‘哈‘))
    
1哈
2哈
3哈		#只在最后追加

r+b模式

#可(读)，追加写
with open(r‘E:python_testxxx.txt‘,mode=‘r+b‘) as f:
    print(f.readable())			#True
    print(f.writable())			#True
    print(f.read())				#b‘hahahahahahahahaha‘
    print(f.write(b‘xe5xa4xa7‘))		#返回值为写入了多少字节

#覆盖写
with open(r‘E:python_testxxxx.txt‘,mode=‘r+b‘) as f:
    f.readline()
    print(f.write(b‘xe5xa4xa7‘))
    
hahahahahaha
大aha大

文件内光标的移动

#rt模式
	#mode可以省略
    #rt模式下，read()内的数字才表示要读取的字符的个数、字节的个数
    #如果不是在rt模式下，数字表示的都是字节
with open(r‘E:python_testxxxx.txt‘,‘r‘,encoding=‘utf-8‘) as f:
    print(f.read(5))	#大灰狼大灰
    
#rb模式
    #中文在utf-8中，一个中文字符是3个字节，一个英文字符是1个字节
with open(r‘E:python_testxxxx.txt‘,‘rb‘) as f:
    print(f.read(3))    				#读3个字节，一个中文3个字节，b‘xe5xa4xa7‘
    print(f.read(3).decode(‘utf-8‘))	#灰
    print(f.read(3).decode(‘utf-8‘))	#狼，注意光标

#seek()函数
	#offset：相对偏移量，字节数
    #whence：参考对象
    	0：参考文件开头			t、b模式都可以使用
        1：参考光标所在的当前位置    b模式才能使用
        2：参考文件的末尾			b模式才能使用

r模式下的seek()函数

#rt模式下的seek(N,0)函数
with open(r‘E:python_testxxxx.txt‘,‘r‘,encoding=‘utf-8‘) as f:
    print(f.read(1))	#大
    f.seek(6,0)			#注意中英文混杂的情况
    print(f.read(3))	#狼sh
    
with open(r‘E:python_testxxxx.txt‘,‘r‘,encoding=‘utf-8‘) as f:
    print(f.read(3))	#大灰狼
    f.seek(9,0)			
    print(f.read())		#shift
    
#rb模式下的seek(N,0)函数
	#read()、seek()函数内的数字都表示字节
    #注意一个中文是3个字节
with open(r‘E:python_testxxxx.txt‘,‘rb‘) as f:
    print(f.read(3).decode(‘utf-8‘))	#大
    f.seek(9,0)
    print(f.read())	#b‘shift‘
    
#rt模式下的seek(N,1)函数
with open(r‘E:python_testxxxx.txt‘,‘r‘,encoding=‘utf-8‘) as f:
    print(f.read(3))	
    f.seek(2,1)			#不支持
    print(f.read(1))

#rb模式下的seek(N,1)函数 
with open(r‘E:python_testxxxx.txt‘,‘rb‘) as f:
    print(f.read(9).decode(‘utf-8‘))	#大灰狼
    f.seek(2,1)
    print(f.read(1))	#b‘i‘
    
#rt模式下的seek(N,2)函数
with open(r‘E:python_testxxxx.txt‘,‘r‘,encoding=‘utf-8‘) as f:
    print(f.read(8))
    f.seek(-3,2)			
    print(f.read(1))	#不支持
    
#rb模式下的seek(N,2)函数
with open(r‘E:python_testxxxx.txt‘,‘rb‘) as f:
    print(f.read().decode(‘utf-8‘))		#大灰狼shift
    f.seek(-3,2)
    print(f.read(1))	#b‘i‘
    
with open(r‘E:python_testxxxx.txt‘,‘rb‘) as f:
    print(f.read(9).decode(‘utf-8‘))	#大灰狼
    f.seek(-3,2)
    print(f.read(1)) 	#b‘i‘

w模式下的seek()函数

#先清空文件，再编辑
with open(r‘E:python_testxxxx.txt‘,‘w‘,encoding=‘utf-8‘) as f:
    f.seek(3,0)				#没有意义
    print(f.write(‘233‘))

a模式下的seek()函数

#只能追加到文件的末尾
with open(r‘E:python_testxxxx.txt‘,‘a‘,encoding=‘utf-8‘) as f:
    f.seek(3,0)				#没有意义
    print(f.write(‘233‘))

r+模式下的seek()函数

#从指定位置开始，逐个替换
with open(r‘E:python_testxxxx.txt‘,‘r+‘,encoding=‘utf-8‘) as f:
    f.seek(9,0)
    f.write(‘233‘)	#大灰狼233ft

w+模式下的seek()函数

#覆盖写
with open(r‘E:python_testxxxx.txt‘,‘w+‘,encoding=‘utf-8‘) as f:
    f.seek(9,0)
    f.write(‘233‘)	#233

a+模式下的seek()函数

#追加写
with open(r‘E:python_testxxxx.txt‘,‘a+‘,encoding=‘utf-8‘) as f:
    f.seek(9,0)
    f.write(‘233‘)	#大灰狼shift233

检测文件内容

#time函数
import time
res = time.strftime(‘%Y-%m-%d %X‘)
print(res,type(res))	#2020-11-03 15:43:02 <class ‘str‘>

#使用while循环实时检测文件末尾
import time
my_time = time.strftime(‘%Y-%m-%d %X‘)
with open(r‘xxxx.txt‘,‘a+‘,encoding=‘utf-8‘) as f:
    f.seek(0,2)
    while True:
        res = f.readline()
        print(f.tell())		#查看光标移动了多少位字节
        if res:
            print(‘s%: 新增了文件内容: %s‘%(my_time,res.decode(‘utf-8‘)))
            f.flush()		#将内存中的数据实时的写入到磁盘
        else:
            print(‘没有人操作文件‘)

#完整代码
import time
my_time = time.strftime(‘%Y-%m-%d %X‘)
with open(r‘xxxx.txt‘,‘rb‘) as f:
    f.seek(0,2)
    while True:
        res = f.readline()
        if res:
            print(‘%s 新增了文件内容: %s‘%(my_time,res.decode(‘utf-8‘)))
        else:
            pass

truncate()函数

#截断指定文件内容
	#删除指定字节后的所有内容
with open(r‘xxxx.txt‘,‘a‘,encoding=‘utf-8‘) as f:
    f.truncate(9)	#大灰狼

修改文件的两种方式

#修改文件方式一：
	1.先将数据从硬盘读到内存
    2.在内存中完成修改
    3.再覆盖到硬盘中的原内容
    
#代码实现
with open(r‘xxxx.txt‘,‘r+‘,encoding=‘utf-8‘) as f:
    data = f.read()
    #print(type(data))	#<class ‘str‘>

with open(r‘xxxx.txt‘,‘w‘,encoding=‘utf-8‘) as f:
    sp = data.replace(‘大灰狼‘,‘syy‘)
    f.write(sp)		#syyshift

	#优点：
    	磁盘中始终都是一个文件，不会占用较多的磁盘空间
        
    #缺点：
    	1.当文件过大的时候，容易导致内存溢出(read())
    
#修改文件方式二：
	1.创建一个新文件
    2.循环读取老文件内容到内存进行修改，将修改好的内容写到新文件中
    3.将老文件删除，将新文件的名字改成老文件

#代码实现
import os
with open(r‘xxxx.txt‘,‘r‘,encoding=‘utf-8‘) as read_f,    open(r‘xxxx.txt.swap‘,‘a‘,encoding=‘utf-8‘) as write_f:
    for line in read_f:							#以空格为分隔符
        newline = line.replace(‘大灰狼‘,‘syy‘)
        write_f.write(newline)
os.remove(r‘xxxx.txt‘)
os.rename(r‘xxxx.txt.swap‘,‘xxxx.txt‘)

	#优点：
    	使用内存较少
    
    #缺点：
    	在某一时刻，磁盘上会有2个文件，会占用较多的磁盘空间
    
    
#替换
with open(r‘xxxx.txt‘,‘r+‘,encoding=‘utf-8‘) as f:
    f.seek(9,0)
    f.write(‘ha‘)	#大灰狼haift

以上是关于字符编码文件处理的主要内容，如果未能解决你的问题，请参考以下文章