Python文件与IO

Posted 2022-11-29 HT . WANG

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了Python文件与IO相关的知识，希望对你有一定的参考价值。

1.读写文本数据

文件读取

f = open('somefile.txt', 'rt')
data = f.read()
f.close()

文件读取之后必须记得手动关闭文件

为了避免上述操作失误，使用with语句

with语句给被使用到的文件创建了一个上下文环境，但 with 控制块结束时，文件会自动关闭

# Read the entire file as a single string
with open('somefile.txt', 'rt') as f:
    data = f.read()

# Iterate over the lines of the file
with open('somefile.txt', 'rt') as f:
    for line in f:
        # process line
        ...

在大多数机器都是utf-8编码。如果已经知道你要读写的文本是其他编码方式，那么可以通过传递一个可选的 encoding 参数给open()函数。

当读取一个未知编码的文本时使用latin-1编码永远不会产生解码错误。

with open('somefile.txt', 'rt', encoding='latin-1') as f:
    ...

文件写入

覆盖式写入：

# Write chunks of text data
with open('somefile.txt', 'wt') as f:
    f.write(text1)
    f.write(text2)
    ...

# Redirected print statement
with open('somefile.txt', 'wt') as f:
    print(line1, file=f)
    print(line2, file=f)
    ...

追加式写入：

# Write chunks of text data
with open('somefile.txt', 'at') as f:
    f.write(text1)
    f.write(text2)
    ...

# Redirected print statement
with open('somefile.txt', 'at') as f:
    print(line1, file=f)
    print(line2, file=f)
    ...

2.打印输出至文件中

print()函数将内容先存入缓冲区，如果不及时进行缓冲区刷新，只有当缓冲区满时才会在系统终端显示

在 print() 函数中指定 file 关键字参数，将 print() 函数的输出重定向到一个文件中去

with open('d:/work/test.txt', 'wt') as f:
    print('Hello World!', file=f)

3.使用其他分隔符或行终止符打印

可以使用在 print() 函数中使用 sep 和 end 关键字参数，以你想要的方式输出

>>> print('ACME', 50, 91.5)
ACME 50 91.5
>>> print('ACME', 50, 91.5, sep=',')
ACME,50,91.5
>>> print('ACME', 50, 91.5, sep=',', end='!!\\n')
ACME,50,91.5!!
>>>

使用 end 参数也可以在输出中禁止换行

>>> for i in range(5):
...     print(i)
...
0
1
2
3
4
>>> for i in range(5):
...     print(i, end=' ')
...
0 1 2 3 4 >>>

4.读写字节数据

# Read the entire file as a single byte string
with open('somefile.bin', 'rb') as f:
    data = f.read() #返回的数据都是字节字符串格式的，而不是文本字符串

# Write binary data to a file
with open('somefile.bin', 'wb') as f:
    f.write(b'Hello World') #保证参数是以字节形式的对象(比如字节字符串，字节数组对象等)

在读取二进制数据的时候，字节字符串和文本字符串的语义差异可能会导致一个潜在的陷阱。特别需要注意的是，索引和迭代动作返回的是字节的值而不是字节字符串。

>>> # Text string
>>> t = 'Hello World'
>>> t[0]
'H'
>>> for c in t:
...     print(c)
...
H
e
l
l
o
...
>>> # Byte string
>>> b = b'Hello World'
>>> b[0]
72
>>> for c in b:
...     print(c)
...
72
101
108
108
111
...
>>>

想从二进制模式的文件中读取或写入文本数据，必须确保要进行解码和编码操作

with open('somefile.bin', 'rb') as f:
    data = f.read(16)
    text = data.decode('utf-8') #解码

with open('somefile.bin', 'wb') as f:
    text = 'Hello World'
    f.write(text.encode('utf-8')) #编码 将unicode编码（utf-8）的字符串编码成二进制数据

5.文件不存在才能写入

向一个文件中写入数据，但是前提必须是这个文件在文件系统上不存在。也就是不允许覆盖已存在的文件内容。

可以在 open() 函数中使用 x 模式来代替 w 模式

>>> with open('somefile', 'xt') as f:
...     f.write('Hello\\n')
...

6.字符串的I/O操作

使用 io.StringIO() 和 io.BytesIO() 类来创建类文件对象操作字符串数据

>>> s = io.StringIO()
>>> s.write('Hello World\\n')
12
>>> print('This is a test', file=s)#print输出重定向到s文件中
15
>>> # 获取文件字符串
>>> s.getvalue()
'Hello World\\nThis is a test\\n'
>>>


>>> s = io.BytesIO()#io.StringIO 只能用于文本。如果要操作二进制数据，要使用 io.BytesIO 类来代替
>>> s.write(b'binary data')
>>> s.getvalue()
b'binary data'
>>>

7.读写压缩文件

gzip 和 bz2 模块可以读写一个gzip或bz2格式的压缩文件

# gzip compression
import gzip
with gzip.open('somefile.gz', 'rt') as f:
    text = f.read()

# bz2 compression
import bz2
with bz2.open('somefile.bz2', 'rt') as f:
    text = f.read()

# gzip compression
import gzip
with gzip.open('somefile.gz', 'wt') as f:
    f.write(text)

# bz2 compression
import bz2
with bz2.open('somefile.bz2', 'wt') as f:
    f.write(text)

当写入压缩数据时，可以使用 compresslevel 这个可选的关键字参数来指定一个压缩级别

默认的等级是9，也是最高的压缩等级。等级越低性能越好，但是数据压缩程度也越低

with gzip.open('somefile.gz', 'wt', compresslevel=5) as f:
    f.write(text)

8.固定大小记录的文件迭代

想以一个固定长度记录或者数据块的集合上迭代，而不是在一个文件中一行一行的迭代。

from functools import partial

RECORD_SIZE = 32

with open('somefile.data', 'rb') as f:
    records = iter(partial(f.read, RECORD_SIZE), b'') #partial 用来创建一个每次被调用时从文件中读取固定数目字节的可调用对象
    for r in records:
        ...

records 对象是一个可迭代对象，它会不断的产生固定大小的数据块，直到文件末尾。要注意的是如果总记录大小不是块大小的整数倍的话，最后一个返回元素的字节数会比期望值少

9.读取二进制数据到可变缓冲区中

import os.path

def read_into_buffer(filename):
    buf = bytearray(os.path.getsize(filename))
    with open(filename, 'rb') as f:
        f.readinto(buf) #readinto() 方法能被用来为预先分配内存的数组填充数据
    return buf

>>> # Write a sample file
>>> with open('sample.bin', 'wb') as f:
...     f.write(b'Hello World')
...
>>> buf = read_into_buffer('sample.bin')
>>> buf

bytearray(b'Hello World')

可以通过零复制的方式对已存在的缓冲区执行切片操作，甚至还能修改它的内容

import os.path

def read_into_buffer(filename):
    buf = bytearray(os.path.getsize(filename))
    with open(filename, 'rb') as f:
        f.readinto(buf) #readinto() 方法能被用来为预先分配内存的数组填充数据
    return buf


>>> # Write a sample file
>>> with open('sample.bin', 'wb') as f:
...     f.write(b'Hello World')
...
>>> buf = read_into_buffer('sample.bin')
>>> buf
bytearray(b'Hello World')
>>> buf[0:5] = b'hello'
>>> buf
bytearray(b'hello World')

10.内存映射的二进制文件

想内存映射一个二进制文件到一个可变字节数组中，目的可能是为了随机访问它的内容或者是原地做些修改

import os
import mmap

def memory_map(filename, access=mmap.ACCESS_WRITE):
    size = os.path.getsize(filename)
    fd = os.open(filename, os.O_RDWR)
    return mmap.mmap(fd, size, access=access)

>>> size = 1000000
>>> with open('data', 'wb') as f:
...     f.seek(size-1) #创建一个文件并将其内容扩充到指定大小
...     f.write(b'\\x00')
...
>>>

>>> m = memory_map('data')
>>> len(m)
1000000
>>> m[0:10]
b'\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00'
>>> m[0]
0
>>> # Reassign a slice
>>> m[0:11] = b'Hello World'
>>> m.close()

>>> # Verify that changes were made
>>> with open('data', 'rb') as f:
... print(f.read(11))
...
b'Hello World'
>>>

11.文件路径名的操作

使用路径名来获取文件名，目录名，绝对路径

>>> import os
>>> path = '/Users/beazley/Data/data.csv'

>>> # Get the last component of the path
>>> os.path.basename(path)
'data.csv'

>>> # Get the directory name
>>> os.path.dirname(path)
'/Users/beazley/Data'

>>> # Join path components together
>>> os.path.join('tmp', 'data', os.path.basename(path))
'tmp/data/data.csv'

>>> # Expand the user's home directory
>>> path = '~/Data/data.csv'
>>> os.path.expanduser(path)
'/Users/beazley/Data/data.csv'

>>> # Split the file extension
>>> os.path.splitext(path)
('~/Data/data', '.csv')
>>>

12.测试文件是否存在

#使用 os.path 模块来测试一个文件或目录是否存在
>>> import os
>>> os.path.exists('/etc/passwd')
True
>>> os.path.exists('/tmp/spam')
False
>>>

#测试这个文件时什么类型的
>>> # Is a regular file
>>> os.path.isfile('/etc/passwd')
True

>>> # Is a directory
>>> os.path.isdir('/etc/passwd')
False

>>> # Is a symbolic link
>>> os.path.islink('/usr/local/bin/python3')
True

>>> # Get the file linked to
>>> os.path.realpath('/usr/local/bin/python3')
'/usr/local/bin/python3.3'
>>>

#获取元数据(比如文件大小或者是修改日期)
>>> os.path.getsize('/etc/passwd')
3669
>>> os.path.getmtime('/etc/passwd')
1272478234.0
>>> import time
>>> time.ctime(os.path.getmtime('/etc/passwd'))
'Wed Apr 28 13:10:34 2010'
>>>

13.获取文件夹中的文件列表

使用 os.listdir() 函数来获取某个目录中的文件列表

import os
names = os.listdir('somedir') #结果会返回目录中所有文件列表，包括所有文件，子目录，符号链接

过滤内容：

import os.path

# Get all regular files
names = [name for name in os.listdir('somedir')
        if os.path.isfile(os.path.join('somedir', name))] #选择文件列出

# Get all dirs
dirnames = [name for name in os.listdir('somedir')
        if os.path.isdir(os.path.join('somedir', name))] #选择目录列出


pyfiles = [name for name in os.listdir('somedir')
            if name.endswith('.py')] #选择后缀为py列出

14.打印不合法的文件名

默认情况下，Python假定所有文件名都已经根据 sys.getfilesystemencoding() 的值编码过了。但是，有一些文件系统并没有强制要求这样做，因此允许创建文件名没有正确编码的文件

取了一个目录中的文件名列表，但是当它试着去打印文件名的时候程序崩溃，出现了异常

def bad_filename(filename):
    return repr(filename)[1:-1]

try:
    print(filename)
except UnicodeEncodeError:
    print(bad_filename(filename))

注意：将不合法编码的文件名操作或传递给 open() 这样的函数，一切都能正常工作。只有当你想要输出文件名时才会崩溃

15.增加或改变已打开文件的编码

给一个以二进制模式打开的文件添加Unicode编码/解码方式，可以使用 io.TextIOWrapper() 对象包装它

import urllib.request
import io

u = urllib.request.urlopen('http://www.python.org')
f = io.TextIOWrapper(u, encoding='utf-8')
text = f.read()

修改一个已经打开的文本模式的文件的编码方式，可以先使用 detach() 方法移除掉已存在的文本编码层，并使用新的编码方式代替

>>> import sys
>>> sys.stdout.encoding
'UTF-8'
>>> sys.stdout = io.TextIOWrapper(sys.stdout.detach(), encoding='latin-1')
>>> sys.stdout.encoding
'latin-1'
>>>

16.将字节写入文本文件

在文本模式打开的文件中写入原始的字节数据

>>> import sys
>>> sys.stdout.buffer.write(b'Hello\\n') #将字节数据直接写入文件的缓冲区
Hello
5
>>>

17.将文件描述符包装成文件对象

一个文件描述符(句柄)和一个打开的普通文件是不一样的。文件描述符仅仅是一个由操作系统指定的整数，用来指代某个系统的I/O通道。如果你碰巧有这么一个文件描述符，你可以通过使用 open() 函数来将其包装为一个Python的文件对象。你仅仅只需要使用这个整数值的文件描述符作为第一个参数来代替文件名即可。

# Open a low-level file descriptor
import os
fd = os.open('somefile.txt', os.O_WRONLY | os.O_CREAT)

# Turn into a proper file
f = open(fd, 'wt')
f.write('hello world\\n')
f.close()

18.创建临时文件和文件夹

需要在程序执行时创建一个临时文件或目录，并希望使用完之后可以自动销毁掉

（1）创建临时文件

from tempfile import TemporaryFile

with TemporaryFile('w+t') as f: #TemporaryFile() 的第一个参数是文件模式，通常来讲文本模式使用 w+t ，二进制模式使用 w+b
    f.write('Hello World\\n')
    f.write('Testing\\n')

    # 定位读取位置
    f.seek(0)
    data = f.read()

# 临时文件自动销毁

#通过 TemporaryFile() 创建的文件都是匿名的，甚至连目录都没有


from tempfile import NamedTemporaryFile

with NamedTemporaryFile('w+t') as f:
    print('filename is:', f.name) #被打开文件的 f.name 属性包含了该临时文件的文件名
    ...

（2）创建临时目录

from tempfile import TemporaryDirectory

with TemporaryDirectory() as dirname:
    print('dirname is:', dirname)
    # Use the directory
    ...
# 临时目录自动销毁

19.与串行端口的数据通信

通过串行端口读写数据，典型场景就是和一些硬件设备数据传输

串行端口配置：

import serial
ser = serial.Serial('/dev/tty.usbmodem641', # 设备名称
                    baudrate=9600,
                    bytesize=8,
                    parity='N',
                    stopbits=1)

数据读写：

ser.write(b'G1 X50 Y50\\r\\n')
resp = ser.readline()

20.序列化Python对象

将一个Python对象序列化为一个字节流，以便将它保存到一个文件、存储到数据库或者通过网络传输它

import pickle

data = [1,2,('a','b','w':1,'t':2)] 



f = open('somefile', 'wb')
pickle.dump(data, f) #将一个python对象转储到文件中

# 从文件中恢复对象
f = open('somefile', 'rb')
data = pickle.load(f)




s = pickle.dumps(data) #将一个python对象转储为一个字符串
# 从字符串中恢复对象
data = pickle.loads(s)

注意：

（1）对于Python数据被不同机器上的解析器所共享的应用程序而言，数据的保存可能会有问题，因为所有的机器都必须访问同一个源代码

（2）有些类型的对象是不能被序列化的。这些通常是那些依赖外部系统状态的对象，比如打开的文件，网络连接，线程，进程，栈帧等等。

以上是关于Python文件与IO的主要内容，如果未能解决你的问题，请参考以下文章