逐字节读写以进行压缩

Posted

技术标签:

【中文标题】逐字节读写以进行压缩【英文标题】:Reading and writing byte-by-byte for compression 【发布时间】:2016-04-28 14:04:53 【问题描述】:

我正在尝试使用 python 实现 Lempel-Ziv-Welch 算法,但在以二进制格式编写文件时遇到问题。

action = sys.argv[3]
if action == "compress":
# initialize dictionary
dictionary = 
for i in range(0,256):
    # for single characters, the value is the same as the key
    # in the compressed file, these would appear as is
    dictionary[chr(i)] = i 
input_file = open(sys.argv[1], 'rb+')
output_file = open(sys.argv[2], 'wb')

data = input_file.read()
# current_data is one byte
current_data = input_file.read(1)
i = 0
j = 1
current_data = data[i:j]
# look for the shortest string not in the dictionary
while i < len(data) - 2:
    while current_data in dictionary.keys():
        if j < len(data) + 1:
            j = j + 1
            current_data = data[i:j]
        else:
            break
    # once the shortest string is found, add it to the dictionary 
    if current_data not in dictionary.keys():
        dictionary[current_data] = len(dictionary)
        thing_to_write = dictionary[current_data[:-1]]
        i = j - 1
        current_data = data[i:j]
    else:
        thing_to_write = dictionary[current_data]
        i = i + 1
        j = i + 1
    # then write to the output file the found string - one character from the end (the longest string that is in the dictionary)\
    mylist = []
    thing_to_write = format(thing_to_write,'x')
    thing_to_write = thing_to_write
    for char in thing_to_write:
        mylist.append(char.encode('hex'))
        for elem in mylist:
            output_file.write(elem)
input_file.close()
output_file.close()
print >> sys.stderr, "The size of " + sys.argv[1] + " is " + str(os.path.getsize(sys.argv[1])) + " bytes." + "\n" + "The size of " + sys.argv[2] + " is " + str(os.path.getsize(sys.argv[2])) + " bytes."

我尝试过用许多不同的格式编写,例如十六进制、二进制等,但我认为我只是将它们编写为 8 位字符。如何用原始二进制编写?

【问题讨论】:

“我遇到麻烦”是什么意思?你收到错误信息吗?然后为问题添加完整消息。 How to create a Minimal, Complete, and Verifiable example 【参考方案1】:

不清楚您要写什么。您获得的数据最终可能大于 256,所以我假设您想要将 2 字节无符号整数写入输出文件?

如果是这种情况,那么我建议您研究 Python 的 struct.pack 函数,该函数旨在将数据从 Python 的类型转换为二进制表示。如果您的数据是字节大小的,您可以只使用output_file.write(chr(x)) 来写入每个字符。

以下使用Python的struct.pack()

import os
os.chdir(os.path.dirname(os.path.abspath(__file__)))

import sys
import struct

action = sys.argv[3]

if action == "compress":
    # initialize dictionary
    dictionary = 

for i in range(0,256):
    # for single characters, the value is the same as the key
    # in the compressed file, these would appear as is
    dictionary[chr(i)] = i 

input_file = open(sys.argv[1], 'rb')
output_file = open(sys.argv[2], 'wb')

data = input_file.read()

# current_data is one byte
current_data = input_file.read(1)
i = 0
j = 1
current_data = data[i:j]

# look for the shortest string not in the dictionary

while i < len(data) - 2:
    while current_data in dictionary.keys():
        if j < len(data) + 1:
            j = j + 1
            current_data = data[i:j]
        else:
            break

    # once the shortest string is found, add it to the dictionary 
    if current_data not in dictionary.keys():
        dictionary[current_data] = len(dictionary)
        thing_to_write = dictionary[current_data[:-1]]
        i = j - 1
        current_data = data[i:j]
    else:
        thing_to_write = dictionary[current_data]
        i = i + 1
        j = i + 1

    # then write to the output file the found string - one character from the end (the longest string that is in the dictionary)\
    output_file.write(struct.pack('H', thing_to_write))     # Convert each thing into 2 byte binary

input_file.close()
output_file.close()

print >> sys.stderr, "The size of " + sys.argv[1] + " is " + str(os.path.getsize(sys.argv[1])) + " bytes." + "\n" + "The size of " + sys.argv[2] + " is " + str(os.path.getsize(sys.argv[2])) + " bytes."

【讨论】:

以上是关于逐字节读写以进行压缩的主要内容,如果未能解决你的问题,请参考以下文章

求解 java 对压缩文件zip 加密 !

Hadoop的数据压缩

现代信息检索——索引构建

现代信息检索——索引构建

在C#中解压缩字节数组

位图都是有压缩格式的吗?