如何更新文本文件中给定章节的子章节数?
Posted
技术标签:
【中文标题】如何更新文本文件中给定章节的子章节数?【英文标题】:How to update numbers of subchapter in the given chapter in a text file? 【发布时间】:2017-03-25 23:49:40 【问题描述】:我有一个包含书籍目录的文本文件。 我必须创建一个可以被 ghostscript 读取的 index.txt 文件。
文本文件可用HERE,看起来像:
Chapter 1 Introduction 1
Chapter 2 Fundamental Observations 7
2.1 Dark night sky 7
2.2 Isotropy and homogeneity 11
2.3 Redshift proportional to distance 15
2.4 Types of particles 22
2.5 Cosmic microwave background 28
Chapter 3 Newton Versus Einstein 32
3.1 Equivalence principle 33
3.2 Describing curvature 39
3.3 Robertson-Walker metric 44
3.4 Proper distance 47
这必须改为:
[/Count -0 /Page 7 /Title (Chapter: 1 Introduction ) /OUT pdfmark
[/Count -5 /Page 13 /Title (Chapter: 2 Fundamental Observations ) /OUT pdfmark
[/Count 0 /Page 13 /Title (Chapter: 2.1 Dark night sky ) /OUT pdfmark
[/Count 0 /Page 17 /Title (Chapter: 2.2 Isotropy and homogeneity ) /OUT pdfmark
[/Count 0 /Page 21 /Title (Chapter: 2.3 Redshift proportional to distance ) /OUT pdfmark
[/Count 0 /Page 28 /Title (Chapter: 2.4 Types of particles ) /OUT pdfmark
[/Count 0 /Page 34 /Title (Chapter: 2.5 Cosmic microwave background ) /OUT pdfmark
[/Count -4 /Page 38 /Title (Chapter: 3 Newton Versus Einstein ) /OUT pdfmark
[/Count 0 /Page 39 /Title (Chapter: 3.1 Equivalence principle ) /OUT pdfmark
[/Count 0 /Page 45 /Title (Chapter: 3.2 Describing curvature ) /OUT pdfmark
[/Count 0 /Page 50 /Title (Chapter: 3.3 Robertson-Walker metric ) /OUT pdfmark
[/Count 0 /Page 53 /Title (Chapter: 3.4 Proper distance ) /OUT pdfmark
在上面,请注意:
Count = number of sub chapter in the given chapter
Page = given page in table of content + 6
我们怎样才能做到这一点?
到目前为止,我已经尝试过了。
def get_Count_Page_and_Title(bookmark, offset=6):
"""Get chapters and page numbers."""
with open(bookmark, 'r') as fi, open('temp_index.txt', 'w') as fo:
for line in fi:
line = r'[/Count -0 /Page 0 /Title (Chapter: 1 Introduction ) /OUT pdfmark'
print(line, file = fo)
部分相关链接为:python reading text fileRead .txt file line by line in Python
【问题讨论】:
【参考方案1】:这是解析文件的一种方法。此代码使用简单的字符串匹配来区分章节行和子章节行。然后它将每个子章节与其封闭的章节组合在一起。最后,它将遍历这些数据以生成所需的输出。
代码:
def print_count_page_and_title(data, page_offset=0):
"""Get chapters and page numbers."""
chapters = []
chapter = None
for line in data:
if line.startswith('Chapter'):
if chapter is not None:
chapters.append(chapter)
chapter = (line.strip().rsplit(' ', 1), [])
else:
chapter[1].append(line.strip().rsplit(' ', 1))
if chapter is not None:
chapters.append(chapter)
def page_num(page):
return int(page) + page_offset
fmt_chapter = '[/Count -%d /Page %d /Title (%s) /OUT pdfmark'
fmt_sub_chapter = '[/Count 0 /Page %d /Title (%s) /OUT pdfmark'
for chapter in chapters:
print(fmt_chapter % (
len(chapter[1]), page_num(chapter[0][1]), chapter[0][0]))
for sub_chapter in chapter[1]:
print(fmt_sub_chapter % (
page_num(sub_chapter[1]), sub_chapter[0]))
print_count_page_and_title(test_data, page_offset=6)
测试数据:
from io import StringIO
test_data = StringIO(u'\n'.join([x.strip() for x in """
Chapter 1 Introduction 1
Chapter 2 Fundamental Observations 7
2.1 Dark night sky 7
2.2 Isotropy and homogeneity 11
2.3 Redshift proportional to distance 15
2.4 Types of particles 22
2.5 Cosmic microwave background 28
Chapter 3 Newton Versus Einstein 32
3.1 Equivalence principle 33
3.2 Describing curvature 39
3.3 Robertson-Walker metric 44
3.4 Proper distance 47
""".split('\n')[1:-1]]))
结果:
[/Count -0 /Page 7 /Title (Chapter 1 Introduction) /OUT pdfmark
[/Count -5 /Page 13 /Title (Chapter 2 Fundamental Observations) /OUT pdfmark
[/Count 0 /Page 13 /Title (2.1 Dark night sky ) /OUT pdfmark
[/Count 0 /Page 17 /Title (2.2 Isotropy and homogeneity ) /OUT pdfmark
[/Count 0 /Page 21 /Title (2.3 Redshift proportional to distance ) /OUT pdfmark
[/Count 0 /Page 28 /Title (2.4 Types of particles ) /OUT pdfmark
[/Count 0 /Page 34 /Title (2.5 Cosmic microwave background ) /OUT pdfmark
[/Count -4 /Page 38 /Title (Chapter 3 Newton Versus Einstein) /OUT pdfmark
[/Count 0 /Page 39 /Title (3.1 Equivalence principle ) /OUT pdfmark
[/Count 0 /Page 45 /Title (3.2 Describing curvature ) /OUT pdfmark
[/Count 0 /Page 50 /Title (3.3 Robertson-Walker metric ) /OUT pdfmark
[/Count 0 /Page 53 /Title (3.4 Proper distance) /OUT pdfmark
【讨论】:
感谢您的回答,但是,它仅适用于 StringIO 对象,当我尝试从文本文件中读取数据时,它给出了错误:UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2在位置 1292:序数不在范围内(128)输入文件是 test_data = open('toc_ryden.txt','r').readlines()【参考方案2】:首先,感谢@Stephen Rauch。 上述代码的用法: 如果我们有任何 pdf 文档并且我们想为其创建书签,我们可以使用以下代码:
注意:我们需要将上述代码的输出写入一个名为index.txt的文本文件中
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
#
# Author : Bhishan Poudel; Physics PhD Student, Ohio University
# Date : Jan 22, 2017
#
# Imports
import io
import subprocess
import os
from pdfrw import PdfReader, PdfWriter
from natsort import natsorted
import glob
def create_bookmarked_pdf(inpdf, outpdf):
"""Create clickable pdf."""
# input/output files
inpdf = inpdf
outpdf = outpdf
commands = "gs -sDEVICE=pdfwrite -q -dBATCH -dNOPAUSE -sOutputFile=" +\
outpdf + ' index.txt -f ' + inpdf
print(' '.format('Creating : ', outpdf, ''))
subprocess.call(commands, shell=True)
def main():
"""Run main function."""
# create clickable index in pdf
inpdf = 'ryden.pdf'
outpdf = 'output.pdf'
create_bookmarked_pdf(inpdf, outpdf)
# delete tmp files
if os.path.exists('index.txt'):
# os.remove('index.txt')
pass
if __name__ == "__main__":
import time
# beginning time
program_begin_time = time.time()
begin_ctime = time.ctime()
# Run the main program
main()
# print the time taken
program_end_time = time.time()
end_ctime = time.ctime()
seconds = program_end_time - program_begin_time
m, s = divmod(seconds, 60)
h, m = divmod(m, 60)
d, h = divmod(h, 24)
print("nBegin time: ", begin_ctime)
print("End time: ", end_ctime, "\n")
print("Time taken: 0: .0f days, 1: .0f hours, \
2: .0f minutes, 3: f seconds.".format(d, h, m, s))
【讨论】:
【参考方案3】:我稍微修改了上面的答案,这样我就可以从一个文本文件中读取数据并写入另一个文本文件。 代码如下:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
#
# Author : Stephen Rauch
# Modified by : Bhishan Poudel; Physics PhD Student, Ohio University
# Date : Mar 5, 2017
# pastebin link for index.txt: http://pastebin.com/LP8KXAmU
def print_count_page_and_title(data, page_offset=0):
"""Get chapters and page numbers."""
fo = open('index.txt', 'w', encoding='utf-8')
print('Creating: ', 'index.txt')
chapters = []
chapter = None
for line in data:
if line.startswith('Chapter'):
if chapter is not None:
chapters.append(chapter)
chapter = (line.strip().rsplit(' ', 1), [])
# chapter is tuple of two lists
# second list is empty list
# first list has two elements,
# second element is separated by white space in end by rsplit.
# print(line)
# Chapter 1 Introduction 1
# print(chapter)
# (['Chapter 1 Introduction', '1'], [])
# print("\n")
else:
subchapter = line.strip().rsplit(' ', 1)
chapter[1].append(subchapter)
if chapter is not None:
chapters.append(chapter)
def page_num(page):
return int(page) + page_offset
fmt_chapter = '[/Count -%d /Page %d /Title (%s) /OUT pdfmark'
fmt_sub_chapter = '[/Count 0 /Page %d /Title (%s) /OUT pdfmark'
for chapter in chapters:
print(fmt_chapter % (
len(chapter[1]), page_num(chapter[0][1]), chapter[0][0]), file=fo)
for sub_chapter in chapter[1]:
print(fmt_sub_chapter % (
page_num(sub_chapter[1]), sub_chapter[0]), file=fo)
pass
fo.close()
if __name__ == "__main__":
test_data = open('toc_ryden.txt', 'r', encoding='utf-8').readlines()
print_count_page_and_title(test_data, page_offset=6)
【讨论】:
以上是关于如何更新文本文件中给定章节的子章节数?的主要内容,如果未能解决你的问题,请参考以下文章