如何更新文本文件中给定章节的子章节数?

Posted

技术标签:

【中文标题】如何更新文本文件中给定章节的子章节数?【英文标题】:How to update numbers of subchapter in the given chapter in a text file? 【发布时间】:2017-03-25 23:49:40 【问题描述】:

我有一个包含书籍目录的文本文件。 我必须创建一个可以被 ghostscript 读取的 index.txt 文件。

文本文件可用HERE,看起来像:

Chapter 1 Introduction 1
Chapter 2 Fundamental Observations 7
2.1 Dark night sky   7
2.2 Isotropy and homogeneity  11
2.3 Redshift proportional to distance  15
2.4 Types of particles  22
2.5 Cosmic microwave background  28
Chapter 3 Newton Versus Einstein 32
3.1 Equivalence principle  33
3.2 Describing curvature  39
3.3 Robertson-Walker metric  44
3.4 Proper distance 47   

这必须改为:

[/Count -0 /Page 7 /Title (Chapter: 1 Introduction ) /OUT pdfmark
[/Count -5 /Page 13 /Title (Chapter: 2 Fundamental Observations ) /OUT pdfmark
[/Count 0 /Page 13 /Title (Chapter: 2.1 Dark night sky   ) /OUT pdfmark
[/Count 0 /Page 17 /Title (Chapter: 2.2 Isotropy and homogeneity  ) /OUT pdfmark
[/Count 0 /Page 21 /Title (Chapter: 2.3 Redshift proportional to distance  ) /OUT pdfmark
[/Count 0 /Page 28 /Title (Chapter: 2.4 Types of particles  ) /OUT pdfmark
[/Count 0 /Page 34 /Title (Chapter: 2.5 Cosmic microwave background  ) /OUT pdfmark
[/Count -4 /Page 38 /Title (Chapter: 3 Newton Versus Einstein ) /OUT pdfmark
[/Count 0 /Page 39 /Title (Chapter: 3.1 Equivalence principle  ) /OUT pdfmark
[/Count 0 /Page 45 /Title (Chapter: 3.2 Describing curvature  ) /OUT pdfmark
[/Count 0 /Page 50 /Title (Chapter: 3.3 Robertson-Walker metric  ) /OUT pdfmark
[/Count 0 /Page 53 /Title (Chapter: 3.4 Proper distance   ) /OUT pdfmark

在上面,请注意:

Count = number of sub chapter in the given chapter  
Page = given page in table of content + 6 

我们怎样才能做到这一点?

到目前为止,我已经尝试过了。

def get_Count_Page_and_Title(bookmark, offset=6):
    """Get chapters and page numbers."""
    with open(bookmark, 'r') as fi, open('temp_index.txt', 'w') as fo:
        for line in fi:
            line = r'[/Count -0 /Page 0 /Title (Chapter: 1 Introduction ) /OUT pdfmark'
            print(line, file = fo)

部分相关链接为:python reading text fileRead .txt file line by line in Python

【问题讨论】:

【参考方案1】:

这是解析文件的一种方法。此代码使用简单的字符串匹配来区分章节行和子章节行。然后它将每个子章节与其封闭的章节组合在一起。最后,它将遍历这些数据以生成所需的输出。

代码:

def print_count_page_and_title(data, page_offset=0):
    """Get chapters and page numbers."""
    chapters = []
    chapter = None
    for line in data:
        if line.startswith('Chapter'):
            if chapter is not None:
                chapters.append(chapter)
            chapter = (line.strip().rsplit(' ', 1), [])
        else:
            chapter[1].append(line.strip().rsplit(' ', 1))

    if chapter is not None:
        chapters.append(chapter)

    def page_num(page):
        return int(page) + page_offset

    fmt_chapter = '[/Count -%d /Page %d /Title (%s) /OUT pdfmark'
    fmt_sub_chapter = '[/Count 0 /Page %d /Title (%s) /OUT pdfmark'

    for chapter in chapters:
        print(fmt_chapter % (
            len(chapter[1]), page_num(chapter[0][1]), chapter[0][0]))
        for sub_chapter in chapter[1]:
            print(fmt_sub_chapter % (
                page_num(sub_chapter[1]), sub_chapter[0]))

print_count_page_and_title(test_data, page_offset=6)

测试数据:

from io import StringIO

test_data = StringIO(u'\n'.join([x.strip() for x in """
    Chapter 1 Introduction 1
    Chapter 2 Fundamental Observations 7
    2.1 Dark night sky   7
    2.2 Isotropy and homogeneity  11
    2.3 Redshift proportional to distance  15
    2.4 Types of particles  22
    2.5 Cosmic microwave background  28
    Chapter 3 Newton Versus Einstein 32
    3.1 Equivalence principle  33
    3.2 Describing curvature  39
    3.3 Robertson-Walker metric  44
    3.4 Proper distance 47   
""".split('\n')[1:-1]]))

结果:

[/Count -0 /Page 7 /Title (Chapter 1 Introduction) /OUT pdfmark
[/Count -5 /Page 13 /Title (Chapter 2 Fundamental Observations) /OUT pdfmark
[/Count 0 /Page 13 /Title (2.1 Dark night sky  ) /OUT pdfmark
[/Count 0 /Page 17 /Title (2.2 Isotropy and homogeneity ) /OUT pdfmark
[/Count 0 /Page 21 /Title (2.3 Redshift proportional to distance ) /OUT pdfmark
[/Count 0 /Page 28 /Title (2.4 Types of particles ) /OUT pdfmark
[/Count 0 /Page 34 /Title (2.5 Cosmic microwave background ) /OUT pdfmark
[/Count -4 /Page 38 /Title (Chapter 3 Newton Versus Einstein) /OUT pdfmark
[/Count 0 /Page 39 /Title (3.1 Equivalence principle ) /OUT pdfmark
[/Count 0 /Page 45 /Title (3.2 Describing curvature ) /OUT pdfmark
[/Count 0 /Page 50 /Title (3.3 Robertson-Walker metric ) /OUT pdfmark
[/Count 0 /Page 53 /Title (3.4 Proper distance) /OUT pdfmark

【讨论】:

感谢您的回答,但是,它仅适用于 StringIO 对象,当我尝试从文本文件中读取数据时,它给出了错误:UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2在位置 1292:序数不在范围内(128)输入文件是 test_data = open('toc_ryden.txt','r').readlines()【参考方案2】:

首先,感谢@Stephen Rauch。 上述代码的用法: 如果我们有任何 pdf 文档并且我们想为其创建书签,我们可以使用以下代码:

注意:我们需要将上述代码的输出写入一个名为index.txt的文本文件中

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
#
# Author      : Bhishan Poudel; Physics PhD Student, Ohio University
# Date        : Jan 22, 2017
#
# Imports
import io
import subprocess
import os
from pdfrw import PdfReader, PdfWriter
from natsort import natsorted
import glob


def create_bookmarked_pdf(inpdf, outpdf):
    """Create clickable pdf."""
    # input/output files
    inpdf = inpdf
    outpdf = outpdf
    commands = "gs -sDEVICE=pdfwrite -q -dBATCH -dNOPAUSE  -sOutputFile=" +\
        outpdf + ' index.txt -f ' + inpdf
    print('  '.format('Creating : ', outpdf, ''))
    subprocess.call(commands, shell=True)


def main():
    """Run main function."""
    # create clickable index in pdf
    inpdf = 'ryden.pdf'
    outpdf = 'output.pdf'
    create_bookmarked_pdf(inpdf, outpdf)

    # delete tmp files
    if os.path.exists('index.txt'):
        # os.remove('index.txt')
        pass


if __name__ == "__main__":
    import time

    # beginning time
    program_begin_time = time.time()
    begin_ctime        = time.ctime()

    #  Run the main program
    main()

    # print the time taken
    program_end_time = time.time()
    end_ctime        = time.ctime()
    seconds          = program_end_time - program_begin_time
    m, s             = divmod(seconds, 60)
    h, m             = divmod(m, 60)
    d, h             = divmod(h, 24)
    print("nBegin time: ", begin_ctime)
    print("End   time: ", end_ctime, "\n")
    print("Time taken: 0: .0f days, 1: .0f hours, \
      2: .0f minutes, 3: f seconds.".format(d, h, m, s))

【讨论】:

【参考方案3】:

我稍微修改了上面的答案,这样我就可以从一个文本文件中读取数据并写入另一个文本文件。 代码如下:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
#
# Author      : Stephen Rauch
# Modified by : Bhishan Poudel; Physics PhD Student, Ohio University
# Date        : Mar 5, 2017
# pastebin link for index.txt: http://pastebin.com/LP8KXAmU


def print_count_page_and_title(data, page_offset=0):
    """Get chapters and page numbers."""
    fo = open('index.txt', 'w', encoding='utf-8')
    print('Creating: ', 'index.txt')
    chapters = []
    chapter = None
    for line in data:
        if line.startswith('Chapter'):
            if chapter is not None:
                chapters.append(chapter)
            chapter = (line.strip().rsplit(' ', 1), [])
            # chapter is tuple of two lists
            # second list is empty list
            # first list has two elements,
            # second element is separated by white space in end by rsplit.
            # print(line)
            # Chapter 1 Introduction 1
            # print(chapter)
            # (['Chapter 1 Introduction', '1'], [])
            # print("\n")
        else:
            subchapter = line.strip().rsplit(' ', 1)
            chapter[1].append(subchapter)

    if chapter is not None:
        chapters.append(chapter)

    def page_num(page):
        return int(page) + page_offset

    fmt_chapter = '[/Count -%d /Page %d /Title (%s) /OUT pdfmark'
    fmt_sub_chapter = '[/Count 0 /Page %d /Title (%s) /OUT pdfmark'

    for chapter in chapters:
        print(fmt_chapter % (
            len(chapter[1]), page_num(chapter[0][1]), chapter[0][0]), file=fo)
        for sub_chapter in chapter[1]:
            print(fmt_sub_chapter % (
                page_num(sub_chapter[1]), sub_chapter[0]), file=fo)
        pass
    fo.close()

if __name__ == "__main__":
    test_data = open('toc_ryden.txt', 'r', encoding='utf-8').readlines()
    print_count_page_and_title(test_data, page_offset=6)

【讨论】:

以上是关于如何更新文本文件中给定章节的子章节数?的主要内容,如果未能解决你的问题,请参考以下文章

PIG 脚本根据特定单词将大型文本文件拆分为多个部分

章节十2-如何点击链接按钮和操作文本框

如何在 LaTeX 中给章节标题加下划线?

UIScrollView里面章节长文本的构造

如何根据给定的行数范围从文本文件中分割数据

使用啥代替 FirstOrDefault 来获取与 C# 中使用 linq 中所选章节关联的 json 文本?