Python 3.6 脚本在 Windows 10 上速度惊人，但在 Ubuntu 17.10 上却没有

Posted 2023-03-06

技术标签:

【中文标题】Python 3.6 脚本在 Windows 10 上速度惊人，但在 Ubuntu 17.10 上却没有【英文标题】：Python 3.6 script surprisingly slow on Windows 10 but not on Ubuntu 17.10 【发布时间】：2018-02-09 23:58:37 【问题描述】：

我最近不得不为一家公司编写一个挑战，即根据每个文件的第一个属性（属性在所有文件中重复）将 3 个 CSV 文件合并为一个。

我编写了代码并发送给他们，但他们说运行需要 2 分钟。这很有趣，因为它在我的机器上运行了 10 秒。我的机器有相同的处理器，16GB 的 RAM，还有一个 SSD。非常相似的环境。

我尝试优化它并重新提交。这次他们说他们在 Ubuntu 机器上运行了 11 秒，而代码在 Windows 10 上仍然运行了 100 秒。

另一个奇怪的事情是，当我尝试使用 Profile 模块对其进行分析时，它会一直持续下去，必须在 450 秒后终止。我转到cProfiler，它记录了 7 秒。

编辑：问题的确切表述是

编写一个控制台程序来合并及时中提供的文件和高效方式。文件路径应作为参数提供，以便该程序可以在不同的数据集上进行评估。合并后的文件应保存为 CSV；使用 id 列作为唯一键合并；程序应该做任何必要的数据清理和错误检查。

随意使用任何您喜欢的语言 - 仅限限制是没有外部库，因为这违背了考试。如果语言提供 CSV 解析库（如 Python），请避免使用它们，因为这是测试。

事不宜迟，代码如下：

#!/usr/bin/python3

import sys
from multiprocessing import Pool

HEADERS = ['id']

def csv_tuple_quotes_valid(a_tuple):
    """
    checks if a quotes in each attribute of a entry (i.e. a tuple) agree with the csv format

    returns True or False
    """
    for attribute in a_tuple:
        in_quotes = False
        attr_len = len(attribute)
        skip_next = False

        for i in range(0, attr_len):
            if not skip_next and attribute[i] == '\"':
                if i < attr_len - 1 and attribute[i + 1] == '\"':
                    skip_next = True
                    continue
                elif i == 0 or i == attr_len - 1:
                    in_quotes = not in_quotes
                else:
                    return False
            else:
                skip_next = False

        if in_quotes:
            return False
    return True

def check_and_parse_potential_tuple(to_parse):
    """
    receives a string and returns an array of the attributes of the csv line
    if the string was not a valid csv line, then returns False
    """
    a_tuple = []
    attribute_start_index = 0
    to_parse_len = len(to_parse)
    in_quotes = False
    i = 0

    #iterate through the string (line from the csv)
    while i < to_parse_len:
        current_char = to_parse[i]

        #this works the following way: if we meet a quote ("), it must be in one
        #of five cases: "" | ", | ," | "\0 | (start_of_string)"
        #in case we are inside a quoted attribute (i.e. "123"), then commas are ignored
        #the following code also extracts the tuples' attributes 

        if current_char == '\"':
            if i == 0 or (to_parse[i - 1] == ',' and not in_quotes): # (start_of_string)" and ," case
                #not including the quote in the next attr
                attribute_start_index = i + 1

                #starting a quoted attr
                in_quotes = True
            elif i + 1 < to_parse_len:
                if to_parse[i + 1] == '\"': # "" case
                    i += 1 #skip the next " because it is part of a ""
                elif to_parse[i + 1] == ',' and in_quotes: # ", case
                    a_tuple.append(to_parse[attribute_start_index:i].strip())

                    #not including the quote and comma in the next attr
                    attribute_start_index = i + 2

                    in_quotes = False #the quoted attr has ended

                    #skip the next comma - we know what it is for
                    i += 1
                else:
                    #since we cannot have a random " in the middle of an attr
                    return False 
            elif i == to_parse_len - 1: # "\0 case
                a_tuple.append(to_parse[attribute_start_index:i].strip())

                #reached end of line, so no more attr's to extract
                attribute_start_index = to_parse_len

                in_quotes = False
            else:
                return False
        elif current_char == ',':
            if not in_quotes:
                a_tuple.append(to_parse[attribute_start_index:i].strip())
                attribute_start_index = i + 1

        i += 1

    #in case the last attr was left empty or unquoted
    if attribute_start_index < to_parse_len or (not in_quotes and to_parse[-1] == ','):
        a_tuple.append(to_parse[attribute_start_index:])

    #line ended while parsing; i.e. a quote was openned but not closed 
    if in_quotes:
        return False

    return a_tuple


def parse_tuple(to_parse, no_of_headers):
    """
    parses a string and returns an array with no_of_headers number of headers

    raises an error if the string was not a valid CSV line
    """

    #get rid of the newline at the end of every line
    to_parse = to_parse.strip()

    # return to_parse.split(',') #if we assume the data is in a valid format

    #the following checking of the format of the data increases the execution
    #time by a factor of 2; if the data is know to be valid, uncomment 3 lines above here

    #if there are more commas than fields, then we must take into consideration
    #how the quotes parse and then extract the attributes
    if to_parse.count(',') + 1 > no_of_headers:
        result = check_and_parse_potential_tuple(to_parse)
        if result:
            a_tuple = result
        else:
            raise TypeError('Error while parsing CSV line %s. The quotes do not parse' % to_parse)
    else:
        a_tuple = to_parse.split(',')
        if not csv_tuple_quotes_valid(a_tuple):
            raise TypeError('Error while parsing CSV line %s. The quotes do not parse' % to_parse)

    #if the format is correct but more data fields were provided
    #the following works faster than an if statement that checks the length of a_tuple
    try:
        a_tuple[no_of_headers - 1]
    except IndexError:
        raise TypeError('Error while parsing CSV line %s. Unknown reason' % to_parse)

    #this replaces the use my own hashtables to store the duplicated values for the attributes
    for i in range(1, no_of_headers):
        a_tuple[i] = sys.intern(a_tuple[i])

    return a_tuple


def read_file(path, file_number):
    """
    reads the csv file and returns (dict, int)

    the dict is the mapping of id's to attributes

    the integer is the number of attributes (headers) for the csv file
    """
    global HEADERS

    try:
        file = open(path, 'r');
    except FileNotFoundError as e:
        print("error in %s:\n%s\nexiting...")
        exit(1)

    main_table = 
    headers = file.readline().strip().split(',')
    no_of_headers = len(headers)

    HEADERS.extend(headers[1:]) #keep the headers from the file

    lines = file.readlines()
    file.close()

    args = []
    for line in lines:
        args.append((line, no_of_headers))

    #pool is a pool of worker processes parsing the lines in parallel
    with Pool() as workers:
        try:
            all_tuples = workers.starmap(parse_tuple, args, 1000)
        except TypeError as e:
            print('Error in file %s:\n%s\nexiting thread...' % (path, e.args))
            exit(1)

    for a_tuple in all_tuples:
        #add quotes to key if needed
        key = a_tuple[0] if a_tuple[0][0] == '\"' else ('\"%s\"' % a_tuple[0])
        main_table[key] = a_tuple[1:]

    return (main_table, no_of_headers)

def merge_files():
    """
    produces a file called merged.csv 
    """
    global HEADERS

    no_of_files = len(sys.argv) - 1
    processed_files = [None] * no_of_files

    for i in range(0, no_of_files):
        processed_files[i] = read_file(sys.argv[i + 1], i)

    out_file = open('merged.csv', 'w+')

    merged_str = ','.join(HEADERS)

    all_keys = 
    #this is to ensure that we include all keys in the final file.
    #even those that are missing from some files and present in others
    for processed_file in processed_files:
        all_keys.update(processed_file[0])

    for key in all_keys:
        merged_str += '\n%s' % key
        for i in range(0, no_of_files):
            (main_table, no_of_headers) = processed_files[i]

            try:
                for attr in main_table[key]:
                    merged_str += ',%s' % attr
            except KeyError:
                print('NOTE: no values found for id %s in file \"%s\"' % (key, sys.argv[i + 1]))
                merged_str += ',' * (no_of_headers - 1)

    out_file.write(merged_str)
    out_file.close()

if __name__ == '__main__':
    # merge_files()
    import cProfile
    cProfile.run('merge_files()')

# import time
# start = time.time()

# print(time.time() - start);

Here 是我在 Windows 上获得的分析器报告。

编辑：提供的其余 csv 数据是 here。 Pastebin 处理文件的时间太长，所以...

它可能不是最好的代码，我知道这一点，但我的问题是，是什么让 Windows 变慢了这么多，却不会让 Ubuntu 变慢？ merge_files() 函数耗时最长，仅它自己需要 94 秒，不包括对其他函数的调用。而且对我来说似乎没有什么太明显的原因说明它为什么这么慢。

谢谢

编辑：注意：我们都使用相同的数据集来运行代码。

【问题讨论】：

一种可能性：multiprocessing 在 Windows 和 Linux 上的工作方式不同。这可能是这里出现差异的一个原因，但我知道的不够多，无法自信地说出更多。你能展示一个 csv 文件的样本吗？我不确定***是否允许像pastebin这样的外部链接上传文件，但这可能有助于更好地了解您的代码。 @juanpa.arrivillaga 这甚至在我使用多处理之前就已经发生了。我用它是因为他们说它太慢了 @DeliriousLettuce pastebin.com/huWNvMtP 这是其中一个文件的一部分 @DimitarDimitrov 感谢您的示例，因为它有助于查看格式，但是完整的文件是否太大而无法将三个文件全部上传到某个地方？没有这三者，很难得到准确的结果。 【参考方案1】：

事实证明，Windows 和 Linux 处理非常长字符串的方式不同。当我将 out_file.write(merged_str) 移动到外部 for 循环 (for key in all_keys:) 并停止附加到 merged_str 时，它按预期运行了 11 秒。我对这两种操作系统的内存管理系统都没有足够的了解，无法预测它为何如此不同。

但我会说第二种方法（Windows 方法）是更安全的方法，因为在内存中保留 30 MB 字符串是不合理的。事实证明，Linux 看到了这一点，并不总是尝试将字符串保存在缓存中，或者每次都重新构建它。

很有趣，最初我确实在我的 Linux 机器上使用相同的写作策略运行了几次，而带有大字符串的那个似乎跑得更快，所以我坚持了下来。我猜你永远不知道。

这是修改后的代码

    for key in all_keys:
        merged_str = '%s' % key
        for i in range(0, no_of_files):
            (main_table, no_of_headers) = processed_files[i]

            try:
                for attr in main_table[key]:
                    merged_str += ',%s' % attr
            except KeyError:
                print('NOTE: no values found for id %s in file \"%s\"' % (key, sys.argv[i + 1]))
                merged_str += ',' * (no_of_headers - 1)
        out_file.write(merged_str + '\n')

    out_file.close()

【讨论】：

我觉得奇怪的是看到这种行为仅仅是由于底层内存管理...... 32 MB 的数据绝对不是那么不寻常 - 特别是在处理图像时，如果 Windows 将其换出一直以来，都无法有效地进行图像处理。是否有可能将 64 位 Linux CPython 与 32 位 Windows CPython 进行比较？如果您使用 64 位 Windows CPython，时间是否保持不变？ Python 字符串是不可变的，这可能解释了性能问题。不是将短字符串 B 添加到现有的大字符串 A 中，而是创建一个新的字符串对象 C，并将 A 和 B 的内容复制到它。 A 和 B 的内存分配可能会同时被释放。在循环中分配和释放 32 MB 之类的内容可能会影响性能。 @J.J.Hakala 确实如此，但为什么它在 Linux 上这么快？ @MatteoItalia 我刚刚在64位windows 10的64位python上试了一下，性能是一样的 @DimitarDimitrov 我的猜测是malloc 使用mmap 可能会产生一些影响。【参考方案2】：

当我在 Ubuntu 16.04 上使用三个给定文件运行您的解决方案时，似乎需要大约 8 秒才能完成。我所做的唯一修改是取消注释底部的计时代码并使用它。

$ python3 dimitar_merge.py file1.csv file2.csv file3.csv
NOTE: no values found for id "aaa5d09b-684b-47d6-8829-3dbefd608b5e" in file "file2.csv"
NOTE: no values found for id "38f79a49-4357-4d5a-90a5-18052ef03882" in file "file2.csv"
NOTE: no values found for id "766590d9-4f5b-4745-885b-83894553394b" in file "file2.csv"
8.039648056030273
$ python3 dimitar_merge.py file1.csv file2.csv file3.csv
NOTE: no values found for id "38f79a49-4357-4d5a-90a5-18052ef03882" in file "file2.csv"
NOTE: no values found for id "766590d9-4f5b-4745-885b-83894553394b" in file "file2.csv"
NOTE: no values found for id "aaa5d09b-684b-47d6-8829-3dbefd608b5e" in file "file2.csv"
7.78482985496521

我在不使用标准库中的csv 的情况下重写了我的第一次尝试，现在得到的时间约为 4.3 秒。

$ python3 lettuce_merge.py file1.csv file2.csv file3.csv
4.332579612731934
$ python3 lettuce_merge.py file1.csv file2.csv file3.csv
4.305467367172241
$ python3 lettuce_merge.py file1.csv file2.csv file3.csv
4.27345871925354

这是我的解决方案代码 (lettuce_merge.py)：

from collections import defaultdict


def split_row(csv_row):
    return [col.strip('"') for col in csv_row.rstrip().split(',')]


def merge_csv_files(files):
    file_headers = []
    merged_headers = []
    for i, file in enumerate(files):
        current_header = split_row(next(file))
        unique_key, *current_header = current_header
        if i == 0:
            merged_headers.append(unique_key)
        merged_headers.extend(current_header)
        file_headers.append(current_header)

    result = defaultdict(lambda: [''] * (len(merged_headers) - 1))
    for file_header, file in zip(file_headers, files):
        for line in file:
            key, *values = split_row(line)
            for col_name, col_value in zip(file_header, values):
                result[key][merged_headers.index(col_name) - 1] = col_value
        file.close()

    quotes = '""'.format
    with open('lettuce_merged.csv', 'w') as f:
        f.write(','.join(quotes(a) for a in merged_headers) + '\n')
        for key, values in result.items():
            f.write(','.join(quotes(b) for b in [key] + values) + '\n')


if __name__ == '__main__':
    from argparse import ArgumentParser, FileType
    from time import time

    parser = ArgumentParser()
    parser.add_argument('files', nargs='*', type=FileType('r'))
    args = parser.parse_args()

    start_time = time()
    merge_csv_files(args.files)
    print(time() - start_time)

我确信这段代码可以进一步优化，但有时只是看到解决问题的另一种方法可以帮助激发新的想法。

【讨论】：

这是我的第一篇文章，很抱歉没有提供必要的信息。我现在更新了帖子问题以包含 3 个 csv 和问题陈述。他们让我负责 CSV 解析。我花了比实际合并更多的时间来整理。 @DimitarDimitrov 没问题！现在我有了问题文本和实际文件，我再看看。最后一件事，你确定你的脚本输出是他们要求的确切格式吗？他们是否确认（尽管速度很慢）它确实产生了正确的输出？感谢您抽出宝贵时间进行调查。他们确认我的结果是正确的，并且使用模型解决方案得到了相同的结果。 @DimitarDimitrov 完美，我再看看。

以上是关于Python 3.6 脚本在 Windows 10 上速度惊人，但在 Ubuntu 17.10 上却没有的主要内容，如果未能解决你的问题，请参考以下文章