使用熊猫分割一个大的excel文件

Posted 2023-03-24

技术标签:

【中文标题】使用熊猫分割一个大的excel文件【英文标题】：splitting a large excel file using pandas 【发布时间】：2021-09-26 04:23:55 【问题描述】：

我在 xlsx 文件中有表格数据，其中包含许多数据列，并且我想将其分隔到不同 xlsx 文件的行块之间存在间隙，其名称从第四列开始，两个字符重新排列序列号。

另外我想删除输出文件中的第 5 列。

例如：在我的数据中有几列数据我想将它分成两个块。

【问题讨论】：

到目前为止你尝试了什么。我完全不知道为什么这个问题会被赞成。另外，“pandas”的标签明明是python库，标题说要用pandas，那为什么是Perl标签呢？ 【参考方案1】：

你能试试下面的 Perl 脚本吗：

#! /usr/bin/env perl

package Main;
use feature qw(say);
use strict;
use warnings;
use Spreadsheet::ParseXLSX;
use Excel::CloneXLSX::Format qw(translate_xlsx_format);
use Excel::Writer::XLSX;



    my $self = Main->new(
        input_file => 'x.xlsx',
        skip_cols => [4],
    );
    $self->scan_input_file();
    $self->write_chunks();
    say "Done";


sub close_chunk_file 
    my ( $self ) = @_;

    if ($self->workbook) 
        $self->workbook->close();
    


sub new 
    my ( $class, %args ) = @_;

    return bless \%args, $class;


sub scan_input_file 
    my ( $self ) = @_;

    my $parser = Spreadsheet::ParseXLSX->new;
    my $workbook = $parser->parse($self->input_file);
    my $worksheet = $workbook->worksheet(0);
    my ( $row_min, $row_max ) = $worksheet->row_range();
    my ( $col_min, $col_max ) = $worksheet->col_range();

    my $name;
    my @chunks;
    $self->save_header_line( $worksheet, $row_min );
    $row_min++;
    for my $row ( $row_min .. $row_max ) 
        my $col0 = 0;
        my $cell0 = $worksheet->get_cell( $row, $col0 );
        if (!$cell0) 
            push @chunks, $name if defined $name;
            $name = undef;
            next;
        
        my $col3 = 3;
        my $cell3 = $worksheet->get_cell( $row, $col3 );
        if ($cell3) 
            $name = $cell3->unformatted();
            ($name) = $name =~ /^(\S+)/;
        
    
    push @chunks, $name if defined $name;
    $self->chunks = \@chunks;


sub save_header_line 
    my ( $self, $worksheet, $row_min ) = @_;

    my ( $col_min, $col_max ) = $worksheet->col_range();
    my %skip = map  $_ => 1  @$self->skip_cols;
    my @header;
    my $row0 = 0;
    my %col_map;
    my $new_col = 0;
    for my $col ( $col_min .. $col_max ) 
        next if exists $skip$col;
        $col_map$col = $new_col;
        $new_col++;
        my $cell = $worksheet->get_cell( $row0, $col );
        push @header, $cell;
    
    $self->header = \@header;
    $self->skip_col = \%skip;
    $self->col_map = \%col_map;


sub start_new_chunk 
    my ( $self, $name) = @_;

    say "--> $name.xlsx";
    $self->close_chunk_file();
    $self->workbook = Excel::Writer::XLSX->new( "$name.xlsx" );
    $self->worksheet = $self->workbook->add_worksheet();
    $self->write_header();


sub write_cell 
    my ( $self, $row, $col, $cell) = @_;

    my $fmt = $cell->get_format();
    my $fmt_props  = translate_xlsx_format( $fmt );
    my $new_format = $self->workbook->add_format(%$fmt_props);
    my $value = $cell->unformatted() || '';
    $self->worksheet->write($row, $self->col_map$col, $value, $new_format);


sub write_chunks 
    my ( $self ) = @_;

    my $parser = Spreadsheet::ParseXLSX->new;
    my $workbook = $parser->parse($self->input_file);
    my $worksheet = $workbook->worksheet(0);
    my ( $row_min, $row_max ) = $worksheet->row_range();
    my ( $col_min, $col_max ) = $worksheet->col_range();
    my @chunks = @$self->chunks;
    die "No chunks to write\n" if @chunks == 0;
    $self->start_new_chunk(shift @chunks);
    my $chunk_row = 1;  # skip header row
    $row_min++; # skip header row
    ROW: for my $row ( $row_min .. $row_max ) 
        for my $col ( $col_min .. $col_max ) 
            my $cell = $worksheet->get_cell( $row, $col );
            if ( $col == 0 && !$cell) 
                if (@chunks) 
                    $self->start_new_chunk(shift @chunks);
                    $chunk_row = 1;
                    next ROW;
                
                else 
                    last;
                
            
            if ( $cell ) 
                if (!exists $self->skip_col$col) 
                    $self->write_cell($chunk_row, $col, $cell);
                
            
        
        $chunk_row++;
    
    $self->close_chunk_file();


sub write_header 
    my ( $self ) = @_;

    my $header = $self->header;
    for my $col ( 0 .. $#$header ) 
        my $cell = $header->[$col];
        next if !$cell;
        my $fmt = $cell->get_format();
        my $fmt_props  = translate_xlsx_format( $fmt );
        my $new_format = $self->workbook->add_format(%$fmt_props);
        my $value = $cell->unformatted() || '';
        my $row0 = 0;
        $self->worksheet->write($row0, $col, $value, $new_format);

【讨论】：

你的意思是如何运行脚本？好的，首先安装模块Spreadsheet::ParseXLSX、Excel::CloneXLSX::Format和Excel::Writer::XLSX，然后像perl script.pl这样运行脚本。假设文件x.xlsx在当前目录中我会尽力让你知道。非常感谢你为我写了这么长的脚本@Hakon Haegland

以上是关于使用熊猫分割一个大的excel文件的主要内容，如果未能解决你的问题，请参考以下文章

使用熊猫在python中循环多个excel文件

无法用熊猫读取 excel 文件

如何使用熊猫规范化来自excel文件的嵌套字典数据

时间为 00:00 时，熊猫读取 excel 返回类型对象

阅读excel时大熊猫的AssertionError

将日期从excel文件转换为熊猫