使用熊猫分割一个大的excel文件
Posted
技术标签:
【中文标题】使用熊猫分割一个大的excel文件【英文标题】:splitting a large excel file using pandas 【发布时间】:2021-09-26 04:23:55 【问题描述】:我在 xlsx 文件中有表格数据,其中包含许多数据列,并且我想将其分隔到不同 xlsx 文件的行块之间存在间隙,其名称从第四列开始,两个字符重新排列序列号。
另外我想删除输出文件中的第 5 列。
例如:在我的数据中有几列数据我想将它分成两个块。
【问题讨论】:
到目前为止你尝试了什么。 我完全不知道为什么这个问题会被赞成。 另外,“pandas”的标签明明是python库,标题说要用pandas,那为什么是Perl标签呢? 【参考方案1】:你能试试下面的 Perl 脚本吗:
#! /usr/bin/env perl
package Main;
use feature qw(say);
use strict;
use warnings;
use Spreadsheet::ParseXLSX;
use Excel::CloneXLSX::Format qw(translate_xlsx_format);
use Excel::Writer::XLSX;
my $self = Main->new(
input_file => 'x.xlsx',
skip_cols => [4],
);
$self->scan_input_file();
$self->write_chunks();
say "Done";
sub close_chunk_file
my ( $self ) = @_;
if ($self->workbook)
$self->workbook->close();
sub new
my ( $class, %args ) = @_;
return bless \%args, $class;
sub scan_input_file
my ( $self ) = @_;
my $parser = Spreadsheet::ParseXLSX->new;
my $workbook = $parser->parse($self->input_file);
my $worksheet = $workbook->worksheet(0);
my ( $row_min, $row_max ) = $worksheet->row_range();
my ( $col_min, $col_max ) = $worksheet->col_range();
my $name;
my @chunks;
$self->save_header_line( $worksheet, $row_min );
$row_min++;
for my $row ( $row_min .. $row_max )
my $col0 = 0;
my $cell0 = $worksheet->get_cell( $row, $col0 );
if (!$cell0)
push @chunks, $name if defined $name;
$name = undef;
next;
my $col3 = 3;
my $cell3 = $worksheet->get_cell( $row, $col3 );
if ($cell3)
$name = $cell3->unformatted();
($name) = $name =~ /^(\S+)/;
push @chunks, $name if defined $name;
$self->chunks = \@chunks;
sub save_header_line
my ( $self, $worksheet, $row_min ) = @_;
my ( $col_min, $col_max ) = $worksheet->col_range();
my %skip = map $_ => 1 @$self->skip_cols;
my @header;
my $row0 = 0;
my %col_map;
my $new_col = 0;
for my $col ( $col_min .. $col_max )
next if exists $skip$col;
$col_map$col = $new_col;
$new_col++;
my $cell = $worksheet->get_cell( $row0, $col );
push @header, $cell;
$self->header = \@header;
$self->skip_col = \%skip;
$self->col_map = \%col_map;
sub start_new_chunk
my ( $self, $name) = @_;
say "--> $name.xlsx";
$self->close_chunk_file();
$self->workbook = Excel::Writer::XLSX->new( "$name.xlsx" );
$self->worksheet = $self->workbook->add_worksheet();
$self->write_header();
sub write_cell
my ( $self, $row, $col, $cell) = @_;
my $fmt = $cell->get_format();
my $fmt_props = translate_xlsx_format( $fmt );
my $new_format = $self->workbook->add_format(%$fmt_props);
my $value = $cell->unformatted() || '';
$self->worksheet->write($row, $self->col_map$col, $value, $new_format);
sub write_chunks
my ( $self ) = @_;
my $parser = Spreadsheet::ParseXLSX->new;
my $workbook = $parser->parse($self->input_file);
my $worksheet = $workbook->worksheet(0);
my ( $row_min, $row_max ) = $worksheet->row_range();
my ( $col_min, $col_max ) = $worksheet->col_range();
my @chunks = @$self->chunks;
die "No chunks to write\n" if @chunks == 0;
$self->start_new_chunk(shift @chunks);
my $chunk_row = 1; # skip header row
$row_min++; # skip header row
ROW: for my $row ( $row_min .. $row_max )
for my $col ( $col_min .. $col_max )
my $cell = $worksheet->get_cell( $row, $col );
if ( $col == 0 && !$cell)
if (@chunks)
$self->start_new_chunk(shift @chunks);
$chunk_row = 1;
next ROW;
else
last;
if ( $cell )
if (!exists $self->skip_col$col)
$self->write_cell($chunk_row, $col, $cell);
$chunk_row++;
$self->close_chunk_file();
sub write_header
my ( $self ) = @_;
my $header = $self->header;
for my $col ( 0 .. $#$header )
my $cell = $header->[$col];
next if !$cell;
my $fmt = $cell->get_format();
my $fmt_props = translate_xlsx_format( $fmt );
my $new_format = $self->workbook->add_format(%$fmt_props);
my $value = $cell->unformatted() || '';
my $row0 = 0;
$self->worksheet->write($row0, $col, $value, $new_format);
【讨论】:
你的意思是如何运行脚本? 好的,首先安装模块Spreadsheet::ParseXLSX
、Excel::CloneXLSX::Format
和Excel::Writer::XLSX
,然后像perl script.pl
这样运行脚本。假设文件x.xlsx
在当前目录中
我会尽力让你知道。非常感谢你为我写了这么长的脚本@Hakon Haegland以上是关于使用熊猫分割一个大的excel文件的主要内容,如果未能解决你的问题,请参考以下文章