从 Word 中的表格中提取原始数据?使用 Perl
Posted
技术标签:
【中文标题】从 Word 中的表格中提取原始数据?使用 Perl【英文标题】:Extract raw data from Tables in Word? using Perl 【发布时间】:2011-10-08 14:34:16 【问题描述】:我正在尝试从 Word 文档中的多个表格中提取数据。尝试将表中的数据转换为文本时出现错误。 ConvertToText 方法有两个可选参数(如何分隔数据和一个布尔值)。这是我当前的代码:
#usr/bin/perl
#OLEWord.pl
#Use string and print warnings
use strict;use warnings;
#Using OLE + OLE constants for Variants and OLE enumeration for Enumerations
use Win32::OLE qw(in);
use Win32::OLE::Const 'Microsoft Word';
use Win32::OLE::Variant;
my $var1 = Win32::OLE::Variant->new(VT_BOOL, 'true');
$Win32::OLE::Warn = 3;
#set the file to be opened
my $file = 'C:\work\SCL_International Financial New Fund Setup Questionnaire V1.6.docx';
#Create a new instance of Win32::OLE for the Word application, die if could not open the application
my $MSWord = Win32::OLE->GetActiveObject('Excel.Application') or Win32::OLE->new('Word.Application','Quit');
#Set the screen to Visible, so that you can see what is going on
$MSWord->'Visible' = 1;
$MSWord->'DisplayAlerts' = 0; #Supress Alerts, such as 'Save As....'
#open the request file or die and print warning message
my $Doc = $MSWord->'Documents'->Open($file) or die "Could not open ", $file, " Error:", Win32::OLE->LastError();
#$MSWord->ActiveDocument->SaveAs(Filename => 'AlteredTest.docx',
#FileFormat => wdFormatDocument);
my $tables = $MSWord->ActiveDocument->'Tables';
for my $table (in $tables)
my $tableText = $table->ConverToText(wdSeparateByParagraphs,$var1);
print "Table: ", $tableText, "\n";
$MSWord->ActiveDocument->Close;
$MSWord->Quit;
我收到了这个错误:
在 OLEWord.pl 第 31 行使用“strict subs”时不允许使用裸词“VT_BOOL” 在 OLEWord.pl 第 31 行使用“strict subs”时,不允许使用裸词“true” 由于编译错误,OLEWord.pl 的执行中止。
【问题讨论】:
【参考方案1】:删除“use strict”将删除“Bareword”错误
【讨论】:
【参考方案2】:“Bareword”错误是由代码中的语法错误引起的。一种 'runaway multi-line' 通常指出错误的开始位置 是,通常意味着一条线还没有完成,通常 因为括号或引号不匹配。
正如几个 SO-ers 指出的那样,这看起来不像 珀尔! Perl 解释器在语法错误上犹豫不决,因为它 不会说那种特定的语言! Source
不使用 strict 不会给你警告。 (但你应该用它来写一个好的代码)
阅读有关 Bareword 的信息,这样您就会知道它们是什么,并且您会自己知道如何纠正这个错误。
以下是一些关于 Bareword 的学习链接: 1.perl.com 2.alumnus
【讨论】:
谢谢,请问如何从表中提取数据?代码看起来正确吗?【参考方案3】:当VT_BOOL
之类的东西没有被定义为常量时,perl 会认为它们是裸词。其他人已经提供了有关他们的信息。
问题的根本原因是缺少Win32::OLE::Variant 模块导出的常量。添加:
use Win32::OLE::Variant;
到您的脚本以删除第一个错误。第二个是类似的问题,true
也没有定义。将其替换为 1
或自己定义常量:
use constant true => 1;
编辑:这是提取表格文本的示例:
my $tables = $MSWord->ActiveDocument->'Tables';
for my $table (in $tables)
my $tableText = $table->ConvertToText( Separator => wdSeparateByTabs );
print "Table: ", $tableText->Text(), "\n";
在您的代码中,您的方法名称 ConverToText
有拼写错误。该方法还返回Range
对象,因此您必须使用Text
方法来获取实际文本。
【讨论】:
是的,我忘记了,谢谢,但是从 Word 中的表格中提取数据呢> @Shahab - 请查看我更新的表格提取代码答案。 hmmm,我运行时出错:来自“Microsoft Word”的 OLE 异常此方法或属性不可用,因为部分或全部数据未引用表 -> ConverToText。跨度> 我虽然 Tables 属性返回了文档中的 Tables 集合? @Shahab - 你是对的,Tables
是集合,你用in
对其进行迭代,每个都转换为文本。您是否注意到ConverToText
中缺少t
?【参考方案4】:
将所有文档表提取到一个 xls 文件中
sub doParseDoc
my $msg = '' ;
my $ret = 1 ; # assume failure at the beginning ...
$msg = 'START --- doParseDoc' ;
$objLogger->LogDebugMsg( $msg );
$msg = 'using the following DocFile: "' . $DocFile . '"' ;
$objLogger->LogInfoMsg( $msg );
#-----------------------------------------------------------------------
#Using OLE + OLE constants for Variants and OLE enumeration for Enumerations
# Create a new Excel workbook
my $objWorkBook = Spreadsheet::WriteExcel->new("$DocFile" . '.xls');
# Add a worksheet
my $objWorkSheet = $objWorkBook->add_worksheet();
my $var1 = Win32::OLE::Variant->new(VT_BOOL, 'true');
Win32::OLE->Option(Warn => \&Carp::croak);
use constant true => 0;
# at this point you should have the Word application opened in UI with t
# the DocFile
# build the MS Word object during run-time
my $objMSWord = Win32::OLE->GetActiveObject('Word.Application')
or Win32::OLE->new('Word.Application', 'Quit');
# build the doc object during run-time
my $objDoc = $objMSWord->Documents->Open($DocFile)
or die "Could not open ", $DocFile, " Error:", Win32::OLE->LastError();
#Set the screen to Visible, so that you can see what is going on
$objMSWord->'Visible' = 1;
# try NOT printing directly to the file
#$objMSWord->ActiveDocument->SaveAs(Filename => 'AlteredTest.docx',
#FileFormat => wdFormatDocument);
my $tables = $objMSWord->ActiveDocument->Tables();
my $tableText = '' ;
my $xlsRow = 1 ;
for my $table (in $tables)
# extract the table text as a single string
#$tableText = $table->ConvertToText( Separator => 'wdSeparateByTabs' );
# cheated those properties from here:
# https://msdn.microsoft.com/en-us/library/aa537149(v=office.11).aspx#officewordautomatingtablesdata_populateatablewithdata
my $RowsCount = $table->'Rows'->'Count' ;
my $ColsCount = $table->'Columns'->'Count' ;
# disgard the tables having different than 5 columns count
next unless ( $ColsCount == 5 ) ;
$msg = "Rows Count: $RowsCount " ;
$msg .= "Cols Count: $ColsCount " ;
$objLogger->LogDebugMsg ( $msg ) ;
#my $tableRange = $table->ConvertToText( Separator => '##' );
# OBS !!! simple print WILL print to your doc file use Select ?!
#$objLogger->LogDebugMsg ( $tableRange . "\n" );
# skip the header row
foreach my $row ( 0..$RowsCount )
foreach my $col (0..$ColsCount)
# nope ... $table->cell($row,$col)->->'WrapText' = 1 ;
# nope $table->cell($row,$col)->'WordWrap' = 1 ;
# so so $table->cell($row,$col)->WordWrap() ;
my $txt = '';
# well some 1% of the values are so nasty that we really give up on them ...
eval
$txt = $table->cell($row,$col)->range->'Text';
#replace all the ctrl chars by space
$txt =~ s/\r/ /g ;
$txt =~ s/[^\040-\176]/ /g ;
# perform some cleansing - ColName<primary key>=> ColName
#$txt =~ s#^(.[a-zA-Z_0-9]*)(\<.*)#$1#g ;
# this will most probably brake your cmd ...
# $objLogger->LogDebugMsg ( "row: $row , col: $col with txt: $txt \n" ) ;
or $txt = 'N/A' ;
# Write a formatted and unformatted string, row and column notation.
$objWorkSheet->write($xlsRow, $col, $txt);
#eof foreach col
# we just want to dump all the tables into the one sheet
$xlsRow++ ;
#eof foreach row
sleep 1 ;
#eof foreach table
# close the opened in the UI document
$objMSWord->ActiveDocument->Close;
# OBS !!! now we are able to print
$objLogger->LogDebugMsg ( $tableText . "\n" );
# exit the whole Word application
$objMSWord->Quit;
return ( $ret , $msg ) ;
#eof sub doParseDoc
【讨论】:
以上是关于从 Word 中的表格中提取原始数据?使用 Perl的主要内容,如果未能解决你的问题,请参考以下文章
关于C#从Word文件中提取内容(包括样式文字,图片,公式,表格)等信息,解析分字段写入数据库的问题。