如何使用 php 将 docx 文档转换为 html?

Posted

技术标签:

【中文标题】如何使用 php 将 docx 文档转换为 html?【英文标题】:How can I convert a docx document to html using php? 【发布时间】:2011-06-03 00:00:09 【问题描述】:

我希望能够上传 MS Word 文档并将其导出到我的站点中的页面。

有没有办法做到这一点?

【问题讨论】:

我对 php 不是很熟悉 - 但也许这可以帮助你? phpLiveDocx -Convert DOCX to html in PHP 您可以使用phpLiveDocx。 你的方法是使用LiveDocx,你需要一个account。然后关注这个guide,或者自己学习如何使用Zend_Service_LiveDocx。 phpLiveDocx 似乎有点矫枉过正......而且它的服务似乎非常有限(没有动态表格或图表) 【参考方案1】:
//FUNCTION :: read a docx file and return the string
function readDocx($filePath) 
    // Create new ZIP archive
    $zip = new ZipArchive;
    $dataFile = 'word/document.xml';
    // Open received archive file
    if (true === $zip->open($filePath)) 
        // If done, search for the data file in the archive
        if (($index = $zip->locateName($dataFile)) !== false) 
            // If found, read it to the string
            $data = $zip->getFromIndex($index);
            // Close archive file
            $zip->close();
            // Load XML from a string
            // Skip errors and warnings
            $xml = DOMDocument::loadXML($data, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
            // Return data without XML formatting tags

            $contents = explode('\n',strip_tags($xml->saveXML()));
            $text = '';
            foreach($contents as $i=>$content) 
                $text .= $contents[$i];
            
            return $text;
        
        $zip->close();
    
    // In case of failure return empty string
    return "";

ZipArchiveDOMDocument 都在 PHP 中,因此您无需安装/包含/需要其他库。

【讨论】:

谢谢,太好了,但是有没有办法保持格式,例如粗体和斜体字 谢谢...它正在返回整个文档。但是有什么方法可以单独获取页面文本! 这个答案没有提供将 .docx 转换为 HTML 的解决方案——正如代码 strip_tags() 中所显示的那样——OP 专门询问如何转换为 HTML【参考方案2】:

可以使用PHPDocX。

它支持几乎所有的 HTML CSS 样式。此外,您可以使用模板通过replaceTemplateVariableByHTML 为您的 HTML 添加额外的格式。

PHPDocX 的 HTML 方法还允许直接使用 Word 样式。你可以使用这样的东西:

$docx->embedHTML($myHTML, array('tableStyle' => 'MediumGrid3-accent5PHPDOCX'));

如果您希望所有表格都使用 MediumGrid3-accent5 Word 样式。 embedHTML 方法及其模板版本 (replaceTemplateVariableByHTML) 保留了继承性,这意味着您可以使用预定义的 Word 样式并用 CSS 覆盖其任何属性。

您还可以使用“JQuery 类型”选择器提取 HTML 的选定部分。

【讨论】:

不得不说它不是免费的!至少没有了。低至 399.00 美元。 建议:让我们向 *** 引入一个“商业”徽章/标记,以使此类内容可见【参考方案3】:

您可以使用 Print2flash 库将 Word docx 文档转换为 html。这是我客户网站的 PHP 摘录,它将文档转换为 html:

include("const.php");
$p2fServ = new COM("Print2Flash4.Server2");
$p2fServ->DefaultProfile->DocumentType=HTML5;
$p2fServ->ConvertFile($wordfile,$htmlFile);

它将在 $wordfile 变量中指定路径的文档转换为由 $htmlFile 变量指定的 html 页面文件。保留所有格式、超链接和图表。您可以从Print2flash SDK 获得所需的 const.php 文件以及更完整的示例。

【讨论】:

【参考方案4】:

这是基于 David Lin 上述回答的解决方法 删除 docx 的 xml 标签中的“w:”会留下 Html 之类的标签

    function readDocx($filePath) 
    // Create new ZIP archive
    $zip = new ZipArchive;
    $dataFile = 'word/document.xml';
    // Open received archive file
    if (true === $zip->open($filePath)) 
        // If done, search for the data file in the archive
        if (($index = $zip->locateName($dataFile)) !== false) 
            // If found, read it to the string
            $data = $zip->getFromIndex($index);
            // Close archive file
            $zip->close();
            // Load XML from a string
            // Skip errors and warnings
            $xml = new DOMDocument("1.0", "utf-8");
            $xml->loadXML($data, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING|LIBXML_PARSEHUGE);
            $xml->encoding = "utf-8";
            // Return data without XML formatting tags
            $output =  $xml->saveXML();
            $output = str_replace("w:","",$output);

            return $output;
        
        $zip->close();
    
    // In case of failure return empty string
    return "";

【讨论】:

【参考方案5】:

如果你不拒绝 REST API,那么你可以使用:

Apache Tika。是久经考验的文本提取 OSS 领导者 如果您不想麻烦配置并想要现成的解决方案,您可以使用RawText,但它不是免费的。

RawText 的示例代码:

$result = $rawText -> parse($your_file)

【讨论】:

【参考方案6】:

好的,我来晚了,但我想我会发布这个来节省大家的时间。 这是我整理的一些 php 代码,不仅用于从 docx 读取文本,还用于读取图像,目前它不支持浮动图像/文本,但到目前为止我所做的是向已经发布的内容迈出了一大步此处 - 请注意您需要将 https://sharinggodslove.uk 更新为您的域名。

<?php

class Docx_ws_imglnk 
    public $originalpath = '';
    public $extractedpath = '';


class Docx_ws_rel 
    public $Id = '';
    public $Target = '';


class Docx_ws_def 
    public $styleId = '';
    public $type = '';
    public $color = '000000';


class Docx_p_def 
    public $data = array();
    public $text = "";


class Docx_p_item 
    public $name = "";
    public $value = "";
    public $innerstyle = "";
    public $type = "text";


class Docx_reader 

    private $fileData = false;
    private $errors = array();
    public $rels = array();
    public $imglnks = array();
    public $styles = array();
    public $document = null;
    public $paragraphs = array();
    public $path = '';
    private $saveimgpath = 'docimages';

    public function __construct() 
    
    

    private function load($file) 
        if (file_exists($file)) 
            $zip = new ZipArchive();
            $openedZip = $zip->open($file);
            if ($openedZip === true) 
            
                $this->path = $file;
            
                //read and save images
                for ( $i = 0; $i < $zip->numFiles; $i ++ ) 
                    $zip_element = $zip->statIndex( $i );
                    if ( preg_match( "([^\s]+(\.(?i)(jpg|jpeg|png|gif|bmp))$)", $zip_element['name'] ) ) 
                        $imglnk = new Docx_ws_imglnk;
                        $imglnk->originalpath = $zip_element['name'];
                        $imagename                   = explode( '/',   $zip_element['name'] );
                        $imagename                   = end( $imagename );
                        $imglnk->extractedpath = dirname( __FILE__ ) . '/' . $this->savepath . $imagename;
                
                        $putres = file_put_contents( $imglnk->extractedpath, $zip->getFromIndex( $i ));
                        $imglnk->extractedpath = str_replace('var/www/', 'https://sharinggodslove.uk/', $imglnk->extractedpath);
                        $imglnk->extractedpath = substr($imglnk->extractedpath, 1);
                    
                        array_push($this->imglnks, $imglnk);
                    
                
            
                //read relationships
                if (($styleIndex = $zip->locateName('word/_rels/document.xml.rels')) !== false) 
                    $stylesRels = $zip->getFromIndex($styleIndex);
                    $xml = simplexml_load_string($stylesRels);
                    $XMLTEXT = $xml->saveXML();
                    $doc = new DOMDocument();
                    $doc->loadXML($XMLTEXT);
                    foreach($doc->documentElement->childNodes as $childnode)
                    
                        $nodename = $childnode->nodeName;
                   
                        if($childnode->hasAttributes())
                        
                            $rel = new Docx_ws_rel;
                            for ($a = 0; $a < $childnode->attributes->count(); $a++)
                            
                                $attrNode = $childnode->attributes->item($a);
                            
                                if (strcmp( $attrNode->nodeName, 'Id') == 0)
                                
                                    $rel->Id = $attrNode->nodeValue;
                                
                                if (strcmp( $attrNode->nodeName, 'Target') == 0)
                                
                                    $rel->Target = $attrNode->nodeValue;
                                
                            
                            array_push($this->rels, $rel);
                        
                    
                
            
                //attempt to load styles:
                if (($styleIndex = $zip->locateName('word/styles.xml')) !== false) 
                    $stylesXml = $zip->getFromIndex($styleIndex);
                    $xml = simplexml_load_string($stylesXml);
                    $XMLTEXT = $xml->saveXML();
                    $doc = new DOMDocument();
                    $doc->loadXML($XMLTEXT);
               
                    foreach($doc->documentElement->childNodes as $childnode)
                    
                        $nodename = $childnode->nodeName;
                    
                        //get style
                        if (strcmp($nodename, "w:style") == 0)
                        
                            $ws_def = new Docx_ws_def;
                            for ($a=0; $a < $childnode->attributes->count(); $a++ )
                            
                                $item = $childnode->attributes->item($a);
                                //style id
                                if (strcmp($item->nodeName, "w:styleId") == 0)
                                
                                    $ws_def->styleId = $item->nodeValue;
                                
                            
                                //style type
                                if (strcmp($item->nodeName, "w:type") == 0)
                                
                                    $ws_def->type = $item->nodeValue;
                                
                            
                        
                        //push style to the array of styles
                        if (strcmp($ws_def->styleId, "") != 0 && strcmp($ws_def->type, "") != 0)
                        
                            array_push($this->styles, $ws_def);
                        
                    
                

                if (($index = $zip->locateName('word/document.xml')) !== false) 
                    $stylesDoc = $zip->getFromIndex($index);
                    $xml = simplexml_load_string($stylesDoc);
                    $XMLTEXT = $xml->saveXML();
                    $this->document = new DOMDocument();
                    $this->document->loadXML($XMLTEXT);
                
                $zip->close();
             else 
                switch($openedZip) 
                    case ZipArchive::ER_EXISTS:
                        $this->errors[] = 'File exists.';
                        break;
                    case ZipArchive::ER_INCONS:
                        $this->errors[] = 'Inconsistent zip file.';
                        break;
                    case ZipArchive::ER_MEMORY:
                        $this->errors[] = 'Malloc failure.';
                        break;
                    case ZipArchive::ER_NOENT:
                        $this->errors[] = 'No such file.';
                        break;
                    case ZipArchive::ER_NOZIP:
                        $this->errors[] = 'File is not a zip archive.';
                        break;
                    case ZipArchive::ER_OPEN:
                        $this->errors[] = 'Could not open file.';
                        break;
                    case ZipArchive::ER_READ:
                        $this->errors[] = 'Read error.';
                        break;
                    case ZipArchive::ER_SEEK:
                        $this->errors[] = 'Seek error.';
                        break;
                
            
         else 
            $this->errors[] = 'File does not exist.';
        
    

    public function setFile($path) 
        $this->fileData = $this->load($path);
    

    public function to_plain_text() 
        if ($this->fileData) 
            return strip_tags($this->fileData);
         else 
            return false;
        
    

    public function processDocument() 
        $html = '';    
    
        foreach($this->document->documentElement->childNodes as $childnode)
        
            $nodename = $childnode->nodeName;
        
            //get the body of the document
            if (strcmp($nodename, "w:body") == 0)
            
                foreach($childnode->childNodes as $subchildnode)
                
                    $pnodename = $subchildnode->nodeName;
                
                    //process every paragraph
                    if (strcmp($pnodename, "w:p") == 0)
                    
                        $pdef = new Docx_p_def;
                    
                        foreach($subchildnode->childNodes as $pchildnode)
                        
                            //process any inner children
                            if (strcmp($pchildnode, "w:pPr") == 0)
                            
                                foreach($pchildnode->childNodes as $prchildnode)
                                
                                    //process text alignment
                                    if (strcmp($prchildnode->nodeName, "w:pStyle") == 0)
                                    
                                        $pitem = new Docx_p_item;
                                        $pitem->name = 'styleId';
                                        $pitem->value = $prchildnode->attributes->getNamedItem('val')->nodeValue;
                                        array_push($pdef->data, $pitem);
                                    
                                
                                    //process text alignment
                                    if (strcmp($prchildnode->nodeName, "w:jc") == 0)
                                    
                                        $pitem = new Docx_p_item;
                                        $pitem->name = 'align';
                                        $pitem->value = $prchildnode->attributes->getNamedItem('val')->nodeValue;
                                    
                                        if (strcmp($pitem->value, "left") == 0)
                                        
                                            $pitem->innerstyle .= "text-align:" . $pitem->value . ";";
                                        
                                    
                                        if (strcmp($pitem->value, "center") == 0)
                                        
                                            $pitem->innerstyle .= "text-align:" . $pitem->value . ";";
                                        
                                    
                                        if (strcmp($pitem->value, "right") == 0)
                                        
                                            $pitem->innerstyle .= "text-align:" . $pitem->value . ";";
                                        
                                    
                                        if (strcmp($pitem->value, "both") == 0)
                                        
                                            $pitem->innerstyle .= "word-spacing:" . 10 . "px;";
                                        
                                    
                                        array_push($pdef->data, $pitem);
                                    
                                
                                    //process drawing
                                    if (strcmp($prchildnode->nodeName, "w:drawing") == 0)
                                    
                                        $pitem = new Docx_p_item;
                                        $pitem->name = 'drawing';
                                        $pitem->value = '';
                                        $pitem->type = 'graphic';
                                    
                                        $extents = $prchildnode->getElementsByTagName('extent')[0];
                                        $cx = $extents->attributes->getNamedItem('cx')->nodeValue;
                                        $cy = $extents->attributes->getNamedItem('cy')->nodeValue;
                                        $pcx = (int)$cx / 9525;
                                        $pcy = (int)$cy / 9525;
                                    
                                        $pitem->innerstyle .= "width:" . $pcx . "px;";
                                        $pitem->innerstyle .= "height:" . $pcy . "px;";
                                    
                                        $blip = $prchildnode->getElementsByTagName('blip')[0];
                                        $pitem->value = $blip->attributes->getNamedItem('embed')->nodeValue;
                                 
                                        array_push($pdef->data, $pitem);
                                    
                                
                                    //process spacing
                                    if (strcmp($prchildnode->nodeName, "w:spacing") == 0)
                                    
                                        $pitem = new Docx_p_item;
                                        $pitem->name = 'paragraphSpacing';
                                        $bval = $prchildnode->attributes->getNamedItem('before')->nodeValue;
                                        if (strcmp($bval, '') == 0)
                                            $bval = 0;
                                        $pitem->innerstyle .= "padding-top:" . $bval . "px;";
                                        $aval = $prchildnode->attributes->getNamedItem('after')->nodeValue;
                                        if (strcmp($aval, '') == 0)
                                            $aval = 0;
                                        $pitem->innerstyle .= "padding-bottom:" . $aval . "px;";
                                    
                                        array_push($pdef->data, $pitem);
                                    
                                
                            
                        
                        
                            if (strcmp($pchildnode, "w:r") == 0)
                            
                                foreach($pchildnode->childNodes as $rchildnode)
                                
                                    //process text
                                    if (strcmp($rchildnode->nodeName, "w:t") == 0)
                                    
                                        $pdef->text .= $rchildnode->nodeValue;
                                        if (count($pdef->data) == 0)
                                        
                                            $pitem = new Docx_p_item;
                                            $pitem->name = 'styleId';
                                            $pitem->value = '';
                                            array_push($pdef->data, $pitem);
                                        
                                    
                                
                                    if (strcmp($rchildnode->nodeName, "w:rPr") == 0)
                                    
                                        foreach($rchildnode->childNodes as $rPrchildnode)
                                        
                                            if (strcmp($rPrchildnode->nodeName, "w:b") == 0 )
                                            
                                                $pitem = new Docx_p_item;
                                                $pitem->name = 'textBold';
                                                $pitem->value = '';
                                                $pitem->innerstyle .= "text-weight: 500;";
                                                array_push($pdef->data, $pitem);
                                            
                                            if (strcmp($rPrchildnode->nodeName, "w:i") == 0 )
                                            
                                                $pitem = new Docx_p_item;
                                                $pitem->name = 'textItalic';
                                                $pitem->value = '';
                                                $pitem->innerstyle .= "text-style: italic;";
                                                array_push($pdef->data, $pitem);
                                            
                                            if (strcmp($rPrchildnode->nodeName, "w:u") == 0 )
                                            
                                                $pitem = new Docx_p_item;
                                                $pitem->name = 'textUnderline';
                                                $pitem->value = '';
                                                $pitem->innerstyle .= "text-decoration: underline;";
                                                array_push($pdef->data, $pitem);
                                            
                                            if (strcmp($rPrchildnode->nodeName, "w:sz") == 0 )
                                            
                                                $pitem = new Docx_p_item;
                                                $pitem->name = 'textSize';
                                            
                                                $sz = $rPrchildnode->attributes->getNamedItem('val')->nodeValue;
                                                if ($sz == '')
                                                
                                                    $sz=0;
                                                
                                                $pitem->value = $sz;
                                                array_push($pdef->data, $pitem);
                                            
                                        
                                    
                                
                            
                        
                  
                       array_push($this->paragraphs, $pdef);
                    
                
            
         
    
    

    public function to_html()
    
        $html = '';
    
        foreach($this->paragraphs as $para)
        
            $styleselect = null;
            $type = 'text';
            $content = $para->text;
            $sz = 0;
            $extent = '';
            $embedid = '';
        
            $pinnerstylesid = '';
            $pinnerstylesunderline = '';
            $pinnerstylessz = '';         
           
        
            if (count($para->data) > 0)
            
                foreach($para->data as $node)
                
                    if (strcmp($node->name, "styleId") == 0)
                    
                        $type = $node->type;
                        $pinnerstylesid = $node->innerstyle;
                       
                        foreach($this->styles as $style)
                        
                            if (strcmp ($node->value, $style->styleId) == 0)
                            
                                $styleselect = $style;
                            
                        
                    
                
                    if (strcmp($node->name, "align") == 0)
                    
                        $pinnerstylesid .= $node->innerstyle. ";";
                    
                
                    if (strcmp($node->name, "drawing") == 0)
                    
                        $type = $node->type;
                        $extent = $node->innerstyle;
                        $embedid = $node->value;
                    
                
                    if (strcmp($node->name, "textSize") == 0)
                    
                        $sz = $node->value;
                    
                
                    if (strcmp($node->name, "textUnderline") == 0)
                    
                       $pinnerstylesunderline = $node->innerstyle;
                    
                
            
     
           if (strcmp($type, 'text') == 0)
           
                //echo "has valid para";
                //echo "<br>";
                if ($styleselect != null)
                
                    //echo "has valid style";
                    //echo "<br>";
                
                    if (strcmp($styleselect->color, '') != 0)
                    
                       $pinnerstylesid .= "color:#" . $styleselect->color. ";";
                    
                
            
                if ($sz != 0)
                
                    $pinnerstylesid .= 'font-size:' . $sz . 'px;';
                    //echo "sz<br>";
                
            
                $span =  "<p style='". $pinnerstylesid . $pinnerstylesunderline ."'>";
                $span .= $content;
                $span .= "</p>";
                //echo $span;
                $html .= $span;
            
        
            if (strcmp($type, 'graphic') == 0)
            
                $imglnk = '';
            
                foreach($this->rels as $rel)
                
                    if(strcmp($embedid, '') != 0 && strcmp($rel->Id, $embedid) == 0)
                    
                        foreach($this->imglnks as $imgpathdef)
                        
                            if (strpos($imgpathdef->extractedpath, $rel->Target) >= 0)
                            
                                $imglnk = $imgpathdef->extractedpath;
                                //echo "has img link<br>";
                                //echo $imglnk . "<br>";
                            
                        
                    
                
            
                if ($styleselect != null)
                
                    //echo "has valid style";
                    //echo "<br>";
                
                    if (strcmp($styleselect->color, '') != 0)
                    
                        $pinnerstylesid .= "color:#" . $styleselect->color. ";";
                    
                
            
                if ($sz != 0)
                
                    $pinnerstylesid .= 'font-size:' . $sz . 'px;';
                    //echo "sz<br>";
                
            
                $span =  "<p style='". $pinnerstylesid . $pinnerstylesunderline ."'>";
                $span .= "<img style='". $extent ."' alt='image coming soon' src ='". $imglnk ."'/>";
                $span .= "</p>";
                //echo $span;
                $html .= $span;
            
           
        
        return $html;
    

    public function get_errors() 
        return $this->errors;
    

    private function getStyles() 
    
    

 

 function getDocX($path)
 
    //echo $path;
    $doc = new Docx_reader();
    $doc->setFile($path);

    if(!$doc->get_errors()) 
        $doc->processDocument();
        $html = $doc->to_html();
        echo $html;
    
    return "";

?>

【讨论】:

【参考方案7】:

现在更常见的方法是使用 composer package phpoffice/phpword,一个纯 php 库,可以将任何办公文档转换为 html,反之亦然,无需依赖。

【讨论】:

以上是关于如何使用 php 将 docx 文档转换为 html?的主要内容,如果未能解决你的问题,请参考以下文章

PHP - Laravel - 将 Docx 转换为 PDF

怎么把doc文件转换成docx

CEBX格式的文档如何转换为PDF格式文档DOCX文档?

如何使用 Ghostscript 将 DOCX 或 DOC 文件转换为 TIFF 格式?

如何将word 文件.docx转成.PDF文件

PHP将docx文件转换为pdf