PHP：如何使用带有 HTML Purifier 的 nl2br() 来保持换行符？

Posted 2023-02-24

技术标签:

【中文标题】PHP：如何使用带有 HTML Purifier 的 nl2br() 来保持换行符？【英文标题】：PHP: How to keep line-breaks using nl2br() with HTML Purifier? 【发布时间】：2013-07-12 20:00:17 【问题描述】：

问题：当使用html Purifier处理用户输入的内容时，换行符没有被翻译成 标签。

考虑以下用户输入的内容：

Lorem ipsum dolor sit amet.
This is another line.

<pre>
.my-css-class 
    color: blue;

</pre>

Lorem ipsum:

<ul>
<li>Lorem</li>
<li>Ipsum</li>
<li>Dolor</li>
</ul>

Dolor sit amet,
MyName

当使用 HTML Purifier 处理时，上面的内容被更改为以下内容：

Lorem ipsum dolor sit amet。这是另一行。
.my-css-class 
    color: blue;  
 
Lorem ipsum：
Lorem Ipsum Dolor Dolor 坐在一起，我的名字

如您所见，“MyName”原本打算由用户在单独的一行显示，但现在与前一行一起显示。

如何解决？

当然是使用 php nl2br() 函数。但是，无论我们在净化内容之前还是之后使用它，都会出现新的问题。

以下是在 HTML Purifier 之前使用 nl2br() 的示例：

Lorem ipsum dolor sit amet。这是另一行。
.my-css-class 

    color: blue; 

 
Lorem ipsum：
洛雷姆 Ipsum 多洛尔
Dolor 坐在一起，我的名字

nl2br() 会为每个换行符添加  ，因此即使是 <pre> 块中的那些，以及每个 <li> 标记后的换行符也会被处理。

我尝试了什么

我尝试了一个custom nl2br() function，它用 标签替换换行符，然后从<pre>块中删除所有 标签。它工作得很好，但问题仍然存在于<li> 项目。

对<ul> 块尝试相同的方法也会从<li> 子元素中删除所有  标记，除非我们使用更复杂的正则表达式来删除在<ul> 元素内部但在外部的  标记<li> 元素。但是，在<li> 项目中嵌套<ul> 呢？为了处理所有这些情况，我们必须有一个更复杂的正则表达式！

如果这是正确的方法，你能帮我解决这个正则表达式吗？如果这不是正确的方法，我该如何解决这个问题？我也对 HTML Purifier 的替代品持开放态度。

我已经看过的其他资源：

HTMLPurifier: auto br http://htmlpurifier.org/phorum/read.php?2,3034

【问题讨论】：

nl2br 应该在被放入 HTML 上下文时用于纯文本。在您的情况下，您已经拥有 HTML。为什么您的 HTML 没有正确包含  s 已用于换行？所以如果用户基本上是在写HTML，他也应该写 标签。也许他在 HTML 中按原意使用换行符：使标记更具可读性，而无需在文本中实际引入换行符。就我而言，你不能同时拥有它。 :) 您确实需要解析 HTML 并仅在特定文本节点上应用 nl2br，不包括 <pre> 元素。关于“这太疯狂了”：是的，确实如此。 SO 使用 Markdown 换行符！ SO 根本不会删除一些 HTML 标签，但它具有用于基本文本格式的 Markdown，包括换行符。这就是我的观点：如果您要求您的用户只编写 HTML 和 HTML，那么这就是权衡。而且你错过了你有一个 Catch-22 的要点。 :) 要将裸换行符转换为   标记，您需要解析 HTML 以仅在某些元素上执行此操作。但是您必须在清理 HTML 之前执行此操作，这意味着您可能无法正确解析 HTML。这是一个非常棘手的提议。我理解你想要做什么，但它很棘手且容易出错的原因就是 Markdown & co。最早出现。 SO 不是一个很好的例子，因为它没有。好吧，总结一下我对帮助方向的咆哮：你需要先清理你的 HTML，这很棘手。 HTML Purifier 似乎是少数几个据称能够做到这一点的唯一库之一。之后，您应该使用 DOM 处理器来处理 HTML 并应用 nl2br。如果默认情况下 Purifier 会弄乱输入中固有的换行符，因此您之后无法执行第二步，则需要自定义 Purifier 以使其行为不同和/或将nl2br 直接滚动到其处理中。你调查过这种可能性吗？我无法在代码 ATM 中为您提供解决方案。 【参考方案1】：

使用自定义nl2br() 函数可以部分（如果不是完全）解决此问题：

function nl2br_special($string)

    // Step 1: Add <br /> tags for each line-break
    $string = nl2br($string); 

    // Step 2: Remove the actual line-breaks
    $string = str_replace("\n", "", $string);
    $string = str_replace("\r", "", $string);

    // Step 3: Restore the line-breaks that are inside <pre></pre> tags
    if(preg_match_all('/\<pre\>(.*?)\<\/pre\>/', $string, $match))
        foreach($match as $a)
            foreach($a as $b)
            $string = str_replace('<pre>'.$b.'</pre>', "<pre>".str_replace("<br />", PHP_EOL, $b)."</pre>", $string);
            
        
    

    // Step 4: Removes extra <br /> tags

    // Before <pre> tags
    $string = str_replace("<br /><br /><br /><pre>", '<br /><br /><pre>', $string);
    // After </pre> tags
    $string = str_replace("</pre><br /><br />", '</pre><br />', $string);

    // Arround <ul></ul> tags
    $string = str_replace("<br /><br /><ul>", '<br /><ul>', $string);
    $string = str_replace("</ul><br /><br />", '</ul><br />', $string);
    // Inside <ul> </ul> tags
    $string = str_replace("<ul><br />", '<ul>', $string);
    $string = str_replace("<br /></ul>", '</ul>', $string);

    // Arround <ol></ol> tags
    $string = str_replace("<br /><br /><ol>", '<br /><ol>', $string);
    $string = str_replace("</ol><br /><br />", '</ol><br />', $string);
    // Inside <ol> </ol> tags
    $string = str_replace("<ol><br />", '<ol>', $string);
    $string = str_replace("<br /></ol>", '</ol>', $string);

    // Arround <li></li> tags
    $string = str_replace("<br /><li>", '<li>', $string);
    $string = str_replace("</li><br />", '</li>', $string);

    return $string;

这必须应用于内容之前它是HTML-Purified。除非您知道自己在做什么，否则切勿重新处理纯化的内容。

请注意，由于已经保留了每个换行符和双换行符，因此您不应使用 HTML Purifier 的AutoFormat.AutoParagraph 功能：

// Process line-breaks
$string = nl2br_special($string);

// Initiate HTML Purifier config
$purifier_config = HTMLPurifier_Config::createDefault();
$purifier_config->set('HTML.Allowed', 'p,ul,ol,li,strong,b,em,i,u,a[href],code,pre,blockquote,cite,img[src|alt],br,hr,h3,h4');
//$purifier_config->set('AutoFormat.AutoParagraph', true); // Make sure to NOT use this

// Initiate HTML Purifier
$purifier = new HTMLPurifier($purifier_config);

// Purify the content!
$string = $purifier->purify($string);

就是这样！

此外，因为允许基本 HTML 标记最初是为了 improve user experience by not adding another markup syntax，您可能希望允许用户发布代码，尤其是 HTML 代码，这些代码不会被 HTML Purifier 解释/删除。

HTML Purifier 目前允许发布代码，但需要复杂的 CDATA 标记：

<![CDATA[
Place code here
]]>

难以记忆和书写。为了尽可能简化用户体验，我认为最好允许用户通过嵌入简单的<code>（用于内联代码）和<pre>（用于代码块）标签来添加代码。以下是如何做到这一点：

function custom_code_tag_callback($code) 

    return '<code>'.trim(htmlspecialchars($code[1])).'</code>';

function custom_pre_tag_callback($code) 

    return '<pre><code>'.trim(htmlspecialchars($code[1])).'</code></pre>';


// Don't require HTMLPurifier's CDATA enclosing, instead allow simple <code> or <pre> tags
$string = preg_replace_callback("/\<code\>(.*?)\<\/code\>/is", 'custom_code_tag_callback', $string);
$string = preg_replace_callback("/\<pre\>(.*?)\<\/pre\>/is", 'custom_pre_tag_callback', $string);

注意，和nl2br 处理一样，它必须在内容被HTML Purified 之前完成。另外，请记住，如果用户将<code> 或<pre> 标签放在他自己发布的代码中，那么它将关闭包含其代码的父<code> 或<pre> 标签。这无法解决，也适用于原始 CDATA 标记或任何标记，甚至是 *** 上使用的标记（例如，在代码示例中使用 ` 符号将关闭代码标记）。

最后，为了获得出色的用户体验，我们可能还希望自动化其他一些事情，例如我们希望使其可点击的链接。幸运的是，这可以通过 HTML Purifier AutoFormat.Linkify 功能来完成。

这是包含最终设置的所有内容的最终代码：

// === Declare functions ===

function nl2br_special($string)

    // Step 1: Add <br /> tags for each line-break
    $string = nl2br($string); 

    // Step 2: Remove the actual line-breaks
    $string = str_replace("\n", "", $string);
    $string = str_replace("\r", "", $string);

    // Step 3: Restore the line-breaks that are inside <pre></pre> tags
    if(preg_match_all('/\<pre\>(.*?)\<\/pre\>/', $string, $match))
        foreach($match as $a)
            foreach($a as $b)
            $string = str_replace('<pre>'.$b.'</pre>', "<pre>".str_replace("<br />", PHP_EOL, $b)."</pre>", $string);
            
        
    

    // Step 4: Removes extra <br /> tags

    // Before <pre> tags
    $string = str_replace("<br /><br /><br /><pre>", '<br /><br /><pre>', $string);
    // After </pre> tags
    $string = str_replace("</pre><br /><br />", '</pre><br />', $string);

    // Arround <ul></ul> tags
    $string = str_replace("<br /><br /><ul>", '<br /><ul>', $string);
    $string = str_replace("</ul><br /><br />", '</ul><br />', $string);
    // Inside <ul> </ul> tags
    $string = str_replace("<ul><br />", '<ul>', $string);
    $string = str_replace("<br /></ul>", '</ul>', $string);

    // Arround <ol></ol> tags
    $string = str_replace("<br /><br /><ol>", '<br /><ol>', $string);
    $string = str_replace("</ol><br /><br />", '</ol><br />', $string);
    // Inside <ol> </ol> tags
    $string = str_replace("<ol><br />", '<ol>', $string);
    $string = str_replace("<br /></ol>", '</ol>', $string);

    // Arround <li></li> tags
    $string = str_replace("<br /><li>", '<li>', $string);
    $string = str_replace("</li><br />", '</li>', $string);

    return $string;



function custom_code_tag_callback($code) 

    return '<code>'.trim(htmlspecialchars($code[1])).'</code>';


function custom_pre_tag_callback($code) 

    return '<pre><code>'.trim(htmlspecialchars($code[1])).'</code></pre>';




// === Process user's input ===

// Process line-breaks
$string = nl2br_special($string);

// Allow simple <code> or <pre> tags for posting code
$string = preg_replace_callback("/\<code\>(.*?)\<\/code\>/is", 'custom_code_tag_callback', $string);
$string = preg_replace_callback("/\<pre\>(.*?)\<\/pre\>/is", 'custom_pre_tag_callback', $string);


// Initiate HTML Purifier config
$purifier_config = HTMLPurifier_Config::createDefault();
$purifier_config->set('HTML.Allowed', 'p,ul,ol,li,strong,b,em,i,u,a[href],code,pre,blockquote,cite,img[src|alt],br,hr,h3,h4');
$purifier_config->set('AutoFormat.Linkify', true); // Make links clickable
//$purifier_config->set('HTML.TargetBlank', true); // Uncomment if you want links to open new tabs
//$purifier_config->set('AutoFormat.AutoParagraph', true); // Leave this commented as it conflicts with nl2br


// Initiate HTML Purifier
$purifier = new HTMLPurifier($purifier_config);

// Purify the content!
$string = $purifier->purify($string);

干杯！

【讨论】：

【参考方案2】：

也许这会有所帮助。

function custom_nl2br($html) 
    $pattern = "/<ul>(.*?)<\/ul>/s";
    preg_match($pattern, $html, $matches);

    $html = nl2br(str_replace($matches[0], '[placeholder]', $html));
    $html = str_replace('[placeholder]',$matches[0], $html);

    return $html;

【讨论】：

以上是关于PHP：如何使用带有 HTML Purifier 的 nl2br() 来保持换行符？的主要内容，如果未能解决你的问题，请参考以下文章