从 PHP 中的文本中提取 URL

Posted 2023-02-24

技术标签:

【中文标题】从 PHP 中的文本中提取 URL【英文标题】：Extract URLs from text in PHP 【发布时间】：2010-10-28 23:54:19 【问题描述】：

我有这段文字：

$string = "this is my friend's website http://example.com I think it is coll";

如何将链接提取到另一个变量中？

我知道应该使用正则表达式，尤其是preg_match()，但我不知道怎么做？

【问题讨论】：

Extract URL from string 的可能重复项 @Michael Berkowski 如何复制用户在 09 年 5 月 26 日 14:13 提出的问题，但您在 2010 年 12 月 8 日 17:44 提出的链接。可能反过来也可能正确。 【参考方案1】：

preg_match_all('/[a-z]+:\/\/\S+/', $string, $matches);

这是一种简单的方法，适用于很多情况，而不是所有情况。所有匹配项都放在 $matches 中。请注意，这不包括锚元素中的链接（

【讨论】：

-1：您刚刚创建了一个 XSS 漏洞，因为它还会提取 javascript: URL。没有说明他会用它做什么，因此我不考虑这一点。他只是想将 URL 放入变量中。 @Michael：查找 javascript URL 还不是漏洞；使用它们没有任何检查是。有时，此类 URL 的存在和数量是有用的信息。我会选择不同的分隔符。 :)【参考方案2】：

URL 有一个相当 complex definition — 您必须先决定要捕获的内容。捕获以http:// 和https:// 开头的任何内容的简单示例可能是：

preg_match_all('!https?://\S+!', $string, $matches);
$all_urls = $matches[0];

请注意，这是非常基本的，可能会捕获无效的 URL。对于更复杂的事情，我建议您关注 POSIX 和 php regular expressions。

【讨论】：

【参考方案3】：

如果您从中提取 URL 的文本是用户提交的，并且您要将结果显示为任何地方的链接，您必须非常非常小心地避免 XSS vulnerabilities，最突出的是“javascript:”协议 URL ，还有malformed URLs 可能会欺骗您的正则表达式和/或显示浏览器将它们作为Javascript URL 执行。至少，您应该只接受以“http”、“https”或“ftp”开头的 URL。

还有一个由 Jeff 撰写的 blog entry，他在其中描述了提取 URL 的一些其他问题。

【讨论】：

【参考方案4】：

可能最安全的方法是使用来自 WordPress 的代码 sn-ps。下载最新版本（当前为 3.1.1）并查看 wp-includes/formatting.php。有一个名为 make_clickable 的函数，它具有纯文本的参数并返回格式化的字符串。您可以获取用于提取 URL 的代码。不过这很复杂。

这一行正则表达式可能会有所帮助。

preg_match_all('#\bhttps?://[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/))#', $string, $match);

但是这个正则表达式仍然无法删除一些格式错误的 URL（例如 http://google:ha.ckers.org ）。

另请参阅： How to mimic *** Auto-Link Behavior

【讨论】：

我玩过 Wordpressformatting.php，使用 make_clickable 是个好主意，但它最终会在依赖项中吸收一半的 wordpress。好的，确保终端部分不是奇怪的字符这不识别没有http的url，比如google.com 这个正则表达式将匹配google:ha.ckers.org "@https?:\/\/(www\.)?[-a-zA-Z0-9\@:%._\+~#= ]1,256\.[a-zA-Z0-9()]1,6\b([-a-zA-Z0-9()\@:%_\+.~#?&// =]*)@";不要记得我在哪里找到它所以不能给予信任 ***.com/questions/23366790/… 对我来说比这更好（上下文 WordPress）。【参考方案5】：

preg_match_all ("/a[\s]+[^>]*?href[\s]?=[\s\"\']+".
                "(.*?)[\"\']+.*?>"."([^<]+|.*?)?<\/a>/",
                $var, &$matches);

$matches = $matches[1];
$list = array();

foreach($matches as $var)
    
    print($var."<br>");

【讨论】：

【参考方案6】：

我尝试按照 Nobu 所说的使用 Wordpress 进行操作，但是由于对其他 WordPress 函数的依赖程度很高，我选择使用 Nobu 的 preg_match_all() 正则表达式并将其转换为函数，使用 preg_replace_callback()；现在用可点击的链接替换文本中的所有链接的功能。它使用anonymous functions，因此您需要 PHP 5.3，或者您可以重写代码以使用普通函数。

<?php 

/**
 * Make clickable links from URLs in text.
 */

function make_clickable($text) 
    $regex = '#\bhttps?://[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/))#';
    return preg_replace_callback($regex, function ($matches) 
        return "<a href=\'$matches[0]\'>$matches[0]</a>";
    , $text);

【讨论】：

请注意：我已将您的答案更新为使用匿名函数作为回调，而不是使用 create_function()。【参考方案7】：

你可以这样做..

<?php
$string = "this is my friend's website http://example.com I think it is coll";
echo explode(' ',strstr($string,'http://'))[0]; //"prints" http://example.com

【讨论】：

【参考方案8】：

对我有用的代码（特别是如果您的 $string 中有多个链接）：

$string = "this is my friend's website https://www.example.com I think it is cool, but this one is cooler https://www.***.com :)";
$regex = '/\b(https?|ftp|file):\/\/[-A-Z0-9+&@#\/%?=~_|$!:,.;]*[A-Z0-9+&@#\/%=~_|$]/i';
preg_match_all($regex, $string, $matches);
$urls = $matches[0];
// go over all links
foreach($urls as $url) 

    echo $url.'<br />';

希望对其他人也有帮助。

【讨论】：

我已经测试了所有答案，这只是一个将删除 html 选项卡【参考方案9】：

您可以尝试这个来找到链接并修改链接（添加 href 链接）。

$reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]2,3(\/\S*)?/";

// The Text you want to filter for urls
$text = "The text you want to filter goes here. http://example.com";

if(preg_match($reg_exUrl, $text, $url)) 

       echo preg_replace($reg_exUrl, "<a href="$url[0]">$url[0]</a> ", $text);

 else 

       echo "No url in the text";

参考这里：http://php.net/manual/en/function.preg-match.php

【讨论】：

【参考方案10】：

这个正则表达式非常适合我，我检查了所有类型的 URL，

<?php
$string = "Thisregexfindurlhttp://www.rubular.com/r/bFHobduQ3n mixedwithstring";
preg_match_all('/(https?|ssh|ftp):\/\/[^\s"]+/', $string, $url);
$all_url = $url[0]; // Returns Array Of all Found URL's
$one_url = $url[0][0]; // Gives the First URL in Array of URL's
?>

检查了很多网址可以在这里找到http://www.rubular.com/r/bFHobduQ3n

【讨论】：

【参考方案11】：

public function find_links($post_content)
    $reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]2,3(\/\S*)?/";
    // Check if there is a url in the text
    if(preg_match_all($reg_exUrl, $post_content, $urls)) 
        // make the urls hyper links,
        foreach($urls[0] as $url)
            $post_content = str_replace($url, '<a href="'.$url.'" rel="nofollow"> LINK </a>', $post_content);
        
        //var_dump($post_content);die(); //uncomment to see result
        //return text with hyper links
        return $post_content;
     else 
        // if no urls in the text just return the text
        return $post_content;

【讨论】：

【参考方案12】：

url 有很多边缘情况。像 url 可以包含括号或不包含协议等。这就是为什么正则表达式是不够的。

我创建了一个可以处理大量边缘情况的 PHP 库：Url highlight。

例子：

<?php

use VStelmakh\UrlHighlight\UrlHighlight;

$urlHighlight = new UrlHighlight();
$urlHighlight->getUrls("this is my friend's website http://example.com I think it is coll");
// return: ['http://example.com']

更多详情请见readme。对于覆盖的 url 案例，请参阅test。

【讨论】：

【参考方案13】：

这是我使用的一个函数，不记得它来自哪里，但似乎在文本中查找链接方面做得很好。并使它们成为链接。

您可以更改功能以满足您的需要。我只是想分享这个，因为我环顾四周并记得我在我的一个助手库中有这个。

function make_links($str)

  $pattern = '(?xi)\b((?:https?://|www\d0,3[.]|[a-z0-9.\-]+[.][a-z]2,4/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\];:\'".,<>?«»“”‘’]))';

  return preg_replace_callback("#$pattern#i", function($matches) 
    $input = $matches[0];
    $url = preg_match('!^https?://!i', $input) ? $input : "http://$input";
    return '<a href="' . $url . '" rel="nofollow" target="_blank">' . "$input</a>";
  , $str);

用途：

$subject = 'this is a link http://google:ha.ckers.org maybe don't want to visit it?';
echo make_links($subject);

输出

this is a link <a href="http://google:ha.ckers.org" rel="nofollow" target="_blank">http://google:ha.ckers.org</a> maybe don't want to visit it?

【讨论】：

【参考方案14】：

<?php
preg_match_all('/(href|src)[\s]?=[\s\"\']?+(.*?)[\s\"\']+.*?/', $webpage_content, $link_extracted);

preview

【讨论】：

以上是关于从 PHP 中的文本中提取 URL的主要内容，如果未能解决你的问题，请参考以下文章