使用正则表达式编辑具有文本和多个图像的 html 字符串

Posted

技术标签:

【中文标题】使用正则表达式编辑具有文本和多个图像的 html 字符串【英文标题】:editing html string having text and several images using regex 【发布时间】:2018-11-20 02:30:45 【问题描述】:

我是这个论坛的新手,希望能得到一些帮助。 我有一个包含文本和几个 base64 图像的 html 字符串。 我需要遍历所有图像标签,在 / 之前添加一个斜杠 结束标签 > 使每个图像以 /> 结尾并返回 包含更改的新 html 字符串。

每个

<IMG src="...."> 

应该是

<IMG src="...."/>

我不熟悉html,我想知道怎么做 (使用正则表达式?)。 这是一些伪代码:

   Function GetSourceImges(Sourcehtml As String) As List(Of String)
    Dim listOfImgs As New List(Of String)()
       'use regex to find image tags
       'Return list of base64 image tags
   End Function

    For each image in list
        insert a slash appropriately
    next

用编辑过的图片重构一个新的 html 字符串 谢谢

【问题讨论】:

SO 不是论坛,而是问答网站。看来您可以访问 DOM 结构,您使用的是什么包?它看起来像 VB.NET。请为问题添加相关标签,以便正确的用户可以看到此问题。 谢谢 刚刚订阅,不懂标签。作为一个新手,我使用的是 VB、net 和部分 c#。那么标签应该是 VB.net 和 c# 吗? 我添加了 VB.NET 标签,因为您在 VB.NET 中发布了代码。但是,您尝试修改标签的代码是什么?您所拥有的仅显示了如何提取和设置 src 属性值,这似乎与问题无关。请更新,否则问题将作为题外话关闭。 RegEx match open tags except XHTML self-contained tags 有一些关于使用正则表达式解析 HTML 的有趣且详细的答案。 How do I use HTML Agility Pack to edit an HTML snippet中提到了更可靠的方法。 好的,刚刚编辑了我的问题。从文本编辑器复制时,我错过了一部分。 【参考方案1】:

使用 LINQ 映射所有“IMG”标签,并使用它们的索引作为锚点来修复丢失的“/”字符。请在代码中查看我的 cmets。

Sub Main()
    Dim htmlstring As String = "<IMG src=""....""> " & vbCrLf _
& "<img src=""...."">" & vbCrLf _
& "<p>blahblah</p>" & vbCrLf _
& "<IMG src=""...."">" & vbCrLf _
& "<p>blahblah</p>"

    ' find all indxes of img using regex and lambda exprations '
    Dim indexofIMG() As Integer = Regex.Matches(htmlstring, "IMG", RegexOptions.IgnoreCase) _
.Cast(Of Match)().Select(Function(x) x.Index).ToArray()

    ' check from each index of "IMG" if "/" is missing '
    For Each itm As Integer In indexofIMG
        Dim counter As Integer = itm
        While counter < htmlstring.Length - 1
            If htmlstring(counter) = ">" Then
                If htmlstring(counter - 1) <> "/" Then
                    ' fix the missing "/" using Insert() method '
                    htmlstring = htmlstring.Insert(counter, "/")
                End If
                Exit While
            End If
            counter += 1
        End While
    Next

    Console.WriteLine(htmlstring)
    Console.ReadLine()
End Sub

【讨论】:

就像我说的,htmlstring 不仅有图片还有其他标签。 (

blahblah

将被错误地替换。)。所以需要一些循环来识别图像标签来做这件事并重组一个新的字符串。有什么帮助吗?
Pabdev Here 更接近我的要求 我的问题的原因是我正在使用 Itextsharp CustomImageTagProcessor(和 xmlworker),由于某些未知原因,它会显示带有“/>”而不是“>”的 base64 图像,谢谢 @Gbhskk 欢迎来到。 ***,如果您发现该答案有帮助,请单击复选标记图标将其标记为答案。 令人惊讶的是,只有第一个 Image 标签被修改了。逻辑似乎很正确。【参考方案2】:

令人惊讶的是,它适用于控制台应用程序,但当我在富文本框上查看它时不起作用,如下面的 btnEditHTML 方法。生成的 pdf 只有一个红点,而不是两个。 不能说为什么。 我必须说你很有帮助。

'SetTable 和 customimagetagprocessor 借用自 [这里] iTextsharp base64 embedded image in header not parsing/showing

Imports System.IO
Imports iTextSharp.text
Imports iTextSharp.tool.xml
Imports iTextSharp.text.pdf
Imports iTextSharp.tool.xml.parser
Imports iTextSharp.tool.xml.pipeline.css
Imports iTextSharp.tool.xml.pipeline.html
Imports iTextSharp.tool.xml.pipeline.end
Imports iTextSharp.tool.xml.html
Imports System.Text.RegularExpressions

Public Class Form1

    Dim dsktop As String = My.Computer.FileSystem.SpecialDirectories.Desktop
    Public Function GetFormattedHTML(str As String) As String
        'format images by changing > to />
        ' find all indxes of img using regex and lambda exprations '
        Dim indexofIMG() As Integer = Regex.Matches(str.ToString, "IMG", RegexOptions.IgnoreCase) _
        .Cast(Of Match)().Select(Function(x) x.Index).ToArray()

        ' check from each index of "IMG" if "/" is missing '
        For Each itm As Integer In indexofIMG
            Dim counter As Integer = itm
            While counter < str.ToString.Length - 1
                If str(counter) = ">" Then
                    If str(counter - 1) <> "/" Then
                        ' fix the missing "/" using Insert() method '
                        str = str.ToString.Insert(counter, " /")
                    End If
                    Exit While
                End If
                counter += 1
            End While
        Next
        Return str.ToString
    End Function
    Private Sub btnEditHTML_Click(sender As Object, e As EventArgs) Handles btnEditHTML.Click
        Rtb.Text = String.Empty
        'the 2 base64 images in the html below are actually just small red dots
        Dim RawHTML As String = "<P>John Doe</P><IMG " &
        "src="""">&nbsp;Jackson5<IMG " &
        "src="""">"
        Rtb.Text = GetFormattedHTML(RawHTML)
        'notice that the 2nd base64 string is not edited as required. 
    End Sub

    Private Sub btnGenerate_Click(sender As Object, e As EventArgs) Handles btnGenerate.Click
        'here I create a 2 column itextsharp table to parse my html into the cells

        Dim doc As New iTextSharp.text.Document(iTextSharp.text.PageSize.A4, 25, 25, 25, 30)
        Dim wri As PdfWriter = PdfWriter.GetInstance(doc, New System.IO.FileStream(dsktop & "\testtable.pdf", System.IO.FileMode.Create))
        doc.Open()
        'set table columnwidths -------------------------------------------------------------
        Dim MainTable As New PdfPTable(2) '2 column table
        MainTable.WidthPercentage = 100
        Dim Wth(1) As Single
        Dim u As Integer = 2
        For i As Integer = 0 To 1
            Wth(i) = CInt(Math.Floor(2 * 500 / u))
        Next
        MainTable.SetWidths(Wth)

        Dim htmlstr As String = GetFormattedHTML("<P>John Doe</P><IMG " &
        "src="""">&nbsp;Jackson5<IMG " &
        "src="""">")

        Dim Elmts = New ElementList()
        Elmts = XMLWorkerHelper.ParseToElementList(htmlstr, Nothing)
        Dim MinorTable As New PdfPTable(1)
        MinorTable = SetTable(Elmts, htmlstr)

        For i = 1 To 2
            Dim Cell As New PdfPCell
            Cell.AddElement(MinorTable)
            MainTable.AddCell(Cell)
        Next
        doc.Add(MainTable)
        doc.Close()

        Process.Start(dsktop & "\testtable.pdf")

    End Sub
    Public Function SetTable(ByVal elements As ElementList, ByVal htmlcode As String) As PdfPTable

        Dim tagProcessors As DefaultTagProcessorFactory = CType(Tags.GetHtmlTagProcessorFactory(), DefaultTagProcessorFactory)
        tagProcessors.RemoveProcessor(HTML.Tag.IMG) ' remove the default processor
        tagProcessors.AddProcessor(HTML.Tag.IMG, New CustomImageTagProcessor()) ' use our new processor

        Dim cssResolver As ICs-s-resolver = XMLWorkerHelper.GetInstance().GetDefaultCssResolver(True)
        cssResolver.AddCssFile(Application.StartupPath & "\pdf.css", True)
        'see sample css file at https://learnwebcode.com/how-to-create-your-first-css-file/

        'Setup Fonts
        Dim xmlFontProvider As XMLWorkerFontProvider = New XMLWorkerFontProvider(XMLWorkerFontProvider.DONTLOOKFORFONTS)
        xmlFontProvider.RegisterDirectory(Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "assets/fonts/"))

        Dim cssAppliers As CssAppliers = New CssAppliersImpl(xmlFontProvider)

        Dim htmlContext As HtmlPipelineContext = New HtmlPipelineContext(cssAppliers)
        htmlContext.SetAcceptUnknown(True)
        htmlContext.SetTagFactory(tagProcessors)

        Dim pdf As ElementHandlerPipeline = New ElementHandlerPipeline(elements, Nothing)
        Dim htmlp As HtmlPipeline = New HtmlPipeline(htmlContext, pdf)
        Dim css As CssResolverPipeline = New CssResolverPipeline(cssResolver, htmlp)

        Dim worker As XMLWorker = New XMLWorker(css, True)
        Dim p As XMLParser = New XMLParser(worker)

        'Dim holderTable As New PdfPTable(1)
        Dim holderTable As PdfPTable = New PdfPTable(1)
        holderTable.WidthPercentage = 100
        holderTable.HorizontalAlignment = Element.ALIGN_LEFT

        Dim holderCell As New PdfPCell()
        holderCell.Padding = 0
        holderCell.UseBorderPadding = False
        holderCell.Border = 0

        p.Parse(New MemoryStream(System.Text.Encoding.ASCII.GetBytes(htmlcode)))

        For Each el As IElement In elements
            holderCell.AddElement(el)
        Next
        holderTable.AddCell(holderCell)
        'Dim holderRow As New PdfPRow(holderCell)
        'holderTable.Rows.Add(holderRow)
        Return holderTable

    End Function

End Class

Public Class CustomImageTagProcessor
    Inherits iTextSharp.tool.xml.html.Image
    Public Overrides Function [End](ctx As IWorkerContext, tag As Tag, currentContent As IList(Of IElement)) As IList(Of IElement)
        Dim attributes As IDictionary(Of String, String) = tag.Attributes
        Dim src As String = String.Empty
        If Not attributes.TryGetValue(iTextSharp.tool.xml.html.HTML.Attribute.SRC, src) Then
            Return New List(Of IElement)(1)
        End If

        If String.IsNullOrEmpty(src) Then
            Return New List(Of IElement)(1)
        End If

        If src.StartsWith("data:image/", StringComparison.InvariantCultureIgnoreCase) Then
            ' data:[<MIME-type>][;charset=<encoding>][;base64],<data>
            Dim base64Data As String = src.Substring(src.IndexOf(",") + 1)
            Dim imagedata As Byte() = Convert.FromBase64String(base64Data)
            Dim image As iTextSharp.text.Image = iTextSharp.text.Image.GetInstance(imagedata)

            Dim list As List(Of IElement) = New List(Of IElement)()
            Dim htmlPipelineContext As pipeline.html.HtmlPipelineContext = GetHtmlPipelineContext(ctx)
            list.Add(GetCssAppliers().Apply(New Chunk(DirectCast(GetCssAppliers().Apply(image, tag, htmlPipelineContext), iTextSharp.text.Image), 0, 0, True), tag, htmlPipelineContext))
            Return list
        Else
            If File.Exists(Path.Combine(AppDomain.CurrentDomain.BaseDirectory, src)) Then
                Dim imagedata As Byte() = File.ReadAllBytes(Path.Combine(AppDomain.CurrentDomain.BaseDirectory, src))
                Dim image As iTextSharp.text.Image = iTextSharp.text.Image.GetInstance(Path.Combine(AppDomain.CurrentDomain.BaseDirectory, src))

                Dim list As List(Of IElement) = New List(Of IElement)()
                Dim htmlPipelineContext As pipeline.html.HtmlPipelineContext = GetHtmlPipelineContext(ctx)
                list.Add(GetCssAppliers().Apply(New Chunk(DirectCast(GetCssAppliers().Apply(image, tag, htmlPipelineContext), iTextSharp.text.Image), 0, 0, True), tag, htmlPipelineContext))
                Return list
            End If
            Return MyBase.[End](ctx, tag, currentContent)
        End If
    End Function
End Class

【讨论】:

【参考方案3】:

我强烈建议只使用AngleSharp 来解析 HTML,根据需要编辑文档,然后再次保存。

这里有很多关于为什么尝试使用正则表达式解析 HTML 是一个坏主意的帖子。

var doc = new HtmlParser().Parse(html);

由于您实际上并没有更改 HTML 内容,只是修复了标签,因此您应该能够只解析它并保存它而无需更改以修复标签。

【讨论】:

以上是关于使用正则表达式编辑具有文本和多个图像的 html 字符串的主要内容,如果未能解决你的问题,请参考以下文章

关于Linux,用户,组,权限,文本处理工具,正则表达式,vim文本编辑器

Linux正则表达式grep与egrep

如何最好地使用正则表达式将层次文本文件转换为 XML?

JS过滤(去除)富文本编辑器中的html标签和换行回车等标记的正则表达式

常用正则表达式

linux下的grep,egrep及正则表达式