使用正则表达式编辑具有文本和多个图像的 html 字符串
Posted
技术标签:
【中文标题】使用正则表达式编辑具有文本和多个图像的 html 字符串【英文标题】:editing html string having text and several images using regex 【发布时间】:2018-11-20 02:30:45 【问题描述】:我是这个论坛的新手,希望能得到一些帮助。 我有一个包含文本和几个 base64 图像的 html 字符串。 我需要遍历所有图像标签,在 / 之前添加一个斜杠 结束标签 > 使每个图像以 /> 结尾并返回 包含更改的新 html 字符串。
每个
<IMG src="....">
应该是
<IMG src="...."/>
我不熟悉html,我想知道怎么做 (使用正则表达式?)。 这是一些伪代码:
Function GetSourceImges(Sourcehtml As String) As List(Of String)
Dim listOfImgs As New List(Of String)()
'use regex to find image tags
'Return list of base64 image tags
End Function
For each image in list
insert a slash appropriately
next
用编辑过的图片重构一个新的 html 字符串 谢谢
【问题讨论】:
SO 不是论坛,而是问答网站。看来您可以访问 DOM 结构,您使用的是什么包?它看起来像 VB.NET。请为问题添加相关标签,以便正确的用户可以看到此问题。 谢谢 刚刚订阅,不懂标签。作为一个新手,我使用的是 VB、net 和部分 c#。那么标签应该是 VB.net 和 c# 吗? 我添加了 VB.NET 标签,因为您在 VB.NET 中发布了代码。但是,您尝试修改标签的代码是什么?您所拥有的仅显示了如何提取和设置 src 属性值,这似乎与问题无关。请更新,否则问题将作为题外话关闭。 RegEx match open tags except XHTML self-contained tags 有一些关于使用正则表达式解析 HTML 的有趣且详细的答案。 How do I use HTML Agility Pack to edit an HTML snippet中提到了更可靠的方法。 好的,刚刚编辑了我的问题。从文本编辑器复制时,我错过了一部分。 【参考方案1】:使用 LINQ 映射所有“IMG”标签,并使用它们的索引作为锚点来修复丢失的“/”字符。请在代码中查看我的 cmets。
Sub Main()
Dim htmlstring As String = "<IMG src=""....""> " & vbCrLf _
& "<img src=""...."">" & vbCrLf _
& "<p>blahblah</p>" & vbCrLf _
& "<IMG src=""...."">" & vbCrLf _
& "<p>blahblah</p>"
' find all indxes of img using regex and lambda exprations '
Dim indexofIMG() As Integer = Regex.Matches(htmlstring, "IMG", RegexOptions.IgnoreCase) _
.Cast(Of Match)().Select(Function(x) x.Index).ToArray()
' check from each index of "IMG" if "/" is missing '
For Each itm As Integer In indexofIMG
Dim counter As Integer = itm
While counter < htmlstring.Length - 1
If htmlstring(counter) = ">" Then
If htmlstring(counter - 1) <> "/" Then
' fix the missing "/" using Insert() method '
htmlstring = htmlstring.Insert(counter, "/")
End If
Exit While
End If
counter += 1
End While
Next
Console.WriteLine(htmlstring)
Console.ReadLine()
End Sub
【讨论】:
就像我说的,htmlstring 不仅有图片还有其他标签。 (blahblah
将被错误地替换。)。所以需要一些循环来识别图像标签来做这件事并重组一个新的字符串。有什么帮助吗? Pabdev Here 更接近我的要求 我的问题的原因是我正在使用 Itextsharp CustomImageTagProcessor(和 xmlworker),由于某些未知原因,它会显示带有“/>”而不是“>”的 base64 图像,谢谢 @Gbhskk 欢迎来到。 ***,如果您发现该答案有帮助,请单击复选标记图标将其标记为答案。 令人惊讶的是,只有第一个 Image 标签被修改了。逻辑似乎很正确。【参考方案2】:令人惊讶的是,它适用于控制台应用程序,但当我在富文本框上查看它时不起作用,如下面的 btnEditHTML 方法。生成的 pdf 只有一个红点,而不是两个。 不能说为什么。 我必须说你很有帮助。
'SetTable 和 customimagetagprocessor 借用自 [这里] iTextsharp base64 embedded image in header not parsing/showing
Imports System.IO
Imports iTextSharp.text
Imports iTextSharp.tool.xml
Imports iTextSharp.text.pdf
Imports iTextSharp.tool.xml.parser
Imports iTextSharp.tool.xml.pipeline.css
Imports iTextSharp.tool.xml.pipeline.html
Imports iTextSharp.tool.xml.pipeline.end
Imports iTextSharp.tool.xml.html
Imports System.Text.RegularExpressions
Public Class Form1
Dim dsktop As String = My.Computer.FileSystem.SpecialDirectories.Desktop
Public Function GetFormattedHTML(str As String) As String
'format images by changing > to />
' find all indxes of img using regex and lambda exprations '
Dim indexofIMG() As Integer = Regex.Matches(str.ToString, "IMG", RegexOptions.IgnoreCase) _
.Cast(Of Match)().Select(Function(x) x.Index).ToArray()
' check from each index of "IMG" if "/" is missing '
For Each itm As Integer In indexofIMG
Dim counter As Integer = itm
While counter < str.ToString.Length - 1
If str(counter) = ">" Then
If str(counter - 1) <> "/" Then
' fix the missing "/" using Insert() method '
str = str.ToString.Insert(counter, " /")
End If
Exit While
End If
counter += 1
End While
Next
Return str.ToString
End Function
Private Sub btnEditHTML_Click(sender As Object, e As EventArgs) Handles btnEditHTML.Click
Rtb.Text = String.Empty
'the 2 base64 images in the html below are actually just small red dots
Dim RawHTML As String = "<P>John Doe</P><IMG " &
"src=""""> Jackson5<IMG " &
"src="""">"
Rtb.Text = GetFormattedHTML(RawHTML)
'notice that the 2nd base64 string is not edited as required.
End Sub
Private Sub btnGenerate_Click(sender As Object, e As EventArgs) Handles btnGenerate.Click
'here I create a 2 column itextsharp table to parse my html into the cells
Dim doc As New iTextSharp.text.Document(iTextSharp.text.PageSize.A4, 25, 25, 25, 30)
Dim wri As PdfWriter = PdfWriter.GetInstance(doc, New System.IO.FileStream(dsktop & "\testtable.pdf", System.IO.FileMode.Create))
doc.Open()
'set table columnwidths -------------------------------------------------------------
Dim MainTable As New PdfPTable(2) '2 column table
MainTable.WidthPercentage = 100
Dim Wth(1) As Single
Dim u As Integer = 2
For i As Integer = 0 To 1
Wth(i) = CInt(Math.Floor(2 * 500 / u))
Next
MainTable.SetWidths(Wth)
Dim htmlstr As String = GetFormattedHTML("<P>John Doe</P><IMG " &
"src=""""> Jackson5<IMG " &
"src="""">")
Dim Elmts = New ElementList()
Elmts = XMLWorkerHelper.ParseToElementList(htmlstr, Nothing)
Dim MinorTable As New PdfPTable(1)
MinorTable = SetTable(Elmts, htmlstr)
For i = 1 To 2
Dim Cell As New PdfPCell
Cell.AddElement(MinorTable)
MainTable.AddCell(Cell)
Next
doc.Add(MainTable)
doc.Close()
Process.Start(dsktop & "\testtable.pdf")
End Sub
Public Function SetTable(ByVal elements As ElementList, ByVal htmlcode As String) As PdfPTable
Dim tagProcessors As DefaultTagProcessorFactory = CType(Tags.GetHtmlTagProcessorFactory(), DefaultTagProcessorFactory)
tagProcessors.RemoveProcessor(HTML.Tag.IMG) ' remove the default processor
tagProcessors.AddProcessor(HTML.Tag.IMG, New CustomImageTagProcessor()) ' use our new processor
Dim cssResolver As ICs-s-resolver = XMLWorkerHelper.GetInstance().GetDefaultCssResolver(True)
cssResolver.AddCssFile(Application.StartupPath & "\pdf.css", True)
'see sample css file at https://learnwebcode.com/how-to-create-your-first-css-file/
'Setup Fonts
Dim xmlFontProvider As XMLWorkerFontProvider = New XMLWorkerFontProvider(XMLWorkerFontProvider.DONTLOOKFORFONTS)
xmlFontProvider.RegisterDirectory(Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "assets/fonts/"))
Dim cssAppliers As CssAppliers = New CssAppliersImpl(xmlFontProvider)
Dim htmlContext As HtmlPipelineContext = New HtmlPipelineContext(cssAppliers)
htmlContext.SetAcceptUnknown(True)
htmlContext.SetTagFactory(tagProcessors)
Dim pdf As ElementHandlerPipeline = New ElementHandlerPipeline(elements, Nothing)
Dim htmlp As HtmlPipeline = New HtmlPipeline(htmlContext, pdf)
Dim css As CssResolverPipeline = New CssResolverPipeline(cssResolver, htmlp)
Dim worker As XMLWorker = New XMLWorker(css, True)
Dim p As XMLParser = New XMLParser(worker)
'Dim holderTable As New PdfPTable(1)
Dim holderTable As PdfPTable = New PdfPTable(1)
holderTable.WidthPercentage = 100
holderTable.HorizontalAlignment = Element.ALIGN_LEFT
Dim holderCell As New PdfPCell()
holderCell.Padding = 0
holderCell.UseBorderPadding = False
holderCell.Border = 0
p.Parse(New MemoryStream(System.Text.Encoding.ASCII.GetBytes(htmlcode)))
For Each el As IElement In elements
holderCell.AddElement(el)
Next
holderTable.AddCell(holderCell)
'Dim holderRow As New PdfPRow(holderCell)
'holderTable.Rows.Add(holderRow)
Return holderTable
End Function
End Class
Public Class CustomImageTagProcessor
Inherits iTextSharp.tool.xml.html.Image
Public Overrides Function [End](ctx As IWorkerContext, tag As Tag, currentContent As IList(Of IElement)) As IList(Of IElement)
Dim attributes As IDictionary(Of String, String) = tag.Attributes
Dim src As String = String.Empty
If Not attributes.TryGetValue(iTextSharp.tool.xml.html.HTML.Attribute.SRC, src) Then
Return New List(Of IElement)(1)
End If
If String.IsNullOrEmpty(src) Then
Return New List(Of IElement)(1)
End If
If src.StartsWith("data:image/", StringComparison.InvariantCultureIgnoreCase) Then
' data:[<MIME-type>][;charset=<encoding>][;base64],<data>
Dim base64Data As String = src.Substring(src.IndexOf(",") + 1)
Dim imagedata As Byte() = Convert.FromBase64String(base64Data)
Dim image As iTextSharp.text.Image = iTextSharp.text.Image.GetInstance(imagedata)
Dim list As List(Of IElement) = New List(Of IElement)()
Dim htmlPipelineContext As pipeline.html.HtmlPipelineContext = GetHtmlPipelineContext(ctx)
list.Add(GetCssAppliers().Apply(New Chunk(DirectCast(GetCssAppliers().Apply(image, tag, htmlPipelineContext), iTextSharp.text.Image), 0, 0, True), tag, htmlPipelineContext))
Return list
Else
If File.Exists(Path.Combine(AppDomain.CurrentDomain.BaseDirectory, src)) Then
Dim imagedata As Byte() = File.ReadAllBytes(Path.Combine(AppDomain.CurrentDomain.BaseDirectory, src))
Dim image As iTextSharp.text.Image = iTextSharp.text.Image.GetInstance(Path.Combine(AppDomain.CurrentDomain.BaseDirectory, src))
Dim list As List(Of IElement) = New List(Of IElement)()
Dim htmlPipelineContext As pipeline.html.HtmlPipelineContext = GetHtmlPipelineContext(ctx)
list.Add(GetCssAppliers().Apply(New Chunk(DirectCast(GetCssAppliers().Apply(image, tag, htmlPipelineContext), iTextSharp.text.Image), 0, 0, True), tag, htmlPipelineContext))
Return list
End If
Return MyBase.[End](ctx, tag, currentContent)
End If
End Function
End Class
【讨论】:
【参考方案3】:我强烈建议只使用AngleSharp 来解析 HTML,根据需要编辑文档,然后再次保存。
这里有很多关于为什么尝试使用正则表达式解析 HTML 是一个坏主意的帖子。
var doc = new HtmlParser().Parse(html);
由于您实际上并没有更改 HTML 内容,只是修复了标签,因此您应该能够只解析它并保存它而无需更改以修复标签。
【讨论】:
以上是关于使用正则表达式编辑具有文本和多个图像的 html 字符串的主要内容,如果未能解决你的问题,请参考以下文章
关于Linux,用户,组,权限,文本处理工具,正则表达式,vim文本编辑器