如何使用 C# 在给定文本中用忽略空格、回车或换行符替换字符串
Posted
技术标签:
【中文标题】如何使用 C# 在给定文本中用忽略空格、回车或换行符替换字符串【英文标题】:How to replace a string with ignoring spaces, carriage return or line breaks in a given text by using C# 【发布时间】:2021-10-10 06:32:39 【问题描述】:我想替换给定文本中的特定字符串(每次都会有所不同,所以不是这个问题中给定的特定示例),但遵循以下规则:它将忽略空格字符、回车符或换行符
这可能吗?
以下面的 html 文档为例。
<tr>
<td colspan="2" style="background-image: initial; background-position: initial; background-size: initial; background-repeat: initial; background-attachment: initial; background-origin: initial; background-clip: initial; border-top-left-radius: 3px; border-top-right-radius: 3px;">
<b>
<a rel=\"nofollow\" target=\"_blank\" href=\"https://www.monstermmorpg.com\"
style=\"color: rgb(6, 69, 173);
text-decoration-line: none; background: none;\"
title=\"Calyrex (Pokémon)\"><span style=\"color: rgb(0, 0, 0);\">←</span></a>
</b>
</td>
</tr>
目标是将上面文档中的以下字符串替换为其他内容,例如AAA
。
<td style="text-align: right;"><a rel="nofollow" target="_blank" href="https://www.monstermmorpg.com" style="color: rgb(6, 69, 173); text-decoration-line: none; background: none;" title="Calyrex (Pokémon)"><span style="color: rgb(0, 0, 0);">←</span></a></td>
预期的结果应该是
<tr>
<td colspan="2" style="background-image: initial; background-position: initial; background-size: initial; background-repeat: initial; background-attachment: initial; background-origin: initial; background-clip: initial; border-top-left-radius: 3px; border-top-right-radius: 3px;">
<b>
AAA
</b>
</td>
</tr>
我尝试过的
我尝试过使用 htmlagilitypack,但不幸的是在我的情况下不起作用,因为我不想替换单个 HTML 节点。我需要替换可能跨越或不跨越多个节点的文档的部分片段。我无法让 htmlagilitypack 执行此操作。
【问题讨论】:
为什么HtlmAgilityPack
不起作用?分享你是如何尝试的。
@JamshaidK。因为 html 只是一个片段而不是一个完整的页面。相信我,我尝试了各种方法。我未能替换节点的 externalhtml 或节点本身。因为它在父节点引用处抛出错误。
这是您需要放置逻辑的部分。如果您无法系统地跟踪所需的节点,您将无法正确替换该元素。最好附上 HtmlAgilityPack 版本,这样人们就可以看到这种情况下的实际问题。
@JamshaidK。这里的示例代码:drive.google.com/file/d/1yV7VaulP8Ze53BqK_xAsEyWFr3sxG-vh/…。它可以找到节点,但无法更改或修改节点。我想以我想要的方式修改具有 href 的节点。
当你说它只是一个片段而不是一个完整的页面。你的意思是它不会因此而工作吗?这不是真的。它仍会将您的 html 加载到 HtmlDocument 中。您将正确传递 xPath 值以过滤掉您的数据。
【参考方案1】:
既然您说传统的 html 解析方法似乎不适用于您的用例,您是否考虑过为这个特定用例编写手动解析器?
我写了一个简短但有效的示例,说明您可能需要考虑的内容。
但是要记住几件事,这个实现写得很快,没有错误处理,假设目标字符串可以容纳在内存中,并且缺少关键的边缘情况。如果您想考虑这个解决方案,您应该投入时间来填补空白。
此解决方案仅解析整个文档,忽略并写入非目标字符,当目标字符串被识别时,改为写入替代字符串。
这可能不是最好的解决方案,我鼓励您寻找提供更多功能的现成 HTML 解析器。
public static void RemoveTargetString(TextReader Reader, TextWriter Writer, string TargetString, char[] CharactersToIgnore, string ReplacementString)
HashSet<char> IgnoreCases = CharactersToIgnore.ToHashSet();
// our buffer need only be the size of the target string
char[] buffer = new char[TargetString.Length];
int currentIndex = 0;
while (Reader.Peek() > -1)
// read one character to the end of the buffer marked by index
if (Reader.Read(buffer, currentIndex, 1) != 0)
// get the last char in the buffer
ref char firstChar = ref buffer[currentIndex];
// if the char is on the ignore list blindly write it and continue
// dont change index so we overwrite the char in the last spot of the buffer
if (IgnoreCases.Contains(firstChar))
// write the char and ignore
Writer.Write(firstChar);
continue;
// check to see if the char is in the right order as the target string
if (firstChar == TargetString[currentIndex])
// if it is don't write the buffer, increment index so we keep the char without back tracking
currentIndex++;
// if we have found the entire string dump the buffer, write the replacement string
if (currentIndex == TargetString.Length)
// write replacement string instead
Writer.Write(ReplacementString);
// reset index so we overwrite the buffer
currentIndex = 0;
else
// check to see if the target string is within something that starts with a partial piece of the target string
// we should not implicitly assume the character we fail at isn't the start of the target as well
// if it is we should avoid writing it
if (firstChar == TargetString[0])
Writer.Write(buffer, 0, currentIndex);
buffer[0] = buffer[currentIndex];
// reset index and start searching for start of target
currentIndex = 1;
else
// since the char at the last position of the buffer wasn't
// either the start or within the target string
// write the buffer from 0 - last index
Writer.Write(buffer, 0, currentIndex + 1);
// reset index and start searching for start of target
currentIndex = 0;
// if for some reason the target string is at the end, but was not complete, we should write the characters in the buffer to the target
if (currentIndex > 0)
Writer.Write(buffer, 0, currentIndex);
char[] IgnoreCharacters = new char[] '\n', '\r', ' ' ;
string target = "<td style=\"text - align: right; \">\n\r<a rel=\"nofollow\" target=\"_blank\" href=\"https://www.monstermmorpg.com\"\n\r style=\"color: rgb(6, 69, 173);\n\r text-decoration-line: none;\n\r background: none;\"\n\r title=\"Calyrex (Pokémon)\"><span style=\"color: rgb(0, 0, 0);\">←</span></a></td>";
StringReader reader = new($"<tr>\n\r<td colspan=\"2\" style=\"background - image: initial; background - position: initial; background - size: initial; background - repeat: initial; background - attachment: initial; background - origin: initial; background - clip: initial; border - top - left - radius: 3px; border - top - right - radius: 3px; \">\n\r<b>target</b>\n\r</td>\n\r</tr>");
foreach (char item in IgnoreCharacters)
target = target.Replace(item.ToString(), "");
StringWriter writer = new();
RemoveTargetString(reader, writer, target, IgnoreCharacters, "AAA");
Console.WriteLine(writer.ToString());
如果您不熟悉TextReader
或TextWriter
,这些是StreamReader
和StreamWriter
等常见IO 功能的基类。您可以使用它来简化文件中的行查找信息,如下所示:
char[] IgnoreCharacters = new char[] '\n', '\r', ' ' ;
string target = "Hello World";
string replacement = "Hello Globe";
using StreamReader reader = new("Test.txt");
using StreamWriter writer = new("Output.txt");
RemoveTargetString(reader, writer, target, IgnoreCharacters, replacement);
编辑: 修复了如果目标正在被识别但失败,则单个字符未写入输出流导致有损转录的问题。为常见的边缘案例创建了测试用例。
对于那些对该解决方案的性能感兴趣的人来说,处理一个 1,278,518,583 字节 (1.19GB) 的文本文件需要大约 35 秒,并使用 9 MB 内存。如果需要额外的性能,请考虑将 IgnoreCases.Contains(firstChar)
替换为 Char.IsWhiteSpace(firstChar)
,这会快约 33%。
static char[] IgnoreCharacters = new char[] '\n', '\r', ' ', '\t' ;
[Theory]
[InlineData("1234", "<div>1234</div>", "<div></div>")]
[InlineData("1234", "<div>1\n\r2\t3\n4\r</div>", "<div>\n\r\t\n\r</div>")]
[InlineData("1234", "\n\r\t1\n\r\t2\r\n\t3\n\t\r4", "\n\r\t\n\r\t\r\n\t\n\t\r")]
[InlineData("1234", "1 2 3 4", " ")]
[InlineData("1234", " \n\r\t1 \n\r\t2 \n\r\t3 \n\r\t4 \n\r\t", " \n\r\t \n\r\t \n\r\t \n\r\t \n\r\t")]
[InlineData("1234", "123412341234", "")]
[InlineData("1234", "4321", "4321")]
[InlineData("1234", "Hello", "Hello")]
[InlineData("1234", "", "")]
[InlineData("1234", "1/2/3/4", "1/2/3/4")]
[InlineData("1", "1111", "")]
[InlineData("1", "12131415", "2345")]
[InlineData("Abcde", "AbcdAbcde", "Abcd")]
[InlineData("Abcde", "AbcdAbcdeAbcd", "AbcdAbcd")]
[InlineData("12345", "121231234123451234123121", "1212312341234123121")]
public void CommonEdgeCases(string Target, string Input, string Expected)
foreach (char item in IgnoreCharacters)
Target = Target.Replace(item.ToString(), "");
StringReader reader = new(Input);
StringWriter writer = new();
RemoveTargetString(reader, writer, Target, IgnoreCharacters, string.Empty);
Assert.Equal(Expected, writer.ToString());
【讨论】:
一个不错的解决方案。所以看起来这个任务没有内置或第 3 方库? 没有线索不幸的是我专注于低/高级代码库设计和优化,所以我不是最新的当前 HTML 解析器。希望一个比我更频繁地使用 HTML 解析的人更了解情况。 问题是我拥有的 html 片段不是有效的完整片段。这是片面的。这就是 htmlagilitypack 失败的原因。 我认为是这种情况,所以我编写了这个实现,以避免对其正在读取的字符串做出任何假设。它不在乎它是 html 还是加密的 char 字符串等。=) 我刚刚有时间进行测试,不幸的是它无法正常工作。这里youtube.com/watch?v=eDvFk0eykCw【参考方案2】:鉴于您发布的文字,这似乎有效:
var doc = new HtmlDocument();
doc.Load(@"example.txt");
var nodes = doc.DocumentNode.SelectNodes("//a[@class='mw-jump-link']");
foreach (HtmlNode secondParagraph in nodes.ToList())
var moreNode = HtmlNode.CreateNode("AAA");
var parent = secondParagraph.ParentNode;
parent.InsertAfter(moreNode, secondParagraph);
parent.RemoveChild(secondParagraph);
doc.Save(@"example2.txt");
【讨论】:
【参考方案3】:只要你对写入一次读取从不正则表达式没问题 :) 这可能会起作用 (Fiddle)
var originalData = @"
<tr>
<td colspan=""2"" style=""background-image: initial; background-position: initial; background-size: initial; background-repeat: initial; background-attachment: initial; background-origin: initial; background-clip: initial; border-top-left-radius: 3px; border-top-right-radius: 3px;"">
<b>
<a rel=\""nofollow\"" target=\""_blank\"" href=\""https://www.monstermmorpg.com\""
style=\""color: rgb(6, 69, 173);
text-decoration-line: none; background: none;\""
title=\""Calyrex (Pokémon)\""><span style=\""color: rgb(0, 0, 0);\"">←</span></a>
</b>
</td>
</tr>
";
var searchString = @"<a rel=""nofollow"" target=""_blank"" href=""https://www.monstermmorpg.com"" style=""color: rgb(6, 69, 173); text-decoration-line: none; background: none;"" title=""Calyrex (Pokémon)"">
<span style=""color: rgb(0, 0, 0);"">←</span>
</a>
";
Console.WriteLine("===== originalData ====");
Console.WriteLine(originalData);
Console.WriteLine("===== originalData ====\n");
Console.WriteLine("===== searchString ====");
Console.WriteLine(searchString);
Console.WriteLine("===== searchString ====\n");
var searchRegexString = searchString.Replace("\n", "").Replace("\r", "").Replace("\t", "").Replace("\"", "\\\\\"").Replace("(", "\\(").Replace(")", "\\)").Replace(" ", "[ \\n\\r\\t]*");
var searchRegex = new Regex(searchRegexString, RegexOptions.IgnoreCase);
var replacedData = searchRegex.Replace(originalData, "***** REPLACED *****");
Console.WriteLine("===== searchRegexString ====");
Console.WriteLine(searchRegexString);
Console.WriteLine("===== searchRegexString ====\n");
Console.WriteLine("===== replacedData ====");
Console.WriteLine(replacedData);
Console.WriteLine("===== replacedData ====\n");
【讨论】:
以上是关于如何使用 C# 在给定文本中用忽略空格、回车或换行符替换字符串的主要内容,如果未能解决你的问题,请参考以下文章