安全截断字符串包含颜色标签

Posted

技术标签:

【中文标题】安全截断字符串包含颜色标签【英文标题】:Safe truncate string contains color tag 【发布时间】:2021-08-16 13:33:18 【问题描述】:

我有一个包含颜色标签的字符串。

var myString = "My name is <color=#FF00EE>ABCDE</color> and I love <color=#FFEE00>music</color>";

我的字符串变成“我的名字是 ABCDE*(粉红色)*,我喜欢音乐*(黄色)*”

如果字符串达到最大长度但仍保留颜色标签,我想截断

var myTruncateString = "My name is <color=#FF00EE>ABCDE</color> and I love <color=#FFEE00>mu</color>";

我的字符串变成“我的名字是 ABCDE*(pink)* and I love mu*(yellow)*”

你有什么建议吗?

var stringWithoutFormat = String.Copy(myString);
stringWithoutFormat = Regex.Replace(stringWithoutFormat, "<color.*?>|</color>", "");

var maxLength = 20;
if (stringWithoutFormat.Length > maxLength)

    // What should I do next?

【问题讨论】:

那么你到底想要什么?你只是想限制字符数吗?那么原因: int max = 300; var myTruncateString = mystring[..max]; @Foitn 我想截断我的字符串但仍保留颜色标签 没那么容易,有效!我会首先检查字符串长度。如果它太长,那么从末尾搜索任何&lt;color&gt; 标签。如果找到,则截断其内容或在需要时将其完全删除。如果字符串没有以颜色标签结束,那么检查它的结束位置,看看我们是否可以截断它后面的文本,或者我们是否还必须截断它的内容。 您可能必须解码 XML 结构,检查解码标签中值的总长度并在需要的地方截断(最终删除整个标签或其子标签)......顺便说一句,你想要什么获取整个“音乐”字是否超出最大长度? “如果字符串达到最大长度但仍保留颜色标签,我想截断”最大长度也计入 标签? 【参考方案1】:

这是一个相对简单且不是的错误处理示例,我认为您正在尝试完成:

检查最大长度时不要计算颜色标签 从末尾删除字符,不要破坏颜色标签 如果您最终得到的颜色标签之间没有文字,请删除这些标签

注意:此代码未经彻底测试。随意使用它来做任何你想做的事情,但我会在这里写很多的单元测试。我特别害怕会导致无限循环的极端情况的存在。

public static string Shorten(string input, int requiredLength)

    var tokens = Tokenize(input).ToList();
    int current = tokens.Count - 1;
    
    // assumption: color tags doesn't contribute to *visible* length
    var totalLength = tokens.Where(t => t.Length == 1).Count();
    
    while (totalLength > requiredLength && current >= 0)
    
        // infinite-loop detection
        if (lastCurrent == current && lastTotalLength == totalLength)
            throw new InvalidOperationException("Infinite loop detected");
        lastCurrent = current;
        lastTotalLength = totalLength;

        if (tokens[current].Length > 1)
        
            if (current == 0)
                return "";
            
            if (tokens[current].StartsWith("</") && tokens[current - 1].StartsWith("<c"))
            
                // Remove a <color></color> pair with no text between
                tokens.RemoveAt(current);
                tokens.RemoveAt(current - 1);
                current -= 2;
                
                // Since color tags doesn't contribute to length, don't adjust totalLength
                continue;
            
            
            // Remove one character from inside the color tags
            tokens.RemoveAt(current - 1);
            current--;
            totalLength--;
        
        else
        
            // Remove last character from string
            tokens.RemoveAt(current);
            current--;
            totalLength--;
        
    

    // If we're now at the right length, but the last two tokens are <color></color>, remove them
    if (tokens.Count >= 2 && tokens.Last().StartsWith("</") && tokens[tokens.Count - 2].StartsWith("<c"))
    
        tokens.RemoveAt(tokens.Count - 1);
        tokens.RemoveAt(tokens.Count - 1);
    
    return string.Join("", tokens);


public static IEnumerable<string> Tokenize(string input)

    int index = 0;
    while (index < input.Length)
    
        if (input[index] == '<')
        
            int endIndex = index;
            while (endIndex < input.Length && input[endIndex] != '>')
                endIndex++;
            if (endIndex < input.Length)
                endIndex++;
            yield return input.Substring(index, endIndex - index);
            index = endIndex;
        
        else
        
            yield return input.Substring(index, 1);
            index++;
        
    

示例代码:

var myString = "My name is <color=#ff00ee>ABCDE</color> and I love <color=#eeddff>music</color>";
for (int length = 1; length < 100; length++)
    Console.WriteLine($"length: Shorten(myString, length)");

输出:

1: M
2: My
3: My 
4: My n
5: My na
6: My nam
7: My name
8: My name 
9: My name i
10: My name is
11: My name is 
12: My name is <color=#ff00ee>A</color>
13: My name is <color=#ff00ee>AB</color>
14: My name is <color=#ff00ee>ABC</color>
15: My name is <color=#ff00ee>ABCD</color>
16: My name is <color=#ff00ee>ABCDE</color>
17: My name is <color=#ff00ee>ABCDE</color> 
18: My name is <color=#ff00ee>ABCDE</color> a
19: My name is <color=#ff00ee>ABCDE</color> an
20: My name is <color=#ff00ee>ABCDE</color> and
21: My name is <color=#ff00ee>ABCDE</color> and 
22: My name is <color=#ff00ee>ABCDE</color> and I
23: My name is <color=#ff00ee>ABCDE</color> and I 
24: My name is <color=#ff00ee>ABCDE</color> and I l
25: My name is <color=#ff00ee>ABCDE</color> and I lo
26: My name is <color=#ff00ee>ABCDE</color> and I lov
27: My name is <color=#ff00ee>ABCDE</color> and I love
28: My name is <color=#ff00ee>ABCDE</color> and I love 
29: My name is <color=#ff00ee>ABCDE</color> and I love <color=#eeddff>m</color>
30: My name is <color=#ff00ee>ABCDE</color> and I love <color=#eeddff>mu</color>
31: My name is <color=#ff00ee>ABCDE</color> and I love <color=#eeddff>mus</color>
32: My name is <color=#ff00ee>ABCDE</color> and I love <color=#eeddff>musi</color>
33: My name is <color=#ff00ee>ABCDE</color> and I love <color=#eeddff>music</color>
34: My name is <color=#ff00ee>ABCDE</color> and I love <color=#eeddff>music</color>
35: My name is <color=#ff00ee>ABCDE</color> and I love <color=#eeddff>music</color>
36: My name is <color=#ff00ee>ABCDE</color> and I love <color=#eeddff>music</color>
37: My name is <color=#ff00ee>ABCDE</color> and I love <color=#eeddff>music</color>
38: My name is <color=#ff00ee>ABCDE</color> and I love <color=#eeddff>music</color>
39: My name is <color=#ff00ee>ABCDE</color> and I love <color=#eeddff>music</color>
... and so on

【讨论】:

【参考方案2】:

我生成了 2 个列表:

一个包含真实文本的索引 一个包含标签的开始和结束索引

然后我将文本提取到第一个数组中的最大长度。 最后,我检查是否有一个开始标签,如果有,我就关闭它。

注意:我的代码不处理嵌套标签。您必须更改结束标记部分。

public static string Truncate(string text, int maxLength)
    
        if (text.Length <= maxLength) return text;

        var tagIndexes = new List<int>();
        var realTextIndexes = new List<int>();
        bool isInTag = false;
        for (int i = 0; i < text.Length; i++)
        
            if (text[i] == '<')
            
                isInTag = true;
                tagIndexes.Add(i);
            

            if (!isInTag)
            
                realTextIndexes.Add(i);
            

            if (text[i] == '>')
            
                isInTag = false;
                tagIndexes.Add(i);
            
        

        if (realTextIndexes.Count <= maxLength) return text;

        string truncatedText = text.Substring(0, realTextIndexes[maxLength - 1] + 1);

        // Should we close a tag ?
        for (int i = 0; i < tagIndexes.Count; i++)
        
            if (tagIndexes[i] > realTextIndexes[maxLength - 1])
            
                if ((i % 4) == 2) // If the next tag is a closing tag
                
                    truncatedText += text.Substring(tagIndexes[i], tagIndexes[i + 1] - tagIndexes[i] + 1);
                

                break;
            
        

        return truncatedText;
    

【讨论】:

以上是关于安全截断字符串包含颜色标签的主要内容,如果未能解决你的问题,请参考以下文章

拆分 Twitter Bootstrap 标签

带有汉字的字符串截断出现半个“汉字”的解决方法-C语言源码

Gmail REST API 是不是有权访问标签颜色?

如何插入包含单引号的字符串

如何截断 DataFrame 列中字符串的长度?

在C中用realloc()截断字符串?