出现换行时转换包含 XML 数据的 stramreader 文本时出现问题

Posted 2023-04-13

技术标签:

【中文标题】出现换行时转换包含 XML 数据的 stramreader 文本时出现问题【英文标题】：Problem converting stramreader text containing XML data when there is line break 【发布时间】：2020-06-18 08:32:45 【问题描述】：

另一个选项可以是使用正则表达式从 log_file_string 中的节点和 xml_lines 之间获取数据，而无需循环，因为数据是 3 MB 文件:(

log_file_string 的显示方式如下：

2020-06-10T10:58:07.0792762Z [data_type_jason] "person_id":"101", "order_id":"123"
2020-06-12T10:58:07.0792762Z [data_type_xml] <?xml version="1.0"?><persons><person id = "101"><name>"Thomas Edison"</name><age>"35"</age><phone>"7777777777"</phone><address>"62  Ross Road, 
    MARSHAM, NR10 6EA"</address><country>"England"</country></person></persons>
2020-06-13T10:58:07.0792762Z [data_type_jason]  "person_id":"102", "order_id":"140"
2020-06-14T10:58:07.0792762Z [data_type_xml]<?xml version="1.0"?><persons><person id = "102"><name>"Louis Pasture"</name><age>"40"</age><phone>"99999999"</phone><address>"145  Thames Street, BOOSBECK, TS12 1AN"</address><country>"England"</country></person></persons>

这是完整的原型宝贝：

using System;
using System.Collections.Generic;
using System.Data;
using System.IO;
using System.Linq;
using System.Text;
using System.Windows.Forms;
using System.Xml.Linq;
namespace Test

    public partial class Form1 : Form
    
        public Form1()
        
            InitializeComponent();
        //Form1

        private void Form1_Load(object sender, EventArgs e)
        
            String file_folder = @"X:\VS 2019\C Sharp\Test";
            String file_path = Path.Combine(file_folder, "log_file.txt");
            process_log_file_data(file_folder, file_path);
        //Form1_Load

        private void process_log_file_data(String file_folder, String file_path)
        
            String log_file_string = read_all_lines_from_file(file_folder, file_path);
            String[] log_file_lines = log_file_string.Split(new String[]  Environment.NewLine , StringSplitOptions.None);

            //The other option can be getting data between just to nodes <persons> and </persons> in xml_lines from the log_file_string using regex but am not regex savvy :(
            IEnumerable<String> xml_lines = from line in log_file_lines
                                            where line.Contains("data_type_xml")
                                            select line;

            IEnumerable<String> jason_lines = from line in log_file_lines
                                              where line.Contains("data_type_jason")
                                              select line;

            XDocument xml_document = new XDocument(new XDeclaration("1.0", "utf-8", "yes"), new XElement("xml_data"));

            foreach (var xml_line in xml_lines)
            
                String line = xml_line.Split(new String[]  "[data_type_xml]" , StringSplitOptions.None)[xml_line.Split(new String[]  "[data_type_xml]" , StringSplitOptions.None).GetUpperBound(0)].Trim();

            //here is the issue < persond id = "101" >< address > as the  there is a line break in log_file_lines  the xml_line = 2020-06-12T10:58:07.0792762Z [data_type_xml] <?xml version="1.0"?><person><person id = "101"><name>"Thomas Edison"<name><age>"35"</age><phone>"7777777777"</phone><address>"62  Ross Road
            XDocument temp_xml_document = XDocument.Parse(line); //Unexpected end of file has occurred. The following elements are not closed: address, person, persons. Line 1, position 144.'
            

            foreach (var jason_line in jason_lines)
            
                //do something
            
        //process_log_file_data(String file_folder, String file_path)

        private String read_all_lines_from_file(String file_folder, String file_path)
        
            FileInfo file_info = new FileInfo(file_path);
            if ((!file_info.Exists) || (file_info.Length == 0))
            
                return String.Empty;
            

            FileStream file_stream; StreamReader stream_reader; UTF8Encoding utf8_encoding; String file_text;
            file_stream = new FileStream(file_path, FileMode.Open, FileAccess.Read, FileShare.ReadWrite);
            utf8_encoding = new UTF8Encoding(false);
            stream_reader = new StreamReader(file_stream, utf8_encoding);
            file_text = stream_reader.ReadToEnd();
            stream_reader.Close();
            file_stream.Close();
            return file_text;
        //read_all_lines_from_file

    //Form1 : Form

//Test

【问题讨论】：

你能举例说明原始日志文件的样子吗？嗨，马格努斯。感谢您对此进行调查。我已经更新了问题并在下面添加了日志文件文本。再次感谢您对此进行调查。如果 xml 中有换行符，该行是否仍以 [data_type_xml] 开头？我个人会按顺序处理每一行。如果一行不以 datetime+type 开头，只需将其附加到处理的最后一行。然后在第二遍进行实际解释。好吧，无论如何您都不需要将整个文件读入内存。您可以在从文件中读取文本行的同时执行 JSON 和 XML 工作。 【参考方案1】：

使用递归和流式阅读器可以实现如下。想法是同时读取两条 readlines，条件很少...

class Program

    static void Main(string[] args)
    
        using (StreamReader r = new StreamReader("filename"))
        
            while (!r.EndOfStream)
            
                var fline = r.ReadLine();
                string sline = "";
                if (!r.EndOfStream)
                    sline = r.ReadLine();

                RecursivelyParse(r,ref fline, ref sline);
            
        
    

    private static void RecursivelyParse(StreamReader r, ref string fline, ref string sline)
    
        if (fline.Contains("data_type_xml"))
        
            if (!(sline.Contains("data_type_xml") || sline.Contains("data_type_jason")) && sline != "")
            
                fline += sline;
                sline = "";
                //Next line also a part of xml
                //parsing loggic of line containing  xml 
            
            else
            
                //parsing loggic of line containing  xml 
            
        
        else if (fline.Contains("data_type_jason"))
        
            //parsing loggic of line containing  jason 
        

        if (sline.Contains("data_type_xml"))
        
            if (!r.EndOfStream)
                fline = r.ReadLine();
            else
                fline = "";

            if (!(fline.Contains("data_type_xml") || fline.Contains("data_type_jason")) && fline != "")
            
                sline += fline;
                fline = "";
                //Next line also a part of xml
                //parsing loggic of line containing  xml 
            
            else
            
                //parsing loggic of line containing  xml 
            
        
        else if (sline.Contains("data_type_jason"))
        
            //parsing loggic of line containing  jason 
        

        while (!r.EndOfStream)
        
            fline = r.ReadLine();

            if (!r.EndOfStream)
                sline = r.ReadLine();
            else 
                sline = "";

            RecursivelyParse(r, ref fline, ref sline);

【讨论】：

【参考方案2】：

解析日志条目：

试试这样的：

public static IList<string> ParseLogs(string log_file_string)

    // The pattern of the beginning of each log entry
    var logprefixPattern = @"\d4-\d2-\d2T\d2:\d2:\d2\.\d+Z \[data_type_(jason|xml)\] ";

    // Replace it with a symbol
    var splitPattern = "[log_entry]";
    var replaced = Regex.Replace(log_file_string, logprefixPattern, splitPattern);

    // Split the string with that symbol
    var splited = replaced.Split(splitPattern);

    // Now you get the list of logs.
    return splited;

我的想法是你读取所有内容并存储到变量log_file_string中，你可以查找每个日志的模式而不是逐行读取。

看起来每个日志条目都以相同的模式开头，因此您可以按该模式拆分日志条目。

如果要将项目分为 2 组：xml 日志和 json 日志，您可以将拆分的结果一一循环，并通过 e.g. 确定。开头字符（是 json，< 是 xml）。

解析日志条目中的 XML

假设你想把XML解析成一个json字符串列表，每个字符串都是一个人，你可以使用Newtonsoft.Json：

public static IList<string> ParseXml(string xmlStr)

    XmlDocument doc = new XmlDocument();
    doc.LoadXml(xmlStr);

    // assuming that each entry contains at least 1 person, and you want to extract those persons
    var ppl = doc.GetElementsByTagName("person");
    var results = new List<string>;
    foreach(XmlNode p in ppl)
    
        string jsonStr = Newtonsoft.Json.JsonConvert.SerializeXmlNode(p);
        results.Add(jsonStr);       
    
    return results;

为简单起见，我只在此处返回 json 字符串。您可以将jsonStr 反序列化为您定义的某些 C# 模型。

【讨论】：

嗨贝敏。这些行之间还有其他垃圾数据。我只是放了我需要的数据你说的垃圾数据是什么意思？有什么例子吗？ @Benn，你的想法很棒。可悲的是我不精通正则表达式。是否可以使用正则表达式从日志中获取 2 个 xml 标签之间的数据而无需任何循环？我的意思是在标签和之间，垃圾数据会感染非常机密的信息。我不能在这里分享。但这对我的项目没有用 @JD 您可以将 XML 字符串解析为 XElement 并检索您想要的内容/属性。请参阅this example，了解如何将字符串解析为XElement。不要使用正则表达式来解析 XML 字符串。谢谢@Bemn。这真的很有用:)。我肯定会在我的项目中实现这一点。但是，挑战仍然存在于包含 xml 字符串中断的行中的位置:(。这就是为什么我想获得一个正则表达式，它可以为我提供和之间的字符串集合，这些集合可以破坏日志文件字符串

以上是关于出现换行时转换包含 XML 数据的 stramreader 文本时出现问题的主要内容，如果未能解决你的问题，请参考以下文章

错误记录Windows 系统 bat 脚本报错 ( Java 生成 bat 脚本乱码处理 | 输出 GB2312 字符串 | Windows 中的换行时 )