如何从 XML 数据中提取特定数据

Posted

技术标签:

【中文标题】如何从 XML 数据中提取特定数据【英文标题】:How to extract specific data from XML data 【发布时间】:2015-08-27 13:54:57 【问题描述】:

我正在使用以下代码 sn-p 来解析一些 XML 数据并将其转换为 CSV。我可以转换整个 XML 数据并将其转储到文件中,但是我的要求已经改变,现在我很困惑。

public void xmlToCSVfiltered(string p, int e)
                         
            string all_lines1 = File.ReadAllText(p);

            all_lines1 = "<Root>" + all_lines1 + "</Root>";
            XmlDocument doc_all = new XmlDocument();
            doc_all.LoadXml(all_lines1);
            StreamWriter write_all = new StreamWriter(FILENAME2);
            XmlNodeList rows_all = doc_all.GetElementsByTagName("XML");

            List<string[]> filtered = new List<string[]>();

            foreach (XmlNode rowtemp in rows_all)
            
                List<string> children_all = new List<string>();
                foreach (XmlNode childtemp in rowtemp.ChildNodes)
                
                    children_all.Add(Regex.Replace(childtemp.InnerText, "\\s+", " "));     // <------- Fixed the Bug , Advisories dont span          
                  
                string.Join(",", children_all.ToArray());

                //write_all.WriteLine(string.Join(",", children_all.ToArray()));

                if (children_all.Contains(e.toString()))
                
                    filtered.Add(children_all.ToArray());
                    write_all.WriteLine(children_all);
                
            
            write_all.Flush();
            write_all.Close();

            foreach (var res in filtered)
            
                Console.WriteLine(string.Join(",", res));
            
        

我的输入如下所示...我现在的目标是只转换那些“事件”并编译成具有一定数量的 CSV。例如,我只想将元素 &lt;EVENT&gt; 下的第二个数据值是 4627 的事件转换为 CSV。它只会转换这些事件,并且在下面的输入的情况下,两者都在下面提到。

<XML><HEADER>1.0,770162,20121009133435,3,</HEADER>20121009133435,721,5,1,0,0,0,00:00,00:00,<EVENT>00032134826064957,4627,</EVENT><DRUG>1,1872161156,7,0,10000</DRUG><DOSE>1,0,5000000,0,10000000,0</DOSE><CAREAREA>1 </CAREAREA><ENCOUNTER></ENCOUNTER><ADVISORY>Keep it simple or spell
        tham ALL out. For some reason 
        that is not the case
        please press the on button 
        when trying to activate
        device codes also available on
    list</ADVISORY><CAREGIVER></CAREGIVER><PATIENT></PATIENT><LOCATION>20121009133435,00-1d-71-0a-71-80,-66</LOCATION><ROUTE></ROUTE><SITE></SITE><POWER>0,50</POWER></XML> 
<XML><HEADER>2.0,773162,20121009133435,3,</HEADER>20121004133435,761,5,1,0,0,0,00:00,00:00,<EVENT>00032134826064957,4627,</EVENT><DRUG>1,18735166156,7,0,10000</DRUG><DOSE>1,0,5000000,0,10000000,0</DOSE><CAREAREA>1 </CAREAREA><ENCOUNTER></ENCOUNTER><ADVISORY>Keep it simple or spell
        tham ALL out. For some reason 
        that is not the case
        please press the on button 
        when trying to activate
        device codes also available on
    list</ADVISORY><CAREGIVER></CAREGIVER><PATIENT></PATIENT><LOCATION>20121009133435,00-1d-71-0a-71-80,-66</LOCATION><ROUTE></ROUTE><SITE></SITE><POWER>0,50</POWER></XML> 

.. goes on

到目前为止,我的方法是将所有内容转换为 CSV 并将其存储在某种数据结构中,然后逐行查询该数据结构并查看该数字是否存在,如果存在,将其写入文件行按行。我的函数将 XML 文件的路径和我们在 XML 数据中查找的数字作为参数。我是 C# 新手,我无法理解如何更改上面的函数。任何帮助将不胜感激!

编辑:

示例输入:

<XML><HEADER>1.0,770162,20121009133435,3,</HEADER>20121009133435,721,5,1,0,0,0,00:00,00:00,<EVENT>00032134826064957,4627,</EVENT><DRUG>1,1872161156,7,0,10000</DRUG><DOSE>1,0,5000000,0,10000000,0</DOSE><CAREAREA>1 </CAREAREA><ENCOUNTER></ENCOUNTER><ADVISORY>Keep it simple or spell
    tham ALL out. For some reason 
    that is not the case
    please press the on button 
    when trying to activate
    device codes also available on
list</ADVISORY><CAREGIVER></CAREGIVER><PATIENT></PATIENT><LOCATION>20121009133435,00-1d-71-0a- 

    <XML><HEADER>1.0,770162,20121009133435,3,</HEADER>20121009133435,721,5,1,0,0,0,00:00,00:00,<EVENT>00032134826064957,4623,</EVENT><DRUG>1,1872161156,7,0,10000</DRUG><DOSE>1,0,5000000,0,10000000,0</DOSE><CAREAREA>1 </CAREAREA><ENCOUNTER></ENCOUNTER><ADVISORY>Keep it simple or spell
        tham ALL out. For some reason 
        that is not the case
        please press the on button 
        when trying to activate
        device codes also available on
    list</ADVISORY><CAREGIVER></CAREGIVER><PATIENT></PATIENT><LOCATION>20121009133435,00-1d-71-0a- 

所需输出:

1.0,770162,20121009133435,3,,20121009133435,721,5,1,0,0,0,00:00,00:00,,00032134 26064957,4627,1,,1872161156,7,0,10000,1,0,5000000,0,10000000,0,1 ,,Keep it simple or spell
    tham ALL out. For some reason 
    that is not the case
    please press the on button 
    when trying to activate
    device codes also available on
list,,,20121009133435,00-1d-71-0a-71-80,-66,,,0,50 

如果我打电话给xmlToCSVfiltered(file, 4627);,就会出现上述情况 另请注意,输出将是 CSV 文件中的一条水平线,但我无法在此处对其进行格式化,使其看起来像那样。

【问题讨论】:

【参考方案1】:

我将 XmlDocumnet 更改为 XDocument,以便可以使用 Xml Linq。我还用于测试使用 StringReader 来读取字符串而不是从文件中读取。您可以将代码转换回原来的 File.ReadAlltext。

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
using System.IO;
using System.Text.RegularExpressions;

namespace ConsoleApplication1

    class Program
    
        const string FILENAME2 = @"c:\temp\test.txt";
        static void Main(string[] args)
        
            string input = 
            "<XML><HEADER>1.0,770162,20121009133435,3,</HEADER>20121009133435,721,5,1,0,0,0,00:00,00:00,<EVENT>00032134826064957,4627,</EVENT><DRUG>1,1872161156,7,0,10000</DRUG><DOSE>1,0,5000000,0,10000000,0</DOSE><CAREAREA>1 </CAREAREA><ENCOUNTER></ENCOUNTER><ADVISORY>Keep it simple or spell\n" +
                    "tham ALL out. For some reason \n" +
                    "that is not the case\n" +
                    "please press the on button\n" + 
                    "when trying to activate\n" +
                    "device codes also available on\n" +
                "list</ADVISORY><CAREGIVER></CAREGIVER><PATIENT></PATIENT><LOCATION>20121009133435,00-1d-71-0a-71-80,-66</LOCATION><ROUTE></ROUTE><SITE></SITE><POWER>0,50</POWER></XML>\n" + 
            "<XML><HEADER>2.0,773162,20121009133435,3,</HEADER>20121004133435,761,5,1,0,0,0,00:00,00:00,<EVENT>00032134826064957,4627,</EVENT><DRUG>1,18735166156,7,0,10000</DRUG><DOSE>1,0,5000000,0,10000000,0</DOSE><CAREAREA>1 </CAREAREA><ENCOUNTER></ENCOUNTER><ADVISORY>Keep it simple or spell\n" +
                    "tham ALL out. For some reason\n" + 
                    "that is not the case\n" +
                    "please press the on button\n" + 
                    "when trying to activate\n" +
                   "device codes also available on\n" +
                "list</ADVISORY><CAREGIVER></CAREGIVER><PATIENT></PATIENT><LOCATION>20121009133435,00-1d-71-0a-71-80,-66</LOCATION><ROUTE></ROUTE><SITE></SITE><POWER>0,50</POWER></XML>\n";

            xmlToCSVfiltered(input, 4627); 

        
        static public void xmlToCSVfiltered(string p, int e)
        
            //string all_lines1 = File.ReadAllText(p);
            StringReader reader = new StringReader(p);
            string all_lines1 = reader.ReadToEnd();

            all_lines1 = "<Root>" + all_lines1 + "</Root>";
            XDocument doc_all = XDocument.Parse(all_lines1);
            StreamWriter write_all = new StreamWriter(FILENAME2);
            List<XElement> rows_all = doc_all.Descendants("XML").Where(x => x.Element("EVENT").Value.Split(new char[] ',').Skip(1).Take(1).FirstOrDefault() == e.ToString()).ToList();

            List<string[]> filtered = new List<string[]>();

            foreach (XElement rowtemp in rows_all)
            
                List<string> children_all = new List<string>();
                foreach (XElement childtemp in rowtemp.Elements())
                
                    children_all.Add(Regex.Replace(childtemp.Value, "\\s+", " "));     // <------- Fixed the Bug , Advisories dont span          
                
                string.Join(",", children_all.ToArray());

                //write_all.WriteLine(string.Join(",", children_all.ToArray()));

                if (children_all.Contains(e.ToString()))
                
                    filtered.Add(children_all.ToArray());
                    write_all.WriteLine(children_all);
                
            
            write_all.Flush();
            write_all.Close();

            foreach (var res in filtered)
            
                Console.WriteLine(string.Join(",", res));
            
        
    

​

【讨论】:

试过你的方法,但我得到空白的输出文件。 你也一样吗? 我更新了代码。确保您使用最新的。我有点阅读障碍,输入的是 4267 而不是 4627。 是的,我改变了它并尝试了。仍然得到空白的输出文件。 检查我的编辑,到目前为止,我已经添加了我的确切功能。我无法确定哪里出了问题。从我的打印语句中,我可以看到它没有进入底部的 IF 循环。【参考方案2】:

我做了一些假设,因为我从问题中不清楚 假设 1。我假设你知道你需要检查节点事件并且你需要从那里定位元素。 2。您知道节点中的值之间的分隔符。例如。 ',' 在事件中

    public void xmlToCSVfiltered(string p, int e, string nodeName, char delimiter)
    
        //get the xml node
        XDocument xml = XDocument.Load(p);

        //get the required node. I am assuming you would know. For eg. Event Node
        var requiredNode = xml.Descendants(nodeName);

        foreach (var node in requiredNode)
        
            if (node == null)
                continue;

            //Also here, I am assuming you have the delimiter knowledge.
            var valueSplit = node.Value.Split(delimiter);

            foreach (var value in valueSplit)
            
                if (value == e.ToString())
                
                    AddToCSV();
                
            
        
    

【讨论】:

这对我不起作用的一个主要原因是,位置编号不是一个常数。在某些情况下,我要查找的数字的位置编号可能是 10,在另一种情况下可能是 15。最终目标是检查逗号之间的每个数字并输出具有该值的行。

以上是关于如何从 XML 数据中提取特定数据的主要内容,如果未能解决你的问题,请参考以下文章

如何提取XML文件中的数据?

如何从串行数据中提取特定序列

如何从 .hdf5 文件表中提取列名并根据指定的列名提取特定行数据?

如何使用 PL/SQL 从 XML 文件中提取数据

ejabberd:如何从 xml 元素中提取数据

从pdf文件中提取特定数据