尝试将文本文件中的数据解析为由 | 分隔的单行象征

Posted

技术标签:

【中文标题】尝试将文本文件中的数据解析为由 | 分隔的单行象征【英文标题】:Trying to parse data from text file to single line separated by | symbol 【发布时间】:2022-01-19 06:36:19 【问题描述】:

我有一个包含以下数据的文本文件:

#294448
ORDER_STATUS1098988 VALID
24.09.2021 05:17 AM
Customer_ID: 5524335312265537
MMYY: 08/23
Txn_ID: 74627
Name: Krystal Flowers
E-mail: abc@gmail.com
Phone: 9109153030
Address_original: 1656 W Alvarado dr, Pueblo West, Colorado, 81007, United States
ZIP_City_State_Country: -
Type: -
Subtype: -

#294448
ORDER_STATUS1097728 VALID
24.09.2021 05:17 AM
Customer_ID: 5524331591654699
MMYY: 11/23
Txn_ID: 45617
Name: Allen E Prieto
E-mail: xyz@gmail.com
Phone: 5056994899
Address_original: 655 Ives Dairy Rd, Miami, Florida, 33179, United States
ZIP_City_State_Country: -
Type: -
Subtype: -

#294445
ORDER_STATUS537099 VALID
24.09.2021 05:01 AM
Customer_ID: 4118230087730234
MMYY: 09/25
Txn_ID: 24430
Name: tera casey
Phone: 7405863997
Address_original: 13705 Neptune Lane, New Concord, Ohio State, 43762, PE
ZIP_City_State_Country: 43762, New Concord, Ohio State, UNITED STATES
Subtype: N/A

#294445
ORDER_STATUS489401 VALID
24.09.2021 05:01 AM
Customer_ID: 4118230054806983
MMYY: 07/24
Txn_ID: 13183
Name: Nancy Lambert
Address_original: 2600 loop drive, N, N, 44113, PE
ZIP_City_State_Country: 44113, N, N, UNITED STATES
Subtype: N/A

#294445
ORDER_STATUS437355 VALID
24.09.2021 05:01 AM
Customer_ID: 4118230061412668
MMYY: 05/24
Txn_ID: 55474
Name: Sheets Sherry
E-mail: tyd@gmail.com
Phone: (567) 241-5074
Address_original: 37 Martha Avenue, Mansfield, Ohio, 44905, US
ZIP_City_State_Country: 44905, Mansfield, Ohio, UNITED STATES
Subtype: N/A

数据需要以一种方式组织,以便 Customer_ID、MMYY 和 Txn_ID 仅显示在由 | 分隔的单行中象征。应忽略此文本文件中的所有其他内容。

例子:

5524335312265537 | 08/23 | 24430
5524331591654699 | 11/23 | 45617
4118230087730234 | 09/25 | 24430
4118230054806983 | 07/24 | 13183
4118230061412668 | 05/24 | 55474

这是我尝试过的,但我得到“无效文件!”打开文本文件后的消息。 Reference taken from this post

private void openFile_Click(object sender, EventArgs e)
        
            OpenFileDialog ofdtmp = new OpenFileDialog();
            if (ofdtmp.ShowDialog() == System.Windows.Forms.DialogResult.OK)
            
                try
                
                    using (StreamReader sr = File.OpenText(ofdtmp.FileName))
                    
                        while (sr.Peek() >= 0)
                        
                            string line = sr.ReadLine();
                            line = line.Trim();
                            if (line.ToString() == "" || line.Contains("#") || line.Contains("ORDER_STATUS") || /*Exclude Date & Time*/ line.Contains(".") || line.Contains("Name:") || line.Contains("E-mail:") || line.Contains("Phone:") || line.Contains("Address_original:") || line.Contains("ZIP_City_State_Country:") || line.Contains("Type:") || line.Contains("Subtype:"))
                                continue; //skip

                            if (line.Contains("CustomerID: "))
                            
                                string customID = line.Substring(12, 29).Trim();
                                continue;
                            

                            if (line.Contains("MMYY: "))
                            
                                string mmyy = line.Substring(6, 11).Trim();
                                continue;
                            

                            if (line.Contains("Txn_ID: "))
                            
                                string txnID = line.Substring(10, 16).Trim();
                                continue;
                               
                        
                        richTextBox.Text = sr.ToString();
                    
                
                catch
                
                    MessageBox.Show("Invalid file!");
                
            
        

我在类似的在线帖子中查找了替代解决方案,看来应用正则表达式是正确的方法。困难在于弄清楚如何跳过文本文件中所有不必要的字符和符号,只提取所需的数据。这个问题的最佳解决方案是什么?

解决方案更新:

using System;
using System.IO;
using System.Security.Cryptography;
using System.Text;
using System.Text.RegularExpressions;
using System.Windows.Forms;

namespace RegExTool

    public partial class Form1 : Form
    
        public Form1()
        
            InitializeComponent();
        

        private void Form1_Load(object sender, EventArgs e)
        
        

        private void openFile_Click(object sender, EventArgs e)
        
            OpenFileDialog ofdtmp = new OpenFileDialog();
            if (ofdtmp.ShowDialog() == System.Windows.Forms.DialogResult.OK)
            
                try
                
                    using (StreamReader sr = new StreamReader(ofdtmp.FileName))
                    
                        string data = sr.ReadToEnd();
                        richTextBox1.Clear();
                        richTextBox2.Clear();
                        richTextBox1.Text = data;
                        string pattern = @"(?<=CustomerID:).*|(?<=MMYY:).*|(?<=Txn_ID:).*";
                        var en = Regex.Matches(data, pattern, RegexOptions.IgnoreCase).GetEnumerator();
                        while (en.MoveNext())
                        
                            var ci = en.Current;
                            if (!en.MoveNext())
                                break;
                            var di = en.Current;
                            if (!en.MoveNext())
                                break;
                            var ti = en.Current;
                            string text = ($"ci|di|ti") + System.Environment.NewLine;
                            richTextBox2.Text += text.Replace(" ", string.Empty);
                        
                    
                 
                catch (Exception ex) 
                 
                    MessageBox.Show(ex.Message); 
                
            
        

        private void saveFile_Click(object sender, EventArgs e)
        
            string tmp = richTextBox2.Text;
            SaveFileDialog svdtmp = new SaveFileDialog();
                if (svdtmp.ShowDialog() == System.Windows.Forms.DialogResult.OK)
                
                    try
                    
                        File.WriteAllText(svdtmp.FileName, (tmp.ToString()));
                        MessageBox.Show("File Saved!");
                    
                    catch (Exception ex)
                    
                        MessageBox.Show("Cannot save text to file.");
                    
                
        
    

最终解决方案:

using System;
using System.Collections.Generic;
using System.IO;
using System.Text.RegularExpressions;
using System.Windows.Forms;

namespace RegExTool

    public partial class Form1 : Form
    
        public Form1()
        
            InitializeComponent();
        

        private void Form1_Load(object sender, EventArgs e)
        

        

        static List<string> GetStrings(string input)
        
            string pattern = @"Customer_ID: (?<CustomerId>\d+)[\r\n]+MMYY\: (?<ExpiryDate>\d2\/\d2)[\r\n]+Txn_ID: (?<TxnId>\d+)";
            List<string> strings = new List<string>();
            foreach(Match match in Regex.Matches(input, pattern, RegexOptions.Multiline,TimeSpan.FromSeconds(1)))
            
                strings.Add($"match.Groups["CustomerId"] | match.Groups["ExpiryDate"] | match.Groups["TxnId"]");
            
            return strings;
        

        private void openFile_Click(object sender, EventArgs e)
        
            OpenFileDialog ofdtmp = new OpenFileDialog();
            if (ofdtmp.ShowDialog() == System.Windows.Forms.DialogResult.OK)
            
                try
                
                    using (StreamReader sr = new StreamReader(ofdtmp.FileName))
                    
                        string input = sr.ReadToEnd();
                        richTextBox1.Clear();
                        richTextBox2.Clear();
                        richTextBox1.Text = input;
                        foreach (var value in GetStrings(input))
                        
                            string text = value + System.Environment.NewLine;
                            richTextBox2.Text += text;
                        
                    
                 catch (Exception ex)
                
                    MessageBox.Show(ex.Message);
                
            
        

        private void saveFile_Click(object sender, EventArgs e)
        
            string tmp = richTextBox2.Text;
            SaveFileDialog svdtmp = new SaveFileDialog();
            if (svdtmp.ShowDialog() == System.Windows.Forms.DialogResult.OK)
            
                try
                
                    File.WriteAllText(svdtmp.FileName, (tmp.ToString()));
                    MessageBox.Show("File Saved!");
                
                catch (Exception ex)
                
                    MessageBox.Show("Cannot save text to file.");
                
            
        
    

【问题讨论】:

让我们从实际显示您得到的异常而不是 Invalid File 消息开始。将 catch (Exception ex) MessageBox.Show(ex.Message) 添加到您的 catch 子句中...或使用 Debug.WriteLine() @Bola 好的错误提示“索引和长度必须引用字符串中的位置。参数名称:长度”。更新 - 将每个部分中的 line.Substring 编辑为 line.Substring(12, line.Length - 12).Trim();line.Substring(5, line.Length - 5).Trim(); 但这次得到的结果是 richTextBox 中的 text "System.IO.StreamReader",发生了什么错了吗? 您应该跟踪您的代码,因为您应该确切地看到为什么它没有按预期工作。为了帮助……在……while (sr.Peek() &gt;= 0)……循环中……代码从文件中读取了几行,你的问题是你怎么知道“什么时候”从文件中读取的最后一行是“END”您要添加到文本框中的行之一?代码在读取 x 行后设置了三个变量……string customID = line.Substring(12, 29).Trim();……但是,代码对这些值没有任何作用。 我认为您可能想要类似...richTextBox1.Text += customID + " | " + mmyy + " | " + txnID + Environment.NewLine;。正如我之前提到的……您需要弄清楚如何告诉“何时”将行添加到文本框中。可能将每个变量设置为空,然后当所有三个变量都不为空时,您就会知道将该行添加到文本框中,然后将所有变量重置为空。粗鲁,但应该可以。 @JohnG 我已将richTextBox.Text = sr.ToString(); 的行替换为richTextBox.Text += customID + " | " + mmyy + " | " + txnID + Environment.NewLine; 但错误提示“当前上下文中不存在名称customID/mmyy/txnID。” 【参考方案1】:

我使用正则表达式的解决方案。处理上的换行符 • Windows \r\n • Linux \n • macOS \r

您可以在 https://replit.com/@JomaCorpFX/SO70374465 上测试/运行此代码

您可以检查https://regex101.com/r/R7Q5bq/4上的正则表达式

代码

using System.Text.RegularExpressions;
using System.Collections.Generic;
using System;
using System.Linq;

public class Program

    static List<string> GetStrings(string input)
    
        string pattern = @"Customer_ID: (?<CustomerId>\d+)[\r\n]+MMYY\: (?<ExpiryDate>\d2\/\d2)[\r\n]+Txn_ID: (?<TxnId>\d+)";
        List<string> strings = new List<string>();
        foreach(Match match in Regex.Matches(input, pattern, RegexOptions.Multiline,TimeSpan.FromSeconds(1)))
        
            strings.Add($"match.Groups["CustomerId"] | match.Groups["ExpiryDate"] | match.Groups["TxnId"]");
        
        return strings;
    

    public static void Main(string[] args)
    
        string input = @"#294448
ORDER_STATUS1098988 VALID
24.09.2021 05:17 AM
Customer_ID: 5524335312265537
MMYY: 08/23
Txn_ID: 74627
Name: Krystal Flowers
E-mail: abc@gmail.com
Phone: 9109153030
Address_original: 1656 W Alvarado dr, Pueblo West, Colorado, 81007, United States
ZIP_City_State_Country: -
Type: -
Subtype: -

#294448
ORDER_STATUS1097728 VALID
24.09.2021 05:17 PM
Customer_ID: 5524331591654699
MMYY: 11/23
Txn_ID: 45617
Name: Allen E Prieto
E-mail: xyz@gmail.com
Phone: 5056994899
Address_original: 655 Ives Dairy Rd, Miami, Florida, 33179, United States
ZIP_City_State_Country: -
Type: -
Subtype: -

#294445
ORDER_STATUS537099 VALID
24.09.2021 05:01 AM
Customer_ID: 4118230087730234
MMYY: 09/25
Txn_ID: 24430
Name: tera casey
Phone: 7405863997
Address_original: 13705 Neptune Lane, New Concord, Ohio State, 43762, PE
ZIP_City_State_Country: 43762, New Concord, Ohio State, UNITED STATES
Subtype: N/A

#294445
ORDER_STATUS489401 VALID
24.09.2021 05:01 AM
Customer_ID: 4118230054806983
MMYY: 07/24
Txn_ID: 13183
Name: Nancy Lambert
Address_original: 2600 loop drive, N, N, 44113, PE
ZIP_City_State_Country: 44113, N, N, UNITED STATES
Subtype: N/A

#294445
ORDER_STATUS437355 VALID
24.09.2021 05:01 AM
Customer_ID: 4118230061412668
MMYY: 05/24
Txn_ID: 55474
Name: Sheets Sherry
E-mail: tyd@gmail.com
Phone: (567) 241-5074
Address_original: 37 Martha Avenue, Mansfield, Ohio, 44905, US
ZIP_City_State_Country: 44905, Mansfield, Ohio, UNITED STATES
Subtype: N/A";
        foreach (var value in GetStrings(input))
        
            Console.WriteLine(value);
        
        Console.ReadLine();
    


输出

5524335312265537 | 08/23 | 74627
5524331591654699 | 11/23 | 45617
4118230087730234 | 09/25 | 24430
4118230054806983 | 07/24 | 13183
4118230061412668 | 05/24 | 55474

参考文献

Regex Match Method - Match(String, String, RegexOptions, TimeSpan)

【讨论】:

Joma 在 VS 中粘贴代码后出现错误:方法 'Matches' 没有重载需要 4 个参数。 您使用什么版本的框架?也许你的目标框架是 4.0。为避免此问题,请删除最后一个参数 TimeSpan.FromSeconds(1) Joma 是的,现在设置为目标框架 4.5。您的解决方案在最坏的情况下工作。在以前的解决方案中得到带有新行文本的结果。更新了帖子中的代码! matchTimeout 参数指定模式匹配方法在超时之前尝试找到匹配项的时间。设置超时间隔可防止依赖过度回溯的正则表达式在处理包含近似匹配的输入时停止响应。有关更多信息,请参阅正则表达式和回溯的最佳实践。如果在该时间间隔内未找到匹配项,则该方法将引发 RegexMatchTimeoutException 异常。 matchTimeout 覆盖为执行该方法的应用程序域定义的任何默认超时值。

以上是关于尝试将文本文件中的数据解析为由 | 分隔的单行象征的主要内容,如果未能解决你的问题,请参考以下文章

如何使用BeautifulSoup中的Python将单行中多列分隔的数据导出为.csv或.xls?

在 .NET 中解析分隔的 CSV

解析 JSON 文件数据后,在结果 XML 中添加注释

iOS-解析读取CSV文件,解析excel文件

如何将多行合并为单行,但仅适用于由空行分隔的行块

将字符串转换为由点分隔的数字,其中 a=1 和 z=26