使用 split 使用分隔符选项卡“\t”在 Java 中解析字符串

Posted 2023-02-24

技术标签:

【中文标题】使用 split 使用分隔符选项卡“\\t”在 Java 中解析字符串【英文标题】：String parsing in Java with delimiter tab "\t" using split使用 split 使用分隔符选项卡“\t”在 Java 中解析字符串 【发布时间】：2010-12-10 18:46:44 【问题描述】：

我正在处理一个制表符分隔的字符串。我正在使用split 函数完成此操作，并且它在大多数情况下都有效。缺少字段时会出现问题，因此我没有在该字段中获取 null，而是得到下一个值。我将解析后的值存储在字符串数组中。

String[] columnDetail = new String[11];
columnDetail = column.split("\t");

任何帮助将不胜感激。如果可能的话，我想将解析后的字符串存储到一个字符串数组中，以便我可以轻松访问解析后的数据。

【问题讨论】：

所以 field1\tfield2\t\tfield4 给你 field1,field2,field4 而不是 field1,field2,[null],field4 ? ***.com/questions/1630092/token-parsing-in-java/… 重复？当您不理解答案并只是复制代码时会发生这种情况。你不需要分配一个新的字符串数组。 String.split 无论如何都会分配一个新的。 ?o.k.w ya 实际上我有一个包含标签的 xml 文件，我必须读取它的制表符分隔值。您需要了解您在寻找什么以及为什么。为您的问题提供工作代码不会教给您任何东西，您最终只会在不同的场景中一遍又一遍地问同一个问题。 【参考方案1】：

String.split 使用Regular Expressions，您也不需要为拆分分配额外的数组。

split-method 会给你一个列表。，问题是你试图预先定义你有多少个选项卡出现，但你怎么知道呢？尝试使用 Scanner 或 StringTokenizer 并了解拆分字符串的工作原理。

让我解释一下为什么 \t 不起作用以及为什么需要\\\\ 来转义\\。

好的，所以当你使用 Split 时，它实际上需要一个正则表达式（正则表达式），并且在正则表达式中你想定义要分割的字符，如果你写 \t 这实际上并不意味着 \t 和你想分割的是\t，对吧？因此，只需编写\t，您就可以告诉您的正则表达式处理器“嘿，被转义的字符分割”NOT“嘿，被看起来像\t 的所有字符分割”。注意到区别了吗？使用 \ 意味着逃避某些东西。而正则表达式中的\ 的含义与您的想法完全不同。

所以这就是你需要使用这个解决方案的原因：

\\t

告诉正则表达式处理器寻找\t。好吧，那你为什么需要两个呢？嗯，第一个 \ 转义了第二个，这意味着它看起来像这样： \t 处理文本时！

现在假设您要拆分\

那么你会留下 \\ 但是你看，那行不通！因为 \ 会尝试转义前一个字符！这就是为什么您希望输出为 \\，因此您需要有 \\\\。

我真的希望上面的例子能帮助你理解为什么你的解决方案不起作用以及如何征服其他的解决方案！

现在，我以前给过你这个answer，也许你现在应该开始看它们了。

其他方法

StringTokenizer

您应该查看StringTokenizer，它是用于此类工作的非常方便的工具。

示例

 StringTokenizer st = new StringTokenizer("this is a test");
 while (st.hasMoreTokens()) 
     System.out.println(st.nextToken());

这将输出

 this
 is
 a
 test

您使用 StringTokenizer 的 Second Constructor 来设置分隔符：

StringTokenizer(String str, String delim)

扫描仪

您也可以使用Scanner，因为其中一位评论员说这可能看起来像这样

示例

 String input = "1 fish 2 fish red fish blue fish";

 Scanner s = new Scanner(input).useDelimiter("\\s*fish\\s*");

 System.out.println(s.nextInt());
 System.out.println(s.nextInt());
 System.out.println(s.next());
 System.out.println(s.next());

 s.close();

输出将是

 1
 2
 red
 blue

意思是它会把“fish”这个词删掉，把剩下的给你，用“fish”作为分隔符。

examples taken from the Java API

【讨论】：

不过，在制表符处拆分时，正则表达式不应该咬你。可能不会，但如果 OP 只是尝试阅读答案并理解它们，他就会知道这个问题的答案。因为这与他昨天发布的内容相似。我会说如果他昨天和今天用我的方法，他就不会遇到这个问题。您对问题的看法完全错误，或者您提出的问题类型错误。我建议不要使用解析器和其他东西来读取 XML。就从简单的开始吧。请为我们提供一个示例，如果您无法使用我提供的信息（我认为这很可疑），那么我能为您做的不多。输出是一样的，如果你使用“\t”或“\\t”，我不知道你为什么要使用StringTokenizer和Scanner。此外，String.split 比其他两个简单得多，并且根据文档“StringTokenizer 是一个遗留类，出于兼容性原因保留，尽管不鼓励在新代码中使用它。” -1 - "\t" 或 "\\t" (***.com/a/3762377/281545) 上的错误信息 - 请编辑【参考方案2】：

试试这个：

String[] columnDetail = column.split("\t", -1);

阅读String.split(java.lang.String, int)上的Javadoc，了解split函数的limit参数：

split

public String[] split(String regex, int limit)
Splits this string around matches of the given regular expression.
The array returned by this method contains each substring of this string that is terminated by another substring that matches the given expression or is terminated by the end of the string. The substrings in the array are in the order in which they occur in this string. If the expression does not match any part of the input then the resulting array has just one element, namely this string.

The limit parameter controls the number of times the pattern is applied and therefore affects the length of the resulting array. If the limit n is greater than zero then the pattern will be applied at most n - 1 times, the array's length will be no greater than n, and the array's last entry will contain all input beyond the last matched delimiter. If n is non-positive then the pattern will be applied as many times as possible and the array can have any length. If n is zero then the pattern will be applied as many times as possible, the array can have any length, and trailing empty strings will be discarded.

The string "boo:and:foo", for example, yields the following results with these parameters:

Regex   Limit   Result
:   2    "boo", "and:foo" 
:   5    "boo", "and", "foo" 
:   -2   "boo", "and", "foo" 
o   5    "b", "", ":and:f", "", "" 
o   -2   "b", "", ":and:f", "", "" 
o   0    "b", "", ":and:f"

当最后几个字段（我是你的情况）丢失时，你会得到这样的列：

field1\tfield2\tfield3\t\t

如果split()没有设置limit，则limit为0，这将导致“尾随空字符串将被丢弃”。所以你只能得到 3 个字段，"field1", "field2", "field3"。

当limit设置为-1时，非正值，尾随空字符串不会被丢弃。所以你可以获得5个字段，最后两个是空字符串，“field1”，“field2”，“field3”，“”，“”。

【讨论】：

@Happy3：您提供了 java1.4 文档的链接。我们不应该参考更多最新版本吗？ :)【参考方案3】：

没有人回答 - 这部分是问题的错误：输入字符串包含 11 个字段（可以推断出这么多）但是有多少个选项卡？很可能完全正确 10. 那么答案是

String s = "\t2\t\t4\t5\t6\t\t8\t\t10\t";
String[] fields = s.split("\t", -1);  // in your case s.split("\t", 11) might also do
for (int i = 0; i < fields.length; ++i) 
    if ("".equals(fields[i])) fields[i] = null;

System.out.println(Arrays.asList(fields));
// [null, 2, null, 4, 5, 6, null, 8, null, 10, null]
// with s.split("\t") : [null, 2, null, 4, 5, 6, null, 8, null, 10]

如果字段碰巧包含选项卡，这当然不会按预期工作。-1 表示：根据需要多次应用模式 - 因此将保留尾随字段（第 11 个）（如果不存在，则作为空字符串 ("")，需要显式转换为 null）。

另一方面，如果缺少的字段没有选项卡 - 因此"5\t6" 是仅包含字段 5,6 的有效输入字符串 - 则无法通过拆分获得 fields[]。

【讨论】：

它未标记为已接受，因为 OP 在提出问题后从未返回该站点。【参考方案4】：

如果制表符分隔字段中的数据本身包含换行符、制表符和可能的 " 字符，String.split 实现将有严重的限制。

制表符分隔的格式已经存在多年，但格式不标准化并且变化多端。许多实现不会转义出现在字段中的字符（换行符和制表符）。相反，它们遵循 CSV 约定并将任何重要的字段包含在“双引号”中。然后他们只转义双引号。所以一条“线”可以延伸到多条线。

阅读时我听说“只需重用 apache 工具”，这听起来是个好建议。

最后我个人选择了opencsv。我发现它是轻量级的，因为它提供了转义和引号字符的选项，它应该涵盖最流行的逗号和制表符分隔的数据格式。

例子：

CSVReader tabFormatReader = new CSVReader(new FileReader("yourfile.tsv"), '\t');

【讨论】：

【参考方案5】：

我刚刚有同样的问题，并在某种教程中注意到了答案。一般情况下你需要使用第二种形式的split方法，使用

split(regex, limit)

这里是完整教程http://www.rgagnon.com/javadetails/java-0438.html

如果您为限制参数设置了一些负数，您将在缺少实际值的数组中获得空字符串。要使用它，您的初始字符串应该有两个分隔符副本，即您应该在缺少值的地方有 \t\t。

希望这会有所帮助:)

【讨论】：

【参考方案6】：

你可以使用 yourstring.split("\x09"); 我测试了它，它可以工作。

【讨论】：

【参考方案7】：

String[] columnDetail = new String[11];
columnDetail = column.split("\t", -1); // unlimited
OR
columnDetail = column.split("\t", 11); // if you are sure about limit.

 * The @code limit parameter controls the number of times the
 * pattern is applied and therefore affects the length of the resulting
 * array.  If the limit <i>n</i> is greater than zero then the pattern
 * will be applied at most <i>n</i>&nbsp;-&nbsp;1 times, the array's
 * length will be no greater than <i>n</i>, and the array's last entry
 * will contain all input beyond the last matched delimiter.  If <i>n</i>
 * is non-positive then the pattern will be applied as many times as
 * possible and the array can have any length.  If <i>n</i> is zero then
 * the pattern will be applied as many times as possible, the array can
 * have any length, and trailing empty strings will be discarded.

【讨论】：

以上是关于使用 split 使用分隔符选项卡“\t”在 Java 中解析字符串的主要内容，如果未能解决你的问题，请参考以下文章