为何两个完全一样的字符串相比较却不一样

Posted 2022-12-03 在京奋斗者

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了为何两个完全一样的字符串相比较却不一样相关的知识，希望对你有一定的参考价值。

曾遇到过这么一个问题，就是我想比较两个文件中有哪些是不一致的，但是发现用java代码跑出来的结果却令人大跌眼镜，出现了本来一样的字符串结果硬生生的给我打印出来不一样！比如1000-11-20190225-ZP-1551024000-1632240000这个字符串，放在两个文件，用java读取文件中的值并做比较，发现竟然不一样！！这让我当时百思不得其解，还以为是Eclipse或IDEA出现了问题呢。

我把我写的代码粘贴出来

@Test
public void testNotEqualData() throws IOException 
	File file=new File("F:\\\\stockSnapshot\\\\erp1.txt");
	BufferedReader reader=null;
	String temp=null;
	File file2=new File("F:\\\\stockSnapshot\\\\result1.txt");
	BufferedReader reader2=null;
	try
		reader=new BufferedReader(new FileReader(file));
		List<String> list = new ArrayList<>();
		while((temp=reader.readLine())!=null)
			System.out.println("temp的长度：" + temp.length());
			list.add(temp);
		
		reader2=new BufferedReader(new FileReader(file2));
		while((temp=reader2.readLine())!=null)
			System.out.println("temp2的长度：" + temp.length());
			if (!list.contains(temp)) 
				System.out.println("两边数据不一致的是：" + temp);
			
		
	
	catch(Exception e)
		e.printStackTrace();
	
	finally
		if(reader!=null)
			try
				reader.close();
			
			catch(Exception e)
				e.printStackTrace();

运行后的结果如下：

temp的长度：42
temp2的长度：41
两边数据不一致的是：1000-11-20190225-ZP-1551024000-1632240000

最后发现原来是两个文件的编码格式不太一样，一个是UTF8无BOM格式，一个是UTF8格式，这两种格式虽然都是UTF8，但还是有区别的，在Windows下用文本编辑器创建的文本文件，如果选择以UTF-8等Unicode格式保存，会在文件头（第一个字符）加入一个BOM标识（从第二行开始便没有这个BOM标识了）。

这个标识在Java读取文件的时候，不会被去掉，而且String.trim()也无法删除。如果用readLine()读取第一行存进String里面，这个String的length会比看到的大1（上面运行结果一个是42，一个是41），而且第一个字符就是这个BOM。

这种情况会给我们造成困扰，幸好，Java在读取Unicode文件的时候，会统一把BOM变成“\\uFEFF”，这样的话，就可以自己手动解决了（判断后，用substring()或replace()去除掉这个BOM）：

比如：temp = temp.trim().replaceAll("\\\\uFEFF", "");这样来统一去掉这个BOM。

但是这种解决方式不太好，如果生成jar文件在windows下运行，还是有问题，比较好的解决方式有两种：

第一种：把两个文件的编码格式都改为UTF-8无BOM格式，这样文件中的内容都不带BOM，从而也就没有我们说的这个问题了。

第二种：apache commons io提供的BOMInputStream

添加如下依赖：

<dependency>
  <groupId>commons-io</groupId>
  <artifactId>commons-io</artifactId>
  <version>2.4</version>
</dependency>

然后使用下面的方式来读取文件中的内容（可以看到主要是用到了BOMInputStream这个类）

@Test
public void testNotEqualData() throws IOException 
	File file=new File("F:\\\\stockSnapshot\\\\erp1.txt");
	BufferedReader reader=null;
	String temp=null;
	File file2=new File("F:\\\\stockSnapshot\\\\result1.txt");
	BufferedReader reader2=null;
	try
           //使用BOMInputStream自动去除UTF-8中的BOM！！
		reader = new BufferedReader(new InputStreamReader(new BOMInputStream(new 
			     FileInputStream(file))));
		List<String> list = new ArrayList<>();
		while((temp=reader.readLine())!=null)
			System.out.println("temp的长度：" + temp.length());
			list.add(temp);
		
		reader2=new BufferedReader(new InputStreamReader(new BOMInputStream(new 
			     FileInputStream(file2))));
		while((temp=reader2.readLine())!=null)
			System.out.println("temp2的长度：" + temp.length());
			if (!list.contains(temp)) 
				System.out.println("两边数据不一致的是：" + temp);
			
		
	
	catch(Exception e)
		e.printStackTrace();
	
	finally
		if(reader!=null)
			try
				reader.close();
			
			catch(Exception e)
				e.printStackTrace();

修改后的运行结果如下：

temp的长度：41
temp2的长度：41

大家看具体情况，如果不允许修改文件的编码格式，那么第二种方式无疑是比较好的。

以上是关于为何两个完全一样的字符串相比较却不一样的主要内容，如果未能解决你的问题，请参考以下文章