java 609.在系统中找到重复文件(#)。java
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了java 609.在系统中找到重复文件(#)。java相关的知识,希望对你有一定的参考价值。
public static List<List<String>> findDuplicate(String[] paths) {
Map<String, List<String>> map = new HashMap<>();
for(String path : paths) {
String[] tokens = path.split(" ");
for(int i = 1; i < tokens.length; i++) {
String file = tokens[i].substring(0, tokens[i].indexOf('('));
String content = tokens[i].substring(tokens[i].indexOf('(') + 1, tokens[i].indexOf(')'));
map.putIfAbsent(content, new ArrayList<>());
map.get(content).add(tokens[0] + "/" + file);
}
}
return map.values().stream().filter(e -> e.size() > 1).collect(Collectors.toList());
}
Follow up questions:
1. Imagine you are given a real file system, how will you search files? DFS or BFS ?
In general, BFS will use more memory then DFS. However BFS can take advantage of the locality of files in inside directories, and therefore will probably be faster
2. If the file content is very large (GB level), how will you modify your solution?
In a real life solution we will not hash the entire file content, since it's not practical. Instead we will first map all the files according to size. Files with different sizes are guaranteed to be different. We will than hash a small part of the files with equal sizes (using MD5 for example). Only if the md5 is the same, we will compare the files byte by byte
3. If you can only read the file by 1kb each time, how will you modify your solution?
This won't change the solution. We can create the hash from the 1kb chunks, and then read the entire file if a full byte by byte comparison is required.
What is the time complexity of your modified solution? What is the most time consuming part and memory consuming part of it? How to optimize?
Time complexity is O(n^2 * k) since in worse case we might need to compare every file to all others. k is the file size
How to make sure the duplicated files you find are not false positive?
We will use several filters to compare: File size, Hash and byte by byte comparisons.
public class Solution {
public List<List<String>> findDuplicate(String[] paths) {
HashMap<String, Integer> map = new HashMap<>();
HashMap<String, String> mapPath = new HashMap<>();
HashMap<String, Integer> resultIndex = new HashMap<>();
List<List<String>> result = new ArrayList<>();
for(int i=0; i<paths.length; i++){
String[] parts = paths[i].split(" ");
String dir = parts[0];
for(int j=1; j<parts.length; j++){
String file = parts[j];
//Separate content and filename
int s_index = file.indexOf("(");
String fname = file.substring(0, s_index);
String content = file.substring(s_index+1, file.length()-1);
if(map.containsKey(content)){
// Duplicate File Found
if(map.get(content)==1){
int r_index = result.size();
ArrayList<String> newRow = new ArrayList<>();
newRow.add(mapPath.get(content)); // To add first file
newRow.add(dir+"/"+fname);
result.add(newRow);
resultIndex.put(content, r_index);
map.put(content, map.get(content)+1);
}
else{
result.get(resultIndex.get(content)).add(dir+"/"+fname);
}
}
else{
map.put(content, 1);
mapPath.put(content, dir+"/"+fname);
}
}
}// end for
return result;
}
}
public class Solution {
public List<List<String>> findDuplicate(String[] paths) {
Map<String, List<String>> map = new HashMap<>();
for(String path : paths) {
String[] pathArr = path.split(" ");
String dir = pathArr[0] + '/';
for (int i = 1; i < pathArr.length; i++) {
int start = pathArr[i].indexOf('(');
String fileName = dir + pathArr[i].substring(0, start);
String content = pathArr[i].substring(start + 1, pathArr[i].length() - 1);
map.putIfAbsent(content, new ArrayList<>());
map.get(content).add(fileName);
}
}
List<List<String>> res = new ArrayList<>();
for (String key : map.keySet()) {
if (map.get(key).size() > 1) {
res.add(map.get(key));
}
}
return res;
}
}
以上是关于java 609.在系统中找到重复文件(#)。java的主要内容,如果未能解决你的问题,请参考以下文章