java 爬虫下载酷狗top500
Posted 小猪聪聪
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了java 爬虫下载酷狗top500相关的知识,希望对你有一定的参考价值。
想下载歌曲,可app,网站啥的都需要会员,作为一个穷屌丝,没钱啊。所以想搞个代码去下载歌曲,
打开酷狗top500:http://www.kugou.com/yy/rank/home/1-8888.html
看到只有22个,有看了url猜测分页果然:把url改为:http://www.kugou.com/yy/rank/home/2-8888.html; 就进入了第二页
然后点击歌曲播放页面:http://www.kugou.com/song/#hash=3337C18539D5BB00D8027D653D536A35&album_id=32440418
使用谷歌浏览器的控制台的Elements搜索MP3
然而使用java请求,结果没有找到MP3,于是想到看看是不是其他请求或js加载的,然后查看entwork,找MP3
然后查看参数:
使用postman测试查看必要参数为:
callback=jQuery191027067069941080546_1546235744250
hash=_HASH_
album_id=_ALBUM_ID_
mid=98350860ab960276d13883f7d07d364d
刷新另外一首歌,看到callback和mid没有变,所以只要找到hash和album_id就可以了
查看hash和album_id和播放页面url的相同,于是数据应该是从top500页面传过来的
使用java访问top500的页面发下一段字符串:
[{"Hash":"7707BE115CF9131E3AEF782D294155D4","FileName":"\\u963f\\u60a0\\u60a0 - \\u4e00\\u751f\\u4e0e\\u4f60\\u64e6\\u80a9\\u800c\\u8fc7","timeLen":239.464,"privilege":10,"size":3832017,"album_id":32922861,"encrypt_id":"11cb4g06"},{"Hash":"3337C18539D5BB00D8027D653D536A35","FileName":"\\u963f\\u5197 - \\u4f60\\u7684\\u7b54\\u6848","timeLen":219,"privilege":10,"size":3519376,"album_id":32440418,"encrypt_id":"113a6795"},{"Hash":"FAEDD01C425118BA343648B5AF35861F","FileName":"en - \\u56a3\\u5f20","timeLen":254.04,"privilege":10,"size":4065240,"album_id":26433357,"encrypt_id":"ybj0x09"},{"Hash":"33EB8FE0DC9F70D9F7FE4CB77305D5A8","FileName":"\\u6d77\\u6765\\u963f\\u6728\\u3001\\u963f\\u5477\\u62c9\\u53e4\\u3001\\u66f2\\u6bd4\\u963f\\u4e14 - \\u522b\\u77e5\\u5df1","timeLen":280.111,"privilege":10,"size":4482365,"album_id":16324799,"encrypt_id":"uajki71"},{"Hash":"1952AF6E49AF16B4C130282B8A40EEE7","FileName":"\\u845b\\u4e1c\\u742a - \\u60ac\\u6eba","timeLen":197.12,"privilege":10,"size":3154481,"album_id":23773597,"encrypt_id":"10fdd762"},{"Hash":"E908C26B69C2873C75CFCEABB5B81F2B","FileName":"\\u6768\\u5c0f\\u58ee - \\u6211\\u627f\\u8ba4\\u6211\\u81ea\\u5351","timeLen":269.348,"privilege":0,"size":4310039,"album_id":30562954,"encrypt_id":"10a05t84"},{"Hash":"ED8DBD8AE97359912A8AEC71C61758D2","FileName":"\\u6768\\u5c0f\\u58ee - \\u4e00\\u4e2a\\u4eba\\u633a\\u597d","timeLen":270.027,"privilege":10,"size":4320906,"album_id":26827357,"encrypt_id":"ygc0db4"},{"Hash":"B30DF1A53446B717C35629952E9B7156","FileName":"\\u6768\\u987a\\u9ad8\\u3001\\u51ef\\u5c0f\\u6674 - \\u4e0d\\u7231\\u6211\\u5c31\\u522b\\u4f24\\u5bb3\\u6211","timeLen":253.44,"privilege":10,"size":4055502,"album_id":27400705,"encrypt_id":"ypwfo0a"},{"Hash":"247ED2BAA3AE6D3DBA48801F802459F3","FileName":"\\u4e8e\\u5609\\u4e50 - \\u9003\\u7231","timeLen":242.018,"privilege":10,"size":3875491,"album_id":13882049,"encrypt_id":"sulfsb0"},{"Hash":"B221F5355343714DC6AA7647B62F9B38","FileName":"\\u5468\\u6770\\u4f26 - \\u65ad\\u4e86\\u7684\\u5f26","timeLen":297,"privilege":10,"size":4761507,"album_id":973367,"encrypt_id":"70mo63"},{"Hash":"B6880A36D02561912C53831ACF305D0D","FileName":"\\u5b9d\\u77f3Gem\\u3001\\u9648\\u4f1f\\u9706 - \\u91ce\\u72fcDisco","timeLen":240.039,"privilege":10,"size":3841212,"album_id":30181741,"encrypt_id":"102ilgdd"},{"Hash":"04E4C1D0AFB9DEEF3B0834AA1F71B654","FileName":"\\u5d14\\u4f1f\\u7acb - \\u9152\\u9189\\u7684\\u8774\\u8776","timeLen":206.184,"privilege":10,"size":3299669,"album_id":26382011,"encrypt_id":"yavyv32"},{"Hash":"BE7C2E688ED155FA6A6B45392FD0FA50","FileName":"Tones And I - Dance Monkey","timeLen":209.815,"privilege":10,"size":3357510,"album_id":26441038,"encrypt_id":"xcw85d8"},{"Hash":"EBBDCAA92236F5877FAD2A91C91564F9","FileName":"\\u5f20\\u9753\\u9896 - \\u5e7b\\u7eb1\\u4e4b\\u7075","timeLen":248.032,"privilege":0,"size":3968984,"album_id":33385138,"encrypt_id":"11i24me5"},{"Hash":"DB66B6258041852A7B856729FF3A4893","FileName":"\\u738b\\u4f73\\u6768 - \\u9057\\u61be","timeLen":255.033,"privilege":10,"size":4081121,"album_id":27347801,"encrypt_id":"yp9nc64"},{"Hash":"517E05B8121A9E7D6E5629B41CCA61C4","FileName":"\\u97f3\\u9619\\u8bd7\\u542c\\u3001\\u8d75\\u65b9\\u5a67 - \\u8292\\u79cd","timeLen":216.032,"privilege":10,"size":3456984,"album_id":22135843,"encrypt_id":"x20jh9b"},{"Hash":"D46AC87092F83685353E564B1F5C9BD3","FileName":"\\u963f\\u60a0\\u60a0 - \\u5ff5\\u65e7","timeLen":229.877,"privilege":10,"size":3678626,"album_id":27580208,"encrypt_id":"ys2nmca"},{"Hash":"5784F2CBA32B65D5E48E4C415F843342","FileName":"\\u9ec4\\u9704\\u96f2 - \\u5de6\\u624b\\u6307\\u6708 (\\u7eaf\\u4eab\\u7248)","timeLen":185.086,"privilege":10,"size":2974138,"album_id":12245979,"encrypt_id":"svuie28"},{"Hash":"07A55D0F8FF088BDCA36B836F7FE115F","FileName":"\\u674e\\u6615\\u878d\\u3001\\u6a0a\\u6850\\u821f\\u3001\\u674e\\u51ef\\u7a20 - \\u4f60\\u7b11\\u8d77\\u6765\\u771f\\u597d\\u770b","timeLen":172.068,"privilege":10,"size":2753687,"album_id":19428374,"encrypt_id":"w7pn29f"},{"Hash":"A017B769F3D653AC063AE20639C0020F","FileName":"\\u8521\\u5065\\u96c5 - \\u7ea2\\u8272\\u9ad8\\u8ddf\\u978b","timeLen":206.68,"privilege":10,"size":3321862,"album_id":978459,"encrypt_id":"39nred"},{"Hash":"6CB8F530E00F6CFEF4A3371FD3F0BEB8","FileName":"\\u5c0f\\u963f\\u67ab - \\u6700\\u8fdc\\u7684\\u4f60\\u662f\\u6211\\u6700\\u8fd1\\u7684\\u7231","timeLen":136.15,"privilege":10,"size":2178865,"album_id":25406316,"encrypt_id":"xzj9add"},{"Hash":"BBDCDDCBD58FF85B60E728726A4B3C9A","FileName":"\\u9708\\u4e39(\\u6d6a\\u54e5) - \\u9003\\u7231","timeLen":242.233,"privilege":0,"size":3876263,"album_id":32104019,"encrypt_id":"10vik0d0"}]
里边存有hash和album_id,然后就是字符串获取了转换了 ,当然也发现了一个好东西:歌曲名称:FileName
然后就是代码了:
package com.kg; import net.sf.json.JSONArray; import net.sf.json.JSONObject; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; import java.io.IOException; import java.util.ArrayList; import java.util.Date; import java.util.List; import java.util.regex.Matcher; import java.util.regex.Pattern; /** * @ClassName: SpiderKugou.java * @Description: TODO * @author 小猪聪聪 * @version V1.0 * @Date 2019年11月15日 下午5:00:33 */ public class SpiderKugou { public static String filePath = "D:/music/"; public static String mp3 = "https://wwwapi.kugou.com/yy/index.php?r=play/getdata&callback=jQuery191027067069941080546_1546235744250&" + "hash=_HASH_&album_id=_ALBUM_ID_&mid=98350860ab960276d13883f7d07d364d"; public static String LINK = "http://www.kugou.com/song/#hash=_HASH_&album_id=_ALBUM_ID_"; // "https://www.kugou.com/yy/rank/home/PAGE-23784.html?from=rank"; public static void main(String[] args) throws IOException { int index = 4; d("http://www.kugou.com/yy/rank/home/" + index + "-8888.html"); } public static String d(String url) throws IOException { HttpGetConnect connect = new HttpGetConnect(); String content = connect.connect(url, "utf-8"); String hashStr = content.substring(content.indexOf("[{"), content.indexOf("}]") + 2); JSONArray hashArr = JSONArray.fromObject(hashStr); System.out.println(hashArr); for (int i = 0; i < hashArr.size(); i++) { JSONObject item = JSONObject.fromObject(hashArr.get(i)); String hash = item.get("Hash").toString(); String albumId = item.get("album_id").toString(); String fileName = item.get("FileName").toString(); String itemUrl = mp3.replace("_HASH_", hash).replace("_ALBUM_ID_", albumId); download(itemUrl, fileName,i); } return null; } public static String download(String url, String name,int i) throws IOException { HttpGetConnect connect = new HttpGetConnect(); String mp = connect.connect(url, "utf-8"); mp = mp.substring(mp.indexOf("(") + 1, mp.length() - 2); JSONObject json = JSONObject.fromObject(mp); String playUrl = json.getJSONObject("data").getString("play_url"); String song_name = json.getJSONObject("data").getString("song_name"); FileDownload down = new FileDownload(); down.download(playUrl, filePath + song_name + ".mp3"); System.out.println(name + " ---- >>> 下载完成"); return playUrl; } }
package com.kg; import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; import org.apache.http.HttpEntity; import org.apache.http.client.config.RequestConfig; import org.apache.http.client.methods.CloseableHttpResponse; import org.apache.http.client.methods.HttpGet; import org.apache.http.impl.client.CloseableHttpClient; import org.apache.http.impl.client.HttpClients; import org.apache.http.impl.conn.BasicHttpClientConnectionManager; import java.io.BufferedReader; import java.io.IOException; import java.io.InputStream; import java.io.InputStreamReader; /** * @ClassName: HttpGetConnect.java * @Description: TODO * @author 小猪聪聪 * @version V1.0 * @Date 2019年11月15日 下午5:01:28 */ public class HttpGetConnect { /** * 获取html内容 * * @param url * @param charsetName UTF-8、GB2312 * @return * @throws IOException */ public static String connect(String url, String charsetName) throws IOException { BasicHttpClientConnectionManager connManager = new BasicHttpClientConnectionManager(); CloseableHttpClient httpclient = HttpClients.custom().setConnectionManager(connManager).build(); String content = ""; try { HttpGet httpget = new HttpGet(url); RequestConfig requestConfig = RequestConfig.custom().setSocketTimeout(5000).setConnectTimeout(50000) .setConnectionRequestTimeout(50000).build(); httpget.setConfig(requestConfig); httpget.setHeader("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"); httpget.setHeader("Accept-Encoding", "gzip,deflate,sdch"); httpget.setHeader("Accept-Language", "zh-CN,zh;q=0.8"); httpget.setHeader("Connection", "keep-alive"); httpget.setHeader("Upgrade-Insecure-Requests", "1"); httpget.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36"); // httpget.setHeader("Hosts", "www.oschina.net"); httpget.setHeader("cache-control", "max-age=0"); CloseableHttpResponse response = httpclient.execute(httpget); int status = response.getStatusLine().getStatusCode(); if (status >= 200 && status < 300) { HttpEntity entity = response.getEntity(); InputStream instream = entity.getContent(); BufferedReader br = new BufferedReader(new InputStreamReader(instream, charsetName)); StringBuffer sbf = new StringBuffer(); String line = null; while ((line = br.readLine()) != null) { sbf.append(line + ""); } br.close(); content = sbf.toString(); } else { content = ""; } } catch (Exception e) { e.printStackTrace(); } finally { httpclient.close(); } // log.info("content is " + content); return content; } private static Log log = LogFactory.getLog(HttpGetConnect.class); }
package com.kg; import java.io.*; import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; import org.apache.http.client.config.RequestConfig; import org.apache.http.client.methods.CloseableHttpResponse; import org.apache.http.client.methods.HttpGet; import org.apache.http.impl.client.CloseableHttpClient; import org.apache.http.impl.client.HttpClients; /** * @ClassName: FileDownload.java * @Description: TODO * @author 小猪聪聪 * @version V1.0 * @Date 2019年11月15日 下午5:02:22 */ public class FileDownload { /** * 文件下载 * * @param url 链接地址 * @param path 要保存的路径及文件名 * @return */ public static boolean download(String url, String path) throws IOException { new WriteText().writeToText(url); new WriteText().writeToText(path); boolean flag = false; CloseableHttpClient httpclient = HttpClients.createDefault(); RequestConfig requestConfig = RequestConfig.custom().setSocketTimeout(2000).setConnectTimeout(2000).build(); HttpGet get = new HttpGet(url); get.setConfig(requestConfig); BufferedInputStream in = null; BufferedOutputStream out = null; try { for (int i = 0; i < 3; i++) { CloseableHttpResponse result = httpclient.execute(get); System.out.println(result.getStatusLine()); if (result.getStatusLine().getStatusCode() == 200) { in = new BufferedInputStream(result.getEntity().getContent()); File file = new File(path); out = new BufferedOutputStream(new FileOutputStream(file)); byte[] buffer = new byte[1024]; int len = -1; while ((len = in.read(buffer, 0, 1024)) > -1) { out.write(buffer, 0, len); } flag = true; break; } else if (result.getStatusLine().getStatusCode() == 500) { continue; } } } catch (Exception e) { e.printStackTrace(); flag = false; } finally { get.releaseConnection(); try { if (in != null) { in.close(); } if (out != null) { out.close(); } } catch (Exception e) { e.printStackTrace(); flag = false; } } return flag; } private static Log log = LogFactory.getLog(FileDownload.class); }
package com.kg; import java.io.*; /** * @ClassName: aa.java * @Description: TODO * @author 小猪聪聪 * @version V1.0 * @Date 2019年11月15日 下午5:02:46 */ public class WriteText { public void writeToText(String musicInfo) throws IOException { String path = "D:\\\\music\\\\music_info\\\\music_info.txt"; File file = new File(path); if (!file.exists()) { file.getParentFile().mkdirs(); } file.createNewFile(); // write FileWriter fw = new FileWriter(file, true); BufferedWriter bw = new BufferedWriter(fw); //写入到txt并自动换行 bw.write(musicInfo + "\\r\\n"); bw.flush(); bw.close(); fw.close(); // read FileReader fr = new FileReader(file); BufferedReader br = new BufferedReader(fr); String str = br.readLine(); } }
package com.kg; import java.io.IOException; /** * @ClassName: Test.java * @Description: TODO * @author 小猪聪聪 * @version V1.0 * @Date 2019年11月15日 下午5:03:13 */ public class Test { public static void main(String[] args) throws IOException { System.out.println("aa"); new WriteText().writeToText("12376\\r\\n"); } }
问题:
下载中发现:只能下载73首歌曲,然后慢慢想问题,看到 https://wwwapi.kugou.com/yy/index.php?r=play/getdata&callback=jQuery191027067069941080546_1546235744250&hash=7A5C31C00EB66499DE0F72BB67097111&album_id=19615336&mid=98350860ab960276d13883f7d07d364d
的返回值有问题,然后找问题,发现没有play_url返回了,于是用postman测试一样,继续找,找不到,下载不了了,问题。。
此时无意中刷新了一个页面,发现了一个东西,验证,没错,需要验证,还是图片验证,此时想到,可能服务器会检测同一台机器的访问次数,到一定次数了就需要验证,
于是把代码改为:
int index = 3; d("http://www.kugou.com/yy/rank/home/" + index + "-8888.html");
遇到的问题:
一次只加载一页,当遇到异常时,在刷新下页面,继续。。。
如果想一次全部加载:
方式1:
写代码验证,但他是图片验证,不好写,暂时放下,
方式2:
此时想到服务器是用那种方式计数的,ip,mid,或者其他的。。。。然后这个mid干嘛用的
在发现:下载的歌曲有些与名称不对,于是想到,歌曲名称获取不对,看到获取下载路径中的数据有歌曲名称,使用这个试试:
欢迎各位评论,留言
以上是关于java 爬虫下载酷狗top500的主要内容,如果未能解决你的问题,请参考以下文章