烂活斯坦福句法解析库使用小结+最新四月新番下载(以辉夜与阿尼亚为例)

Posted 囚生CY

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了烂活斯坦福句法解析库使用小结+最新四月新番下载(以辉夜与阿尼亚为例)相关的知识,希望对你有一定的参考价值。

序言

前排提示本文是挂羊头卖狗肉,正文在第二部分,第一部分纯属为了过审凑字数。


文章目录


1 斯坦福句法解析库(句法树、依存关系图)使用概述

关于NLTK里斯坦福的句法解析模块,最近报警告说即将被弃用,最新版将被nltk.parse.corenlp.StanforCoreNLPParser模块取代,关于CoreNLP可以去斯坦福软件里下载JAR包,目前看至少依存分析和句法树是可行的,这两个也是最有用的,NER也能用,虽然分词和词性标注会报错,但是这两个也不必用非要用斯坦福的,有很多其他资源可以用,中文可以用jieba,英文的话nltk里就有内置的分词包和词性标注包,目前StanforCoreNLPParser还没搞清楚具体用法,近期会发布关于如何使用斯坦福JAR包详细教程。

从上面的链接中下载得到的几个JAR包如下图所示:

其中stanford-parser-full-2020-11-17是最重要的一个包,可以用于生成句法树和依存关系图,然后stanford-corenlp-4.4.0可能算是其他各个包的一个集成,但是我看下来里面的模型要缺很多,比如解析包的模型只有英文,而事实上前者中有包括中文在内的各种语言解析包。关于这些包的具体使用代码如下所示,其中一部分参考自https://www.cnblogs.com/baiboy/p/nltk1.html

# -*- coding: utf-8 -*-
# @author: caoyang
# @email: caoyang@163.sufe.edu.cn

# 2022/06/10 13:16:34 目前NLTK3.3.0
def segmenter_demo():
	# 2022/06/10 13:16:51 无法成功运行, 不知道为什么
	from nltk.tokenize.stanford_segmenter import StanfordSegmenter
	segmenter = StanfordSegmenter(
		path_to_jar=r'D:\\data\\stanford\\software\\stanford-segmenter-2020-11-17\\stanford-segmenter-4.2.0.jar',
		# slf4j这个参数在stanford-segmenter-2020-11-17里找不到, 但是在stanford-parser-full-2020-11-17和stanford-corenlp-4.4.0里都有
		path_to_slf4j=r'D:\\data\\stanford\\software\\stanford-parser-full-2020-11-17\\slf4j-api.jar',
		path_to_sihan_corpora_dict=r'D:\\data\\stanford\\software\\stanford-segmenter-2020-11-17\\data',
		path_to_model=r'D:\\data\\stanford\\software\\stanford-segmenter-2020-11-17\\data\\pku.gz',
		path_to_dict=r'D:\\data\\stanford\\software\\stanford-segmenter-2020-11-17\\data\\dict-chris6.ser.gz',
	)
	string = u'我在博客园开了一个博客,我的博客名叫伏草惟存,写了一些自然语言处理的文章。'
	result = segmenter.segment(string)
	print(result)
	return result
	
def tokenizer_demo():
	# 2022/06/10 13:15:03 无法运行 nltk.tokenize 已经被弃用了
	from nltk.tokenize import StanfordTokenizer
	tokenizer = StanfordTokenizer(path_to_jar=r'D:\\data\\stanford\\software\\stanford-parser-full-2020-11-17\\stanford-parser.jar')
	sent = 'Good muffins cost $3.88\\nin New York.  Please buy me\\ntwo of them.\\nThanks.'
	result = tokenizer.tokenize(sent)
	return result


def ner_tagger_demo():
	# 2022/06/10 13:16:56 可以运行英文, 但是中文的缺少模型jar包
	from nltk.tag import StanfordNERTagger
	eng_tagger = StanfordNERTagger(model_filename=r'D:\\data\\stanford\\software\\stanford-ner-2020-11-17\\classifiers\\english.all.3class.distsim.crf.ser.gz',
								   path_to_jar=r'D:\\data\\stanford\\software\\stanford-ner-2020-11-17\\stanford-ner.jar')

	result = eng_tagger.tag('Rami Eid is studying at Stony Brook University in NY'.split())
	print(result)
	# chi_tagger = StanfordNERTagger(model_filename=r'D:\\data\\stanford\\software\\stanford-ner-2020-11-17\\classifiers\\chinese.misc.distsim.crf.ser.gz',
								   # path_to_jar=r'D:\\data\\stanford\\software\\stanford-ner-2020-11-17\\stanford-ner.jar')
	# for word, tag in  chi_tagger.tag(result.split()):
		# print(word,tag)
	return result

def pos_tagger_demo():
	# 2022/06/10 13:17:35 通过测试
	from nltk.tag import StanfordPOSTagger
	eng_tagger = StanfordPOSTagger(model_filename=r'D:\\data\\stanford\\software\\stanford-postagger-full-2020-11-17\\models\\english-bidirectional-distsim.tagger',
								   path_to_jar=r'D:\\data\\stanford\\software\\stanford-postagger-full-2020-11-17\\stanford-postagger.jar')
	print(eng_tagger.tag('What is the airspeed of an unladen swallow ?'.split()))
	
	chi_tagger = StanfordPOSTagger(model_filename=r'D:\\data\\stanford\\software\\stanford-postagger-full-2020-11-17\\models\\chinese-distsim.tagger',
								   path_to_jar=r'D:\\data\\stanford\\software\\stanford-postagger-full-2020-11-17\\stanford-postagger.jar')
	result = '四川省 成都 信息 工程 大学 我 在 博客 园 开 了 一个 博客 , 我 的 博客 名叫 伏 草 惟 存 , 写 了 一些 自然语言 处理 的 文章 。\\r\\n'
	print(chi_tagger.tag(result.split()))
	
def dependency_demo():
	# 2022/06/10 13:21:17 通过测试
	from nltk.parse.stanford import StanfordDependencyParser
	eng_parser = StanfordDependencyParser(r'D:\\data\\stanford\\software\\stanford-parser-full-2020-11-17\\stanford-parser.jar',
										  r'D:\\data\\stanford\\software\\stanford-parser-full-2020-11-17\\stanford-parser-4.2.0-models.jar',
										  r'D:\\data\\stanford\\software\\stanford-parser-full-2020-11-17\\englishPCFG.ser.gz')
	res = list(eng_parser.parse('the quick brown fox jumps over the lazy dog'.split()))
	for row in res[0].triples():
		print(row)

	chi_parser = StanfordDependencyParser(r'D:\\data\\stanford\\software\\stanford-parser-full-2020-11-17\\stanford-parser.jar',
										  r'D:\\data\\stanford\\software\\stanford-parser-full-2020-11-17\\stanford-parser-4.2.0-models.jar',
										  model_path=r'D:\\data\\stanford\\software\\stanford-parser-full-2020-11-17\\chinesePCFG.ser.gz')		# 这个文件要从stanford-parser-4.2.0-models.jar中解压出来得到
	res = list(eng_parser.parse('我 和 他 是 朋友'.split()))
	print(list(res[0].triples()))
	print('#' * 64)
	for row in res[0].triples():
		print(row)
	
def parse_tree_demo():	
	# 2022/06/10 13:21:17 通过测试
	from nltk.parse.stanford import StanfordParser	
	parser = StanfordParser(r'D:\\data\\stanford\\software\\stanford-parser-full-2020-11-17\\stanford-parser.jar',
							r'D:\\data\\stanford\\software\\stanford-parser-full-2020-11-17\\stanford-parser-4.2.0-models.jar',
							model_path=r'D:\\data\\stanford\\software\\stanford-parser-full-2020-11-17\\chinesePCFG.ser.gz')		# 这个文件要从stanford-parser-4.2.0-models.jar中解压出来得到
	parse_tree = list(parser.parse(['我', '和', '他', '是', '朋友']))
	print(parse_tree)
	return parse_tree

# segmenter_demo()
# tokenizer_demo()
# ner_tagger_demo()
# pos_tagger_demo()
# dependency_demo()
# parse_tree_demo()

目前更新nltk到最新版(3.7.0),可以使用corenlp模块,但是发现它调用的是远程接口,因而无需下载jar包到本地,但是容易连不上远程服务器。感觉是斯坦福不准备开放它们的解析包,而是封装成接口,看注释部分效果还挺fancy:

class CoreNLPParser(GenericCoreNLPParser)
 |  CoreNLPParser(url='http://localhost:9000', encoding='utf8', tagtype=None)
 |
 |  >>> parser = CoreNLPParser(url='http://localhost:9000')
 |
 |  >>> next(
 |  ...     parser.raw_parse('The quick brown fox jumps over the lazy dog.')
 |  ... ).pretty_print()  # doctest: +NORMALIZE_WHITESPACE
 |                       ROOT
 |                        |
 |                        S
 |         _______________|__________________________
 |        |                         VP               |
 |        |                _________|___             |
 |        |               |             PP           |
 |        |               |     ________|___         |
 |        NP              |    |            NP       |
 |    ____|__________     |    |     _______|____    |
 |   DT   JJ    JJ   NN  VBZ   IN   DT      JJ   NN  .
 |   |    |     |    |    |    |    |       |    |   |
 |  The quick brown fox jumps over the     lazy dog  .

另外stanza包同理,也是需要调用远程接口方能调用,API文档在https://stanfordnlp.github.io/stanza/index.html,笔者私以为有上面那个解析包应该差不多就够用了,这个stanza不搭梯子用起来也经常会失败。


2 烂活(可能对追番的朋友有用)

忙里偷闲分享一个烂活。

最近在B站追《辉夜大小姐想让人告白第三季》和《间谍过家家》,实话说以前的B站新番还是能做到跟动画发布商同步更新,零氪党追番也就是只比大会员慢一周少看一集而已,总归是可以忍受。现在的B站各种骚操作,更新巨慢也就算了,各种圣光、暗牧、删减,有些敏感片段还要自己亲自作画重改,实在是让人难以接受,若不是B站还有仅存的弹幕氛围,谁TM还在B站追番。

然后笔者找到了这个:蚂蚁Tube@动画板块

目前基本上所有的四月新番都在持续更新,过往的老番也比较,当然除了动画以外,还有电影、电视剧、综艺的资源,应该说是非常nice了。

经常光顾这种免费站点的人肯定都知道,这类站点的通病就是视频加载巨慢,而且经常会看到一半就完全宕机了,这可实在是太糟心了,所以笔者想能不能直接把视频下载到本地来观看。

其实这件事并不复杂,比B站视频的爬取要简单很多,这里就顺手把B站视频爬虫的脚本挂在下面(因为笔者也是借鉴别人的代码做了一些修改,试着运行主体部分的几个示例,应该还是非常清晰的,截至本文发布仍然可用,注释较为详细,这里是可以直接用番剧的episodeid去直接下载整部番剧的,当然要需要大会员的必须得有大会员的账号,这里用的Cookie是笔者本人的账号,目前应该已经失效了,需要的可以自己网页端登录一下账号然后把Cookie拷贝过来):

# -*- coding: utf-8 -*-
# @author: caoyang
# @email: caoyang@163.sufe.edu.cn
# https://github.com/iawia002/annie

import os
import re
import json
import requests
from tqdm import tqdm

class BiliBiliCrawler(object):
	
	def __init__(self) -> None:				
		self.user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Firefox/91.0'
		self.video_webpage_link = 'https://www.bilibili.com/video/'.format
		self.video_detail_api = 'https://api.bilibili.com/x/player/pagelist?bvid=&jsonp=jsonp'.format						
		self.video_playurl_api = 'https://api.bilibili.com/x/player/playurl?cid=&bvid=&qn=64&type=&otype=json'.format	
		self.episode_playurl_api = 'https://api.bilibili.com/pgc/player/web/playurl?ep_id=&jsonp=jsonp'.format			
		self.episode_webpage_link = 'https://www.bilibili.com/bangumi/play/ep'.format
		self.anime_webpage_link = 'https://www.bilibili.com/bangumi/play/ss'.format
		self.chunk_size = 1024
		self.regexs = 
			'host': 'https://(.*\\.com)',
			'episode_name': r'meta name="keywords" content="(.*?)"',
			'initial_state': r'<script>window.__INITIAL_STATE__=(.*?);',
			'playinfo': r'<script>window.*?__playinfo__=(.*?)</script>',	
		

	def easy_download_video(self, bvid, save_path=None) -> bool:
		"""Tricky method with available api"""
		
		# Request for detail information of video
		response = requests.get(self.video_detail_api(bvid), headers='User-Agent': self.user_agent)
		json_response = response.json()
		
		cid = json_response['data'][0]['cid']
		video_title = json_response['data'][0]['part']
		if save_path is None:
			save_path = f'video_title.mp4'		

		print(f'Video title: video_title')
		
		# Request for playurl and size of video
		response = requests.get(self.video_playurl_api(cid, bvid), headers='User-Agent': self.user_agent)
		json_response = response.json()
		video_playurl = json_response['data']['durl'][0]['url']
		# video_playurl = json_response['data']['durl'][0]['backup_url'][0]
		video_size = json_response['data']['durl'][0]['size']
		total = video_size // self.chunk_size

		print(f'Video size: video_size')
		
		# Download video
		headers = 
			'User-Agent': self.user_agent,
			'Origin'	: 'https://www.bilibili.com',
			'Referer'	: 'https://www.bilibili.com',			
		
		headers['Host'] = re.findall(self.regexs['host'], video_playurl, re.I)[0]
		headers['Range'] = f'bytes=0-video_size'
		response = requests.get(video_playurl, headers=headers, stream=True, verify=False)
		tqdm_bar = tqdm(response.iter_content(self.chunk_size), desc='Download process', total=total)
		with open(save_path, 'wb') as f:
			for byte in tqdm_bar:
				f.write(byte)
		return True

	def easy_download_episode(self, epid, save_path=None) -> bool:
		"""Tricky method with available api"""
		
		# Request for playurl and size of episode
		
		# temp_headers = 
			# "Host": "api.bilibili.com",
			# "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:99.0) Gecko/20100101 Firefox/99.0",
			# "Accept": "application/json, text/plain, */*",
			# "Accept-Language": "zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2",
			# "Accept-Encoding": "gzip, deflate, br",
			# "Referer": "https://www.bilibili.com/bangumi/play/ep234407?spm_id_from=333.337.0.0",
			# "Origin": "https://www.bilibili.com",
			# "Connection": "keep-alive",
			# "Cookie": "innersign=0; buvid3=3D8F234E-5DAF-B5BD-1A26-C7CDE57C21B155047infoc; i-wanna-go-back=-1; b_ut=7; b_lsid=1047C7449_1808035E0D6; _uuid=A4884E3F-BF68-310101-E5E6-10EBFDBCC10CA456283infoc; buvid_fp=82c49016c72d24614786e2a9e883f994; buvid4=247E3498-6553-51E8-EB96-C147A773B34357718-022050123-7//HOhRX5o4Xun7E1GZ2Vg%3D%3D; fingerprint=1b7ad7a26a4a90ff38c80c37007d4612; sid=jilve18q; buvid_fp_plain=undefined; SESSDATA=f1edfaf9%2C1666970475%2Cf281c%2A51; bili_jct=de9bcc8a41300ac37d770bca4de101a8; DedeUserID=130321232; DedeUserID__ckMd5=42d02c72aa29553d; nostalgia_conf=-1; CURRENT_BLACKGAP=1; CURRENT_FNVAL=4048; CURRENT_QUALITY=0; rpdid=|(u~||~uukl)0J'uYluRu)l|J",
			# "Sec-Fetch-Dest": "empty",
			# "Sec-Fetch-Mode": "cors",
			# "Sec-Fetch-Site": "same-site",
			# "TE": "trailers",
		# 
		# response = requests.get(self.episode_playurl_api(epid), headers=temp_headers)
		
		# 2022/05/01 23:31:08 上面是带大会员的下载方式, 可以下载大会员可看的番剧
		response = requests.get(self.episode_playurl_api(epid))
		json_response = response.json()
		# episode_playurl = json_response['result']['durl'][0]['url']
		episode_playurl = json_response['result']['durl'][0]['backup_url'][0]
		episode_size = json_response['result']['durl'][0]['size']
		total = episode_size // self.chunk_size

		print(f'Episode size: episode_size')
		
		# Download episode
		# 2022/05/01 23:31:41 大会员最好加入下面的cookie, 但是我不确信是否去掉还能不能可以
		headers = 
			'User-Agent': self.user_agent,
			'Origin'	: 'https://www.bilibili.com',
			'Referer'	: 'https://www.bilibili.com',	
			# 'Cookie'	: "innersign=0; buvid3=3D8F234E-5DAF-B5BD-1A26-C7CDE57C21B155047infoc; i-wanna-go-back=-1; b_ut=7; b_lsid=1047C7449_1808035E0D6; _uuid=A4884E3F-BF68-310101-E5E6-10EBFDBCC10CA456283infoc; buvid_fp=82c49016c72d24614786e2a9e883f994; buvid4=247E3498-6553-51E8-EB96-C147A773B34357718-022050123-7//HOhRX5o4Xun7E1GZ2Vg%3D%3D; fingerprint=1b7ad7a26a4a90ff38c80c37007d4612; sid=jilve18q; buvid_fp_plain=undefined; SESSDATA=f1edfaf9%2C1666970475%2Cf281c%2A51; bili_jct=de9bcc8a41300ac37d770bca4de101a8; DedeUserID=130321232; DedeUserID__ckMd5=42d02c72aa29553d; nostalgia_conf=-1; CURRENT_BLACKGAP=1; CURRENT_FNVAL=4048; CURRENT_QUALITY=0; rpdid=|(u~||~uukl)0J'uYluRu)l|J",
		
		headers['Host'] = re.findall(self.regexs['host'], episode_playurl, re.I)[0]
		headers['Range'] = f'bytes=0-episode_size'
		response = requests.get(episode_playurl, headers=headers, stream=True, verify=False)
		tqdm_bar = tqdm(response.iter_content(self.chunk_size), desc='Download process', total=total)
		if save_path is None:
			save_path = f'epepid.mp4'
		with open(save_path, 'wb') as f:
			for byte in tqdm_bar:
				f.write(byte)
		return True

	def download(self, bvid, video_save_path=None, audio_save_path=None) -> dict:
		"""General method by parsing page source"""
		
		if video_save_path is None:
			video_save_path = f'bvid.m4s'
		if audio_save_path is None:
			audio_save_path = f'bvid.mp3'
		
		common_headers = 
			'Accept'			: '*/*',
			'Accept-encoding'	: 'gzip, deflate, br',
			'Accept-language'	: 'zh-CN,zh;q=0.9,en;q=0.8',
			'Cache-Control'		: 'no-cache',
			'Origin'			: 'https://www.bilibili.com',
			'Pragma'			: 'no-cache',
			'Host'				: 'www.bilibili.com',
			'User-Agent'		: self.user_agent,
		

		# In fact we only need bvid
		# Each episode of an anime also has a bvid and a corresponding bvid-URL which is redirected to another episode link
		# e.g. https://www.bilibili.com/video/BV1rK4y1b7TZ is redirected to https://www.bilibili.com/bangumi/play/ep322903
		response = requests.get(self.video_webpage_link(bvid), headers=common_headers)
		html = response.text
		playinfos = re.findall(self.regexs['playinfo'], html, re.S)
		if not playinfos:
			raise Exception(f'No playinfo found in bvid bvid\\nPerhaps VIP required')
		playinfo = json.loads(playinfos[0])
		
		# There exists four different URLs with observations as below
		# `baseUrl` is the same as `base_url` with string value
		# `backupUrl` is the same as `backup_url` with array value
		# Here hard code is employed to select playurl
		def _select_video_playurl(_videoinfo):
			if 'backupUrl' in _videoinfo:
				return _videoinfo['backupUrl'][-1]
			if 'backup_url' in _videoinfo:
				return _videoinfo['backup_url'][-1]
			if 'baseUrl' in _videoinfo:
				return _videoinfo['baseUrl']
			if 'base_url' in _videoinfo:
				return _videoinfo['base_url']	
			raise Exception(f'No video URL found\\n_videoinfo')	
			
		def _select_audio_playurl(_audioinfo):
			if 'backupUrl' in _audioinfo:
				return _audioinfo['backupUrl'][-1]
			if 'backup_url' in _audioinfo:
				return _audioinfo['backup_url'][-1]
			if 'baseUrl' in _audioinfo:
				return _audioinfo['baseUrl']
			if 'base_url' in _audioinfo:
				return _audioinfo['base_url']
			raise Exception(f'No audio URL found\\n_audioinfo')
		
		# with open(f'playinfo-bvid.js', 'w') as f:
			# json.dump(playinfo, f)

		if 'durl' in playinfo['data']:
			video_playurl = playinfo['data']['durl'][0]['url']
			# video_playurl = playinfo['data']['durl'][0]['backup_url'][1]
			print(video_playurl)
			video_size = playinfo['data']['durl'][0]['size']
			total = video_size // self.chunk_size
			print(f'Video size: video_size')
			headers = 
				'User-Agent': self.user_agent,
				'Origin'	: 'https://www.bilibili.com',
				'Referer'	: 'https://www.bilibili.com',			
			
			headers['Host'] = re.findall(self.regexs['host'], video_playurl以上是关于烂活斯坦福句法解析库使用小结+最新四月新番下载(以辉夜与阿尼亚为例)的主要内容,如果未能解决你的问题,请参考以下文章

烂活斯坦福句法解析库使用小结+最新四月新番下载(以辉夜与阿尼亚为例)

烂活斯坦福句法解析库使用小结+最新四月新番下载(以辉夜与阿尼亚为例)

斯坦福大学Stanford coreNLP 宾州树库依存句法标注体系

stanford NLP、哈工大云平台NLTP、百度NLPC

Python爬虫: "追新番"网站资源链接爬取

斯坦福NLP组最新报告:自然语言处理中的学习挑战(附149页报告全文下载)