如何使用scrapy规则从Wiki演员和电影页面爬行到仅演员和fimlography链接中的链接

Question

我最近开始使用python和scrapy。我一直在尝试使用scrapy从电影或演员维基页面开始，保存名称和演员或电影摄影，并遍历演员或电影摄影部分中的链接到其他演员/电影维基页面。

但是，我不知道规则是如何工作的（编辑：确定，这有点夸张）并且wiki链接非常嵌套。我看到你可以通过xpath限制并给id或类，但我想要的大多数链接似乎没有类或id。我也不确定xpath是否还包括其他兄弟姐妹和孩子。

因此，我想知道使用什么规则来限制不相关的链接，只能去演员和电影摄影链接。

编辑：很明显，我应该更好地解释我的问题。它并不是我根本不理解xpaths和规则（因为我感到沮丧，这有点夸张）但我显然不清楚他们的工作。首先，让我展示我到目前为止所做的事情，然后澄清我遇到麻烦的地方。

import logging
from bs4 import BeautifulSoup
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor, re
from scrapy.exceptions import CloseSpider
from Assignment2_0.items import Assignment20Item

logging.basicConfig(filename='spider.log',level = logging.DEBUG)


class WikisoupSpiderSpider(CrawlSpider):
    name = 'wikisoup_spider'
    allowed_domains = ['en.wikipedia.org']
    start_urls = ['https://en.wikipedia.org/wiki/Keira_Knightley']

rules = (
    Rule(LinkExtractor(restrict_css= 'table.wikitable')),
    Rule(LinkExtractor(allow =('(/wiki/)',), ),
              callback='parse_crawl', follow=True))

actor_counter = 0
actor_max = 250
movie_counter = 0
movie_max = 125

def parse_crawl(self, response):
    items = []
    soup = BeautifulSoup(response.text, 'lxml')
    item = Assignment20Item()
    occupations = ['Actress', 'Actor']
    logging.debug(soup.title)

    tempoccu = soup.find('td', class_ = 'role')
    logging.warning('tempoccu only works for pages of people')

    tempdir = soup.find('th', text = 'Directed by')
    logging.warning('tempdir only works for pages of movies')


    if (tempdir is not None) and self.movie_counter < self.movie_max:
        logging.info('Found movie and do not have enough yet')

        item['moviename'] = soup.h1.text
        logging.debug('name is ' + item['moviename'])

        finder = soup.find('th', text='Box office')
        gross = finder.next_sibling.next_sibling.text
        gross_float = re.findall(r"[-+]?d*.d+|d+", gross)
        item['netgross'] = float(gross_float[0])
        logging.debug('Net gross is ' + gross_float[0])

        finder = soup.find('div', text='Release date')
        date = finder.parent.next_sibling.next_sibling.contents[1].contents[1].contents[1].get_text(" ")
        date = date.replace(u'xa0', u' ')
        item['releasedate'] = date
        logging.debug('released on ' + item['releasedate'])

        item['type'] = 'movie'
        items.append(item)

    elif (tempoccu is not None) and (any(occu in tempoccu for occu in occupations)) and self.actor_counter < self.actor_max:
        logging.info('Found actor and do not have enough yet')

        item['name'] = soup.h1.text
        logging.debug('name is ' + item['name'])

        temp = soup.find('span', class_ = 'noprint ForceAgeToShow').text
        age = re.findall('d+', temp)
        item['age'] = int(age[0])
        logging.debug('age is ' + age[0])

        filmo = []
        finder = soup.find('span', id='Filmography')
        for x in finder.parent.next_sibling.next_sibling.find_all('i'):
            filmo.append(x.text)
        item['filmography'] = filmo
        logging.debug('has done ' + filmo[0])

        item['type'] = 'actor'
        items.append(item)

    elif (self.movie_counter == self.movie_max and self.actor_counter == self.actor_max):
        logging.info('Found enough data')

        raise CloseSpider(reason='finished')

    else :
        logging.info('irrelavent data')

        pass

    return items

现在，我对代码中的规则的理解是它应该允许所有wiki链接，并且应该只从表标签及其子代中获取链接。这显然不是发生的事情，因为它很快就从电影中消失了。

当每个元素都有一个像id或class这样的标识符时，我很清楚要做什么但是当检查页面时，这些链接被埋没在无id标签的多个嵌套中，这些嵌套看起来并不都遵循单一模式（我会使用常规的xpath，但是不同的页面有不同的路径到胶片，并且它似乎不像在h2 = filmography下找到表的路径，它将包括下面表格中的所有链接）。因此，我想知道更多关于如何使用scrapy来仅使用Filmography链接（无论如何在演员页面中）。

我很抱歉，如果这是一个显而易见的事情，我已经开始在48小时前使用python和scrapy / xpath / css了。

Answer 1

另一答案