网页新闻抓取( URL,)

优采云发布时间: 2021-10-24 01:12

　　网页新闻抓取(

URL,)

　　3.2 通过抓取的一页新闻链接输入新闻详情，抓取所需数据（主要是新闻内容）

　　现在我已经获得了一组网址，现在我需要输入每个网址来抓取我需要的标题、时间和内容。代码实现也很简单。我只需要在原代码抓取网址时输入网址即可。 URL并抓取相应的数据。所以，我只需要另外写一个grab方法进入新闻详情页，使用scapy.request调用即可。

　　写代码

　　#进入新闻详情页的抓取方法

def parse_dir_contents(self, response):

item = GgglxyItem()

item['date'] = response.xpath("//div[@class='detail_zy_title']/p/text()").extract_first()

item['href'] = response

item['title'] = response.xpath("//div[@class='detail_zy_title']/h1/text()").extract_first()

data = response.xpath("//div[@class='detail_zy_c pb30 mb30']")

item['content'] = data[0].xpath('string(.)').extract()[0]

yield item

　　集成到原代码后，有：

　　import scrapy

from ggglxy.items import GgglxyItem

class News2Spider(scrapy.Spider):

name = "news_info_2"

start_urls = [

"http://ggglxy.scu.edu.cn/index.php?c=special&sid=1&page=1",

]

def parse(self, response):

for href in response.xpath("//div[@class='newsinfo_box cf']"):

url = response.urljoin(href.xpath("div[@class='news_c fr']/h3/a/@href").extract_first())

#调用新闻抓取方法

yield scrapy.Request(url, callback=self.parse_dir_contents)

#进入新闻详情页的抓取方法

def parse_dir_contents(self, response):

item = GgglxyItem()

item['date'] = response.xpath("//div[@class='detail_zy_title']/p/text()").extract_first()

item['href'] = response

item['title'] = response.xpath("//div[@class='detail_zy_title']/h1/text()").extract_first()

data = response.xpath("//div[@class='detail_zy_c pb30 mb30']")

item['content'] = data[0].xpath('string(.)').extract()[0]

yield item

　　测试，通过！

　　此时，我们添加一个循环：

<p>NEXT_PAGE_NUM = 1

NEXT_PAGE_NUM = NEXT_PAGE_NUM + 1

if NEXT_PAGE_NUM

0

2021-10-24

网页新闻抓取

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

网页新闻抓取( URL,)

0 个评论

发起人