scrapy分页抓取网页(怎样优雅用yield写这样的代码呢?部分代码实例 )

优采云 发布时间: 2022-01-15 16:01

  scrapy分页抓取网页(怎样优雅用yield写这样的代码呢?部分代码实例

)

  使用yield优雅爬取网页分页数据

  在使用Python爬取网页数据时,我们经常会遇到分页数据,有些下一页按钮已经有了具体的链接地址,而有些可能是javascript处理的。这需要能够同时解析页面内容和采集下一页的url。如何在python中优雅地编写这样的代码?或者如何更pythonic?

  下面给出一些代码示例

  def get_next_page(obj):

'''get next page content from a url or another content '''

error_occurred = False

for retry2 in xrange(3):

try:

if isinstance(obj, (basestring, unicode)):

resp = curr_session.get(obj, timeout=TIMEOUT, headers=headers,

cookies=cookies, allow_redirects=True)

content = resp.content

save_html_content(obj, content)

error_occurred = False

else:

content = obj

soup = BeautifulSoup(content, features='html5lib', from_encoding="utf8")

e_next_page = soup.find('a', text="下頁")

break

except:

error_occurred = True

time.sleep(2)

if error_occurred:

yield content

return

if e_next_page:

next_url = "http://www.etnet.com.hk" + e_next_page.get('href')

time.sleep(2)

yield content

for i in get_next_page(next_url):

yield i

else:

yield content

  def get_next_page(obj, page=1):

'''get next page content from a url or another content '''

error_occurred = False

for retry2 in xrange(3):

try:

if isinstance(obj, (basestring, unicode)):

resp = curr_session.get(obj, timeout=TIMEOUT, headers=headers,

cookies=cookies, allow_redirects=True)

content = resp.content

save_html_content(obj, content)

hrefs = re.findall('industrysymbol=.*&market_id=[^;]+', content)

if page == 1 and (not "sh=" in obj) and hrefs:

reset_url = ("http://www.aastocks.com/tc/cnhk/market/industry"

"/sector-industry-details.aspx?%s&page=1" % \

(hrefs[0].replace('sh=1', 'sh=0').replace('&page=', '') \

.replace("'", '').split()[0]))

for next_page in get_next_page(reset_url):

yield next_page

return

error_occurred = False

else:

content = obj

soup = BeautifulSoup(content, features='html5lib', from_encoding="utf8")

e_next_page = soup.find('td', text="下一頁 ")

break

except:

error_occurred = True

LOG.error(traceback.format_exc())

time.sleep(2)

if error_occurred:

yield content

return

if e_next_page:

hrefs = re.findall('industrysymbol=.*&market_id=[^;]+', content)

if hrefs:

next_url = ("http://www.aastocks.com/tc/cnhk/market/industry/sector-industry"

"-details.aspx?%s&page=%d" % \

(hrefs[0].replace('sh=1', 'sh=0') \

.replace('&page=', '').replace("'", '').split()[0], page+1))

time.sleep(2)

yield content

for next_page in get_next_page(next_url, page+1):

yield next_page

else:

yield content

  for curr_href in e_href:

retry_interval = random.randint(MIN_INTERVAL_SECONDS_FOR_RETRIEVING,

MAX_INTERVAL_SECONDS_FOR_RETRIEVING)

time.sleep(retry_interval)

contents = get_next_page(curr_href)

for content in contents:

get_page_data(content)

0 个评论

要回复文章请先登录注册


官方客服QQ群

微信人工客服

QQ人工客服


线