网页源代码抓取工具(这是一个简单的单页面数据抓取案例，但也有些值得注意的坑 )

优采云发布时间: 2022-01-02 10:29

　　网页源代码抓取工具(这是一个简单的单页面数据抓取案例，但也有些值得注意的坑

)

　　这是一个简单的单页数据抓取案例，但也存在一些值得注意的陷阱。这是代码的快速解释。

　　获取51job网站，搜索“人工智能”，获取职位、职位名称、公司名称、薪资等基本信息

　　图像.png

　　数据直接在【右键-查看源码】的网页源码中，也可以在元素面板中【右键-查看】查看：

　　图像.png

　　我们注意到job列表都在class='dw_table'这个元素下，但是第一个class='el title'是header，不应该收录，虽然下面有t1,t2,t3 , 但它的 class='t1' 是一，正常位置的 t1 是一

　　主要代码如下：

　　from bs4 import BeautifulSoup

import requests

import time

headers = {

'User-Agent': 'Mozilla/5.0'

}

url='https://search.51job.com/list/070300,000000,0000,00,9,99,%25E4%25BA%25BA%25E5%25B7%25A5%25E6%2599%25BA%25E8%2583%25BD,2,1.html?lang=c&stype=&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&providesalary=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare='

html= requests.get(url,headers=headers)

html=html.text.encode('ISO-8859-1').decode('gbk') ##注意这个坑！

soup=BeautifulSoup(html, 'html.parser')

for item in soup.find('div','dw_table').find_all('div','el'):

shuchu=[]

if item.find('p','t1'):

title=item.find('p','t1').find('a')['title']

company=item.find('span','t2').string #爬公司名称

address=item.find('span','t3').string #爬地址

xinzi = item.find('span', 't4').string #爬薪资

date=item.find('span','t5').string #爬日期

shuchu.append(str(title))

shuchu.append(str(company))

shuchu.append(str(address))

shuchu.append(str(xinzi))

shuchu.append(str(date))

print('\t'.join(shuchu))

time.sleep(1)

　　有几个坑需要注意：

　　最终输出大致如下：

0

2022-01-02

网页源代码抓取工具

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

网页源代码抓取工具(这是一个简单的单页面数据抓取案例，但也有些值得注意的坑 )

0 个评论

发起人

AI时代内容工厂

网页源代码抓取工具(这是一个简单的单页面数据抓取案例，但也有些值得注意的坑 )

0 个评论

发起人

相关问题