java抓取网页内容(soup库在python中抓取网页信息是怎样的体验？)

优采云发布时间: 2021-11-20 23:08

　　在标签中，标签的属性是“itemprop”和“articleBody”，可以通过.find()函数访问。

　　news1=soup.find_all('div',class_=["news-card-content news-right-box"])[0]content=news1.find('div',attrs={'itemprop':"articleBody"}).stringprint(content)Indian Shuttler Ajay Jayaramclinched $50k Dutch Open Grand Prix at Almere in Netherlands on Sunday,becoming the first Indian to win badminton Grand Prix tournament under a newscoring system. Jayaram defeated Indonesia's Ihsan Maulana Mustofa 10-11, 11-6,11-7, 1-11, 11-9 in an exciting final clash. The 27-year-old returned to thecircuit in August after a seven-month injury layoff.

　　以类似的方式，我们可以提取任何信息，例如图像、作者姓名和时间。

　　第 3 步：创建数据集

　　接下来，我们对3个新闻类别实现这个操作，然后将文章的所有对应的内容和类别存储在数据框中。作者将使用三个不同的 URL，对每个 URL 执行相同的步骤，并将所有文章及其内容设置类别存储为列表。

　　urls=["https://inshorts.com/en/read/cricket","https://inshorts.com/en/read/tennis", "https://inshorts.com/en/read/badminton"] news_data_content,news_data_title,news_data_category=[],[],[] for url in urls: category=url.split('/')[-1] data=requests.get(url) soup=BeautifulSoup(data.content,'html.parser') news_title=[] news_content=[] news_category=[] for headline,article inzip(soup.find_all('div', class_=["news-card-titlenews-right-box"]), soup.find_all('div',class_=["news-card-contentnews-right-box"])): news_title.append(headline.find('span',attrs={'itemprop':"headline"}).string) news_content.append(article.find('div',attrs={'itemprop':"articleBody"}).string) news_category.append(category) news_data_title.extend(news_title) news_data_content.extend(news_content) news_data_category.extend(news_category) df1=pd.DataFrame(news_data_title,columns=["Title"]) df2=pd.DataFrame(news_data_content,columns=["Content"]) df3=pd.DataFrame(news_data_category,columns=["Category"]) df=pd.concat([df1,df2,df3],axis=1) df.sample(10)

　　输出是：

　　你可以看到在python中使用漂亮的soup库来抓取网络信息是多么容易，你可以轻松地为任何数据科学项目采集有用的数据。从此，带上自己的“眼睛”，快速从网页中提取有价值的信息。

　　点赞关注

0

2021-11-20

java抓取网页内容

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

java抓取网页内容(soup库在python中抓取网页信息是怎样的体验？)

0 个评论

发起人

AI时代内容工厂

java抓取网页内容(soup库在python中抓取网页信息是怎样的体验？)

0 个评论

发起人

相关问题