抓取网页数据违法吗(soup标题位于标记中的应用(一)(图) )

优采云发布时间: 2022-01-20 21:18

　　抓取网页数据违法吗(soup标题位于标记中的应用(一)(图)

)

　　在类别下，您可以进一步看到标题在标签中，属性是“itemprop”和“标题”，可以使用 .find() 函数访问。

　　news1=soup.find_all('div',class_=["news-card-title news-right-box"])[0]

title=news1.find('span',attrs={'itemprop':"headline"}).string

print(title)

We get the following outputgiven below-

Shuttler Jayaram wins Dutch OpenGrand Prix

　　同样，如果您想访问新闻内容，请将新闻设置为

　　新闻卡内容新闻右框” > 类别。我们还可以看到，新闻正文位于

　　标签，标签的属性是“itemprop”和“articleBody”，可以使用.find()函数访问。

　　news1=soup.find_all('div',class_=["news-card-content news-right-box"])[0]

content=news1.find('div',attrs={'itemprop':"articleBody"}).string

print(content)

Indian Shuttler Ajay Jayaramclinched $50k Dutch Open Grand Prix at Almere in Netherlands on Sunday,becoming the first Indian to win badminton Grand Prix tournament under a newscoring system. Jayaram defeated Indonesia's Ihsan Maulana Mustofa 10-11, 11-6,11-7, 1-11, 11-9 in an exciting final clash. The 27-year-old returned to thecircuit in August after a seven-month injury layoff.

　　以类似的方式，我们可以提取任何信息，如图像、作者姓名、时间等。

　　第 3 步：创建数据集

　　接下来，我们对 3 个新闻类别进行此操作，然后将所有文章对应的内容和类别存储在一个数据框中。我将使用三个不同的 Url，对每个 URL 执行相同的步骤，并将所有文章及其内容设置类别存储为一个列表。

　　urls=["https://inshorts.com/en/read/cricket","https://inshorts.com/en/read/tennis",

"https://inshorts.com/en/read/badminton"]

news_data_content,news_data_title,news_data_category=[],[],[]

for url in urls:

category=url.split('/')[-1]

data=requests.get(url)

soup=BeautifulSoup(data.content,'html.parser')

news_title=[]

news_content=[]

news_category=[]

for headline,article inzip(soup.find_all('div', class_=["news-card-titlenews-right-box"]),

soup.find_all('div',class_=["news-card-contentnews-right-box"])):

news_title.append(headline.find('span',attrs={'itemprop':"headline"}).string)

news_content.append(article.find('div',attrs={'itemprop':"articleBody"}).string)

news_category.append(category)

news_data_title.extend(news_title)

news_data_content.extend(news_content)

news_data_category.extend(news_category)

df1=pd.DataFrame(news_data_title,columns=["Title"])

df2=pd.DataFrame(news_data_content,columns=["Content"])

df3=pd.DataFrame(news_data_category,columns=["Category"])

df=pd.concat([df1,df2,df3],axis=1)

df.sample(10)

　　输出是：

　　您可以看到使用漂亮的汤库在 python 中抓取 Web 信息是多么容易，并且您可以轻松地为任何数据科学项目采集有用的数据。从此，准备好自己的“慧眼”，快速从网页中提取有价值的信息。

0

2022-01-20

抓取网页数据违法吗

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

抓取网页数据违法吗(soup标题位于标记中的应用(一)(图) )

0 个评论

发起人