php抓取网页匹配url(这是Python中的BeautifulSoup库(库)获取方法 )

优采云发布时间: 2021-09-14 09:05

　　php抓取网页匹配url(这是Python中的BeautifulSoup库(库)获取方法

)

　　您似乎在寻找 BeautifulSoup 库。这是 Python 中最流行的网页抓取库之一。你可以在这里找到项目页面：

　　首先需要获取要爬取的页面的HTML数据。这可以使用请求或库（例如 urllib）来完成。在下面的示例代码中，我使用的是请求库。

　　获取页面的HTML数据后，可以实例化BeautifulSoup对象，使用find_all方法指定要查找的URL，参数为“a”（因为HTML中的标签代表超链接），然后对于每个找到的 URL 的 href（超引用）执行成员资格测试，以测试特定单词是否在与该 URL 关联的 href 中。

　　import bs4

import requests

# Get the HTML data from the web page.

html = requests.get("https://www.google.com/").content

# Instantiate a BeautifulSoup object based on the HTML data.

soup = bs4.BeautifulSoup(html, "html.parser")

# Creating a list to store results in.

urlsContainingWord = []

# Get all the URLs in the page containing the word.

for url in soup.find_all("a"):

if "mail" in url["href"]:

urlsContainingWord.append(url)

# Print out the result.

print(urlsContainingWord)

　　在本例中，我要查找网络上所有收录“邮件”一词的 URL。我的输出如下所示：

　　[Gmail]

　　此外，如果您只需要 href 本身，您可以遍历列表并仅引用您找到的每个 URL 的 href。

　　for url in urlsContainingWord:

print(url["href"])

　　输出：

　　https://mail.google.com/mail/?tab=wm

0

2021-09-14

php抓取网页匹配url

0 个评论

要回复文章请先登录或注册