js提取指定网站内容( 调用session.get()方法，该方法返回一个响应对象)

优采云发布时间: 2021-11-18 17:10

　　js提取指定网站内容(

调用session.get()方法，该方法返回一个响应对象)

　　import requests

from bs4 import BeautifulSoup as bs

from urllib.parse import urljoin

# URL of the web page you want to extract

url = "http://books.toscrape.com"

# initialize a session

session = requests.Session()

# set the User-agent as a regular browser

session.headers["User-Agent"] = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"

　　现在要下载网页的所有 HTML 内容，我们需要做的就是调用 session.get() 方法，该方法返回一个响应对象。我们只对 HTML 代码感兴趣，而不是整个响应：

　　# get the HTML content

html = session.get(url).content

# parse HTML using beautiful soup

soup = bs(html, "html.parser")

　　现在我们有了 Soup，让我们提取所有脚本和 CSS 文件。我们使用soup.find_all() 方法返回所有用标签和传递属性过滤的HTML Soup 对象：

　　# get the JavaScript files

script_files = []

for script in soup.find_all("script"):

if script.attrs.get("src"):

# if the tag has the attribute 'src'

script_url = urljoin(url, script.attrs.get("src"))

script_files.append(script_url)

　　所以基本上我们正在搜索具有 src 属性的脚本标签，这些标签通常链接到这个网站所需的 Javascript 文件。

　　Python是如何提取Javascript和CSS的？同样，我们可以使用它来提取 CSS 文件：

　　# get the CSS files

css_files = []

for css in soup.find_all("link"):

if css.attrs.get("href"):

# if the link tag has the 'href' attribute

css_url = urljoin(url, css.attrs.get("href"))

css_files.append(css_url)

　　您可能知道，CSS 文件位于链接标签的 href 属性中。我们使用 urljoin() 函数来确保链接是绝对链接（即它有完整路径，而不是相对路径，例如 /js/script.js）。

　　最后，让我们打印所有脚本和 CSS 文件并将链接写入单独的文件。以下是 Python 提取 Javascript 和 CSS 的示例：

　　print("Total script files in the page:", len(script_files))

print("Total CSS files in the page:", len(css_files))

# write file links into files

with open("javascript_files.txt", "w") as f:

for js_file in script_files:

print(js_file, file=f)

with open("css_files.txt", "w") as f:

for css_file in css_files:

print(css_file, file=f)

　　? 执行后，将出现 2 个文件，一个用于 Javascript 链接，另一个用于 CSS 文件：

　　css_files.txt

　　http://books.toscrape.com/static/oscar/favicon.ico

http://books.toscrape.com/static/oscar/css/styles.css

http://books.toscrape.com/static/oscar/js/bootstrap-datetimepicker/bootstrap-datetimepicker.css

http://books.toscrape.com/static/oscar/css/datetimepicker.css

　　javascript_files.txt

　　http://ajax.googleapis.com/ajax/libs/jquery/1.9.1/jquery.min.js

http://books.toscrape.com/static/oscar/js/bootstrap3/bootstrap.min.js

http://books.toscrape.com/static/oscar/js/oscar/ui.js

http://books.toscrape.com/static/oscar/js/bootstrap-datetimepicker/bootstrap-datetimepicker.js

http://books.toscrape.com/static/oscar/js/bootstrap-datetimepicker/locales/bootstrap-datetimepicker.all.js

　　好的，最后，我鼓励您进一步扩展此代码以构建一个复杂的审计工具，该工具可以识别不同的文件、它们的大小，并可能建议优化网站！

　　作为挑战，尝试下载所有这些文件并将它们存储在本地磁盘上（本教程可以提供帮助）。

　　我有另一个教程向您展示如何提取所有网站链接，请查看这里。

　　另外，如果你分析的网站不小心屏蔽了你的IP地址，这种情况下你需要使用代理服务器。

0

2021-11-18

js提取指定网站内容

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

js提取指定网站内容( 调用session.get()方法，该方法返回一个响应对象)

0 个评论

发起人

AI时代内容工厂

js提取指定网站内容( 调用session.get()方法，该方法返回一个响应对象)

0 个评论

发起人

相关问题