python抓取动态网页(使用Pythonrequests.get解析一次不的html代码对于任何提供建议)

优采云发布时间: 2022-04-16 08:31

　　嗨，我在尝试从网站中抓取数据以进行建模时遇到问题（Fantsylabs dotcom）。我只是一个黑客，所以请原谅我对 comp sci 术语的无知。我想要完成的是...

　　>

　　使用 selenium 登录网站，导航到收录数据的页面。

　　## Initialize and load the web page

url = "website url"

driver = webdriver.Firefox()

driver.get(url)

time.sleep(3)

## Fill out forms and login to site

username = driver.find_element_by_name('input')

password = driver.find_element_by_name('password')

username.send_keys('username')

password.send_keys('password')

login_attempt = driver.find_element_by_class_name("pull-right")

login_attempt.click()

## Find and open the page with the data that I wish to scrape

link = driver.find_element_by_partial_link_text('Player Models')

link.click()

time.sleep(10)

##UPDATED CODE TO TRY AND SCROLL DOWN TO LOAD ALL THE DYNAMIC DATA

scroll = driver.find_element_by_class_name("ag-body-viewport")

driver.execute_script("arguments[0].scrollIntoView();", scroll)

## Try to allow time for the full page to load the lazy way then pass to BeautifulSoup

time.sleep(10)

html2 = driver.page_source

soup = BeautifulSoup(html2, "lxml", from_encoding="utf-8")

div = soup.find_all('div', {'class':'ag-pinned-cols-container'})

## continue to scrape what I want

　　该过程通过登录、导航到正确的页面来工作，但在页面完成动态加载（30 秒）后将其传递给 Beautiful Soup。我在表中看到大约 300 个要抓取的实例......但是，bs4 抓取器只吐出 300 个实例中的 30 个。根据我自己的研究，这可能是通过 javascript 动态加载数据的问题，只有推送到 html 的内容才会被 bs4 解析？说明：使用Pythonrequests.get解析一次未加载的html代码

　　对于任何提供建议的人，如果不在网站上创建配置文件，可能很难复制我的示例，但是使用 phantomJS 初始化浏览器只是“抓取”所有实例以捕获所有您需要的数据？

　　 driver = webdriver.PhantomJS() ##instead of webdriver.Firefox()

　　任何想法或经验都会受到赞赏，因为我从来没有处理过动态页面/抓取 javascript，如果这是我遇到的。

0

2022-04-16

python抓取动态网页

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

python抓取动态网页(使用Pythonrequests.get解析一次不的html代码对于任何提供建议)

0 个评论

发起人

AI时代内容工厂

python抓取动态网页(使用Pythonrequests.get解析一次不的html代码对于任何提供建议)

0 个评论

发起人

相关问题