网页文章自动采集(1.python+selenium控制已打开页面参考链接(组图) )

优采云发布时间: 2022-01-05 16:06

　　网页文章自动采集(1.python+selenium控制已打开页面参考链接(组图)

)

　　python网页信息采集

　　介绍

　　这是第一次实战。帮助从 RIA Novosti 网站链接下载有关中国的新闻。技术不行，必须靠个人操作才能完成。

　　1.前期准备

　　选择日期或其他筛选项。

　　加载选项会第一次出现在这个网页上，需要自己点击，下滑后会动态加载。

　　2.自动控制鼠标向下滑动保存加载的网页

　　发现前期使用selenium模块直接打开页面，选择日期，获取数据时，会突然关闭浏览器。所以前期只能自己打开浏览器，手动选择页面，然后selenium继续控制正常加载。

　　1.python+selenium 控制打开的页面

　　参考链接

　　Win：

chrome.exe --remote-debugging-port=9222 --user-data-dir="C:\Users\Administrator\Desktop\ria_ru"

Mac:

chrome启动程序目录：/Applications/Google Chrome.app/Contents/MacOS/

进入chrome启动程序目录后执行：

./Google\ Chrome --remote-debugging-port=9222 --user-data-dir="/Users/lee/Documents/selenum/AutomationProfile"

参数说明：

--remote-debugging-port

可以指定任何打开的端口，selenium启动时要用这个端口。

--user-data-dir

指定创建新chrome配置文件的目录。它确保在单独的配置文件中启动chrome，不会污染你的默认配置文件。

　　2.根据前面的前期准备，在打开的浏览器中手动选择需要的页面

　　3.自动滚动，保存加载的页面

　　技术太差了。我不知道这种动态加载的网页如何选择结束条件。我发现新闻是从新到旧排序的，所以我又选择了一个结束日期（月份）作为结束条件。

　　from selenium import webdriver

from selenium.webdriver.chrome.options import Options

from selenium. webdriver.common.keys import Keys

import time

def Stop_Month(self, stop_month):

#通过获取指定日期的前一个日期判断加载是否完成，需要多选择前一个日期

htmldateElems = browser.find_elements_by_class_name('list-item__date')

month_str = htmldateElems[-1].text.split()

return month_str[1]

def mouse_move(self, stop_month): # 滑动鼠标至底部操作

htmlElem = browser.find_element_by_tag_name('html')

while True:

htmlElem.send_keys(Keys.END)

time.sleep(1)

month = Stop_Month(self, stop_month)

print(month)

if stop_month == month:

print('****Arrived at the specified month interface****')

break

options = Options()

options.add_experimental_option('debuggerAddress', "127.0.0.1:9222")

browser = webdriver.Chrome(chrome_options=options)

browser.implicitly_wait(3)

stop_month = 'декабря'

mouse_move(browser, stop_month)

f = open('0631-0207.html', 'wb')

f.write(browser.page_source.encode("utf-8", "ignore"))

print('****html is written successfully****')

f.close()

　　3.获取页面的所有新闻链接、标题和时间，并生成excel表格

　　下载的网页实际上已经收录了所有的新闻链接、标题和时间。问题是如何提取它们。

　　import openpyxl re, bs4

def Links_Get(self):

'''获取链接'''

downloadFile = open(self, encoding='utf-8')

webdata = bs4.BeautifulSoup(downloadFile.read(), 'html.parser')

elems = webdata.find_all(attrs={'class': 'list-item__title color-font-hover-only'})

link_regex = re.compile(r'http(.*)html')

links=[]

for elem in elems:

a = link_regex.search(str(elem))

links.append(a.group())

return links

def Titles_Get(self):

'''获取标题'''

downloadFile = open(self, encoding='utf-8')

webdata = bs4.BeautifulSoup(downloadFile.read(), 'html.parser')

elems = webdata.find_all(attrs={'class': 'list-item__title color-font-hover-only'})

#查找所有包含这个属性的标签

titles=[]

for elem in elems:

titles.append(elem.text)

return titles

def Get_Link_to_Title(self, title, excel, i):

'''信息写入excel'''

excel['A%s'%(i)] = i

#获取时间列表

date_regex = re.compile(r'\d+')

a = date_regex.search(self)

excel['B%s'%(i)] = a.group()

excel['C%s'%(i)] = title

excel['D%s'%(i)] = self

print("****%s successful****" % i)

links = Links_Get('0631-0207.html') #前面下载网页保存在工作目录

titles = Titles_Get('0631-0207.html')

nums1 = len(links)

nums2 = len(titles)

if nums1 == nums2：#一般的话，应该是对应的，不行的话，再看吧

i, j = 1, 0

#事先新建一个excel，再加载写入

time_title_link = openpyxl.load_workbook('time_title_link.xlsx')

time_title_link.create_sheet('0631-0207')

for link in links:

get_news.Get_Link_to_Title(link, titles[j], time_title_link['0631-0207'], i)

print(str(i), str(nums1))

if link == links[-1]:

time_title_link.save('time_title_link.xlsx')

print('Succeessful save')

i += 1

j += 1

print('****Succeessful all****')

else：

print('Error, titles != links')

　　4.从生成的列表中获取每个链接的新闻内容并生成docx

　　import openpyxl

import docx

def Get_News(self, doc):

res = requests.get(self)

res.raise_for_status()

NewsFile = bs4.BeautifulSoup(res.text, 'html.parser')

elems_titles = NewsFile.select('.article__title')

date_regex = re.compile(r'\d+')

a = date_regex.search(self)

date_str = 'a[href=' + '"/' + a.group() + '/"]'

elems_dates = NewsFile.select(date_str)

elems_texts = NewsFile.select('.article__text')

head0 = doc.add_heading('', 0)

for title in elems_titles:

head0.add_run(title.getText() + ' ')

print('title write succeed')

head2 = doc.add_heading('', 2)

for date in elems_dates:

head2.add_run(date.getText())

print('date write succeed')

for text in elems_texts:

doc.add_paragraph(text.getText())

print('text write succeed')

doc.add_page_break()

workbook = openpyxl.load_workbook(r'time_title_link.xlsx')

sheet = workbook['0631-0207']

doc = docx.Document()

i = 1

for cell in sheet['D']:

if cell.value == 'URL':

continue

elif cell.value != '':

Get_News(cell.value, doc)

print(str(i))

i += 1

else:

doc.save('0631-0207.docx')

break

print('****Succeessful save****')

0

2022-01-05

网页文章自动采集

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

网页文章自动采集(1.python+selenium控制已打开页面参考链接(组图) )

0 个评论

发起人

AI时代内容工厂

网页文章自动采集(1.python+selenium控制已打开页面参考链接(组图) )

0 个评论

发起人

相关问题