python网页数据抓取(来源“LastStatement”替换标题“Link”，我的需求)

优采云发布时间: 2022-03-06 16:11

　　我看过很多文章，但还没有找到完全符合我需要的解决方案。首先，让我说我是 Python 新手（我正在使用 Python 2）。在 Python 中创建数据集，抓取网页

　　我正在尝试从网页采集数据（）。注意好的 html 表格。我已经能够将它读入列表而没有太多问题。但是，还要注意有两列链接。我想删除第一个链接列（但我不确定如何执行此操作，因为我的数据在列表中）。

　　第二个链接列有点复杂。我想用“Last Statement”替换标题“Link”。然后我想访问提供的每个链接，检索最后一条语句，并将其放在我创建列表的原创表的相应行中。

　　最后，我想将此列表打印为制表符分隔的文件，该文件可以作为数据帧读入 R。

　　这是一道需要处理的配菜。请让我知道我是否正确处理此问题。以下是我到目前为止的代码。我错过了我想做的事情，因为我不知道如何开始。

　　from bs4 import BeautifulSoup

import requests

from lxml import html

import csv

import string

import sys

#obtain the main url with bigger data

main_url = "http://www.tdcj.state.tx.us/death_row/dr_executed_offenders.html"

#convert the html to BeautifulSoup

doc = requests.get(main_url)

soup = BeautifulSoup(doc.text, 'lxml')

#find in html the table

tbl = soup.find("table", attrs = {"class":"os"})

#create labels for list rows by table headers

headings = [th.get_text() for th in tbl.find("tr").find_all("th")]

#convert the unicode to string

headers = []

for i in range(0,len(headings)-1):

headers.append(str(headings[i]))

#access the remaining information

prisoners = []

for row in tbl.find_all("tr")[1:]:

#attach the appropriate header to the appropriate corresponding data

#also, converts unicode to string

info = zip(headers, (str(td.get_text()) for td in row.find_all("td")))

#append each of the newly made rows

prisoners.append(info)

#print each row of the list to a file for R

with open('output.txt', 'a') as output:

for p in prisoners:

output.write(str(p)+'\n')

output.close()

　　如果您能帮我弄清楚我正在努力解决的三个部分中的任何一个，我将不胜感激！

　　资源

　　2016-04-24用户1723196

0

2022-03-06

python网页数据抓取

0 个评论

要回复文章请先登录或注册

AI时代内容工厂

python网页数据抓取(来源“LastStatement”替换标题“Link”，我的需求)

0 个评论

发起人