python抓取动态网页(excel中3.算法的性能,需要找到需求的文件进行排除)

优采云 发布时间: 2021-12-03 18:32

  python抓取动态网页(excel中3.算法的性能,需要找到需求的文件进行排除)

  需求场景:需要找到收录源代码中指定的客户信息的某些字段。

  版本 1:检索关键字,如果收录,则将其输出到控制台。

  import os

rootDir = os.getcwd()

def scan_file(filename, dirname):

if("hello" in filename):

if("src" in dirname):

print(os.path.join(dirname,filename))

else:

with open(os.path.join(dirname,filename)) as f:

lines = f.readlines()

for l in lines:

#print(l)

if("hello" in l):

if("/src" in dirname):

print(os.path.join(dirname,filename))

break

for dirName, subdirList, fileList in os.walk(rootDir):

for fname in fileList:

scan_file(fname, dirName)

  版本2:检索多个关键字,输出收录关键字和收录的关键字的文件

  rootDir = os.getcwd()

keywords = ["hello","world","thanks"]

def scan_file(filename, dirname,keyword):

if(keyword in filename):

if("/src" in dirname):

return True

else:

with open(os.path.join(dirname,filename)) as f:

lines = f.readlines()

for l in lines:

if(keyword in l):

if("/src" in dirname):

return True

for dirName, subdirList, fileList in os.walk(rootDir):

for fname in fileList:

flag = False

for keyword in keywords:

if(scan_file(fname, dirName,keyword)):

if(flag is False):

flag = True

f = open('test.txt', 'a')

f.write(keyword)

f.write(" ,")

f.close()

if(flag is True):

f = open('test.txt', 'a')

f.write("\n"+os.path.join(dirName,fname)+"\n")

f.close()

  这个版本实现了基本功能,但还不够完善。迭代空间:

  1.算法的性能,包括代码的时间复杂度、冗余度和优雅度

  2.输出结果的可读性,最好按照模块组织文件,并在excel中呈现

  3.详情:排除不符合要求的文件,如png。

  留给读者思考。

0 个评论

要回复文章请先登录注册


官方客服QQ群

微信人工客服

QQ人工客服


线