python爬虫——对包含客户信息源代码检索

2022-10-26 17:31:35 浏览数 (1)

需求场景:需要找到源码中指定的某些包含客户信息的字段。 版本1: 检索一个关键字,包含的则输出到控制台。

代码语言:javascript复制
import os

rootDir = os.getcwd()

def scan_file(filename, dirname):

    if("hello" in filename):
        if("src" in dirname):
            print(os.path.join(dirname,filename))
    else:
        with open(os.path.join(dirname,filename)) as f:
            lines = f.readlines()
            for l in lines:
                #print(l)
                if("hello" in l):
                    if("/src" in dirname):
                        print(os.path.join(dirname,filename))
                    break

for dirName, subdirList, fileList in os.walk(rootDir):
    for fname in fileList:
        scan_file(fname, dirName)

版本2:检索多个关键字,输出包含关键字的文件与包含的关键字

代码语言:javascript复制
rootDir = os.getcwd()
keywords = ["hello","world","thanks"]

def scan_file(filename, dirname,keyword):

   if(keyword in filename):
       if("/src" in dirname):
           return True
   else:
       with open(os.path.join(dirname,filename)) as f:
           lines = f.readlines()
           for l in lines:
               if(keyword in l):
                   if("/src" in dirname):
                       return True                 

for dirName, subdirList, fileList in os.walk(rootDir):
   for fname in fileList:
       flag = False
       for keyword in keywords:   
           if(scan_file(fname, dirName,keyword)):
               if(flag is False):
                      flag = True
               f = open('test.txt', 'a')
               f.write(keyword)
               f.write(" ,")   
               f.close()
       if(flag is True):
           f = open('test.txt', 'a')
           f.write("n" os.path.join(dirName,fname) "n") 
           f.close()

这个版本实现了基本功能,但是仍然不够完美。迭代的空间:

1.算法的性能,包括时间复杂度,代码的冗余、优雅 2.输出结果的可读性,最好能够按照模块对文件进行整理,呈现在excel中 3.细节:对png等不符合需求的文件进行排除。

留待读者思考。

0 人点赞