一.將掃描版PDF轉(zhuǎn)為可復制文字版PDF
1.利用百度api將掃描版的pdf轉(zhuǎn)為文字版的pdf
申請網(wǎng)址:https://console.bce.baidu.com
點擊后創(chuàng)建文字識別應(yīng)用,在應(yīng)用列表中可見調(diào)用api時的APP_ID、API_KEY、SECRET_KEY
2.依次安裝以下python模塊
3.安裝wkhtmltopdf 軟件
下載網(wǎng)址:https:///downloads.html
記下安裝目錄下 bin/wkhtmltopdf.exe位置,程序中的 path_wk 參數(shù)需要此位置
4.程序:
from PyPDF2 import PdfFileReader, PdfFileWriter
path_wk = r'D:/Procedure/wkhtmltopdf/bin/wkhtmltopdf.exe'
SECRET_KEY = 'qwertyuiop'
# 以下為處理程序--------------------------------------------------------------------------- pdfkit_config = pdfkit.configuration(wkhtmltopdf=path_wk) pdfkit_options = {'encoding': 'UTF-8', } pdf = fitz.open(pdfpath+os.sep+pdfname) for pg in range(0, pdf.pageCount): trans = fitz.Matrix(1.0, 1.0).preRotate(0), pm = page.getPixmap(matrix=trans, alpha=False) pm.writePNG(image_path + os.sep + pdfname[:-4] + '_' + '{:0>3d}.png'.format(pg + 1)) page_range = range(pdf.pageCount)
def read_png_str(page_range): def get_file_content(filePath): with open(filePath, 'rb') as fp:
for page_num in page_range: image = get_file_content(image_path + os.sep + r'{}_{}.png'.format(pdfname[:-4], '%03d' % (page_num + 1)))
client = AipOcr(APP_ID, API_KEY, SECRET_KEY) options["language_type"] = "CHN_ENG" options["detect_direction"] = "false" options["detect_language"] = "false" options["probability"] = "false" pngjson = client.basicGeneral(image, options) for x in pngjson['words_result']: pngstr = pngstr + x['words'] + '</br>' print('正在調(diào)用百度接口:第{}個,共{}個'.format(len(all_pngstr), len(image_list))) all_pngstr.append(pngstr)
def str2pdf(page_range, all_pngstr): for page_num in page_range: print('正在將字符串寫入PDF:第{}個,共{}個'.format((page_num + 1), len(page_range))) pdfkit.from_string((all_pngstr[page_num]), disperse_pdfpath + os.sep + '%s.pdf' % (str(page_num + 1)), configuration=pdfkit_config, options=pdfkit_options)
def pdf_merge(page_range): pdf_output = PdfFileWriter() for page_num in page_range: print('正在合并單頁:第{}個,共{}個'.format((page_num + 1), len(page_range))) pdf_input = PdfFileReader(open(disperse_pdfpath + os.sep + '%s.pdf' % (str(page_num + 1)), 'rb')) page = pdf_input.getPage(0) newPdfPath = pdfpath+os.sep + 'new_{}'.format(pdfname) pdf_output.write(open(newPdfPath, 'wb'))
image_path = pdfpath + os.sep + "image" if not os.path.exists(image_path):
disperse_pdfpath = pdfpath + os.sep + "pdf" if not os.path.exists(disperse_pdfpath): os.mkdir(disperse_pdfpath)
range_count = pdf_image() all_th = read_png_str(range_count) str2pdf(range_count, all_th)
二.將掃描版PDF轉(zhuǎn)為可復制文字版Word文檔
1.在安裝了上節(jié)所需的環(huán)境的基礎(chǔ)下,安裝python-docx python模塊
pip3 install python-docx
2.程序:
from docx import Document
path_wk = r'D:/Procedure/wkhtmltopdf/bin/wkhtmltopdf.exe'
SECRET_KEY = 'qwertyuiop'
# --------------------------------------------------------------------------- pdfkit_config = pdfkit.configuration(wkhtmltopdf=path_wk) pdfkit_options = {'encoding': 'UTF-8', }
pdf = fitz.open(pdfpath + os.sep + pdfname) for pg in range(0, pdf.pageCount): trans = fitz.Matrix(1.0, 1.0).preRotate(0) pm = page.getPixmap(matrix=trans, alpha=False) pm.writePNG(image_path + os.sep + pdfname[:-4] + '_' + '{:0>3d}.png'.format(pg + 1)) page_range = range(pdf.pageCount)
def read_png_str(page_range): def get_file_content(filePath): with open(filePath, 'rb') as fp:
for page_num in page_range: image = get_file_content(image_path + os.sep + r'{}_{}.png'.format(pdfname[:-4], '%03d' % (page_num + 1)))
client = AipOcr(APP_ID, API_KEY, SECRET_KEY) options["language_type"] = "CHN_ENG" options["detect_direction"] = "false" options["detect_language"] = "false" options["probability"] = "false" pngjson = client.basicGeneral(image, options) for x in pngjson['words_result']: pngstr = pngstr + x['words'] + '\n' print('正在調(diào)用百度接口:第{}個,共{}個'.format(len(allPngStr), len(image_list)))
document.save(pdfpath + os.sep + pdfname[:-4] + '.docx')
image_path = pdfpath + os.sep + "image" if not os.path.exists(image_path):
range_count = pdf_image() allPngStr = read_png_str(range_count)
三.將PDF中的文字轉(zhuǎn)為word文檔
1.安裝如下兩個python模塊
2.程序:
from pdfminer.pdfparser import PDFParser, PDFDocument from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.layout import LAParams from pdfminer.converter import PDFPageAggregator from docx import Document
filePath = 'D:/pdf/水滸傳.pdf'
file_name = os.open(filePath, os.O_RDWR) warnings.filterwarnings("ignore") fn = open(file_name, 'rb') resource = PDFResourceManager() device = PDFPageAggregator(resource, laparams=laparams) interpreter = PDFPageInterpreter(resource, device) for i in doc.get_pages(): interpreter.process_page(i) layout = device.get_result() if hasattr(out, "get_text"): content = out.get_text().replace(u'\xa0', u' ') content, style='ListBullet' document.save(filePath[:-4] + '.docx')
if __name__ == '__main__':
參考博客:https://blog.csdn.net/dianepure/article/details/88568761
|