一区二区三区日韩精品-日韩经典一区二区三区-五月激情综合丁香婷婷-欧美精品中文字幕专区

分享

掃描版PDF轉(zhuǎn)文字Word(python3)

 和相品 2020-05-13

一.將掃描版PDF轉(zhuǎn)為可復制文字版PDF

1.利用百度api將掃描版的pdf轉(zhuǎn)為文字版的pdf

申請網(wǎng)址:https://console.bce.baidu.com

點擊后創(chuàng)建文字識別應(yīng)用,在應(yīng)用列表中可見調(diào)用api時的APP_ID、API_KEY、SECRET_KEY

2.依次安裝以下python模塊

  1. pip3 install PyPDF2
  2. pip3 install baidu-aip
  3. pip3 install pdfkit
  4. pip3 install pymupdf

3.安裝wkhtmltopdf 軟件

下載網(wǎng)址:https:///downloads.html

記下安裝目錄下 bin/wkhtmltopdf.exe位置,程序中的 path_wk 參數(shù)需要此位置

4.程序:

  1. from PyPDF2 import PdfFileReader, PdfFileWriter
  2. from aip import AipOcr
  3. import pdfkit
  4. import fitz
  5. import os


  6. pdfpath = 'D:\pdf3'
  7. pdfname = '水滸傳.pdf'
  8. path_wk = r'D:/Procedure/wkhtmltopdf/bin/wkhtmltopdf.exe'


  9. APP_ID = '1234567'
  10. API_KEY = 'abcdefg'
  11. SECRET_KEY = 'qwertyuiop'

  12. # 以下為處理程序---------------------------------------------------------------------------
  13. pdfkit_config = pdfkit.configuration(wkhtmltopdf=path_wk)
  14. pdfkit_options = {'encoding': 'UTF-8', }
  15. # 將每頁pdf轉(zhuǎn)為png格式圖片
  16. def pdf_image():
  17. pdf = fitz.open(pdfpath+os.sep+pdfname)
  18. for pg in range(0, pdf.pageCount):
  19. # 獲得每一頁的對象
  20. page = pdf[pg]
  21. trans = fitz.Matrix(1.0, 1.0).preRotate(0),
  22. # 獲得每一頁的流對象
  23. pm = page.getPixmap(matrix=trans, alpha=False)
  24. # 保存圖片
  25. pm.writePNG(image_path + os.sep + pdfname[:-4] + '_' + '{:0>3d}.png'.format(pg + 1))
  26. page_range = range(pdf.pageCount)
  27. pdf.close()
  28. return page_range


  29. def read_png_str(page_range):
  30. # 讀取本地圖片的函數(shù)
  31. def get_file_content(filePath):
  32. with open(filePath, 'rb') as fp:
  33. return fp.read()

  34. all_pngstr = []
  35. image_list = []
  36. for page_num in page_range:
  37. # 讀取本地圖片
  38. image = get_file_content(image_path + os.sep + r'{}_{}.png'.format(pdfname[:-4], '%03d' % (page_num + 1)))
  39. image_list.append(image)

  40. # 新建一個AipOcr
  41. client = AipOcr(APP_ID, API_KEY, SECRET_KEY)
  42. options = {}
  43. options["language_type"] = "CHN_ENG"
  44. options["detect_direction"] = "false"
  45. options["detect_language"] = "false"
  46. options["probability"] = "false"
  47. for image in image_list:
  48. # 文字識別,得到一個字典
  49. pngjson = client.basicGeneral(image, options)
  50. pngstr = ''
  51. for x in pngjson['words_result']:
  52. pngstr = pngstr + x['words'] + '</br>'
  53. print('正在調(diào)用百度接口:第{}個,共{}個'.format(len(all_pngstr), len(image_list)))
  54. all_pngstr.append(pngstr)
  55. return all_pngstr


  56. def str2pdf(page_range, all_pngstr):
  57. # 字符串寫入PDF
  58. for page_num in page_range:
  59. print('正在將字符串寫入PDF:第{}個,共{}個'.format((page_num + 1), len(page_range)))
  60. pdfkit.from_string((all_pngstr[page_num]), disperse_pdfpath + os.sep + '%s.pdf' % (str(page_num + 1)),
  61. configuration=pdfkit_config, options=pdfkit_options)


  62. def pdf_merge(page_range):
  63. # 合并單頁PDF
  64. pdf_output = PdfFileWriter()
  65. for page_num in page_range:
  66. print('正在合并單頁:第{}個,共{}個'.format((page_num + 1), len(page_range)))
  67. pdf_input = PdfFileReader(open(disperse_pdfpath + os.sep + '%s.pdf' % (str(page_num + 1)), 'rb'))
  68. page = pdf_input.getPage(0)
  69. pdf_output.addPage(page)
  70. newPdfPath = pdfpath+os.sep + 'new_{}'.format(pdfname)
  71. pdf_output.write(open(newPdfPath, 'wb'))
  72. return newPdfPath


  73. image_path = pdfpath + os.sep + "image"
  74. if not os.path.exists(image_path):
  75. os.mkdir(image_path)

  76. disperse_pdfpath = pdfpath + os.sep + "pdf"
  77. if not os.path.exists(disperse_pdfpath):
  78. os.mkdir(disperse_pdfpath)

  79. range_count = pdf_image()
  80. all_th = read_png_str(range_count)
  81. str2pdf(range_count, all_th)
  82. pdf_merge(range_count)

 

二.將掃描版PDF轉(zhuǎn)為可復制文字版Word文檔

1.在安裝了上節(jié)所需的環(huán)境的基礎(chǔ)下,安裝python-docx python模塊

pip3 install python-docx

2.程序:

  1. from docx import Document
  2. from aip import AipOcr
  3. import pdfkit
  4. import fitz
  5. import os

  6. pdfpath = 'D:\pdf'
  7. pdfname = '水滸傳.pdf'
  8. path_wk = r'D:/Procedure/wkhtmltopdf/bin/wkhtmltopdf.exe'

  9. APP_ID = '123456789'
  10. API_KEY = 'abcdefg'
  11. SECRET_KEY = 'qwertyuiop'

  12. # ---------------------------------------------------------------------------
  13. pdfkit_config = pdfkit.configuration(wkhtmltopdf=path_wk)
  14. pdfkit_options = {'encoding': 'UTF-8', }


  15. # 將每頁pdf轉(zhuǎn)為png格式圖片
  16. def pdf_image():
  17. pdf = fitz.open(pdfpath + os.sep + pdfname)
  18. for pg in range(0, pdf.pageCount):
  19. # 獲得每一頁的對象
  20. page = pdf[pg]
  21. trans = fitz.Matrix(1.0, 1.0).preRotate(0)
  22. # 獲得每一頁的流對象
  23. pm = page.getPixmap(matrix=trans, alpha=False)
  24. # 保存圖片
  25. pm.writePNG(image_path + os.sep + pdfname[:-4] + '_' + '{:0>3d}.png'.format(pg + 1))
  26. page_range = range(pdf.pageCount)
  27. pdf.close()
  28. return page_range


  29. # 將圖片中的文字轉(zhuǎn)換為字符串
  30. def read_png_str(page_range):
  31. # 讀取本地圖片的函數(shù)
  32. def get_file_content(filePath):
  33. with open(filePath, 'rb') as fp:
  34. return fp.read()

  35. allPngStr = []
  36. image_list = []
  37. for page_num in page_range:
  38. # 讀取本地圖片
  39. image = get_file_content(image_path + os.sep + r'{}_{}.png'.format(pdfname[:-4], '%03d' % (page_num + 1)))
  40. print(image)
  41. image_list.append(image)

  42. # 新建一個AipOcr
  43. client = AipOcr(APP_ID, API_KEY, SECRET_KEY)
  44. # 可選參數(shù)
  45. options = {}
  46. options["language_type"] = "CHN_ENG"
  47. options["detect_direction"] = "false"
  48. options["detect_language"] = "false"
  49. options["probability"] = "false"
  50. for image in image_list:
  51. # 通用文字識別,得到的是一個dict
  52. pngjson = client.basicGeneral(image, options)
  53. pngstr = ''
  54. for x in pngjson['words_result']:
  55. pngstr = pngstr + x['words'] + '\n'
  56. print('正在調(diào)用百度接口:第{}個,共{}個'.format(len(allPngStr), len(image_list)))
  57. allPngStr.append(pngstr)
  58. return allPngStr


  59. def str2word(allPngStr):
  60. document = Document()
  61. for i in allPngStr:
  62. document.add_paragraph(
  63. i, style='ListBullet'
  64. )
  65. document.save(pdfpath + os.sep + pdfname[:-4] + '.docx')

  66. print('處理完成')


  67. image_path = pdfpath + os.sep + "image"
  68. if not os.path.exists(image_path):
  69. os.mkdir(image_path)

  70. range_count = pdf_image()
  71. allPngStr = read_png_str(range_count)
  72. str2word(allPngStr)

三.將PDF中的文字轉(zhuǎn)為word文檔

1.安裝如下兩個python模塊

  1. pip3 install pdfminer3k

  2. pip3 install python-docx

2.程序:

  1. from pdfminer.pdfparser import PDFParser, PDFDocument
  2. from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
  3. from pdfminer.layout import LAParams
  4. from pdfminer.converter import PDFPageAggregator
  5. from docx import Document
  6. import warnings
  7. import os


  8. filePath = 'D:/pdf/水滸傳.pdf'


  9. file_name = os.open(filePath, os.O_RDWR)
  10. document = Document()
  11. warnings.filterwarnings("ignore")
  12. def pdf2word():
  13. fn = open(file_name, 'rb')
  14. parser = PDFParser(fn)
  15. doc = PDFDocument()
  16. parser.set_document(doc)
  17. doc.set_parser(parser)
  18. resource = PDFResourceManager()
  19. laparams = LAParams()
  20. device = PDFPageAggregator(resource, laparams=laparams)
  21. interpreter = PDFPageInterpreter(resource, device)
  22. for i in doc.get_pages():
  23. interpreter.process_page(i)
  24. layout = device.get_result()
  25. for out in layout:
  26. if hasattr(out, "get_text"):
  27. content = out.get_text().replace(u'\xa0', u' ')
  28. document.add_paragraph(
  29. content, style='ListBullet'
  30. )
  31. document.save(filePath[:-4] + '.docx')
  32. print('處理完成')


  33. if __name__ == '__main__':
  34. pdf2word()

參考博客:https://blog.csdn.net/dianepure/article/details/88568761

 

    本站是提供個人知識管理的網(wǎng)絡(luò)存儲空間,所有內(nèi)容均由用戶發(fā)布,不代表本站觀點。請注意甄別內(nèi)容中的聯(lián)系方式、誘導購買等信息,謹防詐騙。如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容,請點擊一鍵舉報。
    轉(zhuǎn)藏 分享 獻花(0

    0條評論

    發(fā)表

    請遵守用戶 評論公約

    類似文章 更多

    日韩欧美一区二区黄色| 国产成人精品99在线观看| 欧美国产亚洲一区二区三区| 在线免费不卡亚洲国产| 91麻豆精品欧美视频| 国产毛片不卡视频在线| 欧美日韩亚洲国产综合网| 久久亚洲精品成人国产| 老司机精品国产在线视频| 日系韩系还是欧美久久| 麻豆果冻传媒一二三区| 精品日韩视频在线观看| 久久黄片免费播放大全| 日韩精品一级片免费看 | 亚洲国产精品av在线观看| 欧美激情区一区二区三区| 最近最新中文字幕免费| 国产精品美女午夜视频| 99国产精品国产精品九九| 欧美黑人在线一区二区| 国内胖女人做爰视频有没有| 亚洲乱码av中文一区二区三区 | 欧美欧美欧美欧美一区| 精品欧美日韩一区二区三区| 欧美在线视频一区观看| 欧美一级内射一色桃子| 日韩中文高清在线专区| 自拍偷女厕所拍偷区亚洲综合| 美女被草的视频在线观看| 日韩精品你懂的在线观看| 高清一区二区三区大伊香蕉| 国产在线一区二区免费| 国产又粗又硬又长又爽的剧情| 男生和女生哪个更好色| 白白操白白在线免费观看 | 欧美一级黄片欧美精品| 我想看亚洲一级黄色录像| 亚洲精品伦理熟女国产一区二区| 国产在线一区二区免费| 日本欧美一区二区三区高清| 国产又粗又猛又爽又黄|