一、要使用 Python 提取 PDF 文件的标题、日期和内容并将其存储到 MySQL 数据库中,您可以按照以下步骤操作:
安装必要的库:pdfminer, PyPDF2, mysql-connector-python.
pip install pdfminer PyPDF2 mysql-connector-python
导入必要的库并连接到 MySQL 数据库。
import mysql.connectorfrom mysql.connector import Errorfrom mysql.connector import errorcodeimport PyPDF2from pdfminer.high_level import extract_text
try: connection = mysql.connector.connect(host='localhost', database='database_name', user='username', password='password') if connection.is_connected(): cursor = connection.cursor() print("Connected to MySQL database")except Error as e: print("Error while connecting to MySQL", e)
打开 PDF 文件并提取其标题、日期和内容。
pdf_file = open('file.pdf', 'rb')pdf_reader = PyPDF2.PdfFileReader(pdf_file)title = pdf_reader.documentInfo.titledate = pdf_reader.documentInfo['/CreationDate']content = extract_text('file.pdf')
将提取的信息插入到 MySQL 数据库中。
try: cursor.execute("INSERT INTO table_name (title, date, content) VALUES (%s, %s, %s)", (title, date, content)) connection.commit() print("Record inserted successfully into MySQL database")except mysql.connector.Error as error: print("Failed to insert record into MySQL database {}".format(error))finally: if connection.is_connected(): cursor.close() connection.close() print("MySQL connection is closed")
请注意,您需要将database_name、username、password和替换table_name为您自己的数据库信息。此外,请确保 PDF 文件与 python 脚本位于同一目录中,或者指定文件的完整路径。
二、详例解析
1.假定文本内容
Title: Sample PDF DocumentDate: 2022-03-20Content: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed at est at lectus viverra malesuada. Pellentesque fermentum dolor vel finibus consequat. Nulla facilisi.
创建数据表存储PDF数据
CREATE TABLE pdf_data ( id INT AUTO_INCREMENT PRIMARY KEY, title VARCHAR(255), date DATE, content TEXT);
编写Python代码将其解析存入数据库中
import PyPDF2from datetime import datetimeimport mysql.connector# Open the PDF filepdf_file = open('sample.pdf', 'rb')# Read the PDF metadatapdf_reader = PyPDF2.PdfFileReader(pdf_file)pdf_info = pdf_reader.getDocumentInfo()title = pdf_info.title# Read the PDF contentcontent = ''for page_num in range(pdf_reader.numPages): page = pdf_reader.getPage(page_num) content += page.extractText()# Format the datedate_str = pdf_info.get('CreationDate')[2:10]date = datetime.strptime(date_str, '%Y%m%d').date()# Store the data in the MySQL databasecnx = mysql.connector.connect(user='username', password='password', host='localhost', database='pdf_db')cursor = cnx.cursor()add_pdf = ("INSERT INTO pdf_data (title, date, content) VALUES (%s, %s, %s)")pdf_data = (title, date, content)cursor.execute(add_pdf, pdf_data)cnx.commit()# Close the database connection and PDF filecursor.close()cnx.close()pdf_file.close()
插入成功后在数据库库中查询
SELECT * FROM pdf_data;
大致结果如下:
id | title | date | content |
---|---|---|---|
1 | Sample PDF Document | 2022-03-20 | Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed at est at lectus viverra malesuada. Pellentesque fermentum dolor vel finibus consequat. Nulla facilisi. |
来源地址:https://blog.csdn.net/weixin_41772346/article/details/129668586