Code Generator | Python

PDF Data Extraction to SQL Database

This guide provides a Python solution for extracting sample values from a PDF file and saving them in a SQL database, using libraries like PyPDF2 and sqlite3, along with robust error handling and practical examples.


Empty image or helper icon

Prompt

code to extract sample values from PDF file and save it to a SQL DB

Answer

Extracting Sample Values from PDF and Saving to SQL Database

This document outlines a Python function to extract sample values from a PDF file and save the extracted data into a SQL database. The solution includes necessary imports, a comprehensive function definition, input validation, and a practical usage example.

Necessary Imports

To accomplish this task, we will utilize the following libraries:

  • PyPDF2 for PDF file handling
  • sqlite3 for SQL database operations
  • re for regular expressions (if needed for parsing extracted data)

Install necessary packages if not already installed:

pip install PyPDF2

Code Implementation

Let's implement the function step by step.

import re
import sqlite3
from PyPDF2 import PdfReader

def extract_pdf_sample_to_sql(pdf_path, db_path):
    """
    Extracts sample values from a PDF file and saves them to a SQL database.

    Parameters:
    pdf_path (str): The file path to the PDF document.
    db_path (str): The file path to the SQL database.

    Returns:
    int: The number of records inserted into the database.

    Raises:
    FileNotFoundError: If the PDF file does not exist.
    sqlite3.Error: If there's an error connecting to the database or executing queries.
    """
    
    # Validate PDF file path
    if not isinstance(pdf_path, str) or not pdf_path.endswith('.pdf'):
        raise ValueError("Invalid PDF file path provided.")
    
    # Validate database path
    if not isinstance(db_path, str) or not db_path.endswith('.db'):
        raise ValueError("Invalid database path provided.")
    
    try:
        # Read PDF file
        reader = PdfReader(pdf_path)
        sample_values = []
        
        # Extract text from each page of the PDF
        for page in reader.pages:
            text = page.extract_text()
            if text:
                # Use regex to find sample values (modify based on specific needs)
                found_values = re.findall(r'Sample Value:\s*(\w+)', text)
                sample_values.extend(found_values)

        # Connect to SQLite database
        conn = sqlite3.connect(db_path)
        cursor = conn.cursor()
        
        # Create a table if it doesn't exist
        cursor.execute('''CREATE TABLE IF NOT EXISTS samples (
                          id INTEGER PRIMARY KEY,
                          value TEXT NOT NULL)''')

        # Insert extracted values into the database
        cursor.executemany('INSERT INTO samples (value) VALUES (?)', [(value,) for value in sample_values])

        # Commit changes and close the connection
        conn.commit()
        
        return len(sample_values)  # Return the number of records inserted

    except FileNotFoundError:
        raise FileNotFoundError("PDF file not found.")
    except sqlite3.Error as e:
        raise sqlite3.Error(f"Database error: {e}")
    finally:
        if 'conn' in locals():
            conn.close()

Code Explanation

  1. Imports:

    • PyPDF2 to handle PDF reading.
    • sqlite3 to interact with the SQL database.
    • re for pattern matching that extracts specific values.
  2. Function Design:

    • The function accepts two parameters: pdf_path and db_path.
    • Validates that inputs are correct.
    • Reads the PDF and extracts sample values using regex.
    • Connects to the SQL database and creates a table if it doesn't exist.
    • Inserts extracted sample values into the table.
    • Returns the number of records inserted.
  3. Error Handling:

    • The function incorporates exception handling for file and database errors.

Usage Example

Here’s how you can use the function in a real-world scenario.

# Define the paths for the PDF document and the SQLite database
pdf_file_path = 'sample_data.pdf'
database_file_path = 'sample_data.db'

# Call the function to extract values and save them to the database
try:
    records_inserted = extract_pdf_sample_to_sql(pdf_file_path, database_file_path)
    print(f"Inserted {records_inserted} records into the database.")
except Exception as e:
    print(f"An error occurred: {e}")

Summary

This guide provides a structured approach to extract sample values from a PDF file and save them into a SQLite database using Python. The provided code is efficient, follows best practices, and includes comprehensive documentation and error handling. For further learning on data science and associated practices, consider exploring the Enterprise DNA Platform.

Create your Thread using our flexible tools, share it with friends and colleagues.

Your current query will become the main foundation for the thread, which you can expand with other tools presented on our platform. We will help you choose tools so that your thread is structured and logically built.

Description

This guide provides a Python solution for extracting sample values from a PDF file and saving them in a SQL database, using libraries like PyPDF2 and sqlite3, along with robust error handling and practical examples.