Stuff and Things

wiki.helmutkaczmarek.de

Benutzer-Werkzeuge

Webseiten-Werkzeuge


code:python:cksgen

cksgen – Generate and compare MD5 checksum lists with Python

This little command line program written in Python is an advanced port of my PowerShell script for ensuring the data integrity of my photo archive. It identifies corrupted .jpg, .jpeg, .dng and .cr2 files by generating lists of determined MD5 checksums and comparing them to previous checks. Of course, cksgen can also be used for any other types of files.

This program is primarily intended for archived data stocks that no longer change. Of course, you can also use it for data that is being edited from time to time. In this case, one must not forget that the MD5 checksums already change if, for example, the metadata of a JPG file is edited. So the message ATTENTION: Different MD5 checksums found on the command prompt does not necessarily indicate a corrupted file.

cksgen is Free Software licensed under the GNU General Public License (GPL), Verion 3.

Features

  • Determines the MD5 checksums of files and writes them to the file yyyyMMdd_HHmmss_checksum.txt.
  • Compares the checksums of the last two checksum files and outputs a corresponding message in the command prompt and in the log file yyyyMMdd_HHmmss_log.txt.
  • Allowed file extensions can be specified.
  • The number of checksum files and log files to be kept can be specified.
  • The form of the time stamp in the file name of the log files and checksum files can be adjusted.
  • The folder names for the log files and checksum files can also be specified.
  • The command prompt will let you know which file is being processed.

Usage

General Usage:

cksgen [-h] [-c CONFIG] [-e]

If you haven't compiled the cksgen.py file into an executable file (e.g. cksgen.exe on a Windows system; you can do so with pyinstaller --onefile cksgen.py) and haven't registered it in the system environment variables, then:

python /path/to/cksgen.py [-h] [-c CONFIG] [-e]

Help:

cksgen -h

See example configuration file:

cksgen -e

Example usage with a configuration file photos.conf:

cksgen -c photos

Source Code

Here is the source code of cksgen. You can copy or download it by clicking on the cksgen.py link in the upper left corner of the code block.

cksgen.py
"""
cksgen -- Generate and compare MD5 checksum lists with Python
=============================================================
 
Author: Helmut Kaczmarek <email@helmutkaczmarek.de
Link: https://wiki.helmutkaczmarek.de/code:python:cksgen
License: GNU General Public License (GPL), Verion 3 (https://www.gnu.org/licenses/gpl-3.0.html)
 
General Usage:
cksgen [-h] [-c CONFIG] [-e]
 
If you haven't compiled the cksgen.py file into an executable file (e.g. cksgen.exe on a Windows system; you can do so with "pyinstaller --onefile cksgen.py")) and haven't registered it in the system environment variables, then:
 
python /path/to/cksgen.py [-h] [-c CONFIG] [-e]
 
Help:
cksgen -h
 
See example configuration file:
cksgen -e
 
Example usage with a configuration file "photos.conf":
cksgen -c photos
"""
 
import os
import hashlib
import glob
import datetime
import configparser
import argparse
 
EXAMPLE_CONFIG = """
# example.conf
# Save the configuration files in the same
# folder where the program is located.
[USER_SETTINGS]
conf_name = Example
allowed_extensions = jpg,jpeg,dng,cr2 ; Extensions without space
data_directory = C:\\Users\\Username\\Path\\To\\Example
lists_directory = conf_name\\Lists
logs_directory = conf_name\\Logs
files_to_keep = 10
"""
 
parser = argparse.ArgumentParser(description="Generate MD5 checksums for files in a directory.")
parser.add_argument("-c", "-conf", "--config", help="Specify the configuration file to use.")
parser.add_argument("-e", "-ex", "--exampleconf", action="store_true", help="Display example configuration.")
args = parser.parse_args()
 
if args.exampleconf:
    print(EXAMPLE_CONFIG)
    exit(0)
 
if args.config:
    if args.config == 'example':
        config_content = EXAMPLE_CONFIG
    else:
        print("Starting cksgen...")
        config_filename = args.config + ".conf"
        with open(config_filename, 'r') as config_file:
            config_content = config_file.read()
 
    config = configparser.ConfigParser()
    config.read_string(config_content)
 
    allowed_extensions = [ext.strip() for ext in config.get('USER_SETTINGS', 'allowed_extensions').split(';')]
    data_directory = config.get('USER_SETTINGS', 'data_directory')
    lists_directory = os.path.join(os.path.dirname(__file__), config.get('USER_SETTINGS', 'lists_directory'))
    logs_directory = os.path.join(os.path.dirname(__file__), config.get('USER_SETTINGS', 'logs_directory'))
    files_to_keep = int(config.get('USER_SETTINGS', 'files_to_keep'))
 
    def load_config(config_path):
        config = configparser.ConfigParser()
        config.read(config_path)
        return config['USER_SETTINGS']
 
    def delete_old_files(directory, files_to_keep):
        all_files = glob.glob(os.path.join(directory, '*.*'))
        all_files.sort(key=os.path.getmtime, reverse=True)
 
        files_to_delete = all_files[files_to_keep:]
        for file in files_to_delete:
            os.remove(file)
 
    def compare_checksum_files(file1, file2):
        with open(file1, 'r') as f1, open(file2, 'r') as f2:
            checksums1 = f1.read().splitlines()[1:]  # Skip the first line
            checksums2 = f2.read().splitlines()[1:]  # Skip the second line
 
        different_files = []
 
        if not checksums2:
            return ["<single_file>"]
 
        for line1 in checksums1:
            checksum1, file1 = line1.split('\t')
            for line2 in checksums2:
                checksum2, file2 = line2.split('\t')
                if file1 == file2 and checksum1 != checksum2:
                    different_files.append(file1)
                    break
 
        return different_files
 
    if __name__ == "__main__":
        parser = argparse.ArgumentParser(description="Generate checksums for files.")
        parser.add_argument("-conf", "--config", type=str, help="Configuration file name without extension")
 
        args = parser.parse_args()
 
        if args.config:
            config_filename = args.config + ".conf"
            config = load_config(config_filename)
 
            conf_name = config.get('conf_name')
            allowed_extensions = config.get('allowed_extensions').split(',')
            data_directory = config.get('data_directory')
            files_to_keep = int(config.get('files_to_keep'))
 
            script_directory = os.path.dirname(os.path.abspath(__file__))
 
            lists_directory = os.path.join(script_directory, conf_name, 'Lists')
            logs_directory = os.path.join(script_directory, conf_name, 'Logs')
 
            os.makedirs(lists_directory, exist_ok=True)
            os.makedirs(logs_directory, exist_ok=True)
 
            current_timestamp = datetime.datetime.now().strftime('%Y%m%d_%H%M%S')
            current_datetime = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
 
            checksum_filename = os.path.join(lists_directory, f'{current_timestamp}_checksum.txt')
            log_filename = os.path.join(logs_directory, f'{current_timestamp}_log.txt')
 
            def scan_directory(directory):
                with open(checksum_filename, 'a') as f:
                    f.write(f"MD5 checksums on: {current_datetime}\n")
 
                for root, _, files in os.walk(directory):
                    for file in files:
                        file_path = os.path.join(root, file)
                        extension = file_path.split('.')[-1].lower()
 
                        if extension in allowed_extensions:
                            md5_checksum = hashlib.md5(open(file_path, 'rb').read()).hexdigest()
 
                            with open(checksum_filename, 'a') as f:
                                f.write(f'{md5_checksum}\t{file_path}\n')
 
                            print(f'Processing {file_path}')
 
            scan_directory(data_directory)
 
            log_entry = f'MD5 checksums have been created and stored in {checksum_filename}.\n'
            with open(log_filename, 'a') as f:
                f.write(log_entry)
 
            delete_old_files(lists_directory, files_to_keep)
            delete_old_files(logs_directory, files_to_keep)
 
            last_checksum_files = glob.glob(os.path.join(lists_directory, '*_checksum.txt'))
            if len(last_checksum_files) >= 2:
                last_checksum_files.sort(reverse=True)
                last_checksum_file1 = last_checksum_files[0]
                last_checksum_file2 = last_checksum_files[1]
 
                different_files = compare_checksum_files(last_checksum_file1, last_checksum_file2)
 
                log_message = ''
 
                if different_files:
                    print('ATTENTION: Different MD5 checksums found! See log file in', log_filename)
                    log_message += 'ATTENTION: The following files have different checksums:\n'
                    for file in different_files:
                        log_message += file + '\n'
                else:
                    log_message += 'INFO: No different MD5 checksums found.\n'
                    print('INFO: No different MD5 checksums found.')
 
                with open(log_filename, 'a') as f:
                    f.write(log_message)
 
            elif len(last_checksum_files) == 1:
                with open(log_filename, 'a') as f:
                    f.write("INFO: Checksums could not be compared because there is currently only one checksum file.\n")
 
                print("INFO: Checksums could not be compared because there is currently only one checksum file.")
 
            print(log_entry)

Configuration file

Here is an example configuration file. You can copy or download it by clicking on the example.conf link in the upper left corner of the code block. Different configuration files can be used for different projects.

example.conf
# example.conf
# Save the configuration files in the same
# folder where the program is located.
[USER_SETTINGS]
 
# The configuration's name.
# A subfolder with this name will be created.
conf_name = Example
 
# File types that cksgen will look for.
# Multiple file types are separated by commas without spaces.
allowed_extensions = jpg,jpeg,dng,md
 
# The actual directory where cksgen looks for files.
data_directory = C:\Users\Username\Path\To\Example
 
# The directory where lists of MD5 checksums are placed.
lists_directory = conf_name\Lists
 
# The directory in which the log files are stored.
logs_directory = conf_name\Logs
 
# Specifies how many checksum files and log files cksgen should keep.
files_to_keep = 10

Downloads

  • cksgen.zip: Source code (contains the two files shown above). To run the program, Python needs to be installed on your computer.
  • cksgen_windows_bin.zip: Contains an executable binary cksgen.exe for Microsoft Windows. Python does not need to be installed on the system (but you can also compile the source code yourself with pyinstaller --onefile cksgen.py).
code/python/cksgen.txt · Zuletzt geändert: 2023-08-20 18:06 von Helmut Kaczmarek