This little command line program written in Python is an advanced port of my PowerShell script for ensuring the data integrity of my photo archive. It identifies corrupted .jpg
, .jpeg
, .dng
and .cr2
files by generating lists of determined MD5 checksums and comparing them to previous checks. Of course, cksgen can also be used for any other types of files.
This program is primarily intended for archived data stocks that no longer change. Of course, you can also use it for data that is being edited from time to time. In this case, one must not forget that the MD5 checksums already change if, for example, the metadata of a JPG file is edited. So the message ATTENTION: Different MD5 checksums found
on the command prompt does not necessarily indicate a corrupted file.
cksgen is Free Software licensed under the GNU General Public License (GPL), Verion 3.
yyyyMMdd_HHmmss_checksum.txt
.yyyyMMdd_HHmmss_log.txt
.General Usage:
cksgen [-h] [-c CONFIG] [-e]
If you haven't compiled the cksgen.py file into an executable file (e.g. cksgen.exe on a Windows system; you can do so with pyinstaller --onefile cksgen.py
) and haven't registered it in the system environment variables, then:
python /path/to/cksgen.py [-h] [-c CONFIG] [-e]
Help:
cksgen -h
See example configuration file:
cksgen -e
Example usage with a configuration file photos.conf:
cksgen -c photos
Here is the source code of cksgen. You can copy or download it by clicking on the cksgen.py
link in the upper left corner of the code block.
""" cksgen -- Generate and compare MD5 checksum lists with Python ============================================================= Author: Helmut Kaczmarek <email@helmutkaczmarek.de Link: https://wiki.helmutkaczmarek.de/code:python:cksgen License: GNU General Public License (GPL), Verion 3 (https://www.gnu.org/licenses/gpl-3.0.html) General Usage: cksgen [-h] [-c CONFIG] [-e] If you haven't compiled the cksgen.py file into an executable file (e.g. cksgen.exe on a Windows system; you can do so with "pyinstaller --onefile cksgen.py")) and haven't registered it in the system environment variables, then: python /path/to/cksgen.py [-h] [-c CONFIG] [-e] Help: cksgen -h See example configuration file: cksgen -e Example usage with a configuration file "photos.conf": cksgen -c photos """ import os import hashlib import glob import datetime import configparser import argparse EXAMPLE_CONFIG = """ # example.conf # Save the configuration files in the same # folder where the program is located. [USER_SETTINGS] conf_name = Example allowed_extensions = jpg,jpeg,dng,cr2 ; Extensions without space data_directory = C:\\Users\\Username\\Path\\To\\Example lists_directory = conf_name\\Lists logs_directory = conf_name\\Logs files_to_keep = 10 """ parser = argparse.ArgumentParser(description="Generate MD5 checksums for files in a directory.") parser.add_argument("-c", "-conf", "--config", help="Specify the configuration file to use.") parser.add_argument("-e", "-ex", "--exampleconf", action="store_true", help="Display example configuration.") args = parser.parse_args() if args.exampleconf: print(EXAMPLE_CONFIG) exit(0) if args.config: if args.config == 'example': config_content = EXAMPLE_CONFIG else: print("Starting cksgen...") config_filename = args.config + ".conf" with open(config_filename, 'r') as config_file: config_content = config_file.read() config = configparser.ConfigParser() config.read_string(config_content) allowed_extensions = [ext.strip() for ext in config.get('USER_SETTINGS', 'allowed_extensions').split(';')] data_directory = config.get('USER_SETTINGS', 'data_directory') lists_directory = os.path.join(os.path.dirname(__file__), config.get('USER_SETTINGS', 'lists_directory')) logs_directory = os.path.join(os.path.dirname(__file__), config.get('USER_SETTINGS', 'logs_directory')) files_to_keep = int(config.get('USER_SETTINGS', 'files_to_keep')) def load_config(config_path): config = configparser.ConfigParser() config.read(config_path) return config['USER_SETTINGS'] def delete_old_files(directory, files_to_keep): all_files = glob.glob(os.path.join(directory, '*.*')) all_files.sort(key=os.path.getmtime, reverse=True) files_to_delete = all_files[files_to_keep:] for file in files_to_delete: os.remove(file) def compare_checksum_files(file1, file2): with open(file1, 'r') as f1, open(file2, 'r') as f2: checksums1 = f1.read().splitlines()[1:] # Skip the first line checksums2 = f2.read().splitlines()[1:] # Skip the second line different_files = [] if not checksums2: return ["<single_file>"] for line1 in checksums1: checksum1, file1 = line1.split('\t') for line2 in checksums2: checksum2, file2 = line2.split('\t') if file1 == file2 and checksum1 != checksum2: different_files.append(file1) break return different_files if __name__ == "__main__": parser = argparse.ArgumentParser(description="Generate checksums for files.") parser.add_argument("-conf", "--config", type=str, help="Configuration file name without extension") args = parser.parse_args() if args.config: config_filename = args.config + ".conf" config = load_config(config_filename) conf_name = config.get('conf_name') allowed_extensions = config.get('allowed_extensions').split(',') data_directory = config.get('data_directory') files_to_keep = int(config.get('files_to_keep')) script_directory = os.path.dirname(os.path.abspath(__file__)) lists_directory = os.path.join(script_directory, conf_name, 'Lists') logs_directory = os.path.join(script_directory, conf_name, 'Logs') os.makedirs(lists_directory, exist_ok=True) os.makedirs(logs_directory, exist_ok=True) current_timestamp = datetime.datetime.now().strftime('%Y%m%d_%H%M%S') current_datetime = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S') checksum_filename = os.path.join(lists_directory, f'{current_timestamp}_checksum.txt') log_filename = os.path.join(logs_directory, f'{current_timestamp}_log.txt') def scan_directory(directory): with open(checksum_filename, 'a') as f: f.write(f"MD5 checksums on: {current_datetime}\n") for root, _, files in os.walk(directory): for file in files: file_path = os.path.join(root, file) extension = file_path.split('.')[-1].lower() if extension in allowed_extensions: md5_checksum = hashlib.md5(open(file_path, 'rb').read()).hexdigest() with open(checksum_filename, 'a') as f: f.write(f'{md5_checksum}\t{file_path}\n') print(f'Processing {file_path}') scan_directory(data_directory) log_entry = f'MD5 checksums have been created and stored in {checksum_filename}.\n' with open(log_filename, 'a') as f: f.write(log_entry) delete_old_files(lists_directory, files_to_keep) delete_old_files(logs_directory, files_to_keep) last_checksum_files = glob.glob(os.path.join(lists_directory, '*_checksum.txt')) if len(last_checksum_files) >= 2: last_checksum_files.sort(reverse=True) last_checksum_file1 = last_checksum_files[0] last_checksum_file2 = last_checksum_files[1] different_files = compare_checksum_files(last_checksum_file1, last_checksum_file2) log_message = '' if different_files: print('ATTENTION: Different MD5 checksums found! See log file in', log_filename) log_message += 'ATTENTION: The following files have different checksums:\n' for file in different_files: log_message += file + '\n' else: log_message += 'INFO: No different MD5 checksums found.\n' print('INFO: No different MD5 checksums found.') with open(log_filename, 'a') as f: f.write(log_message) elif len(last_checksum_files) == 1: with open(log_filename, 'a') as f: f.write("INFO: Checksums could not be compared because there is currently only one checksum file.\n") print("INFO: Checksums could not be compared because there is currently only one checksum file.") print(log_entry)
Here is an example configuration file. You can copy or download it by clicking on the example.conf
link in the upper left corner of the code block. Different configuration files can be used for different projects.
# example.conf # Save the configuration files in the same # folder where the program is located. [USER_SETTINGS] # The configuration's name. # A subfolder with this name will be created. conf_name = Example # File types that cksgen will look for. # Multiple file types are separated by commas without spaces. allowed_extensions = jpg,jpeg,dng,md # The actual directory where cksgen looks for files. data_directory = C:\Users\Username\Path\To\Example # The directory where lists of MD5 checksums are placed. lists_directory = conf_name\Lists # The directory in which the log files are stored. logs_directory = conf_name\Logs # Specifies how many checksum files and log files cksgen should keep. files_to_keep = 10
cksgen.exe
for Microsoft Windows. Python does not need to be installed on the system (but you can also compile the source code yourself with pyinstaller --onefile cksgen.py
).