Inhaltsverzeichnis
cksgen – Generate and compare MD5 checksum lists with Python
This little command line program written in Python is an advanced port of my PowerShell script for ensuring the data integrity of my photo archive. It identifies corrupted .jpg
, .jpeg
, .dng
and .cr2
files by generating lists of determined MD5 checksums and comparing them to previous checks. Of course, cksgen can also be used for any other types of files.
This program is primarily intended for archived data stocks that no longer change. Of course, you can also use it for data that is being edited from time to time. In this case, one must not forget that the MD5 checksums already change if, for example, the metadata of a JPG file is edited. So the message ATTENTION: Different MD5 checksums found
on the command prompt does not necessarily indicate a corrupted file.
cksgen is Free Software licensed under the GNU General Public License (GPL), Verion 3.
Features
- Determines the MD5 checksums of files and writes them to the file
yyyyMMdd_HHmmss_checksum.txt
. - Compares the checksums of the last two checksum files and outputs a corresponding message in the command prompt and in the log file
yyyyMMdd_HHmmss_log.txt
. - Allowed file extensions can be specified.
- The number of checksum files and log files to be kept can be specified.
- The form of the time stamp in the file name of the log files and checksum files can be adjusted.
- The folder names for the log files and checksum files can also be specified.
- The command prompt will let you know which file is being processed.
Usage
General Usage:
cksgen [-h] [-c CONFIG] [-e]
If you haven't compiled the cksgen.py file into an executable file (e.g. cksgen.exe on a Windows system; you can do so with pyinstaller --onefile cksgen.py
) and haven't registered it in the system environment variables, then:
python /path/to/cksgen.py [-h] [-c CONFIG] [-e]
Help:
cksgen -h
See example configuration file:
cksgen -e
Example usage with a configuration file photos.conf:
cksgen -c photos
Source Code
Here is the source code of cksgen. You can copy or download it by clicking on the cksgen.py
link in the upper left corner of the code block.
- cksgen.py
""" cksgen -- Generate and compare MD5 checksum lists with Python ============================================================= Author: Helmut Kaczmarek <email@helmutkaczmarek.de Link: https://wiki.helmutkaczmarek.de/code:python:cksgen License: GNU General Public License (GPL), Verion 3 (https://www.gnu.org/licenses/gpl-3.0.html) General Usage: cksgen [-h] [-c CONFIG] [-e] If you haven't compiled the cksgen.py file into an executable file (e.g. cksgen.exe on a Windows system; you can do so with "pyinstaller --onefile cksgen.py")) and haven't registered it in the system environment variables, then: python /path/to/cksgen.py [-h] [-c CONFIG] [-e] Help: cksgen -h See example configuration file: cksgen -e Example usage with a configuration file "photos.conf": cksgen -c photos """ import os import hashlib import glob import datetime import configparser import argparse EXAMPLE_CONFIG = """ # example.conf # Save the configuration files in the same # folder where the program is located. [USER_SETTINGS] conf_name = Example allowed_extensions = jpg,jpeg,dng,cr2 ; Extensions without space data_directory = C:\\Users\\Username\\Path\\To\\Example lists_directory = conf_name\\Lists logs_directory = conf_name\\Logs files_to_keep = 10 """ parser = argparse.ArgumentParser(description="Generate MD5 checksums for files in a directory.") parser.add_argument("-c", "-conf", "--config", help="Specify the configuration file to use.") parser.add_argument("-e", "-ex", "--exampleconf", action="store_true", help="Display example configuration.") args = parser.parse_args() if args.exampleconf: print(EXAMPLE_CONFIG) exit(0) if args.config: if args.config == 'example': config_content = EXAMPLE_CONFIG else: print("Starting cksgen...") config_filename = args.config + ".conf" with open(config_filename, 'r') as config_file: config_content = config_file.read() config = configparser.ConfigParser() config.read_string(config_content) allowed_extensions = [ext.strip() for ext in config.get('USER_SETTINGS', 'allowed_extensions').split(';')] data_directory = config.get('USER_SETTINGS', 'data_directory') lists_directory = os.path.join(os.path.dirname(__file__), config.get('USER_SETTINGS', 'lists_directory')) logs_directory = os.path.join(os.path.dirname(__file__), config.get('USER_SETTINGS', 'logs_directory')) files_to_keep = int(config.get('USER_SETTINGS', 'files_to_keep')) def load_config(config_path): config = configparser.ConfigParser() config.read(config_path) return config['USER_SETTINGS'] def delete_old_files(directory, files_to_keep): all_files = glob.glob(os.path.join(directory, '*.*')) all_files.sort(key=os.path.getmtime, reverse=True) files_to_delete = all_files[files_to_keep:] for file in files_to_delete: os.remove(file) def compare_checksum_files(file1, file2): with open(file1, 'r') as f1, open(file2, 'r') as f2: checksums1 = f1.read().splitlines()[1:] # Skip the first line checksums2 = f2.read().splitlines()[1:] # Skip the second line different_files = [] if not checksums2: return ["<single_file>"] for line1 in checksums1: checksum1, file1 = line1.split('\t') for line2 in checksums2: checksum2, file2 = line2.split('\t') if file1 == file2 and checksum1 != checksum2: different_files.append(file1) break return different_files if __name__ == "__main__": parser = argparse.ArgumentParser(description="Generate checksums for files.") parser.add_argument("-conf", "--config", type=str, help="Configuration file name without extension") args = parser.parse_args() if args.config: config_filename = args.config + ".conf" config = load_config(config_filename) conf_name = config.get('conf_name') allowed_extensions = config.get('allowed_extensions').split(',') data_directory = config.get('data_directory') files_to_keep = int(config.get('files_to_keep')) script_directory = os.path.dirname(os.path.abspath(__file__)) lists_directory = os.path.join(script_directory, conf_name, 'Lists') logs_directory = os.path.join(script_directory, conf_name, 'Logs') os.makedirs(lists_directory, exist_ok=True) os.makedirs(logs_directory, exist_ok=True) current_timestamp = datetime.datetime.now().strftime('%Y%m%d_%H%M%S') current_datetime = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S') checksum_filename = os.path.join(lists_directory, f'{current_timestamp}_checksum.txt') log_filename = os.path.join(logs_directory, f'{current_timestamp}_log.txt') def scan_directory(directory): with open(checksum_filename, 'a') as f: f.write(f"MD5 checksums on: {current_datetime}\n") for root, _, files in os.walk(directory): for file in files: file_path = os.path.join(root, file) extension = file_path.split('.')[-1].lower() if extension in allowed_extensions: md5_checksum = hashlib.md5(open(file_path, 'rb').read()).hexdigest() with open(checksum_filename, 'a') as f: f.write(f'{md5_checksum}\t{file_path}\n') print(f'Processing {file_path}') scan_directory(data_directory) log_entry = f'MD5 checksums have been created and stored in {checksum_filename}.\n' with open(log_filename, 'a') as f: f.write(log_entry) delete_old_files(lists_directory, files_to_keep) delete_old_files(logs_directory, files_to_keep) last_checksum_files = glob.glob(os.path.join(lists_directory, '*_checksum.txt')) if len(last_checksum_files) >= 2: last_checksum_files.sort(reverse=True) last_checksum_file1 = last_checksum_files[0] last_checksum_file2 = last_checksum_files[1] different_files = compare_checksum_files(last_checksum_file1, last_checksum_file2) log_message = '' if different_files: print('ATTENTION: Different MD5 checksums found! See log file in', log_filename) log_message += 'ATTENTION: The following files have different checksums:\n' for file in different_files: log_message += file + '\n' else: log_message += 'INFO: No different MD5 checksums found.\n' print('INFO: No different MD5 checksums found.') with open(log_filename, 'a') as f: f.write(log_message) elif len(last_checksum_files) == 1: with open(log_filename, 'a') as f: f.write("INFO: Checksums could not be compared because there is currently only one checksum file.\n") print("INFO: Checksums could not be compared because there is currently only one checksum file.") print(log_entry)
Configuration file
Here is an example configuration file. You can copy or download it by clicking on the example.conf
link in the upper left corner of the code block. Different configuration files can be used for different projects.
- example.conf
# example.conf # Save the configuration files in the same # folder where the program is located. [USER_SETTINGS] # The configuration's name. # A subfolder with this name will be created. conf_name = Example # File types that cksgen will look for. # Multiple file types are separated by commas without spaces. allowed_extensions = jpg,jpeg,dng,md # The actual directory where cksgen looks for files. data_directory = C:\Users\Username\Path\To\Example # The directory where lists of MD5 checksums are placed. lists_directory = conf_name\Lists # The directory in which the log files are stored. logs_directory = conf_name\Logs # Specifies how many checksum files and log files cksgen should keep. files_to_keep = 10
Downloads
- cksgen.zip: Source code (contains the two files shown above). To run the program, Python needs to be installed on your computer.
- cksgen_windows_bin.zip: Contains an executable binary
cksgen.exe
for Microsoft Windows. Python does not need to be installed on the system (but you can also compile the source code yourself withpyinstaller --onefile cksgen.py
).