Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Calculate an MD5 Checksum of a Directory in Linux
During our daily use of Linux, we may want to check if there are any changes to any of the files in a directory. Or we might want to confirm that the contents of one directory are the same as those of another directory on a different location, disk, or system. In this tutorial we will learn how to compute an MD5 checksum of an entire directory tree on Linux. We will compute a single hash value of all directory contents for comparison purposes.
Get the List of All Files in a Directory Tree
To find out the collective hash of all files in a directory tree, we first need to get a list of these files. We will use the find command for this activity.
Let's run the tree command to see our example directory structure
??? file1.png
??? folder1
? ??? file2.jpg
? ??? folder3
? ??? file3.txt
??? folder2
??? file4.sh
As we can see, we have files in multiple subdirectories. Now we can use the find command with the -type f option to obtain a list of all files in our directory and their subdirectories, excluding folders and symbolic links
find . -type f
./folder2/file4.sh ./folder1/folder3/file3.txt ./folder1/file2.jpg ./file1.png
Now we can get a list of all files in a directory and its subdirectories by running a single command.
Sorting and the "Locale Problem"
Now that we can get a list of all our files, our next steps are
Run the md5sum command on each file in that list
Create a string containing the list of file paths along with their hashes
And finally, run md5sum on this string we just created to get a single hash value
So if anything in our directory changes, including file paths or file contents, the hash will also change. But we have a problem with this approach. The find command does not sort the output by default. For the sake of efficiency, the find command simply prints the individual results it gets as it traverses the file system. So the order can change between different systems, locations, or even different runs. As a result of this, the hash value will change, even if the two directories are exactly the same.
We can fix this by sorting our find results using the sort command
find . -type f | sort
./file1.png ./folder1/file2.jpg ./folder1/folder3/file3.txt ./folder2/file4.sh
But we are still missing something. The sorting operation is more complex than it seems. The letters, numbers, dates and how they are supposed to be arranged can change from locale to locale. This can change our results for directories that reside on two systems with different locale configurations. We can solve this problem by overriding our locale using the environment LC_ALL variable
find . -type f | LC_ALL=C sort
./file1.png ./folder1/file2.jpg ./folder1/folder3/file3.txt ./folder2/file4.sh
By using the C locale standard for our sorting operations, we eliminate sorting inconsistencies.
Put it all Together
We can use the -exec parameter of the find command to run the md5sum command on each file found
find . -type f -exec md5sum {} +
7d2186aaeed78b24f00f782f2346e5f9 ./folder2/file4.sh d41d8cd98f00b204e9800998ecf8427e ./folder1/folder3/file3.txt c6aa7ce9967680b77ea7e72d96949303 ./folder1/file2.jpg 46ffe26d56fe5164570ad43cc79b59d3 ./file1.png
We use curly braces ({}) to specify where "filenames" will be passed to the md5sum command as arguments. We also added the plus sign (+) to the end so that our files are passed as arguments to a single md5sum command (md5sum file1 file2 file3...) instead of running a separate md5sum process for each file.
However, we still need to address the sorting issue. Let's combine everything with proper sorting
find . -type f | LC_ALL=C sort | xargs md5sum
46ffe26d56fe5164570ad43cc79b59d3 ./file1.png c6aa7ce9967680b77ea7e72d96949303 ./folder1/file2.jpg d41d8cd98f00b204e9800998ecf8427e ./folder1/folder3/file3.txt 7d2186aaeed78b24f00f782f2346e5f9 ./folder2/file4.sh
To get the final hash, we can create the string containing all file paths and corresponding hash values, then pass it to the md5sum command
find . -type f | LC_ALL=C sort | xargs md5sum | md5sum
1d0e4d4ed4e4f3c3d0d9a3900b13f3e7 -
The final hash of our directory tree is 1d0e4d4ed4e4f3c3d0d9a3900b13f3e7.
Alternative Approach Using tar
Another reliable method is to use tar to create a deterministic archive and then hash it
tar -cf - . | md5sum
This approach automatically handles file ordering and provides a consistent hash for directory comparison.
Conclusion
In this tutorial, we learned how to compute an MD5 checksum of an entire directory tree on Linux. We use the find and md5sum commands with proper sorting to ensure consistent results across different systems. The key is using LC_ALL=C sort to eliminate locale-based sorting variations, ensuring the same directory contents always produce the same hash value.
