Updated On : Feb-21,2021 Tags file-comparison, directories-comparison
filecmp - Compare Files and Directories using Python

filecmp - Compare Files and Directories using Python

Python developers many times need to compare files in the same directory or different directories when performing tasks like data analysis, machine learning, etc. Many times developers end up writing their own algorithms to do comparisons of these types. To solve that problem and save the time of the developers, Python has developed a module named filecmp which lets developers compare files and directories using its easy-to-use API. The module provides different methods to compare two files, more than two files, different directories containing a list of files, etc. As a part of this tutorial, we'll explain how we can use filecmp module for performing different kinds of comparisons with simple and easy-to-understand examples.

We have created a directory structure with a list of files inside it which will be used for various comparison examples that we'll explain as a part of this tutorial.

directory1


                                       directory1
                                           |
                   -----------------------------------------------
                   |               |              |              |
                directory1_1    file2.txt    original.txt    modified.txt
                     |
                ------------------------------------------------------
                |                  |                  |
           directory1_1_1    original1_1.txt     modified1_1.txt
                |
          -------------------------
          |                       |
     original1_1_1.txt   modified1_1_1.txt

directory2


                                       directory2
                                           |
                   -----------------------------------------------
                   |               |              |              |
                directory1_1    file2.txt    original.txt    modified.txt
                     |
                ----------------------------------------------------------
                |                  |                  |                  |
           directory1_1_1    original.txt     modified1_1.txt         file2.txt
                |
          -------------------------
          |                       |
     original.txt          modified.txt

Both directories have almost the same structure with the same file names in some subdirectories and different in some. The contents of all original*.txt, modified*.txt is same. We have below shown the contents of the files. The contents of the file are taken from the text of zen of Python (import this).

In [1]:
!cat directory1/original.txt
1. Readability counts.
2. Special cases aren't special enough to break the rules.
3. Errors should never pass silently.
4. In the face of ambiguity, refuse the temptation to guess.
5. There should be one-- and preferably only one --obvious way to do it.
6. Although that way may not be obvious at first unless you're Dutch.
7. Now is better than never.
8. Although never is often better than *right* now.
9. If the implementation is hard to explain, it's a bad idea.
10. If the implementation is easy to explain, it may be a good idea.
In [2]:
!cat directory1/modified.txt
1. Simplicy counts as well.
2. Special cases aren't that special enough to break the rules.
3. Errors shall never pass ever.
4. In the face of ambiguity, refuse the temptation to guess.
5. There should be one obvious way to do it.
6. Although that way may not be obvious at first unless you're Dutch.
7. Now is better than never.
8. Although never is often better than immediately.
9. If the implementation is hard to code, it's most probably a bad idea.
In [3]:
!cat directory1/file2.txt
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.

Please make a NOTE that filecmp compares contents of the file and returns results as boolean values (same or not). If you are interested in finding the line-by-line differences between two files then please check our tutorial on difflib module which provides that functionality.

Example 1

As a part of our first example, we'll explain how we can compare two files using cmp() function available from filecmp.


  • cmp(f1,f2,shallow=True) - The function takes as input file names for two files and return True if they are equal else False. The shallow parameter indicates whether to use os.stat() function to find out whether files are similar or not. If shallow is set to True then it'll be used else the actual contents of the files will be compared.

Our code for this example simply compares files in directory1 and directory2 with different parameter settings.

Please make a NOTE that if you are comparing many files then this function keeps some of the files in the cache. Therefore it’s recommended to clear cache using filecmp.clear_cache() function in order to avoid comparing stale files if file contents are getting changed very often.

In [1]:
import filecmp

result = filecmp.cmp("directory1/original.txt", "directory1/modified.txt")

print("Is {} equal to {}? : {}".format("directory1/original.txt", "directory1/modified.txt", result))

result = filecmp.cmp("directory1/original.txt", "directory2/original.txt", shallow=False)

print("Is {} equal to {}? : {}".format("directory1/original.txt", "directory1/original.txt", result))

esult = filecmp.cmp("directory1/modified.txt", "directory2/modified.txt", shallow=False)

print("Is {} equal to {}? : {}".format("directory1/modified.txt", "directory1/modified.txt", result))
Is directory1/original.txt equal to directory1/modified.txt? : False
Is directory1/original.txt equal to directory1/original.txt? : True
Is directory1/modified.txt equal to directory1/modified.txt? : True

Example 2

As a part of our second example, we'll explain how we can compare the list of files having the same name different directories using cmpfiles() function.


  • cmpfiles(directory1, directory2, common,shallow=True) - It accepts directory names and a list of file names for which we want to check both directories whether the contents of the files are the same in both or not. It returns three values as output.
    • List of matched files.
    • List of mismatched files.
    • List of errors (files not present in either directory, access issue, etc).

Our code for this example first creates a list of files to check in both directories. The list has some filenames present in both directories and some present in neither. The code then compares for these files in both directories and prints match, mismatch, and error results.

In [4]:
import filecmp

files_to_compare = ["original.txt", "modified.txt", "file2.txt", "file3.txt", "file4.txt"]

match, mismatch, errors = filecmp.cmpfiles("directory1", "directory2", common=files_to_compare)

print("Matched Files    : {}".format(match))
print("Mismatched Files : {}".format(mismatch))
print("Errors           : {}".format(errors))

match, mismatch, errors = filecmp.cmpfiles("directory1", "directory2", common=files_to_compare, shallow=False)

print("\nMatched Files    : {}".format(match))
print("Mismatched Files : {}".format(mismatch))
print("Errors           : {}".format(errors))
Matched Files    : ['original.txt', 'modified.txt']
Mismatched Files : ['file2.txt']
Errors           : ['file3.txt', 'file4.txt']

Matched Files    : ['original.txt', 'modified.txt']
Mismatched Files : ['file2.txt']
Errors           : ['file3.txt', 'file4.txt']

Example 3

As a part of our third example, we'll explain how we can compare the two directories using dircmp instance of filecmp. The dircmp let us compare all the files in the directories and all of its subdirectories as well. It then also let us generate a report showing the results of the comparison.


  • dircmp(a,b,ignore=None,hide=None) - It accepts two directory names and returns dircmp instance which can be used to generate report about comparison. The ignore parameter accepts a list of names (directories or filenames) that we want to ignore when doing the comparison. The hide parameter accepts a list of names (directories or filenames) to hide from the reports.

Important Methods of dircmp Instance

  • report() - It prints report of files comparison in both directories. It does not include a comparison report on subdirectories.
  • report_partial_closure() - It prints a report of file comparison between both directories and subdirectories at the first level inside of both. It does not go level beyond the first level recursively.
  • report_full_closure() - It prints a report of file comparison between both directories and all subdirectories going recursively till the last directory.

All the reports include information about the same files and differing files between two directories as well as files that are present in only one of the directories. It also includes information about common subdirectories. The report will have one section per each directory comparison.


Our code for this example creates a dircmp instance for comparing directories directory1 and directory2. It then prints reports by calling all three different report generating methods described above.

When we run below code, we can notice that how report() method only compared files in given directories, report_partial_closure() method went only 1 level down to do comparison and report_full_closure() method compared all sub directories recursively.

In [50]:
directory_cmp = filecmp.dircmp(a="directory1", b="directory2")

print("=========== Comparison Report =========== \n")
directory_cmp.report()

print("\n========= Comparison Report Partial ============\n")
directory_cmp.report_partial_closure()

print("\n========= Comparison Report Full ============== \n")
directory_cmp.report_full_closure()
=========== Comparison Report ===========

diff directory1 directory2
Identical files : ['modified.txt', 'original.txt']
Differing files : ['file2.txt']
Common subdirectories : ['directory1_1']

========= Comparison Report Partial ============

diff directory1 directory2
Identical files : ['modified.txt', 'original.txt']
Differing files : ['file2.txt']
Common subdirectories : ['directory1_1']

diff directory1/directory1_1 directory2/directory1_1
Only in directory1/directory1_1 : ['modified1_1.txt', 'original1_1.txt']
Only in directory2/directory1_1 : ['file2.txt', 'modified.txt', 'original.txt']
Common subdirectories : ['directory1_1_1']

========= Comparison Report Full ==============

diff directory1 directory2
Identical files : ['modified.txt', 'original.txt']
Differing files : ['file2.txt']
Common subdirectories : ['directory1_1']

diff directory1/directory1_1 directory2/directory1_1
Only in directory1/directory1_1 : ['modified1_1.txt', 'original1_1.txt']
Only in directory2/directory1_1 : ['file2.txt', 'modified.txt', 'original.txt']
Common subdirectories : ['directory1_1_1']

diff directory1/directory1_1/directory1_1_1 directory2/directory1_1/directory1_1_1
Only in directory1/directory1_1/directory1_1_1 : ['modified1_1_1.txt', 'original1_1_1.txt']
Only in directory2/directory1_1/directory1_1_1 : ['modified.txt', 'original.txt']

Example 4

As part of our fourth example, we are demonstrating how we can ignore some files when doing a comparison using dircmp instance by using ignore attribute of the constructor.

Our code for this example is exactly the same as our previous example with one minor change. We have added file2.txt to the list of files to be ignored when doing the comparison.

If we compare the output of this example with the previous example then we can clearly see that file2.txt is not present in the output of this example.

In [57]:
directory_cmp = filecmp.dircmp(a="directory1", b="directory2", ignore=["file2.txt"])

print("=========== Comparison Report =========== \n")
directory_cmp.report()

print("\n========= Comparison Report Partial ============\n")
directory_cmp.report_partial_closure()

print("\n========= Comparison Report Full ============== \n")
directory_cmp.report_full_closure()
=========== Comparison Report ===========

diff directory1 directory2
Identical files : ['modified.txt', 'original.txt']
Common subdirectories : ['directory1_1']

========= Comparison Report Partial ============

diff directory1 directory2
Identical files : ['modified.txt', 'original.txt']
Common subdirectories : ['directory1_1']

diff directory1/directory1_1 directory2/directory1_1
Only in directory1/directory1_1 : ['original1_1.txt']
Only in directory2/directory1_1 : ['original.txt']
Identical files : ['modified1_1.txt']
Common subdirectories : ['directory1_1_1']

========= Comparison Report Full ==============

diff directory1 directory2
Identical files : ['modified.txt', 'original.txt']
Common subdirectories : ['directory1_1']

diff directory1/directory1_1 directory2/directory1_1
Only in directory1/directory1_1 : ['original1_1.txt']
Only in directory2/directory1_1 : ['original.txt']
Identical files : ['modified1_1.txt']
Common subdirectories : ['directory1_1_1']

diff directory1/directory1_1/directory1_1_1 directory2/directory1_1/directory1_1_1
Only in directory1/directory1_1/directory1_1_1 : ['modified1_1_1.txt', 'original1_1_1.txt']
Only in directory2/directory1_1/directory1_1_1 : ['modified.txt', 'original.txt']

Example 5

As a part of our fifth, example, we are explaining how we can hide information about the list of files when generating a report using dircmp. We'll be using its attribute hide for this purpose.

Our code for this example is exactly the same as example 3 with a minor change. We have set the list of two files (modified1_1_1.txt, original1_1_1.txt) as the value of hide parameter to inform the report to hide information about them.

If we compare the output of this example with previous examples then we can clearly notice the difference that information about above mentioned two files are not present in any report.

In [59]:
directory_cmp = filecmp.dircmp(a="directory1", b="directory2", hide=["modified1_1_1.txt", "original1_1_1.txt"])

print("=========== Comparison Report =========== \n")
directory_cmp.report()

print("\n========= Comparison Report Partial ============\n")
directory_cmp.report_partial_closure()

print("\n========= Comparison Report Full ============== \n")
directory_cmp.report_full_closure()
=========== Comparison Report ===========

diff directory1 directory2
Identical files : ['modified.txt', 'original.txt']
Differing files : ['file2.txt']
Common subdirectories : ['directory1_1']

========= Comparison Report Partial ============

diff directory1 directory2
Identical files : ['modified.txt', 'original.txt']
Differing files : ['file2.txt']
Common subdirectories : ['directory1_1']

diff directory1/directory1_1 directory2/directory1_1
Only in directory1/directory1_1 : ['original1_1.txt']
Only in directory2/directory1_1 : ['file2.txt', 'original.txt']
Identical files : ['modified1_1.txt']
Common subdirectories : ['directory1_1_1']

========= Comparison Report Full ==============

diff directory1 directory2
Identical files : ['modified.txt', 'original.txt']
Differing files : ['file2.txt']
Common subdirectories : ['directory1_1']

diff directory1/directory1_1 directory2/directory1_1
Only in directory1/directory1_1 : ['original1_1.txt']
Only in directory2/directory1_1 : ['file2.txt', 'original.txt']
Identical files : ['modified1_1.txt']
Common subdirectories : ['directory1_1_1']

diff directory1/directory1_1/directory1_1_1 directory2/directory1_1/directory1_1_1
Only in directory2/directory1_1/directory1_1_1 : ['modified.txt', 'original.txt']

Example 6

As a part of our sixth example, we'll be explaining various attributes of dircmp instance.


Important Attributes of dircmp Instance

  • left - It returns the name of the first directory.
  • right - It returns the name of the second directory.
  • left_list - It returns files and subdirectories present in the first directory.
  • right_list - It returns files and subdirectories present in the second directory.
  • common - It returns files and subdirectories present in both directories.
  • common_dirs - It returns subdirectories present in both directories.
  • common_files -It returns files present in both directories.
  • common_funny - It returns names present in both directories for which there is type difference or os.stat() function gave error.
  • same_files - It returns files that are the same in both directories.
  • diff_files - It returns files that are present in both directories but contents do not match.
  • funny_files - It returns files which are present in both directories but contents could not be compared.
  • subdirs - It returns mapping from directory names present in common_dirs to their dircmp instance.

Our code for this part generates dircmp instance for directory directory1_1 which is present in both directory1 and directory2. We then print the value of all the attributes described above.

In [3]:
directory_cmp = filecmp.dircmp(a="directory1/directory1_1", b="directory2/directory1_1")

print("=========== Comparison Report =========== \n")
directory_cmp.report()

print("\n=========== Important Attributes of dircmp Instance ========")
print("\nLeft Directory               : {}".format(directory_cmp.left))
print("Right Directory               : {}".format(directory_cmp.right))
print("Left  List of Files/Directories : {}".format(directory_cmp.left_list))
print("Right List of Files/Directories : {}".format(directory_cmp.right_list))
print("Common Files                    : {}".format(directory_cmp.common))
print("Common Directories              : {}".format(directory_cmp.common_dirs))
print("Common Funny                    : {}".format(directory_cmp.common_files))
print("Common Files                    : {}".format(directory_cmp.common_funny))
print("Identical Files                 : {}".format(directory_cmp.same_files))
print("Different Files                 : {}".format(directory_cmp.diff_files))
print("Funny Files                     : {}".format(directory_cmp.funny_files))
print("Mapping from Dirname to dircmp  : {}".format(directory_cmp.subdirs))
=========== Comparison Report ===========

diff directory1/directory1_1 directory2/directory1_1
Only in directory1/directory1_1 : ['original1_1.txt']
Only in directory2/directory1_1 : ['file2.txt', 'original.txt']
Identical files : ['modified1_1.txt']
Common subdirectories : ['directory1_1_1']

=========== Important Attributes of dircmp Instance ========

Left Directory               : directory1/directory1_1
Right Directory               : directory2/directory1_1
Left  List of Files/Directories : ['directory1_1_1', 'modified1_1.txt', 'original1_1.txt']
Right List of Files/Directories : ['directory1_1_1', 'file2.txt', 'modified1_1.txt', 'original.txt']
Common Files                    : ['directory1_1_1', 'modified1_1.txt']
Common Directories              : ['directory1_1_1']
Common Funny                    : ['modified1_1.txt']
Common Files                    : []
Identical Files                 : ['modified1_1.txt']
Different Files                 : []
Funny Files                     : []
Mapping from Dirname to dircmp  : {'directory1_1_1': <filecmp.dircmp object at 0x7f54e002f748>}

This ends our small tutorial explaining how we can compare files and directories using filecmp module of Python. Please feel free to let us know your views in the comments section.

Reference



Sunny Solanki  Sunny Solanki