Many times a Python developer needs to compare two sequences (list of numbers, list of strings, list of characters, etc.) to find matching subsequences in between them. These subsequences can help us understand how much two sequences are similar and how different they can have different applications.
The algorithm which does can be useful in various situations like comparing contents of the files, contents of a single string, etc.
Python provides us with a module named difflib which can compare sequences of any type for us so that we don't need to write complicated algorithms to find common subsequences between two sequences.
The difflib module provides different classes and methods to perform a comparison of two sequences and generate a delta.
As a part of this tutorial, we'll be explaining how to use Python module "difflib" to compare sequences of different types with simple and easy-to-understand examples. Apart from basic comparison, Tutorial explains how to generate match ratios of sequences, handle junk characters, generate one sequence from another, etc. It even explains different ways of formatting (HTML, ContextDiff, Differ, etc) different between two sequences. Tutorial covers total API of "difflib" module.
At the core of the difflib module is SequenceMatcher class which implements an algorithm responsible for comparing two sequences. It requires that all the elements of both sequences be "hashable" in order for them to work.
First, It finds the longest common subsequence between two sequences and then divides both sequences into left parts of both original sequences (left subsequences of original sequences) and right parts of both original sequences (right subsequences of original sequences) based on that common subsequence.
Then, It recursively performs the same function of finding the longest common subsequence on the left parts of both original sequences and the right parts of both original sequences.
The algorithm takes quadratic time for the worst case and linear time for the best case. The expected case time is dependent on the size of sequences and is better than worst-case quadratic time.
The algorithm also automatically takes care of junk elements which are the most commonly occurring elements. If an element appears for more than 1% time of the total elements of the sequence then it's considered a junk element and ignored for subsequence finding.
If you are just interested in comparing the list of files/directories and preparing a report about them then please feel free to check our tutorial on Python module filecmp.
Below, we have listed important sections of tutorial to give an overview of the material covered.
As a part of our first example, we'll explain how we can compare two sequences with numbers using SequenceMatcher instance and its methods.
Our code for this example first creates an instance of SequenceMatcher using two sequences of integers. It then finds out the longest common subsequence using find_longest_match() and prints it. It then finds out the list of common subsequences using get_matching_blocks() and prints them.
import difflib
l1 = [1,2,3,5,6,7, 8,9]
l2 = [2,3,6,7,8,10,11]
seq_mat = difflib.SequenceMatcher(a=l1, b=l2)
match = seq_mat.find_longest_match(alo=0, ahi=len(l1), blo=0, bhi=len(l2))
print("============ Longest Matching Sequence ==================")
print("\nMatch Object : {}".format(match))
print("Matching Sequence from l1 : {}".format(l1[match.a:match.a+match.size]))
print("Matching Sequence from l2 : {}\n".format(l2[match.b:match.b+match.size]))
print("============ All Matching Sequences ==================")
for match in seq_mat.get_matching_blocks():
print("\nMatch Object : {}".format(match))
print("Matching Sequence from l1 : {}".format(l1[match.a:match.a+match.size]))
print("Matching Sequence from l2 : {}".format(l2[match.b:match.b+match.size]))
As a part of our second example, we'll be explaining various methods of SequenceMatcher instance.
Our code for this example starts by creating an instance of SequenceMatcher without setting any sequences. We have then set both sequences using set_seqs() methods. We have then printed the longest common subsequence and ratios of similarity between two sequences.
We have then set two different sequences as first and second sequence of the SequenceMatcher using set_seq1() and set_seq2() methods. We have then again printed the longest common subsequence and similarity ratios between these two new sequences.
import difflib
l1 = [1,2,3,5,6,7, 8,9]
l2 = [2,3,6,7,8,10,11]
seq_mat = difflib.SequenceMatcher()
seq_mat.set_seqs(l1, l2)
match = seq_mat.find_longest_match(alo=0, ahi=len(l1), blo=0, bhi=len(l2))
print("============ Longest Matching Sequence (l1,l2) ==================")
print("\nMatch Object : {}".format(match))
print("Matching Sequence from l1 : {}".format(l1[match.a:match.a+match.size]))
print("Matching Sequence from l2 : {}".format(l2[match.b:match.b+match.size]))
print("\n=========== Similarity Ratios ==============")
print("Similarity Ratio : {}".format(seq_mat.ratio()))
print("Similarity Ratio Quick : {}".format(seq_mat.quick_ratio()))
print("Similarity Ratio Very Quick : {}".format(seq_mat.real_quick_ratio()))
#####################################################
l3 = [0,1,2,3,4,6,7,8,9]
l4 = [2,3,6,7,8,9,10,11,12,13]
seq_mat.set_seq1(l3)
seq_mat.set_seq2(l4)
match = seq_mat.find_longest_match(alo=0, ahi=len(l3), blo=0, bhi=len(l4))
print("\n\n\n============ Longest Matching Sequence (l3,l4) ==================")
print("\nMatch Object : {}".format(match))
print("Matching Sequence from l3 : {}".format(l3[match.a:match.a+match.size]))
print("Matching Sequence from l4 : {}".format(l4[match.b:match.b+match.size]))
print("\n=========== Similarity Ratios ==============")
print("Similarity Ratio : {}".format(seq_mat.ratio()))
print("Similarity Ratio Quick : {}".format(seq_mat.quick_ratio()))
print("Similarity Ratio Very Quick : {}".format(seq_mat.real_quick_ratio()))
As a part of our third example, we are explaining how we can compare strings using SequenceMatcher. We are also explaining how we can use a function with isjunk parameter which will decide which characters to consider as junk. We are also explaining which characters will be considered junk elements when the size of the sequences is greater than 200 elements. Apart from this, we have also explained few important attributes of the SequenceMatcher instance.
Our code first creates two sequences that have string data. It then creates an instance of SequenceMatcher. It provides a function to isjunk parameter which considers comma and dot as junk elements. It then finds out the longest common subsequence and prints it. It also prints a list of all subsequences between both strings. It then prints attributes bjunk, bpopular, and b2j of SequenceMatcher instance.
Our code then modifies the second-string further by adding 150 e characters to it. This is done so that the string size becomes more than 200 characters. This will activate the auto junk functionality of the algorithm. We have then again printed attributes bjunk, bpopular, and b2j of SequenceMatcher instance. We can clearly notice a difference in values of the attributes which is now giving results for bpopular attribute.
import difflib
l1 = "Hello, Welcome to CoderzColumn."
l2 = "Welcome to CoderzColumn, Have a Great Learning Day."
seq_mat = difflib.SequenceMatcher(isjunk=lambda x: x in [",", "."], a=l1, b=l2, autojunk=True)
match = seq_mat.find_longest_match(alo=0, ahi=len(l1), blo=0, bhi=len(l2))
print("============ Longest Matching Sequence ==================")
print("\nMatch Object : {}".format(match))
print("Matching Sequence from l1 : {}".format(l1[match.a:match.a+match.size]))
print("Matching Sequence from l2 : {}\n".format(l2[match.b:match.b+match.size]))
print("============ All Matching Sequences ==================")
for match in seq_mat.get_matching_blocks():
if match.size > 0:
print("\nMatch Object : {}".format(match))
print("Matching Sequence from l1 : {}".format(l1[match.a:match.a+match.size]))
print("Matching Sequence from l2 : {}".format(l2[match.b:match.b+match.size]))
print("\nSequence B Junk : {}".format(seq_mat.bjunk))
print("Sequence B Popular : {}".format(seq_mat.bpopular))
print("Sequence B Junk : {}".format(seq_mat.b2j))
l2 = l2 + "e"*150 ### Added 150 e character to make string of lengh more than 200 to make autojunk work.
seq_mat.set_seq2(l2)
print("\nSequence B Junk : {}".format(seq_mat.bjunk))
print("Sequence B Popular : {}".format(seq_mat.bpopular))
print("Sequence B Junk : {}".format(seq_mat.b2j))
As a part of our fourth example, we'll explain how we can perform list of operations on first sequence to transform it to second sequence using get_opcodes() and get_grouped_opcodes() methods of SequenceMatcher instances.
Our code for this part starts by creating an instance of SequenceMatcher with the same sequence which we had used in the previous example. We then create a third list which is a copy of the first list and has elements as a list of characters. We then loop through each operation returned by get_opcodes() method and perform that operation on the third sequence (which is a copy of the first sequence) to transform it to the second sequence.
import difflib
l1 = "Hello, Welcome to CoderzColumn."
l2 = "Welcome to CoderzColumn, Have a Great Learning Day."
seq_mat = difflib.SequenceMatcher(a=l1, b=l2, autojunk=True)
l3 = list(l1)
for operation, i1,i2,j1,j2 in seq_mat.get_opcodes():
if operation == "delete":
print("Deleting Sequence : '{}' from l1".format(l1[i1:i2]))
l3[i1:i2] = [""] * len(l1[i1:i2])
elif operation == "replace":
print("Replacing Sequence : '{}' in l1 with '{}' in l2".format(l1[i1:i2], l2[j1:j2]))
l3[i1:i2] = [""] * len(l1[i1:i2])
l3.insert(i1, l2[j1:j2])
elif operation == "insert":
print("Inserting Sequence : '{}' from l2 at {} in l1".format(l2[j1:j2], i1))
l3.insert(i1, l2[j1:j2])
elif operation == "equal":
print("Equal Sequences. '{}' No Action Needed.".format(l1[i1:i2]))
print("\nFinal Sequence : {}".format("".join(l3)))
Our code for this part starts by creating an instance of SequenceMatcher with the same sequence which we had used in previous examples. We then create a third list which is a copy of the first list and has elements as a list of characters. We then loop through each group returned by get_grouped_opcodes() method and perform operations specified in each group on the third sequence (which is a copy of the first sequence) to transform it to the second sequence.
import difflib
l1 = "Hello, Welcome to CoderzColumn."
l2 = "Welcome to CoderzColumn, Have a Great Learning Day."
seq_mat = difflib.SequenceMatcher(a=l1, b=l2, autojunk=True)
l3 = list(l1)
for groups in seq_mat.get_grouped_opcodes(n=8):
for operation, i1,i2,j1,j2 in groups:
if operation == "delete":
print("Deleting Sequence : '{}' from l1".format(l1[i1:i2]))
l3[i1:i2] = [""] * len(l1[i1:i2])
elif operation == "replace":
print("Replacing Sequence : '{}' in l1 with '{}' in l2".format(l1[i1:i2], l2[j1:j2]))
l3[i1:i2] = l2[j1:j2]
elif operation == "insert":
print("Inserting Sequence : '{}' from l2 at {} in l1".format(l2[j1:j2], i1))
l3.insert(i1, l2[j1:j2])
elif operation == "equal":
print("Equal Sequences. '{}' No Action Needed.".format(l1[i1:i2]))
print("\nFinal Sequence : {}".format("".join(l3)))
As a part of our fifth example, we'll explain how we can compare the list of strings using Differ class of difflib module.
The Differ internally uses SequenceMatcher for finding sequence on the list of strings to find common subsequences between two original sequences and then on a list of characters to find subsequences between individual elements of both original subsequences.
Our code for this part starts by creating two strings from the contents of zen of Python (import this). We have then created an instance of Differ which has a method named compare() which accepts two lists of strings and compares them. We have then compared two strings using compare() methods and printed their result. We have split strings into a list of strings using splitlines() method of string which splits strings based on a new line character.
We can notice from the output that there are four kinds of lines in the output.
import difflib
a = '''
1. Readability counts.
2. Special cases aren't special enough to break the rules.
3. Errors should never pass silently.
4. In the face of ambiguity, refuse the temptation to guess.
5. There should be one-- and preferably only one --obvious way to do it.
6. Although that way may not be obvious at first unless you're Dutch.
7. Now is better than never.
8. Although never is often better than *right* now.
9. If the implementation is hard to explain, it's a bad idea.
10. If the implementation is easy to explain, it may be a good idea.
'''
b = '''
1. Simplicy counts as well.
2. Special cases aren't that special enough to break the rules.
3. Errors shall never pass ever.
4. In the face of ambiguity, refuse the temptation to guess.
5. There should be one obvious way to do it.
6. Although that way may not be obvious at first unless you're Dutch.
7. Now is better than never.
8. Although never is often better than immediately.
9. If the implementation is hard to code, it's most probably a bad idea.
'''
difference = difflib.Differ()
for line in difference.compare(a.splitlines(keepends=True), b.splitlines(keepends=True)):
print(line, end="")
As a part of our sixth example, we are again using Differ to explain how we can compare the contents of two files.
We have saved our strings from previous examples in files named original.txt and modified.txt. We have then read the contents of both files and compared them using Differ. We can notice from the result that it is exactly the same as the previous example.
import difflib
a = open("original.txt", "r").readlines()
b = open("modified.txt", "r").readlines()
difference = difflib.Differ(charjunk=lambda x: x in [",", ".", "-", "'"])
for line in difference.compare(a, b):
print(line, end="")
As a part of our seventh example, we'll explain how we can generate a difference between two sequences in HTML format where the table shows the difference between two sequences side by side using different colors. The difflib provides a class named HtmlDiff for this purpose.
Our code for this part reads two files which we used in our previous example. We are then creating an instance of HtmlDiff to compare the list of strings. We then call make_file() method of HtmlDiff to compare two sequences and return comparison result in HTML format. We are then storing the result in compare.html file.
We are using display module of IPython to display an HTML file in jupyter notebook. The output presented in this format is easy to understand and interpret.
If you are interested in learning about how contents of different types like HTML, audio, video, etc can be displayed in the Jupyter notebook then please feel free to check our tutorial on the same.
import difflib
from IPython import display
a = open("original.txt", "r").readlines()
b = open("modified.txt", "r").readlines()
difference = difflib.HtmlDiff(tabsize=2)
with open("compare.html", "w") as fp:
html = difference.make_file(fromlines=a, tolines=b, fromdesc="Original", todesc="Modified")
fp.write(html)
display.HTML(open("compare.html", "r").read())
We are using our eighth example again to show differences between two lists of strings in HTML format. We are explaining usage of make_table() method of HtmlDiff instance this time. This can be useful when we only want difference as an HTML table so that we can include it in some HTML of our own. We might now always need the whole HTML ready if we want to include a table in some rich HTML of our own.
Our code for this example is exactly the same as our previous example with the only change that we are using make_table() method instead of make_file().
import difflib
from IPython import display
a = open("original.txt", "r").readlines()
b = open("modified.txt", "r").readlines()
difference = difflib.HtmlDiff(tabsize=2)
with open("compare.html", "w") as fp:
html = difference.make_table(fromlines=a, tolines=b, fromdesc="Original", todesc="Modified")
fp.write(html)
display.HTML(open("compare.html", "r").read())
As a part of our ninth example, we'll be demonstrating how we can show the difference between two lists of strings in a contextual difference format using context_diff() method. The contextual difference is a simple way of showing which lines are changed along with few other lines around them to show context.
Our code for this example reads the content of two files that we had created in one of our previous examples. It then finds out the contextual differences between them using context_diff() function and prints difference. The output has lines starting with the character '!' to show that it has the difference between the two sequences.
import difflib
a = open("original.txt", "r").readlines()
b = open("modified.txt", "r").readlines()
difference = difflib.context_diff(a, b,
fromfile="original.txt", tofile="modified.txt",
fromfiledate="2021-02-19", tofiledate="2021-02-20")
for diff in difference:
print(diff, end="")
As a part of our tenth example, we'll explain the usage of function ndiff() of module difflib which gives the same functionality that is available through Differ instance.
Our code for this example is pretty self-explanatory which uses files that we have been using for many examples. It finds out the difference between the contents of the files using ndiff() method and prints it.
import difflib
a = open("original.txt", "r").readlines()
b = open("modified.txt", "r").readlines()
for diff in difflib.ndiff(a, b):
print(diff, end="")
As a part of our eleventh example, we are demonstrating how we can show the difference between two lists of strings in a unified format using unified_diff() method of difflib module. The unified difference format just shows lines that are changed plus a few lines around them to show context.
Our code for this example like many of our previous examples starts by reading two text files created earlier. It uses unified_diff() method this time to find out the difference and print it.
import difflib
a = open("original.txt", "r").readlines()
b = open("modified.txt", "r").readlines()
diff = difflib.unified_diff(a,b,
fromfile="original.txt", tofile="modified.txt",
fromfiledate="2020-02-19", tofiledate="2020-02-20"
)
for line in diff:
print(line, end="")
As a part of our twelfth example, we are demonstrating how we can generate an original list of strings based on the difference that we found out between them. The difflib module provides method named restore() for this purpose. The restore() method can generate original list of strings from difference generated from Differ instance or ndiff() method only.
Our code for this example first finds the difference between the contents of two files using Differ instance. It then uses this difference to find our contents of the first and second files both using restore() method. We have been using both files since many of our last examples.
import difflib
a = open("original.txt", "r").readlines()
b = open("modified.txt", "r").readlines()
difference = difflib.Differ()
diff = difference.compare(a, b)
original_file_contents = difflib.restore(diff, 1)
print("============== Original File Contents ====================\n")
for line in original_file_contents:
print(line, end="")
difference = difflib.Differ()
diff = difference.compare(a, b)
modified_file_contents = difflib.restore(diff, 2)
print("\n\n============== Modified File Contents ====================\n")
for line in modified_file_contents:
print(line, end="")
Our code for this example first finds the difference between the contents of two files using ndiff() method. It then uses this difference to find our contents of the first and second files both using restore() method.
import difflib
a = open("original.txt", "r").readlines()
b = open("modified.txt", "r").readlines()
diff = difflib.ndiff(a, b)
original_file_contents = difflib.restore(diff, 1)
print("============== Original File Contents ====================\n")
for line in original_file_contents:
print(line, end="")
diff = difflib.ndiff(a, b)
modified_file_contents = difflib.restore(diff, 2)
print("\n\n============== Modified File Contents ====================\n")
for line in modified_file_contents:
print(line, end="")
We'll use our thirteenth example to demonstrate how we can find out the list of words from the given list of words that somewhat matches (not compulsory 100% match ) a particular word given as input. We can do this using get_close_matches() method of difflib.
Our code for this example simply tries different values of parameters of method get_close_matches() to see how they impact the results.
import difflib
matches = difflib.get_close_matches("micro", ["macro", "crow", "cream", "nonsense", "none"])
print(matches)
matches = difflib.get_close_matches("micro", ["macro", "crow", "cream", "nonsense", "none"], n=1)
print(matches)
matches = difflib.get_close_matches("micro", ["macro", "crow", "cream", "nonsense", "none"], n=3, cutoff=0.4)
print(matches)
matches = difflib.get_close_matches("micro", ["macro", "crow", "cream", "nonsense", "none"], n=2, cutoff=0.4)
print(matches)
This ends our small tutorial explaining how we can compare sequences with different types of data using difflib module.
If you are more comfortable learning through video tutorials then we would recommend that you subscribe to our YouTube channel.
When going through coding examples, it's quite common to have doubts and errors.
If you have doubts about some code examples or are stuck somewhere when trying our code, send us an email at coderzcolumn07@gmail.com. We'll help you or point you in the direction where you can find a solution to your problem.
You can even send us a mail if you are trying something new and need guidance regarding coding. We'll try to respond as soon as possible.
If you want to