When the lines of code in a program or project start to increase substantially, then it becomes hard to keep track of changes being made. Things start to become counter-productive. As a result, there is a lot of wastage of time, money, and manpower. Hence, being able to discern or, in other words, point out the changes made becomes a necessary task. Python provides an excellent built-in library named difflib to handle these kinds of problems.
What is difflib?
The Difflib library of Python contains functions and classes used for computing the differences(deltas) of sequences or files. Usually, it is used to compare string sequences.
You might have worked with or come across a term named ‘git’. Git is a version control system. In other words, it keeps track of changes to a file or multiple files(stored in a computer) over time. Moreover, changes made to files can be rolled back to a previous state. It can be due to an error etc.
Similarly, Python’s diff module works. In the following article, we will be looking at Python’s built-in difflib module, its relevance, functioning, types, and some examples.
Importing difflib
import difflib
differ class
The difflib’s differ class compares lines of text or strings or sequences and produces differences(deltas) that a person can easily understand.
differ has different codes for comparison of text:
Code | Meaning |
---|---|
‘ – ‘ | means string is unique to text 1 |
‘ + ‘ | means string is unique to text 2 |
‘ ‘ | string common to text 1 and 2 |
‘ ? ‘ | string not common to text 1 and 2 |
Syntax & Parameters
Syntax:
difflib.Differ(linejunk=None, charjunk=None)
Parameter: linejunk and charjunk, by default, are set to value None
- linejunk – accepts a single argument and return True if string is junk. Default value is None if not junk.
- charjunk – accepts a single character argument and return True if character is junk. Default value is None if not junk.
compare(firstString, secondString) – compare functions compares two sequences of the line are returns their differences or deltas. Sequences must have a newline character(‘\n’) at the end.
from difflib import Differ
import sys
differ_inst = Differ()
string1 = """This is a random string.
Lets call it string 1.
This is so random
""".splitlines(keepends=True)
string2 = """This is a random string.
Lets call it string 2.
This is so random.
Or mayble not, or is it.
""".splitlines(keepends=True)
deltas = list(differ_inst.compare(string1,string2))
sys.stdout.writelines(deltas)
Let’s try to breakdown the code above:
- Firstly we import differ from difflib and sys module.
- Secondly, differ instance/object named differ_inst.
- string1 and string2 stores the lists of individual spilt lines.
- After converting the generator returned by the compare function to a list, it is displayed on the screen using sys module’s writelines method, one at a time.
In the output image above ‘ – ‘, ‘ + ‘, & ‘ ‘ represents lines of string1 , string2 and the common lines of string1 and string2 respectively.
SequenceMatcher class
SequenceMatcher compares two sequences and returns how close they are. We will try to demonstrate using some examples.
Syntax: SequenceMatcher(isjunk=None,a=”, b=”, autojunk=True)
- isjunk – Accepts a function as an argument. The function takes a single element of sequence as input and returns True if it is junk then False. The default value is None.
- autojunk – Accepts boolean value as argument. If True, enables auto junk finding functionality.
Let’s discuss popular sequence matcher methods:
ratio
It returns the similarity of sequence as a floating-point value in the range 0-1(both inclusive).
import difflib
from difflib import SequenceMatcher
string_one = 'He is right'
string_two = 'He was right'
seq_ratio = SequenceMatcher(a=string_one,b=string_two)
print(f"SequenceMatcher ratio: {seq_ratio.ratio()}")
Difference between difflib’s ratio, quick_ratio, real_quick_ratio?
The three methods return the ratio of matching to total characters. They have different levels of approximations. As a result, quick_ratio() and real_quick_ratio() might vary with the result of ratio(). However, they are as large as the ratio().
sequence = SequenceMatcher(None, "abcd", "bcde")
print(s.ratio())
print(s.quick_ratio())
print(s.real_quick_ratio())
Difflib’s Ratio vs Levenshtein
- Levenshtein – utilizes levenshtein algorithm to compute the minimum number of edits required to transform one string into the other.
- difflib’s ratio – utilizes Ratcliff/Obershelp algorithm. Computes the doubled number of matching characters divided by the total number of characters in the two strings.
import Levenshtein
import difflib
print(Levenshtein.ratio('united states of america','america'))
print(difflib.SequenceMatcher(None,'united states of america','america').ratio())
find_longest_match
It compares the two sequences and returns the longest subsequence.
Syntax: find_longest_match(alo=0, ahi=None, blo=0, bhi=None)
Paramters:
- alo – starting index of first sequence
- ahi – ending index of first sequence
- blo – starting index of second sequence
- bhi – ending index of second sequence
It takes in the staring and ending indices of the two sequences and returns the longest subsequence.
import difflib
list_one = [1,2,3,4,5,6,7,8,9]
list_two = [1,3,4,5,6,8,9,10,11]
match_seq = difflib.SequenceMatcher(a=list_one,b=list_two)
match = match_seq.find_longest_match(alo=0,ahi=len(list_one),blo=0,bhi=len(list_two))
print(f"Match object:{match}")
print(f"Matching sequence list_one: {list_one[match.a:match.a+match.size]}")
get_matching_blocks
This method of sequence matcher simply returns all the matching blocks in both sequences.
import difflib
list_one = [1,2,3,4,5,6,7,8,9]
list_two = [1,3,4,5,6,8,9,10,11]
match_seq = difflib.SequenceMatcher(a=list_one,b=list_two)
match = match_seq.find_longest_match(alo=0,ahi=len(list_one),blo=0,bhi=len(list_two))
for match in match_seq.get_matching_blocks():
print(f"Match object:{match}")
print(f"Matching sequence list_one: {list_one[match.a:match.a+match.size]}")
print(f"Matching sequence list_two: {list_two[match.b:match.b+match.size]}")
print()
Difflib’s methods
context_diff method
difflib.context_diff compares two sequences and returns a delta in context format. In other words, it is a generator generating delta(difference) lines.
In context format, the output shows which lines have been changed by returning the changed lines with a prefix of ‘!’.
import difflib
string_one = """Lorem ipsum dolor sit amet.
Pellentesque at leo neque.
Aenean sit amet tempor sem, eu tristique sapien.
Ut id quam at mauris volutpat fringilla sit amet et enim.
Morbi faucibus maximus massa, in commodo erat luctus ut.""".splitlines(keepends=True)
string_two = """Lorem ipsum dolor sit amet.
Pellentesque at leo nequed rattla bisawed.
Aenean sit amet tempor sem, eu tristique sapien.
Cras consequat ornare arcu, ac dapibus elit tincidunt non.
Fusce massa diam, tristique pellentesque ultricies eu, auctor nec ipsum.""".splitlines(keepends=True)
difference = difflib.context_diff(string_one,string_two)
for item in difference:
print(item, end='')
Let’s breakdown the above code:
- We have two sequences stored inside the string_one and string_two variables.
- splitlines function splits sequence into individual lines and return them in a list(see image above).
- context_diff function return the delta in context diff format.
- difference variable stores the generator(delta) returned by context_diff function.
- Iterating over the generator returns the individual diffrences(deltas).
Look at the output below, ‘ ! ‘ shows the different lines in each sequence.
unified_diff method
difflib.unfied_diff compares two sequences and returns a delta in a unified format.
In unified format, the output shows each word that was either added or removed from the first sequence.
import difflib
string_one = """Lorem ipsum dolor sit amet.
Pellentesque at leo neque.
Aenean sit amet tempor sem, eu tristique sapien.
Ut id quam at mauris volutpat fringilla sit amet et enim.
Morbi faucibus maximus massa, in commodo erat luctus ut.""".splitlines(keepends=True)
string_two = """Lorem ipsum dolor sit amet.
Pellentesque at leo nequed rattla bisawed.
Aenean sit amet tempor sem, eu tristique sapien.
Cras consequat ornare arcu, ac dapibus elit tincidunt non.
Fusce massa diam, tristique pellentesque ultricies eu, auctor nec ipsum.""".splitlines(keepends=True)
difference = difflib.unified_diff(string_one,string_two)
for item in difference:
print(item, end='')
Similarly, the code above works like the context_diff example. However, the only change is that instead of a context diff format, the returned generator is of unified diff format.
Look at the output below, ‘ – ‘ shows the lines removed in the first sequence, and ‘ + ‘ shows the lines added to it.
get_close_matches method
The get close matches function of difflib takes in a word and a list of words to match against. It returns a list of closest matches to the word.
Syntax & Parameters
Syntax: get_close_matches(word, possible_words, n, cutoff)
Parameters:
- Word – word for which matches are to searched
- possible_words – list of words againsts which word has to be matched
- n – an optional parameter indicating the number of matches to be returned.
- cutoff – it is an optional paramter with default value of 0.6. It indicates that the close matches should have a score greater than the cutoff value.
Example 1:
from difflib import get_close_matches
possible_words = ['eat','cat','their','beat','here','them']
if __name__ == "__main__":
matches = get_close_matches('bat',possible_words)
print(matches)
The above code takes in arguments and returns a list of close matches.
Example 2:
from difflib import get_close_matches
possible_words = ['theme','thames','that','this','those']
if __name__ == "__main__":
matches = get_close_matches('the',possible_words,2,0.7)
print(matches)
ndiff method
difflib’s nidff also compares the two sequences and returns differ style delta(difference).
import difflib
string_one = """Lorem ipsum dolor sit amet.
Aenean sit amet tempor sem, eu tristique sapien.
Ut id quam at mauris volutpat fringilla sit amet et enim.
""".splitlines(keepends=True)
string_two = """Lorem ipsum dolor sit amet.
Pellentesque at leo nequed rattla bisawed.
Cras consequat ornare arcu, ac dapibus elit tincidunt non.
""".splitlines(keepends=True)
diffrence = difflib.ndiff(string_one,string_two)
for item in diffrence:
print(item, end='')
HtmlDiff
HtmlDiff of difflib module compares two sequences and returns the delta in an HTML file format. Let’s understand it using an example.
import difflib
a = open("a.txt", "r").readlines()
b = open("b.txt", "r").readlines()
difference = difflib.HtmlDiff(tabsize=2)
with open("delta.html", "w") as fp:
html = difference.make_file(fromlines=a, tolines=b, fromdesc="Original", todesc="Modified")
fp.write(html)
Let’s breakdown the above code:
- We have two files a.txt and b.txt with names of countries. After opening and reading the contents of both files, we store it in variables a and b respectively.
- After that, we create a instance of HtmlDiff which is used for generating difference in HTML format.
- make_file method of HtmlDiff returns a string which contains HTML showing differences between the two files. Each line of files is shown side by side in form of a table.
- The table headers for a.txt and b.txt is named Orginal and Modified.
- Futhermore, you can notice in the output image below word America was removed and China was added, highlighted using red and green repectively.
FAQs on Python difflib
Python has a built-in module named difflib, which provides function get_close_matches. It returns a list of words closest to the target word.
The Difflib library of Python contains functions and classes used for computing the deltas of sequences or files.
To ignore whitespaces in difflib, you will need to create a function that removes whitespaces, for instance. import re
def remove_whitespace(line): return re.sub("\s+"," ",line.strip())
Then pass the strings and then compare using difflib.
Conclusion
In this article, we covered an exciting and valuable module named difflib. We also looked at its classes and functions through examples. For instance, we covered difflib’s differ, sequence matcher classes, and their functions. Other than that, we covered difflib’s methods, for example, HtmlDiff, ndiff, context_diff, unified_diff, and get_close_matches.