Learn Python Difflib Library Effectively

When the lines of code in a program or project start to increase substantially, then it becomes hard to keep track of changes being made. Things start to become counter-productive. As a result, there is a lot of wastage of time, money, and manpower. Hence, being able to discern or, in other words, point out the changes made becomes a necessary task. Python provides an excellent built-in library named difflib to handle these kinds of problems.

Contents

What is difflib?

The Difflib library of Python contains functions and classes used for computing the differences(deltas) of sequences or files. Usually, it is used to compare string sequences.

You might have worked with or come across a term named ‘git’. Git is a version control system. In other words, it keeps track of changes to a file or multiple files(stored in a computer) over time. Moreover, changes made to files can be rolled back to a previous state. It can be due to an error etc.

git changes difflib — git changes depicted using red and green representing removed and added lines

Similarly, Python’s diff module works. In the following article, we will be looking at Python’s built-in difflib module, its relevance, functioning, types, and some examples.

Importing difflib

import difflib

differ class

The difflib’s differ class compares lines of text or strings or sequences and produces differences(deltas) that a person can easily understand.

differ has different codes for comparison of text:

Code	Meaning
‘ – ‘	means string is unique to text 1
‘ + ‘	means string is unique to text 2
‘ ‘	string common to text 1 and 2
‘ ? ‘	string not common to text 1 and 2

differ class codes

Syntax & Parameters

Syntax:

difflib.Differ(linejunk=None, charjunk=None)

Parameter: linejunk and charjunk, by default, are set to value None

linejunk – accepts a single argument and return True if string is junk. Default value is None if not junk.
charjunk – accepts a single character argument and return True if character is junk. Default value is None if not junk.

compare(firstString, secondString) – compare functions compares two sequences of the line are returns their differences or deltas. Sequences must have a newline character(‘\n’) at the end.

from difflib import Differ
import sys

differ_inst = Differ()

string1 = """This is a random string.
 Lets call it string 1. 
 This is so random
 """.splitlines(keepends=True)

string2 = """This is a random string.
 Lets call it string 2. 
 This is so random.
 Or mayble not, or is it.
 """.splitlines(keepends=True)

deltas = list(differ_inst.compare(string1,string2))
sys.stdout.writelines(deltas)

Let’s try to breakdown the code above:

Firstly we import differ from difflib and sys module.
Secondly, differ instance/object named differ_inst.
string1 and string2 stores the lists of individual spilt lines.
After converting the generator returned by the compare function to a list, it is displayed on the screen using sys module’s writelines method, one at a time.

Output of differ method — The output of differ method

In the output image above ‘ – ‘, ‘ + ‘, & ‘ ‘ represents lines of string1 , string2 and the common lines of string1 and string2 respectively.

SequenceMatcher class

SequenceMatcher compares two sequences and returns how close they are. We will try to demonstrate using some examples.

Syntax: SequenceMatcher(isjunk=None,a=”, b=”, autojunk=True)

isjunk – Accepts a function as an argument. The function takes a single element of sequence as input and returns True if it is junk then False. The default value is None.
autojunk – Accepts boolean value as argument. If True, enables auto junk finding functionality.

Let’s discuss popular sequence matcher methods:

ratio

It returns the similarity of sequence as a floating-point value in the range 0-1(both inclusive).

import difflib
from difflib import SequenceMatcher

string_one = 'He is right'
string_two = 'He was right'
seq_ratio = SequenceMatcher(a=string_one,b=string_two)
print(f"SequenceMatcher ratio: {seq_ratio.ratio()}")

Ratio of match of sequences — Match ratio of sequences

Difference between difflib’s ratio, quick_ratio, real_quick_ratio?

The three methods return the ratio of matching to total characters. They have different levels of approximations. As a result, quick_ratio() and real_quick_ratio() might vary with the result of ratio(). However, they are as large as the ratio().

sequence = SequenceMatcher(None, "abcd", "bcde")
print(s.ratio())
print(s.quick_ratio())
print(s.real_quick_ratio())

difference between ratio, quick_ratio, and real_quick_ratio

Difflib’s Ratio vs Levenshtein

Levenshtein – utilizes levenshtein algorithm to compute the minimum number of edits required to transform one string into the other.
difflib’s ratio – utilizes Ratcliff/Obershelp algorithm. Computes the doubled number of matching characters divided by the total number of characters in the two strings.

import Levenshtein
import difflib

print(Levenshtein.ratio('united states of america','america'))
print(difflib.SequenceMatcher(None,'united states of america','america').ratio())

Output of Levenshtein and difflib's ratio — Output of Levenshtein and difflib’s ratio

find_longest_match

It compares the two sequences and returns the longest subsequence.

Syntax: find_longest_match(alo=0, ahi=None, blo=0, bhi=None)

Paramters:

alo – starting index of first sequence
ahi – ending index of first sequence
blo – starting index of second sequence
bhi – ending index of second sequence

It takes in the staring and ending indices of the two sequences and returns the longest subsequence.

import difflib

list_one = [1,2,3,4,5,6,7,8,9]
list_two = [1,3,4,5,6,8,9,10,11]

match_seq = difflib.SequenceMatcher(a=list_one,b=list_two)
match = match_seq.find_longest_match(alo=0,ahi=len(list_one),blo=0,bhi=len(list_two))

print(f"Match object:{match}")
print(f"Matching sequence list_one: {list_one[match.a:match.a+match.size]}")

get_matching_blocks

This method of sequence matcher simply returns all the matching blocks in both sequences.

import difflib

list_one = [1,2,3,4,5,6,7,8,9]
list_two = [1,3,4,5,6,8,9,10,11]

match_seq = difflib.SequenceMatcher(a=list_one,b=list_two)
match = match_seq.find_longest_match(alo=0,ahi=len(list_one),blo=0,bhi=len(list_two))

for match in match_seq.get_matching_blocks():
	print(f"Match object:{match}")
	print(f"Matching sequence list_one: {list_one[match.a:match.a+match.size]}")
	print(f"Matching sequence list_two: {list_two[match.b:match.b+match.size]}")
	print()

Difflib’s methods

context_diff method

difflib.context_diff compares two sequences and returns a delta in context format. In other words, it is a generator generating delta(difference) lines.

In context format, the output shows which lines have been changed by returning the changed lines with a prefix of ‘!’.

import difflib

string_one = """Lorem ipsum dolor sit amet.
Pellentesque at leo neque.
Aenean sit amet tempor sem, eu tristique sapien.
Ut id quam at mauris volutpat fringilla sit amet et enim.
Morbi faucibus maximus massa, in commodo erat luctus ut.""".splitlines(keepends=True)

string_two = """Lorem ipsum dolor sit amet.
Pellentesque at leo nequed rattla bisawed.
Aenean sit amet tempor sem, eu tristique sapien.
Cras consequat ornare arcu, ac dapibus elit tincidunt non.
Fusce massa diam, tristique pellentesque ultricies eu, auctor nec ipsum.""".splitlines(keepends=True)

difference = difflib.context_diff(string_one,string_two)
for item in difference:
	print(item, end='')

Sequence split into individual lines by split lines method

Let’s breakdown the above code:

We have two sequences stored inside the string_one and string_two variables.
splitlines function splits sequence into individual lines and return them in a list(see image above).
context_diff function return the delta in context diff format.
difference variable stores the generator(delta) returned by context_diff function.
Iterating over the generator returns the individual diffrences(deltas).

Look at the output below, ‘ ! ‘ shows the different lines in each sequence.

Output of context_diff function — The output of the context_diff function

unified_diff method

difflib.unfied_diff compares two sequences and returns a delta in a unified format.

In unified format, the output shows each word that was either added or removed from the first sequence.

import difflib

string_one = """Lorem ipsum dolor sit amet.
Pellentesque at leo neque.
Aenean sit amet tempor sem, eu tristique sapien.
Ut id quam at mauris volutpat fringilla sit amet et enim.
Morbi faucibus maximus massa, in commodo erat luctus ut.""".splitlines(keepends=True)

string_two = """Lorem ipsum dolor sit amet.
Pellentesque at leo nequed rattla bisawed.
Aenean sit amet tempor sem, eu tristique sapien.
Cras consequat ornare arcu, ac dapibus elit tincidunt non.
Fusce massa diam, tristique pellentesque ultricies eu, auctor nec ipsum.""".splitlines(keepends=True)

difference = difflib.unified_diff(string_one,string_two)
for item in difference:
	print(item, end='')

Similarly, the code above works like the context_diff example. However, the only change is that instead of a context diff format, the returned generator is of unified diff format.

Look at the output below, ‘ – ‘ shows the lines removed in the first sequence, and ‘ + ‘ shows the lines added to it.

get_close_matches method

The get close matches function of difflib takes in a word and a list of words to match against. It returns a list of closest matches to the word.

Syntax & Parameters

Syntax: get_close_matches(word, possible_words, n, cutoff)

Parameters:

Word – word for which matches are to searched
possible_words – list of words againsts which word has to be matched
n – an optional parameter indicating the number of matches to be returned.
cutoff – it is an optional paramter with default value of 0.6. It indicates that the close matches should have a score greater than the cutoff value.

Example 1:

from difflib import get_close_matches

possible_words = ['eat','cat','their','beat','here','them']

if __name__ == "__main__":
	matches = get_close_matches('bat',possible_words)
	print(matches)

The above code takes in arguments and returns a list of close matches.

Example 2:

from difflib import get_close_matches

possible_words = ['theme','thames','that','this','those']

if __name__ == "__main__":
	matches = get_close_matches('the',possible_words,2,0.7)
	print(matches)

list of matches, with n = 2, and cutoff = 0.7 — list of matches, with n = 2 and cutoff = 0.7

ndiff method

difflib’s nidff also compares the two sequences and returns differ style delta(difference).

import difflib

string_one = """Lorem ipsum dolor sit amet.
Aenean sit amet tempor sem, eu tristique sapien.
Ut id quam at mauris volutpat fringilla sit amet et enim.
""".splitlines(keepends=True)

string_two = """Lorem ipsum dolor sit amet.
Pellentesque at leo nequed rattla bisawed.
Cras consequat ornare arcu, ac dapibus elit tincidunt non.
""".splitlines(keepends=True)

diffrence = difflib.ndiff(string_one,string_two)
for item in diffrence:
	print(item, end='')

The output of difflib ndiff — The output of ndiff

HtmlDiff

HtmlDiff of difflib module compares two sequences and returns the delta in an HTML file format. Let’s understand it using an example.

import difflib

a = open("a.txt", "r").readlines()
b = open("b.txt", "r").readlines()

difference = difflib.HtmlDiff(tabsize=2)

with open("delta.html", "w") as fp:
    html = difference.make_file(fromlines=a, tolines=b, fromdesc="Original", todesc="Modified")
    fp.write(html)

Let’s breakdown the above code:

We have two files a.txt and b.txt with names of countries. After opening and reading the contents of both files, we store it in variables a and b respectively.
After that, we create a instance of HtmlDiff which is used for generating difference in HTML format.
make_file method of HtmlDiff returns a string which contains HTML showing differences between the two files. Each line of files is shown side by side in form of a table.
The table headers for a.txt and b.txt is named Orginal and Modified.
Futhermore, you can notice in the output image below word America was removed and China was added, highlighted using red and green repectively.

difflib HtmlDiff delta output in a html file — HtmlDiff delta output in a HTML file

FAQs on Python difflib

How do you close a match in Python?

Python has a built-in module named difflib, which provides function get_close_matches. It returns a list of words closest to the target word.

What is difflib in Python?

The Difflib library of Python contains functions and classes used for computing the deltas of sequences or files.

How to ignore whitespaces in difflib?

To ignore whitespaces in difflib, you will need to create a function that removes whitespaces, for instance.
import re def remove_whitespace(line): return re.sub("\s+"," ",line.strip())
Then pass the strings and then compare using difflib.

Conclusion

In this article, we covered an exciting and valuable module named difflib. We also looked at its classes and functions through examples. For instance, we covered difflib’s differ, sequence matcher classes, and their functions. Other than that, we covered difflib’s methods, for example, HtmlDiff, ndiff, context_diff, unified_diff, and get_close_matches.