Run Length Encoding in Python Guide

Run length encoding in python is an algorithm using which we replace values inside a string that occurs repetitively. We count the number of similar characters, and instead of sending them multiple times, we send them along with their character count. So, for every character of similar value, only two characters are representing it – its value and its count. In this article, we shall build a function for implementing run length encoding in python.

Contents

What is the need of run length encoding?

Run length encoding is an algorithm for performing lossless data compression. Lossless data compression refers to compressing the data in such a way that the original form of the data can then be derived from it. When a character occurs a large number of times consecutively in a sequence, then we can represent the same consecutive subsequence using only a single occurrence of that character and its count. Using run length encoding, we can save memory space while transmitting data and preserving its original form. It is useful when we want to store or transmit large sequences of data.

For example if the sequence is : AACCCBBBBBAAAAFFFFFFFF

Then, using run length encoding, we can represent it as: A2C3B5A4F8

The 22 length sequence was compressed to a 10 length sequence.

Implementing run length encoding

To implement run length encoding, we will have to store the string first. Then, we have to scan the entire string, store each consecutive character by a single character, and count its occurrences. We will implement run length encoding in python using the list.

Implementing run length encoding using a list can be done using indexing. We can either maintain two different lists – one for storing character and one for storing its count or use nested lists inside a single list. Here, we shall perform the program using nested lists.

1. Defining the function

Now first, we shall define the run length encoding function.

def run_length_encoding(seq):
  compressed = []
  count = 1
  char = seq[0]
  for i in range(1,len(seq)):
    if seq[i] == char:
      count = count + 1
    else :
      compressed.append([char,count])
      char = seq[i]
      count = 1
  compressed.append([char,count])
  return compressed

Here, the name of the function defined is ‘run_length_encoding()’. It accepts one argument, which is the sequence to be compressed. We have taken an empty list named ‘compressed’ to which we will append sub-lists. Then, we have taken a counter variable named ‘count’ which will count the character occurrences. The ‘count’ is set to 1.

Also, we have a string named ‘char’ which will be used to store the previous character, and we shall compare the current character with ‘char.’ Initially, the value of variable ‘char’ is the first character of the sequence.

Then, we run a for loop from the second character of the sequence till the end of the sequence. Using an if condition, we shall compare if the current character is equal to the character stored in the variable ‘char.’ If equal, then it means that two consecutive characters have the same value, so we will then increment ‘count.’

The count will be incremented till two consecutive characters are equal. Once that condition fails, the program will move to the other part. So, it will append a nested list to the main list ‘compressed.’ The nested list shall contain the character and its count.

Then, we shall set the value of ‘char’ to the new character and set ‘count’ to 1. Once the complete list has been scanned, we shall append the last character sequence to the list ‘compressed,’ otherwise remaining excluded from the list.

In the end, we return the list ‘compressed.’

2. Calling the function

First, we shall define a string sequence inside a variable ‘seq.’ Then, we shall pass that sequence into the run_length_encoding() function and store the output string into a list named ‘list1’.

seq = 'AACCCBBBBBAAAAFFFFFFFF'
list1 = run_length_encoding(seq)

The list1 would be something like this:

[['A', 2], ['C', 3], ['B', 5], ['A', 4], ['F', 8]]

Now, we shall take a null string named ‘compressed_seq.’ Using for loop, we will print each nested list and store its value in the form of a string inside ‘compressed_seq.’

compressed_seq = ''

for i in range(0,len(list1)):
  for j in list1[i]:
    compressed_seq += str(j)

print(compressed_seq)

The output compressed string after encoding is:

A2C3B5A4F8

The Entire Code is:

def run_length_encoding(seq):
  compressed = []
  count = 1
  char = seq[0]
  for i in range(1,len(seq)):
    if seq[i] == char:
      count = count + 1
    else :
      compressed.append([char,count])
      char = seq[i]
      count = 1
  compressed.append([char,count])
  return compressed

seq = 'AACCCBBBBBAAAAFFFFFFFF'
list1 = run_length_encoding(seq)

compressed_seq = ''

for i in range(0,len(list1)):
  for j in list1[i]:
    compressed_seq += str(j)

print(compressed_seq)

Also, Read | Python List Length | How to Find the Length of List in Python

How to do decoding?

Decoding the compressed string is not a very complex task. We shall define a function named ‘run_length_decoding()’ to perform decoding.

def run_length_decoding(compressed_seq):
  seq = ''
  for i in range(0,len(compressed_seq)):
    if compressed_seq[i].isalpha() == True:
      for j in range(int(compressed_seq[i+1])):
        seq += compressed_seq[i]

  return(seq)

The function will take one argument, ‘compressed_seq,’ which will be the sequence in its compressed form. We take a local null string named ‘seq.’ Then, we will run a for loop along the length of the compressed sequence. We will check whether the current character is an alphabet or not. We will check that using the isalpha() method.

Since the compressed representation contains the character first and then its count, we will first check for the alphabet character. If it is an alphabet, then the next character will be its count. So, we will run another for loop until the count’s value and append that character to the string ‘seq’. In the end, we shall return the string ‘seq’.

Then we will call the function run_length_decoding() and pass the compressed string as an argument to that function.

print(run_length_decoding(compressed_seq))

The output is:

AACCCBBBBBAAAAFFFFFFFF

As seen, this will return the original sequence.

The entire program containing for implementing both encoding and decoding is :

def run_length_encoding(seq):
  compressed = []
  count = 1
  char = seq[0]
  for i in range(1,len(seq)):
    if seq[i] == char:
      count = count + 1
    else :
      compressed.append([char,count])
      char = seq[i]
      count = 1
  compressed.append([char,count])
  return compressed

def run_length_decoding(compressed_seq):
  seq = ''
  for i in range(0,len(compressed_seq)):
    if compressed_seq[i].isalpha() == True:
      for j in range(int(compressed_seq[i+1])):
        seq += compressed_seq[i]

  return(seq)

seq = 'AACCCBBBBBAAAAFFFFFFFF'
list1 = run_length_encoding(seq)

compressed_seq = ''

for i in range(0,len(list1)):
  for j in list1[i]:
    compressed_seq += str(j)

print(compressed_seq)
print(run_length_decoding(compressed_seq))

Run length encoding using numpy

We can also perform run length encoding on a numpy array. Let us create a separate function for that. The name of the function is run_length_encoding_numpy(). It accepts one argument, which is the array sequence.

First, we shall import the numpy library.

import numpy as np

Then, the function defined is:

def run_length_encoding_numpy(seq_array):
  compressed_seq = ''
  change =  (seq_array[1:] != seq_array[:-1])
  x = np.append(np.where(change), len(seq_array)-1)
  counter = np.diff(np.append(-1,x))

  for i in range(0,len(counter)):
    compressed_seq = compressed_seq + str(seq_array[x][i]) + str(counter[i])

  return compressed_seq

The variable ‘change’ is a list containing boolean values. The value is False if a given sequence has the same consecutive values. It will be True where the values are changing from their previous value. np.where(change) shall contain the indexes where the list change was True. Then, we have a variable named ‘counter’ which will store the count of consecutively equal sequences. We shall add the array element and its respective count to the string ‘compressed_seq’ using a for loop. The function shall return that string.

To call the function, we shall execute the below lines of code.

str_array = np.array(['A','A','B','C','C','C','C','C'])
print(run_length_encoding_numpy(str_array))