Get The Most Out of Python Oshash Module

Hello Geeks! How are you doing? I hope everything is fine. So today in this article we are going to discuss the oshash module. We will see how we can install it in our system and can use it. We will also see the performance measure of this algorithm in comparison to other algorithms. After that, we look at some of its examples to get a clear understanding of that. So, Let’s get started,

Hashing and Hash Functions

Hashing means using some function or algorithm to map object data to some representative integer value. It is achieved in the form of a table having key-values pair. It works in the way that the value is fed to that hashing function and it returns a key also known as hash-keys/hash-codes corresponding to the value. The hash code, which is an integer, is then mapped to the fixed size we have.

From here we can deduce that a hash function is any function that can be used to map data of arbitrary size to fixed-size values. Those values returned by a hash function are called hash valueshash codes, or simply hashes. So, till now we understood about hashing now we can move on to the module i.e. “oshash”.

Oshash Module

Although there are several algorithms that work quite efficiently, “Oshash” tried some different approaches to achieve this. Unlike other algorithms, its primary goal is to achieve good speed where others are lagging. The main drawback which makes them slow is that they read the whole file all in once which isn’t a practice for “oshash”. Instead, it reads the file in parts.

However, we didn’t have to worry much about its internal working and hash functions. We will focus more on its usage. Let’s see its installation first and then will move on to the example.

Installation

We can install it using pip with the following command.

pip install oshash

Using Oshash

So, Once we are done with the installation, Let’s see how can we use it.

We can use it in either of ways, the first one is in our program file or we can also use it through the command-line interface. Let’s see an example for each.

In Program file

# let's first import oshash module using

import oshash

file_hash = oshash.oshash(<path to video file>)

In Command Line

$ oshash <path to file>

It returns a hash file in both cases.

Working of Oshash

Although we haven’t encountered any such algorithm in the above example, in the background, a hash is computed as explained below.

file_buffer = open("/path/to/file/")

head_checksum = checksum(file_buffer.head(64 * 1024))  # 64KB
tail_checksum = checksum(file_buffer.tail(64 * 1024))  # 64KB

file_hash = file_buffer.size + head_checksum + tail_checksum

Comparison of Oshash with other methods to calculate Checksum

So, till now we understood the meaning of oshash, its time to see how this module is better than others. To see this let’s take a look at the graph comparison of different algorithms.

import os
import hashlib
import argparse

import oshash

from pprint import pprint
from timeit import timeit

try:
    import matplotlib.pyplot as plt
except ImportError:
    plt = None

try:
    import seaborn  # noqa
    seaborn.set()
except ImportError:
    pass


BLOCK_SIZE = 65536  # 64K: 64 * 1024


def hash_file(file_path, hash_func):
    hasher = hash_func()

    with open(file_path, "rb") as f:
        block_bytes = f.read(BLOCK_SIZE)

        while len(block_bytes) > 0:
            hasher.update(block_bytes)
            block_bytes = f.read(BLOCK_SIZE)

    return hasher.hexdigest()


def main():
    parser = argparse.ArgumentParser(description="OpenSubtitles Hash tool")

    parser.add_argument("file_path", help="File path to test each algorithm")
    parser.add_argument("-n", "--number", type=int, help="How many times to execute each algorithm", default=1)

    args = parser.parse_args()

    file_path = os.path.expanduser(args.file_path)

    algorithm_times = {
        "oshash": timeit(lambda: oshash.oshash(file_path), number=args.number)
    }

    for algorithm_name in hashlib.algorithms_guaranteed:
        if algorithm_name.startswith('shake_'):
                continue

        hash_func = getattr(hashlib, algorithm_name, None)

        algorithm_times[algorithm_name] = timeit(lambda: hash_file(file_path, hash_func), number=args.number)

    pprint(algorithm_times)

    if plt is not None:
        names, times = zip(*sorted(algorithm_times.items(), key=lambda x: x[1]))

        barlist = plt.bar(names, times)

        barlist[names.index("oshash")].set_color("red")  # mark oshash bar

        file_size = os.path.getsize(file_path)
        plt.title("{} ({})".format(file_path, oshash.utils.human_size(file_size)))

        plt.show()


if __name__ == '__main__':
    main()
Comparison of Oshash with other methods to calculate Checksum

So, in the above code what we have done is first we imported all the required modules needed t do the comparisons and graph plotting. We also imported the oshash module along with the hashlib module which provides the SHA-224, SHA-256, SHA-384, SHA-512 hash algorithms in addition to platform optimized versions of MD5 and SHA1. If OpenSSL is present all of its hash algorithms are provided.

Now once we are done with importing, we defined a hash_file() function which creates a hash file corresponding to the given algorithm provided. Then, we defined a main() function. In the main function, we first provided the path of the file and then create a dictionary object whose key is the name of the algorithm and the value is the time required to create the hash file.

In the above code, we first calculated the time using the lambda function for the oshash module. Once it is completed, we start calculating the values for other algorithms. To do that we make use of the hashlib module and iterate over each algorithm provided by the library and store the time to calculate the file in the dictionary. Once, all the algorithms are calculated we plotted the graph for them using the matplotlib library and we defined the code for them.

Once we got the graph we can easily see the differences in the time comparison from the oshash module and algorithms present in the hashlib module and oshash is way faster than other algorithms.

FAQs

Can we use other hashes like SHA and MD5 using Oshash?

No, we can’t use other hashes like SHA and Md5 using Oshash.

What is the Block Size used in reading files using oshash?

The size of the block for reading files is 65536 = 64*1024

Conclusion

So, today in this article we learned about the python “oshash module”. We also learned how can we use it for building subtitles files.

Hope this article helped you. Thank you

Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments