Skip to main content

MD5 Hash Function: What is it and How to Implement?

MD5 (Message Digest Algorithm 5)

MD5 hash function is one of the oldest and widely used hash function. MD5 stands for Message Digest Algorithm was developed in 1991 by Ronald Rivest. MD5 accepts input of any size and generates hash value of 32 characters (128 bits). 

Let's get straight into Step-by-Step process of MD5 algorithm and then look into some of the vulnerabilities of MD5 Algorithm.
 

Step-by-Step Process

On a high level below are the steps involved in generating hash value. 
  • Padding the Input
  • Breaking the Input into Blocks
  • Initializing the State variables
  • Processing the Blocks
  • Finalizing the Hash Value
Let's have a look into what does each step involve.

Padding the Input

MD5 Algorithm works on blocks of size 512 bits. So, first thing is to ensure the input message is of 512 bits (or multiples of 512 bits).

Before processing the input message, input would be padded with the below. 
  • '1' bit would be padded immediately after the input message. '1' bit works as a delimiter to indicate the end of the actual input message before any padding. This is mandatory irrespective of the input message size. 
  • '0' bits to be added as a filler to ensure that the input message is a multiple of 512 bits. This is only required if length of the input message (including '1' delimiter and 64 bits indicating the length) is not a multiple of 512. 
  • The last 64 bits representing the length of the actual input message would be padded.
Out of these, '1' bit (delimiter) and 64 bits (length representation) are mandatory and fixed. Number of '0' bits to be padded would be depending upon the length of the input message and calculated based on the below.

Length of the input message should be congruent to 448 modulo 512. Input message in this context is after padding the '1' bit and '0' bits. Below calculation can be used to ensure how many number of '0' bits to be added, so that length of input message would be congruent to 448 modulo 512. 

Padding Length = (448 - (Message Length + 1)%512) % 512

So, what is the importance of 448 here? When MD5 actually processes blocks of 512 bits size? 

64 bits are to be mandatorily padded at the end of input message to indicate the length of actual input message. '512 - 64 = 448', Which would be the required length of the message after any padding ('1' bit, '0' bits). So, any input message which is congruent to 448 modulo 512 would become a multiple of 512 bits after adding it's length at the end.

Let's break down the above formula. 
  • Message Length: Length of the actual input message (before any padding). 
  • (Message Length + 1): '+1' here is to consider the mandatory delimiter '1' bit. 
  • Subtracting from 448: To get the actual number '0' bits to be padded.
 This can be better understood 

If the input message is of 447 bits, 

Padding Length = (448 - (447 + 1)%512)%512 
               = (448 - 448%512)%512 
               = (448 - 448)%512
               = (0)%512 
               = 0

No additional padding is required if the input message is of 447 bits, this is because
  • mandatory '1' bit would make the input message length to 448 bits. 
  • mandatory '64' bits to represent the input length make the input message length to 512 bits, which is equal to the block size required by MD5
If the input message is of 512 bits, 

Padding Length = (448 - (512 + 1)%512)%512 
               = (448 - 513%512)%512 
               = (448 - 1)%512
               = (447)%512 
               = 447
  • mandatory '1' bit would make the input message length to 513 bits. 
  • As per the calculation 447 '0' bits are to be added to the input message after delimiter, this would make the input length to 960 bits
  • mandatory '64' bits to represent the input length make the input message length to 1024 bits, which is a multiple of 512. MD5 splits the message into two blocks of 512 bits and process.
For cases where the input message is greater than 512 bits (E.g.: 2000 bits),

Padding Length = (448 - (2000 + 1)%512)%512 
               = (448 - 2001%512)%512 
               = (448 - 465)%512
               = (-17)%512 # To avoid negative, we can add 512
               = (-17 + 512)%512
               = (495)%512 
               = 495
  • mandatory '1' bit would make the input message length to 2001 bits. 
  • As per the calculation 495 '0' bits are to be added to the input message after delimiter, this would make the input length to 2496 bits
  • mandatory '64' bits to represent the input length make the input message length to 2560 bits, which is a multiple of 512. MD5 splits the message into five blocks of 512 bits and process.

Breaking the Input into Blocks

Input message (which is padded as described above) to be broken into multiple blocks with each block size is of 512 bits. 

Initializing the State variables

State variables hold the internal state of the hash function and are initialized with a specific constant at the beginning of hashing process. 

MD5 algorithm uses four state variables, each of them are identified as A, B, C and D. These variables are initialized with the below values. 

A = 0x67452301
B = 0xEFCDAB89
C = 0x98BADCFE
D = 0x10325476

Processing the Blocks

During the process of hash value generation, each block goes through 64 steps in four different rounds (with each round having 16 steps).

Each step involves some calculations and transformations applied to the current block's data and state variables. These operations are designed to ensure hash value is unique and unpredictable by introducing non-linearity and diffusion. 

Below are the four non-linear functions part of the each step, different non-linear function would be used in each round. 

Each of these functions would take the state variables as input and applies different combinations of logical functions, bitwise operations and modular arithmetic operations on the block. Using different combination for different functions produces different intermediate result. 

Intermediate result from each function would be combined with the corresponding state variables using bitwise addition modulo 2^32. 

The updated state variables would become input for the next step. Finalized state variables after processing all blocks represent the internal state of the MD5 Algorithm. 

Finalizing the Hash Value

Last step in generating the Hash is to process the finalized state variables after processing all the blocks, convert them to Little-Endian format, concatenate state variables and convert the concatenated value to Hexadecimal format. 
  • Convert each of the state variable (which is of 32 bit length) from actual Big-Endian format to Little-Endian format. Below is the brief on what is Big-Endian format and Little-Endian format. 
    • Big-Endian format: This format can be visualized as reading from left to right, with most significant byte (MSB) is stored at the lowest memory address and least significant byte (LSB) is stored at the highest memory address. This is similar to how we write the numbers.
    • Little-Endian format: This format is opposite to the Big-Endian format and can be visualized as reading from right to left, with lost significant byte (LSB) is stored at the lowest memory address and most significant byte (MSB) is stored at the highest memory address. This is different from the way how we write the numbers. 
    • E.g.: A 16 bit number 0x1234, is stored as 0x12 followed by 0x34 in big-endian format and it would be stored as 0x34 followed by 0x12 in little-endian format. 
  • Concatenate all the state variables (converted to little-endian format) in the order of A, B, C and D. The result would form a 128 bit value. 
  • Convert the concatenated 128 bit value to the hexadecimal format. 
  • The resulting 128 bit hexadecimal value is the final Hash value generated by the MD5 algorithm. 

Implementing MD5 Hash

How to generate MD5 for any given string? Let's look at an example in Python. 

For this example we will use the module 'HashLib'. There is a function 'md5' in HashLib which accepts the encoded input string and generates Hash value.

1

2

3

4

5

6

7


8

9

10

11

import hashlib


# Accept the input string for generating Hash

input_string = input("Enter string: ")

# Input string needs to be encoded before generating Hash

encoded_input_string = input_string.encode()

# Encoded string would be passed to 'md5' function from the HashLib module.

hash_value = hashlib.md5(encoded_input_string)


print(hash_value.hexdigest())



In the above example, 
  • Line - 6: Input string needs to encoded. This can be done using the string method encode(). 
  • Line - 8: Encoded input string to be passed to the function md5(), this returns the HASH object. 
  • Line - 10: Method 'hexdigest()' can be used on the HASH object to get the Hexadecimal Hash value. 
Below is the sample result. 

Enter string: This is a String

80b2959a8f7c38e52d7daa1660eafed7


Any minor change to the input string would result in generating a completely different Hash value. Below is the hash value with the same string by converting 'S' to lower case. 

Enter string: This is a string

41fb5b5ae4d57c5ee528adb00e5e8e74



I hope this post has provided a good insight on what is a MD5 hash function and how we can generate Hash using MD5 algorithm in Python. Please note that MD5 is considered to be insecure for cryptographic purposes due to the vulnerabilities identified and more secure hash function (like SHA-256) is recommended. 


If you have any Suggestions or Feedback, Please leave a comment below or use Contact Form.

Comments

Popular posts from this blog

All about READ in RPGLE & Why we use it with SETLL/SETGT?

READ READ is one of the most used Opcodes in RPGLE. As the name suggests main purpose of this Opcode is to read a record from Database file. What are the different READ Opcodes? To list, Below are the five Opcodes.  READ - Read a Record READC - Read Next Changed Record READE - Read Equal Key Record READP - Read Prior Record READPE - Read Prior Equal Record We will see more about each of these later in this article. Before that, We will see a bit about SETLL/SETGT .  SETLL (Set Lower Limit) SETLL accepts Key Fields or Relative Record Number (RRN) as Search Arguments and positions the file at the Corresponding Record (or Next Record if exact match isn't found).  SETGT (Set Greater Than) SETGT accepts Key Fields or Relative Record Number (RRN) as Search Arguments and positions the file at the Next Record (Greater Than the Key value). Syntax: SETLL SEARCH-ARGUMENTS/KEYFIELDS FILENAME SETGT  SEARCH-ARGUMENTS/KEYFIELDS FILENAME One of the below can be passed as Search Arguments. Key Fiel

What we need to know about CHAIN (RPGLE) & How is it different from READ?

CHAIN READ & CHAIN, These are one of the most used (& useful) Opcodes by any RPG developer. These Opcodes are used to read a record from file. So, What's the difference between CHAIN & READ?   CHAIN operation retrieves a record based on the Key specified. It's more like Retrieving Random record from a Database file based on the Key fields.  READ operation reads the record currently pointed to from a Database file. There are multiple Opcodes that start with READ and all are used to read a record but with slight difference. We will see more about different Opcodes and How they are different from each other (and CHAIN) in another article. Few differences to note.  CHAIN requires Key fields to read a record where as READ would read the record currently pointed to (SETLL or SETGT are used to point a Record).  If there are multiple records with the same Key data, CHAIN would return the same record every time. READE can be used to read all the records with the specified Ke

Extract a portion of a Date/Time/Timestamp in RPGLE - IBM i

%SUBDT Extracting Year, Month, Day, Hour, Minutes, Seconds or Milli seconds of a given Date/Time/Timestamp is required most of the times.  This can be extracted easily by using %SUBDT. BIF name looks more similar to %SUBST which is used to extract a portion of string by passing from and two positions of the original string. Instead, We would need to pass a value (i.e., Date, Time or Timestamp ) and Unit (i.e., *YEARS, *MONTHS, *DAYS, *HOURS, *MINUTES, *SECONDS or *MSECONDS) to %SUBDT.  Valid unit should be passed for the type of the value passed. Below are the valid values for each type. Date - *DAYS, *MONTHS, *YEARS Time - *HOURS, *MINUTES, *SECONDS Timestamp - *DAYS, *MONTHS, *YEARS, *HOURS, *MINUTES, *SECONDS, *MSECONDS Syntax: %SUBDT(value : unit { : digits { : decpos} }) Value and Unit are the mandatory arguments.  Digits and Decimal positions are optional and can only be used with *SECONDS for Timestamp. We can either pass the full form for the unit or use the short form. Below i