--- comments: true --- # 6.3 Hash Algorithm The previous two sections introduced the working principle of hash tables and the methods to handle hash collisions. However, both open addressing and separate chaining **can only ensure that the hash table functions normally when hash collisions occur, but cannot reduce the frequency of hash collisions**. If hash collisions occur too frequently, the performance of the hash table will deteriorate drastically. As shown in Figure 6-8, for a separate chaining hash table, in the ideal case, the key-value pairs are evenly distributed across the buckets, achieving optimal query efficiency; in the worst case, all key-value pairs are stored in the same bucket, degrading the time complexity to $O(n)$. { class="animation-figure" }
Figure 6-8 Ideal and worst cases of hash collisions
**The distribution of key-value pairs is determined by the hash function**. Recall the steps of the hash function: first compute the hash value, then take it modulo the array length: ```shell index = hash(key) % capacity ``` Observing the above formula, when the hash table capacity `capacity` is fixed, **the hash algorithm `hash()` determines the output value**, thereby determining the distribution of key-value pairs in the hash table. This means that, to reduce the probability of hash collisions, we should focus on the design of the hash algorithm `hash()`. ## 6.3.1 Goals of Hash Algorithms To build a hash table that is both fast and robust, a hash algorithm should have the following properties: - **Determinism**: For the same input, the hash algorithm should always produce the same output. Only then can the hash table be reliable. - **High efficiency**: The process of computing the hash value should be fast enough. The smaller the computational overhead, the more practical the hash table. - **Uniform distribution**: The hash algorithm should ensure that key-value pairs are evenly distributed in the hash table. The more uniform the distribution, the lower the probability of hash collisions. In fact, hash algorithms are not only used to implement hash tables but are also widely applied in other fields. - **Password storage**: To protect the security of user passwords, systems usually do not store the plaintext passwords but rather the hash values of the passwords. When a user enters a password, the system calculates the hash value of the input and compares it with the stored hash value. If they match, the password is considered correct. - **Data integrity check**: The data sender can calculate the hash value of the data and send it along; the receiver can recalculate the hash value of the received data and compare it with the received hash value. If they match, the data is considered intact. For cryptographic applications, hash algorithms need stronger security properties to prevent reverse engineering, such as inferring the original password from a hash value. - **Unidirectionality**: It should be impossible to deduce any information about the input data from the hash value. - **Collision resistance**: It should be extremely difficult to find two different inputs that produce the same hash value. - **Avalanche effect**: Minor changes in the input should lead to significant and unpredictable changes in the output. Note that **"uniform distribution" and "collision resistance" are two independent concepts**. Satisfying uniform distribution does not necessarily mean collision resistance. For example, under random input `key`, the hash function `key % 100` can produce a uniformly distributed output. However, this hash algorithm is too simple, and all `key` with the same last two digits will have the same output, making it easy to deduce a usable `key` from the hash value, thereby cracking the password. ## 6.3.2 Design of Hash Algorithms The design of hash algorithms is a complex issue that requires consideration of many factors. However, for some less demanding scenarios, we can also design some simple hash algorithms. - **Additive hash**: Add up the ASCII codes of each character in the input and use the total sum as the hash value. - **Multiplicative hash**: Leverage the low correlation introduced by multiplication: multiply by a constant at each step and accumulate the ASCII codes of the characters into the hash value. - **XOR hash**: Accumulate the hash value by XORing each element of the input data. - **Rotating hash**: Accumulate the ASCII code of each character into a hash value, performing a rotation operation on the hash value before each accumulation. === "Python" ```python title="simple_hash.py" def add_hash(key: str) -> int: """Additive hash""" hash = 0 modulus = 1000000007 for c in key: hash += ord(c) return hash % modulus def mul_hash(key: str) -> int: """Multiplicative hash""" hash = 0 modulus = 1000000007 for c in key: hash = 31 * hash + ord(c) return hash % modulus def xor_hash(key: str) -> int: """XOR hash""" hash = 0 modulus = 1000000007 for c in key: hash ^= ord(c) return hash % modulus def rot_hash(key: str) -> int: """Rotational hash""" hash = 0 modulus = 1000000007 for c in key: hash = (hash << 4) ^ (hash >> 28) ^ ord(c) return hash % modulus ``` === "C++" ```cpp title="simple_hash.cpp" /* Additive hash */ int addHash(string key) { long long hash = 0; const int MODULUS = 1000000007; for (unsigned char c : key) { hash = (hash + (int)c) % MODULUS; } return (int)hash; } /* Multiplicative hash */ int mulHash(string key) { long long hash = 0; const int MODULUS = 1000000007; for (unsigned char c : key) { hash = (31 * hash + (int)c) % MODULUS; } return (int)hash; } /* XOR hash */ int xorHash(string key) { int hash = 0; const int MODULUS = 1000000007; for (unsigned char c : key) { hash ^= (int)c; } return hash & MODULUS; } /* Rotational hash */ int rotHash(string key) { long long hash = 0; const int MODULUS = 1000000007; for (unsigned char c : key) { hash = ((hash << 4) ^ (hash >> 28) ^ (int)c) % MODULUS; } return (int)hash; } ``` === "Java" ```java title="simple_hash.java" /* Additive hash */ int addHash(String key) { long hash = 0; final int MODULUS = 1000000007; for (char c : key.toCharArray()) { hash = (hash + (int) c) % MODULUS; } return (int) hash; } /* Multiplicative hash */ int mulHash(String key) { long hash = 0; final int MODULUS = 1000000007; for (char c : key.toCharArray()) { hash = (31 * hash + (int) c) % MODULUS; } return (int) hash; } /* XOR hash */ int xorHash(String key) { int hash = 0; final int MODULUS = 1000000007; for (char c : key.toCharArray()) { hash ^= (int) c; } return hash & MODULUS; } /* Rotational hash */ int rotHash(String key) { long hash = 0; final int MODULUS = 1000000007; for (char c : key.toCharArray()) { hash = ((hash << 4) ^ (hash >> 28) ^ (int) c) % MODULUS; } return (int) hash; } ``` === "C#" ```csharp title="simple_hash.cs" /* Additive hash */ int AddHash(string key) { long hash = 0; const int MODULUS = 1000000007; foreach (char c in key) { hash = (hash + c) % MODULUS; } return (int)hash; } /* Multiplicative hash */ int MulHash(string key) { long hash = 0; const int MODULUS = 1000000007; foreach (char c in key) { hash = (31 * hash + c) % MODULUS; } return (int)hash; } /* XOR hash */ int XorHash(string key) { int hash = 0; const int MODULUS = 1000000007; foreach (char c in key) { hash ^= c; } return hash & MODULUS; } /* Rotational hash */ int RotHash(string key) { long hash = 0; const int MODULUS = 1000000007; foreach (char c in key) { hash = ((hash << 4) ^ (hash >> 28) ^ c) % MODULUS; } return (int)hash; } ``` === "Go" ```go title="simple_hash.go" /* Additive hash */ func addHash(key string) int { var hash int64 var modulus int64 modulus = 1000000007 for _, b := range []byte(key) { hash = (hash + int64(b)) % modulus } return int(hash) } /* Multiplicative hash */ func mulHash(key string) int { var hash int64 var modulus int64 modulus = 1000000007 for _, b := range []byte(key) { hash = (31*hash + int64(b)) % modulus } return int(hash) } /* XOR hash */ func xorHash(key string) int { hash := 0 modulus := 1000000007 for _, b := range []byte(key) { fmt.Println(int(b)) hash ^= int(b) hash = (31*hash + int(b)) % modulus } return hash & modulus } /* Rotational hash */ func rotHash(key string) int { var hash int64 var modulus int64 modulus = 1000000007 for _, b := range []byte(key) { hash = ((hash << 4) ^ (hash >> 28) ^ int64(b)) % modulus } return int(hash) } ``` === "Swift" ```swift title="simple_hash.swift" /* Additive hash */ func addHash(key: String) -> Int { var hash = 0 let MODULUS = 1_000_000_007 for c in key { for scalar in c.unicodeScalars { hash = (hash + Int(scalar.value)) % MODULUS } } return hash } /* Multiplicative hash */ func mulHash(key: String) -> Int { var hash = 0 let MODULUS = 1_000_000_007 for c in key { for scalar in c.unicodeScalars { hash = (31 * hash + Int(scalar.value)) % MODULUS } } return hash } /* XOR hash */ func xorHash(key: String) -> Int { var hash = 0 let MODULUS = 1_000_000_007 for c in key { for scalar in c.unicodeScalars { hash ^= Int(scalar.value) } } return hash & MODULUS } /* Rotational hash */ func rotHash(key: String) -> Int { var hash = 0 let MODULUS = 1_000_000_007 for c in key { for scalar in c.unicodeScalars { hash = ((hash << 4) ^ (hash >> 28) ^ Int(scalar.value)) % MODULUS } } return hash } ``` === "JS" ```javascript title="simple_hash.js" /* Additive hash */ function addHash(key) { let hash = 0; const MODULUS = 1000000007; for (const c of key) { hash = (hash + c.charCodeAt(0)) % MODULUS; } return hash; } /* Multiplicative hash */ function mulHash(key) { let hash = 0; const MODULUS = 1000000007; for (const c of key) { hash = (31 * hash + c.charCodeAt(0)) % MODULUS; } return hash; } /* XOR hash */ function xorHash(key) { let hash = 0; const MODULUS = 1000000007; for (const c of key) { hash ^= c.charCodeAt(0); } return hash % MODULUS; } /* Rotational hash */ function rotHash(key) { let hash = 0; const MODULUS = 1000000007; for (const c of key) { hash = ((hash << 4) ^ (hash >> 28) ^ c.charCodeAt(0)) % MODULUS; } return hash; } ``` === "TS" ```typescript title="simple_hash.ts" /* Additive hash */ function addHash(key: string): number { let hash = 0; const MODULUS = 1000000007; for (const c of key) { hash = (hash + c.charCodeAt(0)) % MODULUS; } return hash; } /* Multiplicative hash */ function mulHash(key: string): number { let hash = 0; const MODULUS = 1000000007; for (const c of key) { hash = (31 * hash + c.charCodeAt(0)) % MODULUS; } return hash; } /* XOR hash */ function xorHash(key: string): number { let hash = 0; const MODULUS = 1000000007; for (const c of key) { hash ^= c.charCodeAt(0); } return hash % MODULUS; } /* Rotational hash */ function rotHash(key: string): number { let hash = 0; const MODULUS = 1000000007; for (const c of key) { hash = ((hash << 4) ^ (hash >> 28) ^ c.charCodeAt(0)) % MODULUS; } return hash; } ``` === "Dart" ```dart title="simple_hash.dart" /* Additive hash */ int addHash(String key) { int hash = 0; final int MODULUS = 1000000007; for (int i = 0; i < key.length; i++) { hash = (hash + key.codeUnitAt(i)) % MODULUS; } return hash; } /* Multiplicative hash */ int mulHash(String key) { int hash = 0; final int MODULUS = 1000000007; for (int i = 0; i < key.length; i++) { hash = (31 * hash + key.codeUnitAt(i)) % MODULUS; } return hash; } /* XOR hash */ int xorHash(String key) { int hash = 0; final int MODULUS = 1000000007; for (int i = 0; i < key.length; i++) { hash ^= key.codeUnitAt(i); } return hash & MODULUS; } /* Rotational hash */ int rotHash(String key) { int hash = 0; final int MODULUS = 1000000007; for (int i = 0; i < key.length; i++) { hash = ((hash << 4) ^ (hash >> 28) ^ key.codeUnitAt(i)) % MODULUS; } return hash; } ``` === "Rust" ```rust title="simple_hash.rs" /* Additive hash */ fn add_hash(key: &str) -> i32 { let mut hash = 0_i64; const MODULUS: i64 = 1000000007; for c in key.chars() { hash = (hash + c as i64) % MODULUS; } hash as i32 } /* Multiplicative hash */ fn mul_hash(key: &str) -> i32 { let mut hash = 0_i64; const MODULUS: i64 = 1000000007; for c in key.chars() { hash = (31 * hash + c as i64) % MODULUS; } hash as i32 } /* XOR hash */ fn xor_hash(key: &str) -> i32 { let mut hash = 0_i64; const MODULUS: i64 = 1000000007; for c in key.chars() { hash ^= c as i64; } (hash & MODULUS) as i32 } /* Rotational hash */ fn rot_hash(key: &str) -> i32 { let mut hash = 0_i64; const MODULUS: i64 = 1000000007; for c in key.chars() { hash = ((hash << 4) ^ (hash >> 28) ^ c as i64) % MODULUS; } hash as i32 } ``` === "C" ```c title="simple_hash.c" /* Additive hash */ int addHash(char *key) { long long hash = 0; const int MODULUS = 1000000007; for (int i = 0; i < strlen(key); i++) { hash = (hash + (unsigned char)key[i]) % MODULUS; } return (int)hash; } /* Multiplicative hash */ int mulHash(char *key) { long long hash = 0; const int MODULUS = 1000000007; for (int i = 0; i < strlen(key); i++) { hash = (31 * hash + (unsigned char)key[i]) % MODULUS; } return (int)hash; } /* XOR hash */ int xorHash(char *key) { int hash = 0; const int MODULUS = 1000000007; for (int i = 0; i < strlen(key); i++) { hash ^= (unsigned char)key[i]; } return hash & MODULUS; } /* Rotational hash */ int rotHash(char *key) { long long hash = 0; const int MODULUS = 1000000007; for (int i = 0; i < strlen(key); i++) { hash = ((hash << 4) ^ (hash >> 28) ^ (unsigned char)key[i]) % MODULUS; } return (int)hash; } ``` === "Kotlin" ```kotlin title="simple_hash.kt" /* Additive hash */ fun addHash(key: String): Int { var hash = 0L val MODULUS = 1000000007 for (c in key.toCharArray()) { hash = (hash + c.code) % MODULUS } return hash.toInt() } /* Multiplicative hash */ fun mulHash(key: String): Int { var hash = 0L val MODULUS = 1000000007 for (c in key.toCharArray()) { hash = (31 * hash + c.code) % MODULUS } return hash.toInt() } /* XOR hash */ fun xorHash(key: String): Int { var hash = 0 val MODULUS = 1000000007 for (c in key.toCharArray()) { hash = hash xor c.code } return hash and MODULUS } /* Rotational hash */ fun rotHash(key: String): Int { var hash = 0L val MODULUS = 1000000007 for (c in key.toCharArray()) { hash = ((hash shl 4) xor (hash shr 28) xor c.code.toLong()) % MODULUS } return hash.toInt() } ``` === "Ruby" ```ruby title="simple_hash.rb" ### Additive hash ### def add_hash(key) hash = 0 modulus = 1_000_000_007 key.each_char { |c| hash += c.ord } hash % modulus end ### Multiplicative hash ### def mul_hash(key) hash = 0 modulus = 1_000_000_007 key.each_char { |c| hash = 31 * hash + c.ord } hash % modulus end ### XOR hash ### def xor_hash(key) hash = 0 modulus = 1_000_000_007 key.each_char { |c| hash ^= c.ord } hash % modulus end ### Rotational hash ### def rot_hash(key) hash = 0 modulus = 1_000_000_007 key.each_char { |c| hash = (hash << 4) ^ (hash >> 28) ^ c.ord } hash % modulus end ``` We can observe that the final step of each hash algorithm is to take the result modulo the large prime $1000000007$, ensuring that the hash value stays within a suitable range. This naturally raises a question: why emphasize using a prime modulus, and what are the drawbacks of using a composite modulus? In short: **using a large prime as the modulus helps maximize the uniformity of hash values**. Because a prime shares no common factors with other numbers, it can reduce periodic patterns introduced by the modulo operation and thus mitigate hash collisions. For example, suppose we choose the composite number $9$ as the modulus, which can be divided by $3$, then all `key` divisible by $3$ will be mapped to hash values $0$, $3$, $6$. $$ \begin{aligned} \text{modulus} & = 9 \newline \text{key} & = \{ 0, 3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, \dots \} \newline \text{hash} & = \{ 0, 3, 6, 0, 3, 6, 0, 3, 6, 0, 3, 6,\dots \} \end{aligned} $$ If the input `key` values happen to follow this kind of arithmetic progression, the hash values will cluster, worsening hash collisions. Now suppose we replace `modulus` with the prime number $13$. Because `key` and `modulus` share no common factors, the output hash values become much more evenly distributed. $$ \begin{aligned} \text{modulus} & = 13 \newline \text{key} & = \{ 0, 3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, \dots \} \newline \text{hash} & = \{ 0, 3, 6, 9, 12, 2, 5, 8, 11, 1, 4, 7, \dots \} \end{aligned} $$ It is worth noting that if the `key` is guaranteed to be randomly and uniformly distributed, then choosing a prime number or a composite number as the modulus can both produce uniformly distributed hash values. However, when the distribution of `key` has some periodicity, modulo a composite number is more likely to result in clustering. In summary, we usually choose a prime number as the modulus, and this prime number should be large enough to eliminate periodic patterns as much as possible, enhancing the robustness of the hash algorithm. ## 6.3.3 Common Hash Algorithms It is easy to see that the simple hash algorithms introduced above are fairly "fragile" and fall far short of the design goals of hash algorithms. For example, because addition and XOR are commutative, additive hash and XOR hash cannot distinguish strings with the same characters in a different order, which may worsen hash collisions and introduce security risks. In practice, we usually use some standard hash algorithms, such as MD5, SHA-1, SHA-2, and SHA-3. They can map input data of any length to a fixed-length hash value. Over the past century, hash algorithms have been in a continuous process of upgrading and optimization. Some researchers strive to improve the performance of hash algorithms, while others, including hackers, are dedicated to finding security issues in hash algorithms. Table 6-2 shows hash algorithms commonly used in practical applications. - MD5 and SHA-1 have been successfully attacked multiple times and are thus abandoned in various security applications. - SHA-2 series, especially SHA-256, is one of the most secure hash algorithms to date, with no successful attacks reported, hence commonly used in various security applications and protocols. - SHA-3 has lower implementation costs and higher computational efficiency compared to SHA-2, but its current usage coverage is not as extensive as the SHA-2 series.Table 6-2 Common hash algorithms