String Hash Function Design
I. In-Class Exercises
Programming Exercises
II. Knowledge Summary
Core Idea
String hashing maps a string to an integer hash value such that:
- equal strings must have equal hash values
- different strings have a very high probability of having different hash values
With string hashing, two substrings can be compared in $O(1)$ time after $O(n)$ preprocessing. This turns many operations that would require $O(n)$ character-by-character comparison into $O(1)$ operations.
The core method is polynomial rolling hash: view the string as a number in base base, then take it modulo a large prime.
Algorithm Principle
Polynomial Hash
For a string $s = s0 s1 s2 \cdots s{n-1}$, define:
$$H(s) = \left(\sum{i=0}^{n-1} si \cdot base^{n-1-i}\right) \bmod mod$$
where:
- $s_i$ is the numeric value of the character
baseis the base, often 131 or 13331modis the modulus, often a large prime such as $10^9+7$ or $10^9+9$
Prefix Hash and Substring Hash
Precompute a prefix-hash array $H[i]$ representing the hash of s[0..i-1]:
$$H[i] = H[i-1] \times base + s[i-1]$$
Then the hash of substring s[l..r] is:
$$hash(l, r) = H[r+1] - H[l] \times base^{r-l+1}$$
So any substring hash can be obtained in $O(1)$ time.
Double Hashing
A single hash may still collide. To reduce the collision probability to an extremely low level, use two different pairs of (base, mod) at the same time, and regard two strings as equal only if both hashes are equal.
Collision probability:
- single hash: about $\frac{1}{mod} \approx 10^{-9}$
- double hash: about $\frac{1}{mod1 \times mod2} \approx 10^{-18}$
Natural Overflow Hash
Use the natural overflow of unsigned long long, which is equivalent to taking modulo $2^{64}$. This avoids explicit modulus operations and is extremely fast. However, since the modulus is fixed and known, it can be attacked by specially constructed data.
Code Implementation
Basic String Hash
1#include <iostream>
2#include <string>
3using namespace std;
4
5typedef long long ll;
6const ll BASE = 131;
7const ll MOD = 1e9 + 7;
8const int MAXN = 100005;
9
10ll H[MAXN];
11ll pw[MAXN];
12
13void initHash(const string& s) {
14 int n = s.size();
15 pw[0] = 1;
16 H[0] = 0;
17 for (int i = 1; i <= n; i++) {
18 pw[i] = pw[i - 1] * BASE % MOD;
19 H[i] = (H[i - 1] * BASE + s[i - 1]) % MOD;
20 }
21}
22
23ll getHash(int l, int r) {
24 return (H[r + 1] - H[l] * pw[r - l + 1] % MOD + MOD) % MOD;
25}
26
27int main() {
28 string s;
29 cin >> s;
30 initHash(s);
31
32 int l1, r1, l2, r2;
33 cin >> l1 >> r1 >> l2 >> r2;
34 if (getHash(l1, r1) == getHash(l2, r2))
35 cout << "The two substrings are equal" << endl;
36 else
37 cout << "The two substrings are different" << endl;
38
39 return 0;
40}Double Hashing
1#include <iostream>
2#include <string>
3using namespace std;
4
5typedef long long ll;
6typedef pair<ll, ll> pll;
7
8const ll BASE1 = 131, MOD1 = 1e9 + 7;
9const ll BASE2 = 13331, MOD2 = 1e9 + 9;
10const int MAXN = 100005;
11
12ll H1[MAXN], H2[MAXN];
13ll pw1[MAXN], pw2[MAXN];
14
15void initHash(const string& s) {
16 int n = s.size();
17 pw1[0] = pw2[0] = 1;
18 H1[0] = H2[0] = 0;
19 for (int i = 1; i <= n; i++) {
20 pw1[i] = pw1[i - 1] * BASE1 % MOD1;
21 pw2[i] = pw2[i - 1] * BASE2 % MOD2;
22 H1[i] = (H1[i - 1] * BASE1 + s[i - 1]) % MOD1;
23 H2[i] = (H2[i - 1] * BASE2 + s[i - 1]) % MOD2;
24 }
25}
26
27pll getHash(int l, int r) {
28 ll h1 = (H1[r + 1] - H1[l] * pw1[r - l + 1] % MOD1 + MOD1) % MOD1;
29 ll h2 = (H2[r + 1] - H2[l] * pw2[r - l + 1] % MOD2 + MOD2) % MOD2;
30 return {h1, h2};
31}
32
33int main() {
34 string s;
35 cin >> s;
36 initHash(s);
37
38 int q;
39 cin >> q;
40 while (q--) {
41 int l1, r1, l2, r2;
42 cin >> l1 >> r1 >> l2 >> r2;
43 if (getHash(l1, r1) == getHash(l2, r2))
44 cout << "Yes" << endl;
45 else
46 cout << "No" << endl;
47 }
48 return 0;
49}Natural Overflow Hash
1#include <iostream>
2#include <string>
3using namespace std;
4
5typedef unsigned long long ull;
6const ull BASE = 131;
7const int MAXN = 100005;
8
9ull H[MAXN], pw[MAXN];
10
11void initHash(const string& s) {
12 int n = s.size();
13 pw[0] = 1;
14 H[0] = 0;
15 for (int i = 1; i <= n; i++) {
16 pw[i] = pw[i - 1] * BASE;
17 H[i] = H[i - 1] * BASE + s[i - 1];
18 }
19}
20
21ull getHash(int l, int r) {
22 return H[r + 1] - H[l] * pw[r - l + 1];
23}Application: Finding Repeated Substrings
1#include <iostream>
2#include <string>
3#include <unordered_set>
4using namespace std;
5
6typedef long long ll;
7const ll BASE = 131, MOD = 1e9 + 7;
8const int MAXN = 100005;
9
10ll H[MAXN], pw[MAXN];
11
12void initHash(const string& s) {
13 int n = s.size();
14 pw[0] = 1; H[0] = 0;
15 for (int i = 1; i <= n; i++) {
16 pw[i] = pw[i - 1] * BASE % MOD;
17 H[i] = (H[i - 1] * BASE + s[i - 1]) % MOD;
18 }
19}
20
21ll getHash(int l, int r) {
22 return (H[r + 1] - H[l] * pw[r - l + 1] % MOD + MOD) % MOD;
23}
24
25int longestDupSubstring(const string& s) {
26 int n = s.size();
27 initHash(s);
28
29 int lo = 1, hi = n - 1, ans = 0;
30 while (lo <= hi) {
31 int mid = (lo + hi) / 2;
32 unordered_set<ll> seen;
33 bool found = false;
34 for (int i = 0; i + mid - 1 < n; i++) {
35 ll h = getHash(i, i + mid - 1);
36 if (seen.count(h)) {
37 found = true;
38 break;
39 }
40 seen.insert(h);
41 }
42
43 if (found) {
44 ans = mid;
45 lo = mid + 1;
46 } else {
47 hi = mid - 1;
48 }
49 }
50 return ans;
51}
52
53int main() {
54 string s;
55 cin >> s;
56 cout << "Length of the longest repeated substring: " << longestDupSubstring(s) << endl;
57 return 0;
58}Execution Example
Take s = "abcabc" as an example, with base = 131 and mod = 10^9+7:
1Character values: a=97, b=98, c=99
2
3Prefix hash computation:
4 H[0] = 0
5 H[1] = 0 × 131 + 97 = 97
6 H[2] = 97 × 131 + 98 = 12805
7 H[3] = 12805 × 131 + 99 = 1677554
8 H[4] = 1677554 × 131 + 97 = 219859671
9 H[5] = 219859671 × 131 + 98 = 28801616909 mod (10^9+7) = ...
10 H[6] = ...
11
12Substring hashes:
13 hash("abc", [0,2]) = H[3] - H[0] × 131^3 = 1677554
14 hash("abc", [3,5]) = H[6] - H[3] × 131^3
15
16 They are equal -> conclude s[0..2] == s[3..5] ✓Detailed Problem-Solving Steps
Typical Use Cases
- Substring equality: compare two substrings in $O(1)$
- String matching: Rabin-Karp algorithm for $O(n+m)$ matching
- Longest common substring: binary search + hashing
- Palindrome checking: compare forward hash and reverse hash
- String deduplication / counting: store hashes instead of whole strings
Parameter Selection Guide
| Parameter | Recommended Values | Explanation |
|---|---|---|
base | 131, 13331 | Prime and larger than the alphabet size |
mod | $10^9+7$, $10^9+9$ | Large primes |
| Character mapping | s[i] - 'a' + 1 or ASCII | Avoid mapping characters to 0 |
Important Note: Characters Must Not Map to 0
If a character maps to 0, then prefixes like "a" and "aa" may collide because of leading-zero effects. A common practice is to map 'a' to 1 instead of 0.
Rabin-Karp String Matching
- Compute the hash of the pattern string
- Slide a window over the text and compute the hash of each substring of the same length
- If the hashes are equal, treat it as a match, and optionally verify character-by-character to avoid collisions
Common Mistakes
- Choosing
basetoo small: for examplebase = 26causes more collisions; usuallybase >= 131is safer - Mapping characters to 0:
'a' - 'a' = 0causes leading-zero issues - Negative values after subtraction:
H[r+1] - H[l] * pw[...]may become negative, so addMOD - Forgetting to precompute the powers array: this leads to wrong answers or runtime errors
- Array size too small:
Handpwshould have sizen+1 - Single hashing being attacked: on some platforms, double hashing or randomized bases are safer
- Natural-overflow hashing being attacked: modulus $2^{64}$ is fixed and can be targeted by crafted test data
Evaluation
Time Complexity
| Operation | Time |
|---|---|
| Preprocessing | $O(n)$ |
| Query substring hash | $O(1)$ |
| Rabin-Karp matching | $O(n + m)$ |
| Longest repeated substring via binary search + hashing | $O(n \log n)$ |
Space Complexity
$O(n)$ for storing prefix hashes and powers.
Comparison with Other String Algorithms
| Algorithm | Typical Use | Preprocessing | Query |
|---|---|---|---|
| String hashing | Substring equality | $O(n)$ | $O(1)$ |
| KMP | Pattern matching | $O(m)$ | $O(n)$ |
| Suffix array | Various substring problems | $O(n \log n)$ | $O(\log n)$ |
| Trie | Prefix queries | $O(\sum | s_i |
The strength of string hashing is that it is easy to implement and broadly applicable. It is one of the most commonly used string tools in programming contests.