This lesson has accompanying video explanation, available after purchase (permanent access)

String Hash Function Design

I. In-Class Exercises

Programming Exercises

String Hash Template: L49951
Cow Echoes: L49952

II. Knowledge Summary

Core Idea

String hashing maps a string to an integer hash value such that:

equal strings must have equal hash values
different strings have a very high probability of having different hash values

With string hashing, two substrings can be compared in $O(1)$ time after $O(n)$ preprocessing. This turns many operations that would require $O(n)$ character-by-character comparison into $O(1)$ operations.

The core method is polynomial rolling hash: view the string as a number in base base, then take it modulo a large prime.

Algorithm Principle

Polynomial Hash

For a string $s = s0 s1 s2 \cdots s{n-1}$, define:

$$H(s) = \left(\sum{i=0}^{n-1} si \cdot base^{n-1-i}\right) \bmod mod$$

where:

$s_i$ is the numeric value of the character
base is the base, often 131 or 13331
mod is the modulus, often a large prime such as $10^9+7$ or $10^9+9$

Prefix Hash and Substring Hash

Precompute a prefix-hash array $H[i]$ representing the hash of s[0..i-1]:

$$H[i] = H[i-1] \times base + s[i-1]$$

Then the hash of substring s[l..r] is:

$$hash(l, r) = H[r+1] - H[l] \times base^{r-l+1}$$

So any substring hash can be obtained in $O(1)$ time.

Double Hashing

A single hash may still collide. To reduce the collision probability to an extremely low level, use two different pairs of (base, mod) at the same time, and regard two strings as equal only if both hashes are equal.

Collision probability:

single hash: about $\frac{1}{mod} \approx 10^{-9}$
double hash: about $\frac{1}{mod1 \times mod2} \approx 10^{-18}$

Natural Overflow Hash

Use the natural overflow of unsigned long long, which is equivalent to taking modulo $2^{64}$. This avoids explicit modulus operations and is extremely fast. However, since the modulus is fixed and known, it can be attacked by specially constructed data.

Code Implementation

Basic String Hash

string_hash_basic.cpp

1#include <iostream>
2#include <string>
3using namespace std;
4
5typedef long long ll;
6const ll BASE = 131;
7const ll MOD = 1e9 + 7;
8const int MAXN = 100005;
9
10ll H[MAXN];
11ll pw[MAXN];
12
13void initHash(const string& s) {
14    int n = s.size();
15    pw[0] = 1;
16    H[0] = 0;
17    for (int i = 1; i <= n; i++) {
18        pw[i] = pw[i - 1] * BASE % MOD;
19        H[i] = (H[i - 1] * BASE + s[i - 1]) % MOD;
20    }
21}
22
23ll getHash(int l, int r) {
24    return (H[r + 1] - H[l] * pw[r - l + 1] % MOD + MOD) % MOD;
25}
26
27int main() {
28    string s;
29    cin >> s;
30    initHash(s);
31
32    int l1, r1, l2, r2;
33    cin >> l1 >> r1 >> l2 >> r2;
34    if (getHash(l1, r1) == getHash(l2, r2))
35        cout << "The two substrings are equal" << endl;
36    else
37        cout << "The two substrings are different" << endl;
38
39    return 0;
40}

Double Hashing

string_hash_double.cpp

1#include <iostream>
2#include <string>
3using namespace std;
4
5typedef long long ll;
6typedef pair<ll, ll> pll;
7
8const ll BASE1 = 131, MOD1 = 1e9 + 7;
9const ll BASE2 = 13331, MOD2 = 1e9 + 9;
10const int MAXN = 100005;
11
12ll H1[MAXN], H2[MAXN];
13ll pw1[MAXN], pw2[MAXN];
14
15void initHash(const string& s) {
16    int n = s.size();
17    pw1[0] = pw2[0] = 1;
18    H1[0] = H2[0] = 0;
19    for (int i = 1; i <= n; i++) {
20        pw1[i] = pw1[i - 1] * BASE1 % MOD1;
21        pw2[i] = pw2[i - 1] * BASE2 % MOD2;
22        H1[i] = (H1[i - 1] * BASE1 + s[i - 1]) % MOD1;
23        H2[i] = (H2[i - 1] * BASE2 + s[i - 1]) % MOD2;
24    }
25}
26
27pll getHash(int l, int r) {
28    ll h1 = (H1[r + 1] - H1[l] * pw1[r - l + 1] % MOD1 + MOD1) % MOD1;
29    ll h2 = (H2[r + 1] - H2[l] * pw2[r - l + 1] % MOD2 + MOD2) % MOD2;
30    return {h1, h2};
31}
32
33int main() {
34    string s;
35    cin >> s;
36    initHash(s);
37
38    int q;
39    cin >> q;
40    while (q--) {
41        int l1, r1, l2, r2;
42        cin >> l1 >> r1 >> l2 >> r2;
43        if (getHash(l1, r1) == getHash(l2, r2))
44            cout << "Yes" << endl;
45        else
46            cout << "No" << endl;
47    }
48    return 0;
49}

Natural Overflow Hash

string_hash_overflow.cpp

1#include <iostream>
2#include <string>
3using namespace std;
4
5typedef unsigned long long ull;
6const ull BASE = 131;
7const int MAXN = 100005;
8
9ull H[MAXN], pw[MAXN];
10
11void initHash(const string& s) {
12    int n = s.size();
13    pw[0] = 1;
14    H[0] = 0;
15    for (int i = 1; i <= n; i++) {
16        pw[i] = pw[i - 1] * BASE;
17        H[i] = H[i - 1] * BASE + s[i - 1];
18    }
19}
20
21ull getHash(int l, int r) {
22    return H[r + 1] - H[l] * pw[r - l + 1];
23}

Application: Finding Repeated Substrings

string_hash_application.cpp

1#include <iostream>
2#include <string>
3#include <unordered_set>
4using namespace std;
5
6typedef long long ll;
7const ll BASE = 131, MOD = 1e9 + 7;
8const int MAXN = 100005;
9
10ll H[MAXN], pw[MAXN];
11
12void initHash(const string& s) {
13    int n = s.size();
14    pw[0] = 1; H[0] = 0;
15    for (int i = 1; i <= n; i++) {
16        pw[i] = pw[i - 1] * BASE % MOD;
17        H[i] = (H[i - 1] * BASE + s[i - 1]) % MOD;
18    }
19}
20
21ll getHash(int l, int r) {
22    return (H[r + 1] - H[l] * pw[r - l + 1] % MOD + MOD) % MOD;
23}
24
25int longestDupSubstring(const string& s) {
26    int n = s.size();
27    initHash(s);
28
29    int lo = 1, hi = n - 1, ans = 0;
30    while (lo <= hi) {
31        int mid = (lo + hi) / 2;
32        unordered_set<ll> seen;
33        bool found = false;
34        for (int i = 0; i + mid - 1 < n; i++) {
35            ll h = getHash(i, i + mid - 1);
36            if (seen.count(h)) {
37                found = true;
38                break;
39            }
40            seen.insert(h);
41        }
42
43        if (found) {
44            ans = mid;
45            lo = mid + 1;
46        } else {
47            hi = mid - 1;
48        }
49    }
50    return ans;
51}
52
53int main() {
54    string s;
55    cin >> s;
56    cout << "Length of the longest repeated substring: " << longestDupSubstring(s) << endl;
57    return 0;
58}

Execution Example

Take s = "abcabc" as an example, with base = 131 and mod = 10^9+7:

1Character values: a=97, b=98, c=99
2
3Prefix hash computation:
4  H[0] = 0
5  H[1] = 0 × 131 + 97 = 97
6  H[2] = 97 × 131 + 98 = 12805
7  H[3] = 12805 × 131 + 99 = 1677554
8  H[4] = 1677554 × 131 + 97 = 219859671
9  H[5] = 219859671 × 131 + 98 = 28801616909 mod (10^9+7) = ...
10  H[6] = ...
11
12Substring hashes:
13  hash("abc", [0,2]) = H[3] - H[0] × 131^3 = 1677554
14  hash("abc", [3,5]) = H[6] - H[3] × 131^3
15
16  They are equal -> conclude s[0..2] == s[3..5] ✓

Detailed Problem-Solving Steps

Typical Use Cases

Substring equality: compare two substrings in $O(1)$
String matching: Rabin-Karp algorithm for $O(n+m)$ matching
Longest common substring: binary search + hashing
Palindrome checking: compare forward hash and reverse hash
String deduplication / counting: store hashes instead of whole strings

Parameter Selection Guide

Parameter	Recommended Values	Explanation
`base`	131, 13331	Prime and larger than the alphabet size
`mod`	$10^9+7$, $10^9+9$	Large primes
Character mapping	`s[i] - 'a' + 1` or ASCII	Avoid mapping characters to 0

Important Note: Characters Must Not Map to 0

If a character maps to 0, then prefixes like "a" and "aa" may collide because of leading-zero effects. A common practice is to map 'a' to 1 instead of 0.

Rabin-Karp String Matching

Compute the hash of the pattern string
Slide a window over the text and compute the hash of each substring of the same length
If the hashes are equal, treat it as a match, and optionally verify character-by-character to avoid collisions

Common Mistakes

Choosing base too small: for example base = 26 causes more collisions; usually base >= 131 is safer
Mapping characters to 0: 'a' - 'a' = 0 causes leading-zero issues
Negative values after subtraction: H[r+1] - H[l] * pw[...] may become negative, so add MOD
Forgetting to precompute the powers array: this leads to wrong answers or runtime errors
Array size too small: H and pw should have size n+1
Single hashing being attacked: on some platforms, double hashing or randomized bases are safer
Natural-overflow hashing being attacked: modulus $2^{64}$ is fixed and can be targeted by crafted test data

Evaluation

Time Complexity

Operation	Time
Preprocessing	$O(n)$
Query substring hash	$O(1)$
Rabin-Karp matching	$O(n + m)$
Longest repeated substring via binary search + hashing	$O(n \log n)$

Space Complexity

$O(n)$ for storing prefix hashes and powers.

Comparison with Other String Algorithms

Algorithm	Typical Use	Preprocessing	Query
String hashing	Substring equality	$O(n)$	$O(1)$
KMP	Pattern matching	$O(m)$	$O(n)$
Suffix array	Various substring problems	$O(n \log n)$	$O(\log n)$
Trie	Prefix queries	$O(\sum	s_i

The strength of string hashing is that it is easy to implement and broadly applicable. It is one of the most commonly used string tools in programming contests.