Skip to main content

String Manipulation

TL;DR

Character frequency counting with hash maps enables anagram detection and character-based operations in O(n) time. String validation checks format against rules systematically. Pattern matching locates substrings efficiently. String transformation rearranges or modifies content. Two pointers work on strings like arrays. Understanding ASCII values, character encoding, and efficient string operations is crucial for string algorithm problems.

Core Concepts

String Parsing and Validation

Parsing breaks strings into components following specific rules. Example: parsing "123-456-7890" into phone number components.

Validation checks if string satisfies constraints:

  • Valid parentheses: Balanced and properly nested
  • Valid number: Correct format (digits, decimal point, sign)
  • Valid email: Username, @, domain, extension

Validation typically uses character-by-character state checking or regular expressions.

Anagram Detection

Two strings are anagrams if they contain same characters with same frequencies, possibly rearranged.

Approach 1 - Sorting: Sort both strings, compare. O(n log n) time.

Approach 2 - Character frequency: Count character frequencies in both strings, compare counts. O(n) time, O(1) space (26 letters max).

Character frequency hash map: Tracks count of each character. Increment for first string, decrement for second. All zero at end = anagrams.

Pattern Matching

Substring search: Locate exact pattern in text.

  • Naive: Check every position O(nm) where n is text length, m is pattern length
  • KMP (Knuth-Morris-Pratt): Use failure function to avoid redundant comparisons O(n+m)
  • Boyer-Moore: Skip characters based on pattern O(n/m) average case

Regular expressions: Flexible pattern matching using regex syntax. More powerful but slower than optimized string algorithms.

String Transformation

Reversal: Reverse entire string or parts. Two pointers from ends, swap inward.

Case conversion: Transform upper/lower case.

Character replacement: Substitute characters matching condition.

String compression: Encode repeated sequences: "aaabba" becomes "a3b2a1".

Character Frequency Operations

Counting characters: Hash map stores count of each character.

Unique characters: Set stores unique characters, size is unique count.

Most frequent character: Track max count while iterating.

Character sorting: Sort string by character frequency or alphabetically.

Key Algorithms and Techniques

Anagram Detection

Hash map approach: count characters in both strings, compare. O(n) time, O(1) space.

Group Anagrams

Hash map where key is sorted characters, value is list of anagrams. O(nk log k) where k is average string length.

Longest Palindrome

Character frequency approach: if character count is even, use all. If odd, use all but one. One odd character can be in middle.

Valid Parentheses

Stack tracks open parentheses. For each close, check matches top of stack. O(n) time, O(n) space.

String Compression

Iterate tracking consecutive characters, build compressed string. O(n) time, O(n) space.

Practical Example

# Anagram detection
def is_anagram(s1, s2):
if len(s1) != len(s2):
return False
char_count = {}
for char in s1:
char_count[char] = char_count.get(char, 0) + 1
for char in s2:
if char not in char_count:
return False
char_count[char] -= 1
if char_count[char] < 0:
return False
return all(count == 0 for count in char_count.values())

# Group anagrams
def group_anagrams(strings):
anagram_map = {}
for s in strings:
key = ''.join(sorted(s))
if key not in anagram_map:
anagram_map[key] = []
anagram_map[key].append(s)
return list(anagram_map.values())

# Valid parentheses
def is_valid_parentheses(s):
stack = []
pairs = {'(': ')', '[': ']', '{': '}'}
for char in s:
if char in pairs:
stack.append(char)
elif char in pairs.values():
if not stack or pairs[stack.pop()] != char:
return False
return len(stack) == 0

# String compression
def compress_string(s):
if not s:
return ""
compressed = []
count = 1
for i in range(1, len(s)):
if s[i] == s[i - 1]:
count += 1
else:
compressed.append(s[i - 1] + str(count))
count = 1
compressed.append(s[-1] + str(count))
return ''.join(compressed)

# Valid number
def is_valid_number(s):
s = s.strip()
if not s or (s[0] not in '+-0123456789'):
return False
has_dot = has_e = False
for i, char in enumerate(s):
if char in '+-':
if i != 0:
return False
elif char == '.':
if has_dot or has_e:
return False
has_dot = True
elif char == 'e' or char == 'E':
if has_e or i == 0 or i == len(s) - 1:
return False
has_e = True
elif not char.isdigit():
return False
return True

# Test cases
print(is_anagram("listen", "silent")) # True
print(group_anagrams(["eat", "tea", "ate", "bat", "tab"]))
print(is_valid_parentheses("()[]{()}")) # True
print(compress_string("aaabbacc")) # a3b2a1c2
print(is_valid_number("3.14")) # True

Common Pitfalls

warning

Not handling empty strings: Edge case of empty input breaks many algorithms. Always check length before processing.

Incorrect anagram check: Counting characters in only one direction misses problems. Count both strings and compare counts, or verify all counts zero at end.

Parentheses validation errors: Forgetting to check stack not empty before popping causes crashes. Check both that pair types match.

Not normalizing input: Case sensitivity matters. Normalize to lowercase/uppercase before comparison unless case matters for problem.

Index out of bounds: String operations with indices need careful bounds checking. Off-by-one errors are common.

Inefficient repeated concatenation: String concatenation in loop is O(n²) in many languages. Use StringBuilder/array and join instead.

Self-Check

  1. How does character frequency hash map detect anagrams? Why is it O(n) instead of O(n log n)?
  2. Explain the valid parentheses algorithm. Why must you check stack isn't empty before popping?
  3. How would you find longest substring without repeating characters? What data structure tracks current window?

One Takeaway

info

String problems often boil down to character frequency counting and careful index management. Mastering hash maps for character tracking, two pointers for traversal, and stack for balanced delimiters solves most string algorithm problems efficiently. Always consider edge cases like empty strings, special characters, and case sensitivity.

References

  • Cormen, T. H., Leiserson, C. E., Rivest, R. L., & Stein, C. (2009). Introduction to Algorithms (3rd ed.). MIT Press.
  • McDowell, G. L. (2015). Cracking the Coding Interview (6th ed.). CareerCup.