🤖 Dan Huynh

Recent Notes

ECE459
Mar 09, 2026
Bitcoin
Mar 09, 2026
CUDA Kernels
Mar 09, 2026
ChatGPT
Mar 09, 2026
Counters
Mar 09, 2026

Sep 12, 20251 min read

Lexical Analysis

Recognizing words (the smallest unit above letters)

Here, we divide program text into words or tokens.

Units

Keywords
Variables
Operators
Constants
Punctuation
White spaces

The output of lexical analysis is a stream of Tokens.

Designing

We need to define a finite set of tokens
Describe what strings belong to each token

Implementation

Needs to recognize substrings corresponding to tokens
Return the value or the lexeme (substring) of the token
Usually discards tokens that don’t contribute to parsing (comments, whitespace)

However, discarding whitespace can lead to issues like in FORTRAN, where VAR1 and VA R1 are treated as identical which is a poor design.

The goal is to partition the string implemented by left-to-right reading
Lookahead might be required to decide the boundaries between tokens
- Ex. \= vs \== or unreserved keywords
- Some other ambiguities include:
  - Foo<Bar>
  - cin >> var;
  - Foo<Bar<Bazz>>
- Where the token >> is used in different ways in cpp so lookahead is sometimes required

Lexer

Inspects a string
Consume that string
Inspects a token to see if the next token is an ID
Consumes the ID
Inspects the next token to see if its EOF
Consumes the EOF token

Graph View

Recent Notes

ECE459
Mar 09, 2026
Bitcoin
Mar 09, 2026
CUDA Kernels
Mar 09, 2026

Units
Designing
Implementation
Lexer

Backlinks

Compilers
Parsing
ECE351

Created with Quartz v4.5.1 © 2026

Portfolio
GitHub
LinkedIn