Recognizing words (the smallest unit above letters)

Here, we divide program text into words or tokens.

Units

  • Keywords
  • Variables
  • Operators
  • Constants
  • Punctuation
  • White spaces

The output of lexical analysis is a stream of Tokens.

Designing

  • We need to define a finite set of tokens
  • Describe what strings belong to each token

Implementation

  • Needs to recognize substrings corresponding to tokens
  • Return the value or the lexeme (substring) of the token
  • Usually discards tokens that don’t contribute to parsing (comments, whitespace)

However, discarding whitespace can lead to issues like in FORTRAN, where VAR1 and VA R1 are treated as identical which is a poor design.

  • The goal is to partition the string implemented by left-to-right reading
  • Lookahead might be required to decide the boundaries between tokens
    • Ex. \= vs \== or unreserved keywords
    • Some other ambiguities include:
      • Foo<Bar>
      • cin >> var;
      • Foo<Bar<Bazz>>
    • Where the token >> is used in different ways in cpp so lookahead is sometimes required

Lexer

  • Inspects a string
  • Consume that string
  • Inspects a token to see if the next token is an ID
  • Consumes the ID
  • Inspects the next token to see if its EOF
  • Consumes the EOF token