Knowledge

Lexer hack

Source 📝

234:
library to determine the nature of the identifier. This allows a simpler and more maintainable architecture than The Lexer Hack. This is also the approach used in most other modern languages, which do not distinguish different classes of identifiers in the lexical grammar, but instead defer them to
100:, such as words, numbers, and strings. The parser analyzes sequences of tokens attempting to match them to syntax rules representing language structures, such as loops and variable declarations. A problem occurs here if a single sequence of tokens can ambiguously match more than one syntax rule. 229:
parser handles the situation in a completely different way, namely by using a non-reference lexical grammar. Clang's lexer does not attempt to differentiate between type names and variable names: it simply reports the current token as an identifier. The parser then uses Clang's
222:-derived BtYacc ("Backtracking Yacc"), give the generated parser the ability to try multiple attempts to parse the tokens. In the problem described here, if an attempt fails because of semantic information about the identifier, it can backtrack and attempt other rules. 49:
Lexer hack is frowned upon in modern compilers as it creates tight coupling between otherwise largely independent steps in the compilation process. Instead, identifier-like tokens are tokenized as identifiers and later disambiguated the parser, allowing cleaner
187:
Without added context, the lexer cannot distinguish type identifiers from other identifiers because all identifiers have the same format. With the hack in the above example, when the lexer finds the identifier
370: 179:
from the lexer to the parser, there is a backchannel from semantic analysis back to the lexer. This mixing of parsing and semantic analysis is generally regarded as inelegant, which is why it is called a
192:
it should be able to classify the token as a type identifier. The rules of the language would be clarified by specifying that typecasts require a type identifier and the ambiguity disappears.
46:, where classifying a sequence of characters as a variable name or a type name requires contextual information, by feeding contextual information backwards from the parser to the lexer. 371:
http://groups.google.com/group/comp.compilers/browse_frm/thread/db7f68e9d8b49002/fa20bf5de9c73472?lnk=st&q=%2B%22the+lexer+hack%22&rnum=1&hl=en#fa20bf5de9c73472
159:
so the parser does not have to handle an ambiguous parse. This gramatical ambiguity is known as the "typedef-name: identifier" problem, due to the name of the problematic
366: 211:
techniques, as these are intrinsically contextual. These are generally seen as less elegant designs, however, because they lack the modularity of having a
273: 385: 329: 231: 160: 62:
The fundamental problem is differentiating types from other identifiers. In the following example, the lexical class of
356: 299: 39: 181: 346: 212: 51: 28: 277: 176: 249: 208: 395: 390: 103:
This ambiguity can happen in C if the lexer does not distinguish between variable and
379: 244: 219: 207:
This problem does not arise (and hence needs no "hack" in order to solve) when using
172: 235:
the parsing or semantic analysis phase, when sufficient information is available.
85:
This code could either be multiplication or a declaration, depending on context.
357:
http://cens.ioc.ee/local/man/CompaqCompilers/ladebug/ladebug-manual-details.html
93: 361: 17: 351: 318:
Based on Berkeley yacc with modifications by Chris Dodd and Vadim Maslov.
171:
The solution generally consists of feeding information from the semantic
89: 347:
http://www.cs.berkeley.edu/~smcpeak/elkhound/sources/elkhound/index.html
175:
back into the lexer. That is, rather than functioning as a pure one-way
104: 36: 43: 196: 226: 92:, the lexer performs one of the earliest stages of converting the 330:"How Clang handles the type / variable name ambiguity of C/C++" 313: 66:
cannot be determined without further contextual information:
274:"A YACC-able C++ 2.1 GRAMMAR, AND THE RESULTING AMBIGUITIES" 147:
is a typedef name or not it may be desirable to tokenize
96:
to a program. It scans the text to extract meaningful
107:identifiers. For example, in the C expression: 8: 267: 265: 352:http://cs.nyu.edu/rgrimm/papers/pldi06.pdf 300:"The context sensitivity of C's grammar" 261: 7: 218:Some parser generators, such as the 199:and parsers can use the same hack. 25: 126:the lexer may find these tokens: 272:Roskind, James A. (1991-07-11). 215:lexer and parser in a pipeline. 1: 298:Bendersky, Eli (2007-11-24). 195:The problem also exists in 412: 386:C (programming language) 109: 68: 52:separation of concerns 203:Alternative solutions 143:Depending on whether 88:In more detail, in a 29:computer programming 250:Most vexing parse 232:semantic analysis 209:lexerless parsing 167:The hack solution 42:grammars such as 40:context-sensitive 35:is a solution to 16:(Redirected from 403: 334: 333: 328:Bendersky, Eli. 325: 319: 317: 310: 304: 303: 295: 289: 288: 286: 285: 276:. Archived from 269: 150: 146: 122: 119: 116: 113: 81: 78: 75: 72: 65: 21: 411: 410: 406: 405: 404: 402: 401: 400: 376: 375: 343: 338: 337: 327: 326: 322: 312: 311: 307: 297: 296: 292: 283: 281: 271: 270: 263: 258: 241: 205: 169: 161:production rule 148: 144: 139:punctuation ';' 124: 123: 120: 117: 114: 111: 83: 82: 79: 76: 73: 70: 63: 60: 23: 22: 15: 12: 11: 5: 409: 407: 399: 398: 393: 388: 378: 377: 374: 373: 368: 364: 359: 354: 349: 342: 339: 336: 335: 320: 305: 290: 260: 259: 257: 254: 253: 252: 247: 240: 237: 204: 201: 168: 165: 141: 140: 137: 136:identifier 'B' 134: 131: 110: 69: 59: 56: 24: 18:The lexer hack 14: 13: 10: 9: 6: 4: 3: 2: 408: 397: 394: 392: 389: 387: 384: 383: 381: 372: 369: 367: 365: 363: 360: 358: 355: 353: 350: 348: 345: 344: 340: 331: 324: 321: 315: 309: 306: 301: 294: 291: 280:on 2007-06-22 279: 275: 268: 266: 262: 255: 251: 248: 246: 245:Dangling else 243: 242: 238: 236: 233: 228: 223: 221: 216: 214: 210: 202: 200: 198: 193: 191: 185: 183: 178: 174: 166: 164: 162: 158: 154: 151:as either an 138: 135: 132: 129: 128: 127: 108: 106: 101: 99: 95: 91: 86: 67: 57: 55: 53: 47: 45: 41: 38: 34: 30: 19: 323: 314:"BtYacc 3.0" 308: 293: 282:. Retrieved 278:the original 224: 217: 206: 194: 189: 186: 173:symbol table 170: 156: 152: 142: 133:operator '*' 125: 102: 97: 87: 84: 61: 48: 32: 26: 94:source code 380:Categories 284:2008-11-27 256:References 213:concurrent 153:identifier 33:lexer hack 341:Citations 239:See also 177:pipeline 90:compiler 396:Parsing 362:DOI.org 105:typedef 58:Problem 37:parsing 130:?? 'A' 98:tokens 31:, the 227:Clang 220:byacc 155:or a 225:The 182:hack 157:type 391:C++ 197:C++ 184:". 27:In 382:: 264:^ 163:. 54:. 332:. 316:. 302:. 287:. 190:A 180:" 149:A 145:A 121:; 118:B 115:* 112:A 80:; 77:B 74:* 71:A 64:A 44:C 20:)

Index

The lexer hack
computer programming
parsing
context-sensitive
C
separation of concerns
compiler
source code
typedef
production rule
symbol table
pipeline
hack
C++
lexerless parsing
concurrent
byacc
Clang
semantic analysis
Dangling else
Most vexing parse


"A YACC-able C++ 2.1 GRAMMAR, AND THE RESULTING AMBIGUITIES"
the original
"The context sensitivity of C's grammar"
"BtYacc 3.0"
"How Clang handles the type / variable name ambiguity of C/C++"
http://www.cs.berkeley.edu/~smcpeak/elkhound/sources/elkhound/index.html
http://cs.nyu.edu/rgrimm/papers/pldi06.pdf

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.