Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I always felt like NUL-termination, newline-separation, and (eventually) UTF-8 were all sort of complementary ideas: they all take as an axiom that strings are fundamentally streams, not random-access buffers; and they all separate the space of single-byte coding units, by simple one-bitwise-instruction-differentiable means, into multiple lexical token types.

Taking all three together, you end up with the conception of a "string" as a serialized bitstring encoding a sequence of four lexical types: a NUL type (like the EOF "character" in a STL stream), an ASCII control-code type (or a set of individual control codes as types, if you like), a set of UTF-8 "beginning of rune" types for each possible rune length, a "byte continuing rune" type, and an ASCII-printable type. (You then feed this stream into another lexer to put the rune-segment-tokens together into rune-tokens.)

In the end, it's not a surprise that all of these components were effectively from a single coherent design, thought up by Ken Thompson. It's a bit annoying that each part ended up introduced as part of a separate project, though: NULs with Unix, gets() with C, and runes with Plan 9.

One of the pleasant things about Go's string support, I think, it that was an opportunity for Ken to express the entirety of his string semantics as a single ADT type. That part of the compiler is quite lovely.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: