CS 410P/510 Rust W2020: Strings and I/O

Strings (ugh)

Lots of stuff here
Fortunately, we've talked about a lot of it before
tl;dr:
- There's char which is a Unicode code point
- There's str which is a UTF-8 string of bytes
- There's String which is the "owned" version of str

"Character" can mean a lot of things. There's 7-bit ASCII, 8-bit "latin-1". There's a million per-language coding standards
In many languages there can be some debate about what constitutes a single "character". Even in English, is a ligature like ff a character?
Rust's char is a Unicode "code point". It's a 32-bit quantity, but not all possible values are legal
There are the usual character classifiers and converters, which work on full Unicode
There's a bit of ASCII support, but probably shouldn't normally use it
You can always cast a char to any integer type big enough to hold it
You can't cast an integer to char: you need to use std::char::from_u32() or something like it. It returns an Option depending on whether the particular input is a legal Unicode code point
Note that case conversions can't return a single character because uppercase ←→ lowercase is not always 1::1. So return a char iterator, which is super-annoying

It is deemed "not best practice" to store strings as sequences of 32-bit code points. So a compressed encoding called UTF-8 is used for strings. This encoding stores ASCII characters as themselves, and uses an escape convention to get multibyte coding of non-ASCII strings. A UTF-8 string is almost always much smaller than 4x the number of code points
A Rust str is like an array, except of UTF-8 text. A str is unsized, so it is really only useful in certain type declarations

Let's just refer to &str and String values collectively as "strings"
An &str is a reference to a str. It is a fat pointer that contains the size of the &str in bytes. The normal borrow rules apply
A String is an owned reference to a str. It is a fat pointer that contains the size of the str in bytes.
Because String is owned, you can modify the contained bytes. However, all the methods provided for this are guaranteed to preserve UTF-8 encoding of the bytes. This is carefully tuned to avoid trouble
You can get an iterator over a string's code points (.chars()) or u8 bytes (.bytes())
There is no convenient way to go to a given character (code point) position in a string. If you plan to do that a lot, use chars().collect()

There are the obvious methods for converting these things around. Read the book carefully to learn about the vocabulary
Many of the string manipulation functions take a "pattern". There's a lot of kinds: read p. 402 of the book for the details. You will use them all, eventually
There's a regex package in the library. It's OK.
You can use from_str() or .parse() to convert a string into something else. You can use to_string() to convert other things into String
You can use .as_bytes() and .into_bytes() to grab the bytes of a string for free
The ::from_utf8() methods come in checked, lossy and unsafe flavors. Choose wisely

Pages 411-424 of the book discuss the details of the format string language we've been using with println!() and the like. Read it
A lot of the details of this section have to do with macros, so we'll deal with it then

There's Read, BufRead and Write traits
The first two correspond to "unbuffered" and "buffered"
Any sensible thing, including strings, either is Read / Write or can be made to be
stdin, stdout and stderr are treated as functions returning readers / writers, because locking across threads
The flush situation is non-optimal
Path and friends deal with filenames / pathnames. OsString may also be necessary

Last modified: Thursday, 17 May 2018, 6:31 PM