CS 410P/510 Rust Su19: Arrays and Strings

Arrays

Array
- Owned type
- Type includes size
- Types like [u8;5], values like [0u8;5] or [0u8, 1u8].
Slice
- Reference to an array — contents not owned
- Reference is "smart pointer" that remembers the length
- Types like &[u8], values like &[0u8;5] or &[0u8, 1u8].
- Can index with a range to get a "slice" of the underlying array, e.g.
```
let a = [0u8, 1, 2, 3, 4];
let s = &a[1..4];
assert_eq!(s.len(), 3);
assert_eq!(s[0], 1);
assert_eq!(s[2], 3);
```
- There are lots of cool operations on slices; see the textbook and The Book and the docs
Vec
- Heap-allocated array-like object
- Length is tracked at runtime: storage is managed
- Types like Vec<u8>, values like Vec::new() or vec!(1u8, 2, 3)
- Can append to a Vec with push(), extract last with pop()
- Can slice a Vec, e.g.
```
let mut a = vec![0u8, 1, 2, 3, 4];
assert_eq!(a.len(), 5);
a.push(5);
assert_eq!(a.len(), 6);
let s = &a[1..4];
assert_eq!(s.len(), 3);
assert_eq!(s[0], 1);
assert_eq!(s[2], 3);
```

Can only index with value of type usize
Indices are bounds-checked at runtime: bounds checks are often lifted or omitted by clever compiler

May be more efficient and readable to iterate over values or references than to do the indexing

    let mut a: Vec<u8> = (0..5).collect();
    for i in 0..a.len() {
        a[i] += 1;
    }

    let mut a: Vec<u8> = (0..5).collect();
    for v in a.iter_mut() {
        *v += 1;
    }

Lots of stuff here
tl;dr:
- There's char which is a Unicode code point
- There's str which is a UTF-8 string of bytes
- There's String which is the "owned" version of str

"Character" can mean a lot of things. There's 7-bit ASCII, 8-bit "latin-1". There's a million per-language coding standards
In many languages there can be some debate about what constitutes a single "character". Even in English, is a ligature like ff a character?
Rust's char is a Unicode "code point". It's a 32-bit quantity, but not all possible values are legal
There are the usual character classifiers and converters, which work on full Unicode
There's a bit of ASCII support, but probably shouldn't normally use it
You can always cast a char to any integer type big enough to hold it
You can't cast an integer to char: you need to use std::char::from_u32() or something like it. It returns an Option depending on whether the particular input is a legal Unicode code point
Note that case conversions can't return a single character because uppercase ←→ lowercase is not always 1::1. So return a char iterator, which is super-annoying

It is deemed "not best practice" to store strings as sequences of 32-bit code points. So a compressed encoding called UTF-8 is used for strings. This encoding stores ASCII characters as themselves, and uses an escape convention to get multibyte coding of non-ASCII strings. A UTF-8 string is almost always much smaller than 4x the number of code points
A Rust str is like an array, except of UTF-8 text. A str is unsized, so it is really only useful in certain type declarations

Let's just refer to &str and String values collectively as "strings"
An &str is a reference to a str. It is a fat pointer that contains the size of the &str in bytes. The normal borrow rules apply
A String is an owned reference to a str. It is a fat pointer that contains the size of the str in bytes.
Because String is owned, you can modify the contained bytes. However, all the methods provided for this are guaranteed to preserve UTF-8 encoding of the bytes. This is carefully tuned to avoid trouble
You can get an iterator over a string's code points (.chars()) or u8 bytes (.bytes())
There is no convenient way to go to a given character (code point) position in a string. If you plan to do that a lot, use chars().collect()

There are the obvious methods for converting these things around. Read the book carefully to learn about the vocabulary
Many of the string manipulation functions take a "pattern". There's a lot of kinds: read p. 402 of the book for the details. You will use them all, eventually
There's a regex package in the library. It's OK.
You can use from_str() or .parse() to convert a string into something else. You can use to_string() to convert other things into String
You can use .as_bytes() and .into_bytes() to grab the bytes of a string for free
The ::from_utf8() methods come in checked, lossy and unsafe flavors. Choose wisely

Last modified: Monday, 1 July 2019, 1:37 AM