Mistakes with Rust smart pointers: when Deref goes wrong

Prologue

The following is an attempt at highlighting the sources of high
astonishment

arising from the implementation of the
std::ops::Deref trait on
custom types.

I write this, in part, because I had recently used Deref inappropriately while
at $day_job, due to my lack of a full understanding of how Deref interacts
with the rest of the language. It should go without saying that this was a
confusing experience.

Naturally, the aforementioned mishap will serve as a motivating example
throughout this dive into Rust semantics.

Newtypes

I cannot share the full context surrounding my misuse of Deref (for both
pedagogic and legal reasons) so please understand that my motivating example
won’t be well motivated itself.

With that disclaimer out of the way, suppose we’re building a library that
provides an ASCII string type called
AsciiString. A fitting representation of these so-called byte strings would
simply be a plain ol’ Vec.

However, ASCII characters are limited to the range 0..128, and thus not any
value of type Vec would be a valid ASCII string. In other words, it’s not
a good idea to
alias
AsciiString to Vec; they are not
interchangeable
.

In Rust, it is considered good practice to enforce invariants at the type
level. Thus, the idiomatic course of action would be to use a
newtype to
wrap a Vec and then statically guarantee that all wrapped byte vectors
represent valid ASCII strings.

/// An ASCII-encoded, growable string.
struct AsciiString(Vec);

impl AsciiString {
    /// Range of valid ASCII bytes.
    const ASCII_RANGE: Range = 0..128;

    /// Converts a byte vector to an `AsciiString`.
    ///
    /// # Errors
    /// Returns `Err` if any byte in `vec` 
    /// falls outside the range [`Self::ASCII_RANGE`].
    fn from_ascii(vec: Vec) -> Result {
        if vec
            .iter()
            .all(|b| Self::ASCII_RANGE.contains(b))
        {
            Ok(Self(vec))
        } else {
            Err(AsciiError)
        }
    }
}

Explicit conversion

UTF-8 is backward compatible with ASCII. This means that we can freely
reinterpret any well-constructed AsciiString value to a str value. What we
stand to gain from this is compatibility with code written with UTF-8 strings in
mind.

impl AsciiString {
    fn as_str(&self) -> &str {
        debug_assert!(str::from_utf8(&self.0).is_ok());

        // SAFETY: self.0 is ASCII-encoded, 
        // and ASCII is a subset of UTF-8
        unsafe { str::from_utf8_unchecked(&self.0) }
    }
}

Two statements in, and the unspeakable keyword has already been uttered; I’m
feeling pretty tall
today and the temptation is frankly irresistible. Obviously, any bug surrounding
the construction of AsciiString values implies memory unsafely outside debug
mode.

Hold on a second. Both
std::str::from_utf8 and
std::str::from_utf8_unchecked
take in a &[u8] as an argument, while &self.0 is of type &Vec; there
is definitely some kind of implicit coercion going on here. Is Rust not so
strongly typed after all?

Enter Deref

pub trait Deref {
    type Target: ?Sized;

    fn deref(&self) -> &Self::Target;
}

We seem to have inadvertently run into Deref before I even got the chance to
introduce it properly. The simple
explanation of why &Vec was “coerced” to &[u8] is that Vec
implements
Deref. In short, the compiler conceptually transformed the
expression str::from_utf8(&self.0) into something akin to:

str::from_utf8( as Deref>::deref(&self.0))

This mechanism at play here is called Deref
Coercion
,
and is defined by the interaction between two distinct mechanisms defined by The
Rust Reference: Method
Resolution

and Type
Coercion
. Having
already dropped so much jargon, I refrain from uncovering any gory details for
now.

Clean Code™

For all practical purposes, AsciiString:as_str works perfectly as a sure and
simple way to convert any &AsciiString reference to a &str. However, it will
mean one would have to manually add .as_str() every time a conversion is
needed.

For dubious reasons involving “clean code” we shall implement Deref for AsciiString: the compiler will from now on call .as_str() on our
behalf.

impl Deref for AsciiString {
    type Target = str;

    fn deref(&self) -> &Self::Target {
        self.as_str()
    }
}

Idiomatic Code™

For those who care not for aesthetics when it comes to computer programs — a
vulgar but arguably rational position — I shall appeal to Rust
idioms. Chapter 5 of the Rust API Guidelines contains a recommendation
titled
C-DEREF
which states:

The Deref traits are used implicitly by the compiler in many circumstances,
and interact with method resolution. The relevant rules are designed
specifically to accommodate smart pointers, and so the traits should be used
only for that purpose.

Is AsciiString a smart pointer? What is a smart pointer anyway? That is a
debate I’m
uninterested in exploring here. However, the authors of the Rust API Guidelines
go on to give examples of smart pointers in the Standard Library:

  • Box
  • String as a smart pointer to str
  • Rc
  • Arc
  • Cow<'a, T>

On these grounds, I claim that AsciiString ought to be considered a smart
pointer to str. Note that the idea that String is a smart pointer is itself
often contested, since String is not generic over the data it points to the
way Box and friends are.

To my astonishment

Note that AsciiString inherits the implementation of Index>
from str. We get this as a side effect of implementing Deref and it allows
us to borrow string slices from an AsciiString.

let data = vec![b'A', b'B', b'C'];
let ascii = AsciiString::from_ascii(data).unwrap();
assert_eq!(&ascii[1..3], "BC");

Up until now, we have been relatively well-behaved rustaceans. But where is the
fun in that? One cannot be writing production-ready code all the time, for that
is a sure path to madness.

Let’s allow for AsciiString to be indexed using floating-point numbers;
precisely because we can.

impl Index> for AsciiString {
    type Output = str;

    /// Returns an ASCII string slice by rounding 
    /// the given floating-point indices in `index`.
    fn index(&self, index: Range) -> &Self::Output {
        /// Returns the integer nearest 
        /// to the absolute value of `float`.
        fn nearest_index(float: f64) -> usize {
            float.abs().round() as usize
        }

        let start = nearest_index(index.start);
        let end = nearest_index(index.end);
        let string = self.as_str();

        &string[start..end]
    }
}

You may be wondering what happens when your index is exactly halfway between two
integers. The answer is that f64::round will round away from 0.0, thus
0.5 will be rounded to 1. At this point, we might as well have had Index
toss a coin to resolve this ambiguity.

let data = vec![b'A', b'B', b'C'];
let ascii = AsciiString::from_ascii(data).unwrap();
assert_eq!(&ascii[1.5_f64..3_f64], "BC");

We finally arrived at the part which surprised me and served as the motivation
for all that I wrote before and after this point: we no longer inherit
Index> from str. The previous version of the above code
block doesn’t compile anymore:

error[E0308]: mismatched types
  --> src/main.rs:69:23
   |
69 |     assert_eq!(&ascii[1..3], "BC");
   |                       ^^^^ expected `Range`, 
   |                            found `Range<{integer}>`
   |
   = note: expected struct `std::ops::Range`
              found struct `std::ops::Range<{integer}>`

To me, this was incredibly confusing because Index is generic over its Idx
type, which means it’s perfectly valid to have multiple implementations of
Index for different Idx types.

Somehow, the compiler can no longer find the implementation of Index on
&str. If you’re like me, it’s not at all obvious how that came to be. To
better understand all of this, we dig deeper into exactly how the compiler
resolves methods in the presence of a Deref implementation.

Nitty-gritty details

Everything I describe below is limited to commit
6bacf5a
of the Rust project, i.e. Rust 1.69.0
.

Method lookup in rustc is divided into two phases:
probe and
confirm. As
the Rust Compiler Development
Guide
explains,
probe is concerned with finding the callee method while confirm solves for
unknown type variables.

During the probe phase, the compiler would first perform a “Deref-loop”:
essentially repeated dereferencing of the receiver type, i.e. AsciiString. In
our particular case, this would yield two steps: AsciiString and [u8].

Then, the compiler would assemble a list of candidates from each step. These
candidate methods are subsequently searched for one that fits the reliever
type. Up until this point, it seems that rustc should’ve been able to find the
index method from the Index> implementation on [u8] because
the search always includes Deref targets.

The issue, it turns out, is that method search does not check method parameter
types against argument types. Rather, it only checks for method names and
where-clauses such as T: Trait. Types are only resolved in the confirm phase,
at which point rustc would have already picked its method.

But why is method lookup in rustc split into two phases? The answer lies in
the cacheability of the picks provided by the probe phase. In other words,
rustc developers wished to reuse probe search results of identical methods,
hence why probe cannot depend on inferred types local to one particular call
site. I would wager that this was most likely done to address performance
issues.

If you can spot any inaccuracies in the above description, please don’t shy away
from correcting me via email. I would appreciate it, as I am by no means a
rustc developer.

Epilogue

Moral of the story: don’t implement Deref kids!

On a more serious note, if one does implement Deref on a type T,
and it proved necessary to implement a generic trait Trait on T. Then
one should ensure that the implementations of Trait for different
parameters S are not scattered across T and U.

Put simply, either T or U should host all existing Trait
implementations. Note that the preceding “or” is inclusive, meaning that both
T and U can host some Trait implementations, as long as they both host
all of them.


Read More

Leave a Comment