Pointer Analysis and Undefined Behavior in C programs

Recently, I came across the question — can Pointer Analysis algorithms such as Andersen’s algorithm and Steensgaard’s algorithm, could correctly detect undefined behavior (such as buffer overflow) in C programs? If yes, how? And if no, why not? And if it can’t how does it affect soundness of pointer analysis? This is a very interesting question and I’ll try to share my thoughts in this post.

I believe the key to answering this question is to not classify behavior of C programs as defined or undefined when thinking about pointer analysis, but to think of pointer analysis as answering the question — which objects can a pointer point to during execution of the program as intended to by the programmer?

Also, pointer analysis operates on an abstract model, it operates with limited knowledge of the full program. Moreover, some “knowledge” isn’t available at all at static analysis time at all (e.g. lengths of buffers read over the network, etc)

In light of this, let’s take a look again at two interesting undefined behavior in C programs that affects pointers.

  1. Integer to Pointer casts

    Casting an integer to a pointer, is classified as “implementation-defined” in the ANSI C standard (see here). However, identifying the program points which perform this operation is trivial. Therefore, the pointer analysis implementation, on encountering an integer to pointer cast can easily fall back to saying, “this-pointer-can-point-to-ALL-objects-in-the-program”, thus maintaining soundness, but losing all precision for that particular pointer. The SVF implementation does exactly this by modeling all pointers which are created from integers, to point to a “black hole object”.
  2. Buffer overflows

    Detecting buffer overflows however is a different challenge. Usually, buffers are traversed via a loop that looks something like this (even if it is using Libc functions such as memcpy etc, within memcpy there is a loop that is logically equivalent to this):

    for (int i = 0; i < SIZE; i++) {
    ptr++ = buff++;
    }


    Now, determining the bounds for this loop might be simple, but in general, SIZE might have been the result of a pointer dereference, or even been read from over the network! This type of bounds analysis on loop counters is very challenging, if not impossible to perform at analysis time. Also, note that even if it could be determined at analysis time whether the pointer would go out-of-bounds, it could almost never be determined which object would reside at the out-of-bounds location, as this information is dependent on the heap allocation algorithm being used, order of allocation, etc, etc.

    So most pointer analysis implementations that I have seen, assume that a pointer which points to within an object, at the start of a loop, remains within the body of the pointer during all iterations of that loop.

    From a security perspective, this is actually useful when detecting malicious buffer overflows. We can determine at analysis time which object a particular pointer can point to and for each iteration of the loop, we can instrument the program to insert a check that ensures that the pointer still points to the right object.

    The only situation where I can see pointer analysis *not* being sound, is if there is intentional buffer overflows — that is the same pointer is incremented (or decremented) in such a way that it accesses adjacent objects that have no relation to each other (not part of a C array, for example).

I’m sure there are other undefined behavior of pointers that cause all sorts of intriguing questions about how they can be correctly modeled and analyzed! Super exciting stuff! 😀

Andersen Wave Diff

This is the variant of Andersen’s algorithm that seems to be the “default” in SVF. If you use -ander, this is the analysis that is run. I’ll describe this algorithm in short.

The key idea that differentiates this from vanilla-Andersen’s analysis is that instead of propagating the full points-to set, for each CopyEdge, at each iteration, it propagates only the difference between it’s points-to set at the end of the previous iteration, and, its points-to set at this iteration.

In order to do this, the constraint graph must not have any cycles, and must be sorted in topological order.

At a high level, each iteration is divided into three phases: 

Phase 1: First phase of each iteration is to collapse all cycles, and sort the graph in topological order. Phase 2: Then, process the Copy constraints/edges and propagate the difference in the points-to sets up the constraint-tree. 

Phase 3: Then, process the Load and Store constraints/edges and add the new Copy Edges.

If a new Copy Edge is added during Phase 3, then repeat.

The worst-case complexity of this approach is the same as vanilla-Andersen, but because it propagates only the difference in points-to sets, it’s faster in the average-case. And also, the Phase 2 and Phase 3 can be parallelized ( according to the paper http://compilers.cs.ucla.edu/fernando/publications/papers/CGO09.pdf)

Note: Because Phase 2 and 3 requires an acyclic graph, any positive-weight-cycles are collapsed and the struct object is made field-insensitive.

Variant GEP

GEP Edges can be of two types — Normal Gep Edges and Variant Gep Edges.

A normal gep edge is one where the index or offset is known. A variant GEP is a gep whose index is not a constant. For example — 

struct Obj { int a; int b; int c; }

int* ptr = &sobj;

for (int i = 0; i < 3; i++) {

int *c = (ptr + i); // ← non-constant offset

}

In this case, a VarGepPE PAGEdge will be inserted for the source ValPN for ptr, and this VarGepPE is converted into a VariantGepCGEdge in the ConstraintGraph and solved in the following way —

Src → Dst

Any object that the src can point to is first made field-insensitive. Then, these field-insensitive objects are added to the pts-to-set(Dst). 

VariantGeps and Arrays of struct

The default MemoryModel is field-insensitive when it comes to arrays. All elements in an array are considered to be the same. Now if there is a variable based access to this array, it’ll result in a VarGepPE from this array. 

Consider this example —

struct Obj { int* a; int* b;};

struct Obj objects[2];

struct Obj* optr = objects;

for (int i = 0; i < 2; i++) { optr[i].b = &bint; }

In this case, the variable i is being used to index into the array objects, via the pointer optr. The IR for the highlighted part will be as follows —

for.body:                                         ; preds = %for.cond

  %1 = load %struct.Obj*, %struct.Obj** %optr, align 8

  %2 = load i32, i32* %i, align 4

  %idxprom = sext i32 %2 to i64

  %arrayidx = getelementptr inbounds %struct.Obj, %struct.Obj* %1, i64 %idxprom

  %b = getelementptr inbounds %struct.Obj, %struct.Obj* %arrayidx, i32 0, i32 1

  store i32* %bint, i32** %b, align 8

  br label %for.inc

There are two gep pointers involved in computing the address of optr[i].b. The first one computes the address of the i-th object in the array, and the second one computes the address of the field ‘b’, within the struct. 

The Constraint / PAG graph considering only the first gep is shown in Figure 1.

We’d expect the GEP edge for the second gep instruction to be a normal gep edge because the offset is constant (1), but because the source of this gep is originated from a VarGep, the second GEP edge also becomes a VarGep. Logically it makes sense. The array itself is element/field-insensitive, so it’s impossible to distinguish between fields within the elements of this array.

This causes much imprecision in apr-hook framework for httpd. Figure 2 shows the Constraint Graph after the second gep edge is added.

Figure 1

Figure 2

SVF: Cycle Handling in Field Sensitive Analysis

TL;DR – The AndersenWaveDiff implementation of SVF handles positive-weight cycles by making the structure object involved field-insensitive!

Let’s look at how these positive-weight-cycles look like in the IR Representation, and the cycle that shows up in SVF.

The IR for the above C code is shown below. Note the BitCast Instructions that are used to assign to and from the void pointer q.
 

  %a = alloca %struct.aggr, align 8

  %p = alloca %struct.aggr*, align 8

  %q = alloca i8*, align 8

  store i32 0, i32* %retval, align 4

  %0 = bitcast %struct.aggr* %a to i8*

  store i8* %0, i8** %q, align 8

  %1 = load i8*, i8** %q, align 8

  %2 = bitcast i8* %1 to %struct.aggr*

  store %struct.aggr* %2, %struct.aggr** %p, align 8

  %3 = load %struct.aggr*, %struct.aggr** %p, align 8

  %f2 = getelementptr inbounds %struct.aggr, %struct.aggr* %3, i32 0, i32 1

  %4 = bitcast i32** %f2 to i8*

  store i8* %4, i8** %q, align 8

The initial constraint Graph for the above example is shown in Figure 8. The BitCast Instructions are specified as CopyEdges. The rest of the IR instructions are handled as expected.

Figure 8

After solving the AddrEdges, and the CopyEdge 9 → 17, we get Figure 9.

Figure 9

Solving, the CopyEdges 9 → 17, 19 → 20, and the StoreEdges, 17 → 13, 20 → 12, and the LoadEdge 13→ 19, gives this constraint graph

Finally, solving edges LoadEdges, 11 → 22, and StoreEdge 24 → 13, we get the constraint graph shown in Figure 10.

Figure 10

As you can see, this has a cycle 14 → 19 → 20 → 12 → 22 → gep-edge → 23 → 24 → 14. This is how a positive weight cycle (PWC) looks in the SVF’s Constraint Graph. 

Because, there’s a GepEdge for the struct obj 10, then object will become field-insensitive. Then the cycle will be collapsed into the final constraint graph shown in Figure 11. Note that the GepEdge is gone.

Figure 11

Cycles in Field Sensitive Pointer Analysis

In the original Field-Sensitive paper by Pearce, this problem was described as the Positive Weight Cycle problem. Let’s first look at this problem as formulated in the original paper.

Consider the following C code, that has a void* pointer.

typedef struct { int *f1; int *f2; } aggr;

int main(void) {

    aggr a, *p; void *q;

    q = &a; // q {a}

    p = q; // p ⊇ q

    q = &(p->f2); // q ⊇ p + 1

    // do stuff with q

    return 0;

}

This leads to a weighted cycle — p → q, q — 1 → p. Unlike non-weighted cycles, the nodes in a weighted cycle don’t share the same solution, and it can lead to infinite derivations. 

For example, 

q = {a}

p = {a}

q = {a, r}, where r is the field at the index 1 from the base object a.

Then, because it’s a cycle, the derivation continues,

p = {a, r}
q = {a, r, s}, where s is the field at the index 1 from the base object r.

So on and so forth. Pearce et.al. basically puts a limit to these derivations, and say that the maximum number of fields we’ll support is N.