This article demonstrates particular algorithm implementations — but it is important to remember that the source code will vary between different implementations of the standard library, and can differ based on what sort of data you pass in. These examples are intended to be representative. For purposes of this article we are only considering bidirectional iterators.

### Review of current techniques

#### swap vs move

C++11 introduced move semantics, which allow us to skip expensive copying operations when they are not necessary. Moving the value out of an object leaves that object in a “moved-from” state — an empty husk of its former self. We will represent the moved-from state as “?”.

std::swap — and even specialized swapping algorithms — usually boil down to performing three operations. This is why it’s important not to implement move in terms of swap. It may be 3 times more expensive to do it that way. If you have nested objects inside of other objects, all moved using swap, then the overhead of unnecessary swaps may grow geometrically. 3 x 3 x 3 x 3 …

We’ll be analyzing how often our algorithms move objects, as this can have a larger performance impact than may immediately be obvious. Let’s define move complexity to refer to the number of times data is moved during the course of an algorithm. We can then say that swap has a move complexity of 3, while simply moving an object on top of another has a move complexity of 1.

It’s important to remember that compilers are always getting smarter. Swaps can often be optimized away entirely. However, when dealing with complex or large data sets, your chances of having the optimizer fix up unnecessary swaps shrinks. It’s a good idea to choose the weakest algorithms that will achieve your needs.

#### remove_if

Transforms the range [first,last) into a range with all the elements for which pred returns true removed, and returns an iterator to the new end of that range.

template<class FwdIt, class Predicate> ForwardIt remove_if(FwdIt first, FwdIt last, Predicate p) { first = std::find_if(first, last, p); if (first != last) for(auto i = first; ++i != last; ) if (!p(*i))*first++ = move(*i);return first; }

Demonstrating with remove_if on odd integers:

0 1 2 3 4 5 6 7 8 90 2? 3 4 5 6 7 8 90 2 43 ? 5 6 7 8 90 2 4 6? 5 ? 7 8 90 2 4 6 85 ? 7 ? 9

Complexity analysis is very easy for this algorithm. The predicate is applied once for each element, and it performs at most N move operations. This is consistent with the requirements in n4527.

#### partition

Rearranges the elements from the range [first,last), in such a way that all the elements for which pred returns true precede all those for which it returns false. The iterator returned points to the first element of the second group.

template <class BidirIt, class Predicate> auto partition (BidirIt first,BidirIt last, Predicate pred) { for (; ; ++first) { for (; first != last && p(*first); ++first); if (first == last) break; for (; first != --last && !p(*last); ); if (first == last) break; swap(*first, *last); } return (first); }

Demonstrating with partitioning on even:

0 1 2 3 4 5 6 7 8 90 82 3 4 5 6 7 1 90 8 2 64 5 3 7 1 90 8 2 6 45 3 7 1 9

Like remove_if, this implementation applies the predicate once for each element. n4527 states that partition uses at most N/2 swaps for bidirectional iterators, or 3N/2 move complexity. We will take them on their word rather than padding this article with proofs.

#### Comparisons

These algorithms differ in two orthogonal ways:

- stable (remove_if) vs. unstable(partition)
- destructive(remove_if) vs nondestructive(partition)

But what would happen if we put on our mad scientist gloves and screwed one algorithm’s head on the other one’s body? What strange monster would awake from slumber?

### unstable_remove_if

We’ll start by taking partition and switching it to use move instead of swap. Out of convenience I will also negate the predicate.

```
template <class BidirIt, class Predicate>
auto unstable_remove_if(BidirIt first,BidirIt last, Predicate p)
{
for (; ; ++first)
{
for (; first != last && !p(*first); ++first);
if (first == last) break;
for (; first != --last && p(*last); );
if (first == last) break;
```** *first = move(*last);**
}
return first;
}

We demonstrate with an unstable_remove_if on odd integers:

0 1 2 3 4 5 6 7 8 90 82 3 4 5 6 7 ? 90 8 2 64 5 ? 7 ? 90 8 2 645 ? 7 ? 9

This algorithm does not preserve element order, but it also does not preserve the values of unwanted elements. We are left with an algorithm which removes elements from a range, but does not preserve the relative order of remaining elements. remove_if is explicitly defined to be “stable” or order-preserving, so the natural name for this algorithm is unstable_remove_if.

Since partition had at most N/2 swaps, this algorithm has at most N/2 moves. This is cheaper than either of the algorithms we started with— which makes sense, as it inherited the most efficient characteristics of both.

### semistable_partition

Now we’ll try changing remove_if to use swap instead of move. We’ll negate the predicate again for convenience.

```
template<class FwdIt, class Pred>
auto semistable_partition(FwdIt first, FwdIt last, Pred p)
{
first = std::find_if(first, last, not1(p));
if (first != last)
for(auto i = first; ++i != last; )
if (p(*i))
```** swap(*first++,*i);**
return first;
}

We demonstrate with a semistable_partition on even integers:

0 1 2 3 4 5 6 7 8 90 21 3 4 5 6 7 8 90 2 43 1 5 6 7 8 90 2 4 61 5 3 7 8 90 2 4 6 85 3 7 1 9

This algorithm has achieved a half-stable partition. The even integers maintained order, but the odd integers did not. All data was preserved.

This algorithm will swap at most N times as it progresses from the front of the range to the back. Since swap is 3 move operations, we say this algorithm has 3N move complexity.

### Summary and Analysis

These algorithms are not entirely novel. D supports both an unstable remove algorithm and a semistable partition algorithm. Alexander Stepanov covered the partition_semistable algorithm in Elements of Programming.

Well, not every mad scientist can be the first.

algorithm | move complexity | example output unstable_remove_if | N/2 |0 8 2 6 45 ? 7 ? 9 remove_if | N |0 2 4 6 85 ? 7 ? 9 partition | 3N/2 |0 8 2 6 45 3 7 1 9 semistable_partition | 3N |0 2 4 6 85 3 7 1 9 stable_partition | 3N log(N) * |0 2 4 6 81 3 5 7 9 * or 3N/2 w/ memory allocation

The following performance tests were run under Visual Studio 2015. The source code and raw data [0,1] [2,3] is available on github. We use the implementations provided in this article. We use MSVC’s implementation of stable_partition, which will use the memory allocation approach rather than the 3Nlog(N) approach.

### Test 0

This test constructs a vector<array<int32_t, 1>>. The arrays are constructed with increasing values in their [0] element. We partition/remove data as in our above demonstrations– segregating odd values. The intent of this test is to measure the performance characteristics for types which are cheap to compare and cheap to move. The X axis is the size of our vector.

### Test 1

This test is the same as Test 0, but uses std::arrays of length 32 instead. This test is intended to demonstrate how the behavior changes as types become expensive to move but remain cheap to compare.

### Test 2

Our above tests operate on very uniformly distributed data, so this test instead uses a fixed vector length=2000 but varies the number and distribution of odd elements. On the far left we have all even elements, and on the far right we have all odd elements. This introduces significant amounts of branch misprediction. It uses types which are cheap to move and compare as in Test 1.

### Test 3

This test is like test 2, but uses types which are very expensive to move (array<int32_t, 32>).

We see that semistable_partition performs very poorly in this test, and stable_partition runs comparably with partition. I believe these results can be partly attributed to the fixed length of N for our vector. Allocating and freeing a fixed size block of memory over and over again is likely helping to eliminate most of the memory allocation costs that stable_partition paid for in Tests 0 and 1. The worst case scenario for semistable_partition is when we have a high proportion of ‘bad’ elements which are expensive to move.

algorithm Test 0 Mean Test 1 Mean remove_if 6049.05 105596.78 unstable_remove_if 6849.56 67102.81 --- speedup 0.88x 1.57x stable_partition 25685.09 397108.76 semistable_partition 20401.41 167239.83 --- speedup 1.26x 2.37x

#### Take away

We’ve discovered, or rather re-invented, two powerful algorithms and oriented them alongside a pantheon of standard library algorithms.

Our Tests 0 and 1 somewhat seem to contradict tests 2 and 3. 0 and 1 show semistable_partition as far faster than stable_partition. Tests 2 and 3 tell a very different story. I believe this demonstrates that stable_partition can have wildly different performance characteristics based on your allocation pattern, and that semistable_partition has some pathological cases.

unstable_remove_if is shown to be equivalent or faster than remove_if in many scenarios. When moves are cheap and the data size is large (>2k elements), remove_if may outperform unstable_remove_if by a few percent. This may be caused by unstable_remove_if executing more instructions per iteration. When move operations are cheap, the extra care taken to avoid moves may be detrimental.

semistable_partition is shown to be competitive with or faster than stable_partition in most cases. That it does not use extra memory makes it particularly interesting in scenarios where such operations are very expensive or outright forbidden. Some might have expected stable_partition to always be slower than semistable_partition due to the memory allocation, but we showed that isn’t necessarily the case. Moving large objects around counts as an ‘expensive memory operation’ just as much as allocation does. Unfortunately, we are unable to compare the 3Nlog(N) implementation of stable_partition without artificially inducing a low memory scenario so we don’t know for certain how that implementation of the algorithm would compare.

And remember that your mileage may vary based on your use case and implementation.

According to Scott’s article, should swap’s complexity be 3/2 of move’s?

LikeLike

Scott’s article is talking about different ways of implementing move. He optimizes his move implementation by using two pointer assignments instead of three. Scott does not cover what happens when you swap two of his Container classes. In the first version of move, swapping two Containers would cost 3*3 assignments and in the second version it would cost 2*3 assignments. But that factor of three remains, because swap requires three moves.

LikeLike