###### 2014-11-29

# Proof of work and sorting

**WARNING:** Unfinished math ponderings ahead. There are probably bugs
in this analysis, I had to go work on other things before I managed to
fully convince myself that it was correct. Posting it anyway, please
take it with an appropriate dosage of salt.

I'm still pondering Bram's original idea.

I'm annoyed with Bram himself, as I feel his responses to my comments were both dismissive and disrespectful. If you ask for feedback and people take time to write thoughtful replies, they deserve more than terse oneliners on Twitter in return. Admittedly, most of my feedback had to do with the fact that his original article was just poorly written and confusing, but if people don't understand what you're blathering about then you can't really expect much in response. That's probably why it still has zero comments. :-P

But the idea captured my interest, and has proven an excellent excuse for dusting off my rusty math skills and read up on interesting things.

So here's some math!

Bram's has proposed a proof-of-work scheme, whereby he claims the most efficient way to find a solution is to generate a bunch of hashes and then sort the results.

Involving a sort in the process is desirable, because sorting a list of values requires keeping the entire list in RAM (or some other form of storage) and there are also limits on how efficiently a sort can be parallelised. These are both important for a proof of work, because they reduce the advantage gained by an attacker with customized hardware; being forced to add lots of RAM pushes costs up to the point that general purpose computers should remain competetive.

Standard Bitcoin mining is an example of a proof-of-work scheme that lacks both of these properties, which is why nobody bothers mining on normal computers anymore. Rich guys with expensive custom hardware have become impossible to compete with.

So that's the crux of Bram's idea, and it is indeed quite promising.

The bulk of his article goes into some implementation details, but since he doesn't do any actual math, he ends up with a solution that doesn't obviously force the attacker to sort the hashes. It can still be brute forced, and he doesn't adequately explain why his approach wins.

## Restating Bram's Crazy Idea

The initial Crazy Idea, can be rephrased like so:

```
Find k input values, A1, A2, ... Ak, which fulfill the following:
hash(A1) ^ hash(A2) ^ ... ^ hash(Ak) = 0.
```

The ^ symbol represents an XOR operation, the hash function itself was undefined. For the purposes of this analysis, let R be the size of the hash function's solution space, and N be the size of the input space.

Bram's proposal did not clearly define the hash function, but for this
analysis the hash function doesn't really matter, as long as we assume
it is has the expected properties of a cryptograph hash function. What
does matter is the size of the result space that needs to be considered,
which is defined by the number of zeros demanded in the solution. So if
we want a solution to have 5 zeros, that is equivalent to `R=2**5`

. It
doesn't matter if the hash function outputs more bits, only the last
ones matter.

Bram spent a fair bit of time discussing how to choose k (he likes k=4),
and proposed requiring the input values be either 38 or 19 bits long,
so `N = 2**38`

for `k = 2`

, or `N = 2**19`

for `k = 4`

.

Bram claimed that the most efficient solution would have a complexity of
`O(N*logN)`

. This breaks down as `O(N)`

for calculating all the hashes for
the input space, which is then dominated by the time taken to sort the
results, `O(N*logN)`

. He also made the claim that RAM usage would be `O(N)`

.
This all seems fair enough - but how well does it work?

## Brute Forcing Sanity

Superficially, Bram's problem appears to have a simple solution that does not require sorting. The following algorithm will (usually) solve the problem using a brute-force, highly parallelisable approach:

- First, calculate a set of k-1 hashes H1, H2, ... Hk-1, such that Hi = hash(i) for all i < k. Then set Hk = XOR(H1, H2, ... Hk-1). This guarantees the property H1 ^ H2 ^ ... ^ Hk = 0 will hold.
- Iteratively calculate hash(x) for different values of x >= k.
- If hash(x) = Hk, then we have a result: 1, 2, ..., k-1, x.
- Fail if the input search space is exhausted with no result

This approach simply reduces Bram's problem to the well understood
problem of searching for a collision with a known hash, which is `O(1)`

RAM and `O(R)`

time, for any values of k, R or N.

The reason this is `O(R)`

and not `O(N)`

, is that we are also not guaranteed
to find a solution in the first pass - we may find that there is no x in
N for which `hash(x) = Hk`

. In the `k = 2`

case, this means H1 is a unicorn,
for other values of k there may be no solution at all. If this happens,
then we need to start over, using different values of i in step 1, until
we've covered most of R.

Note that this does *not* in any way invalidate Bram's claims, as the
values being compared, `O(R)`

and `O(N)`

, are different by many orders of
magnitude. All this shows is that the problem can be brute forced with a
trivially parallisable algorithm, not that it's feasible to do so in
practice.

## Breeding Unicorns

To thwart such a brute force attack, we need to balance N and R in such a way that collisions are likely, but unicorns are plentiful enough to get in the way.

Remember that Bram's original algorithm implies exhaustively generating
N hashes, and solving the problem in `O(N*logN)`

time.

If `R < N*logN`

, then the brute force approach trivially wins (it uses
less RAM and is parallelisable). So R needs to be bigger than `N*logN`

and
that is indeed what Bram proposed. The greater R is, the larger the
potential advantage afforded by Bram's approach.

However, if that is the case, then we're not guaranteed to find a solution at all. We might, but if we're not generating all possible hashes, the problem might have no solution - there might not be a collision at all. It turns out this isn't a big deal in practice, but to understand why, we need some statistics.

How likely are we to have a collision if we choose N results of R possible? This classic puzzle is known as the birthday problem and has both well defined known solutions and convenient approximations.

Using the birthday problem lets us calculate the odds of finding a 76
bit collision, if we make the solver generate and sort `2^38`

hashes.
These are numbers from Bram's original post, and we just plug them into
the square approximation to find:

```
2**38 = sqrt(2 * 2**76 * X) # Raise to power of two
2**76 = 2 * 2**76 * X # Simplify right hand side
2**76 = 2**77 * X # Divide all by 2**77
1/2 = X
```

That would give us a 50% chance of finding a 76 bit collision. If we go
to `2**39`

inputs, the birthday problem should guarantee success.

However, this isn't actually what Bram's algorithm does. Although it
does indeed generate `2**39`

hashes, it does so in two sets of `2**38`

, and
Bram is only looking for collisions between the two sets, not within
each one.

So what we've actually shown here, is that there is a 50% chance that there's a collision within each of Bram's respective input sets. But that's not actually what we are looking for. We will however use this result later on.

## So what are the odds?

In order to maximize our fun, let's calculate the odds of success in a
general way, not just assuming `k = 2`

or `k = 4`

.

For starters, we know that for any two equally long bit strings `A1`

and
`A2`

, there exists an `A3`

of equal length such that `A1 ^ A2 ^ A3 = 0`

. By
induction it is possible to prove that this holds true for any set ```
A1
... Ak
```

, where `k > 2`

. This is also trivially true for `k = 2`

. If this
looks familiar, that's because we used this property in the brute
forcing algorithm above, and showed how to calculate `Ak`

.

Intuitively, because multiple solutions are valid one might expect to see a much higher rate of collisions for combinations of 3 or more variables. Is this intuition correct? How can we quantify the change?

Like before, what we are really interested in is finding the smallest input size N relative to hash solution space R, for which we have a high probability of finding a solution.

A common approach for solving problems of this nature, is to turn the question on its head. Instead of asking "how likely is it that I'll find one of the zillions of correct solutions", you ask "how likely is it that I'll be so unlucky that I don't find a solution at all". It's one of the ironies of statistics, that answering the pessimistic question is usually much more tractible. Then you bring back the optimism by using the basic identity P(lucky) = 1 - P(unlucky) to find the odds of a happy ending! Hooray!

Since we know that for any sequence `A1, A2, ..., Ak`

, it's really only
the last one that matters, we can focus solely on the last term when
calculating our odds. However we also need to keep track of how many
different `A1, A2, ..., Ak-1`

combinations we're working with, as every
such combination has a potential solution.

For `N = 1`

, we find the odds of all possible `Ak`

's being useless, to be
quite overwhelming: `(R-1)/R`

, since there is only one equation to
consider and on average only 1 of the available R values is going to
work (this is based on the initial assumption that our hashing
function's output is uniform and well distributed).

So for `N > 1`

, what are the odds? For `k = 3`

, there are up to `N**2`

possible
equations to solve, each of which has exactly one expected solution out
of R possible values. As we discovered earlier, the birthday problem
informs us that if the number of generated hashes is less than `sqrt(R)`

,
we can reasonably expect very few duplicate values. This means our
equations are mostly unique, which lets us assume we really do have `N**2`

different potential solutions.

The odds that we won't find any of solution at all on our first try is
thus `(R - N**2)/R`

, and we have N tries. So the odds of abject failure
are `(R - N**2)/R ** N`

. If we explore `k >= 3`

, then by the same argument
we find the odds of failure are `(R - N**(k - 1))/R ** N`

and this
approximation is valid while `(k - 1)N < sqrt(R)`

.

This puts the odds of success at:

```
P = 1 - ((R - N**(k - 1))/R)**N
```

This general solution above has however diverged somewhat from Bram's
proposal. Instead of trying to solve `N**3`

combinations with a set of N,
he compares `N**2`

combinations with another `N**2`

. This covers the same
solution space, but uses far less RAM because `2 * N**2 < N**3 + N`

. This
implies an implementation strategy, where we try to pairwise balance the
input sets as equally as possible. It turns out this doesn't change the
odds of success though.

The odds of finding a solution using Bram's algorithm work out to be:

```
P = 1 - ((R - N)/R)**N # for k = 2
P = 1 - ((R - (N**2))/R)**(N**2) # for k = 4
```

So if we plug `k = 2`

, `N = 2**38`

and `R = 2**76`

into this formula,
we can finally calculate the odds of finding a collision using Bram's
algorithm:

```
P = 1 - ((R - N**1) / R)**N # x = 1-y <=> 1-x = y, k=2
= 1 - ((R - N) / R)**N #
= 1 - (1 - N/R)**N #
= 1 - (1 - 2**38/2**76)**N #
= 1 - (1 - 2**-38)**(2**38) # Ask a calculator
= 0.6321
```

Similarly, if we plug `N = 2**19`

and `R = 2**76`

into the formula given
above for Bram's algorithm, we get the exact same result. Again, this is
expected since both methods generate `2**39`

hashes, albeit by different
means.

This is where I took a little break to write a little program that
implemented Bram's algorithm. Combining four inputs, with `R = N**4`

, it
found a solution about 60% of the time, as predicted.

I don't know if this is a high enough ratio to make this a satisfactory proof-of-work puzzle. With only 63% odds, we need 4 tries before our confidence of finding a result goes over 99%. Conversely, if we are willing to accept a 74 bit collision, we have a 98% chance of success on the first try.

What happens if we use the general approach? Do we get more or less
collisions for the same amount of RAM? The general approach will
generate `N + N**(k-1)`

combined hashes, so let's calculate odds:

```
k = 2 N = 2**38 RAM = O(2**39) R = 2**74 P = 0.982
k = 3 N = 2**19 RAM = O(2**39) R = 2**55 P = 0.982
k = 4 N = 2**13 RAM = O(2**39) R = 2**50 P = 0.982
k = 4 N = 2**19 RAM = O(2**39) R = 2**74 P = 0.982 # Bram
```

As we can see, given a fixed RAM budget, Bram's balanced solution does
quite well and it doesn't matter whether he uses `k = 2`

or `k = 4`

.

However, it's informative to contrast this with straight results from the Birthday problem:

```
k = 1 N = 2**37 RAM = O(2**37) R = 2**74 P = 1
k = 1 N = 2**39 RAM = O(2**39) R = 2**78 P = 1
```

Here we find we need 4x less RAM to find a collision of the same size, or given the same RAM budget we can make R significantly larger, which hits the brute forcers where it hurts the most.

So it seems Bram might actually be better off abandoning his XOR strategy and going with straight hashes instead. For an excellent analysis of exactly that, check out Momentum:

Yay math!

**Addendum:**

Working through this (even though I don't consider it complete) was good fun.

More importantly, looking at the proof-of-work literature led me to reading some actual academic papers on proofs of work, and I was particluarly happy to discover "Guided Tours", which rely more heavily on physics (the speed of light) than math.

I think for fighting spam, that may actually be a more promising approach and hope I'll be able to revisit that stuff later on as part of my work on Mailpile.