Goal: get a block to a peer, with minimum bandwidth, and without having 
much/any more per-peer state. This would reduce bandwidth spikes around
new blocks and make Bitcoin more usable on limited connections... as well
as limit the worst case fairness impacts of larger blocks,

Non-goal: minimization of latency; though reducing bandwidth will often
reduce latency a lot.


Lets define a short txid:
 A short transaction ID is siphash(blockhash || wtxid) truncated to 64-bits.
 The purpose of the short ID is to be short but strongly collision resistant
 in an adversarial setting. Siphash() could be siphash or any other very
 fast (for small inputs) cryptographic hash.

Lets define a block sketch:
 A block sketch is the coinbase txn of a block, 
  the count of txn, then then N_txn 8-byte short-ids.

Given a block, anyone would compute the same block sketch.

A block shows up, and every node who receives a whole block sends to their
 peers a blockheader and a hash of the block sketch for the block.

A client collects these advertisements, picks a peer, asks it for a block
  sketch. Peer responds with the sketch (or timeout), if sketch is
  inconsistent with advertisement, ban peer.

Client uses the blockhash, to generate short-IDs from its own mempool
 (potentially only the top N transactions worth, if its mempool is big-- up
  to the client; it can start doing this as soon as the header shows up
  hiding it in the sketch request roundtrip)

Client determines which transactions from the sketch it already has in its
 mempool, and which tx indexes (offsets in a block) it is missing.

Client requests missing indexes from peers that have offered the same block
sketch using differential coding and variable length integers. E.g. say
 the client wants indexes 2,4,5,9,13 it would send a string of varints
 2,2,1,4,4. (maybe bitcoin varints or a more efficient scheme like sipa's
 uint).  Server(s) return transactions, short-IDs are checked, peers banned
 if they don't mach.  (this is why the sketch is hashed in the
 advertisement)

When all transactions received, block reconstructed. If it failed, fall 
 back to an ordinary block request.  


Conclusion:  Block transfer with around 9-10 bytes of overhead per
transaction using the mempool, in 2.5 round trip times; supporting
parallel fetch of the bulk of the block from multiple peers (reducing
load spikes).


Potential rocket-science improvements: 

(1) Use of "recoverable" short IDs.  

Imagine AES-CBC-ing a transaction twice, once front to back, once back to
front with a static well known key.

Now all the bytes of the ciphertext depend strongly on the whole of the 
transaction.  You can now use the first N bytes of this data as an N
byte transaction ID; send someone the ID, then when they request the 
TXN, send the rest. This avoids N bytes of 'overhead' that arise when
using a hash as an ID.

Use of very short IDs (under 20 bytes or so) requires a fiat-shamir
trick to provide collision resistance; thats the above short ID scheme.
One could change the AES key for this, but recomputing IDs for the 
whole mempool when a block shows up would be computationally prohibitive.

Instead we can compute a non-salted 256 bit recoverable ID for each txn.

And at blocktime then compute an 8 byte salted recoverable ID for that ID,
which is fast. then the gather requests send the len-8 bytes of remaining
transaction data.

Now we have maybe only 2-3 bytes per transaction of overhead in the case
where the transaction wasn't already in the mempool... but still 9-10
bytes for the common case of known ones.


Cost: code, cpu time, and additional memory to store txn in recoverable
form.


(2) Use of set reconciliation in the block sketch (UPDATED)

Set reconciliation schemes allow you to transmit the difference between
your set A and an unknown set B with data proportional to the size of
the symmetric difference.

There are several choices for schemes such as IBLT or polynomial code based.
IBLT is inefficient, in that it requires some small constant multiple more
data than the size of the difference. IBLT also doesn't work well
incrementally if an IBLT fails to decode you cannot just send a bit more
data to make it successful. But IBLT decodes faster. Since we don't care
about the lowest possible latency, and because polynomial is fast for
the kind of small difference numbers expected here, I propose we use
polynomial and not IBLT.

There is a lovely BCH code based implementation at
https://github.com/hbhdytf/PinSketch  so I'll assume it now.

To use this, lets define a compressed sketch.

When a new block comes in, after checking POW but before accepting the block 
you do the same tx selection that CreateNewBlock would do, and record the
list of tx that were in the block that weren't in the CNB result, call this
the surprises.

If the surprises are more than X% of the block or when the block has under
100 transactions remaining after the surprises, just send a non-compressed
sketch.

When sending a compressed sketch, send the header/coinbase/txn as in a
normal sketch. Then the blocksize, and the sum of the tx sizes of the
surprises. Then send a varint count of the surprises, and the
surprises in the same order they were in the block; just like in the non-
compressed sketch.   Then send 100 terms of a BCH set reconciliation
code. 

The receiver will take the list of surprises, and the size of the block
and will CreateNewBlock to generate a list of transactions for a block
of the actual size, but with the added constraint that the target
size is reduced by the sum of surprise sizes, and the sizes of known
surprises are skipped when enforcing that limit. [This is a best guess
at everything not in the surprise set].

Receiver then extracts a candidate set from the mempool, adding in all
unknown surprises and attempts to recover the unknown values.

Then take the recovered set, sort it, and remove all the signaled
surprises. Code the permutation of the block by efficiently coding
an integer on the range [0, remaining_txn+1) where the last value
is special cased to take the next entry from the surprises. (as
they were sent in block order).  

If its successful, the hash of the full sketch will match and things
can continue like normal.  If it fails, it could request additional 
sketch data, which would accumulate with the last...  Not sure if
it wouldn't just be better to fall back to an uncompressed sketch
there.

The goal with the 'surprises' is to take the transactions which are
very likely to be set differences out of the reconciliation process in
order to keep it small and fast.

For rusty's collected mempool data, this looks like it would make the 
sketches <2048 bytes for basically all blocks, compared with about 8kb
for uncompressed.

Combining recoverable IDs can be combined with this overall overheads of only
a couple bytes per transaction.

Cost: code, cpu time, performance depends more strongly on block/mempool
agreement.

(3) Round-trip elimination.

Since the sketch is small (esp with compression in use). You could 
ask the last peer who was the first to offer and successfully deliver
a sketch to next time just send you the sketch instead of just the 
offer.  This would save one round trip.

Not part of main proposal, because it optimizes latency; but it would
likely 'pay' for the compressed sketch decoding cpu time.

(4) ~elimination of time-outs via FEC coding.

Taking the protocol above (with or without (1)/(2) enhancements).  Servers
can use forward error correction (e.g. a RS code) to take their sketch
and code it into (say) 32 parts. such that any 16 of them is enough to
recover the sketch.  The sketch hash is merkelized over all 32 parts.

Clients can then round-robbin request one part at a time from offering
peers. If the client finds itself waiting on a single peer, it requests
one more part from a different peer. Then when either respond, the
reconstruction can complete. 

Similar could be done for the gather requests. E.g. make a request for
the same missing txn to all peers, and ask them to give you a 1/Nth
share of it (along with a hash of the shares).

Cost: Code (though sipa has a nice RS code that would be suitable for this),
additional latency in the common case (but much less latency in the 
adversarial case). Cpu time in the block offering server to code up
the sketch.. though the sketches are small. FEC encoding for gather
requests could be a CPU DOS for servers.

This has some mild incompatibilities with the above compressed sketch
scheme, since the surprise set would differ from peer to peer.