Goal: massively reduce the overhead from transaction relay.

Non-goal: Get the absolute lowest transaction relay latency.


Each node elects one or two preferred relayouts. One if the node has fewer
inbound connections than outbound. Two otherwise. 

A preferred relayout is preferentially a long uptime outbound peer
who claims to be a full node, claims to relay txn, and passes various
heuristic health checks (e.g. it notifies us of current blocks).

When a node receives a transaction, via whatever means, it immediately
announces it to its preferred relayouts via the current INV process.

This is the primary mechanism of transaction relay in the network.

The low fanout conserves bandwidth, but broken nodes and cycles in the
relayout graph will cause transactions to get stuck in cliques and
not reach the whole network.

Every transaction in the mempool gets an 8 byte short-id defined,
using the hash of last block as a salt:
short_id = siphash(blockhash||wtxid)

The use of a keyed short_id is to allow very short identifiers but
make computing collisions harder (though not impossible). 

All transactions in the top 12MB of the mempool get their short
IDs added to a one of 16 BCH sets; the set is selected based on
the first 4 bits of the short-id.  We will assume we are using BCH sets of size
T=1000.

BCH codes are completely linear, adding a new entry X involves computing
X^2, X^4, X^6, ... X^(t*2), and adding each term to each of t accumulators.
(In F(2^m), 2^64 in our case). 

At random times, a node will pick a peer and request a sync operation.
It will send its current tip block hash, and how many changes (adds/deletes)
were made to the top 12 mb of it's mempool since the last sync with that
peer, excluding deletes as a result of accepting a new block.

The peer will get that message and check if the blocks match, and 
reject the sync otherwise. If they do match, it will compute the number
of changes to the top 12MB of its mempool. It will then find the target 
transaction count C =  abs(them-us) + Q * min(them,us).

Where Q is a coefficient 0 .. 1 to be discussed in a moment.

If C is less terms of our BCH code (1000) then we sum all 16 sets, 
and send the first C terms to the peer, along with our estimate of
C, and which sets were used.  If C is over the limit, we recursively
split the set until the C/parts + ceil(log2(parts)) < 1000. 

We then sends parts set messages, with C/parts + ceil(log2(parts)) terms
each.

The far end can then attempts to construct the set difference(s).
If it is successful, it INVs any transactions it has that you don't, and
getdatas any that you have that it doesn't (this needs a shortID
getdata/inv). 

If it fails, and you send more then 1000 terms, it asks you for more terms
(double the last amount).

When it is successful, it computes what Q should have been and asks you to
update Q. 

Newly received transactions are relayed via the primary mechanism.


Questions:

The size 1000 in the BCH code was random. I don't know how much data would
normally be required. Making it too big just wastes cpu cycles computing
higher powers of the ID.  It's possible to incrementally compute higher
terms just by remembering the highest power computed so for for each txn.
... this way nodes could just adapt their BCH order on the fly. (e.g. if
they get peers with high C's they could go and compute the higher order 
terms.

The Q scheme above is .. uh. guesswork. I have no idea how to predict
the inconsistency in the network.  The addition of ceil(log(parts))
is a rough guess at the efficiency loss from splitting the sets. When
I am not feeling lazy I can compute the actual expected loss analytically.

Suitable BCH implementation in C++ using NTL is here:
https://github.com/hbhdytf/PinSketch

It's GPL (NTL is GPL).  Presumably we can get sipa drunk and get him
to implement the required stuff to get a non-GPL version. whatever
magic NTL is doing for polynomial factoring is beyond me.

Pinsketch seems reasonably fast: Constructing a BCH set for 40,000
elements takes 20 seconds on my slow laptop. Decoding a set difference
of size 1000 takes 5.3 seconds.  The way pinsketch works, the construction
is no slower done incrementally. I believe there is a faster way perform a batch update.

This all assumes the ID is based on the latest block, so nodes with
differing latest blocks can't sync. This could be improved with a bit of
memory cost by just keeping BCH data for some older blocks, and using the
latest in common.

like my efficient block relay writeup, recoverable IDs could be used.


Some more complex stuff involving cycle detection could probably be done to help
further prevent cycles in the network. e.g. support carrying a small
set of "rider pubkey hashes" on invs, and randomly replace the nonces
with one of your own with some low probability. When you see one
come back to you, prove knowledge of its discrete log to tell the peer
to not use you as a relayout.  this would help keep Q down, and
make the reconcile data quite small.