Brainstorming BitTorrent Peer Distribution Algorithms
Recently, I’ve been working on etracker, an experimental BitTorrent tracker designed to incentivize healthy client behavior. If you already know about the Bittorrent protocol or about my project, feel free to skip ahead to the actual brainstorming.
Project Background
BitTorrent is a P2P file transfer protocol, which can use centralized trackers to distribute peer lists to clients, who in turn connect to each other. A collection of clients announcing a particular content infohash is a swarm, which consists of some clients “seeding” content they already have, and other “leeching” content they do not yet have. A tracker must keep track of many clients, each in one or more swarms.
A standard challenge in the BitTorrent world is free-riding: Users who only leech content, but do not seed. The protocol does not build in an incentive to seed, so users may leech content and then leave the swarm. This behavior can put additional upload bandwidth requirements on users who do choose to seed, and can reduce the longevity of swarms as only a proper subset of the clients with the content actually offer to seed it to others.
One approach to the free-riding problem, developed by private communities, is
to put out-of-protocol restrictions on clients to force them to seed.
BitTorrent metadata is distributed in the form of .torrent
files. A front-end
can restrict access to .torrent
files only to users who have demonstrated
that they have uploaded a certain amount of content or seeded torrents for a
certain amount of time. This kind of approach requires significant overhead and
coordination, including a separate database to track user contributions over
time, rules for when new .torrent
files can be downloaded, including
authorization that ties clients to a persisting user identity, and moderation
to prevent cheating.
In etracker
, I pursue a different approach. The tracker’s primary job is to
distribute peers to clients. The unofficial extension to the BitTorrent
spec
states:
As mentioned above, the list of peers is length 50 by default. If there are fewer peers in the torrent, then the list will be smaller. Otherwise, the tracker randomly selects peers to include in the response. The tracker may choose to implement a more intelligent mechanism for peer selection when responding to a request. For instance, reporting seeds to other seeders could be avoided.
One of the goals of etracker
is to explore alternative peer distribution
algorithms which are designed to incentivize long-term and fast seeding,
without relying on restrictions on .torrent
file downloads or requiring that
users authorize with a central service to gain access to content.
(Another goal of etracker
is to experiment with tracker-level cheater
detection. Cheating becomes an issue with any system that tries to
algorithmically solve the free-riding problem. I hope to discuss cheating in
later posts.)
Brainstorming Peer-Distribution Algorithms
The informal idea behind etracker
is to reward good peers with more good
peers. Formalizing this informal idea requires figuring out what a good seeder
is, figuring out what a good peer is, and figuring out how to scale the
rewards. At this point I’m just going to be publicly brainstorming, building up
in steps towards the algorithm I’m currently targeting.
First Algorithm: Number of announces = Number of peers
The simplest algorithm counts how many non-stale announces a client has (i.e., the number of torrents they have announcing in their client), and returns a number of peers equal to that number, up to the maximum that the client requests. A sample implementation of this simple algorithm is available on GitHub.
However, it has a number of problems:
- A linear algorithm with an intercept of zero is probably not what we want. It’s too punishing to new peers in few or no existing swarms.
- Large-scale leechers can still get all the peers they request, because they are making announces for each torrent they are trying to leech. Although leechers should get peers, we should do more to incentivize seeding as well.
- No sense of peer quality beyond number of peers returned.
etracker
bends the spec by giving many leechers fewer peers than they request. If leechers are only given poor quality peers and are not given as many as they request, they will be incentivized to use another tracker, rather than to seed more. Poor quality peers might be slow peers, or there might be other kinds of poor quality peer.
From now on, I’m going to assume that Problem #1 can be solved independently. It’s just a matter of exploring some mathematical functions to normalize the curve of peers we distribute.
Second Algorithm: Number of seeds = Normalized number of peers
The next evolution of the algorithm defines a good peer as a good seeder. We
can detect seeders in the swarm by tracking announces with the left
key set
to zero. This solves Problem 2 of the first algorithm, since we only reward
seeders, and not leechers. However, some problems remain and other problems
arise.
Problems:
- Disadvantages partial seeders: A partial seeder is a client which snatches
part of a torrent, not all of it. For example, the Internet
Archive hosts large torrents of historical documents,
and a user may wish to only snatch some subset of the data. These seeders
keep announcing with a non-zero
left
key, even though they never leech more of the torrent. - No sense of peer quality beyond the number of peers received.
Third Algorithm: Number of seeds and Amount of upload = Normalized number of peers
Each client announce includes statistics on how much the client has upload and downloaded on a particular torrent. The tracker can keep track of this, and this points to a third evolution in the algorithm. Define as a good peer as a good seeder, or as a leecher who announces upload on torrents which they have not yet completed. This will capture at least two additional cases beyond the second algorithm: First, it rewards leechers who are simultaneously seeding to others. Even if the leecher leaves the swarm soon, the fact that they have seeded as well before leaving is a sign of swarm health. Second, it rewards partial seeders who report upload to other peers.
This algorithm is not a complete solution to the issue of disadvantaging partial seeders: It will only count partial seeders who are also uploading to other leechers. This will not count partial seeders on unpopular content, who do not seed to anyone else. However, I think this is a reasonable compromise. If content is unpopular, we should incentivize people to seed it all for better retention. (Conversely, if a partial seeder reports no upload because they are a partial seeder of a very well-seeded torrent, there is no reason to particularly incentivize this kind of partial seeding either.)
Problems:
- No sense of peer quality beyond the number of peers received.
Fourth Algorithm: Number of seeds and Amount of upload = Scaled peers by tier
The fourth and final algorithm finally tries to tackle the problem of returning good peers. The basic idea is that just as we can use upload and download amounts to measure if a particular client is a good enough contributor to be “deserving” of peers, we can also judge the other peers in the same way.
I imagine dividing the set of peers into tiers. Seeding peers with quick upload are in a high tier, and are preferentially distributed to high tier peers which are leeching new content.
This involves a number of complications in implementation. As you can tell, I’m still working through exactly what this might look like:
- Seeding amount versus seeding speed: Clients seeding large number of torrents may have a limit on the number of connections they can make. Although we want to distribute peers in such a way as to take advantage of large bandwidth uploaders, we cannot punish them by trying to make them use too many connections. Ideally, we may want to reserve the upload bandwidth of large seeders for the most unpopular torrents they are sharing.
- Uncertainty in estimating peer quality: Peers do not directly report their client limits or their bandwidth limits to us. This can only be unreliably inferred from their announce statistics.
- Handling multiple requests for peers from leeching clients: On slower connections or with large files, a leecher may make several announces before completing. It may make sense to give them better peers in later announces to help them finish snatching a torrent. If we do not adequately reward clients with completed content, they won’t use this kind of tracker to begin with.
- Technical issues: Efficiently calculating and updating peer tiers may be expensive, and require additional tables or caching.
- Cheating: Once the algorithm becomes sufficiently sensitive to reported announce statistics, it also increases the incentive for clients to cheat. This is a large topic which I’ll approach in later development.
How is the fourth evolution of the algorithm different from the out-of-protocol
solution I mentioned earlier in this post? Those solutions are all-or-nothing:
Authorization for each user is used to prevent any bad contributors from using
the tracker at all. etracker
is designed to be more generous: All the content
is still available to anyone, but it just may be slower to snatch if you are
not a good seeder or active uploader. This is designed to incentivize such
behavior by rewarding it with faster download speeds. This is implemented
without any specific mechanism for client authorization, and the broader
requirements for state are limited, since clients are identified by PeerID,
which reset on every client restart.