Daniel Moerner

Brainstorming BitTorrent Peer Distribution Algorithms

Recently, I’ve been working on etracker, an experimental BitTorrent tracker designed to incentivize healthy client behavior. If you already know about the Bittorrent protocol or about my project, feel free to skip ahead to the actual brainstorming.

Project Background

BitTorrent is a P2P file transfer protocol, which can use centralized trackers to distribute peer lists to clients, who in turn connect to each other. A collection of clients announcing a particular content infohash is a swarm, which consists of some clients “seeding” content they already have, and other “leeching” content they do not yet have. A tracker must keep track of many clients, each in one or more swarms.

A standard challenge in the BitTorrent world is free-riding: Users who only leech content, but do not seed. The protocol does not build in an incentive to seed, so users may leech content and then leave the swarm. This behavior can put additional upload bandwidth requirements on users who do choose to seed, and can reduce the longevity of swarms as only a proper subset of the clients with the content actually offer to seed it to others.

One approach to the free-riding problem, developed by private communities, is to put out-of-protocol restrictions on clients to force them to seed. BitTorrent metadata is distributed in the form of .torrent files. A front-end can restrict access to .torrent files only to users who have demonstrated that they have uploaded a certain amount of content or seeded torrents for a certain amount of time. This kind of approach requires significant overhead and coordination, including a separate database to track user contributions over time, rules for when new .torrent files can be downloaded, including authorization that ties clients to a persisting user identity, and moderation to prevent cheating.

In etracker, I pursue a different approach. The tracker’s primary job is to distribute peers to clients. The unofficial extension to the BitTorrent spec states:

As mentioned above, the list of peers is length 50 by default. If there are fewer peers in the torrent, then the list will be smaller. Otherwise, the tracker randomly selects peers to include in the response. The tracker may choose to implement a more intelligent mechanism for peer selection when responding to a request. For instance, reporting seeds to other seeders could be avoided.

One of the goals of etracker is to explore alternative peer distribution algorithms which are designed to incentivize long-term and fast seeding, without relying on restrictions on .torrent file downloads or requiring that users authorize with a central service to gain access to content.

(Another goal of etracker is to experiment with tracker-level cheater detection. Cheating becomes an issue with any system that tries to algorithmically solve the free-riding problem. I hope to discuss cheating in later posts.)

Brainstorming Peer-Distribution Algorithms

The informal idea behind etracker is to reward good peers with more good peers. Formalizing this informal idea requires figuring out what a good seeder is, figuring out what a good peer is, and figuring out how to scale the rewards. At this point I’m just going to be publicly brainstorming, building up in steps towards the algorithm I’m currently targeting.

First Algorithm: Number of announces = Number of peers

The simplest algorithm counts how many non-stale announces a client has (i.e., the number of torrents they have announcing in their client), and returns a number of peers equal to that number, up to the maximum that the client requests. A sample implementation of this simple algorithm is available on GitHub.

However, it has a number of problems:

  1. A linear algorithm with an intercept of zero is probably not what we want. It’s too punishing to new peers in few or no existing swarms.
  2. Large-scale leechers can still get all the peers they request, because they are making announces for each torrent they are trying to leech. Although leechers should get peers, we should do more to incentivize seeding as well.
  3. No sense of peer quality beyond number of peers returned. etracker bends the spec by giving many leechers fewer peers than they request. If leechers are only given poor quality peers and are not given as many as they request, they will be incentivized to use another tracker, rather than to seed more. Poor quality peers might be slow peers, or there might be other kinds of poor quality peer.

From now on, I’m going to assume that Problem #1 can be solved independently. It’s just a matter of exploring some mathematical functions to normalize the curve of peers we distribute.

Second Algorithm: Number of seeds = Normalized number of peers

The next evolution of the algorithm defines a good peer as a good seeder. We can detect seeders in the swarm by tracking announces with the left key set to zero. This solves Problem 2 of the first algorithm, since we only reward seeders, and not leechers. However, some problems remain and other problems arise.

Problems:

  1. Disadvantages partial seeders: A partial seeder is a client which snatches part of a torrent, not all of it. For example, the Internet Archive hosts large torrents of historical documents, and a user may wish to only snatch some subset of the data. These seeders keep announcing with a non-zero left key, even though they never leech more of the torrent.
  2. No sense of peer quality beyond the number of peers received.

Third Algorithm: Number of seeds and Amount of upload = Normalized number of peers

Each client announce includes statistics on how much the client has upload and downloaded on a particular torrent. The tracker can keep track of this, and this points to a third evolution in the algorithm. Define as a good peer as a good seeder, or as a leecher who announces upload on torrents which they have not yet completed. This will capture at least two additional cases beyond the second algorithm: First, it rewards leechers who are simultaneously seeding to others. Even if the leecher leaves the swarm soon, the fact that they have seeded as well before leaving is a sign of swarm health. Second, it rewards partial seeders who report upload to other peers.

This algorithm is not a complete solution to the issue of disadvantaging partial seeders: It will only count partial seeders who are also uploading to other leechers. This will not count partial seeders on unpopular content, who do not seed to anyone else. However, I think this is a reasonable compromise. If content is unpopular, we should incentivize people to seed it all for better retention. (Conversely, if a partial seeder reports no upload because they are a partial seeder of a very well-seeded torrent, there is no reason to particularly incentivize this kind of partial seeding either.)

Problems:

  1. No sense of peer quality beyond the number of peers received.

Fourth Algorithm: Number of seeds and Amount of upload = Scaled peers by tier

The fourth and final algorithm finally tries to tackle the problem of returning good peers. The basic idea is that just as we can use upload and download amounts to measure if a particular client is a good enough contributor to be “deserving” of peers, we can also judge the other peers in the same way.

I imagine dividing the set of peers into tiers. Seeding peers with quick upload are in a high tier, and are preferentially distributed to high tier peers which are leeching new content.

This involves a number of complications in implementation. As you can tell, I’m still working through exactly what this might look like:

  • Seeding amount versus seeding speed: Clients seeding large number of torrents may have a limit on the number of connections they can make. Although we want to distribute peers in such a way as to take advantage of large bandwidth uploaders, we cannot punish them by trying to make them use too many connections. Ideally, we may want to reserve the upload bandwidth of large seeders for the most unpopular torrents they are sharing.
  • Uncertainty in estimating peer quality: Peers do not directly report their client limits or their bandwidth limits to us. This can only be unreliably inferred from their announce statistics.
  • Handling multiple requests for peers from leeching clients: On slower connections or with large files, a leecher may make several announces before completing. It may make sense to give them better peers in later announces to help them finish snatching a torrent. If we do not adequately reward clients with completed content, they won’t use this kind of tracker to begin with.
  • Technical issues: Efficiently calculating and updating peer tiers may be expensive, and require additional tables or caching.
  • Cheating: Once the algorithm becomes sufficiently sensitive to reported announce statistics, it also increases the incentive for clients to cheat. This is a large topic which I’ll approach in later development.

How is the fourth evolution of the algorithm different from the out-of-protocol solution I mentioned earlier in this post? Those solutions are all-or-nothing: Authorization for each user is used to prevent any bad contributors from using the tracker at all. etracker is designed to be more generous: All the content is still available to anyone, but it just may be slower to snatch if you are not a good seeder or active uploader. This is designed to incentivize such behavior by rewarding it with faster download speeds. This is implemented without any specific mechanism for client authorization, and the broader requirements for state are limited, since clients are identified by PeerID, which reset on every client restart.