Press "Enter" to skip to content

How SET accelerates P2P filesharing

Yesterday, we reported on the announcement of the SET filesharing system. SET claims to accelerate a typical P2P download by anywhere from 30-70 percent by increasing the potential sources of files being downloaded. It works this magic by identifying files that were similar to the one being downloaded, and finding identical pieces within those similar files. Thanks to the folks who developed the system, we had a chance to read the paper they are presenting that described the system in detail. HangZhou Night Net

The obvious first hurdle in any such system is presenting the files to be shared in a way that allows similarities to be detected. P2P filesharing systems already split files into chunks typically based on byte length, allowing clients to retrieve different chunks from different sources. This would mask similar but nonidentical files, however, as any change that altered the total bytes would throw off the register of the information in any chunks that followed it. SET works around this by using a process called Rabin fingerprinting.

Using this method, chunk boundaries are set based on a combination of size and the properties of the data itself. For example, chunk boundaries could be determined based on rules such as "16 Kb after the last boundary at the first place that four bits in a row are zeroes." Using this sort of method allows the chunks to share common boundaries even when intervening differences throw them off register.

Once a file is split in chunks, a hash value can then be calculated for each chunk, allowing identical chunks to be recognized. But sharing and comparing all the hashes for the chunks of all available files would be horribly inefficient, as it would require significant network traffic and computational resources. To avoid these problems, the authors devised a rapid comparison scheme they termed "handprinting" that involves a single, small data transfer.

Once the hashes are calculated, they sorted them lexicographically (similar to alphabetical sorting). The first 30 of the resulting hashes were defined as the "handprint" of the file. Those handprints can then be sent to the global filesharing lookup table, allowing new clients to rapidly access them. By comparing two handprints, a rough measure of the total similarities between two files can be determined. Only when the similarity of a handprint in two different files was above a certain threshold would the full list of chunk hashes be compared, and identical pieces of the files could then be downloaded.

There are three variables that need to be set arbitrarily in the SET system, which the authors chose to fill empirically. The first is chunk size: bigger chunks would make for more efficient transfers, but produce fewer similar chunks when differences are evenly distributed in the files. They chose a 16Kb chunk because that size captured the majority of possible similarities while only imposing a one percent overhead.

A second variable was the number of chunk hashes that go into a handprint. They calculated that using 28 of them would give them a greater than 90 percent chance of identifying any two files that were more than 10 percent similar, so they tested a handprint of 30 chunk hashes against a set of movie and MP3 files obtained from P2P networks. This successfully pulled out 99 percent of the similar files, which they considered to be a pretty good rate. The final issue was where to set the similarity threshold that determines whether a file is similar enough to perform a full comparison. Here, tests showed that a 10 percent similarity provided a good balance between overhead and the identification of similar files.

With these parameters in place, they tested their system on a number of simulated networks, and compared it to BitTorrent clients on the same networks. In general, the speedups from SET were most dramatic when the network performance was worst. Speedup was minimal when any server was on a fast network, because a single server could saturate the clients. Put both servers and clients on slow, asymmetric DSL links, and SET provided a huge boost. In the real world, these inefficient conditions appear to predominate: the authors cite studies showing that over 60 percent of P2P downloads are never completed, and the median transfer time for a 100MB file on Kazaa was over a day.

Given that data, It looks like the SET technique could have a big impact on future P2P services. The authors also noted that their technique is simply a method for rapidly identifying more sources of the file that's being downloaded. There's also research going on into speeding the data transfers themselves, and the two improvements could be combined to produce an aggregate speedup. Hopefully, the resulting software will cut down on the most inefficient aspect of the whole process: the downloads that are started but never completed.