2008-07-06 19:37:04 +02:00
|
|
|
Filename: 151-path-selection-improvements.txt
|
|
|
|
Title: Improving Tor Path Selection
|
|
|
|
Author: Fallon Chen, Mike Perry
|
|
|
|
Created: 5-Jul-2008
|
|
|
|
Status: Draft
|
|
|
|
|
|
|
|
Overview
|
|
|
|
|
|
|
|
The performance of paths selected can be improved by adjusting the
|
2008-07-07 01:36:33 +02:00
|
|
|
CircuitBuildTimeout and avoiding failing guard nodes. This proposal
|
2008-08-12 20:23:38 +02:00
|
|
|
describes a method of tracking buildtime statistics at the client, and
|
|
|
|
using those statistics to adjust the CircuitBuildTimeout.
|
2008-07-06 19:37:04 +02:00
|
|
|
|
|
|
|
Motivation
|
|
|
|
|
|
|
|
Tor's performance can be improved by excluding those circuits that
|
|
|
|
have long buildtimes (and by extension, high latency). For those Tor
|
|
|
|
users who require better performance and have lower requirements for
|
|
|
|
anonymity, this would be a very useful option to have.
|
|
|
|
|
|
|
|
Implementation
|
|
|
|
|
2008-08-12 20:23:38 +02:00
|
|
|
Storing Build Times
|
|
|
|
|
|
|
|
Circuit build times will be stored in the circular array
|
|
|
|
'circuit_build_times' consisting of uint16_t elements as milliseconds.
|
|
|
|
The total size of this array will be based on the number of circuits
|
|
|
|
it takes to converge on a good fit of the long term distribution of
|
|
|
|
the circuit builds for a fixed link. We do not want this value to be
|
|
|
|
too large, because it will make it difficult for clients to adapt to
|
|
|
|
moving between different links.
|
|
|
|
|
|
|
|
From our initial observations, this value appears to be on the order
|
|
|
|
of 1000, but will be configurable in a #define NCIRCUITS_TO_OBSERVE.
|
2008-08-15 06:13:11 +02:00
|
|
|
The exact value for this #define will be determined by performing
|
|
|
|
goodness of fit tests using measurments obtained from the shufflebt.py
|
|
|
|
script from TorFlow.
|
2008-08-12 20:23:38 +02:00
|
|
|
|
|
|
|
Long Term Storage
|
|
|
|
|
|
|
|
The long-term storage representation will be implemented by storing a
|
|
|
|
histogram with BUILDTIME_BIN_WIDTH millisecond buckets (default 50) when
|
|
|
|
writing out the statistics to disk. The format of this histogram on disk
|
|
|
|
is yet to be finalized, but it will likely be of the format
|
2008-08-15 06:13:11 +02:00
|
|
|
'CircuitBuildTime <bin> <count>', with the total specified as
|
|
|
|
'TotalBuildTimes <total>'
|
2008-08-12 20:23:38 +02:00
|
|
|
Example:
|
|
|
|
|
2008-08-15 06:13:11 +02:00
|
|
|
TotalBuildTimes 100
|
|
|
|
CircuitBuildTimeBin 1 50
|
|
|
|
CircuitBuildTimeBin 2 25
|
|
|
|
CircuitBuildTimeBin 3 13
|
2008-08-12 20:23:38 +02:00
|
|
|
...
|
|
|
|
|
|
|
|
Reading the histogram in will entail multiplying each bin by the
|
|
|
|
BUILDTIME_BIN_WIDTH and then inserting <count> values into the
|
|
|
|
circuit_build_times array each with the value of
|
2008-08-15 06:13:11 +02:00
|
|
|
<bin>*BUILDTIME_BIN_WIDTH. In order to evenly distribute the
|
|
|
|
values in the circular array, a form of index skipping must
|
|
|
|
be employed. Values from bin #N with bin count C and total T
|
|
|
|
will occupy indexes specified by N+((T/C)*k)-1, where k is the
|
|
|
|
set of integers ranging from 0 to C-1.
|
|
|
|
|
|
|
|
For example, this would mean that the values from bin 1 would
|
|
|
|
occupy indexes 1+(100/50)*k-1, or 0, 2, 4, 6, 8, 10 and so on.
|
|
|
|
The values for bin 2 would occupy positions 1, 5, 9, 13. Collisions
|
|
|
|
will be inserted at the first empty position in the array greater
|
|
|
|
than the selected index (which may requiring looping around the
|
|
|
|
array back to index 0).
|
2008-08-12 20:23:38 +02:00
|
|
|
|
2008-07-06 19:37:04 +02:00
|
|
|
Learning the CircuitBuildTimeout
|
|
|
|
|
|
|
|
Based on studies of build times, we found that the distribution of
|
2008-08-12 20:23:38 +02:00
|
|
|
circuit buildtimes appears to be a Pareto distribution.
|
2008-07-07 01:36:33 +02:00
|
|
|
|
2008-08-12 20:23:38 +02:00
|
|
|
We will calculate the parameters for a Pareto distribution
|
|
|
|
fitting the data using the estimators at
|
|
|
|
http://en.wikipedia.org/wiki/Pareto_distribution#Parameter_estimation.
|
2008-07-07 01:36:33 +02:00
|
|
|
|
2008-08-12 20:23:38 +02:00
|
|
|
The timeout itself will be calculated by solving the CDF for the
|
|
|
|
a percentile cutoff BUILDTIME_PERCENT_CUTOFF. This value
|
|
|
|
represents the percentage of paths the Tor client will accept out of
|
|
|
|
the total number of paths. We have not yet determined a good
|
|
|
|
cutoff for this mathematically, but 85% seems a good choice for now.
|
2008-07-06 19:37:04 +02:00
|
|
|
|
2008-08-12 20:23:38 +02:00
|
|
|
From http://en.wikipedia.org/wiki/Pareto_distribution#Definition,
|
|
|
|
the calculation we need is pow(BUILDTIME_PERCENT_CUTOFF/100.0, k)/Xm.
|
|
|
|
|
2008-08-15 06:13:11 +02:00
|
|
|
Testing
|
|
|
|
|
|
|
|
After circuit build times, storage, and learning are implemented,
|
|
|
|
the resulting histogram should be checked for consistency by
|
|
|
|
verifying it persists across successive Tor invocations where
|
|
|
|
no circuits are built. In addition, we can also use the existing
|
|
|
|
buildtime scripts to record build times, and verify that the histogram
|
|
|
|
the python produces matches that which is output to the state file in Tor,
|
|
|
|
and verify that the Pareto parameters and cutoff points also match.
|
|
|
|
|
|
|
|
Soft timeout vs Hard Timeout
|
|
|
|
|
|
|
|
At some point, it may be desirable to change the cutoff from a
|
|
|
|
single hard cutoff that destroys the circuit to a soft cutoff and
|
|
|
|
a hard cutoff, where the soft cutoff merely triggers the building
|
|
|
|
of a new circuit, and the hard cutoff triggers destruction of the
|
|
|
|
circuit.
|
|
|
|
|
|
|
|
Good values for hard and soft cutoffs seem to be 85% and 65%
|
|
|
|
respectively, but we should eventually justify this with observation.
|
|
|
|
|
2008-08-12 20:23:38 +02:00
|
|
|
When to Begin Calculation
|
|
|
|
|
|
|
|
The number of circuits to observe (NCIRCUITS_TO_CUTOFF) before
|
|
|
|
changing the CircuitBuildTimeout will be tunable via a #define. From
|
|
|
|
our measurements, a good value for NCIRCUITS_TO_CUTOFF appears to be
|
|
|
|
on the order of 100.
|
|
|
|
|
|
|
|
Dealing with Timeouts
|
|
|
|
|
|
|
|
Timeouts should be counted as the expectation of the region of
|
|
|
|
of the Pareto distribution beyond the cutoff. The proposal will
|
|
|
|
be updated with this value soon.
|
|
|
|
|
|
|
|
Also, in the event of network failure, the observation mechanism
|
|
|
|
should stop collecting timeout data.
|
2008-07-06 19:37:04 +02:00
|
|
|
|
2008-08-12 20:23:38 +02:00
|
|
|
Client Hints
|
2008-07-07 01:36:33 +02:00
|
|
|
|
2008-08-12 20:23:38 +02:00
|
|
|
Some research still needs to be done to provide initial values
|
|
|
|
for CircuitBuildTimeout based on values learned from modem
|
|
|
|
users, DSL users, Cable Modem users, and dedicated links. A
|
|
|
|
radiobutton in Vidalia should eventually be provided that
|
|
|
|
sets CircuitBuildTimeout to one of these values and also
|
|
|
|
provide the option of purging all learned data, should any exist.
|
2008-07-07 01:36:33 +02:00
|
|
|
|
2008-08-12 20:23:38 +02:00
|
|
|
These values can either be published in the directory, or
|
|
|
|
shipped hardcoded for a particular Tor version.
|
|
|
|
|
|
|
|
Issues
|
|
|
|
|
|
|
|
Impact on anonymity
|
2008-07-06 19:37:04 +02:00
|
|
|
|
2008-08-12 20:23:38 +02:00
|
|
|
Since this follows a Pareto distribution, large reductions on the
|
|
|
|
timeout can be achieved without cutting off a great number of the
|
|
|
|
total paths. This will eliminate a great deal of the performance
|
|
|
|
variation of Tor usage.
|