Update spec with new right-censored pareto estimators.

This commit is contained in:
Mike Perry 2010-06-07 20:02:12 -07:00
parent f897154b26
commit 81736f426f

View File

@ -311,15 +311,39 @@ of their choices.
the tail of the distribution with a Pareto curve.
We calculate the parameters for a Pareto distribution fitting the data
using the estimators at
http://en.wikipedia.org/wiki/Pareto_distribution#Parameter_estimation.
using the estimators in equation 4 from:
http://portal.acm.org/citation.cfm?id=1647962.1648139
Because this is not a true Pareto distribution, we alter how Xm is
computed. The Xm parameter is computed as the midpoint of the most
This is:
alpha_m = s/(ln(U(X)/Xm^n))
where s is the total number of completed circuits we have seen, and
U(X) = x_max^u * Prod_s{x_i}
with x_i as our i-th completed circuit time, x_max as the longest
completed circuit build time we have yet observed, u as the
number of unobserved timeouts that have no exact value recorded,
and n as u+s, the total number of circuits that either timeout or
complete.
Using log laws, we compute this as the sum of logs to avoid
overflow and ln(1.0+epsilon) precision issues:
alpha_m = s/(u*ln(x_max) + Sum_s{ln(x_i)} - n*ln(Xm))
This estimator is closely related to the parameters present in:
http://en.wikipedia.org/wiki/Pareto_distribution#Parameter_estimation
except they are adjusted to handle the fact that our samples are
right-censored at the timeout cutoff.
Additionally, because this is not a true Pareto distribution, we alter
how Xm is computed. The Xm parameter is computed as the midpoint of the most
frequently occurring 50ms histogram bin, until the point where 1000
circuits are recorded. After this point, the weighted average of the top
3 midpoint modes is used as Xm. All times below this value are counted
as having the midpoint value of this weighted average bin.
'cbtnummodes' (default: 3) midpoint modes is used as Xm. All times below
this value are counted as having the midpoint value of this weighted average bin.
The timeout itself is calculated by using the Pareto Quantile function (the
inverted CDF) to give us the value on the CDF such that 80% of the mass
@ -347,10 +371,18 @@ of their choices.
2.4.3. How to record timeouts
Timeouts should be counted as the expectation of the region of
of the Pareto distribution beyond the cutoff. This is done by
generating a random sample for each timeout at points on the
curve beyond the current timeout cutoff up to the 90% quantile marker.
Circuits that pass the timeout threshold should be allowed to continue
building until a time corresponding to the point 'cbtclosequantile'
(default 95) on the Pareto curve, or 60 seconds, whichever is greater.
The actual completion times for these circuits should be recorded.
Implementations should completely abandon a circuit and record a value
as an 'unknown' timeout if the total build time exceeds this threshold.
The reason for this is that right-censored pareto estimators begin to lose
their accuracy if more than approximately 5% of the values are censored.
Since we wish to set the cutoff at 20%, we must allow circuits to continue
building past this cutoff point up to the 95th percentile.
2.4.4. Detecting Changing Network Conditions
@ -388,6 +420,14 @@ of their choices.
disabled and history should be discarded. For use in
emergency situations only.
cbtnummodes
Default: 3
Effect: This value governs how many modes to use in the weighted
average calculation of Pareto paramter Xm. A value of 3 introduces
some bias (2-5% of CDF) under ideal conditions, but allows for better
performance in the event that a client chooses guard nodes of radically
different performance characteristics.
cbtrecentcount
Default: 20
Effect: This is the number of circuit build times to keep track of
@ -409,11 +449,11 @@ of their choices.
Effect: This is the position on the quantile curve to use to set the
timeout value. It is a percent (0-99).
cbtmaxsynthquantile
Default: 90
Effect: This is the maximum position on the quantile curve to use to
generate synthetic circuit build times for timeouts. It is a
percent (0-99).
cbtclosequantile
Default: 95
Effect: This is the position on the quantile curve to use to set the
timeout value to use to actually close circuits. It is a percent
(0-99).
cbttestfreq
Default: 60