Add proposed methodolody for tracking national usage trends.

svn:r14578
This commit is contained in:
Nick Mathewson 2008-05-08 04:13:36 +00:00
parent 2238d8008d
commit 32065813ac

View File

@ -0,0 +1,88 @@
Abstract
This document explains how to tell about how many Tor users there
are, and how many there are in which country. Statistics are
involved.
Motivation
There are a few reasons we need to keep track of which countries
Tor users (in aggregate) are coming from:
- Resource allocation. Knowing about underserved countries with
lots of users can let us know about where we need to direct
translation and outreach efforts.
- Anticensorship. Sudden drops in usage on a national basis can
indicate the arrival of a censorious firewall.
- Sponsor outreach and self-evalutation. Many people and
organizations who are interested in funding The Tor Project's
work want to know that we're successfully serving parts of the
world they're interested in, and that efforts to expand our
userbase are actually succeeding. So, when you come right
down to it, do we.
Goals
We want to know about how many Tor users there are, and which
countries they're in, even in the presence of a hypothetical
"directory guard" feature. Some uncertainty is okay, but we'd like
to be able to put a bound on the uncertainty.
We need to make sure this information isn't exposed in a way that
helps an adversary.
Methods:
Every client downloads network status documents. There are
currently three methods (one hypothetical) for clients to get them.
- 0.1.2.x clients (and earlier) fetch a v2 networkstatus
document about every NETWORKSTATUS_CLIENT_DL_INTERVAL [30
minutes].
- 0.2.0.x clients fetch a v3 networkstatus consensus document
at a random interval between when their current document is no
longer freshest, and when their current document is about to
expire.
[In both of the above cases, clients choose a directory cache at
random with odds roughly proportional to its bandwidth.]
- In some future version, clients will choose directory caches
to serve as their "directory guards" to avoid profiling
attacks, similarly to how clients currently start all their
circuits at guard nodes.
We assume that a directory cache can tell which of these three
categories a client is in by the format of its status request.
A directory cache can be made to count distinct client IP
addresses that make a certain request of it in a given timeframe.
For the first two cases, a cache can get a picture of the overall
number and countries of users in the network by dividing the IP
count by the probability with which they (as a cache) would be
chosen. Assuming that our listed bandwidth is such that we expect
to be chosen with probability P for any given request, and we've
been counting IPs for long enough that we expect the average
client to have made N requests, they will have visited us at least
once with probability P' = 1-(1-P)^N, and so we divide the IP
counts we've seen by P' for our estimate.
If directory guards are in use, directory guards get a picture of
all those users who chose them as a guard when they were listed
as a good choice for a guard, and who are also on the network
now. The cleanest data here will come from nodes that were listed
as good new-guards choices for a while, and have not been so for a
while longer (to study decay rates); nodes that have been listed
as good new-guard choices consistently for a long time (to get a
sample of the network); and nodes that have been listed as good
new-guard choices only recently (to get a sample of new users and
users whose guards have died out.)
Note that these measurements *shouldn't* be taken at directory
authorities: their picture of the network is too skewed by the
special cases in which clients fetch from them directly.