mirror of
https://gitlab.torproject.org/tpo/core/tor.git
synced 2024-12-01 08:03:31 +01:00
1913364dc8
svn:r15234
138 lines
6.2 KiB
Plaintext
138 lines
6.2 KiB
Plaintext
|
|
|
|
Abstract
|
|
|
|
This document explains how to tell about how many Tor users there
|
|
are, and how many there are in which country. Statistics are
|
|
involved.
|
|
|
|
Motivation
|
|
|
|
There are a few reasons we need to keep track of which countries
|
|
Tor users (in aggregate) are coming from:
|
|
|
|
- Resource allocation. Knowing about underserved countries with
|
|
lots of users can let us know about where we need to direct
|
|
translation and outreach efforts.
|
|
|
|
- Anticensorship. Sudden drops in usage on a national basis can
|
|
indicate the arrival of a censorious firewall.
|
|
|
|
- Sponsor outreach and self-evalutation. Many people and
|
|
organizations who are interested in funding The Tor Project's
|
|
work want to know that we're successfully serving parts of the
|
|
world they're interested in, and that efforts to expand our
|
|
userbase are actually succeeding. So do we.
|
|
|
|
Goals
|
|
|
|
We want to know approximately how many Tor users there are, and which
|
|
countries they're in, even in the presence of a hypothetical
|
|
"directory guard" feature. Some uncertainty is okay, but we'd like
|
|
to be able to put a bound on the uncertainty.
|
|
|
|
We need to make sure this information isn't exposed in a way that
|
|
helps an adversary.
|
|
|
|
Methods for curent clients:
|
|
|
|
Every client downloads network status documents. There are
|
|
currently three methods (one hypothetical) for clients to get them.
|
|
- 0.1.2.x clients (and earlier) fetch a v2 networkstatus
|
|
document about every NETWORKSTATUS_CLIENT_DL_INTERVAL [30
|
|
minutes].
|
|
|
|
- 0.2.0.x clients fetch a v3 networkstatus consensus document
|
|
at a random interval between when their current document is no
|
|
longer freshest, and when their current document is about to
|
|
expire.
|
|
|
|
[In both of the above cases, clients choose a running
|
|
directory cache at random with odds roughly proportional to
|
|
its bandwidth. If they're just starting, they know a ]
|
|
|
|
- In some future version, clients will choose directory caches
|
|
to serve as their "directory guards" to avoid profiling
|
|
attacks, similarly to how clients currently start all their
|
|
circuits at guard nodes.
|
|
|
|
We assume that a directory cache can tell which of these three
|
|
categories a client is in by the format of its status request.
|
|
|
|
A directory cache can be made to count distinct client IP
|
|
addresses that make a certain request of it in a given timeframe,
|
|
and total requests made to it over that timeframe. For the first
|
|
two cases, a cache can get a picture of the overall
|
|
number and countries of users in the network by dividing the IP
|
|
count by the probability with which they (as a cache) would be
|
|
chosen. Assuming that our listed bandwidth is such that we expect
|
|
to be chosen with probability P for any given request, and we've
|
|
been counting IPs for long enough that we expect the average
|
|
client to have made N requests, they will have visited us at least
|
|
once with probability P' = 1-(1-P)^N, and so we divide the IP
|
|
counts we've seen by P' for our estimate. To estimate total
|
|
number of clients of a given type, determine how many requests a
|
|
client of that type will make over that time, and assume we'll
|
|
have seen P of them.
|
|
|
|
Both of these numbers are useful: the IP counts will give the
|
|
total number of IPs connecting to the network, and the request
|
|
counts will give the total number of users on the network at any
|
|
given time.
|
|
|
|
Notes:
|
|
- [Over H hours, the N for V2 clients is 2*H, and the N for V3
|
|
clients is currently around N/2 or N/3.]
|
|
|
|
- (We should only count requests that we actually intend to answer;
|
|
503 requests shouldn't count.)
|
|
|
|
- These measurements should also be be taken at a directory
|
|
authority if possible: their picture of the network is skewed
|
|
by clients that fetch from them directly. These clients,
|
|
however, are all the clients that are just bootstrapping
|
|
(assuming that the fallback-consensus feature isn't yet used
|
|
much).
|
|
|
|
- These measurements also overestimate the V2 download rate if
|
|
some downloads fail and clients retry them later after backing
|
|
off.
|
|
|
|
Methods for directory guards:
|
|
|
|
If directory guards are in use, directory guards get a picture of
|
|
all those users who chose them as a guard when they were listed
|
|
as a good choice for a guard, and who are also on the network
|
|
now. The cleanest data here will come from nodes that were listed
|
|
as good new-guards choices for a while, and have not been so for a
|
|
while longer (to study decay rates); nodes that have been listed
|
|
as good new-guard choices consistently for a long time (to get a
|
|
sample of the network); and nodes that have been listed as good
|
|
new-guard choices only recently (to get a sample of new users and
|
|
users whose guards have died out.)
|
|
|
|
Since directory guards are currently unspecified, we'll need to
|
|
make some guesses about how they'll turn out to work. Here are
|
|
a couple of approaches that could work.
|
|
- We could have clients pick completely new directory guards on
|
|
a rolling basis every two months or so. This would ensure
|
|
that staying as a guard for a while would be sufficient to
|
|
see a sample of users. This is potentially advantageous for
|
|
load-balancing the network as well, though it might lose some
|
|
of the benefits of directory guard. We need to quantify the
|
|
impact of this; it might not actually make stuff worse in
|
|
practice, if most guards don't stay good guards for a month
|
|
or two.
|
|
|
|
- We could try to collect statistics at several directory
|
|
guards and combine their statisics, but we would need to make
|
|
sure that for all time, at least one of the directory guards
|
|
had been recommended as a good choice for new guards. By
|
|
looking at new-IP rates for guards, we could get an idea of
|
|
user uptake; for looking at old-IP decay rates, we could get
|
|
an idea of turnover. This approach would entail significant
|
|
complexity, and we'd probably need to record more information
|
|
than we'd really like to.
|
|
|
|
|