tor/doc/spec/proposals/ideas/xxx-geoip-survey-plan.txt



Abstract

   This document explains how to tell about how many Tor users there
   are, and how many there are in which country.  Statistics are
   involved.

Motivation

   There are a few reasons we need to keep track of which countries
   Tor users (in aggregate) are coming from:

      - Resource allocation.  Knowing about underserved countries with
        lots of users can let us know about where we need to direct
        translation and outreach efforts.

      - Anticensorship.  Sudden drops in usage on a national basis can
        indicate the arrival of a censorious firewall.

      - Sponsor outreach and self-evalutation.  Many people and
        organizations who are interested in funding The Tor Project's
        work want to know that we're successfully serving parts of the
        world they're interested in, and that efforts to expand our
        userbase are actually succeeding.  So do we.

Goals

   We want to know about how many Tor users there are, and which
   countries they're in, even in the presence of a hypothetical
   "directory guard" feature.  Some uncertainty is okay, but we'd like
   to be able to put a bound on the uncertainty.

   We need to make sure this information isn't exposed in a way that
   helps an adversary.

Methods for curent clients:

   Every client downloads network status documents.  There are
   currently three methods (one hypothetical) for clients to get them.
      - 0.1.2.x clients (and earlier) fetch a v2 networkstatus
        document about every NETWORKSTATUS_CLIENT_DL_INTERVAL [30
        minutes].

      - 0.2.0.x clients fetch a v3 networkstatus consensus document
        at a random interval between when their current document is no
        longer freshest, and when their current document is about to
        expire.

        [In both of the above cases, clients choose a running
        directory cache at random with odds roughly proportional to
        its bandwidth.]

      - In some future version, clients will choose directory caches
        to serve as their "directory guards" to avoid profiling
        attacks, similarly to how clients currently start all their
        circuits at guard nodes.

    We assume that a directory cache can tell which of these three
    categories a client is in by the format of its status request.

    A directory cache can be made to count distinct client IP
    addresses that make a certain request of it in a given timeframe,
    and total requests made to it over that timeframe.  For the first
    two cases, a cache can get a  picture of the overall
    number and countries of users in the network by dividing the IP
    count by the probability with which they (as a cache) would be
    chosen.  Assuming that our listed bandwidth is such that we expect
    to be chosen with probability P for any given request, and we've
    been counting IPs for long enough that we expect the average
    client to have made N requests, they will have visited us at least
    once with probability P' = 1-(1-P)^N, and so we divide the IP
    counts we've seen by P' for our estimate.  To estimate total
    number of clients of a given type, determine how many requests a
    client of that type will make over that time, and assume we'll
    have seen P of them.

    Both of these numbers are useful: the IP counts will give the
    total number of IPs connecting to the network, and the request
    counts will give the total number of users on the network at any
    given time.

    Notes:
       - [Over H hours, the N for V2 clients is 2*H, and the N for V3
         clients is currently around N/2 or N/3. [***FIGURE THIS
         OUT***XXXX]]

       - (We should only count requests that we actually intend to answer;
         503 requests shouldn't count.)

       - These measurements *shouldn't* be taken at directory
         authorities: their picture of the network is too skewed by the
         special cases in which clients fetch from them directly.

Methods for directory guards:

    If directory guards are in use, directory guards get a picture of
    all those users who chose them as a guard when they were listed
    as a good choice for a guard, and who are also on the network
    now.  The cleanest data here will come from nodes that were listed
    as good new-guards choices for a while, and have not been so for a
    while longer (to study decay rates); nodes that have been listed
    as good new-guard choices consistently for a long time (to get a
    sample of the network); and nodes that have been listed as good
    new-guard choices only recently (to get a sample of new users and
    users whose guards have died out.)

    Since directory guards are currently unspecified, we'll need to
    make some guesses about how they'll turn out to work.  Here are
    a couple of approaches that could work.
       - We could have clients pick completely new directory guards on
         a rolling basis every two months or so.  This would ensure
         that staying as a guard for a while would be sufficient to
         see a sample of users.  This is potentially advantageous for
         load-balancing the network as well, though it might lose some
         of the benefits of directory guard.  We need to quantify the
         impact of this; it might not actually make stuff worse in
         practice, if most guards don't stay good guards for a month
         or two.

       - We could try to collect statistics at several directory
         guards and combine their statisics, but we would need to make
         sure that for all time, at least one of the directory guards
         had been recommended as a good choice for new guards.  By
         looking at new-IP rates for guards, we could get an idea of
         user uptake; for looking at old-IP decay rates, we could get
         an idea of turnover.  This approach would entail significant
         complexity, and we'd probably need to record more information
         than we'd really like to.
Add proposed methodolody for tracking national usage trends. svn:r14578 2008-05-08 06:13:36 +02:00

			`Abstract`

			`This document explains how to tell about how many Tor users there`
			`are, and how many there are in which country. Statistics are`
			`involved.`

			`Motivation`

			`There are a few reasons we need to keep track of which countries`
			`Tor users (in aggregate) are coming from:`

			`- Resource allocation. Knowing about underserved countries with`
			`lots of users can let us know about where we need to direct`
			`translation and outreach efforts.`

			`- Anticensorship. Sudden drops in usage on a national basis can`
			`indicate the arrival of a censorious firewall.`

			`- Sponsor outreach and self-evalutation. Many people and`
			`organizations who are interested in funding The Tor Project's`
			`work want to know that we're successfully serving parts of the`
			`world they're interested in, and that efforts to expand our`
r16178@tombo: nickm \| 2008-06-11 16:33:06 -0400 Update geoip proposal draft to more closely match reality , and include slightly better ideas about dir guards. svn:r15142 2008-06-11 22:44:22 +02:00			`userbase are actually succeeding. So do we.`
Add proposed methodolody for tracking national usage trends. svn:r14578 2008-05-08 06:13:36 +02:00
			`Goals`

			`We want to know about how many Tor users there are, and which`
			`countries they're in, even in the presence of a hypothetical`
			`"directory guard" feature. Some uncertainty is okay, but we'd like`
			`to be able to put a bound on the uncertainty.`

			`We need to make sure this information isn't exposed in a way that`
			`helps an adversary.`

r16178@tombo: nickm \| 2008-06-11 16:33:06 -0400 Update geoip proposal draft to more closely match reality , and include slightly better ideas about dir guards. svn:r15142 2008-06-11 22:44:22 +02:00			`Methods for curent clients:`
Add proposed methodolody for tracking national usage trends. svn:r14578 2008-05-08 06:13:36 +02:00
			`Every client downloads network status documents. There are`
			`currently three methods (one hypothetical) for clients to get them.`
			`- 0.1.2.x clients (and earlier) fetch a v2 networkstatus`
			`document about every NETWORKSTATUS_CLIENT_DL_INTERVAL [30`
			`minutes].`

			`- 0.2.0.x clients fetch a v3 networkstatus consensus document`
			`at a random interval between when their current document is no`
			`longer freshest, and when their current document is about to`
			`expire.`

r16178@tombo: nickm \| 2008-06-11 16:33:06 -0400 Update geoip proposal draft to more closely match reality , and include slightly better ideas about dir guards. svn:r15142 2008-06-11 22:44:22 +02:00			`[In both of the above cases, clients choose a running`
			`directory cache at random with odds roughly proportional to`
			`its bandwidth.]`
Add proposed methodolody for tracking national usage trends. svn:r14578 2008-05-08 06:13:36 +02:00
			`- In some future version, clients will choose directory caches`
			`to serve as their "directory guards" to avoid profiling`
			`attacks, similarly to how clients currently start all their`
			`circuits at guard nodes.`

			`We assume that a directory cache can tell which of these three`
			`categories a client is in by the format of its status request.`

			`A directory cache can be made to count distinct client IP`
r16178@tombo: nickm \| 2008-06-11 16:33:06 -0400 Update geoip proposal draft to more closely match reality , and include slightly better ideas about dir guards. svn:r15142 2008-06-11 22:44:22 +02:00			`addresses that make a certain request of it in a given timeframe,`
			`and total requests made to it over that timeframe. For the first`
			`two cases, a cache can get a picture of the overall`
Add proposed methodolody for tracking national usage trends. svn:r14578 2008-05-08 06:13:36 +02:00			`number and countries of users in the network by dividing the IP`
			`count by the probability with which they (as a cache) would be`
			`chosen. Assuming that our listed bandwidth is such that we expect`
			`to be chosen with probability P for any given request, and we've`
			`been counting IPs for long enough that we expect the average`
			`client to have made N requests, they will have visited us at least`
			`once with probability P' = 1-(1-P)^N, and so we divide the IP`
r16178@tombo: nickm \| 2008-06-11 16:33:06 -0400 Update geoip proposal draft to more closely match reality , and include slightly better ideas about dir guards. svn:r15142 2008-06-11 22:44:22 +02:00			`counts we've seen by P' for our estimate. To estimate total`
			`number of clients of a given type, determine how many requests a`
			`client of that type will make over that time, and assume we'll`
			`have seen P of them.`

			`Both of these numbers are useful: the IP counts will give the`
			`total number of IPs connecting to the network, and the request`
			`counts will give the total number of users on the network at any`
			`given time.`

			`Notes:`
			`- [Over H hours, the N for V2 clients is 2*H, and the N for V3`
			`clients is currently around N/2 or N/3. [***FIGURE THIS`
			`OUT***XXXX]]`

			`- (We should only count requests that we actually intend to answer;`
			`503 requests shouldn't count.)`

			`- These measurements shouldn't be taken at directory`
			`authorities: their picture of the network is too skewed by the`
			`special cases in which clients fetch from them directly.`

			`Methods for directory guards:`
Add proposed methodolody for tracking national usage trends. svn:r14578 2008-05-08 06:13:36 +02:00
			`If directory guards are in use, directory guards get a picture of`
			`all those users who chose them as a guard when they were listed`
			`as a good choice for a guard, and who are also on the network`
			`now. The cleanest data here will come from nodes that were listed`
			`as good new-guards choices for a while, and have not been so for a`
			`while longer (to study decay rates); nodes that have been listed`
			`as good new-guard choices consistently for a long time (to get a`
			`sample of the network); and nodes that have been listed as good`
			`new-guard choices only recently (to get a sample of new users and`
			`users whose guards have died out.)`

r16178@tombo: nickm \| 2008-06-11 16:33:06 -0400 Update geoip proposal draft to more closely match reality , and include slightly better ideas about dir guards. svn:r15142 2008-06-11 22:44:22 +02:00			`Since directory guards are currently unspecified, we'll need to`
			`make some guesses about how they'll turn out to work. Here are`
			`a couple of approaches that could work.`
			`- We could have clients pick completely new directory guards on`
			`a rolling basis every two months or so. This would ensure`
			`that staying as a guard for a while would be sufficient to`
			`see a sample of users. This is potentially advantageous for`
			`load-balancing the network as well, though it might lose some`
			`of the benefits of directory guard. We need to quantify the`
			`impact of this; it might not actually make stuff worse in`
			`practice, if most guards don't stay good guards for a month`
			`or two.`

			`- We could try to collect statistics at several directory`
			`guards and combine their statisics, but we would need to make`
			`sure that for all time, at least one of the directory guards`
			`had been recommended as a good choice for new guards. By`
			`looking at new-IP rates for guards, we could get an idea of`
			`user uptake; for looking at old-IP decay rates, we could get`
			`an idea of turnover. This approach would entail significant`
			`complexity, and we'd probably need to record more information`
			`than we'd really like to.`

Add proposed methodolody for tracking national usage trends. svn:r14578 2008-05-08 06:13:36 +02:00