mirror of
https://gitlab.torproject.org/tpo/core/tor.git
synced 2024-11-11 05:33:47 +01:00
8de470cf69
ask about source, timestamp of arrival, purpose, etc. We need something like this to help Vidalia not do GeoIP lookups on bridge addresses. svn:r12687
395 lines
18 KiB
Plaintext
395 lines
18 KiB
Plaintext
Filename: 126-geoip-fetching.txt
|
|
Title: Getting GeoIP data and publishing usage summaries
|
|
Version: $Revision$
|
|
Last-Modified: $Date$
|
|
Author: Roger Dingledine
|
|
Created: 2007-11-24
|
|
Status: Open
|
|
|
|
1. Background and motivation
|
|
|
|
Right now we can keep a rough count of Tor users, both total and by
|
|
country, by watching connections to a single directory mirror. Being
|
|
able to get usage estimates is useful both for our funders (to
|
|
demonstrate progress) and for our own development (so we know how
|
|
quickly we're scaling and can design accordingly, and so we know which
|
|
countries and communities to focus on more). This need for information
|
|
is the only reason we haven't deployed "directory guards" (think of
|
|
them like entry guards but for directory information; in practice,
|
|
it would seem that Tor clients should simply use their entry guards
|
|
as their directory guards; see also proposal 125).
|
|
|
|
With the move toward bridges, we will no longer be able to track Tor
|
|
clients that use bridges, since they use their bridges as directory
|
|
guards. Further, we need to be able to learn which bridges stop seeing
|
|
use from certain countries (and are thus likely blocked), so we can
|
|
avoid giving them out to other users in those countries.
|
|
|
|
Right now we already do GeoIP lookups in Vidalia: Vidalia draws relays
|
|
and circuits on its 'network map', and it performs anonymized GeoIP
|
|
lookups to its central servers to know where to put the dots. Vidalia
|
|
caches answers it gets -- to reduce delay, to reduce overhead on
|
|
the network, and to reduce anonymity issues where users reveal their
|
|
knowledge about the network through which IP addresses they ask about.
|
|
|
|
But with the advent of bridges, Tor clients are asking about IP
|
|
addresses that aren't in the main directory. In particular, bridge
|
|
users inform the central Vidalia servers about each bridge as they
|
|
discover it and their Vidalia tries to map it.
|
|
|
|
Also, we wouldn't mind letting Vidalia do a GeoIP lookup on the client's
|
|
own IP address, so it can provide a more useful map.
|
|
|
|
Finally, Vidalia's central servers leave users open to partitioning
|
|
attacks, even if they can't target specific users. Further, as we
|
|
start using GeoIP results for more operational or security-relevant
|
|
goals, such as avoiding or including particular countries in circuits,
|
|
it becomes more important that users can't be singled out in terms of
|
|
their IP-to-country mapping beliefs.
|
|
|
|
2. The available GeoIP databases
|
|
|
|
There are at least two classes of GeoIP database out there: "IP to
|
|
country", which tells us the country code for the IP address but
|
|
no more details, and "IP to city", which tells us the country code,
|
|
the name of the city, and some basic latitude/longitude guesses.
|
|
|
|
A recent ip-to-country.csv is 3421362 bytes. Compressed, it is 564252
|
|
bytes. A typical line is:
|
|
"205500992","208605279","US","USA","UNITED STATES"
|
|
http://ip-to-country.webhosting.info/node/view/5
|
|
|
|
Similarly, the maxmind GeoLite Country database is also about 500KB
|
|
compressed.
|
|
http://www.maxmind.com/app/geolitecountry
|
|
|
|
The maxmind GeoLite City database gives more finegrained detail like
|
|
geo coordinates and city name. Vidalia currently makes use of this
|
|
information. On the other hand it's 16MB compressed. A typical line is:
|
|
206.124.149.146,Bellevue,WA,US,47.6051,-122.1134
|
|
http://www.maxmind.com/app/geolitecity
|
|
|
|
There are other databases out there, like
|
|
http://www.hostip.info/faq.html
|
|
http://www.webconfs.com/ip-to-city.php
|
|
that want more attention, but for now let's assume that all the db's
|
|
are around this size.
|
|
|
|
3. What we'd like to solve
|
|
|
|
Goal #1a: Tor relays collect IP-to-country user stats and publish
|
|
sanitized versions.
|
|
Goal #1b: Tor bridges collect IP-to-country user stats and publish
|
|
sanitized versions.
|
|
|
|
Goal #2a: Vidalia learns IP-to-city stats for Tor relays, for better
|
|
mapping.
|
|
Goal #2b: Vidalia learns IP-to-country stats for Tor relays, so the user
|
|
can pick countries for her paths.
|
|
|
|
Goal #3: Vidalia doesn't do external lookups on bridge relay addresses.
|
|
|
|
Goal #4: Vidalia resolves the Tor client's IP-to-country or IP-to-city
|
|
for better mapping.
|
|
|
|
Goal #5: Reduce partitioning opportunities where Vidalia central
|
|
servers can give different (distinguishing) responses.
|
|
|
|
4. Solution overview
|
|
|
|
Our goal is to allow Tor relays, bridges, and clients to learn enough
|
|
GeoIP information so they can do local private queries.
|
|
|
|
4.1. The IP-to-country db
|
|
|
|
Directory authorities should publish a "geoip" file that contains
|
|
IP-to-country mappings. Directory caches will mirror it, and Tor clients
|
|
and relays (including bridge relays) will fetch it. Thus we can solve
|
|
goals 1a and 1b (publish sanitized usage info). Controllers could also
|
|
use this to solve goal 2b (choosing path by country attributes). It
|
|
also solves goal 4 (learning the Tor client's country), though for
|
|
huge countries like the US we'd still need to decide where the "middle"
|
|
should be when we're mapping that address.
|
|
|
|
The IP-to-country details are described further in Sections 5 and
|
|
6 below.
|
|
|
|
4.2. The IP-to-city db
|
|
|
|
In an ideal world, the IP-to-city db would be small enough that we
|
|
could distribute it in the above manner too. But for now, it is too
|
|
large. Here's where the design choice forks.
|
|
|
|
Option A: Vidalia should continue doing its anonymized IP-to-city
|
|
queries. Thus we can achieve goals 2a and 2b. We would solve goal
|
|
3 by only doing lookups on descriptors that are purpose "general"
|
|
(see Section 4.2.1 for how). We would leave goal 5 unsolved.
|
|
|
|
Option B: Each directory authority should keep an IP-to-city db,
|
|
lookup the value for each router it lists, and include that line in
|
|
the router's network-status entry. The network-status consensus would
|
|
then use the line that appears in the majority of votes. This approach
|
|
also solves goals 2a and 2b, goal 3 (Vidalia doesn't do any lookups
|
|
at all now), and goal 5 (reduced partitioning risks).
|
|
|
|
Option B has the advantage that Vidalia can simplify its operation,
|
|
and the advantage that this consensus IP-to-city data is available to
|
|
other controllers besides just Vidalia. But it has the disadvantage
|
|
that the networkstatus consensus becomes larger, even though most of
|
|
the GeoIP information won't change from one consensus to the next. Is
|
|
there another reasonable location for it that can provide similar
|
|
consensus security properties?
|
|
|
|
4.2.1. Controllers can query for router annotations
|
|
|
|
Vidalia needs to stop doing queries on bridge relay IP addresses.
|
|
It could do that by only doing lookups on descriptors that are in
|
|
the networkstatus consensus, but that precludes designs like Blossom
|
|
that might want to map its relay locations. The best answer is that it
|
|
should learn the router annotations, with a new controller 'getinfo'
|
|
command:
|
|
"GETINFO desc-annotations/id/<OR identity>"
|
|
which would respond with something like
|
|
@downloaded-at 2007-11-29 08:06:38
|
|
@source "128.31.0.34"
|
|
@purpose bridge
|
|
|
|
[We could also make the answer include the digest for the router in
|
|
question, which would enable us to ask GETINFO router-annotations/all.
|
|
Is this worth it? -RD]
|
|
|
|
Then Vidalia can avoid doing lookups on descriptors with purpose
|
|
"bridge". Even better would be to add a new annotation "@private true"
|
|
so Vidalia can know how to handle new purposes that we haven't created
|
|
yet. Vidalia could special-case "bridge" for now, for compatibility
|
|
with the current 0.2.0.x-alphas.
|
|
|
|
4.3. Recommendation
|
|
|
|
My overall recommendation is that we should implement 4.1 soon
|
|
(e.g. early in 0.2.1.x), and we can go with 4.2 option A for now,
|
|
with the hope that later we discover a better way to distribute the
|
|
IP-to-city info and can switch to 4.2 option B.
|
|
|
|
Below we discuss more how to go about achieving 4.1.
|
|
|
|
5. Publishing and caching the GeoIP (IP-to-country) database
|
|
|
|
Each v3 directory authority should put a copy of the "geoip" file in
|
|
its datadirectory. Then its network-status votes should include a hash
|
|
of this file (Recommended-geoip-hash: %s), and the resulting consensus
|
|
directory should specify the consensus hash.
|
|
|
|
There should be a new URL for fetching this geoip db (by "current.z"
|
|
for testing purposes, and by hash.z for typical downloads). Authorities
|
|
should fetch and serve the one listed in the consensus, even when they
|
|
vote for their own. This would argue for storing the cached version
|
|
in a better filename than "geoip".
|
|
|
|
Directory mirrors should keep a copy of this file available via the
|
|
same URLs.
|
|
|
|
We assume that the file would change at most a few times a month. Should
|
|
Tor ship with a bootstrap geoip file? An out-of-date geoip file may
|
|
open you up to partitioning attacks, but for the most part it won't
|
|
be that different.
|
|
|
|
There should be a config option to disable updating the geoip file,
|
|
in case users want to use their own file (e.g. they have a proprietary
|
|
GeoIP file they prefer to use). In that case we leave it up to the
|
|
user to update his geoip file out-of-band.
|
|
|
|
[XXX Should consider forward/backward compatibility, e.g. if we want
|
|
to move to a new geoip file format. -RD]
|
|
|
|
6. Controllers use the IP-to-country db for mapping and for path building
|
|
|
|
Down the road, Vidalia could use the IP-to-country mappings for placing
|
|
on its map:
|
|
- The location of the client
|
|
- The location of the bridges, or other relays not in the
|
|
networkstatus, on the map.
|
|
- Any relays that it doesn't yet have an IP-to-city answer for.
|
|
|
|
Other controllers can also use it to set EntryNodes, ExitNodes, etc
|
|
in a per-country way.
|
|
|
|
To support these features, we need to export the IP-to-country data
|
|
via the Tor controller protocol.
|
|
|
|
Is it sufficient just to add a new GETINFO command?
|
|
GETINFO ip-to-country/128.31.0.34
|
|
250+ip-to-country/128.31.0.34="US","USA","UNITED STATES"
|
|
|
|
6.1. Other interfaces
|
|
|
|
Robert Hogan has also suggested a
|
|
|
|
GETINFO relays-by-country/cn
|
|
|
|
as well as torrc options for ExitCountryCodes, EntryCountryCodes,
|
|
ExcludeCountryCodes, etc.
|
|
|
|
7. Relays and bridges use the IP-to-country db for usage summaries
|
|
|
|
Once bridges have a GeoIP database locally, they can start to publish
|
|
sanitized summaries of client usage -- how many users they see and from
|
|
what countries. This might also be a more useful way for ordinary Tor
|
|
relays to convey the level of usage they see, which would allow us to
|
|
switch to using directory guards for all users by default.
|
|
|
|
But how to safely summarize this information without opening too many
|
|
anonymity leaks?
|
|
|
|
7.1 Attacks to think about
|
|
|
|
First, note that we need to have a large enough time window that we're
|
|
not aiding correlation attacks much. I hope 24 hours is enough. So
|
|
that means no publishing stats until you've been up at least 24 hours.
|
|
And you can't publish follow-up stats more often than every 24 hours,
|
|
or people could look at the differential.
|
|
|
|
Second, note that we need to be sufficiently vague about the IP
|
|
addresses we're reporting. We are hoping that just specifying the
|
|
country will be vague enough. But a) what about active attacks where
|
|
we convince a bridge to use a GeoIP db that labels each suspect IP
|
|
address as a unique country? We have to assume that the consensus GeoIP
|
|
db won't be malicious in this way. And b) could such singling-out
|
|
attacks occur naturally, for example because of countries that have
|
|
a very small IP space? We should investigate that.
|
|
|
|
7.2. Granularity of users
|
|
|
|
Do we only want to report countries that have a sufficient anonymity set
|
|
(that is, number of users) for the day? For example, we might avoid
|
|
listing any countries that have seen less than five addresses over
|
|
the 24 hour period. This approach would be helpful in reducing the
|
|
singling-out opportunities -- in the extreme case, we could imagine a
|
|
situation where one blogger from the Sudan used Tor on a given day, and
|
|
we can discover which entry guard she used.
|
|
|
|
But I fear that especially for bridges, seeing only one hit from a
|
|
given country in a given day may be quite common.
|
|
|
|
As a compromise, we should start out with an "Other" category in
|
|
the reported stats, which is the sum of unlisted countries; if that
|
|
category is consistently interesting, we can think harder about how
|
|
to get the right data from it safely.
|
|
|
|
But note that bridge summaries will not be made public individually,
|
|
since doing so would help people enumerate bridges. Whereas summaries
|
|
from normal relays will be public. So perhaps that means we can afford
|
|
to be more specific in bridge summaries? In particular, I'm thinking the
|
|
"other" category should be used by public relays but not for bridges
|
|
(or if it is, used with a lower threshold).
|
|
|
|
Even for countries that have many Tor users, we might not want to be
|
|
too specific about how many users we've seen. For example, we might
|
|
round down the number of users we report to the nearest multiple of 5.
|
|
My instinct for now is that this won't be that useful.
|
|
|
|
7.3 Other issues
|
|
|
|
Another note: we'll likely be overreporting in the case of users with
|
|
dynamic IP addresses: if they rotate to a new address over the course
|
|
of the day, we'll count them twice. So be it.
|
|
|
|
7.4. Where to publish the summaries?
|
|
|
|
We designed extrainfo documents for information like this. So they
|
|
should just be more entries in the extrainfo doc.
|
|
|
|
But if we want to publish summaries every 24 hours (no more often,
|
|
no less often), aren't we tried to the router descriptor publishing
|
|
schedule? That is, if we publish a new router descriptor at the 18
|
|
hour mark, and nothing much has changed at the 24 hour mark, won't
|
|
the new descriptor get dropped as being "cosmetically similar", and
|
|
then nobody will know to ask about the new extrainfo document?
|
|
|
|
One solution would be to make and remember the 24 hour summary at the
|
|
24 hour mark, but not actually publish it anywhere until we happen to
|
|
publish a new descriptor for other reasons. If we happen to go down
|
|
before publishing a new descriptor, then so be it, at least we tried.
|
|
|
|
7.5. What if the relay is unreachable or goes to sleep?
|
|
|
|
Even if you've been up for 24 hours, if you were hibernating for 18
|
|
of them, then we're not getting as much fuzziness as we'd like. So
|
|
I guess that means that we need a 24-hour period of being "awake"
|
|
before we'll willing to publish a summary. A similar attack works if
|
|
you've been awake but unreachable for the first 18 of the 24 hours. As
|
|
another example, a bridge that's on a laptop might be suspended for
|
|
some of each day.
|
|
|
|
This implies that some relays and bridges will never publish summary
|
|
stats, because they're not ever reliably working for 24 hours in
|
|
a row. If a significant percentage of our reporters end up being in
|
|
this boat, we should investigate whether we can accumulate 24 hours of
|
|
"usefulness", even if there are holes in the middle, and publish based
|
|
on that.
|
|
|
|
What other issues are like this? It seems that just moving to a new
|
|
IP address shouldn't be a reason to cancel stats publishing, assuming
|
|
we were usable at each address.
|
|
|
|
7.6. IP addresses that aren't in the geoip db
|
|
|
|
Some IP addresses aren't in the public geoip databases. In particular,
|
|
I've found that a lot of African countries are missing, but there
|
|
are also some common ones in the US that are missing, like parts of
|
|
Comcast. We could just lump unknown IP addresses into the "other"
|
|
category, but it might be useful to gather a general sense of how many
|
|
lookups are failing entirely, by adding a separate "Unknown" category.
|
|
|
|
We could also contribute back to the geoip db, by letting bridges set
|
|
a config option to report the actual IP addresses that failed their
|
|
lookup. Then the bridge authority operators can manually make sure
|
|
the correct answer will be in later geoip files. This config option
|
|
should be disabled by default.
|
|
|
|
7.7 Bringing it all together
|
|
|
|
So here's the plan:
|
|
|
|
24 hours after starting up (modulo Section 7.5 above), bridges and
|
|
relays should construct a daily summary of client countries they've
|
|
seen, including the above "Unknown" category (Section 7.6) as well.
|
|
|
|
Non-bridge relays lump all countries with less than K (e.g. K=5) users
|
|
into the "Other" category (see Sec 7.2 above), whereas bridge relays are
|
|
willing to list a country even when it has only one user for the day.
|
|
|
|
Whenever we have a daily summary on record, we include it in our
|
|
extrainfo document whenever we publish one. The daily summary we
|
|
remember locally gets replaced with a newer one when another 24
|
|
hours pass.
|
|
|
|
7.8. Some forward secrecy
|
|
|
|
How should we remember addresses locally? If we convert them into
|
|
country-codes immediately, we will count them again if we see them
|
|
again. On the other hand, we don't really want to keep a list hanging
|
|
around of all IP addresses we've seen in the past 24 hours.
|
|
|
|
Step one is that we should never write this stuff to disk. Keeping it
|
|
only in ram will make things somewhat better. Step two is to avoid
|
|
keeping any timestamps associated with it: rather than a rolling
|
|
24-hour window, which would require us to remember the various times
|
|
we've seen that address, we can instead just throw out the whole list
|
|
every 24 hours and start over.
|
|
|
|
We could hash the addresses, and then compare hashes when deciding if
|
|
we've seen a given address before. We could even do keyed hashes. Or
|
|
Bloom filters. But if our goal is to defend against an adversary
|
|
who steals a copy of our ram while we're running and then does
|
|
guess-and-check on whatever blob we're keeping, we're in bad shape.
|
|
|
|
We could drop the last octet of the IP address as soon as we see
|
|
it. That would cause us to undercount some users from cablemodem and
|
|
DSL networks that have a high density of Tor users. And it wouldn't
|
|
really help that much -- indeed, the extent to which it does help is
|
|
exactly the extent to which it makes our stats less useful.
|
|
|
|
Other ideas?
|
|
|