Filename: 166-statistics-extra-info-docs.txt Title: Including Network Statistics in Extra-Info Documents Author: Karsten Loesing Created: 21-Jul-2009 Target: 0.2.2 Status: Open Change history: 21-Jul-2009 Initial proposal for or-dev Overview: The Tor network has grown to almost two thousand relays and millions of casual users over the past few years. With growth has come increasing performance problems and attempts by some countries to block access to the Tor network. In order to address these problems, we need to learn more about the Tor network. This proposal suggests to measure additional statistics and include them in extra-info documents to help us understand the Tor network better. Introduction: As of May 2009, relays, bridges, and directories gather the following data for statistical purposes: - Relays and bridges count the number of bytes that they have pushed in 15-minute intervals over the past 24 hours. Relays and bridges include these data in extra-info documents that they send to the directory authorities whenever they publish their server descriptor. - Bridges further include a rough number of clients per country that they have seen in the past 48 hours in their extra-info documents. - Directories can be configured to count the number of clients they see per country in the past 24 hours and to write them to a local file. Since then we extended the network statistics in Tor. These statistics include: - Directories now gather more precise statistics about connecting clients. Fixes include measuring in intervals of exactly 24 hours, counting unsuccessful requests, measuring download times, etc. The directories append their statistics to a local file every 24 hours. - Entry guards count the number of clients per country per day like bridges do and write them to a local file every 24 hours. - Relays measure statistics of the number of cells in their circuit queues and how much time these cells spend waiting there. Relays write these statistics to a local file every 24 hours. - Exit nodes count the number of read and written bytes on exit connections per port as well as the number of opened exit streams per port in 24-hour intervals. Exit nodes write their statistics to a local file. The following four sections contain descriptions for adding these statistics to the relays' extra-info documents. Directory request statistics: The first type of statistics aims at measuring directory requests sent by clients to a directory mirror or directory authority. More precisely, these statistics aim at requests for v2 and v3 network statuses only. These directory requests are sent non-anonymously, either via HTTP-like requests to a directory's Dir port or tunneled over a 1-hop circuit. Measuring directory request statistics is useful for several reasons: First, the number of locally seen directory requests can be used to estimate the total number of clients in the Tor network. Second, the country-wise classification of requests using a GeoIP database can help counting the relative and absolute number of users per country. Third, the download times can give hints on the available bandwidth capacity at clients. Directory requests do not give any hints on the contents that clients send or receive over the Tor network. Every client requests network statuses from the directories, so that there are no anonymity-related concerns to gather these statistics. It might be, though, that clients wish to hide the fact that they are connecting to the Tor network. Therefore, IP addresses are resolved to country codes in memory, events are accumulated over 24 hours, and numbers are rounded up to multiples of 4 or 8. "dirreq-stats-end" YYYY-MM-DD HH:MM:SS (NSEC s) NL [At most once.] YYYY-MM-DD HH:MM:SS defines the end of the included measurement interval of length NSEC seconds (86400 seconds by default). A "dirreq-stats-end" line, as well as any other "dirreq-*" line, is only added when the relay has opened its Dir port and after 24 hours of measuring directory requests. "dirreq-v2-ips" CC=N,CC=N,... NL [At most once.] "dirreq-v3-ips" CC=N,CC=N,... NL [At most once.] List of mappings from two-letter country codes to the number of unique IP addresses that have connected from that country to request a v2/v3 network status, rounded up to the nearest multiple of 8. Only those IP addresses are counted that the directory can answer with a 200 OK status code. "dirreq-v2-reqs" CC=N,CC=N,... NL [At most once.] "dirreq-v3-reqs" CC=N,CC=N,... NL [At most once.] List of mappings from two-letter country codes to the number of requests for v2/v3 network statuses from that country, rounded up to the nearest multiple of 8. Only those requests are counted that the directory can answer with a 200 OK status code. "dirreq-v2-share" num% NL [At most once.] "dirreq-v3-share" num% NL [At most once.] The share of v2/v3 network status requests that the directory expects to receive from clients based on its advertised bandwidth compared to the overall network bandwidth capacity. Shares are formatted in percent with two decimal places. Shares are calculated as means over the whole 24-hour interval. "dirreq-v2-resp" status=num,... NL [At most once.] "dirreq-v3-resp" status=nul,... NL [At most once.] List of mappings from response statuses to the number of requests for v2/v3 network statuses that were answered with that response status, rounded up to the nearest multiple of 4. Only response statuses with at least 1 response are reported. New response statuses can be added at any time. The current list of response statuses is as follows: "ok": a network status request is answered; this number corresponds to the sum of all requests as reported in "dirreq-v2-reqs" or "dirreq-v3-reqs", respectively, before rounding up. "not-enough-sigs: a version 3 network status is not signed by a sufficient number of requested authorities. "unavailable": a requested network status object is unavailable. "not-found": a requested network status is not found. "not-modified": a network status has not been modified since the If-Modified-Since time that is included in the request. "busy": the directory is busy. "dirreq-v2-direct-dl" key=val,... NL [At most once.] "dirreq-v3-direct-dl" key=val,... NL [At most once.] "dirreq-v2-tunneled-dl" key=val,... NL [At most once.] "dirreq-v3-tunneled-dl" key=val,... NL [At most once.] List of statistics about possible failures in the download process of v2/v3 network statuses. Requests are either "direct" HTTP-encoded requests over the relay's directory port, or "tunneled" requests using a BEGIN_DIR cell over the relay's OR port. The list of possible statistics can change, and statistics can be left out from reporting. The current list of statistics is as follows: Successful downloads and failures: "complete": a client has finished the download successfully. "timeout": a download did not finish within 10 minutes after starting to send the response. "running": a download is still running at the end of the measurement period for less than 10 minutes after starting to send the response. Download times: "min", "max": smallest and largest measured bandwidth in B/s. "d[1-4,6-9]": 1st to 4th and 6th to 9th decile of measured bandwidth in B/s. For a given decile i, i/10 of all downloads had a smaller bandwidth than di, and (10-i)/10 of all downloads had a larger bandwidth than di. "q[1,3]": 1st and 3rd quartile of measured bandwidth in B/s. One fourth of all downloads had a smaller bandwidth than q1, one fourth of all downloads had a larger bandwidth than q3, and the remaining half of all downloads had a bandwidth between q1 and q3. "md": median of measured bandwidth in B/s. Half of the downloads had a smaller bandwidth than md, the other half had a larger bandwidth than md. Entry guard statistics: Entry guard statistics include the number of clients per country and per day that are connecting directly to an entry guard. Entry guard statistics are important to learn more about the distribution of clients to countries. In the future, this knowledge can be useful to detect if there are or start to be any restrictions for clients connecting from specific countries. The information which client connects to a given entry guard is very sensitive. This information must not be combined with the information what contents are leaving the network at the exit nodes. Therefore, entry guard statistics need to be aggregated to prevent them from becoming useful for de-anonymization. Aggregation includes resolving IP addresses to country codes, counting events over 24-hour intervals, and rounding up numbers to the next multiple of 8. "entry-stats-end" YYYY-MM-DD HH:MM:SS (NSEC s) NL [At most once.] YYYY-MM-DD HH:MM:SS defines the end of the included measurement interval of length NSEC seconds (86400 seconds by default). An "entry-stats-end" line, as well as any other "entry-*" line, is first added after the relay has been running for at least 24 hours. "entry-ips" CC=N,CC=N,... NL [At most once.] List of mappings from two-letter country codes to the number of unique IP addresses that have connected from that country to the relay and which are no known other relays, rounded up to the nearest multiple of 8. Cell statistics: The third type of statistics have to do with the time that cells spend in circuit queues. In order to gather these statistics, the relay memorizes when it puts a given cell in a circuit queue and when this cell is flushed. The relay further notes the life time of the circuit. These data are sufficient to determine the mean number of cells in a queue over time and the mean time that cells spend in a queue. Cell statistics are necessary to learn more about possible reasons for the poor network performance of the Tor network, especially high latencies. The same statistics are also useful to determine the effects of design changes by comparing today's data with future data. There are basically no privacy concerns from measuring cell statistics, regardless of a node being an entry, middle, or exit node. "cell-stats-end" YYYY-MM-DD HH:MM:SS (NSEC s) NL [At most once.] YYYY-MM-DD HH:MM:SS defines the end of the included measurement interval of length NSEC seconds (86400 seconds by default). A "cell-stats-end" line, as well as any other "cell-*" line, is first added after the relay has been running for at least 24 hours. "cell-processed-cells" num,...,num NL [At most once.] Mean number of processed cells per circuit, subdivided into deciles of circuits by the number of cells they have processed in descending order from loudest to quietest circuits. "cell-queued-cells" num,...,num NL [At most once.] Mean number of cells contained in queues by circuit decile. These means are calculated by 1) determining the mean number of cells in a single circuit between its creation and its termination and 2) calculating the mean for all circuits in a given decile as determined in "cell-processed-cells". Numbers have a precision of two decimal places. "cell-time-in-queue" num,...,num NL [At most once.] Mean time cells spend in circuit queues in milliseconds. Times are calculated by 1) determining the mean time cells spend in the queue of a single circuit and 2) calculating the mean for all circuits in a given decile as determined in "cell-processed-cells". "cell-circuits-per-decile" num NL [At most once.] Mean number of circuits that are included in any of the deciles, rounded up to the next integer. Exit statistics: The last type of statistics affects exit nodes counting the number of bytes written and read and the number of streams opened per port and per 24 hours. Exit port statistics can be measured from looking of headers of BEGIN and DATA cells. A BEGIN cell contains the exit port that is required for the exit node to open a new exit stream. Subsequent DATA cells coming from the client or being sent back to the client contain a length field stating how many bytes of application data are contained in the cell. Exit port statistics are important to measure in order to identify possible load-balancing problems with respect to exit policies. Exit nodes that permit more ports than others are very likely overloaded with traffic for those ports plus traffic for other ports. Improving load balancing in the Tor network improves the overall utilization of bandwidth capacity. Exit traffic is one of the most sensitive parts of network data in the Tor network. Even though these statistics do not require looking at traffic contents, statistics are aggregated so that they are not useful for de-anonymizing users. Only those ports are reported that have seen at least 0.1% of exiting or incoming bytes, numbers of bytes are rounded up to full kibibytes (KiB), and stream numbers are rounded up to the next multiple of 4. "exit-stats-end" YYYY-MM-DD HH:MM:SS (NSEC s) NL [At most once.] YYYY-MM-DD HH:MM:SS defines the end of the included measurement interval of length NSEC seconds (86400 seconds by default). An "exit-stats-end" line, as well as any other "exit-*" line, is first added after the relay has been running for at least 24 hours and only if the relay permits exiting (where exiting to a single port and IP address is sufficient). "exit-kibibytes-written" port=N,port=N,... NL [At most once.] "exit-kibibytes-read" port=N,port=N,... NL [At most once.] List of mappings from ports to the number of kibibytes that the relay has written to or read from exit connections to that port, rounded up to the next full kibibyte. "exit-streams-opened" port=N,port=N,... NL [At most once.] List of mappings from ports to the number of opened exit streams to that port, rounded up to the nearest multiple of 4. Implementation notes: Right now, relays that are configured accordingly write similar statistics to those described in this proposal to disk every 24 hours. With this proposal being implemented, relays include the contents of these files in extra-info documents. The following steps are necessary to implement this proposal: 1. The current format of [dirreq|entry|buffer|exit]-stats files needs to be adapted to the description in this proposal. This step basically means renaming keywords. 2. The timing of writing the four *-stats files should be unified, so that they are written exactly after 24 hours after starting the relay. Right now, the measurement intervals for dirreq, entry, and exit stats starts with the first observed request, and files are written when observing the first request that occurs more than 24 hours after the beginning of the measurement interval. With this proposal, the measurement intervals should all start at the same time, and files should be written exactly 24 hours later. 3. It is advantageous to cache statistics in local files in the data directory until they are included in extra-info documents. The reason is that the 24-hour measurement interval can be very different from the 18-hour publication interval of extra-info documents. When a relay crashed after finishing a measurement interval, but before publishing the next extra-info document, statistics would get lost. Therefore, statistics are written to disk when finishing a measurement interval and read from disk when generating an extra-info document. As a result, the *-stats files need to be overwritten after 24 hours, rather than appending new statistics to them. Further, the contents of the *-stats files need to be checked in the process of generating extra-info documents. 4. With the statistics patches being tested, the ./configure options should be removed and the statistics code be compiled by default. It is still required for relay operators to add configuration options (DirReqStatistics, ExitPortStatistics, etc.) to enable gathering statistics. However, in the near future, statistics shall be enabled gathered by all relays by default, where requiring a ./configure option would be a barrier for many relay operators.