sched: Don't get KIST stuck in an infinite loop

When a channel is scheduled and flush cells returns 0 that is no cells to
flush, we flag it back in waiting for cells so it doesn't get stuck in a
possible infinite loop.

It has been observed on moria1 where a closed channel end up in the scheduler
where the flush process returned 0 cells but it was ultimately kept in the
scheduling loop forever. We suspect that this is due to a more deeper problem
in tor where the channel_more_to_flush() is actually looking at the wrong
queue and was returning 1 for an empty channel thus putting the channel in the
"Case 4" of the scheduler which is to go back in pending state thus
re-considered at the next iteration.

This is a fix that allows the KIST scheduler to recover properly from a not
entirelly diagnosed problem in tor.

Fixes #23676

Signed-off-by: David Goulet <dgoulet@torproject.org>
This commit is contained in:
Matt Traudt 2017-09-27 16:11:05 -04:00 committed by David Goulet
parent fc6c0b46fb
commit 3ef7e6f187
2 changed files with 22 additions and 1 deletions

6
changes/bug23676 Normal file
View File

@ -0,0 +1,6 @@
o Major bugfixes (scheduler):
If a channel is put into the scheduler's pending list, then it starts
closing, and then if the scheduler runs before it finishes closing, the
scheduler will get stuck trying to flush its cells while the lower layers
refuse to cooperate. Fix that race condition by given the scheduler an
escape method. Fixes bug 23676; bugfix on 0.3.2.1-alpha

View File

@ -591,8 +591,23 @@ kist_scheduler_run(void)
if (flush_result > 0) {
update_socket_written(&socket_table, chan, flush_result *
(CELL_MAX_NETWORK_SIZE + TLS_PER_CELL_OVERHEAD));
} else {
/* XXX: This can happen because tor sometimes does flush in an
* opportunistic way cells from the circuit to the outbuf so the
* channel can end up here without having anything to flush nor needed
* to write to the kernel. Hopefully we'll fix that soon but for now
* we have to handle this case which happens kind of often. */
log_debug(LD_SCHED,
"We didn't flush anything on a chan that we think "
"can write and wants to write. The channel's state is '%s' "
"and in scheduler state %d. We're going to mark it as "
"waiting_for_cells (as that's most likely the issue) and "
"stop scheduling it this round.",
channel_state_to_string(chan->state),
chan->scheduler_state);
chan->scheduler_state = SCHED_CHAN_WAITING_FOR_CELLS;
continue;
}
/* XXX What if we didn't flush? */
}
/* Decide what to do with the channel now */