Checkpointing Project Server checkpointing is to allow a replacement of ntserv or daemon during a game in progress. As of 2.11.1, we can only: - stop and restart netrekd at whim, - replace ntserv for use by new connections, provided the shared memory layout does not change. Requirements - be able to replace ntserv or daemon, - no significant delay to be experienced by players, - in case of program-detected failure, fall back to previous version, - in case of operator-detected failure, fall back to previous version, Design - microschedule, immediately after a tick update and before sleeping, - macroschedule, prefer to avoid if any player in red alert, - signalling, by flags in shared memory, - use files to store state, assuming use of filesystem cache, - processes with network sockets to pass them on to replacement program, Detailed Design - daemon checkpoint commence - for all slots, set flag slot checkpoint request, - wait for all ntservs to set flag slot checkpoint done, Detailed Design - ntserv checkpoint commence and commit - write all global variables to a per-process file, - set flag indicating slot checkpoint done, - exec replacement program, with restart flag, Detailed Design - daemon checkpoint commit - write all shared memory variables to a file, - set flag indicating daemon checkpoint done, - exec replacement program, with restart flag, Detailed Design - ntserv restart - map to shared memory, - wait for daemon checkpoint done flag to be clear, - read all global variables from the per-process file, - clear slot checkpoint done flag, - clear slot checkpoint request flag, - wait for next daemon synchronisation update signal, Design - daemon restart - map to shared memory, - initialise shared memory, - read all shared memory variables from file - clear checkpoint done flag, - wait for all slots to clear slot checkpoint done flag, - clear checkpoint request flag, - resume main loop. Timeline _________________ DCRF ___/ \___ Daemon Checkpoint Request Flag ____ DCDF ___________/ \________ Daemon Checkpoint Done Flag _____________ SCRF[0] _____/ \_____ Slot #0 Checkpoint Request Flag __________ SCDF[0] ________/ \_____ Slot #0 Checkpoint Done Flag ______________ SCRF[1] _____/ \____ Slot #1 Checkpoint Request Flag ___________ SCDF[1] ________/ \____ Slot #1 Checkpoint Done Flag Utilities - set daemon checkpoint request flag, monitor, and exit, - set individual slot checkpoint request flag, monitor, and clear, Shared Memory Impacts - daemon checkpointing flags to be positioned at top of shared memory, so that they are not affected by redesign of shared memory contents, - slot checkpointing flags to be positioned immediately after, to allow for the possibility of number of slots changed.