diff options
authorEric W. Biederman <>2019-09-14 07:35:02 -0500
committerIngo Molnar <>2019-09-25 17:42:29 +0200
commit5311a98fef7d0dc2e8040ae0e18f5568d6d1dd5a (patch)
parent154abafc68bfb7c2ef2ad5308a3b2de8968c3f61 (diff)
tasks, sched/core: RCUify the assignment of rq->curr
The current task on the runqueue is currently read with rcu_dereference(). To obtain ordinary RCU semantics for an rcu_dereference() of rq->curr it needs to be paired with rcu_assign_pointer() of rq->curr. Which provides the memory barrier necessary to order assignments to the task_struct and the assignment to rq->curr. Unfortunately the assignment of rq->curr in __schedule is a hot path, and it has already been show that additional barriers in that code will reduce the performance of the scheduler. So I will attempt to describe below why you can effectively have ordinary RCU semantics without any additional barriers. The assignment of rq->curr in init_idle is a slow path called once per cpu and that can use rcu_assign_pointer() without any concerns. As I write this there are effectively two users of rcu_dereference() on rq->curr. There is the membarrier code in kernel/sched/membarrier.c that only looks at "->mm" after the rcu_dereference(). Then there is task_numa_compare() in kernel/sched/fair.c. My best reading of the code shows that task_numa_compare only access: "->flags", "->cpus_ptr", "->numa_group", "->numa_faults[]", "->total_numa_faults", and "->se.cfs_rq". The code in __schedule() essentially does: rq_lock(...); smp_mb__after_spinlock(); next = pick_next_task(...); rq->curr = next; context_switch(prev, next); At the start of the function the rq_lock/smp_mb__after_spinlock pair provides a full memory barrier. Further there is a full memory barrier in context_switch(). This means that any task that has already run and modified itself (the common case) has already seen two memory barriers before __schedule() runs and begins executing. A task that modifies itself then sees a third full memory barrier pair with the rq_lock(); For a brand new task that is enqueued with wake_up_new_task() there are the memory barriers present from the taking and release the pi_lock and the rq_lock as the processes is enqueued as well as the full memory barrier at the start of __schedule() assuming __schedule() happens on the same cpu. This means that by the time we reach the assignment of rq->curr except for values on the task struct modified in pick_next_task the code has the same guarantees as if it used rcu_assign_pointer(). Reading through all of the implementations of pick_next_task it appears pick_next_task is limited to modifying the task_struct fields "->se", "->rt", "->dl". These fields are the sched_entity structures of the varies schedulers. Further "->se.cfs_rq" is only changed in cgroup attach/move operations initialized by userspace. Unless I have missed something this means that in practice that the users of "rcu_dereference(rq->curr)" get normal RCU semantics of rcu_dereference() for the fields the care about, despite the assignment of rq->curr in __schedule() ot using rcu_assign_pointer. Signed-off-by: Eric W. Biederman <> Signed-off-by: Peter Zijlstra (Intel) <> Cc: Chris Metcalf <> Cc: Christoph Lameter <> Cc: Davidlohr Bueso <> Cc: Kirill Tkhai <> Cc: Linus Torvalds <> Cc: Mike Galbraith <> Cc: Oleg Nesterov <> Cc: Paul E. McKenney <> Cc: Peter Zijlstra <> Cc: Russell King - ARM Linux admin <> Cc: Thomas Gleixner <> Link: Signed-off-by: Ingo Molnar <>
1 files changed, 7 insertions, 2 deletions
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5e5fefbd4cc7..84c71160beb1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4033,7 +4033,11 @@ static void __sched notrace __schedule(bool preempt)
if (likely(prev != next)) {
- rq->curr = next;
+ /*
+ * RCU users of rcu_dereference(rq->curr) may not see
+ * changes to task_struct made by pick_next_task().
+ */
+ RCU_INIT_POINTER(rq->curr, next);
* The membarrier system call requires each architecture
* to have a full memory barrier after updating
@@ -6060,7 +6064,8 @@ void init_idle(struct task_struct *idle, int cpu)
__set_task_cpu(idle, cpu);
- rq->curr = rq->idle = idle;
+ rq->idle = idle;
+ rcu_assign_pointer(rq->curr, idle);
idle->on_rq = TASK_ON_RQ_QUEUED;
idle->on_cpu = 1;