AgeCommit message (Collapse)AuthorFilesLines
2019-02-01mm, memory_hotplug: test_pages_in_a_zone do not pass the end of zoneMikhail Zaslonko1-0/+3
If memory end is not aligned with the sparse memory section boundary, the mapping of such a section is only partly initialized. This may lead to VM_BUG_ON due to uninitialized struct pages access from test_pages_in_a_zone() function triggered by memory_hotplug sysfs handlers. Here are the the panic examples: CONFIG_DEBUG_VM_PGFLAGS=y kernel parameter mem=2050M -------------------------- page:000003d082008000 is uninitialized and poisoned page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p)) Call Trace: test_pages_in_a_zone+0xde/0x160 show_valid_zones+0x5c/0x190 dev_attr_show+0x34/0x70 sysfs_kf_seq_show+0xc8/0x148 seq_read+0x204/0x480 __vfs_read+0x32/0x178 vfs_read+0x82/0x138 ksys_read+0x5a/0xb0 system_call+0xdc/0x2d8 Last Breaking-Event-Address: test_pages_in_a_zone+0xde/0x160 Kernel panic - not syncing: Fatal exception: panic_on_oops Fix this by checking whether the pfn to check is within the zone. [ separated this change from] Link: [ separated this change from] Signed-off-by: Michal Hocko <> Signed-off-by: Mikhail Zaslonko <> Tested-by: Mikhail Gavrilov <> Reviewed-by: Oscar Salvador <> Tested-by: Gerald Schaefer <> Cc: Heiko Carstens <> Cc: Martin Schwidefsky <> Cc: Mikhail Gavrilov <> Cc: Pavel Tatashin <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2019-02-01mm, memory_hotplug: is_mem_section_removable do not pass the end of a zoneMichal Hocko1-1/+2
Patch series "mm, memory_hotplug: fix uninitialized pages fallouts", v2. Mikhail Zaslonko has posted fixes for the two bugs quite some time ago [1]. I have pushed back on those fixes because I believed that it is much better to plug the problem at the initialization time rather than play whack-a-mole all over the hotplug code and find all the places which expect the full memory section to be initialized. We have ended up with commit 2830bf6f05fb ("mm, memory_hotplug: initialize struct pages for the full memory section") merged and cause a regression [2][3]. The reason is that there might be memory layouts when two NUMA nodes share the same memory section so the merged fix is simply incorrect. In order to plug this hole we really have to be zone range aware in those handlers. I have split up the original patch into two. One is unchanged (patch 2) and I took a different approach for `removable' crash. [1] [2] [3] This patch (of 2): Mikhail has reported the following VM_BUG_ON triggered when reading sysfs removable state of a memory block: page:000003d08300c000 is uninitialized and poisoned page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p)) Call Trace: is_mem_section_removable+0xb4/0x190 show_mem_removable+0x9a/0xd8 dev_attr_show+0x34/0x70 sysfs_kf_seq_show+0xc8/0x148 seq_read+0x204/0x480 __vfs_read+0x32/0x178 vfs_read+0x82/0x138 ksys_read+0x5a/0xb0 system_call+0xdc/0x2d8 Last Breaking-Event-Address: is_mem_section_removable+0xb4/0x190 Kernel panic - not syncing: Fatal exception: panic_on_oops The reason is that the memory block spans the zone boundary and we are stumbling over an unitialized struct page. Fix this by enforcing zone range in is_mem_section_removable so that we never run away from a zone. Link: Signed-off-by: Michal Hocko <> Reported-by: Mikhail Zaslonko <> Debugged-by: Mikhail Zaslonko <> Tested-by: Gerald Schaefer <> Tested-by: Mikhail Gavrilov <> Reviewed-by: Oscar Salvador <> Cc: Pavel Tatashin <> Cc: Heiko Carstens <> Cc: Martin Schwidefsky <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2019-02-01oom, oom_reaper: do not enqueue same task twiceTetsuo Handa2-2/+3
Arkadiusz reported that enabling memcg's group oom killing causes strange memcg statistics where there is no task in a memcg despite the number of tasks in that memcg is not 0. It turned out that there is a bug in wake_oom_reaper() which allows enqueuing same task twice which makes impossible to decrease the number of tasks in that memcg due to a refcount leak. This bug existed since the OOM reaper became invokable from task_will_free_mem(current) path in out_of_memory() in Linux 4.7, T1@P1 |T2@P1 |T3@P1 |OOM reaper ----------+----------+----------+------------ # Processing an OOM victim in a different memcg domain. try_charge() mem_cgroup_out_of_memory() mutex_lock(&oom_lock) try_charge() mem_cgroup_out_of_memory() mutex_lock(&oom_lock) try_charge() mem_cgroup_out_of_memory() mutex_lock(&oom_lock) out_of_memory() oom_kill_process(P1) do_send_sig_info(SIGKILL, @P1) mark_oom_victim(T1@P1) wake_oom_reaper(T1@P1) # T1@P1 is enqueued. mutex_unlock(&oom_lock) out_of_memory() mark_oom_victim(T2@P1) wake_oom_reaper(T2@P1) # T2@P1 is enqueued. mutex_unlock(&oom_lock) out_of_memory() mark_oom_victim(T1@P1) wake_oom_reaper(T1@P1) # T1@P1 is enqueued again due to oom_reaper_list == T2@P1 && T1@P1->oom_reaper_list == NULL. mutex_unlock(&oom_lock) # Completed processing an OOM victim in a different memcg domain. spin_lock(&oom_reaper_lock) # T1P1 is dequeued. spin_unlock(&oom_reaper_lock) but memcg's group oom killing made it easier to trigger this bug by calling wake_oom_reaper() on the same task from one out_of_memory() request. Fix this bug using an approach used by commit 855b018325737f76 ("oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task"). As a side effect of this patch, this patch also avoids enqueuing multiple threads sharing memory via task_will_free_mem(current) path. Link: Link: Fixes: af8e15cc85a25315 ("oom, oom_reaper: do not enqueue task if it is on the oom_reaper_list head") Signed-off-by: Tetsuo Handa <> Reported-by: Arkadiusz Miskiewicz <> Tested-by: Arkadiusz Miskiewicz <> Acked-by: Michal Hocko <> Acked-by: Roman Gushchin <> Cc: Tejun Heo <> Cc: Aleksa Sarai <> Cc: Jay Kamat <> Cc: Johannes Weiner <> Cc: <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2019-02-01mm: migrate: make buffer_migrate_page_norefs() actually succeedJan Kara1-5/+0
Currently, buffer_migrate_page_norefs() was constantly failing because buffer_migrate_lock_buffers() grabbed reference on each buffer. In fact, there's no reason for buffer_migrate_lock_buffers() to grab any buffer references as the page is locked during all our operation and thus nobody can reclaim buffers from the page. So remove grabbing of buffer references which also makes buffer_migrate_page_norefs() succeed. Link: Fixes: 89cb0888ca14 "mm: migrate: provide buffer_migrate_page_norefs()" Signed-off-by: Jan Kara <> Cc: Sergey Senozhatsky <> Cc: Pavel Machek <> Cc: Mel Gorman <> Cc: Vlastimil Babka <> Cc: Andrea Arcangeli <> Cc: David Rientjes <> Cc: Michal Hocko <> Cc: Zi Yan <> Cc: Johannes Weiner <> Cc: <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2019-02-01kernel/exit.c: release ptraced tasks before zap_pid_ns_processesAndrei Vagin1-2/+10
Currently, exit_ptrace() adds all ptraced tasks in a dead list, then zap_pid_ns_processes() waits on all tasks in a current pidns, and only then are tasks from the dead list released. zap_pid_ns_processes() can get stuck on waiting tasks from the dead list. In this case, we will have one unkillable process with one or more dead children. Thanks to Oleg for the advice to release tasks in find_child_reaper(). Link: Fixes: 7c8bd2322c7f ("exit: ptrace: shift "reap dead" code from exit_ptrace() to forget_original_parent()") Signed-off-by: Andrei Vagin <> Signed-off-by: Oleg Nesterov <> Cc: "Eric W. Biederman" <> Cc: <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2019-02-01x86_64: increase stack size for KASAN_EXTRAQian Cai1-0/+4
If the kernel is configured with KASAN_EXTRA, the stack size is increasted significantly because this option sets "-fstack-reuse" to "none" in GCC [1]. As a result, it triggers stack overrun quite often with 32k stack size compiled using GCC 8. For example, this reproducer triggers a "corrupted stack end detected inside scheduler" very reliably with CONFIG_SCHED_STACK_END_CHECK enabled. There are just too many functions that could have a large stack with KASAN_EXTRA due to large local variables that have been called over and over again without being able to reuse the stacks. Some noticiable ones are size 7648 shrink_page_list 3584 xfs_rmap_convert 3312 migrate_page_move_mapping 3312 dev_ethtool 3200 migrate_misplaced_transhuge_page 3168 copy_process There are other 49 functions are over 2k in size while compiling kernel with "-Wframe-larger-than=" even with a related minimal config on this machine. Hence, it is too much work to change Makefiles for each object to compile without "-fsanitize-address-use-after-scope" individually. [1] Although there is a patch in GCC 9 to help the situation, GCC 9 probably won't be released in a few months and then it probably take another 6-month to 1-year for all major distros to include it as a default. Hence, the stack usage with KASAN_EXTRA can be revisited again in 2020 when GCC 9 is everywhere. Until then, this patch will help users avoid stack overrun. This has already been fixed for arm64 for the same reason via 6e8830674ea ("arm64: kasan: Increase stack size for KASAN_EXTRA"). Link: Signed-off-by: Qian Cai <> Cc: Thomas Gleixner <> Cc: Ingo Molnar <> Cc: Borislav Petkov <> Cc: "H. Peter Anvin" <> Cc: Andrey Ryabinin <> Cc: Alexander Potapenko <> Cc: Dmitry Vyukov <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2019-02-01mm/hugetlb.c: teach follow_hugetlb_page() to handle FOLL_NOWAITAndrea Arcangeli1-1/+2
hugetlb needs the same fix as faultin_nopage (which was applied in commit 96312e61282a ("mm/gup.c: teach get_user_pages_unlocked to handle FOLL_NOWAIT")) or KVM hangs because it thinks the mmap_sem was already released by hugetlb_fault() if it returned VM_FAULT_RETRY, but it wasn't in the FOLL_NOWAIT case. Link: Fixes: ce53053ce378 ("kvm: switch get_user_page_nowait() to get_user_pages_unlocked()") Signed-off-by: Andrea Arcangeli <> Tested-by: "Dr. David Alan Gilbert" <> Reported-by: "Dr. David Alan Gilbert" <> Reviewed-by: Mike Kravetz <> Reviewed-by: Peter Xu <> Cc: Mike Rapoport <> Cc: <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2019-02-01arch: unexport asm/shmparam.h for all architecturesMasahiro Yamada14-7/+7
Most architectures do not export shmparam.h to user-space. $ find arch -name shmparam.h | sort arch/alpha/include/asm/shmparam.h arch/arc/include/asm/shmparam.h arch/arm64/include/asm/shmparam.h arch/arm/include/asm/shmparam.h arch/csky/include/asm/shmparam.h arch/ia64/include/asm/shmparam.h arch/mips/include/asm/shmparam.h arch/nds32/include/asm/shmparam.h arch/nios2/include/asm/shmparam.h arch/parisc/include/asm/shmparam.h arch/powerpc/include/asm/shmparam.h arch/s390/include/asm/shmparam.h arch/sh/include/asm/shmparam.h arch/sparc/include/asm/shmparam.h arch/x86/include/asm/shmparam.h arch/xtensa/include/asm/shmparam.h Strangely, some users of the asm-generic wrapper export shmparam.h $ git grep 'generic-y += shmparam.h' arch/c6x/include/uapi/asm/Kbuild:generic-y += shmparam.h arch/h8300/include/uapi/asm/Kbuild:generic-y += shmparam.h arch/hexagon/include/uapi/asm/Kbuild:generic-y += shmparam.h arch/m68k/include/uapi/asm/Kbuild:generic-y += shmparam.h arch/microblaze/include/uapi/asm/Kbuild:generic-y += shmparam.h arch/openrisc/include/uapi/asm/Kbuild:generic-y += shmparam.h arch/riscv/include/asm/Kbuild:generic-y += shmparam.h arch/unicore32/include/uapi/asm/Kbuild:generic-y += shmparam.h The newly added riscv correctly creates the asm-generic wrapper in the kernel space, but the others (c6x, h8300, hexagon, m68k, microblaze, openrisc, unicore32) create the one in the uapi directory. Digging into the git history, now I guess fcc8487d477a ("uapi: export all headers under uapi directories") was the misconversion. Prior to that commit, no architecture exported to shmparam.h As its commit description said, that commit exported shmparam.h for c6x, h8300, hexagon, m68k, openrisc, unicore32. 83f0124ad81e ("microblaze: remove asm-generic wrapper headers") accidentally exported shmparam.h for microblaze. This commit unexports shmparam.h for those architectures. There is no more reason to export include/uapi/asm-generic/shmparam.h, so it has been moved to include/asm-generic/shmparam.h Link: Signed-off-by: Masahiro Yamada <> Acked-by: Stafford Horne <> Cc: Geert Uytterhoeven <> Cc: Michal Simek <> Cc: Yoshinori Sato <> Cc: Richard Kuo <> Cc: Guan Xuetao <> Cc: Nicolas Dichtel <> Cc: Arnd Bergmann <> Cc: Aurelien Jacquiot <> Cc: Greentime Hu <> Cc: Guo Ren <> Cc: Palmer Dabbelt <> Cc: Stefan Kristiansson <> Cc: Mark Salter <> Cc: Albert Ou <> Cc: Jonas Bonn <> Cc: Vincent Chen <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2019-02-01proc: fix /proc/net/* after setns(2)Alexey Dobriyan6-1/+155
/proc entries under /proc/net/* can't be cached into dcache because setns(2) can change current net namespace. [ coding-style fixes] [ avoid vim miscolorization] [ write test, add dummy ->d_revalidate hook: necessary if /proc/net/* is pinned at setns time] Link: Link: Fixes: 1da4d377f943fe4194ffb9fb9c26cc58fad4dd24 ("proc: revalidate misc dentries") Signed-off-by: Alexey Dobriyan <> Reported-by: Mateusz Stępień <> Reported-by: Ahmad Fatoum <> Cc: Al Viro <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2019-02-01mm, memory_hotplug: don't bail out in do_migrate_range() prematurelyOscar Salvador1-16/+2
do_migrate_range() takes a memory range and tries to isolate the pages to put them into a list. This list will be later on used in migrate_pages() to know the pages we need to migrate. Currently, if we fail to isolate a single page, we put all already isolated pages back to their LRU and we bail out from the function. This is quite suboptimal, as this will force us to start over again because scan_movable_pages will give us the same range. If there is no chance that we can isolate that page, we will loop here forever. Issue debugged in [1] has proved that. During the debugging of that issue, it was noticed that if do_migrate_ranges() fails to isolate a single page, we will just discard the work we have done so far and bail out, which means that scan_movable_pages() will find again the same set of pages. Instead, we can just skip the error, keep isolating as much pages as possible and then proceed with the call to migrate_pages(). This will allow us to do as much work as possible at once. [1] Michal said: : I still think that this doesn't give us a whole picture. Looping for : ever is a bug. Failing the isolation is quite possible and it should : be a ephemeral condition (e.g. a race with freeing the page or : somebody else isolating the page for whatever reason). And here comes : the disadvantage of the current implementation. We simply throw : everything on the floor just because of a ephemeral condition. The : racy page_count check is quite dubious to prevent from that. Link: Signed-off-by: Oscar Salvador <> Acked-by: Michal Hocko <> Cc: David Hildenbrand <> Cc: Dan Williams <> Cc: Jan Kara <> Cc: Kirill A. Shutemov <> Cc: William Kucharski <> Cc: Pavel Tatashin <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2019-02-01Merge tag 'for-linus' of git:// Torvalds18-42/+97
Pull rdma fixes from Jason Gunthorpe: "Still not much going on, the usual set of oops and driver fixes this time: - Fix two uapi breakage regressions in mlx5 drivers - Various oops fixes in hfi1, mlx4, umem, uverbs, and ipoib - A protocol bug fix for hfi1 preventing it from implementing the verbs API properly, and a compatability fix for EXEC STACK user programs - Fix missed refcounting in the 'advise_mr' patches merged this cycle. - Fix wrong use of the uABI in the hns SRQ patches merged this cycle" * tag 'for-linus' of git:// IB/uverbs: Fix OOPs in uverbs_user_mmap_disassociate IB/ipoib: Fix for use-after-free in ipoib_cm_tx_start IB/uverbs: Fix ioctl query port to consider device disassociation RDMA/mlx5: Fix flow creation on representors IB/uverbs: Fix OOPs upon device disassociation RDMA/umem: Add missing initialization of owning_mm RDMA/hns: Update the kernel header file of hns IB/mlx5: Fix how advise_mr() launches async work RDMA/device: Expose ib_device_try_get(() IB/hfi1: Add limit test for RC/UC send via loopback IB/hfi1: Remove overly conservative VM_EXEC flag check IB/{hfi1, qib}: Fix WC.byte_len calculation for UD_SEND_WITH_IMM IB/mlx4: Fix using wrong function to destroy sqp AHs under SRIOV RDMA/mlx5: Fix check for supported user flags when creating a QP
2019-02-01Merge tag 'iomap-5.0-fixes-1' of git:// Torvalds1-7/+30
Pull iomap fixes from Darrick Wong: "A couple of iomap fixes to eliminate some memory corruption and hang problems that were reported: - fix page migration when using iomap for pagecache management - fix a use-after-free bug in the directio code" * tag 'iomap-5.0-fixes-1' of git:// iomap: fix a use after free in iomap_dio_rw iomap: get/put the page in iomap_page_create/release()
2019-02-01Merge tag 'pm-5.0-rc5' of ↵Linus Torvalds3-7/+7
git:// Pull power management fixes from Rafael Wysocki: "These fix a PM-runtime framework regression introduced by the recent switch-over of device autosuspend to hrtimers and a mistake in the "poll idle state" code introduced by a recent change in it. Specifics: - Since ktime_get() turns out to be problematic for device autosuspend in the PM-runtime framework, make it use ktime_get_mono_fast_ns() instead (Vincent Guittot). - Fix an initial value of a local variable in the "poll idle state" code that makes it behave not exactly as expected when all idle states except for the "polling" one are disabled (Doug Smythies)" * tag 'pm-5.0-rc5' of git:// cpuidle: poll_state: Fix default time limit PM-runtime: Fix deadlock with ktime_get()
2019-02-01Merge tag 'acpi-5.0-rc5' of ↵Linus Torvalds2-1/+3
git:// Pull ACPI Kconfig fixes from Rafael Wysocki: "Prevent invalid configurations from being created (e.g. by randconfig) due to some ACPI-related Kconfig options' dependencies that are not specified directly (Sinan Kaya)" * tag 'acpi-5.0-rc5' of git:// platform/x86: Fix unmet dependency warning for SAMSUNG_Q10 platform/x86: Fix unmet dependency warning for ACPI_CMPC mfd: Fix unmet dependency warning for MFD_TPS68470
2019-02-01Merge tag 'mmc-v5.0-rc4' of ↵Linus Torvalds2-1/+3
git:// Pull MMC host fixes from Ulf Hansson: - mediatek: Fix incorrect register write for tunings - bcm2835: Fixup leakage of DMA channel on probe errors * tag 'mmc-v5.0-rc4' of git:// mmc: mediatek: fix incorrect register setting of hs400_cmd_int_delay mmc: bcm2835: Fix DMA channel leak on probe error
2019-02-01Merge tag 'i3c/fixes-for-5.0-rc5' of ↵Linus Torvalds2-7/+13
git:// Pull i3c fixes from Boris Brezillon: - Fix a deadlock in the designware driver - Fix the error path in i3c_master_add_i3c_dev_locked() * tag 'i3c/fixes-for-5.0-rc5' of git:// i3c: master: dw: fix deadlock i3c: fix missing detach if failed to retrieve i3c dev
2019-02-01x86/kexec: Don't setup EFI info if EFI runtime is not enabledKairui Song1-0/+3
Kexec-ing a kernel with "efi=noruntime" on the first kernel's command line causes the following null pointer dereference: BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 #PF error: [normal kernel read fault] Call Trace: efi_runtime_map_copy+0x28/0x30 bzImage64_load+0x688/0x872 arch_kexec_kernel_image_load+0x6d/0x70 kimage_file_alloc_init+0x13e/0x220 __x64_sys_kexec_file_load+0x144/0x290 do_syscall_64+0x55/0x1a0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Just skip the EFI info setup if EFI runtime services are not enabled. [ bp: Massage commit message. ] Suggested-by: Dave Young <> Signed-off-by: Kairui Song <> Signed-off-by: Borislav Petkov <> Acked-by: Dave Young <> Cc: AKASHI Takahiro <> Cc: Andrew Morton <> Cc: Ard Biesheuvel <> Cc: Cc: David Howells <> Cc: Cc: Cc: "H. Peter Anvin" <> Cc: Ingo Molnar <> Cc: Cc: Cc: Cc: Philipp Rudo <> Cc: Cc: Cc: Thomas Gleixner <> Cc: x86-ml <> Cc: Yannik Sembritzki <> Link:
2019-02-01x86: explicitly align IO accesses in memcpy_{to,from}ioLinus Torvalds1-3/+30
In commit 170d13ca3a2f ("x86: re-introduce non-generic memcpy_{to,from}io") I made our copy from IO space use a separate copy routine rather than rely on the generic memcpy. I did that because our generic memory copy isn't actually well-defined when it comes to internal access ordering or alignment, and will in fact depend on various CPUID flags. In particular, the default memcpy() for a modern Intel CPU will generally be just a "rep movsb", which works reasonably well for medium-sized memory copies of regular RAM, since the CPU will turn it into fairly optimized microcode. However, for non-cached memory and IO, "rep movs" ends up being horrendously slow and will just do the architectural "one byte at a time" accesses implied by the movsb. At the other end of the spectrum, if you _don't_ end up using the "rep movsb" code, you'd likely fall back to the software copy, which does overlapping accesses for the tail, and may copy things backwards. Again, for regular memory that's fine, for IO memory not so much. The thinking was that clearly nobody really cared (because things worked), but some people had seen horrible performance due to the byte accesses, so let's just revert back to our long ago version that dod "rep movsl" for the bulk of the copy, and then fixed up the potentially last few bytes of the tail with "movsw/b". Interestingly (and perhaps not entirely surprisingly), while that was our original memory copy implementation, and had been used before for IO, in the meantime many new users of memcpy_*io() had come about. And while the access patterns for the memory copy weren't well-defined (so arguably _any_ access pattern should work), in practice the "rep movsb" case had been very common for the last several years. In particular Jarkko Sakkinen reported that the memcpy_*io() change resuled in weird errors from his Geminilake NUC TPM module. And it turns out that the TPM TCG accesses according to spec require that the accesses be (a) done strictly sequentially (b) be naturally aligned otherwise the TPM chip will abort the PCI transaction. And, in fact, the tpm_crb.c driver did this: memcpy_fromio(buf, priv->rsp, 6); ... memcpy_fromio(&buf[6], &priv->rsp[6], expected - 6); which really should never have worked in the first place, but back before commit 170d13ca3a2f it *happened* to work, because the memcpy_fromio() would be expanded to a regular memcpy, and (a) gcc would expand the first memcpy in-line, and turn it into a 4-byte and a 2-byte read, and they happened to be in the right order, and the alignment was right. (b) gcc would call "memcpy()" for the second one, and the machines that had this TPM chip also apparently ended up always having ERMS ("Enhanced REP MOVSB/STOSB instructions"), so we'd use the "rep movbs" for that copy. In other words, basically by pure luck, the code happened to use the right access sizes in the (two different!) memcpy() implementations to make it all work. But after commit 170d13ca3a2f, both of the memcpy_fromio() calls resulted in a call to the routine with the consistent memory accesses, and in both cases it started out transferring with 4-byte accesses. Which worked for the first copy, but resulted in the second copy doing a 32-bit read at an address that was only 2-byte aligned. Jarkko is actually fixing the fragile code in the TPM driver, but since this is an excellent example of why we absolutely must not use a generic memcpy for IO accesses, _and_ an IO-specific one really should strive to align the IO accesses, let's do exactly that. Side note: Jarkko also noted that the driver had been used on ARM platforms, and had worked. That was because on 32-bit ARM, memcpy_*io() ends up always doing byte accesses, and on 64-bit ARM it first does byte accesses to align to 8-byte boundaries, and then does 8-byte accesses for the bulk. So ARM actually worked by design, and the x86 case worked by pure luck. We *might* want to make x86-64 do the 8-byte case too. That should be a pretty straightforward extension, but let's do one thing at a time. And generally MMIO accesses aren't really all that performance-critical, as shown by the fact that for a long time we just did them a byte at a time, and very few people ever noticed. Reported-and-tested-by: Jarkko Sakkinen <> Tested-by: Jerry Snitselaar <> Cc: David Laight <> Fixes: 170d13ca3a2f ("x86: re-introduce non-generic memcpy_{to,from}io") Signed-off-by: Linus Torvalds <>
2019-02-01apparmor: Fix aa_label_build() error handling for failed mergesJohn Johansen1-1/+4
aa_label_merge() can return NULL for memory allocations failures make sure to handle and set the correct error in this case. Reported-by: Peng Hao <> Signed-off-by: John Johansen <>
2019-02-01arm64: hibernate: Clean the __hyp_text to PoC after resumeJames Morse1-1/+3
During resume hibernate restores all physical memory. Any memory that is accessed with the MMU disabled needs to be cleaned to the PoC. KVMs __hyp_text was previously ommitted as it runs with the MMU enabled, but now that the hyp-stub is located in this section, we must clean __hyp_text too. This ensures secondary CPUs that come online after hibernate has finished resuming, and load KVM via the freshly written hyp-stub see the correct instructions. Signed-off-by: James Morse <> Cc: Signed-off-by: Will Deacon <>
2019-02-01arm64: hyp-stub: Forbid kprobing of the hyp-stubJames Morse1-0/+2
The hyp-stub is loaded by the kernel's early startup code at EL2 during boot, before KVM takes ownership later. The hyp-stub's text is part of the regular kernel text, meaning it can be kprobed. A breakpoint in the hyp-stub causes the CPU to spin in el2_sync_invalid. Add it to the __hyp_text. Signed-off-by: James Morse <> Cc: Signed-off-by: Will Deacon <>
2019-02-01arm64: kprobe: Always blacklist the KVM world-switch codeJames Morse1-3/+3
On systems with VHE the kernel and KVM's world-switch code run at the same exception level. Code that is only used on a VHE system does not need to be annotated as __hyp_text as it can reside anywhere in the kernel text. __hyp_text was also used to prevent kprobes from patching breakpoint instructions into this region, as this code runs at a different exception level. While this is no longer true with VHE, KVM still switches VBAR_EL1, meaning a kprobe's breakpoint executed in the world-switch code will cause a hyp-panic. Move the __hyp_text check in the kprobes blacklist so it applies on VHE systems too, to cover the common code and guest enter/exit assembly. Fixes: 888b3c8720e0 ("arm64: Treat all entry code as non-kprobe-able") Reviewed-by: Christoffer Dall <> Signed-off-by: James Morse <> Acked-by: Masami Hiramatsu <> Signed-off-by: Will Deacon <>
2019-02-01arm64: kaslr: ensure randomized quantities are clean also when kaslr is offArd Biesheuvel1-0/+1
Commit 1598ecda7b23 ("arm64: kaslr: ensure randomized quantities are clean to the PoC") added cache maintenance to ensure that global variables set by the kaslr init routine are not wiped clean due to cache invalidation occurring during the second round of page table creation. However, if kaslr_early_init() exits early with no randomization being applied (either due to the lack of a seed, or because the user has disabled kaslr explicitly), no cache maintenance is performed, leading to the same issue we attempted to fix earlier, as far as the module_alloc_base variable is concerned. Note that module_alloc_base cannot be initialized statically, because that would cause it to be subject to a R_AARCH64_RELATIVE relocation, causing it to be overwritten by the second round of KASLR relocation processing. Fixes: f80fb3a3d508 ("arm64: add support for kernel ASLR") Cc: <> # v4.6+ Signed-off-by: Ard Biesheuvel <> Signed-off-by: Will Deacon <>
2019-02-01arm64: Do not issue IPIs for user executable ptesCatalin Marinas1-1/+5
Commit 3b8c9f1cdfc5 ("arm64: IPI each CPU after invalidating the I-cache for kernel mappings") was aimed at fixing the I-cache invalidation for kernel mappings. However, it inadvertently caused all cache maintenance for user mappings via set_pte_at() -> __sync_icache_dcache() -> sync_icache_aliases() to call kick_all_cpus_sync(). Reported-by: Shijith Thotton <> Tested-by: Shijith Thotton <> Reported-by: Wandun Chen <> Fixes: 3b8c9f1cdfc5 ("arm64: IPI each CPU after invalidating the I-cache for kernel mappings") Cc: <> # 4.19.x- Signed-off-by: Catalin Marinas <> Signed-off-by: Will Deacon <>
2019-02-01drm/sun4i: tcon: Prepare and enable TCON channel 0 clock at initPaul Kocialkowski1-0/+2
When initializing clocks, a reference to the TCON channel 0 clock is obtained. However, the clock is never prepared and enabled later. Switching from simplefb to DRM actually disables the clock (that was usually configured by U-Boot) because of that. On the V3s, this results in a hang when writing to some mixer registers when switching over to DRM from simplefb. Fix this by preparing and enabling the clock when initializing other clocks. Waiting for sun4i_tcon_channel_enable to enable the clock is apparently too late and results in the same mixer register access hang. Signed-off-by: Paul Kocialkowski <> Signed-off-by: Maxime Ripard <> Link:
2019-02-01apparmor: Fix warning about unused function apparmor_ipv6_postroutePetr Vorel1-0/+2
when compiled without CONFIG_IPV6: security/apparmor/lsm.c:1601:21: warning: ‘apparmor_ipv6_postroute’ defined but not used [-Wunused-function] static unsigned int apparmor_ipv6_postroute(void *priv, ^~~~~~~~~~~~~~~~~~~~~~~ Reported-by: Jordan Glover <> Tested-by: Jordan Glover <> Signed-off-by: Petr Vorel <> Signed-off-by: John Johansen <>
2019-02-01Merge branch 'acpi-misc'Rafael J. Wysocki1-0/+2
* acpi-misc: platform/x86: Fix unmet dependency warning for SAMSUNG_Q10 platform/x86: Fix unmet dependency warning for ACPI_CMPC
2019-02-01Merge branch 'pm-cpuidle-fixes'Rafael J. Wysocki1-1/+1
* pm-cpuidle-fixes: cpuidle: poll_state: Fix default time limit
2019-01-31Merge tag 'clk-fixes-for-linus' of ↵Linus Torvalds8-32/+35
git:// Pull clk fixes from Stephen Boyd: "Mostly driver fixes, but there's a core framework fix in here too: - Revert the commits that introduce clk management for the SP clk on MMP2 SoCs (used for OLPC). Turns out it wasn't a good idea and there isn't any need to manage this clk, it just causes more headaches. - A performance regression that went unnoticed for many years where we would traverse the entire clk tree looking for a clk by name when we already have the pointer to said clk that we're looking for - A parent linkage fix for the qcom SDM845 clk driver - An i.MX clk driver rate miscalculation fix where order of operations were messed up - One error handling fix from the static checkers" * tag 'clk-fixes-for-linus' of git:// clk: qcom: gcc: Use active only source for CPUSS clocks clk: ti: Fix error handling in ti_clk_parse_divider_data() clk: imx: Fix fractional clock set rate computation clk: Remove global clk traversal on fetch parent index Revert "dt-bindings: marvell,mmp2: Add clock id for the SP clock" Revert "clk: mmp2: add SP clock" Revert "Input: olpc_apsp - enable the SP clock"
2019-01-31Merge branch 'linus' of ↵Linus Torvalds1-4/+6
git:// Pull crypto fix from Herbert Xu: "This fixes a bug in cavium/nitrox where the callback is invoked prior to the DMA unmap" * 'linus' of git:// crypto: cavium/nitrox - Invoke callback after DMA unmap
2019-01-31Merge tag 'pci-v5.0-fixes-3' of ↵Linus Torvalds3-22/+9
git:// Pull PCI fixes from Bjorn Helgaas: - Revert armada8k GPIO reset change that broke Macchiatobin booting (Baruch Siach) - Use actual size config reads on ARM cns3xxx (Koen Vandeputte) - Fix ARM cns3xxx config write alignment issue (Koen Vandeputte) - Fix imx6 PHY device link error checking (Leonard Crestez) - Fix imx6 probe failure on chips without separate PCI power domain (Leonard Crestez) * tag 'pci-v5.0-fixes-3' of git:// Revert "PCI: armada8k: Add support for gpio controlled reset signal" ARM: cns3xxx: Use actual size reads for PCIe ARM: cns3xxx: Fix writing to wrong PCI config registers after alignment PCI: imx: Fix checking pd_pcie_phy device link addition PCI: imx: Fix probe failure without power domain
2019-02-01drm/amdgpu: fix the incorrect external id for raven seriesHuang Rui1-2/+4
This patch fixes the incorrect external id that kernel reports to user mode driver. Raven2's rev_id is starts from 0x8, so its external id (0x81) should start from rev_id + 0x79 (0x81 - 0x8). And Raven's rev_id should be 0x21 while rev_id == 1. Reported-by: Crystal Jin <> Signed-off-by: Huang Rui <> Reviewed-by: Hawking Zhang <> Signed-off-by: Alex Deucher <>
2019-02-01drm/amdgpu: Implement doorbell self-ring for NBIO 7.4Jay Cornwall1-0/+13
Fixes doorbell reflection on Vega20. Change-Id: I0495139d160a9032dff5977289b1eec11c16f781 Signed-off-by: Jay Cornwall <> Reviewed-by: Alex Deucher <> Signed-off-by: Alex Deucher <>
2019-02-01drm/amd/display: Fix fclk idle stateRoman Li1-1/+9
[Why] The earlier change 'Fix 6x4K displays' led to fclk value idling at higher DPM level. [How] Apply the fix only to respective multi-display configuration. Signed-off-by: Roman Li <> Reviewed-by: Feifei Xu <> Acked-by: Alex Deucher <> Signed-off-by: Alex Deucher <>
2019-01-31Revert "PCI: armada8k: Add support for gpio controlled reset signal"Baruch Siach1-16/+0
Revert commit 3d71746c42 ("PCI: armada8k: Add support for gpio controlled reset signal"). That commit breaks boot on Macchiatobin board when a Mellanox NIC is present in the PCIe slot. It turns out that full reset cycle requires first comphy serdes initialization. Reset signal toggle without comphy initialization makes access to PCI configuration registers stall indefinitely. U-Boot toggles the Macchiatobin PCIe reset line already at boot, after initializing the comphy serdes. So while commit 3d71746c42 ("PCI: armada8k: Add support for gpio controlled reset signal") enables PCIe on platforms that U-Boot does not touch the reset line (like Clearfog GT-8K), it breaks PCIe (and boot) on the Macchiatobin board. Revert commit 3d71746c42 ("PCI: armada8k: Add support for gpio controlled reset signal") entirely to fix the Macchiatobin regression. Reported-by: Sven Auhagen <> Signed-off-by: Baruch Siach <> Signed-off-by: Lorenzo Pieralisi <>
2019-01-31ARM: cns3xxx: Use actual size reads for PCIeKoen Vandeputte1-1/+1
commit 802b7c06adc7 ("ARM: cns3xxx: Convert PCI to use generic config accessors") reimplemented cns3xxx_pci_read_config() using pci_generic_config_read32(), which preserved the property of only doing 32-bit reads. It also replaced cns3xxx_pci_write_config() with pci_generic_config_write(), so it changed writes from always being 32 bits to being the actual size, which works just fine. Given that: - The documentation does not mention that only 32 bit access is allowed. - Writes are already executed using the actual size - Extensive testing shows that 8b, 16b and 32b reads work as intended Allow read access of any size by replacing pci_generic_config_read32() with the pci_generic_config_read() accessors. Fixes: 802b7c06adc7 ("ARM: cns3xxx: Convert PCI to use generic config accessors") Suggested-by: Bjorn Helgaas <> Signed-off-by: Koen Vandeputte <> [ updated commit log] Signed-off-by: Lorenzo Pieralisi <> Acked-by: Krzysztof Halasa <> Acked-by: Arnd Bergmann <> CC: Krzysztof Halasa <> CC: Olof Johansson <> CC: Robin Leblon <> CC: Rob Herring <> CC: Russell King <> CC: Tim Harvey <>
2019-01-31ARM: cns3xxx: Fix writing to wrong PCI config registers after alignmentKoen Vandeputte1-1/+1
Originally, cns3xxx used its own functions for mapping, reading and writing config registers. Commit 802b7c06adc7 ("ARM: cns3xxx: Convert PCI to use generic config accessors") removed the internal PCI config write function in favor of the generic one: cns3xxx_pci_write_config() --> pci_generic_config_write() cns3xxx_pci_write_config() expected aligned addresses, being produced by cns3xxx_pci_map_bus() while the generic one pci_generic_config_write() actually expects the real address as both the function and hardware are capable of byte-aligned writes. This currently leads to pci_generic_config_write() writing to the wrong registers. For instance, upon ath9k module loading: - driver ath9k gets loaded - The driver wants to write value 0xA8 to register PCI_LATENCY_TIMER, located at 0x0D - cns3xxx_pci_map_bus() aligns the address to 0x0C - pci_generic_config_write() effectively writes 0xA8 into register 0x0C (CACHE_LINE_SIZE) Fix the bug by removing the alignment in the cns3xxx mapping function. Fixes: 802b7c06adc7 ("ARM: cns3xxx: Convert PCI to use generic config accessors") Signed-off-by: Koen Vandeputte <> [ updated commit log] Signed-off-by: Lorenzo Pieralisi <> Acked-by: Krzysztof Halasa <> Acked-by: Tim Harvey <> Acked-by: Arnd Bergmann <> CC: # v4.0+ CC: Bjorn Helgaas <> CC: Olof Johansson <> CC: Robin Leblon <> CC: Rob Herring <> CC: Russell King <>
2019-01-31PCI: imx: Fix checking pd_pcie_phy device link additionLeonard Crestez1-4/+4
The check on the device_link_add() return value is wrong; this leads to erroneous code execution, so fix it. Fixes: 3f7cceeab895 ("PCI: imx: Add multi-pd support") Signed-off-by: Leonard Crestez <> [ updated commit log] Signed-off-by: Lorenzo Pieralisi <>
2019-01-31PCI: imx: Fix probe failure without power domainLeonard Crestez1-0/+3
On chips without a separate power domain for PCI (such as 6q/6qp) the imx6_pcie_attach_pd() function incorrectly returns an error. Fix by returning 0 if dev_pm_domain_attach_by_name() does not find anything. Fixes: 3f7cceeab895 ("PCI: imx: Add multi-pd support") Reported-by: Lukas F.Hartmann <> Signed-off-by: Leonard Crestez <> [ updated commit log] Signed-off-by: Lorenzo Pieralisi <>
2019-01-31gfs2: Revert "Fix loop in gfs2_rbm_find"Andreas Gruenbacher1-1/+1
This reverts commit 2d29f6b96d8f80322ed2dd895bca590491c38d34. It turns out that the fix can lead to a ~20 percent performance regression in initial writes to the page cache according to iozone. Let's revert this for now to have more time for a proper fix. Cc: # v3.13+ Signed-off-by: Andreas Gruenbacher <> Signed-off-by: Bob Peterson <> Signed-off-by: Linus Torvalds <>
2019-01-31Merge tag 'linux-kselftest-5.0-rc5' of ↵Linus Torvalds5-20/+71
git:// Pull kselftest fixes from Shuah Khan: "This consists of run-time fixes to cpu-hotplug, and seccomp tests, compile fixes to ir, net, and timers Makefiles" * tag 'linux-kselftest-5.0-rc5' of git:// selftests: timers: use LDLIBS instead of LDFLAGS selftests: net: use LDLIBS instead of LDFLAGS selftests/seccomp: Enhance per-arch ptrace syscall skip tests selftests: Use lirc.h from kernel tree, not from system selftests: cpu-hotplug: fix case where CPUs offline > CPUs present
2019-01-31Merge tag 'nfs-for-5.0-3' of git:// Torvalds2-4/+10
Pull NFS client fixes from Anna Schumaker: "This addresses two bugs, one in the error code handling of nfs_page_async_flush() and one to fix a potential NULL pointer dereference in nfs_parse_devname(). Stable bugfix: - Fix up return value on fatal errors in nfs_page_async_flush() Other bugfix: - Fix NULL pointer dereference of dev_name" * tag 'nfs-for-5.0-3' of git:// NFS: Fix up return value on fatal errors in nfs_page_async_flush() nfs: Fix NULL pointer dereference of dev_name
2019-01-31Merge tag 'sound-5.0-rc5' of ↵Linus Torvalds3-34/+54
git:// Pull sound fixes from Takashi Iwai: "Only three fixes. The fix for Realtek HD-audio looks lengthy, but it's just a code shuffling, and the actual changes are fairly small. The rest are a PCM core fix for a long-standing bug that was recently scratched by syzkaller, and a trivial USB-audio quirk for DSD support" * tag 'sound-5.0-rc5' of git:// ALSA: hda/realtek - Fixed hp_pin no value ALSA: pcm: Fix tight loop of OSS capture stream ALSA: usb-audio: Add Opus #3 to quirks for native DSD support
2019-01-31x86/microcode/amd: Don't falsely trick the late loading mechanismThomas Lendacky1-1/+1
The load_microcode_amd() function searches for microcode patches and attempts to apply a microcode patch if it is of different level than the currently installed level. While the processor won't actually load a level that is less than what is already installed, the logic wrongly returns UCODE_NEW thus signaling to its caller reload_store() that a late loading should be attempted. If the file-system contains an older microcode revision than what is currently running, such a late microcode reload can result in these misleading messages: x86/CPU: CPU features have changed after loading microcode, but might not take effect. x86/CPU: Please consider either early loading through initrd/built-in or a potential BIOS update. These messages were issued on a system where SME/SEV are not enabled by the BIOS (MSR C001_0010[23] = 0b) because during boot, early_detect_mem_encrypt() is called and cleared the SME and SEV features in this case. However, after the wrong late load attempt, get_cpu_cap() is called and reloads the SME and SEV feature bits, resulting in the messages. Update the microcode level check to not attempt microcode loading if the current level is greater than(!) and not only equal to the current patch level. [ bp: massage commit message. ] Fixes: 2613f36ed965 ("x86/microcode: Attempt late loading only when new microcode is present") Signed-off-by: Tom Lendacky <> Signed-off-by: Borislav Petkov <> Cc: "H. Peter Anvin" <> Cc: Ingo Molnar <> Cc: Thomas Gleixner <> Cc: x86-ml <> Link:
2019-01-31ide: ensure atapi sense request aren't preemptedJens Axboe5-38/+59
There's an issue with how sense requests are handled in IDE. If ide-cd encounters an error, it queues a sense request. With how IDE request handling is done, this is the next request we need to handle. But it's impossible to guarantee this, as another request could come in between the sense being queued, and ->queue_rq() being run and handling it. If that request ALSO fails, then we attempt to doubly queue the single sense request we have. Since we only support one active request at the time, defer request processing when a sense request is queued. Fixes: 600335205b8d "ide: convert to blk-mq" Reported-by: He Zhe <> Tested-by: He Zhe <> Signed-off-by: Jens Axboe <>
2019-01-31cifs: update internal module version numberSteve French1-1/+1
To 2.17 Signed-off-by: Steve French <>
2019-01-31CIFS: fix use-after-free of the lease keysAurelien Aptel1-2/+2
The request buffers are freed right before copying the pointers. Use the func args instead which are identical and still valid. Simple reproducer (requires KASAN enabled) on a cifs mount: echo foo > foo ; tail -f foo & rm foo Cc: <> # 4.20 Fixes: 179e44d49c2f ("smb3: add tracepoint for sending lease break responses to server") Signed-off-by: Aurelien Aptel <> Signed-off-by: Steve French <> Reviewed-by: Paulo Alcantara <>
2019-01-30cpuidle: poll_state: Fix default time limitDoug Smythies1-1/+1
The default time is declared in units of microsecnds, but is used as nanoseconds, resulting in significant accounting errors for idle state 0 time when all idle states deeper than 0 are disabled. Under these unusual conditions, we don't really care about the poll time limit anyhow. Fixes: 800fb34a99ce ("cpuidle: poll_state: Disregard disable idle states") Signed-off-by: Doug Smythies <> Signed-off-by: Rafael J. Wysocki <>
2019-01-30PM-runtime: Fix deadlock with ktime_get()Vincent Guittot2-6/+6
A deadlock has been seen when swicthing clocksources which use PM-runtime. The call path is: change_clocksource ... write_seqcount_begin ... timekeeping_update ... sh_cmt_clocksource_enable ... rpm_resume pm_runtime_mark_last_busy ktime_get do read_seqcount_begin while read_seqcount_retry .... write_seqcount_end Although we should be safe because we haven't yet changed the clocksource at that time, we can't do that because of seqcount protection. Use ktime_get_mono_fast_ns() instead which is lock safe for such cases. With ktime_get_mono_fast_ns, the timestamp is not guaranteed to be monotonic across an update and as a result can goes backward. According to update_fast_timekeeper() description: "In the worst case, this can result is a slightly wrong timestamp (a few nanoseconds)". For PM-runtime autosuspend, this means only that the suspend decision may be slightly suboptimal. Fixes: 8234f6734c5d ("PM-runtime: Switch autosuspend over to using hrtimers") Reported-by: Biju Das <> Signed-off-by: Vincent Guittot <> Reviewed-by: Ulf Hansson <> Signed-off-by: Rafael J. Wysocki <>
2019-01-30fs/dcache: Track & report number of negative dentriesWaiman Long3-13/+52
The current dentry number tracking code doesn't distinguish between positive & negative dentries. It just reports the total number of dentries in the LRU lists. As excessive number of negative dentries can have an impact on system performance, it will be wise to track the number of positive and negative dentries separately. This patch adds tracking for the total number of negative dentries in the system LRU lists and reports it in the 5th field in the /proc/sys/fs/dentry-state file. The number, however, does not include negative dentries that are in flight but not in the LRU yet as well as those in the shrinker lists which are on the way out anyway. The number of positive dentries in the LRU lists can be roughly found by subtracting the number of negative dentries from the unused count. Matthew Wilcox had confirmed that since the introduction of the dentry_stat structure in 2.1.60, the dummy array was there, probably for future extension. They were not replacements of pre-existing fields. So no sane applications that read the value of /proc/sys/fs/dentry-state will do dummy thing if the last 2 fields of the sysctl parameter are not zero. IOW, it will be safe to use one of the dummy array entry for negative dentry count. Signed-off-by: Waiman Long <> Signed-off-by: Linus Torvalds <>