To support crash hotplug, a mechanism is needed to update the crash
elfcorehdr upon CPU or memory changes (eg. hot un/plug or off/ onlining).
The crash elfcorehdr describes the CPUs and memory to be written into the
vmcore.
To track CPU changes, callbacks are registered with the cpuhp mechanism
via cpuhp_setup_state_nocalls(CPUHP_BP_PREPARE_DYN). The crash hotplug
elfcorehdr update has no explicit ordering requirement (relative to other
cpuhp states), so meets the criteria for utilizing CPUHP_BP_PREPARE_DYN.
CPUHP_BP_PREPARE_DYN is a dynamic state and avoids the need to introduce a
new state for crash hotplug. Also, CPUHP_BP_PREPARE_DYN is the last state
in the PREPARE group, just prior to the STARTING group, which is very
close to the CPU starting up in a plug/online situation, or stopping in a
unplug/ offline situation. This minimizes the window of time during an
actual plug/online or unplug/offline situation in which the elfcorehdr
would be inaccurate. Note that for a CPU being unplugged or offlined, the
CPU will still be present in the list of CPUs generated by
crash_prepare_elf64_headers(). However, there is no need to explicitly
omit the CPU, see justification in 'crash: change
crash_prepare_elf64_headers() to for_each_possible_cpu()'.
To track memory changes, a notifier is registered to capture the memblock
MEM_ONLINE and MEM_OFFLINE events via register_memory_notifier().
The CPU callbacks and memory notifiers invoke crash_handle_hotplug_event()
which performs needed tasks and then dispatches the event to the
architecture specific arch_crash_handle_hotplug_event() to update the
elfcorehdr with the current state of CPUs and memory. During the process,
the kexec_lock is held.
Link: https://lkml.kernel.org/r/20230814214446.6659-3-eric.devolder@oracle.com
Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
Reviewed-by: Sourabh Jain <sourabhjain@linux.ibm.com>
Acked-by: Hari Bathini <hbathini@linux.ibm.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Akhil Raj <lf32.dev@gmail.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Mimi Zohar <zohar@linux.ibm.com>
Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Weißschuh <linux@weissschuh.net>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "crash: Kernel handling of CPU and memory hot un/plug", v28.
Once the kdump service is loaded, if changes to CPUs or memory occur,
either by hot un/plug or off/onlining, the crash elfcorehdr must also be
updated.
The elfcorehdr describes to kdump the CPUs and memory in the system, and
any inaccuracies can result in a vmcore with missing CPU context or memory
regions.
The current solution utilizes udev to initiate an unload-then-reload of
the kdump image (eg. kernel, initrd, boot_params, purgatory and
elfcorehdr) by the userspace kexec utility. In the original post I
outlined the significant performance problems related to offloading this
activity to userspace.
This patchset introduces a generic crash handler that registers with the
CPU and memory notifiers. Upon CPU or memory changes, from either hot
un/plug or off/onlining, this generic handler is invoked and performs
important housekeeping, for example obtaining the appropriate lock, and
then invokes an architecture specific handler to do the appropriate
elfcorehdr update.
Note the description in patch 'crash: change crash_prepare_elf64_headers()
to for_each_possible_cpu()' and 'x86/crash: optimize CPU changes' that
enables further optimizations related to CPU plug/unplug/online/offline
performance of elfcorehdr updates.
In the case of x86_64, the arch specific handler generates a new
elfcorehdr, and overwrites the old one in memory; thus no involvement with
userspace needed.
To realize the benefits/test this patchset, one must make a couple
of minor changes to userspace:
- Prevent udev from updating kdump crash kernel on hot un/plug changes.
Add the following as the first lines to the RHEL udev rule file
/usr/lib/udev/rules.d/98-kexec.rules:
# The kernel updates the crash elfcorehdr for CPU and memory changes
SUBSYSTEM=="cpu", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end"
SUBSYSTEM=="memory", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end"
With this changeset applied, the two rules evaluate to false for
CPU and memory change events and thus skip the userspace
unload-then-reload of kdump.
- Change to the kexec_file_load for loading the kdump kernel:
Eg. on RHEL: in /usr/bin/kdumpctl, change to:
standard_kexec_args="-p -d -s"
which adds the -s to select kexec_file_load() syscall.
This kernel patchset also supports kexec_load() with a modified kexec
userspace utility. A working changeset to the kexec userspace utility is
posted to the kexec-tools mailing list here:
http://lists.infradead.org/pipermail/kexec/2023-May/027049.html
To use the kexec-tools patch, apply, build and install kexec-tools, then
change the kdumpctl's standard_kexec_args to replace the -s with
--hotplug. The removal of -s reverts to the kexec_load syscall and the
addition of --hotplug invokes the changes put forth in the kexec-tools
patch.
This patch (of 8):
The crash hotplug support leans on the work for the kexec_file_load()
syscall. To also support the kexec_load() syscall, a few bits of code
need to be move outside of CONFIG_KEXEC_FILE. As such, these bits are
moved out of kexec_file.c and into a common location crash_core.c.
In addition, struct crash_mem and crash_notes were moved to new locales so
that PROC_KCORE, which sets CRASH_CORE alone, builds correctly.
No functionality change intended.
Link: https://lkml.kernel.org/r/20230814214446.6659-1-eric.devolder@oracle.com
Link: https://lkml.kernel.org/r/20230814214446.6659-2-eric.devolder@oracle.com
Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
Reviewed-by: Sourabh Jain <sourabhjain@linux.ibm.com>
Acked-by: Hari Bathini <hbathini@linux.ibm.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Akhil Raj <lf32.dev@gmail.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Mimi Zohar <zohar@linux.ibm.com>
Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Weißschuh <linux@weissschuh.net>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Eric has pointed out that we still have 3 users of do_each_thread().
Change them to use for_each_process_thread() and kill this helper.
There is a subtle change, after do_each_thread/while_each_thread g == t ==
&init_task, while after for_each_process_thread() they both point to
nowhere, but this doesn't matter.
> Why is for_each_process_thread() better than do_each_thread()?
Say, for_each_process_thread() is rcu safe, do_each_thread() is not.
And certainly
for_each_process_thread(p, t) {
do_something(p, t);
}
looks better than
do_each_thread(p, t) {
do_something(p, t);
} while_each_thread(p, t);
And again, there are only 3 users of this awkward helper left. It should
have been killed years ago and in fact I thought it had already been
killed. It uses while_each_thread() which needs some changes.
Link: https://lkml.kernel.org/r/20230817163708.GA8248@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: "Christian Brauner (Microsoft)" <brauner@kernel.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Jiri Slaby <jirislaby@kernel.org> # tty/serial
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
A syzbot stress test using a corrupted disk image reported that
mark_buffer_dirty() called from __nilfs_mark_inode_dirty() or
nilfs_palloc_commit_alloc_entry() may output a kernel warning, and can
panic if the kernel is booted with panic_on_warn.
This is because nilfs2 keeps buffer pointers in local structures for some
metadata and reuses them, but such buffers may be forcibly discarded by
nilfs_clear_dirty_page() in some critical situations.
This issue is reported to appear after commit 28a65b49eb ("nilfs2: do
not write dirty data after degenerating to read-only"), but the issue has
potentially existed before.
Fix this issue by checking the uptodate flag when attempting to reuse an
internally held buffer, and reloading the metadata instead of reusing the
buffer if the flag was lost.
Link: https://lkml.kernel.org/r/20230818131804.7758-1-konishi.ryusuke@gmail.com
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+cdfcae656bac88ba0e2d@syzkaller.appspotmail.com
Closes: https://lkml.kernel.org/r/0000000000003da75f05fdeffd12@google.com
Fixes: 8c26c4e269 ("nilfs2: fix issue with flush kernel thread after remount in RO mode because of driver's internal error or metadata corruption")
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: <stable@vger.kernel.org> # 3.10+
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
On the parisc architecture, lockdep reports for all static objects which
are in the __initdata section (e.g. "setup_done" in devtmpfs,
"kthreadd_done" in init/main.c) this warning:
INFO: trying to register non-static key.
The warning itself is wrong, because those objects are in the __initdata
section, but the section itself is on parisc outside of range from
_stext to _end, which is why the static_obj() functions returns a wrong
answer.
While fixing this issue, I noticed that the whole existing check can
be simplified a lot.
Instead of checking against the _stext and _end symbols (which include
code areas too) just check for the .data and .bss segments (since we check a
data object). This can be done with the existing is_kernel_core_data()
macro.
In addition objects in the __initdata section can be checked with
init_section_contains(), and is_kernel_rodata() allows keys to be in the
_ro_after_init section.
This partly reverts and simplifies commit bac59d18c7 ("x86/setup: Fix static
memory detection").
Link: https://lkml.kernel.org/r/ZNqrLRaOi/3wPAdp@p100
Fixes: bac59d18c7 ("x86/setup: Fix static memory detection")
Signed-off-by: Helge Deller <deller@gmx.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
xchg originated in 6e399cd144 ("prctl: avoid using mmap_sem for exe_file
serialization"). While the commit message does not explain *why* the
change, I found the original submission [1] which ultimately claims it
cleans things up by removing dependency of exe_file on the semaphore.
However, fe69d560b5 ("kernel/fork: always deny write access to current
MM exe_file") added a semaphore up/down cycle to synchronize the state of
exe_file against fork, defeating the point of the original change.
This is on top of semaphore trips already present both in the replacing
function and prctl (the only consumer).
Normally replacing exe_file does not happen for busy processes, thus
write-locking is not an impediment to performance in the intended use
case. If someone keeps invoking the routine for a busy processes they are
trying to play dirty and that's another reason to avoid any trickery.
As such I think the atomic here only adds complexity for no benefit.
Just write-lock around the replacement.
I also note that replacement races against the mapping check loop as
nothing synchronizes actual assignment with with said checks but I am not
addressing it in this patch. (Is the loop of any use to begin with?)
Link: https://lore.kernel.org/linux-mm/1424979417.10344.14.camel@stgolabs.net/ [1]
Link: https://lkml.kernel.org/r/20230814172140.1777161-1-mjguzik@gmail.com
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: "Christian Brauner (Microsoft)" <brauner@kernel.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
"externs should be avoided in .c files" needs an exception for linker
symbols, like those that mark the start, stop of many kernel sections.
Since checkpatch already checks REALNAME to avoid looking at fragments
changing vmlinux.lds.h, add a new else-if block to look at them
instead. As a simple heuristic, treat all words (in the patch-line)
as possible symbols, to screen later warnings.
For my test case, the possible-symbols included BOUNDED_BY (a macro),
which is extra, but not troublesome - these are just to screen
WARNINGS that might be issued on later fragments (changing .c files)
Where the WARN is issued, precede it with an else-if block to catch
one common extern-in-c use case: "extern struct foo bar[]". Here we
can at least issue a softer warning, after checking for a match with a
maybe-linker-symbol parsed earlier from the patch.
Though heuristic, it worked for my test-case, allowing both start__,
stop__ $symbol's (wo the prefixes specifically named). I've coded it
narrowly, it can be expanded later to cover any other expressions.
It does require that the externs in .c's have the additions to
vmlinux.lds.h in the same patch. And requires vmlinux.lds.h before .c
fragments.
Link: https://lkml.kernel.org/r/20230808033019.21911-2-jim.cromie@gmail.com
Signed-off-by: Jim Cromie <jim.cromie@gmail.com>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The APIs that allow backtracing across CPUs have always had a way to
exclude the current CPU. This convenience means callers didn't need to
find a place to allocate a CPU mask just to handle the common case.
Let's extend the API to take a CPU ID to exclude instead of just a
boolean. This isn't any more complex for the API to handle and allows the
hardlockup detector to exclude a different CPU (the one it already did a
trace for) without needing to find space for a CPU mask.
Arguably, this new API also encourages safer behavior. Specifically if
the caller wants to avoid tracing the current CPU (maybe because they
already traced the current CPU) this makes it more obvious to the caller
that they need to make sure that the current CPU ID can't change.
[akpm@linux-foundation.org: fix trigger_allbutcpu_cpu_backtrace() stub]
Link: https://lkml.kernel.org/r/20230804065935.v4.1.Ia35521b91fc781368945161d7b28538f9996c182@changeid
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Lecopzer Chen <lecopzer.chen@mediatek.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Pingfan Liu <kernelfans@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
&o2net_debug_lock is acquired by timer o2net_idle_timer() along the
following call chain. Thus the acquisition of the lock under process
context should disable bottom half, otherwise deadlock could happen if the
timer happens to preempt the execution while the lock is held in process
context on the same CPU.
<timer interrupt>
-> o2net_idle_timer()
-> queue_delayed_work()
-> sc_put()
-> sc_kref_release()
-> o2net_debug_del_sc()
-> spin_lock(&o2net_debug_lock);
Several lock acquisition of &o2net_debug_lock under process context do not
disable irq or bottom half. The patch fixes these potential deadlocks
scenerio by using spin_lock_bh() on &o2net_debug_lock.
This flaw was found by an experimental static analysis tool I am
developing for irq-related deadlock. x86_64 allmodconfig using gcc shows
no new warning.
Link: https://lkml.kernel.org/r/20230802131436.17765-1-dg573847474@gmail.com
Signed-off-by: Chengfeng Ye <dg573847474@gmail.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
&qs->qs_lock is acquired by timer o2net_idle_timer() along the following
call chain. Thus the acquisition of the lock under process context should
disable bottom half, otherwise deadlock could happen if the timer happens
to preempt the execution while the lock is held in process context on the
same CPU.
<timer interrupt>
-> o2net_idle_timer()
-> o2quo_conn_err()
-> spin_lock(&qs->qs_lock)
Several lock acquisition of &qs->qs_lock under process contex do not
disable irq or bottom half. The patch fixes these potential deadlocks
scenerio by using spin_lock_bh() on &qs->qs_lock.
This flaw was found by an experimental static analysis tool I am
developing for irq-related deadlock. x86_64 allmodconfig using gcc shows
no new warning.
Link: https://lkml.kernel.org/r/20230802123824.15301-1-dg573847474@gmail.com
Signed-off-by: Chengfeng Ye <dg573847474@gmail.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
lx-symbols assumes that module's .text sections is located at
`module->mem[MOD_TEXT].base` and passes it to add-symbol-file command.
However, .text section follows after .plt section in modules built by LLVM
toolchain for arm64 target. Symbol addresses are skewed in GDB.
Fix this issue by using the address of .text section stored in
`module->sect_attrs`.
Link: https://lkml.kernel.org/r/20230801121052.2475183-1-koudai@google.com
Signed-off-by: Koudai Iwahori <koudai@google.com>
Cc: Jan Kiszka <jan.kiszka@siemens.com>
Cc: Kieran Bingham <kbingham@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
gcov uses global functions that are called from generated code, but these
have no prototype in a header, which causes a W=1 build warning:
kernel/gcov/gcc_base.c:12:6: error: no previous prototype for '__gcov_init' [-Werror=missing-prototypes]
kernel/gcov/gcc_base.c:40:6: error: no previous prototype for '__gcov_flush' [-Werror=missing-prototypes]
kernel/gcov/gcc_base.c:46:6: error: no previous prototype for '__gcov_merge_add' [-Werror=missing-prototypes]
kernel/gcov/gcc_base.c:52:6: error: no previous prototype for '__gcov_merge_single' [-Werror=missing-prototypes]
Just turn off these warnings unconditionally for the two files that
contain them.
Link: https://lore.kernel.org/all/0820010f-e9dc-779d-7924-49c7df446bce@linux.ibm.com/
Link: https://lkml.kernel.org/r/20230725123042.2269077-1-arnd@kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Tested-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Acked-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This patch is a minor cleanup to the series "refactor Kconfig to
consolidate KEXEC and CRASH options".
In that series, a new option ARCH_DEFAULT_KEXEC was introduced in order to
obtain the equivalent behavior of s390 original Kconfig settings for
KEXEC. As it turns out, this new option did not fully provide the
equivalent behavior, rather a "select KEXEC" did.
As such, the ARCH_DEFAULT_KEXEC is not needed anymore, so remove it.
Link: https://lkml.kernel.org/r/20230802161750.2215-1-eric.devolder@oracle.com
Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>