The double indirect bio leads to somewhat suboptimal code generation.
Instead return the (original or split) bio, and make sure the
request_queue arguments to the lower level helpers is passed after the
bio to avoid constant reshuffling of the argument passing registers.
Also give it and the helpers used to implement it more descriptive names.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220727162300.3089193-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Set the queue dying flag and call blk_mq_exit_queue from del_gendisk for
all disks that do not have separately allocated queues, and thus remove
the need to call blk_cleanup_queue for them.
Rename blk_cleanup_disk to blk_mq_destroy_queue to make it clear that
this function is intended only for separately allocated blk-mq queues.
This saves an extra queue freeze for devices without a separately
allocated queue.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20220619060552.1850436-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, IO priority set in task's IO context is not reflected in the
bio->bi_ioprio for most IO (only io_uring and direct IO set it). This
results in odd results where process is submitting some bios with one
priority and other bios with a different (unset) priority and due to
differing priorities bios cannot be merged. Make sure bio->bi_ioprio is
always set on bio submission.
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220623074840.5960-9-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Convert blk-ioprio handling from a rqos policy to a direct call from
blk_mq_submit_bio(). Firstly, blk-ioprio is not much of a rqos policy
anyway, it just needs a hook in bio submission path to set the bio's IO
priority. Secondly, the rqos .track hook gets actually called too late
for blk-ioprio purposes and introducing a special rqos hook just for
blk-ioprio looks even weirder.
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220623074840.5960-7-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
__blk_mq_delay_run_hw_queue() disables preemption to get a stable
current CPU number and then invokes __blk_mq_run_hw_queue() if the CPU
number is part the mask.
__blk_mq_run_hw_queue() acquires a spin_lock_t which is a sleeping lock
on PREEMPT_RT and can't be acquired with disabled preemption.
It is not required for correctness to invoke __blk_mq_run_hw_queue() on
a CPU matching hctx->cpumask. Both (async and direct requests) can run
on a CPU not matching hctx->cpumask.
The CPU mask without disabling preemption and invoking
__blk_mq_run_hw_queue().
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/YrLSEiNvagKJaDs5@linutronix.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since the introduction of blk_mq_get_hctx_type() the operation type in
the second argument of blk_mq_get_hctx_type() matters. The introduction
of blk_mq_get_hctx_type() caused blk_mq_get_sq_hctx() to select a
hardware queue of type HCTX_TYPE_READ instead of HCTX_TYPE_DEFAULT.
Switch to hardware queue type HCTX_TYPE_DEFAULT since HCTX_TYPE_READ
should only be used for read requests.
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220615225549.1054905-4-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If rq_qos_throttle() ends up blocking, then we will have invalidated and
flushed our current plug. Since blk_mq_get_cached_request() hasn't
popped the cached request off the plug list just yet, we end holding a
pointer to a request that is no longer valid. This insta-crashes with
rq->mq_hctx being NULL in the validity checks just after.
Pop the request off the cached list before doing rq_qos_throttle() to
avoid using a potentially stale request.
Fixes: 0a5aa8d161 ("block: fix blk_mq_attempt_bio_merge and rq_qos_throttle protection")
Reported-by: Dylan Yudaken <dylany@fb.com>
Tested-by: Dylan Yudaken <dylany@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
commit 364b61818f ("blk-mq: clearing flush request reference in
tags->rqs[]") is added to clear the to-be-free flush request from
tags->rqs[] for avoiding use-after-free on the flush rq.
Yu Kuai reported that blk_mq_clear_flush_rq_mapping() slows down boot time
by ~8s because running scsi probe which may create and remove lots of
unpresent LUNs on megaraid-sas which uses BLK_MQ_F_TAG_HCTX_SHARED and
each request queue has lots of hw queues.
Improve the situation by not running blk_mq_clear_flush_rq_mapping if
disk isn't added when there can't be any flush request issued.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reported-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220616014401.817001-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
q->elevator is referred in blk_mq_has_sqsched() without any protection,
no .q_usage_counter is held, no queue srcu and rcu read lock is held,
so potential use-after-free may be triggered.
Fix the issue by adding one queue flag for checking if the elevator
uses single queue style dispatch. Meantime the elevator feature flag
of ELEVATOR_F_MQ_AWARE isn't needed any more.
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220616014401.817001-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch prevents that test nvme/004 triggers the following:
UBSAN: array-index-out-of-bounds in block/blk-mq.h:135:9
index 512 is out of range for type 'long unsigned int [512]'
Call Trace:
show_stack+0x52/0x58
dump_stack_lvl+0x49/0x5e
dump_stack+0x10/0x12
ubsan_epilogue+0x9/0x3b
__ubsan_handle_out_of_bounds.cold+0x44/0x49
blk_mq_alloc_request_hctx+0x304/0x310
__nvme_submit_sync_cmd+0x70/0x200 [nvme_core]
nvmf_connect_io_queue+0x23e/0x2a0 [nvme_fabrics]
nvme_loop_connect_io_queues+0x8d/0xb0 [nvme_loop]
nvme_loop_create_ctrl+0x58e/0x7d0 [nvme_loop]
nvmf_create_ctrl+0x1d7/0x4d0 [nvme_fabrics]
nvmf_dev_write+0xae/0x111 [nvme_fabrics]
vfs_write+0x144/0x560
ksys_write+0xb7/0x140
__x64_sys_write+0x42/0x50
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Fixes: 20e4d81393 ("blk-mq: simplify queue mapping & schedule with each possisble CPU")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220615210004.1031820-1-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull block request execute cleanups from Jens Axboe:
"This change was advertised in the initial core block pull request, but
didn't actually make that branch as we deferred it to a post-merge
pull request to avoid a bunch of cross branch issues.
This series cleans up the block execute path quite nicely"
* tag 'for-5.19/block-exec-2022-06-02' of git://git.kernel.dk/linux-block:
blk-mq: remove the done argument to blk_execute_rq_nowait
blk-mq: avoid a mess of casts for blk_end_sync_rq
blk-mq: remove __blk_execute_rq_nowait
Pull block fixes from Jens Axboe:
"Just a collection of fixes that have been queued up since the initial
merge window pull request, the majority of which are targeted for
stable as well.
One bio_set fix that fixes an issue with the dm adoption of cached bio
structs that got introduced in this merge window"
* tag 'for-5.19/block-2022-06-02' of git://git.kernel.dk/linux-block:
block: Fix potential deadlock in blk_ia_range_sysfs_show()
block: fix bio_clone_blkg_association() to associate with proper blkcg_gq
block: remove useless BUG_ON() in blk_mq_put_tag()
blk-mq: do not update io_ticks with passthrough requests
block: make bioset_exit() fully resilient against being called twice
block: use bio_queue_enter instead of blk_queue_enter in bio_poll
block: document BLK_STS_AGAIN usage
block: take destination bvec offsets into account in bio_copy_data_iter
blk-iolatency: Fix inflight count imbalances and IO hangs on offline
blk-mq: don't touch ->tagset in blk_mq_get_sq_hctx
Flush or passthrough requests are not accounted as normal IO in completion.
To reflect iostat for slow IO, io_ticks is updated when stat show called
based on inflight numbers.
It may cause inconsistent io_ticks calculation result.
So do not account non-passthrough request when check inflight.
Fixes: 86d7331299 ("block: update io_ticks when io hang")
Signed-off-by: Haisu Wang <haisuwang@tencent.com>
Reviewed-by: samuelliao <samuelliao@tencent.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220530064059.1120058-1-haisuwang@tencent.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Instead of trying to cast a __bitwise 32-bit integer to a larger integer
and then a pointer, just allow a struct with the blk_status_t and the
completion on stack and set the end_io_data to that. Use the
opportunity to move the code to where it belongs and drop rather
confusing comments.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220524121530.943123-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull block updates from Jens Axboe:
"Here are the core block changes for 5.19. This contains:
- blk-throttle accounting fix (Laibin)
- Series removing redundant assignments (Michal)
- Expose bio cache via the bio_set, so that DM can use it (Mike)
- Finish off the bio allocation interface cleanups by dealing with
the weirdest member of the family. bio_kmalloc combines a kmalloc
for the bio and bio_vecs with a hidden bio_init call and magic
cleanup semantics (Christoph)
- Clean up the block layer API so that APIs consumed by file systems
are (almost) only struct block_device based, so that file systems
don't have to poke into block layer internals like the
request_queue (Christoph)
- Clean up the blk_execute_rq* API (Christoph)
- Clean up various lose end in the blk-cgroup code to make it easier
to follow in preparation of reworking the blkcg assignment for bios
(Christoph)
- Fix use-after-free issues in BFQ when processes with merged queues
get moved to different cgroups (Jan)
- BFQ fixes (Jan)
- Various fixes and cleanups (Bart, Chengming, Fanjun, Julia, Ming,
Wolfgang, me)"
* tag 'for-5.19/block-2022-05-22' of git://git.kernel.dk/linux-block: (83 commits)
blk-mq: fix typo in comment
bfq: Remove bfq_requeue_request_body()
bfq: Remove superfluous conversion from RQ_BIC()
bfq: Allow current waker to defend against a tentative one
bfq: Relax waker detection for shared queues
blk-cgroup: delete rcu_read_lock_held() WARN_ON_ONCE()
blk-throttle: Set BIO_THROTTLED when bio has been throttled
blk-cgroup: Remove unnecessary rcu_read_lock/unlock()
blk-cgroup: always terminate io.stat lines
block, bfq: make bfq_has_work() more accurate
block, bfq: protect 'bfqd->queued' by 'bfqd->lock'
block: cleanup the VM accounting in submit_bio
block: Fix the bio.bi_opf comment
block: reorder the REQ_ flags
blk-iocost: combine local_stat and desc_stat to stat
block: improve the error message from bio_check_eod
block: allow passing a NULL bdev to bio_alloc_clone/bio_init_clone
block: remove superfluous calls to blkcg_bio_issue_init
kthread: unexport kthread_blkcg
blk-cgroup: cleanup blkcg_maybe_throttle_current
...
blk_mq_run_hw_queues() could be run when there isn't queued request and
after queue is cleaned up, at that time tagset is freed, because tagset
lifetime is covered by driver, and often freed after blk_cleanup_queue()
returns.
So don't touch ->tagset for figuring out current default hctx by the mapping
built in request queue, so use-after-free on tagset can be avoided. Meantime
this way should be fast than retrieving mapping from tagset.
Cc: "yukuai (C)" <yukuai3@huawei.com>
Cc: Jan Kara <jack@suse.cz>
Fixes: b6e68ee825 ("blk-mq: Improve performance of non-mq IO schedulers with multiple HW queues")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220522122350.743103-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
First we can't add request into plug list in blk_mq_request_bypass_insert
which may be called when flushing plug list, so nested plug is caused.
Second if polled passthrough request is inserted via blk_execute_rq(),
it can't be added to plug list too since io polling needs the request
to be issued to driver.
Fixes the two by moving plugging into blk_execute_rq_no_wait().
Cc: Christoph Hellwig <hch@lst.de>
Fixes: 1c2d2fff6d ("block: wire-up support for passthrough plugging")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220512140010.1458645-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add support for plugging in passthrough path. When plugging is enabled, the
requests are added to a plug instead of getting dispatched to the driver.
And when the plug is finished, the whole batch gets dispatched via
->queue_rqs which turns out to be more efficient. Otherwise dispatching
used to happen via ->queue_rq, one request at a time.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220511054750.20432-3-joshi.k@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This reverts commit 0006707723. It has a
couple problems:
* bio_issue_time() is stored in bio->bi_issue truncated to 51 bits. This
overflows in slightly over 26 days. Setting rq->io_start_time_ns with it
means that io duration calculation would yield >26days after 26 days of
uptime. This, for example, confuses kyber making it cause high IO
latencies.
* rq->io_start_time_ns should record the time that the IO is issued to the
device so that on-device latency can be measured. However,
bio_issue_time() is set before the bio goes through the rq-qos controllers
(wbt, iolatency, iocost), so when the bio gets throttled in any of the
mechanisms, the measured latencies make no sense - on-device latencies end
up higher than request-alloc-to-completion latencies.
We'll need a smarter way to avoid calling ktime_get_ns() repeatedly
back-to-back. For now, let's revert the commit.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org # v5.16+
Link: https://lore.kernel.org/r/YmmeOLfo5lzc+8yI@slm.duckdns.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull block fixes from Jens Axboe:
"Either fixes or a few additions that got missed in the initial merge
window pull. In detail:
- List iterator fix to avoid leaking value post loop (Jakob)
- One-off fix in minor count (Christophe)
- Fix for a regression in how io priority setting works for an
exiting task (Jiri)
- Fix a regression in this merge window with blkg_free() being called
in an inappropriate context (Ming)
- Misc fixes (Ming, Tom)"
* tag 'for-5.18/block-2022-04-01' of git://git.kernel.dk/linux-block:
blk-wbt: remove wbt_track stub
block: use dedicated list iterator variable
block: Fix the maximum minor value is blk_alloc_ext_minor()
block: restore the old set_task_ioprio() behaviour wrt PF_EXITING
block: avoid calling blkg_free() in atomic context
lib/sbitmap: allocate sb->map via kvzalloc_node
Pull NVMe write streams removal from Jens Axboe:
"This removes the write streams support in NVMe. No vendor ever really
shipped working support for this, and they are not interested in
supporting it.
With the NVMe support gone, we have nothing in the tree that supports
this. Remove passing around of the hints.
The only discussion point in this patchset imho is the fact that the
file specific write hint setting/getting fcntl helpers will now return
-1/EINVAL like they did before we supported write hints. No known
applications use these functions, I only know of one prototype that I
help do for RocksDB, and that's not used. That said, with a change
like this, it's always a bit controversial. Alternatively, we could
just make them return 0 and pretend it worked. It's placement based
hints after all"
* tag 'for-5.18/write-streams-2022-03-18' of git://git.kernel.dk/linux-block:
fs: remove fs.f_write_hint
fs: remove kiocb.ki_hint
block: remove the per-bio/request write hint
nvme: remove support or stream based temperature hint
We used to sort the plug list if we had multiple queues before dispatching
requests to the IO scheduler. This usually isn't needed, but for certain
workloads that interleave requests to disks, it's a less efficient to
process the plug list one-by-one if everything is interleaved.
Don't sort the list, but skip through it and flush out entries that have
the same target at the same time.
Fixes: df87eb0fce ("block: get rid of plug list sorting")
Reported-and-tested-by: Song Liu <song@kernel.org>
Reviewed-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The queue's top debugfs dir is removed from blk_release_queue(), so all
hctx's debugfs dirs are removed from there. Given blk_mq_exit_queue()
is only called from blk_cleanup_queue(), it isn't necessary to remove
hctx debugfs from blk_mq_exit_queue().
So remove it from blk_mq_exit_queue().
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220308055200.735835-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
First code becomes more clean by switching to xarray from plain array.
Second use-after-free on q->queue_hw_ctx can be fixed because
queue_for_each_hw_ctx() may be run when updating nr_hw_queues is
in-progress. With this patch, q->hctx_table is defined as xarray, and
this structure will share same lifetime with request queue, so
queue_for_each_hw_ctx() can use q->hctx_table to lookup hctx reliably.
Reported-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220308073219.91173-7-ming.lei@redhat.com
[axboe: fix blk_mq_hw_ctx forward declaration]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It is inevitable to cause use-after-free on q->queue_hw_ctx between
queue_for_each_hw_ctx() and blk_mq_update_nr_hw_queues(). And converting
to xarray can fix the uaf, meantime code gets cleaner.
Prepare for converting q->queue_hctx_ctx into xarray, one thing is that
xa_for_each() can only accept 'unsigned long' as index, so changes type
of hctx index of queue_for_each_hw_ctx() into 'unsigned long'.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220308073219.91173-6-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The current code always uses default queue map and hw queue index
for figuring out the numa node for hw queue, this way isn't correct
because blk-mq supports three queue maps, and the correct queue map
should be used for the specified hw queue.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220308073219.91173-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit 9d497e2941 ("block: don't protect submit_bio_checks by
q_usage_counter") moved blk_mq_attempt_bio_merge and rq_qos_throttle
calls out of q_usage_counter protection. However, these functions require
q_usage_counter protection. The blk_mq_attempt_bio_merge call without
the protection resulted in blktests block/005 failure with KASAN null-
ptr-deref or use-after-free at bio merge. The rq_qos_throttle call
without the protection caused kernel hang at qos throttle.
To fix the failures, move the blk_mq_attempt_bio_merge and
rq_qos_throttle calls back to q_usage_counter protection.
Fixes: 9d497e2941 ("block: don't protect submit_bio_checks by q_usage_counter")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Link: https://lore.kernel.org/r/20220308080915.3473689-1-shinichiro.kawasaki@wdc.com
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When blk_mq_delay_run_hw_queues sets an hctx to run in the future, it can
reset the delay length for an already pending delayed work run_work. This
creates a scenario where multiple hctx may have their queues set to run,
but if one runs first and finds nothing to do, it can reset the delay of
another hctx and stall the other hctx's ability to run requests.
To avoid this I/O stall when an hctx's run_work is already pending,
leave it untouched to run at its current designated time rather than
extending its delay. The work will still run which keeps closed the race
calling blk_mq_delay_run_hw_queues is needed for while also avoiding the
I/O stall.
Signed-off-by: David Jeffery <djeffery@redhat.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220131203337.GA17666@redhat
Signed-off-by: Jens Axboe <axboe@kernel.dk>