mirror of
https://github.com/torvalds/linux.git
synced 2025-11-30 23:16:01 +07:00
Remove bcachefs core code
bcachefs was marked 'externally maintained' in 6.17 but the code remained to make the transition smoother. It's now a DKMS module, making the in-kernel code stale, so remove it to avoid any version confusion. Link: https://lore.kernel.org/linux-bcachefs/yokpt2d2g2lluyomtqrdvmkl3amv3kgnipmenobkpgx537kay7@xgcgjviv3n7x/T/ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This commit is contained in:
@@ -1,186 +0,0 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
bcachefs coding style
|
||||
=====================
|
||||
|
||||
Good development is like gardening, and codebases are our gardens. Tend to them
|
||||
every day; look for little things that are out of place or in need of tidying.
|
||||
A little weeding here and there goes a long way; don't wait until things have
|
||||
spiraled out of control.
|
||||
|
||||
Things don't always have to be perfect - nitpicking often does more harm than
|
||||
good. But appreciate beauty when you see it - and let people know.
|
||||
|
||||
The code that you are afraid to touch is the code most in need of refactoring.
|
||||
|
||||
A little organizing here and there goes a long way.
|
||||
|
||||
Put real thought into how you organize things.
|
||||
|
||||
Good code is readable code, where the structure is simple and leaves nowhere
|
||||
for bugs to hide.
|
||||
|
||||
Assertions are one of our most important tools for writing reliable code. If in
|
||||
the course of writing a patchset you encounter a condition that shouldn't
|
||||
happen (and will have unpredictable or undefined behaviour if it does), or
|
||||
you're not sure if it can happen and not sure how to handle it yet - make it a
|
||||
BUG_ON(). Don't leave undefined or unspecified behavior lurking in the codebase.
|
||||
|
||||
By the time you finish the patchset, you should understand better which
|
||||
assertions need to be handled and turned into checks with error paths, and
|
||||
which should be logically impossible. Leave the BUG_ON()s in for the ones which
|
||||
are logically impossible. (Or, make them debug mode assertions if they're
|
||||
expensive - but don't turn everything into a debug mode assertion, so that
|
||||
we're not stuck debugging undefined behaviour should it turn out that you were
|
||||
wrong).
|
||||
|
||||
Assertions are documentation that can't go out of date. Good assertions are
|
||||
wonderful.
|
||||
|
||||
Good assertions drastically and dramatically reduce the amount of testing
|
||||
required to shake out bugs.
|
||||
|
||||
Good assertions are based on state, not logic. To write good assertions, you
|
||||
have to think about what the invariants on your state are.
|
||||
|
||||
Good invariants and assertions will hold everywhere in your codebase. This
|
||||
means that you can run them in only a few places in the checked in version, but
|
||||
should you need to debug something that caused the assertion to fail, you can
|
||||
quickly shotgun them everywhere to find the codepath that broke the invariant.
|
||||
|
||||
A good assertion checks something that the compiler could check for us, and
|
||||
elide - if we were working in a language with embedded correctness proofs that
|
||||
the compiler could check. This is something that exists today, but it'll likely
|
||||
still be a few decades before it comes to systems programming languages. But we
|
||||
can still incorporate that kind of thinking into our code and document the
|
||||
invariants with runtime checks - much like the way people working in
|
||||
dynamically typed languages may add type annotations, gradually making their
|
||||
code statically typed.
|
||||
|
||||
Looking for ways to make your assertions simpler - and higher level - will
|
||||
often nudge you towards making the entire system simpler and more robust.
|
||||
|
||||
Good code is code where you can poke around and see what it's doing -
|
||||
introspection. We can't debug anything if we can't see what's going on.
|
||||
|
||||
Whenever we're debugging, and the solution isn't immediately obvious, if the
|
||||
issue is that we don't know where the issue is because we can't see what's
|
||||
going on - fix that first.
|
||||
|
||||
We have the tools to make anything visible at runtime, efficiently - RCU and
|
||||
percpu data structures among them. Don't let things stay hidden.
|
||||
|
||||
The most important tool for introspection is the humble pretty printer - in
|
||||
bcachefs, this means `*_to_text()` functions, which output to printbufs.
|
||||
|
||||
Pretty printers are wonderful, because they compose and you can use them
|
||||
everywhere. Having functions to print whatever object you're working with will
|
||||
make your error messages much easier to write (therefore they will actually
|
||||
exist) and much more informative. And they can be used from sysfs/debugfs, as
|
||||
well as tracepoints.
|
||||
|
||||
Runtime info and debugging tools should come with clear descriptions and
|
||||
labels, and good structure - we don't want files with a list of bare integers,
|
||||
like in procfs. Part of the job of the debugging tools is to educate users and
|
||||
new developers as to how the system works.
|
||||
|
||||
Error messages should, whenever possible, tell you everything you need to debug
|
||||
the issue. It's worth putting effort into them.
|
||||
|
||||
Tracepoints shouldn't be the first thing you reach for. They're an important
|
||||
tool, but always look for more immediate ways to make things visible. When we
|
||||
have to rely on tracing, we have to know which tracepoints we're looking for,
|
||||
and then we have to run the troublesome workload, and then we have to sift
|
||||
through logs. This is a lot of steps to go through when a user is hitting
|
||||
something, and if it's intermittent it may not even be possible.
|
||||
|
||||
The humble counter is an incredibly useful tool. They're cheap and simple to
|
||||
use, and many complicated internal operations with lots of things that can
|
||||
behave weirdly (anything involving memory reclaim, for example) become
|
||||
shockingly easy to debug once you have counters on every distinct codepath.
|
||||
|
||||
Persistent counters are even better.
|
||||
|
||||
When debugging, try to get the most out of every bug you come across; don't
|
||||
rush to fix the initial issue. Look for things that will make related bugs
|
||||
easier the next time around - introspection, new assertions, better error
|
||||
messages, new debug tools, and do those first. Look for ways to make the system
|
||||
better behaved; often one bug will uncover several other bugs through
|
||||
downstream effects.
|
||||
|
||||
Fix all that first, and then the original bug last - even if that means keeping
|
||||
a user waiting. They'll thank you in the long run, and when they understand
|
||||
what you're doing you'll be amazed at how patient they're happy to be. Users
|
||||
like to help - otherwise they wouldn't be reporting the bug in the first place.
|
||||
|
||||
Talk to your users. Don't isolate yourself.
|
||||
|
||||
Users notice all sorts of interesting things, and by just talking to them and
|
||||
interacting with them you can benefit from their experience.
|
||||
|
||||
Spend time doing support and helpdesk stuff. Don't just write code - code isn't
|
||||
finished until it's being used trouble free.
|
||||
|
||||
This will also motivate you to make your debugging tools as good as possible,
|
||||
and perhaps even your documentation, too. Like anything else in life, the more
|
||||
time you spend at it the better you'll get, and you the developer are the
|
||||
person most able to improve the tools to make debugging quick and easy.
|
||||
|
||||
Be wary of how you take on and commit to big projects. Don't let development
|
||||
become product-manager focused. Often time an idea is a good one but needs to
|
||||
wait for its proper time - but you won't know if it's the proper time for an
|
||||
idea until you start writing code.
|
||||
|
||||
Expect to throw a lot of things away, or leave them half finished for later.
|
||||
Nobody writes all perfect code that all gets shipped, and you'll be much more
|
||||
productive in the long run if you notice this early and shift to something
|
||||
else. The experience gained and lessons learned will be valuable for all the
|
||||
other work you do.
|
||||
|
||||
But don't be afraid to tackle projects that require significant rework of
|
||||
existing code. Sometimes these can be the best projects, because they can lead
|
||||
us to make existing code more general, more flexible, more multipurpose and
|
||||
perhaps more robust. Just don't hesitate to abandon the idea if it looks like
|
||||
it's going to make a mess of things.
|
||||
|
||||
Complicated features can often be done as a series of refactorings, with the
|
||||
final change that actually implements the feature as a quite small patch at the
|
||||
end. It's wonderful when this happens, especially when those refactorings are
|
||||
things that improve the codebase in their own right. When that happens there's
|
||||
much less risk of wasted effort if the feature you were going for doesn't work
|
||||
out.
|
||||
|
||||
Always strive to work incrementally. Always strive to turn the big projects
|
||||
into little bite sized projects that can prove their own merits.
|
||||
|
||||
Instead of always tackling those big projects, look for little things that
|
||||
will be useful, and make the big projects easier.
|
||||
|
||||
The question of what's likely to be useful is where junior developers most
|
||||
often go astray - doing something because it seems like it'll be useful often
|
||||
leads to overengineering. Knowing what's useful comes from many years of
|
||||
experience, or talking with people who have that experience - or from simply
|
||||
reading lots of code and looking for common patterns and issues. Don't be
|
||||
afraid to throw things away and do something simpler.
|
||||
|
||||
Talk about your ideas with your fellow developers; often times the best things
|
||||
come from relaxed conversations where people aren't afraid to say "what if?".
|
||||
|
||||
Don't neglect your tools.
|
||||
|
||||
The most important tools (besides the compiler and our text editor) are the
|
||||
tools we use for testing. The shortest possible edit/test/debug cycle is
|
||||
essential for working productively. We learn, gain experience, and discover the
|
||||
errors in our thinking by running our code and seeing what happens. If your
|
||||
time is being wasted because your tools are bad or too slow - don't accept it,
|
||||
fix it.
|
||||
|
||||
Put effort into your documentation, commit messages, and code comments - but
|
||||
don't go overboard. A good commit message is wonderful - but if the information
|
||||
was important enough to go in a commit message, ask yourself if it would be
|
||||
even better as a code comment.
|
||||
|
||||
A good code comment is wonderful, but even better is the comment that didn't
|
||||
need to exist because the code was so straightforward as to be obvious;
|
||||
organized into small clean and tidy modules, with clear and descriptive names
|
||||
for functions and variables, where every line of code has a clear purpose.
|
||||
@@ -1,105 +0,0 @@
|
||||
Submitting patches to bcachefs
|
||||
==============================
|
||||
|
||||
Here are suggestions for submitting patches to bcachefs subsystem.
|
||||
|
||||
Submission checklist
|
||||
--------------------
|
||||
|
||||
Patches must be tested before being submitted, either with the xfstests suite
|
||||
[0]_, or the full bcachefs test suite in ktest [1]_, depending on what's being
|
||||
touched. Note that ktest wraps xfstests and will be an easier method to running
|
||||
it for most users; it includes single-command wrappers for all the mainstream
|
||||
in-kernel local filesystems.
|
||||
|
||||
Patches will undergo more testing after being merged (including
|
||||
lockdep/kasan/preempt/etc. variants), these are not generally required to be
|
||||
run by the submitter - but do put some thought into what you're changing and
|
||||
which tests might be relevant, e.g. are you dealing with tricky memory layout
|
||||
work? kasan, are you doing locking work? then lockdep; and ktest includes
|
||||
single-command variants for the debug build types you'll most likely need.
|
||||
|
||||
The exception to this rule is incomplete WIP/RFC patches: if you're working on
|
||||
something nontrivial, it's encouraged to send out a WIP patch to let people
|
||||
know what you're doing and make sure you're on the right track. Just make sure
|
||||
it includes a brief note as to what's done and what's incomplete, to avoid
|
||||
confusion.
|
||||
|
||||
Rigorous checkpatch.pl adherence is not required (many of its warnings are
|
||||
considered out of date), but try not to deviate too much without reason.
|
||||
|
||||
Focus on writing code that reads well and is organized well; code should be
|
||||
aesthetically pleasing.
|
||||
|
||||
CI
|
||||
--
|
||||
|
||||
Instead of running your tests locally, when running the full test suite it's
|
||||
preferable to let a server farm do it in parallel, and then have the results
|
||||
in a nice test dashboard (which can tell you which failures are new, and
|
||||
presents results in a git log view, avoiding the need for most bisecting).
|
||||
|
||||
That exists [2]_, and community members may request an account. If you work for
|
||||
a big tech company, you'll need to help out with server costs to get access -
|
||||
but the CI is not restricted to running bcachefs tests: it runs any ktest test
|
||||
(which generally makes it easy to wrap other tests that can run in qemu).
|
||||
|
||||
Other things to think about
|
||||
---------------------------
|
||||
|
||||
- How will we debug this code? Is there sufficient introspection to diagnose
|
||||
when something starts acting wonky on a user machine?
|
||||
|
||||
We don't necessarily need every single field of every data structure visible
|
||||
with introspection, but having the important fields of all the core data
|
||||
types wired up makes debugging drastically easier - a bit of thoughtful
|
||||
foresight greatly reduces the need to have people build custom kernels with
|
||||
debug patches.
|
||||
|
||||
More broadly, think about all the debug tooling that might be needed.
|
||||
|
||||
- Does it make the codebase more or less of a mess? Can we also try to do some
|
||||
organizing, too?
|
||||
|
||||
- Do new tests need to be written? New assertions? How do we know and verify
|
||||
that the code is correct, and what happens if something goes wrong?
|
||||
|
||||
We don't yet have automated code coverage analysis or easy fault injection -
|
||||
but for now, pretend we did and ask what they might tell us.
|
||||
|
||||
Assertions are hugely important, given that we don't yet have a systems
|
||||
language that can do ergonomic embedded correctness proofs. Hitting an assert
|
||||
in testing is much better than wandering off into undefined behaviour la-la
|
||||
land - use them. Use them judiciously, and not as a replacement for proper
|
||||
error handling, but use them.
|
||||
|
||||
- Does it need to be performance tested? Should we add new performance counters?
|
||||
|
||||
bcachefs has a set of persistent runtime counters which can be viewed with
|
||||
the 'bcachefs fs top' command; this should give users a basic idea of what
|
||||
their filesystem is currently doing. If you're doing a new feature or looking
|
||||
at old code, think if anything should be added.
|
||||
|
||||
- If it's a new on disk format feature - have upgrades and downgrades been
|
||||
tested? (Automated tests exists but aren't in the CI, due to the hassle of
|
||||
disk image management; coordinate to have them run.)
|
||||
|
||||
Mailing list, IRC
|
||||
-----------------
|
||||
|
||||
Patches should hit the list [3]_, but much discussion and code review happens
|
||||
on IRC as well [4]_; many people appreciate the more conversational approach
|
||||
and quicker feedback.
|
||||
|
||||
Additionally, we have a lively user community doing excellent QA work, which
|
||||
exists primarily on IRC. Please make use of that resource; user feedback is
|
||||
important for any nontrivial feature, and documenting it in commit messages
|
||||
would be a good idea.
|
||||
|
||||
.. rubric:: References
|
||||
|
||||
.. [0] git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git
|
||||
.. [1] https://evilpiepirate.org/git/ktest.git/
|
||||
.. [2] https://evilpiepirate.org/~testdashboard/ci/
|
||||
.. [3] linux-bcachefs@vger.kernel.org
|
||||
.. [4] irc.oftc.net#bcache, #bcachefs-dev
|
||||
@@ -1,108 +0,0 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
Casefolding
|
||||
===========
|
||||
|
||||
bcachefs has support for case-insensitive file and directory
|
||||
lookups using the regular `chattr +F` (`S_CASEFOLD`, `FS_CASEFOLD_FL`)
|
||||
casefolding attributes.
|
||||
|
||||
The main usecase for casefolding is compatibility with software written
|
||||
against other filesystems that rely on casefolded lookups
|
||||
(eg. NTFS and Wine/Proton).
|
||||
Taking advantage of file-system level casefolding can lead to great
|
||||
loading time gains in many applications and games.
|
||||
|
||||
Casefolding support requires a kernel with the `CONFIG_UNICODE` enabled.
|
||||
Once a directory has been flagged for casefolding, a feature bit
|
||||
is enabled on the superblock which marks the filesystem as using
|
||||
casefolding.
|
||||
When the feature bit for casefolding is enabled, it is no longer possible
|
||||
to mount that filesystem on kernels without `CONFIG_UNICODE` enabled.
|
||||
|
||||
On the lookup/query side: casefolding is implemented by allocating a new
|
||||
string of `BCH_NAME_MAX` length using the `utf8_casefold` function to
|
||||
casefold the query string.
|
||||
|
||||
On the dirent side: casefolding is implemented by ensuring the `bkey`'s
|
||||
hash is made from the casefolded string and storing the cached casefolded
|
||||
name with the regular name in the dirent.
|
||||
|
||||
The structure looks like this:
|
||||
|
||||
* Regular: [dirent data][regular name][nul][nul]...
|
||||
* Casefolded: [dirent data][reg len][cf len][regular name][casefolded name][nul][nul]...
|
||||
|
||||
(Do note, the number of NULs here is merely for illustration; their count can
|
||||
vary per-key, and they may not even be present if the key is aligned to
|
||||
`sizeof(u64)`.)
|
||||
|
||||
This is efficient as it means that for all file lookups that require casefolding,
|
||||
it has identical performance to a regular lookup:
|
||||
a hash comparison and a `memcmp` of the name.
|
||||
|
||||
Rationale
|
||||
---------
|
||||
|
||||
Several designs were considered for this system:
|
||||
One was to introduce a dirent_v2, however that would be painful especially as
|
||||
the hash system only has support for a single key type. This would also need
|
||||
`BCH_NAME_MAX` to change between versions, and a new feature bit.
|
||||
|
||||
Another option was to store without the two lengths, and just take the length of
|
||||
the regular name and casefolded name contiguously / 2 as the length. This would
|
||||
assume that the regular length == casefolded length, but that could potentially
|
||||
not be true, if the uppercase unicode glyph had a different UTF-8 encoding than
|
||||
the lowercase unicode glyph.
|
||||
It would be possible to disregard the casefold cache for those cases, but it was
|
||||
decided to simply encode the two string lengths in the key to avoid random
|
||||
performance issues if this edgecase was ever hit.
|
||||
|
||||
The option settled on was to use a free-bit in d_type to mark a dirent as having
|
||||
a casefold cache, and then treat the first 4 bytes the name block as lengths.
|
||||
You can see this in the `d_cf_name_block` member of union in `bch_dirent`.
|
||||
|
||||
The feature bit was used to allow casefolding support to be enabled for the majority
|
||||
of users, but some allow users who have no need for the feature to still use bcachefs as
|
||||
`CONFIG_UNICODE` can increase the kernel side a significant amount due to the tables used,
|
||||
which may be decider between using bcachefs for eg. embedded platforms.
|
||||
|
||||
Other filesystems like ext4 and f2fs have a super-block level option for casefolding
|
||||
encoding, but bcachefs currently does not provide this. ext4 and f2fs do not expose
|
||||
any encodings than a single UTF-8 version. When future encodings are desirable,
|
||||
they will be added trivially using the opts mechanism.
|
||||
|
||||
dentry/dcache considerations
|
||||
----------------------------
|
||||
|
||||
Currently, in casefolded directories, bcachefs (like other filesystems) will not cache
|
||||
negative dentry's.
|
||||
|
||||
This is because currently doing so presents a problem in the following scenario:
|
||||
|
||||
- Lookup file "blAH" in a casefolded directory
|
||||
- Creation of file "BLAH" in a casefolded directory
|
||||
- Lookup file "blAH" in a casefolded directory
|
||||
|
||||
This would fail if negative dentry's were cached.
|
||||
|
||||
This is slightly suboptimal, but could be fixed in future with some vfs work.
|
||||
|
||||
|
||||
References
|
||||
----------
|
||||
|
||||
(from Peter Anvin, on the list)
|
||||
|
||||
It is worth noting that Microsoft has basically declared their
|
||||
"recommended" case folding (upcase) table to be permanently frozen (for
|
||||
new filesystem instances in the case where they use an on-disk
|
||||
translation table created at format time.) As far as I know they have
|
||||
never supported anything other than 1:1 conversion of BMP code points,
|
||||
nor normalization.
|
||||
|
||||
The exFAT specification enumerates the full recommended upcase table,
|
||||
although in a somewhat annoying format (basically a hex dump of
|
||||
compressed data):
|
||||
|
||||
https://learn.microsoft.com/en-us/windows/win32/fileio/exfat-specification
|
||||
@@ -1,30 +0,0 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
bcachefs private error codes
|
||||
----------------------------
|
||||
|
||||
In bcachefs, as a hard rule we do not throw or directly use standard error
|
||||
codes (-EINVAL, -EBUSY, etc.). Instead, we define private error codes as needed
|
||||
in fs/bcachefs/errcode.h.
|
||||
|
||||
This gives us much better error messages and makes debugging much easier. Any
|
||||
direct uses of standard error codes you see in the source code are simply old
|
||||
code that has yet to be converted - feel free to clean it up!
|
||||
|
||||
Private error codes may subtype another error code, this allows for grouping of
|
||||
related errors that should be handled similarly (e.g. transaction restart
|
||||
errors), as well as specifying which standard error code should be returned at
|
||||
the bcachefs module boundary.
|
||||
|
||||
At the module boundary, we use bch2_err_class() to convert to a standard error
|
||||
code; this also emits a trace event so that the original error code be
|
||||
recovered even if it wasn't logged.
|
||||
|
||||
Do not reuse error codes! Generally speaking, a private error code should only
|
||||
be thrown in one place. That means that when we see it in a log message we can
|
||||
see, unambiguously, exactly which file and line number it was returned from.
|
||||
|
||||
Try to give error codes names that are as reasonably descriptive of the error
|
||||
as possible. Frequently, the error will be logged at a place far removed from
|
||||
where the error was generated; good names for error codes mean much more
|
||||
descriptive and useful error messages.
|
||||
@@ -1,78 +0,0 @@
|
||||
Idle/background work classes design doc:
|
||||
|
||||
Right now, our behaviour at idle isn't ideal, it was designed for servers that
|
||||
would be under sustained load, to keep pending work at a "medium" level, to
|
||||
let work build up so we can process it in more efficient batches, while also
|
||||
giving headroom for bursts in load.
|
||||
|
||||
But for desktops or mobile - scenarios where work is less sustained and power
|
||||
usage is more important - we want to operate differently, with a "rush to
|
||||
idle" so the system can go to sleep. We don't want to be dribbling out
|
||||
background work while the system should be idle.
|
||||
|
||||
The complicating factor is that there are a number of background tasks, which
|
||||
form a heirarchy (or a digraph, depending on how you divide it up) - one
|
||||
background task may generate work for another.
|
||||
|
||||
Thus proper idle detection needs to model this heirarchy.
|
||||
|
||||
- Foreground writes
|
||||
- Page cache writeback
|
||||
- Copygc, rebalance
|
||||
- Journal reclaim
|
||||
|
||||
When we implement idle detection and rush to idle, we need to be careful not
|
||||
to disturb too much the existing behaviour that works reasonably well when the
|
||||
system is under sustained load (or perhaps improve it in the case of
|
||||
rebalance, which currently does not actively attempt to let work batch up).
|
||||
|
||||
SUSTAINED LOAD REGIME
|
||||
---------------------
|
||||
|
||||
When the system is under continuous load, we want these jobs to run
|
||||
continuously - this is perhaps best modelled with a P/D controller, where
|
||||
they'll be trying to keep a target value (i.e. fragmented disk space,
|
||||
available journal space) roughly in the middle of some range.
|
||||
|
||||
The goal under sustained load is to balance our ability to handle load spikes
|
||||
without running out of x resource (free disk space, free space in the
|
||||
journal), while also letting some work accumululate to be batched (or become
|
||||
unnecessary).
|
||||
|
||||
For example, we don't want to run copygc too aggressively, because then it
|
||||
will be evacuating buckets that would have become empty (been overwritten or
|
||||
deleted) anyways, and we don't want to wait until we're almost out of free
|
||||
space because then the system will behave unpredicably - suddenly we're doing
|
||||
a lot more work to service each write and the system becomes much slower.
|
||||
|
||||
IDLE REGIME
|
||||
-----------
|
||||
|
||||
When the system becomes idle, we should start flushing our pending work
|
||||
quicker so the system can go to sleep.
|
||||
|
||||
Note that the definition of "idle" depends on where in the heirarchy a task
|
||||
is - a task should start flushing work more quickly when the task above it has
|
||||
stopped generating new work.
|
||||
|
||||
e.g. rebalance should start flushing more quickly when page cache writeback is
|
||||
idle, and journal reclaim should only start flushing more quickly when both
|
||||
copygc and rebalance are idle.
|
||||
|
||||
It's important to let work accumulate when more work is still incoming and we
|
||||
still have room, because flushing is always more efficient if we let it batch
|
||||
up. New writes may overwrite data before rebalance moves it, and tasks may be
|
||||
generating more updates for the btree nodes that journal reclaim needs to flush.
|
||||
|
||||
On idle, how much work we do at each interval should be proportional to the
|
||||
length of time we have been idle for. If we're idle only for a short duration,
|
||||
we shouldn't flush everything right away; the system might wake up and start
|
||||
generating new work soon, and flushing immediately might end up doing a lot of
|
||||
work that would have been unnecessary if we'd allowed things to batch more.
|
||||
|
||||
To summarize, we will need:
|
||||
|
||||
- A list of classes for background tasks that generate work, which will
|
||||
include one "foreground" class.
|
||||
- Tracking for each class - "Am I doing work, or have I gone to sleep?"
|
||||
- And each class should check the class above it when deciding how much work to issue.
|
||||
@@ -1,38 +0,0 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
======================
|
||||
bcachefs Documentation
|
||||
======================
|
||||
|
||||
Subsystem-specific development process notes
|
||||
--------------------------------------------
|
||||
|
||||
Development notes specific to bcachefs. These are intended to supplement
|
||||
:doc:`general kernel development handbook </process/index>`.
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
:numbered:
|
||||
|
||||
CodingStyle
|
||||
SubmittingPatches
|
||||
|
||||
Filesystem implementation
|
||||
-------------------------
|
||||
|
||||
Documentation for filesystem features and their implementation details.
|
||||
At this moment, only a few of these are described here.
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
:numbered:
|
||||
|
||||
casefolding
|
||||
errorcodes
|
||||
|
||||
Future design
|
||||
-------------
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
future/idle_work
|
||||
@@ -72,7 +72,6 @@ Documentation for filesystem implementations.
|
||||
afs
|
||||
autofs
|
||||
autofs-mount-control
|
||||
bcachefs/index
|
||||
befs
|
||||
bfs
|
||||
btrfs
|
||||
|
||||
@@ -4217,10 +4217,7 @@ M: Kent Overstreet <kent.overstreet@linux.dev>
|
||||
L: linux-bcachefs@vger.kernel.org
|
||||
S: Externally maintained
|
||||
C: irc://irc.oftc.net/bcache
|
||||
P: Documentation/filesystems/bcachefs/SubmittingPatches.rst
|
||||
T: git https://evilpiepirate.org/git/bcachefs.git
|
||||
F: fs/bcachefs/
|
||||
F: Documentation/filesystems/bcachefs/
|
||||
|
||||
BDISP ST MEDIA DRIVER
|
||||
M: Fabien Dessenne <fabien.dessenne@foss.st.com>
|
||||
|
||||
@@ -454,7 +454,6 @@ CONFIG_XFS_FS=m
|
||||
CONFIG_OCFS2_FS=m
|
||||
# CONFIG_OCFS2_DEBUG_MASKLOG is not set
|
||||
CONFIG_BTRFS_FS=m
|
||||
CONFIG_BCACHEFS_FS=m
|
||||
CONFIG_FANOTIFY=y
|
||||
CONFIG_QUOTA_NETLINK_INTERFACE=y
|
||||
CONFIG_AUTOFS_FS=m
|
||||
|
||||
@@ -411,7 +411,6 @@ CONFIG_XFS_FS=m
|
||||
CONFIG_OCFS2_FS=m
|
||||
# CONFIG_OCFS2_DEBUG_MASKLOG is not set
|
||||
CONFIG_BTRFS_FS=m
|
||||
CONFIG_BCACHEFS_FS=m
|
||||
CONFIG_FANOTIFY=y
|
||||
CONFIG_QUOTA_NETLINK_INTERFACE=y
|
||||
CONFIG_AUTOFS_FS=m
|
||||
|
||||
@@ -431,7 +431,6 @@ CONFIG_XFS_FS=m
|
||||
CONFIG_OCFS2_FS=m
|
||||
# CONFIG_OCFS2_DEBUG_MASKLOG is not set
|
||||
CONFIG_BTRFS_FS=m
|
||||
CONFIG_BCACHEFS_FS=m
|
||||
CONFIG_FANOTIFY=y
|
||||
CONFIG_QUOTA_NETLINK_INTERFACE=y
|
||||
CONFIG_AUTOFS_FS=m
|
||||
|
||||
@@ -403,7 +403,6 @@ CONFIG_XFS_FS=m
|
||||
CONFIG_OCFS2_FS=m
|
||||
# CONFIG_OCFS2_DEBUG_MASKLOG is not set
|
||||
CONFIG_BTRFS_FS=m
|
||||
CONFIG_BCACHEFS_FS=m
|
||||
CONFIG_FANOTIFY=y
|
||||
CONFIG_QUOTA_NETLINK_INTERFACE=y
|
||||
CONFIG_AUTOFS_FS=m
|
||||
|
||||
@@ -413,7 +413,6 @@ CONFIG_XFS_FS=m
|
||||
CONFIG_OCFS2_FS=m
|
||||
# CONFIG_OCFS2_DEBUG_MASKLOG is not set
|
||||
CONFIG_BTRFS_FS=m
|
||||
CONFIG_BCACHEFS_FS=m
|
||||
CONFIG_FANOTIFY=y
|
||||
CONFIG_QUOTA_NETLINK_INTERFACE=y
|
||||
CONFIG_AUTOFS_FS=m
|
||||
|
||||
@@ -430,7 +430,6 @@ CONFIG_XFS_FS=m
|
||||
CONFIG_OCFS2_FS=m
|
||||
# CONFIG_OCFS2_DEBUG_MASKLOG is not set
|
||||
CONFIG_BTRFS_FS=m
|
||||
CONFIG_BCACHEFS_FS=m
|
||||
CONFIG_FANOTIFY=y
|
||||
CONFIG_QUOTA_NETLINK_INTERFACE=y
|
||||
CONFIG_AUTOFS_FS=m
|
||||
|
||||
@@ -517,7 +517,6 @@ CONFIG_XFS_FS=m
|
||||
CONFIG_OCFS2_FS=m
|
||||
# CONFIG_OCFS2_DEBUG_MASKLOG is not set
|
||||
CONFIG_BTRFS_FS=m
|
||||
CONFIG_BCACHEFS_FS=m
|
||||
CONFIG_FANOTIFY=y
|
||||
CONFIG_QUOTA_NETLINK_INTERFACE=y
|
||||
CONFIG_AUTOFS_FS=m
|
||||
|
||||
@@ -403,7 +403,6 @@ CONFIG_XFS_FS=m
|
||||
CONFIG_OCFS2_FS=m
|
||||
# CONFIG_OCFS2_DEBUG_MASKLOG is not set
|
||||
CONFIG_BTRFS_FS=m
|
||||
CONFIG_BCACHEFS_FS=m
|
||||
CONFIG_FANOTIFY=y
|
||||
CONFIG_QUOTA_NETLINK_INTERFACE=y
|
||||
CONFIG_AUTOFS_FS=m
|
||||
|
||||
@@ -404,7 +404,6 @@ CONFIG_XFS_FS=m
|
||||
CONFIG_OCFS2_FS=m
|
||||
# CONFIG_OCFS2_DEBUG_MASKLOG is not set
|
||||
CONFIG_BTRFS_FS=m
|
||||
CONFIG_BCACHEFS_FS=m
|
||||
CONFIG_FANOTIFY=y
|
||||
CONFIG_QUOTA_NETLINK_INTERFACE=y
|
||||
CONFIG_AUTOFS_FS=m
|
||||
|
||||
@@ -420,7 +420,6 @@ CONFIG_XFS_FS=m
|
||||
CONFIG_OCFS2_FS=m
|
||||
# CONFIG_OCFS2_DEBUG_MASKLOG is not set
|
||||
CONFIG_BTRFS_FS=m
|
||||
CONFIG_BCACHEFS_FS=m
|
||||
CONFIG_FANOTIFY=y
|
||||
CONFIG_QUOTA_NETLINK_INTERFACE=y
|
||||
CONFIG_AUTOFS_FS=m
|
||||
|
||||
@@ -401,7 +401,6 @@ CONFIG_XFS_FS=m
|
||||
CONFIG_OCFS2_FS=m
|
||||
# CONFIG_OCFS2_DEBUG_MASKLOG is not set
|
||||
CONFIG_BTRFS_FS=m
|
||||
CONFIG_BCACHEFS_FS=m
|
||||
CONFIG_FANOTIFY=y
|
||||
CONFIG_QUOTA_NETLINK_INTERFACE=y
|
||||
CONFIG_AUTOFS_FS=m
|
||||
|
||||
@@ -401,7 +401,6 @@ CONFIG_XFS_FS=m
|
||||
CONFIG_OCFS2_FS=m
|
||||
# CONFIG_OCFS2_DEBUG_MASKLOG is not set
|
||||
CONFIG_BTRFS_FS=m
|
||||
CONFIG_BCACHEFS_FS=m
|
||||
CONFIG_FANOTIFY=y
|
||||
CONFIG_QUOTA_NETLINK_INTERFACE=y
|
||||
CONFIG_AUTOFS_FS=m
|
||||
|
||||
@@ -658,9 +658,6 @@ CONFIG_BTRFS_FS_POSIX_ACL=y
|
||||
CONFIG_BTRFS_DEBUG=y
|
||||
CONFIG_BTRFS_ASSERT=y
|
||||
CONFIG_NILFS2_FS=m
|
||||
CONFIG_BCACHEFS_FS=y
|
||||
CONFIG_BCACHEFS_QUOTA=y
|
||||
CONFIG_BCACHEFS_POSIX_ACL=y
|
||||
CONFIG_FS_DAX=y
|
||||
CONFIG_EXPORTFS_BLOCK_OPS=y
|
||||
CONFIG_FS_ENCRYPTION=y
|
||||
|
||||
@@ -645,9 +645,6 @@ CONFIG_OCFS2_FS=m
|
||||
CONFIG_BTRFS_FS=y
|
||||
CONFIG_BTRFS_FS_POSIX_ACL=y
|
||||
CONFIG_NILFS2_FS=m
|
||||
CONFIG_BCACHEFS_FS=m
|
||||
CONFIG_BCACHEFS_QUOTA=y
|
||||
CONFIG_BCACHEFS_POSIX_ACL=y
|
||||
CONFIG_FS_DAX=y
|
||||
CONFIG_EXPORTFS_BLOCK_OPS=y
|
||||
CONFIG_FS_ENCRYPTION=y
|
||||
|
||||
@@ -51,7 +51,6 @@ source "fs/ocfs2/Kconfig"
|
||||
source "fs/btrfs/Kconfig"
|
||||
source "fs/nilfs2/Kconfig"
|
||||
source "fs/f2fs/Kconfig"
|
||||
source "fs/bcachefs/Kconfig"
|
||||
source "fs/zonefs/Kconfig"
|
||||
|
||||
endif # BLOCK
|
||||
|
||||
@@ -121,7 +121,6 @@ obj-$(CONFIG_OCFS2_FS) += ocfs2/
|
||||
obj-$(CONFIG_BTRFS_FS) += btrfs/
|
||||
obj-$(CONFIG_GFS2_FS) += gfs2/
|
||||
obj-$(CONFIG_F2FS_FS) += f2fs/
|
||||
obj-$(CONFIG_BCACHEFS_FS) += bcachefs/
|
||||
obj-$(CONFIG_CEPH_FS) += ceph/
|
||||
obj-$(CONFIG_PSTORE) += pstore/
|
||||
obj-$(CONFIG_EFIVAR_FS) += efivarfs/
|
||||
|
||||
@@ -1,121 +0,0 @@
|
||||
|
||||
config BCACHEFS_FS
|
||||
tristate "bcachefs filesystem support (EXPERIMENTAL)"
|
||||
depends on BLOCK
|
||||
select EXPORTFS
|
||||
select CLOSURES
|
||||
select CRC32
|
||||
select CRC64
|
||||
select FS_POSIX_ACL
|
||||
select LZ4_COMPRESS
|
||||
select LZ4_DECOMPRESS
|
||||
select LZ4HC_COMPRESS
|
||||
select LZ4HC_DECOMPRESS
|
||||
select ZLIB_DEFLATE
|
||||
select ZLIB_INFLATE
|
||||
select ZSTD_COMPRESS
|
||||
select ZSTD_DECOMPRESS
|
||||
select CRYPTO_LIB_SHA256
|
||||
select CRYPTO_LIB_CHACHA
|
||||
select CRYPTO_LIB_POLY1305
|
||||
select KEYS
|
||||
select RAID6_PQ
|
||||
select XOR_BLOCKS
|
||||
select XXHASH
|
||||
select SRCU
|
||||
select SYMBOLIC_ERRNAME
|
||||
select MIN_HEAP
|
||||
select XARRAY_MULTI
|
||||
help
|
||||
The bcachefs filesystem - a modern, copy on write filesystem, with
|
||||
support for multiple devices, compression, checksumming, etc.
|
||||
|
||||
config BCACHEFS_QUOTA
|
||||
bool "bcachefs quota support"
|
||||
depends on BCACHEFS_FS
|
||||
select QUOTACTL
|
||||
|
||||
config BCACHEFS_ERASURE_CODING
|
||||
bool "bcachefs erasure coding (RAID5/6) support (EXPERIMENTAL)"
|
||||
depends on BCACHEFS_FS
|
||||
select QUOTACTL
|
||||
help
|
||||
This enables the "erasure_code" filesysystem and inode option, which
|
||||
organizes data into reed-solomon stripes instead of ordinary
|
||||
replication.
|
||||
|
||||
WARNING: this feature is still undergoing on disk format changes, and
|
||||
should only be enabled for testing purposes.
|
||||
|
||||
config BCACHEFS_POSIX_ACL
|
||||
bool "bcachefs POSIX ACL support"
|
||||
depends on BCACHEFS_FS
|
||||
select FS_POSIX_ACL
|
||||
|
||||
config BCACHEFS_DEBUG
|
||||
bool "bcachefs debugging"
|
||||
depends on BCACHEFS_FS
|
||||
help
|
||||
Enables many extra debugging checks and assertions.
|
||||
|
||||
The resulting code will be significantly slower than normal; you
|
||||
probably shouldn't select this option unless you're a developer.
|
||||
|
||||
config BCACHEFS_INJECT_TRANSACTION_RESTARTS
|
||||
bool "Randomly inject transaction restarts"
|
||||
depends on BCACHEFS_DEBUG
|
||||
help
|
||||
Randomly inject transaction restarts in a few core paths - may have a
|
||||
significant performance penalty
|
||||
|
||||
config BCACHEFS_TESTS
|
||||
bool "bcachefs unit and performance tests"
|
||||
depends on BCACHEFS_FS
|
||||
help
|
||||
Include some unit and performance tests for the core btree code
|
||||
|
||||
config BCACHEFS_LOCK_TIME_STATS
|
||||
bool "bcachefs lock time statistics"
|
||||
depends on BCACHEFS_FS
|
||||
help
|
||||
Expose statistics for how long we held a lock in debugfs
|
||||
|
||||
config BCACHEFS_NO_LATENCY_ACCT
|
||||
bool "disable latency accounting and time stats"
|
||||
depends on BCACHEFS_FS
|
||||
help
|
||||
This disables device latency tracking and time stats, only for performance testing
|
||||
|
||||
config BCACHEFS_SIX_OPTIMISTIC_SPIN
|
||||
bool "Optimistic spinning for six locks"
|
||||
depends on BCACHEFS_FS
|
||||
depends on SMP
|
||||
default y
|
||||
help
|
||||
Instead of immediately sleeping when attempting to take a six lock that
|
||||
is held by another thread, spin for a short while, as long as the
|
||||
thread owning the lock is running.
|
||||
|
||||
config BCACHEFS_PATH_TRACEPOINTS
|
||||
bool "Extra btree_path tracepoints"
|
||||
depends on BCACHEFS_FS && TRACING
|
||||
help
|
||||
Enable extra tracepoints for debugging btree_path operations; we don't
|
||||
normally want these enabled because they happen at very high rates.
|
||||
|
||||
config BCACHEFS_TRANS_KMALLOC_TRACE
|
||||
bool "Trace bch2_trans_kmalloc() calls"
|
||||
depends on BCACHEFS_FS
|
||||
|
||||
config BCACHEFS_ASYNC_OBJECT_LISTS
|
||||
bool "Keep async objects on fast_lists for debugfs visibility"
|
||||
depends on BCACHEFS_FS && DEBUG_FS
|
||||
|
||||
config MEAN_AND_VARIANCE_UNIT_TEST
|
||||
tristate "mean_and_variance unit tests" if !KUNIT_ALL_TESTS
|
||||
depends on KUNIT
|
||||
depends on BCACHEFS_FS
|
||||
default KUNIT_ALL_TESTS
|
||||
help
|
||||
This option enables the kunit tests for mean_and_variance module.
|
||||
If unsure, say N.
|
||||
@@ -1,107 +0,0 @@
|
||||
|
||||
obj-$(CONFIG_BCACHEFS_FS) += bcachefs.o
|
||||
|
||||
bcachefs-y := \
|
||||
acl.o \
|
||||
alloc_background.o \
|
||||
alloc_foreground.o \
|
||||
backpointers.o \
|
||||
bkey.o \
|
||||
bkey_methods.o \
|
||||
bkey_sort.o \
|
||||
bset.o \
|
||||
btree_cache.o \
|
||||
btree_gc.o \
|
||||
btree_io.o \
|
||||
btree_iter.o \
|
||||
btree_journal_iter.o \
|
||||
btree_key_cache.o \
|
||||
btree_locking.o \
|
||||
btree_node_scan.o \
|
||||
btree_trans_commit.o \
|
||||
btree_update.o \
|
||||
btree_update_interior.o \
|
||||
btree_write_buffer.o \
|
||||
buckets.o \
|
||||
buckets_waiting_for_journal.o \
|
||||
chardev.o \
|
||||
checksum.o \
|
||||
clock.o \
|
||||
compress.o \
|
||||
darray.o \
|
||||
data_update.o \
|
||||
debug.o \
|
||||
dirent.o \
|
||||
disk_accounting.o \
|
||||
disk_groups.o \
|
||||
ec.o \
|
||||
enumerated_ref.o \
|
||||
errcode.o \
|
||||
error.o \
|
||||
extents.o \
|
||||
extent_update.o \
|
||||
eytzinger.o \
|
||||
fast_list.o \
|
||||
fs.o \
|
||||
fs-ioctl.o \
|
||||
fs-io.o \
|
||||
fs-io-buffered.o \
|
||||
fs-io-direct.o \
|
||||
fs-io-pagecache.o \
|
||||
fsck.o \
|
||||
inode.o \
|
||||
io_read.o \
|
||||
io_misc.o \
|
||||
io_write.o \
|
||||
journal.o \
|
||||
journal_io.o \
|
||||
journal_reclaim.o \
|
||||
journal_sb.o \
|
||||
journal_seq_blacklist.o \
|
||||
keylist.o \
|
||||
logged_ops.o \
|
||||
lru.o \
|
||||
mean_and_variance.o \
|
||||
migrate.o \
|
||||
move.o \
|
||||
movinggc.o \
|
||||
namei.o \
|
||||
nocow_locking.o \
|
||||
opts.o \
|
||||
printbuf.o \
|
||||
progress.o \
|
||||
quota.o \
|
||||
rebalance.o \
|
||||
rcu_pending.o \
|
||||
recovery.o \
|
||||
recovery_passes.o \
|
||||
reflink.o \
|
||||
replicas.o \
|
||||
sb-clean.o \
|
||||
sb-counters.o \
|
||||
sb-downgrade.o \
|
||||
sb-errors.o \
|
||||
sb-members.o \
|
||||
siphash.o \
|
||||
six.o \
|
||||
snapshot.o \
|
||||
str_hash.o \
|
||||
subvolume.o \
|
||||
super.o \
|
||||
super-io.o \
|
||||
sysfs.o \
|
||||
tests.o \
|
||||
time_stats.o \
|
||||
thread_with_file.o \
|
||||
trace.o \
|
||||
two_state_shared_lock.o \
|
||||
util.o \
|
||||
varint.o \
|
||||
xattr.o
|
||||
|
||||
bcachefs-$(CONFIG_BCACHEFS_ASYNC_OBJECT_LISTS) += async_objs.o
|
||||
|
||||
obj-$(CONFIG_MEAN_AND_VARIANCE_UNIT_TEST) += mean_and_variance_test.o
|
||||
|
||||
# Silence "note: xyz changed in GCC X.X" messages
|
||||
subdir-ccflags-y += $(call cc-disable-warning, psabi)
|
||||
@@ -1,445 +0,0 @@
|
||||
// SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
#include "bcachefs.h"
|
||||
|
||||
#include "acl.h"
|
||||
#include "xattr.h"
|
||||
|
||||
#include <linux/posix_acl.h>
|
||||
|
||||
static const char * const acl_types[] = {
|
||||
[ACL_USER_OBJ] = "user_obj",
|
||||
[ACL_USER] = "user",
|
||||
[ACL_GROUP_OBJ] = "group_obj",
|
||||
[ACL_GROUP] = "group",
|
||||
[ACL_MASK] = "mask",
|
||||
[ACL_OTHER] = "other",
|
||||
NULL,
|
||||
};
|
||||
|
||||
void bch2_acl_to_text(struct printbuf *out, const void *value, size_t size)
|
||||
{
|
||||
const void *p, *end = value + size;
|
||||
|
||||
if (!value ||
|
||||
size < sizeof(bch_acl_header) ||
|
||||
((bch_acl_header *)value)->a_version != cpu_to_le32(BCH_ACL_VERSION))
|
||||
return;
|
||||
|
||||
p = value + sizeof(bch_acl_header);
|
||||
while (p < end) {
|
||||
const bch_acl_entry *in = p;
|
||||
unsigned tag = le16_to_cpu(in->e_tag);
|
||||
|
||||
prt_str(out, acl_types[tag]);
|
||||
|
||||
switch (tag) {
|
||||
case ACL_USER_OBJ:
|
||||
case ACL_GROUP_OBJ:
|
||||
case ACL_MASK:
|
||||
case ACL_OTHER:
|
||||
p += sizeof(bch_acl_entry_short);
|
||||
break;
|
||||
case ACL_USER:
|
||||
prt_printf(out, " uid %u", le32_to_cpu(in->e_id));
|
||||
p += sizeof(bch_acl_entry);
|
||||
break;
|
||||
case ACL_GROUP:
|
||||
prt_printf(out, " gid %u", le32_to_cpu(in->e_id));
|
||||
p += sizeof(bch_acl_entry);
|
||||
break;
|
||||
}
|
||||
|
||||
prt_printf(out, " %o", le16_to_cpu(in->e_perm));
|
||||
|
||||
if (p != end)
|
||||
prt_char(out, ' ');
|
||||
}
|
||||
}
|
||||
|
||||
#ifdef CONFIG_BCACHEFS_POSIX_ACL
|
||||
|
||||
#include "fs.h"
|
||||
|
||||
#include <linux/fs.h>
|
||||
#include <linux/posix_acl_xattr.h>
|
||||
#include <linux/sched.h>
|
||||
#include <linux/slab.h>
|
||||
|
||||
static inline size_t bch2_acl_size(unsigned nr_short, unsigned nr_long)
|
||||
{
|
||||
return sizeof(bch_acl_header) +
|
||||
sizeof(bch_acl_entry_short) * nr_short +
|
||||
sizeof(bch_acl_entry) * nr_long;
|
||||
}
|
||||
|
||||
static inline int acl_to_xattr_type(int type)
|
||||
{
|
||||
switch (type) {
|
||||
case ACL_TYPE_ACCESS:
|
||||
return KEY_TYPE_XATTR_INDEX_POSIX_ACL_ACCESS;
|
||||
case ACL_TYPE_DEFAULT:
|
||||
return KEY_TYPE_XATTR_INDEX_POSIX_ACL_DEFAULT;
|
||||
default:
|
||||
BUG();
|
||||
}
|
||||
}
|
||||
|
||||
/*
|
||||
* Convert from filesystem to in-memory representation.
|
||||
*/
|
||||
static struct posix_acl *bch2_acl_from_disk(struct btree_trans *trans,
|
||||
const void *value, size_t size)
|
||||
{
|
||||
const void *p, *end = value + size;
|
||||
struct posix_acl *acl;
|
||||
struct posix_acl_entry *out;
|
||||
unsigned count = 0;
|
||||
int ret;
|
||||
|
||||
if (!value)
|
||||
return NULL;
|
||||
if (size < sizeof(bch_acl_header))
|
||||
goto invalid;
|
||||
if (((bch_acl_header *)value)->a_version !=
|
||||
cpu_to_le32(BCH_ACL_VERSION))
|
||||
goto invalid;
|
||||
|
||||
p = value + sizeof(bch_acl_header);
|
||||
while (p < end) {
|
||||
const bch_acl_entry *entry = p;
|
||||
|
||||
if (p + sizeof(bch_acl_entry_short) > end)
|
||||
goto invalid;
|
||||
|
||||
switch (le16_to_cpu(entry->e_tag)) {
|
||||
case ACL_USER_OBJ:
|
||||
case ACL_GROUP_OBJ:
|
||||
case ACL_MASK:
|
||||
case ACL_OTHER:
|
||||
p += sizeof(bch_acl_entry_short);
|
||||
break;
|
||||
case ACL_USER:
|
||||
case ACL_GROUP:
|
||||
p += sizeof(bch_acl_entry);
|
||||
break;
|
||||
default:
|
||||
goto invalid;
|
||||
}
|
||||
|
||||
count++;
|
||||
}
|
||||
|
||||
if (p > end)
|
||||
goto invalid;
|
||||
|
||||
if (!count)
|
||||
return NULL;
|
||||
|
||||
acl = allocate_dropping_locks(trans, ret,
|
||||
posix_acl_alloc(count, _gfp));
|
||||
if (!acl)
|
||||
return ERR_PTR(-ENOMEM);
|
||||
if (ret) {
|
||||
kfree(acl);
|
||||
return ERR_PTR(ret);
|
||||
}
|
||||
|
||||
out = acl->a_entries;
|
||||
|
||||
p = value + sizeof(bch_acl_header);
|
||||
while (p < end) {
|
||||
const bch_acl_entry *in = p;
|
||||
|
||||
out->e_tag = le16_to_cpu(in->e_tag);
|
||||
out->e_perm = le16_to_cpu(in->e_perm);
|
||||
|
||||
switch (out->e_tag) {
|
||||
case ACL_USER_OBJ:
|
||||
case ACL_GROUP_OBJ:
|
||||
case ACL_MASK:
|
||||
case ACL_OTHER:
|
||||
p += sizeof(bch_acl_entry_short);
|
||||
break;
|
||||
case ACL_USER:
|
||||
out->e_uid = make_kuid(&init_user_ns,
|
||||
le32_to_cpu(in->e_id));
|
||||
p += sizeof(bch_acl_entry);
|
||||
break;
|
||||
case ACL_GROUP:
|
||||
out->e_gid = make_kgid(&init_user_ns,
|
||||
le32_to_cpu(in->e_id));
|
||||
p += sizeof(bch_acl_entry);
|
||||
break;
|
||||
}
|
||||
|
||||
out++;
|
||||
}
|
||||
|
||||
BUG_ON(out != acl->a_entries + acl->a_count);
|
||||
|
||||
return acl;
|
||||
invalid:
|
||||
pr_err("invalid acl entry");
|
||||
return ERR_PTR(-EINVAL);
|
||||
}
|
||||
|
||||
/*
|
||||
* Convert from in-memory to filesystem representation.
|
||||
*/
|
||||
static struct bkey_i_xattr *
|
||||
bch2_acl_to_xattr(struct btree_trans *trans,
|
||||
const struct posix_acl *acl,
|
||||
int type)
|
||||
{
|
||||
struct bkey_i_xattr *xattr;
|
||||
bch_acl_header *acl_header;
|
||||
const struct posix_acl_entry *acl_e, *pe;
|
||||
void *outptr;
|
||||
unsigned nr_short = 0, nr_long = 0, acl_len, u64s;
|
||||
|
||||
FOREACH_ACL_ENTRY(acl_e, acl, pe) {
|
||||
switch (acl_e->e_tag) {
|
||||
case ACL_USER:
|
||||
case ACL_GROUP:
|
||||
nr_long++;
|
||||
break;
|
||||
case ACL_USER_OBJ:
|
||||
case ACL_GROUP_OBJ:
|
||||
case ACL_MASK:
|
||||
case ACL_OTHER:
|
||||
nr_short++;
|
||||
break;
|
||||
default:
|
||||
return ERR_PTR(-EINVAL);
|
||||
}
|
||||
}
|
||||
|
||||
acl_len = bch2_acl_size(nr_short, nr_long);
|
||||
u64s = BKEY_U64s + xattr_val_u64s(0, acl_len);
|
||||
|
||||
if (u64s > U8_MAX)
|
||||
return ERR_PTR(-E2BIG);
|
||||
|
||||
xattr = bch2_trans_kmalloc(trans, u64s * sizeof(u64));
|
||||
if (IS_ERR(xattr))
|
||||
return xattr;
|
||||
|
||||
bkey_xattr_init(&xattr->k_i);
|
||||
xattr->k.u64s = u64s;
|
||||
xattr->v.x_type = acl_to_xattr_type(type);
|
||||
xattr->v.x_name_len = 0;
|
||||
xattr->v.x_val_len = cpu_to_le16(acl_len);
|
||||
|
||||
acl_header = xattr_val(&xattr->v);
|
||||
acl_header->a_version = cpu_to_le32(BCH_ACL_VERSION);
|
||||
|
||||
outptr = (void *) acl_header + sizeof(*acl_header);
|
||||
|
||||
FOREACH_ACL_ENTRY(acl_e, acl, pe) {
|
||||
bch_acl_entry *entry = outptr;
|
||||
|
||||
entry->e_tag = cpu_to_le16(acl_e->e_tag);
|
||||
entry->e_perm = cpu_to_le16(acl_e->e_perm);
|
||||
switch (acl_e->e_tag) {
|
||||
case ACL_USER:
|
||||
entry->e_id = cpu_to_le32(
|
||||
from_kuid(&init_user_ns, acl_e->e_uid));
|
||||
outptr += sizeof(bch_acl_entry);
|
||||
break;
|
||||
case ACL_GROUP:
|
||||
entry->e_id = cpu_to_le32(
|
||||
from_kgid(&init_user_ns, acl_e->e_gid));
|
||||
outptr += sizeof(bch_acl_entry);
|
||||
break;
|
||||
|
||||
case ACL_USER_OBJ:
|
||||
case ACL_GROUP_OBJ:
|
||||
case ACL_MASK:
|
||||
case ACL_OTHER:
|
||||
outptr += sizeof(bch_acl_entry_short);
|
||||
break;
|
||||
}
|
||||
}
|
||||
|
||||
BUG_ON(outptr != xattr_val(&xattr->v) + acl_len);
|
||||
|
||||
return xattr;
|
||||
}
|
||||
|
||||
struct posix_acl *bch2_get_acl(struct inode *vinode, int type, bool rcu)
|
||||
{
|
||||
struct bch_inode_info *inode = to_bch_ei(vinode);
|
||||
struct bch_fs *c = inode->v.i_sb->s_fs_info;
|
||||
struct bch_hash_info hash = bch2_hash_info_init(c, &inode->ei_inode);
|
||||
struct xattr_search_key search = X_SEARCH(acl_to_xattr_type(type), "", 0);
|
||||
struct btree_iter iter = {};
|
||||
struct posix_acl *acl = NULL;
|
||||
|
||||
if (rcu)
|
||||
return ERR_PTR(-ECHILD);
|
||||
|
||||
struct btree_trans *trans = bch2_trans_get(c);
|
||||
retry:
|
||||
bch2_trans_begin(trans);
|
||||
|
||||
struct bkey_s_c k = bch2_hash_lookup(trans, &iter, bch2_xattr_hash_desc,
|
||||
&hash, inode_inum(inode), &search, 0);
|
||||
int ret = bkey_err(k);
|
||||
if (ret)
|
||||
goto err;
|
||||
|
||||
struct bkey_s_c_xattr xattr = bkey_s_c_to_xattr(k);
|
||||
acl = bch2_acl_from_disk(trans, xattr_val(xattr.v),
|
||||
le16_to_cpu(xattr.v->x_val_len));
|
||||
ret = PTR_ERR_OR_ZERO(acl);
|
||||
err:
|
||||
if (bch2_err_matches(ret, BCH_ERR_transaction_restart))
|
||||
goto retry;
|
||||
|
||||
if (ret)
|
||||
acl = !bch2_err_matches(ret, ENOENT) ? ERR_PTR(ret) : NULL;
|
||||
|
||||
if (!IS_ERR_OR_NULL(acl))
|
||||
set_cached_acl(&inode->v, type, acl);
|
||||
|
||||
bch2_trans_iter_exit(trans, &iter);
|
||||
bch2_trans_put(trans);
|
||||
return acl;
|
||||
}
|
||||
|
||||
int bch2_set_acl_trans(struct btree_trans *trans, subvol_inum inum,
|
||||
struct bch_inode_unpacked *inode_u,
|
||||
struct posix_acl *acl, int type)
|
||||
{
|
||||
struct bch_hash_info hash_info = bch2_hash_info_init(trans->c, inode_u);
|
||||
int ret;
|
||||
|
||||
if (type == ACL_TYPE_DEFAULT &&
|
||||
!S_ISDIR(inode_u->bi_mode))
|
||||
return acl ? -EACCES : 0;
|
||||
|
||||
if (acl) {
|
||||
struct bkey_i_xattr *xattr =
|
||||
bch2_acl_to_xattr(trans, acl, type);
|
||||
if (IS_ERR(xattr))
|
||||
return PTR_ERR(xattr);
|
||||
|
||||
ret = bch2_hash_set(trans, bch2_xattr_hash_desc, &hash_info,
|
||||
inum, &xattr->k_i, 0);
|
||||
} else {
|
||||
struct xattr_search_key search =
|
||||
X_SEARCH(acl_to_xattr_type(type), "", 0);
|
||||
|
||||
ret = bch2_hash_delete(trans, bch2_xattr_hash_desc, &hash_info,
|
||||
inum, &search);
|
||||
}
|
||||
|
||||
return bch2_err_matches(ret, ENOENT) ? 0 : ret;
|
||||
}
|
||||
|
||||
int bch2_set_acl(struct mnt_idmap *idmap,
|
||||
struct dentry *dentry,
|
||||
struct posix_acl *_acl, int type)
|
||||
{
|
||||
struct bch_inode_info *inode = to_bch_ei(dentry->d_inode);
|
||||
struct bch_fs *c = inode->v.i_sb->s_fs_info;
|
||||
struct btree_iter inode_iter = {};
|
||||
struct bch_inode_unpacked inode_u;
|
||||
struct posix_acl *acl;
|
||||
umode_t mode;
|
||||
int ret;
|
||||
|
||||
mutex_lock(&inode->ei_update_lock);
|
||||
struct btree_trans *trans = bch2_trans_get(c);
|
||||
retry:
|
||||
bch2_trans_begin(trans);
|
||||
acl = _acl;
|
||||
|
||||
ret = bch2_subvol_is_ro_trans(trans, inode->ei_inum.subvol) ?:
|
||||
bch2_inode_peek(trans, &inode_iter, &inode_u, inode_inum(inode),
|
||||
BTREE_ITER_intent);
|
||||
if (ret)
|
||||
goto btree_err;
|
||||
|
||||
mode = inode_u.bi_mode;
|
||||
|
||||
if (type == ACL_TYPE_ACCESS) {
|
||||
ret = posix_acl_update_mode(idmap, &inode->v, &mode, &acl);
|
||||
if (ret)
|
||||
goto btree_err;
|
||||
}
|
||||
|
||||
ret = bch2_set_acl_trans(trans, inode_inum(inode), &inode_u, acl, type);
|
||||
if (ret)
|
||||
goto btree_err;
|
||||
|
||||
inode_u.bi_ctime = bch2_current_time(c);
|
||||
inode_u.bi_mode = mode;
|
||||
|
||||
ret = bch2_inode_write(trans, &inode_iter, &inode_u) ?:
|
||||
bch2_trans_commit(trans, NULL, NULL, 0);
|
||||
btree_err:
|
||||
bch2_trans_iter_exit(trans, &inode_iter);
|
||||
|
||||
if (bch2_err_matches(ret, BCH_ERR_transaction_restart))
|
||||
goto retry;
|
||||
if (unlikely(ret))
|
||||
goto err;
|
||||
|
||||
bch2_inode_update_after_write(trans, inode, &inode_u,
|
||||
ATTR_CTIME|ATTR_MODE);
|
||||
|
||||
set_cached_acl(&inode->v, type, acl);
|
||||
err:
|
||||
bch2_trans_put(trans);
|
||||
mutex_unlock(&inode->ei_update_lock);
|
||||
|
||||
return ret;
|
||||
}
|
||||
|
||||
int bch2_acl_chmod(struct btree_trans *trans, subvol_inum inum,
|
||||
struct bch_inode_unpacked *inode,
|
||||
umode_t mode,
|
||||
struct posix_acl **new_acl)
|
||||
{
|
||||
struct bch_hash_info hash_info = bch2_hash_info_init(trans->c, inode);
|
||||
struct xattr_search_key search = X_SEARCH(KEY_TYPE_XATTR_INDEX_POSIX_ACL_ACCESS, "", 0);
|
||||
struct btree_iter iter;
|
||||
struct posix_acl *acl = NULL;
|
||||
|
||||
struct bkey_s_c k = bch2_hash_lookup(trans, &iter, bch2_xattr_hash_desc,
|
||||
&hash_info, inum, &search, BTREE_ITER_intent);
|
||||
int ret = bkey_err(k);
|
||||
if (ret)
|
||||
return bch2_err_matches(ret, ENOENT) ? 0 : ret;
|
||||
|
||||
struct bkey_s_c_xattr xattr = bkey_s_c_to_xattr(k);
|
||||
|
||||
acl = bch2_acl_from_disk(trans, xattr_val(xattr.v),
|
||||
le16_to_cpu(xattr.v->x_val_len));
|
||||
ret = PTR_ERR_OR_ZERO(acl);
|
||||
if (ret)
|
||||
goto err;
|
||||
|
||||
ret = allocate_dropping_locks_errcode(trans, __posix_acl_chmod(&acl, _gfp, mode));
|
||||
if (ret)
|
||||
goto err;
|
||||
|
||||
struct bkey_i_xattr *new = bch2_acl_to_xattr(trans, acl, ACL_TYPE_ACCESS);
|
||||
ret = PTR_ERR_OR_ZERO(new);
|
||||
if (ret)
|
||||
goto err;
|
||||
|
||||
new->k.p = iter.pos;
|
||||
ret = bch2_trans_update(trans, &iter, &new->k_i, 0);
|
||||
*new_acl = acl;
|
||||
acl = NULL;
|
||||
err:
|
||||
bch2_trans_iter_exit(trans, &iter);
|
||||
if (!IS_ERR_OR_NULL(acl))
|
||||
kfree(acl);
|
||||
return ret;
|
||||
}
|
||||
|
||||
#endif /* CONFIG_BCACHEFS_POSIX_ACL */
|
||||
@@ -1,60 +0,0 @@
|
||||
/* SPDX-License-Identifier: GPL-2.0 */
|
||||
#ifndef _BCACHEFS_ACL_H
|
||||
#define _BCACHEFS_ACL_H
|
||||
|
||||
struct bch_inode_unpacked;
|
||||
struct bch_hash_info;
|
||||
struct bch_inode_info;
|
||||
struct posix_acl;
|
||||
|
||||
#define BCH_ACL_VERSION 0x0001
|
||||
|
||||
typedef struct {
|
||||
__le16 e_tag;
|
||||
__le16 e_perm;
|
||||
__le32 e_id;
|
||||
} bch_acl_entry;
|
||||
|
||||
typedef struct {
|
||||
__le16 e_tag;
|
||||
__le16 e_perm;
|
||||
} bch_acl_entry_short;
|
||||
|
||||
typedef struct {
|
||||
__le32 a_version;
|
||||
} bch_acl_header;
|
||||
|
||||
void bch2_acl_to_text(struct printbuf *, const void *, size_t);
|
||||
|
||||
#ifdef CONFIG_BCACHEFS_POSIX_ACL
|
||||
|
||||
struct posix_acl *bch2_get_acl(struct inode *, int, bool);
|
||||
|
||||
int bch2_set_acl_trans(struct btree_trans *, subvol_inum,
|
||||
struct bch_inode_unpacked *,
|
||||
struct posix_acl *, int);
|
||||
int bch2_set_acl(struct mnt_idmap *, struct dentry *, struct posix_acl *, int);
|
||||
int bch2_acl_chmod(struct btree_trans *, subvol_inum,
|
||||
struct bch_inode_unpacked *,
|
||||
umode_t, struct posix_acl **);
|
||||
|
||||
#else
|
||||
|
||||
static inline int bch2_set_acl_trans(struct btree_trans *trans, subvol_inum inum,
|
||||
struct bch_inode_unpacked *inode_u,
|
||||
struct posix_acl *acl, int type)
|
||||
{
|
||||
return 0;
|
||||
}
|
||||
|
||||
static inline int bch2_acl_chmod(struct btree_trans *trans, subvol_inum inum,
|
||||
struct bch_inode_unpacked *inode,
|
||||
umode_t mode,
|
||||
struct posix_acl **new_acl)
|
||||
{
|
||||
return 0;
|
||||
}
|
||||
|
||||
#endif /* CONFIG_BCACHEFS_POSIX_ACL */
|
||||
|
||||
#endif /* _BCACHEFS_ACL_H */
|
||||
File diff suppressed because it is too large
Load Diff
@@ -1,361 +0,0 @@
|
||||
/* SPDX-License-Identifier: GPL-2.0 */
|
||||
#ifndef _BCACHEFS_ALLOC_BACKGROUND_H
|
||||
#define _BCACHEFS_ALLOC_BACKGROUND_H
|
||||
|
||||
#include "bcachefs.h"
|
||||
#include "alloc_types.h"
|
||||
#include "buckets.h"
|
||||
#include "debug.h"
|
||||
#include "super.h"
|
||||
|
||||
/* How out of date a pointer gen is allowed to be: */
|
||||
#define BUCKET_GC_GEN_MAX 96U
|
||||
|
||||
static inline bool bch2_dev_bucket_exists(struct bch_fs *c, struct bpos pos)
|
||||
{
|
||||
guard(rcu)();
|
||||
struct bch_dev *ca = bch2_dev_rcu_noerror(c, pos.inode);
|
||||
return ca && bucket_valid(ca, pos.offset);
|
||||
}
|
||||
|
||||
static inline u64 bucket_to_u64(struct bpos bucket)
|
||||
{
|
||||
return (bucket.inode << 48) | bucket.offset;
|
||||
}
|
||||
|
||||
static inline struct bpos u64_to_bucket(u64 bucket)
|
||||
{
|
||||
return POS(bucket >> 48, bucket & ~(~0ULL << 48));
|
||||
}
|
||||
|
||||
static inline u8 alloc_gc_gen(struct bch_alloc_v4 a)
|
||||
{
|
||||
return a.gen - a.oldest_gen;
|
||||
}
|
||||
|
||||
static inline void alloc_to_bucket(struct bucket *dst, struct bch_alloc_v4 src)
|
||||
{
|
||||
dst->gen = src.gen;
|
||||
dst->data_type = src.data_type;
|
||||
dst->stripe_sectors = src.stripe_sectors;
|
||||
dst->dirty_sectors = src.dirty_sectors;
|
||||
dst->cached_sectors = src.cached_sectors;
|
||||
dst->stripe = src.stripe;
|
||||
}
|
||||
|
||||
static inline void __bucket_m_to_alloc(struct bch_alloc_v4 *dst, struct bucket src)
|
||||
{
|
||||
dst->gen = src.gen;
|
||||
dst->data_type = src.data_type;
|
||||
dst->stripe_sectors = src.stripe_sectors;
|
||||
dst->dirty_sectors = src.dirty_sectors;
|
||||
dst->cached_sectors = src.cached_sectors;
|
||||
dst->stripe = src.stripe;
|
||||
}
|
||||
|
||||
static inline struct bch_alloc_v4 bucket_m_to_alloc(struct bucket b)
|
||||
{
|
||||
struct bch_alloc_v4 ret = {};
|
||||
__bucket_m_to_alloc(&ret, b);
|
||||
return ret;
|
||||
}
|
||||
|
||||
static inline enum bch_data_type bucket_data_type(enum bch_data_type data_type)
|
||||
{
|
||||
switch (data_type) {
|
||||
case BCH_DATA_cached:
|
||||
case BCH_DATA_stripe:
|
||||
return BCH_DATA_user;
|
||||
default:
|
||||
return data_type;
|
||||
}
|
||||
}
|
||||
|
||||
static inline bool bucket_data_type_mismatch(enum bch_data_type bucket,
|
||||
enum bch_data_type ptr)
|
||||
{
|
||||
return !data_type_is_empty(bucket) &&
|
||||
bucket_data_type(bucket) != bucket_data_type(ptr);
|
||||
}
|
||||
|
||||
/*
|
||||
* It is my general preference to use unsigned types for unsigned quantities -
|
||||
* however, these helpers are used in disk accounting calculations run by
|
||||
* triggers where the output will be negated and added to an s64. unsigned is
|
||||
* right out even though all these quantities will fit in 32 bits, since it
|
||||
* won't be sign extended correctly; u64 will negate "correctly", but s64 is the
|
||||
* simpler option here.
|
||||
*/
|
||||
static inline s64 bch2_bucket_sectors_total(struct bch_alloc_v4 a)
|
||||
{
|
||||
return a.stripe_sectors + a.dirty_sectors + a.cached_sectors;
|
||||
}
|
||||
|
||||
static inline s64 bch2_bucket_sectors_dirty(struct bch_alloc_v4 a)
|
||||
{
|
||||
return a.stripe_sectors + a.dirty_sectors;
|
||||
}
|
||||
|
||||
static inline s64 bch2_bucket_sectors(struct bch_alloc_v4 a)
|
||||
{
|
||||
return a.data_type == BCH_DATA_cached
|
||||
? a.cached_sectors
|
||||
: bch2_bucket_sectors_dirty(a);
|
||||
}
|
||||
|
||||
static inline s64 bch2_bucket_sectors_fragmented(struct bch_dev *ca,
|
||||
struct bch_alloc_v4 a)
|
||||
{
|
||||
int d = bch2_bucket_sectors(a);
|
||||
|
||||
return d ? max(0, ca->mi.bucket_size - d) : 0;
|
||||
}
|
||||
|
||||
static inline s64 bch2_gc_bucket_sectors_fragmented(struct bch_dev *ca, struct bucket a)
|
||||
{
|
||||
int d = a.stripe_sectors + a.dirty_sectors;
|
||||
|
||||
return d ? max(0, ca->mi.bucket_size - d) : 0;
|
||||
}
|
||||
|
||||
static inline s64 bch2_bucket_sectors_unstriped(struct bch_alloc_v4 a)
|
||||
{
|
||||
return a.data_type == BCH_DATA_stripe ? a.dirty_sectors : 0;
|
||||
}
|
||||
|
||||
static inline enum bch_data_type alloc_data_type(struct bch_alloc_v4 a,
|
||||
enum bch_data_type data_type)
|
||||
{
|
||||
if (a.stripe)
|
||||
return data_type == BCH_DATA_parity ? data_type : BCH_DATA_stripe;
|
||||
if (bch2_bucket_sectors_dirty(a))
|
||||
return bucket_data_type(data_type);
|
||||
if (a.cached_sectors)
|
||||
return BCH_DATA_cached;
|
||||
if (BCH_ALLOC_V4_NEED_DISCARD(&a))
|
||||
return BCH_DATA_need_discard;
|
||||
if (alloc_gc_gen(a) >= BUCKET_GC_GEN_MAX)
|
||||
return BCH_DATA_need_gc_gens;
|
||||
return BCH_DATA_free;
|
||||
}
|
||||
|
||||
static inline void alloc_data_type_set(struct bch_alloc_v4 *a, enum bch_data_type data_type)
|
||||
{
|
||||
a->data_type = alloc_data_type(*a, data_type);
|
||||
}
|
||||
|
||||
static inline u64 alloc_lru_idx_read(struct bch_alloc_v4 a)
|
||||
{
|
||||
return a.data_type == BCH_DATA_cached
|
||||
? a.io_time[READ] & LRU_TIME_MAX
|
||||
: 0;
|
||||
}
|
||||
|
||||
#define DATA_TYPES_MOVABLE \
|
||||
((1U << BCH_DATA_btree)| \
|
||||
(1U << BCH_DATA_user)| \
|
||||
(1U << BCH_DATA_stripe))
|
||||
|
||||
static inline bool data_type_movable(enum bch_data_type type)
|
||||
{
|
||||
return (1U << type) & DATA_TYPES_MOVABLE;
|
||||
}
|
||||
|
||||
static inline u64 alloc_lru_idx_fragmentation(struct bch_alloc_v4 a,
|
||||
struct bch_dev *ca)
|
||||
{
|
||||
if (a.data_type >= BCH_DATA_NR)
|
||||
return 0;
|
||||
|
||||
if (!data_type_movable(a.data_type) ||
|
||||
!bch2_bucket_sectors_fragmented(ca, a))
|
||||
return 0;
|
||||
|
||||
/*
|
||||
* avoid overflowing LRU_TIME_BITS on a corrupted fs, when
|
||||
* bucket_sectors_dirty is (much) bigger than bucket_size
|
||||
*/
|
||||
u64 d = min_t(s64, bch2_bucket_sectors_dirty(a),
|
||||
ca->mi.bucket_size);
|
||||
|
||||
return div_u64(d * (1ULL << 31), ca->mi.bucket_size);
|
||||
}
|
||||
|
||||
static inline u64 alloc_freespace_genbits(struct bch_alloc_v4 a)
|
||||
{
|
||||
return ((u64) alloc_gc_gen(a) >> 4) << 56;
|
||||
}
|
||||
|
||||
static inline struct bpos alloc_freespace_pos(struct bpos pos, struct bch_alloc_v4 a)
|
||||
{
|
||||
pos.offset |= alloc_freespace_genbits(a);
|
||||
return pos;
|
||||
}
|
||||
|
||||
static inline unsigned alloc_v4_u64s_noerror(const struct bch_alloc_v4 *a)
|
||||
{
|
||||
return (BCH_ALLOC_V4_BACKPOINTERS_START(a) ?:
|
||||
BCH_ALLOC_V4_U64s_V0) +
|
||||
BCH_ALLOC_V4_NR_BACKPOINTERS(a) *
|
||||
(sizeof(struct bch_backpointer) / sizeof(u64));
|
||||
}
|
||||
|
||||
static inline unsigned alloc_v4_u64s(const struct bch_alloc_v4 *a)
|
||||
{
|
||||
unsigned ret = alloc_v4_u64s_noerror(a);
|
||||
BUG_ON(ret > U8_MAX - BKEY_U64s);
|
||||
return ret;
|
||||
}
|
||||
|
||||
static inline void set_alloc_v4_u64s(struct bkey_i_alloc_v4 *a)
|
||||
{
|
||||
set_bkey_val_u64s(&a->k, alloc_v4_u64s(&a->v));
|
||||
}
|
||||
|
||||
struct bkey_i_alloc_v4 *
|
||||
bch2_trans_start_alloc_update_noupdate(struct btree_trans *, struct btree_iter *, struct bpos);
|
||||
struct bkey_i_alloc_v4 *
|
||||
bch2_trans_start_alloc_update(struct btree_trans *, struct bpos,
|
||||
enum btree_iter_update_trigger_flags);
|
||||
|
||||
void __bch2_alloc_to_v4(struct bkey_s_c, struct bch_alloc_v4 *);
|
||||
|
||||
static inline const struct bch_alloc_v4 *bch2_alloc_to_v4(struct bkey_s_c k, struct bch_alloc_v4 *convert)
|
||||
{
|
||||
const struct bch_alloc_v4 *ret;
|
||||
|
||||
if (unlikely(k.k->type != KEY_TYPE_alloc_v4))
|
||||
goto slowpath;
|
||||
|
||||
ret = bkey_s_c_to_alloc_v4(k).v;
|
||||
if (BCH_ALLOC_V4_BACKPOINTERS_START(ret) != BCH_ALLOC_V4_U64s)
|
||||
goto slowpath;
|
||||
|
||||
return ret;
|
||||
slowpath:
|
||||
__bch2_alloc_to_v4(k, convert);
|
||||
return convert;
|
||||
}
|
||||
|
||||
struct bkey_i_alloc_v4 *bch2_alloc_to_v4_mut(struct btree_trans *, struct bkey_s_c);
|
||||
|
||||
int bch2_bucket_io_time_reset(struct btree_trans *, unsigned, size_t, int);
|
||||
|
||||
int bch2_alloc_v1_validate(struct bch_fs *, struct bkey_s_c,
|
||||
struct bkey_validate_context);
|
||||
int bch2_alloc_v2_validate(struct bch_fs *, struct bkey_s_c,
|
||||
struct bkey_validate_context);
|
||||
int bch2_alloc_v3_validate(struct bch_fs *, struct bkey_s_c,
|
||||
struct bkey_validate_context);
|
||||
int bch2_alloc_v4_validate(struct bch_fs *, struct bkey_s_c,
|
||||
struct bkey_validate_context);
|
||||
void bch2_alloc_v4_swab(struct bkey_s);
|
||||
void bch2_alloc_to_text(struct printbuf *, struct bch_fs *, struct bkey_s_c);
|
||||
void bch2_alloc_v4_to_text(struct printbuf *, struct bch_fs *, struct bkey_s_c);
|
||||
|
||||
#define bch2_bkey_ops_alloc ((struct bkey_ops) { \
|
||||
.key_validate = bch2_alloc_v1_validate, \
|
||||
.val_to_text = bch2_alloc_to_text, \
|
||||
.trigger = bch2_trigger_alloc, \
|
||||
.min_val_size = 8, \
|
||||
})
|
||||
|
||||
#define bch2_bkey_ops_alloc_v2 ((struct bkey_ops) { \
|
||||
.key_validate = bch2_alloc_v2_validate, \
|
||||
.val_to_text = bch2_alloc_to_text, \
|
||||
.trigger = bch2_trigger_alloc, \
|
||||
.min_val_size = 8, \
|
||||
})
|
||||
|
||||
#define bch2_bkey_ops_alloc_v3 ((struct bkey_ops) { \
|
||||
.key_validate = bch2_alloc_v3_validate, \
|
||||
.val_to_text = bch2_alloc_to_text, \
|
||||
.trigger = bch2_trigger_alloc, \
|
||||
.min_val_size = 16, \
|
||||
})
|
||||
|
||||
#define bch2_bkey_ops_alloc_v4 ((struct bkey_ops) { \
|
||||
.key_validate = bch2_alloc_v4_validate, \
|
||||
.val_to_text = bch2_alloc_v4_to_text, \
|
||||
.swab = bch2_alloc_v4_swab, \
|
||||
.trigger = bch2_trigger_alloc, \
|
||||
.min_val_size = 48, \
|
||||
})
|
||||
|
||||
int bch2_bucket_gens_validate(struct bch_fs *, struct bkey_s_c,
|
||||
struct bkey_validate_context);
|
||||
void bch2_bucket_gens_to_text(struct printbuf *, struct bch_fs *, struct bkey_s_c);
|
||||
|
||||
#define bch2_bkey_ops_bucket_gens ((struct bkey_ops) { \
|
||||
.key_validate = bch2_bucket_gens_validate, \
|
||||
.val_to_text = bch2_bucket_gens_to_text, \
|
||||
})
|
||||
|
||||
int bch2_bucket_gens_init(struct bch_fs *);
|
||||
|
||||
static inline bool bkey_is_alloc(const struct bkey *k)
|
||||
{
|
||||
return k->type == KEY_TYPE_alloc ||
|
||||
k->type == KEY_TYPE_alloc_v2 ||
|
||||
k->type == KEY_TYPE_alloc_v3;
|
||||
}
|
||||
|
||||
int bch2_alloc_read(struct bch_fs *);
|
||||
|
||||
int bch2_alloc_key_to_dev_counters(struct btree_trans *, struct bch_dev *,
|
||||
const struct bch_alloc_v4 *,
|
||||
const struct bch_alloc_v4 *, unsigned);
|
||||
int bch2_trigger_alloc(struct btree_trans *, enum btree_id, unsigned,
|
||||
struct bkey_s_c, struct bkey_s,
|
||||
enum btree_iter_update_trigger_flags);
|
||||
|
||||
int bch2_check_discard_freespace_key(struct btree_trans *, struct btree_iter *, u8 *, bool);
|
||||
int bch2_check_alloc_info(struct bch_fs *);
|
||||
int bch2_check_alloc_to_lru_refs(struct bch_fs *);
|
||||
void bch2_dev_do_discards(struct bch_dev *);
|
||||
void bch2_do_discards(struct bch_fs *);
|
||||
|
||||
static inline u64 should_invalidate_buckets(struct bch_dev *ca,
|
||||
struct bch_dev_usage u)
|
||||
{
|
||||
u64 want_free = ca->mi.nbuckets >> 7;
|
||||
u64 free = max_t(s64, 0,
|
||||
u.buckets[BCH_DATA_free]
|
||||
+ u.buckets[BCH_DATA_need_discard]
|
||||
- bch2_dev_buckets_reserved(ca, BCH_WATERMARK_stripe));
|
||||
|
||||
return clamp_t(s64, want_free - free, 0, u.buckets[BCH_DATA_cached]);
|
||||
}
|
||||
|
||||
void bch2_dev_do_invalidates(struct bch_dev *);
|
||||
void bch2_do_invalidates(struct bch_fs *);
|
||||
|
||||
static inline struct bch_backpointer *alloc_v4_backpointers(struct bch_alloc_v4 *a)
|
||||
{
|
||||
return (void *) ((u64 *) &a->v +
|
||||
(BCH_ALLOC_V4_BACKPOINTERS_START(a) ?:
|
||||
BCH_ALLOC_V4_U64s_V0));
|
||||
}
|
||||
|
||||
static inline const struct bch_backpointer *alloc_v4_backpointers_c(const struct bch_alloc_v4 *a)
|
||||
{
|
||||
return (void *) ((u64 *) &a->v + BCH_ALLOC_V4_BACKPOINTERS_START(a));
|
||||
}
|
||||
|
||||
int bch2_dev_freespace_init(struct bch_fs *, struct bch_dev *, u64, u64);
|
||||
int bch2_fs_freespace_init(struct bch_fs *);
|
||||
int bch2_dev_remove_alloc(struct bch_fs *, struct bch_dev *);
|
||||
|
||||
void bch2_recalc_capacity(struct bch_fs *);
|
||||
u64 bch2_min_rw_member_capacity(struct bch_fs *);
|
||||
|
||||
void bch2_dev_allocator_set_rw(struct bch_fs *, struct bch_dev *, bool);
|
||||
void bch2_dev_allocator_remove(struct bch_fs *, struct bch_dev *);
|
||||
void bch2_dev_allocator_add(struct bch_fs *, struct bch_dev *);
|
||||
|
||||
void bch2_dev_allocator_background_exit(struct bch_dev *);
|
||||
void bch2_dev_allocator_background_init(struct bch_dev *);
|
||||
|
||||
void bch2_fs_allocator_background_init(struct bch_fs *);
|
||||
|
||||
#endif /* _BCACHEFS_ALLOC_BACKGROUND_H */
|
||||
@@ -1,95 +0,0 @@
|
||||
/* SPDX-License-Identifier: GPL-2.0 */
|
||||
#ifndef _BCACHEFS_ALLOC_BACKGROUND_FORMAT_H
|
||||
#define _BCACHEFS_ALLOC_BACKGROUND_FORMAT_H
|
||||
|
||||
struct bch_alloc {
|
||||
struct bch_val v;
|
||||
__u8 fields;
|
||||
__u8 gen;
|
||||
__u8 data[];
|
||||
} __packed __aligned(8);
|
||||
|
||||
#define BCH_ALLOC_FIELDS_V1() \
|
||||
x(read_time, 16) \
|
||||
x(write_time, 16) \
|
||||
x(data_type, 8) \
|
||||
x(dirty_sectors, 16) \
|
||||
x(cached_sectors, 16) \
|
||||
x(oldest_gen, 8) \
|
||||
x(stripe, 32) \
|
||||
x(stripe_redundancy, 8)
|
||||
|
||||
enum {
|
||||
#define x(name, _bits) BCH_ALLOC_FIELD_V1_##name,
|
||||
BCH_ALLOC_FIELDS_V1()
|
||||
#undef x
|
||||
};
|
||||
|
||||
struct bch_alloc_v2 {
|
||||
struct bch_val v;
|
||||
__u8 nr_fields;
|
||||
__u8 gen;
|
||||
__u8 oldest_gen;
|
||||
__u8 data_type;
|
||||
__u8 data[];
|
||||
} __packed __aligned(8);
|
||||
|
||||
#define BCH_ALLOC_FIELDS_V2() \
|
||||
x(read_time, 64) \
|
||||
x(write_time, 64) \
|
||||
x(dirty_sectors, 32) \
|
||||
x(cached_sectors, 32) \
|
||||
x(stripe, 32) \
|
||||
x(stripe_redundancy, 8)
|
||||
|
||||
struct bch_alloc_v3 {
|
||||
struct bch_val v;
|
||||
__le64 journal_seq;
|
||||
__le32 flags;
|
||||
__u8 nr_fields;
|
||||
__u8 gen;
|
||||
__u8 oldest_gen;
|
||||
__u8 data_type;
|
||||
__u8 data[];
|
||||
} __packed __aligned(8);
|
||||
|
||||
LE32_BITMASK(BCH_ALLOC_V3_NEED_DISCARD,struct bch_alloc_v3, flags, 0, 1)
|
||||
LE32_BITMASK(BCH_ALLOC_V3_NEED_INC_GEN,struct bch_alloc_v3, flags, 1, 2)
|
||||
|
||||
struct bch_alloc_v4 {
|
||||
struct bch_val v;
|
||||
__u64 journal_seq_nonempty;
|
||||
__u32 flags;
|
||||
__u8 gen;
|
||||
__u8 oldest_gen;
|
||||
__u8 data_type;
|
||||
__u8 stripe_redundancy;
|
||||
__u32 dirty_sectors;
|
||||
__u32 cached_sectors;
|
||||
__u64 io_time[2];
|
||||
__u32 stripe;
|
||||
__u32 nr_external_backpointers;
|
||||
/* end of fields in original version of alloc_v4 */
|
||||
__u64 journal_seq_empty;
|
||||
__u32 stripe_sectors;
|
||||
__u32 pad;
|
||||
} __packed __aligned(8);
|
||||
|
||||
#define BCH_ALLOC_V4_U64s_V0 6
|
||||
#define BCH_ALLOC_V4_U64s (sizeof(struct bch_alloc_v4) / sizeof(__u64))
|
||||
|
||||
BITMASK(BCH_ALLOC_V4_NEED_DISCARD, struct bch_alloc_v4, flags, 0, 1)
|
||||
BITMASK(BCH_ALLOC_V4_NEED_INC_GEN, struct bch_alloc_v4, flags, 1, 2)
|
||||
BITMASK(BCH_ALLOC_V4_BACKPOINTERS_START,struct bch_alloc_v4, flags, 2, 8)
|
||||
BITMASK(BCH_ALLOC_V4_NR_BACKPOINTERS, struct bch_alloc_v4, flags, 8, 14)
|
||||
|
||||
#define KEY_TYPE_BUCKET_GENS_BITS 8
|
||||
#define KEY_TYPE_BUCKET_GENS_NR (1U << KEY_TYPE_BUCKET_GENS_BITS)
|
||||
#define KEY_TYPE_BUCKET_GENS_MASK (KEY_TYPE_BUCKET_GENS_NR - 1)
|
||||
|
||||
struct bch_bucket_gens {
|
||||
struct bch_val v;
|
||||
u8 gens[KEY_TYPE_BUCKET_GENS_NR];
|
||||
} __packed __aligned(8);
|
||||
|
||||
#endif /* _BCACHEFS_ALLOC_BACKGROUND_FORMAT_H */
|
||||
File diff suppressed because it is too large
Load Diff
@@ -1,318 +0,0 @@
|
||||
/* SPDX-License-Identifier: GPL-2.0 */
|
||||
#ifndef _BCACHEFS_ALLOC_FOREGROUND_H
|
||||
#define _BCACHEFS_ALLOC_FOREGROUND_H
|
||||
|
||||
#include "bcachefs.h"
|
||||
#include "buckets.h"
|
||||
#include "alloc_types.h"
|
||||
#include "extents.h"
|
||||
#include "io_write_types.h"
|
||||
#include "sb-members.h"
|
||||
|
||||
#include <linux/hash.h>
|
||||
|
||||
struct bkey;
|
||||
struct bch_dev;
|
||||
struct bch_fs;
|
||||
struct bch_devs_List;
|
||||
|
||||
extern const char * const bch2_watermarks[];
|
||||
|
||||
void bch2_reset_alloc_cursors(struct bch_fs *);
|
||||
|
||||
struct dev_alloc_list {
|
||||
unsigned nr;
|
||||
u8 data[BCH_SB_MEMBERS_MAX];
|
||||
};
|
||||
|
||||
struct alloc_request {
|
||||
unsigned nr_replicas;
|
||||
unsigned target;
|
||||
bool ec;
|
||||
enum bch_watermark watermark;
|
||||
enum bch_write_flags flags;
|
||||
enum bch_data_type data_type;
|
||||
struct bch_devs_list *devs_have;
|
||||
struct write_point *wp;
|
||||
|
||||
/* These fields are used primarily by open_bucket_add_buckets */
|
||||
struct open_buckets ptrs;
|
||||
unsigned nr_effective; /* sum of @ptrs durability */
|
||||
bool have_cache; /* have we allocated from a 0 durability dev */
|
||||
struct bch_devs_mask devs_may_alloc;
|
||||
|
||||
/* bch2_bucket_alloc_set_trans(): */
|
||||
struct dev_alloc_list devs_sorted;
|
||||
struct bch_dev_usage usage;
|
||||
|
||||
/* bch2_bucket_alloc_trans(): */
|
||||
struct bch_dev *ca;
|
||||
|
||||
enum {
|
||||
BTREE_BITMAP_NO,
|
||||
BTREE_BITMAP_YES,
|
||||
BTREE_BITMAP_ANY,
|
||||
} btree_bitmap;
|
||||
|
||||
struct {
|
||||
u64 buckets_seen;
|
||||
u64 skipped_open;
|
||||
u64 skipped_need_journal_commit;
|
||||
u64 need_journal_commit;
|
||||
u64 skipped_nocow;
|
||||
u64 skipped_nouse;
|
||||
u64 skipped_mi_btree_bitmap;
|
||||
} counters;
|
||||
|
||||
unsigned scratch_nr_replicas;
|
||||
unsigned scratch_nr_effective;
|
||||
bool scratch_have_cache;
|
||||
enum bch_data_type scratch_data_type;
|
||||
struct open_buckets scratch_ptrs;
|
||||
struct bch_devs_mask scratch_devs_may_alloc;
|
||||
};
|
||||
|
||||
void bch2_dev_alloc_list(struct bch_fs *,
|
||||
struct dev_stripe_state *,
|
||||
struct bch_devs_mask *,
|
||||
struct dev_alloc_list *);
|
||||
void bch2_dev_stripe_increment(struct bch_dev *, struct dev_stripe_state *);
|
||||
|
||||
static inline struct bch_dev *ob_dev(struct bch_fs *c, struct open_bucket *ob)
|
||||
{
|
||||
return bch2_dev_have_ref(c, ob->dev);
|
||||
}
|
||||
|
||||
static inline unsigned bch2_open_buckets_reserved(enum bch_watermark watermark)
|
||||
{
|
||||
switch (watermark) {
|
||||
case BCH_WATERMARK_interior_updates:
|
||||
return 0;
|
||||
case BCH_WATERMARK_reclaim:
|
||||
return OPEN_BUCKETS_COUNT / 6;
|
||||
case BCH_WATERMARK_btree:
|
||||
case BCH_WATERMARK_btree_copygc:
|
||||
return OPEN_BUCKETS_COUNT / 4;
|
||||
case BCH_WATERMARK_copygc:
|
||||
return OPEN_BUCKETS_COUNT / 3;
|
||||
default:
|
||||
return OPEN_BUCKETS_COUNT / 2;
|
||||
}
|
||||
}
|
||||
|
||||
struct open_bucket *bch2_bucket_alloc(struct bch_fs *, struct bch_dev *,
|
||||
enum bch_watermark, enum bch_data_type,
|
||||
struct closure *);
|
||||
|
||||
static inline void ob_push(struct bch_fs *c, struct open_buckets *obs,
|
||||
struct open_bucket *ob)
|
||||
{
|
||||
BUG_ON(obs->nr >= ARRAY_SIZE(obs->v));
|
||||
|
||||
obs->v[obs->nr++] = ob - c->open_buckets;
|
||||
}
|
||||
|
||||
#define open_bucket_for_each(_c, _obs, _ob, _i) \
|
||||
for ((_i) = 0; \
|
||||
(_i) < (_obs)->nr && \
|
||||
((_ob) = (_c)->open_buckets + (_obs)->v[_i], true); \
|
||||
(_i)++)
|
||||
|
||||
static inline struct open_bucket *ec_open_bucket(struct bch_fs *c,
|
||||
struct open_buckets *obs)
|
||||
{
|
||||
struct open_bucket *ob;
|
||||
unsigned i;
|
||||
|
||||
open_bucket_for_each(c, obs, ob, i)
|
||||
if (ob->ec)
|
||||
return ob;
|
||||
|
||||
return NULL;
|
||||
}
|
||||
|
||||
void bch2_open_bucket_write_error(struct bch_fs *,
|
||||
struct open_buckets *, unsigned, int);
|
||||
|
||||
void __bch2_open_bucket_put(struct bch_fs *, struct open_bucket *);
|
||||
|
||||
static inline void bch2_open_bucket_put(struct bch_fs *c, struct open_bucket *ob)
|
||||
{
|
||||
if (atomic_dec_and_test(&ob->pin))
|
||||
__bch2_open_bucket_put(c, ob);
|
||||
}
|
||||
|
||||
static inline void bch2_open_buckets_put(struct bch_fs *c,
|
||||
struct open_buckets *ptrs)
|
||||
{
|
||||
struct open_bucket *ob;
|
||||
unsigned i;
|
||||
|
||||
open_bucket_for_each(c, ptrs, ob, i)
|
||||
bch2_open_bucket_put(c, ob);
|
||||
ptrs->nr = 0;
|
||||
}
|
||||
|
||||
static inline void bch2_alloc_sectors_done_inlined(struct bch_fs *c, struct write_point *wp)
|
||||
{
|
||||
struct open_buckets ptrs = { .nr = 0 }, keep = { .nr = 0 };
|
||||
struct open_bucket *ob;
|
||||
unsigned i;
|
||||
|
||||
open_bucket_for_each(c, &wp->ptrs, ob, i)
|
||||
ob_push(c, ob->sectors_free < block_sectors(c)
|
||||
? &ptrs
|
||||
: &keep, ob);
|
||||
wp->ptrs = keep;
|
||||
|
||||
mutex_unlock(&wp->lock);
|
||||
|
||||
bch2_open_buckets_put(c, &ptrs);
|
||||
}
|
||||
|
||||
static inline void bch2_open_bucket_get(struct bch_fs *c,
|
||||
struct write_point *wp,
|
||||
struct open_buckets *ptrs)
|
||||
{
|
||||
struct open_bucket *ob;
|
||||
unsigned i;
|
||||
|
||||
open_bucket_for_each(c, &wp->ptrs, ob, i) {
|
||||
ob->data_type = wp->data_type;
|
||||
atomic_inc(&ob->pin);
|
||||
ob_push(c, ptrs, ob);
|
||||
}
|
||||
}
|
||||
|
||||
static inline open_bucket_idx_t *open_bucket_hashslot(struct bch_fs *c,
|
||||
unsigned dev, u64 bucket)
|
||||
{
|
||||
return c->open_buckets_hash +
|
||||
(jhash_3words(dev, bucket, bucket >> 32, 0) &
|
||||
(OPEN_BUCKETS_COUNT - 1));
|
||||
}
|
||||
|
||||
static inline bool bch2_bucket_is_open(struct bch_fs *c, unsigned dev, u64 bucket)
|
||||
{
|
||||
open_bucket_idx_t slot = *open_bucket_hashslot(c, dev, bucket);
|
||||
|
||||
while (slot) {
|
||||
struct open_bucket *ob = &c->open_buckets[slot];
|
||||
|
||||
if (ob->dev == dev && ob->bucket == bucket)
|
||||
return true;
|
||||
|
||||
slot = ob->hash;
|
||||
}
|
||||
|
||||
return false;
|
||||
}
|
||||
|
||||
static inline bool bch2_bucket_is_open_safe(struct bch_fs *c, unsigned dev, u64 bucket)
|
||||
{
|
||||
bool ret;
|
||||
|
||||
if (bch2_bucket_is_open(c, dev, bucket))
|
||||
return true;
|
||||
|
||||
spin_lock(&c->freelist_lock);
|
||||
ret = bch2_bucket_is_open(c, dev, bucket);
|
||||
spin_unlock(&c->freelist_lock);
|
||||
|
||||
return ret;
|
||||
}
|
||||
|
||||
enum bch_write_flags;
|
||||
int bch2_bucket_alloc_set_trans(struct btree_trans *, struct alloc_request *,
|
||||
struct dev_stripe_state *, struct closure *);
|
||||
|
||||
int bch2_alloc_sectors_start_trans(struct btree_trans *,
|
||||
unsigned, unsigned,
|
||||
struct write_point_specifier,
|
||||
struct bch_devs_list *,
|
||||
unsigned, unsigned,
|
||||
enum bch_watermark,
|
||||
enum bch_write_flags,
|
||||
struct closure *,
|
||||
struct write_point **);
|
||||
|
||||
static inline struct bch_extent_ptr bch2_ob_ptr(struct bch_fs *c, struct open_bucket *ob)
|
||||
{
|
||||
struct bch_dev *ca = ob_dev(c, ob);
|
||||
|
||||
return (struct bch_extent_ptr) {
|
||||
.type = 1 << BCH_EXTENT_ENTRY_ptr,
|
||||
.gen = ob->gen,
|
||||
.dev = ob->dev,
|
||||
.offset = bucket_to_sector(ca, ob->bucket) +
|
||||
ca->mi.bucket_size -
|
||||
ob->sectors_free,
|
||||
};
|
||||
}
|
||||
|
||||
/*
|
||||
* Append pointers to the space we just allocated to @k, and mark @sectors space
|
||||
* as allocated out of @ob
|
||||
*/
|
||||
static inline void
|
||||
bch2_alloc_sectors_append_ptrs_inlined(struct bch_fs *c, struct write_point *wp,
|
||||
struct bkey_i *k, unsigned sectors,
|
||||
bool cached)
|
||||
{
|
||||
struct open_bucket *ob;
|
||||
unsigned i;
|
||||
|
||||
BUG_ON(sectors > wp->sectors_free);
|
||||
wp->sectors_free -= sectors;
|
||||
wp->sectors_allocated += sectors;
|
||||
|
||||
open_bucket_for_each(c, &wp->ptrs, ob, i) {
|
||||
struct bch_dev *ca = ob_dev(c, ob);
|
||||
struct bch_extent_ptr ptr = bch2_ob_ptr(c, ob);
|
||||
|
||||
ptr.cached = cached ||
|
||||
(!ca->mi.durability &&
|
||||
wp->data_type == BCH_DATA_user);
|
||||
|
||||
bch2_bkey_append_ptr(k, ptr);
|
||||
|
||||
BUG_ON(sectors > ob->sectors_free);
|
||||
ob->sectors_free -= sectors;
|
||||
}
|
||||
}
|
||||
|
||||
void bch2_alloc_sectors_append_ptrs(struct bch_fs *, struct write_point *,
|
||||
struct bkey_i *, unsigned, bool);
|
||||
void bch2_alloc_sectors_done(struct bch_fs *, struct write_point *);
|
||||
|
||||
void bch2_open_buckets_stop(struct bch_fs *c, struct bch_dev *, bool);
|
||||
|
||||
static inline struct write_point_specifier writepoint_hashed(unsigned long v)
|
||||
{
|
||||
return (struct write_point_specifier) { .v = v | 1 };
|
||||
}
|
||||
|
||||
static inline struct write_point_specifier writepoint_ptr(struct write_point *wp)
|
||||
{
|
||||
return (struct write_point_specifier) { .v = (unsigned long) wp };
|
||||
}
|
||||
|
||||
void bch2_fs_allocator_foreground_init(struct bch_fs *);
|
||||
|
||||
void bch2_open_bucket_to_text(struct printbuf *, struct bch_fs *, struct open_bucket *);
|
||||
void bch2_open_buckets_to_text(struct printbuf *, struct bch_fs *, struct bch_dev *);
|
||||
void bch2_open_buckets_partial_to_text(struct printbuf *, struct bch_fs *);
|
||||
|
||||
void bch2_write_points_to_text(struct printbuf *, struct bch_fs *);
|
||||
|
||||
void bch2_fs_alloc_debug_to_text(struct printbuf *, struct bch_fs *);
|
||||
void bch2_dev_alloc_debug_to_text(struct printbuf *, struct bch_dev *);
|
||||
|
||||
void __bch2_wait_on_allocator(struct bch_fs *, struct closure *);
|
||||
static inline void bch2_wait_on_allocator(struct bch_fs *c, struct closure *cl)
|
||||
{
|
||||
if (cl->closure_get_happened)
|
||||
__bch2_wait_on_allocator(c, cl);
|
||||
}
|
||||
|
||||
#endif /* _BCACHEFS_ALLOC_FOREGROUND_H */
|
||||
@@ -1,121 +0,0 @@
|
||||
/* SPDX-License-Identifier: GPL-2.0 */
|
||||
#ifndef _BCACHEFS_ALLOC_TYPES_H
|
||||
#define _BCACHEFS_ALLOC_TYPES_H
|
||||
|
||||
#include <linux/mutex.h>
|
||||
#include <linux/spinlock.h>
|
||||
|
||||
#include "clock_types.h"
|
||||
#include "fifo.h"
|
||||
|
||||
#define BCH_WATERMARKS() \
|
||||
x(stripe) \
|
||||
x(normal) \
|
||||
x(copygc) \
|
||||
x(btree) \
|
||||
x(btree_copygc) \
|
||||
x(reclaim) \
|
||||
x(interior_updates)
|
||||
|
||||
enum bch_watermark {
|
||||
#define x(name) BCH_WATERMARK_##name,
|
||||
BCH_WATERMARKS()
|
||||
#undef x
|
||||
BCH_WATERMARK_NR,
|
||||
};
|
||||
|
||||
#define BCH_WATERMARK_BITS 3
|
||||
#define BCH_WATERMARK_MASK ~(~0U << BCH_WATERMARK_BITS)
|
||||
|
||||
#define OPEN_BUCKETS_COUNT 1024
|
||||
|
||||
#define WRITE_POINT_HASH_NR 32
|
||||
#define WRITE_POINT_MAX 32
|
||||
|
||||
/*
|
||||
* 0 is never a valid open_bucket_idx_t:
|
||||
*/
|
||||
typedef u16 open_bucket_idx_t;
|
||||
|
||||
struct open_bucket {
|
||||
spinlock_t lock;
|
||||
atomic_t pin;
|
||||
open_bucket_idx_t freelist;
|
||||
open_bucket_idx_t hash;
|
||||
|
||||
/*
|
||||
* When an open bucket has an ec_stripe attached, this is the index of
|
||||
* the block in the stripe this open_bucket corresponds to:
|
||||
*/
|
||||
u8 ec_idx;
|
||||
enum bch_data_type data_type:6;
|
||||
unsigned valid:1;
|
||||
unsigned on_partial_list:1;
|
||||
|
||||
u8 dev;
|
||||
u8 gen;
|
||||
u32 sectors_free;
|
||||
u64 bucket;
|
||||
struct ec_stripe_new *ec;
|
||||
};
|
||||
|
||||
#define OPEN_BUCKET_LIST_MAX 15
|
||||
|
||||
struct open_buckets {
|
||||
open_bucket_idx_t nr;
|
||||
open_bucket_idx_t v[OPEN_BUCKET_LIST_MAX];
|
||||
};
|
||||
|
||||
struct dev_stripe_state {
|
||||
u64 next_alloc[BCH_SB_MEMBERS_MAX];
|
||||
};
|
||||
|
||||
#define WRITE_POINT_STATES() \
|
||||
x(stopped) \
|
||||
x(waiting_io) \
|
||||
x(waiting_work) \
|
||||
x(runnable) \
|
||||
x(running)
|
||||
|
||||
enum write_point_state {
|
||||
#define x(n) WRITE_POINT_##n,
|
||||
WRITE_POINT_STATES()
|
||||
#undef x
|
||||
WRITE_POINT_STATE_NR
|
||||
};
|
||||
|
||||
struct write_point {
|
||||
struct {
|
||||
struct hlist_node node;
|
||||
struct mutex lock;
|
||||
u64 last_used;
|
||||
unsigned long write_point;
|
||||
enum bch_data_type data_type;
|
||||
|
||||
/* calculated based on how many pointers we're actually going to use: */
|
||||
unsigned sectors_free;
|
||||
|
||||
struct open_buckets ptrs;
|
||||
struct dev_stripe_state stripe;
|
||||
|
||||
u64 sectors_allocated;
|
||||
} __aligned(SMP_CACHE_BYTES);
|
||||
|
||||
struct {
|
||||
struct work_struct index_update_work;
|
||||
|
||||
struct list_head writes;
|
||||
spinlock_t writes_lock;
|
||||
|
||||
enum write_point_state state;
|
||||
u64 last_state_change;
|
||||
u64 time[WRITE_POINT_STATE_NR];
|
||||
u64 last_runtime;
|
||||
} __aligned(SMP_CACHE_BYTES);
|
||||
};
|
||||
|
||||
struct write_point_specifier {
|
||||
unsigned long v;
|
||||
};
|
||||
|
||||
#endif /* _BCACHEFS_ALLOC_TYPES_H */
|
||||
@@ -1,132 +0,0 @@
|
||||
// SPDX-License-Identifier: GPL-2.0
|
||||
/*
|
||||
* Async obj debugging: keep asynchronous objects on (very fast) lists, make
|
||||
* them visibile in debugfs:
|
||||
*/
|
||||
|
||||
#include "bcachefs.h"
|
||||
#include "async_objs.h"
|
||||
#include "btree_io.h"
|
||||
#include "debug.h"
|
||||
#include "io_read.h"
|
||||
#include "io_write.h"
|
||||
|
||||
#include <linux/debugfs.h>
|
||||
|
||||
static void promote_obj_to_text(struct printbuf *out, void *obj)
|
||||
{
|
||||
bch2_promote_op_to_text(out, obj);
|
||||
}
|
||||
|
||||
static void rbio_obj_to_text(struct printbuf *out, void *obj)
|
||||
{
|
||||
bch2_read_bio_to_text(out, obj);
|
||||
}
|
||||
|
||||
static void write_op_obj_to_text(struct printbuf *out, void *obj)
|
||||
{
|
||||
bch2_write_op_to_text(out, obj);
|
||||
}
|
||||
|
||||
static void btree_read_bio_obj_to_text(struct printbuf *out, void *obj)
|
||||
{
|
||||
struct btree_read_bio *rbio = obj;
|
||||
bch2_btree_read_bio_to_text(out, rbio);
|
||||
}
|
||||
|
||||
static void btree_write_bio_obj_to_text(struct printbuf *out, void *obj)
|
||||
{
|
||||
struct btree_write_bio *wbio = obj;
|
||||
bch2_bio_to_text(out, &wbio->wbio.bio);
|
||||
}
|
||||
|
||||
static int bch2_async_obj_list_open(struct inode *inode, struct file *file)
|
||||
{
|
||||
struct async_obj_list *list = inode->i_private;
|
||||
struct dump_iter *i;
|
||||
|
||||
i = kzalloc(sizeof(struct dump_iter), GFP_KERNEL);
|
||||
if (!i)
|
||||
return -ENOMEM;
|
||||
|
||||
file->private_data = i;
|
||||
i->from = POS_MIN;
|
||||
i->iter = 0;
|
||||
i->c = container_of(list, struct bch_fs, async_objs[list->idx]);
|
||||
i->list = list;
|
||||
i->buf = PRINTBUF;
|
||||
return 0;
|
||||
}
|
||||
|
||||
static ssize_t bch2_async_obj_list_read(struct file *file, char __user *buf,
|
||||
size_t size, loff_t *ppos)
|
||||
{
|
||||
struct dump_iter *i = file->private_data;
|
||||
struct async_obj_list *list = i->list;
|
||||
ssize_t ret = 0;
|
||||
|
||||
i->ubuf = buf;
|
||||
i->size = size;
|
||||
i->ret = 0;
|
||||
|
||||
struct genradix_iter iter;
|
||||
void *obj;
|
||||
fast_list_for_each_from(&list->list, iter, obj, i->iter) {
|
||||
ret = bch2_debugfs_flush_buf(i);
|
||||
if (ret)
|
||||
return ret;
|
||||
|
||||
if (!i->size)
|
||||
break;
|
||||
|
||||
list->obj_to_text(&i->buf, obj);
|
||||
}
|
||||
|
||||
if (i->buf.allocation_failure)
|
||||
ret = -ENOMEM;
|
||||
else
|
||||
i->iter = iter.pos;
|
||||
|
||||
if (!ret)
|
||||
ret = bch2_debugfs_flush_buf(i);
|
||||
|
||||
return ret ?: i->ret;
|
||||
}
|
||||
|
||||
static const struct file_operations async_obj_ops = {
|
||||
.owner = THIS_MODULE,
|
||||
.open = bch2_async_obj_list_open,
|
||||
.release = bch2_dump_release,
|
||||
.read = bch2_async_obj_list_read,
|
||||
};
|
||||
|
||||
void bch2_fs_async_obj_debugfs_init(struct bch_fs *c)
|
||||
{
|
||||
c->async_obj_dir = debugfs_create_dir("async_objs", c->fs_debug_dir);
|
||||
|
||||
#define x(n) debugfs_create_file(#n, 0400, c->async_obj_dir, \
|
||||
&c->async_objs[BCH_ASYNC_OBJ_LIST_##n], &async_obj_ops);
|
||||
BCH_ASYNC_OBJ_LISTS()
|
||||
#undef x
|
||||
}
|
||||
|
||||
void bch2_fs_async_obj_exit(struct bch_fs *c)
|
||||
{
|
||||
for (unsigned i = 0; i < ARRAY_SIZE(c->async_objs); i++)
|
||||
fast_list_exit(&c->async_objs[i].list);
|
||||
}
|
||||
|
||||
int bch2_fs_async_obj_init(struct bch_fs *c)
|
||||
{
|
||||
for (unsigned i = 0; i < ARRAY_SIZE(c->async_objs); i++) {
|
||||
if (fast_list_init(&c->async_objs[i].list))
|
||||
return -BCH_ERR_ENOMEM_async_obj_init;
|
||||
c->async_objs[i].idx = i;
|
||||
}
|
||||
|
||||
#define x(n) c->async_objs[BCH_ASYNC_OBJ_LIST_##n].obj_to_text = n##_obj_to_text;
|
||||
BCH_ASYNC_OBJ_LISTS()
|
||||
#undef x
|
||||
|
||||
return 0;
|
||||
}
|
||||
@@ -1,44 +0,0 @@
|
||||
/* SPDX-License-Identifier: GPL-2.0 */
|
||||
#ifndef _BCACHEFS_ASYNC_OBJS_H
|
||||
#define _BCACHEFS_ASYNC_OBJS_H
|
||||
|
||||
#ifdef CONFIG_BCACHEFS_ASYNC_OBJECT_LISTS
|
||||
static inline void __async_object_list_del(struct fast_list *head, unsigned idx)
|
||||
{
|
||||
fast_list_remove(head, idx);
|
||||
}
|
||||
|
||||
static inline int __async_object_list_add(struct fast_list *head, void *obj, unsigned *idx)
|
||||
{
|
||||
int ret = fast_list_add(head, obj);
|
||||
*idx = ret > 0 ? ret : 0;
|
||||
return ret < 0 ? ret : 0;
|
||||
}
|
||||
|
||||
#define async_object_list_del(_c, _list, idx) \
|
||||
__async_object_list_del(&(_c)->async_objs[BCH_ASYNC_OBJ_LIST_##_list].list, idx)
|
||||
|
||||
#define async_object_list_add(_c, _list, obj, idx) \
|
||||
__async_object_list_add(&(_c)->async_objs[BCH_ASYNC_OBJ_LIST_##_list].list, obj, idx)
|
||||
|
||||
void bch2_fs_async_obj_debugfs_init(struct bch_fs *);
|
||||
void bch2_fs_async_obj_exit(struct bch_fs *);
|
||||
int bch2_fs_async_obj_init(struct bch_fs *);
|
||||
|
||||
#else /* CONFIG_BCACHEFS_ASYNC_OBJECT_LISTS */
|
||||
|
||||
#define async_object_list_del(_c, _n, idx) do {} while (0)
|
||||
|
||||
static inline int __async_object_list_add(void)
|
||||
{
|
||||
return 0;
|
||||
}
|
||||
#define async_object_list_add(_c, _n, obj, idx) __async_object_list_add()
|
||||
|
||||
static inline void bch2_fs_async_obj_debugfs_init(struct bch_fs *c) {}
|
||||
static inline void bch2_fs_async_obj_exit(struct bch_fs *c) {}
|
||||
static inline int bch2_fs_async_obj_init(struct bch_fs *c) { return 0; }
|
||||
|
||||
#endif /* CONFIG_BCACHEFS_ASYNC_OBJECT_LISTS */
|
||||
|
||||
#endif /* _BCACHEFS_ASYNC_OBJS_H */
|
||||
@@ -1,25 +0,0 @@
|
||||
/* SPDX-License-Identifier: GPL-2.0 */
|
||||
#ifndef _BCACHEFS_ASYNC_OBJS_TYPES_H
|
||||
#define _BCACHEFS_ASYNC_OBJS_TYPES_H
|
||||
|
||||
#define BCH_ASYNC_OBJ_LISTS() \
|
||||
x(promote) \
|
||||
x(rbio) \
|
||||
x(write_op) \
|
||||
x(btree_read_bio) \
|
||||
x(btree_write_bio)
|
||||
|
||||
enum bch_async_obj_lists {
|
||||
#define x(n) BCH_ASYNC_OBJ_LIST_##n,
|
||||
BCH_ASYNC_OBJ_LISTS()
|
||||
#undef x
|
||||
BCH_ASYNC_OBJ_NR
|
||||
};
|
||||
|
||||
struct async_obj_list {
|
||||
struct fast_list list;
|
||||
void (*obj_to_text)(struct printbuf *, void *);
|
||||
unsigned idx;
|
||||
};
|
||||
|
||||
#endif /* _BCACHEFS_ASYNC_OBJS_TYPES_H */
|
||||
File diff suppressed because it is too large
Load Diff
@@ -1,200 +0,0 @@
|
||||
/* SPDX-License-Identifier: GPL-2.0 */
|
||||
#ifndef _BCACHEFS_BACKPOINTERS_H
|
||||
#define _BCACHEFS_BACKPOINTERS_H
|
||||
|
||||
#include "btree_cache.h"
|
||||
#include "btree_iter.h"
|
||||
#include "btree_update.h"
|
||||
#include "buckets.h"
|
||||
#include "error.h"
|
||||
#include "super.h"
|
||||
|
||||
static inline u64 swab40(u64 x)
|
||||
{
|
||||
return (((x & 0x00000000ffULL) << 32)|
|
||||
((x & 0x000000ff00ULL) << 16)|
|
||||
((x & 0x0000ff0000ULL) >> 0)|
|
||||
((x & 0x00ff000000ULL) >> 16)|
|
||||
((x & 0xff00000000ULL) >> 32));
|
||||
}
|
||||
|
||||
int bch2_backpointer_validate(struct bch_fs *, struct bkey_s_c k,
|
||||
struct bkey_validate_context);
|
||||
void bch2_backpointer_to_text(struct printbuf *, struct bch_fs *, struct bkey_s_c);
|
||||
void bch2_backpointer_swab(struct bkey_s);
|
||||
|
||||
#define bch2_bkey_ops_backpointer ((struct bkey_ops) { \
|
||||
.key_validate = bch2_backpointer_validate, \
|
||||
.val_to_text = bch2_backpointer_to_text, \
|
||||
.swab = bch2_backpointer_swab, \
|
||||
.min_val_size = 32, \
|
||||
})
|
||||
|
||||
#define MAX_EXTENT_COMPRESS_RATIO_SHIFT 10
|
||||
|
||||
/*
|
||||
* Convert from pos in backpointer btree to pos of corresponding bucket in alloc
|
||||
* btree:
|
||||
*/
|
||||
static inline struct bpos bp_pos_to_bucket(const struct bch_dev *ca, struct bpos bp_pos)
|
||||
{
|
||||
u64 bucket_sector = bp_pos.offset >> MAX_EXTENT_COMPRESS_RATIO_SHIFT;
|
||||
|
||||
return POS(bp_pos.inode, sector_to_bucket(ca, bucket_sector));
|
||||
}
|
||||
|
||||
static inline struct bpos bp_pos_to_bucket_and_offset(const struct bch_dev *ca, struct bpos bp_pos,
|
||||
u32 *bucket_offset)
|
||||
{
|
||||
u64 bucket_sector = bp_pos.offset >> MAX_EXTENT_COMPRESS_RATIO_SHIFT;
|
||||
|
||||
return POS(bp_pos.inode, sector_to_bucket_and_offset(ca, bucket_sector, bucket_offset));
|
||||
}
|
||||
|
||||
static inline bool bp_pos_to_bucket_nodev_noerror(struct bch_fs *c, struct bpos bp_pos, struct bpos *bucket)
|
||||
{
|
||||
guard(rcu)();
|
||||
struct bch_dev *ca = bch2_dev_rcu_noerror(c, bp_pos.inode);
|
||||
if (ca)
|
||||
*bucket = bp_pos_to_bucket(ca, bp_pos);
|
||||
return ca != NULL;
|
||||
}
|
||||
|
||||
static inline struct bpos bucket_pos_to_bp_noerror(const struct bch_dev *ca,
|
||||
struct bpos bucket,
|
||||
u64 bucket_offset)
|
||||
{
|
||||
return POS(bucket.inode,
|
||||
(bucket_to_sector(ca, bucket.offset) <<
|
||||
MAX_EXTENT_COMPRESS_RATIO_SHIFT) + bucket_offset);
|
||||
}
|
||||
|
||||
/*
|
||||
* Convert from pos in alloc btree + bucket offset to pos in backpointer btree:
|
||||
*/
|
||||
static inline struct bpos bucket_pos_to_bp(const struct bch_dev *ca,
|
||||
struct bpos bucket,
|
||||
u64 bucket_offset)
|
||||
{
|
||||
struct bpos ret = bucket_pos_to_bp_noerror(ca, bucket, bucket_offset);
|
||||
EBUG_ON(!bkey_eq(bucket, bp_pos_to_bucket(ca, ret)));
|
||||
return ret;
|
||||
}
|
||||
|
||||
static inline struct bpos bucket_pos_to_bp_start(const struct bch_dev *ca, struct bpos bucket)
|
||||
{
|
||||
return bucket_pos_to_bp(ca, bucket, 0);
|
||||
}
|
||||
|
||||
static inline struct bpos bucket_pos_to_bp_end(const struct bch_dev *ca, struct bpos bucket)
|
||||
{
|
||||
return bpos_nosnap_predecessor(bucket_pos_to_bp(ca, bpos_nosnap_successor(bucket), 0));
|
||||
}
|
||||
|
||||
int bch2_bucket_backpointer_mod_nowritebuffer(struct btree_trans *,
|
||||
struct bkey_s_c,
|
||||
struct bkey_i_backpointer *,
|
||||
bool);
|
||||
|
||||
static inline int bch2_bucket_backpointer_mod(struct btree_trans *trans,
|
||||
struct bkey_s_c orig_k,
|
||||
struct bkey_i_backpointer *bp,
|
||||
bool insert)
|
||||
{
|
||||
if (static_branch_unlikely(&bch2_backpointers_no_use_write_buffer))
|
||||
return bch2_bucket_backpointer_mod_nowritebuffer(trans, orig_k, bp, insert);
|
||||
|
||||
if (!insert) {
|
||||
bp->k.type = KEY_TYPE_deleted;
|
||||
set_bkey_val_u64s(&bp->k, 0);
|
||||
}
|
||||
|
||||
return bch2_trans_update_buffered(trans, BTREE_ID_backpointers, &bp->k_i);
|
||||
}
|
||||
|
||||
static inline enum bch_data_type bch2_bkey_ptr_data_type(struct bkey_s_c k,
|
||||
struct extent_ptr_decoded p,
|
||||
const union bch_extent_entry *entry)
|
||||
{
|
||||
switch (k.k->type) {
|
||||
case KEY_TYPE_btree_ptr:
|
||||
case KEY_TYPE_btree_ptr_v2:
|
||||
return BCH_DATA_btree;
|
||||
case KEY_TYPE_extent:
|
||||
case KEY_TYPE_reflink_v:
|
||||
if (p.has_ec)
|
||||
return BCH_DATA_stripe;
|
||||
if (p.ptr.cached)
|
||||
return BCH_DATA_cached;
|
||||
else
|
||||
return BCH_DATA_user;
|
||||
case KEY_TYPE_stripe: {
|
||||
const struct bch_extent_ptr *ptr = &entry->ptr;
|
||||
struct bkey_s_c_stripe s = bkey_s_c_to_stripe(k);
|
||||
|
||||
BUG_ON(ptr < s.v->ptrs ||
|
||||
ptr >= s.v->ptrs + s.v->nr_blocks);
|
||||
|
||||
return ptr >= s.v->ptrs + s.v->nr_blocks - s.v->nr_redundant
|
||||
? BCH_DATA_parity
|
||||
: BCH_DATA_user;
|
||||
}
|
||||
default:
|
||||
BUG();
|
||||
}
|
||||
}
|
||||
|
||||
static inline void bch2_extent_ptr_to_bp(struct bch_fs *c,
|
||||
enum btree_id btree_id, unsigned level,
|
||||
struct bkey_s_c k, struct extent_ptr_decoded p,
|
||||
const union bch_extent_entry *entry,
|
||||
struct bkey_i_backpointer *bp)
|
||||
{
|
||||
bkey_backpointer_init(&bp->k_i);
|
||||
bp->k.p.inode = p.ptr.dev;
|
||||
|
||||
if (k.k->type != KEY_TYPE_stripe)
|
||||
bp->k.p.offset = ((u64) p.ptr.offset << MAX_EXTENT_COMPRESS_RATIO_SHIFT) + p.crc.offset;
|
||||
else {
|
||||
/*
|
||||
* Put stripe backpointers where they won't collide with the
|
||||
* extent backpointers within the stripe:
|
||||
*/
|
||||
struct bkey_s_c_stripe s = bkey_s_c_to_stripe(k);
|
||||
bp->k.p.offset = ((u64) (p.ptr.offset + le16_to_cpu(s.v->sectors)) <<
|
||||
MAX_EXTENT_COMPRESS_RATIO_SHIFT) - 1;
|
||||
}
|
||||
|
||||
bp->v = (struct bch_backpointer) {
|
||||
.btree_id = btree_id,
|
||||
.level = level,
|
||||
.data_type = bch2_bkey_ptr_data_type(k, p, entry),
|
||||
.bucket_gen = p.ptr.gen,
|
||||
.bucket_len = ptr_disk_sectors(level ? btree_sectors(c) : k.k->size, p),
|
||||
.pos = k.k->p,
|
||||
};
|
||||
}
|
||||
|
||||
struct bkey_buf;
|
||||
struct bkey_s_c bch2_backpointer_get_key(struct btree_trans *, struct bkey_s_c_backpointer,
|
||||
struct btree_iter *, unsigned, struct bkey_buf *);
|
||||
struct btree *bch2_backpointer_get_node(struct btree_trans *, struct bkey_s_c_backpointer,
|
||||
struct btree_iter *, struct bkey_buf *);
|
||||
|
||||
int bch2_check_bucket_backpointer_mismatch(struct btree_trans *, struct bch_dev *, u64,
|
||||
bool, struct bkey_buf *);
|
||||
|
||||
int bch2_check_btree_backpointers(struct bch_fs *);
|
||||
int bch2_check_extents_to_backpointers(struct bch_fs *);
|
||||
int bch2_check_backpointers_to_extents(struct bch_fs *);
|
||||
|
||||
static inline bool bch2_bucket_bitmap_test(struct bucket_bitmap *b, u64 i)
|
||||
{
|
||||
unsigned long *bitmap = READ_ONCE(b->buckets);
|
||||
return bitmap && test_bit(i, bitmap);
|
||||
}
|
||||
|
||||
int bch2_bucket_bitmap_resize(struct bch_dev *, struct bucket_bitmap *, u64, u64);
|
||||
void bch2_bucket_bitmap_free(struct bucket_bitmap *);
|
||||
|
||||
#endif /* _BCACHEFS_BACKPOINTERS_BACKGROUND_H */
|
||||
@@ -1,37 +0,0 @@
|
||||
/* SPDX-License-Identifier: GPL-2.0 */
|
||||
#ifndef _BCACHEFS_BBPOS_H
|
||||
#define _BCACHEFS_BBPOS_H
|
||||
|
||||
#include "bbpos_types.h"
|
||||
#include "bkey_methods.h"
|
||||
#include "btree_cache.h"
|
||||
|
||||
static inline int bbpos_cmp(struct bbpos l, struct bbpos r)
|
||||
{
|
||||
return cmp_int(l.btree, r.btree) ?: bpos_cmp(l.pos, r.pos);
|
||||
}
|
||||
|
||||
static inline struct bbpos bbpos_successor(struct bbpos pos)
|
||||
{
|
||||
if (bpos_cmp(pos.pos, SPOS_MAX)) {
|
||||
pos.pos = bpos_successor(pos.pos);
|
||||
return pos;
|
||||
}
|
||||
|
||||
if (pos.btree != BTREE_ID_NR) {
|
||||
pos.btree++;
|
||||
pos.pos = POS_MIN;
|
||||
return pos;
|
||||
}
|
||||
|
||||
BUG();
|
||||
}
|
||||
|
||||
static inline void bch2_bbpos_to_text(struct printbuf *out, struct bbpos pos)
|
||||
{
|
||||
bch2_btree_id_to_text(out, pos.btree);
|
||||
prt_char(out, ':');
|
||||
bch2_bpos_to_text(out, pos.pos);
|
||||
}
|
||||
|
||||
#endif /* _BCACHEFS_BBPOS_H */
|
||||
@@ -1,18 +0,0 @@
|
||||
/* SPDX-License-Identifier: GPL-2.0 */
|
||||
#ifndef _BCACHEFS_BBPOS_TYPES_H
|
||||
#define _BCACHEFS_BBPOS_TYPES_H
|
||||
|
||||
struct bbpos {
|
||||
enum btree_id btree;
|
||||
struct bpos pos;
|
||||
};
|
||||
|
||||
static inline struct bbpos BBPOS(enum btree_id btree, struct bpos pos)
|
||||
{
|
||||
return (struct bbpos) { btree, pos };
|
||||
}
|
||||
|
||||
#define BBPOS_MIN BBPOS(0, POS_MIN)
|
||||
#define BBPOS_MAX BBPOS(BTREE_ID_NR - 1, SPOS_MAX)
|
||||
|
||||
#endif /* _BCACHEFS_BBPOS_TYPES_H */
|
||||
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
@@ -1,473 +0,0 @@
|
||||
/* SPDX-License-Identifier: GPL-2.0 */
|
||||
#ifndef _BCACHEFS_IOCTL_H
|
||||
#define _BCACHEFS_IOCTL_H
|
||||
|
||||
#include <linux/uuid.h>
|
||||
#include <asm/ioctl.h>
|
||||
#include "bcachefs_format.h"
|
||||
#include "bkey_types.h"
|
||||
|
||||
/*
|
||||
* Flags common to multiple ioctls:
|
||||
*/
|
||||
#define BCH_FORCE_IF_DATA_LOST (1 << 0)
|
||||
#define BCH_FORCE_IF_METADATA_LOST (1 << 1)
|
||||
#define BCH_FORCE_IF_DATA_DEGRADED (1 << 2)
|
||||
#define BCH_FORCE_IF_METADATA_DEGRADED (1 << 3)
|
||||
|
||||
#define BCH_FORCE_IF_LOST \
|
||||
(BCH_FORCE_IF_DATA_LOST| \
|
||||
BCH_FORCE_IF_METADATA_LOST)
|
||||
#define BCH_FORCE_IF_DEGRADED \
|
||||
(BCH_FORCE_IF_DATA_DEGRADED| \
|
||||
BCH_FORCE_IF_METADATA_DEGRADED)
|
||||
|
||||
/*
|
||||
* If cleared, ioctl that refer to a device pass it as a pointer to a pathname
|
||||
* (e.g. /dev/sda1); if set, the dev field is the device's index within the
|
||||
* filesystem:
|
||||
*/
|
||||
#define BCH_BY_INDEX (1 << 4)
|
||||
|
||||
/*
|
||||
* For BCH_IOCTL_READ_SUPER: get superblock of a specific device, not filesystem
|
||||
* wide superblock:
|
||||
*/
|
||||
#define BCH_READ_DEV (1 << 5)
|
||||
|
||||
/* global control dev: */
|
||||
|
||||
/* These are currently broken, and probably unnecessary: */
|
||||
#if 0
|
||||
#define BCH_IOCTL_ASSEMBLE _IOW(0xbc, 1, struct bch_ioctl_assemble)
|
||||
#define BCH_IOCTL_INCREMENTAL _IOW(0xbc, 2, struct bch_ioctl_incremental)
|
||||
|
||||
struct bch_ioctl_assemble {
|
||||
__u32 flags;
|
||||
__u32 nr_devs;
|
||||
__u64 pad;
|
||||
__u64 devs[];
|
||||
};
|
||||
|
||||
struct bch_ioctl_incremental {
|
||||
__u32 flags;
|
||||
__u64 pad;
|
||||
__u64 dev;
|
||||
};
|
||||
#endif
|
||||
|
||||
/* filesystem ioctls: */
|
||||
|
||||
#define BCH_IOCTL_QUERY_UUID _IOR(0xbc, 1, struct bch_ioctl_query_uuid)
|
||||
|
||||
/* These only make sense when we also have incremental assembly */
|
||||
#if 0
|
||||
#define BCH_IOCTL_START _IOW(0xbc, 2, struct bch_ioctl_start)
|
||||
#define BCH_IOCTL_STOP _IO(0xbc, 3)
|
||||
#endif
|
||||
|
||||
#define BCH_IOCTL_DISK_ADD _IOW(0xbc, 4, struct bch_ioctl_disk)
|
||||
#define BCH_IOCTL_DISK_REMOVE _IOW(0xbc, 5, struct bch_ioctl_disk)
|
||||
#define BCH_IOCTL_DISK_ONLINE _IOW(0xbc, 6, struct bch_ioctl_disk)
|
||||
#define BCH_IOCTL_DISK_OFFLINE _IOW(0xbc, 7, struct bch_ioctl_disk)
|
||||
#define BCH_IOCTL_DISK_SET_STATE _IOW(0xbc, 8, struct bch_ioctl_disk_set_state)
|
||||
#define BCH_IOCTL_DATA _IOW(0xbc, 10, struct bch_ioctl_data)
|
||||
#define BCH_IOCTL_FS_USAGE _IOWR(0xbc, 11, struct bch_ioctl_fs_usage)
|
||||
#define BCH_IOCTL_DEV_USAGE _IOWR(0xbc, 11, struct bch_ioctl_dev_usage)
|
||||
#define BCH_IOCTL_READ_SUPER _IOW(0xbc, 12, struct bch_ioctl_read_super)
|
||||
#define BCH_IOCTL_DISK_GET_IDX _IOW(0xbc, 13, struct bch_ioctl_disk_get_idx)
|
||||
#define BCH_IOCTL_DISK_RESIZE _IOW(0xbc, 14, struct bch_ioctl_disk_resize)
|
||||
#define BCH_IOCTL_DISK_RESIZE_JOURNAL _IOW(0xbc,15, struct bch_ioctl_disk_resize_journal)
|
||||
|
||||
#define BCH_IOCTL_SUBVOLUME_CREATE _IOW(0xbc, 16, struct bch_ioctl_subvolume)
|
||||
#define BCH_IOCTL_SUBVOLUME_DESTROY _IOW(0xbc, 17, struct bch_ioctl_subvolume)
|
||||
|
||||
#define BCH_IOCTL_DEV_USAGE_V2 _IOWR(0xbc, 18, struct bch_ioctl_dev_usage_v2)
|
||||
|
||||
#define BCH_IOCTL_FSCK_OFFLINE _IOW(0xbc, 19, struct bch_ioctl_fsck_offline)
|
||||
#define BCH_IOCTL_FSCK_ONLINE _IOW(0xbc, 20, struct bch_ioctl_fsck_online)
|
||||
#define BCH_IOCTL_QUERY_ACCOUNTING _IOW(0xbc, 21, struct bch_ioctl_query_accounting)
|
||||
#define BCH_IOCTL_QUERY_COUNTERS _IOW(0xbc, 21, struct bch_ioctl_query_counters)
|
||||
|
||||
/* ioctl below act on a particular file, not the filesystem as a whole: */
|
||||
|
||||
#define BCHFS_IOC_REINHERIT_ATTRS _IOR(0xbc, 64, const char __user *)
|
||||
|
||||
/*
|
||||
* BCH_IOCTL_QUERY_UUID: get filesystem UUID
|
||||
*
|
||||
* Returns user visible UUID, not internal UUID (which may not ever be changed);
|
||||
* the filesystem's sysfs directory may be found under /sys/fs/bcachefs with
|
||||
* this UUID.
|
||||
*/
|
||||
struct bch_ioctl_query_uuid {
|
||||
__uuid_t uuid;
|
||||
};
|
||||
|
||||
#if 0
|
||||
struct bch_ioctl_start {
|
||||
__u32 flags;
|
||||
__u32 pad;
|
||||
};
|
||||
#endif
|
||||
|
||||
/*
|
||||
* BCH_IOCTL_DISK_ADD: add a new device to an existing filesystem
|
||||
*
|
||||
* The specified device must not be open or in use. On success, the new device
|
||||
* will be an online member of the filesystem just like any other member.
|
||||
*
|
||||
* The device must first be prepared by userspace by formatting with a bcachefs
|
||||
* superblock, which is only used for passing in superblock options/parameters
|
||||
* for that device (in struct bch_member). The new device's superblock should
|
||||
* not claim to be a member of any existing filesystem - UUIDs on it will be
|
||||
* ignored.
|
||||
*/
|
||||
|
||||
/*
|
||||
* BCH_IOCTL_DISK_REMOVE: permanently remove a member device from a filesystem
|
||||
*
|
||||
* Any data present on @dev will be permanently deleted, and @dev will be
|
||||
* removed from its slot in the filesystem's list of member devices. The device
|
||||
* may be either offline or offline.
|
||||
*
|
||||
* Will fail removing @dev would leave us with insufficient read write devices
|
||||
* or degraded/unavailable data, unless the approprate BCH_FORCE_IF_* flags are
|
||||
* set.
|
||||
*/
|
||||
|
||||
/*
|
||||
* BCH_IOCTL_DISK_ONLINE: given a disk that is already a member of a filesystem
|
||||
* but is not open (e.g. because we started in degraded mode), bring it online
|
||||
*
|
||||
* all existing data on @dev will be available once the device is online,
|
||||
* exactly as if @dev was present when the filesystem was first mounted
|
||||
*/
|
||||
|
||||
/*
|
||||
* BCH_IOCTL_DISK_OFFLINE: offline a disk, causing the kernel to close that
|
||||
* block device, without removing it from the filesystem (so it can be brought
|
||||
* back online later)
|
||||
*
|
||||
* Data present on @dev will be unavailable while @dev is offline (unless
|
||||
* replicated), but will still be intact and untouched if @dev is brought back
|
||||
* online
|
||||
*
|
||||
* Will fail (similarly to BCH_IOCTL_DISK_SET_STATE) if offlining @dev would
|
||||
* leave us with insufficient read write devices or degraded/unavailable data,
|
||||
* unless the approprate BCH_FORCE_IF_* flags are set.
|
||||
*/
|
||||
|
||||
struct bch_ioctl_disk {
|
||||
__u32 flags;
|
||||
__u32 pad;
|
||||
__u64 dev;
|
||||
};
|
||||
|
||||
/*
|
||||
* BCH_IOCTL_DISK_SET_STATE: modify state of a member device of a filesystem
|
||||
*
|
||||
* @new_state - one of the bch_member_state states (rw, ro, failed,
|
||||
* spare)
|
||||
*
|
||||
* Will refuse to change member state if we would then have insufficient devices
|
||||
* to write to, or if it would result in degraded data (when @new_state is
|
||||
* failed or spare) unless the appropriate BCH_FORCE_IF_* flags are set.
|
||||
*/
|
||||
struct bch_ioctl_disk_set_state {
|
||||
__u32 flags;
|
||||
__u8 new_state;
|
||||
__u8 pad[3];
|
||||
__u64 dev;
|
||||
};
|
||||
|
||||
#define BCH_DATA_OPS() \
|
||||
x(scrub, 0) \
|
||||
x(rereplicate, 1) \
|
||||
x(migrate, 2) \
|
||||
x(rewrite_old_nodes, 3) \
|
||||
x(drop_extra_replicas, 4)
|
||||
|
||||
enum bch_data_ops {
|
||||
#define x(t, n) BCH_DATA_OP_##t = n,
|
||||
BCH_DATA_OPS()
|
||||
#undef x
|
||||
BCH_DATA_OP_NR
|
||||
};
|
||||
|
||||
/*
|
||||
* BCH_IOCTL_DATA: operations that walk and manipulate filesystem data (e.g.
|
||||
* scrub, rereplicate, migrate).
|
||||
*
|
||||
* This ioctl kicks off a job in the background, and returns a file descriptor.
|
||||
* Reading from the file descriptor returns a struct bch_ioctl_data_event,
|
||||
* indicating current progress, and closing the file descriptor will stop the
|
||||
* job. The file descriptor is O_CLOEXEC.
|
||||
*/
|
||||
struct bch_ioctl_data {
|
||||
__u16 op;
|
||||
__u8 start_btree;
|
||||
__u8 end_btree;
|
||||
__u32 flags;
|
||||
|
||||
struct bpos start_pos;
|
||||
struct bpos end_pos;
|
||||
|
||||
union {
|
||||
struct {
|
||||
__u32 dev;
|
||||
__u32 data_types;
|
||||
} scrub;
|
||||
struct {
|
||||
__u32 dev;
|
||||
__u32 pad;
|
||||
} migrate;
|
||||
struct {
|
||||
__u64 pad[8];
|
||||
};
|
||||
};
|
||||
} __packed __aligned(8);
|
||||
|
||||
enum bch_data_event {
|
||||
BCH_DATA_EVENT_PROGRESS = 0,
|
||||
/* XXX: add an event for reporting errors */
|
||||
BCH_DATA_EVENT_NR = 1,
|
||||
};
|
||||
|
||||
enum data_progress_data_type_special {
|
||||
DATA_PROGRESS_DATA_TYPE_phys = 254,
|
||||
DATA_PROGRESS_DATA_TYPE_done = 255,
|
||||
};
|
||||
|
||||
struct bch_ioctl_data_progress {
|
||||
__u8 data_type;
|
||||
__u8 btree_id;
|
||||
__u8 pad[2];
|
||||
struct bpos pos;
|
||||
|
||||
__u64 sectors_done;
|
||||
__u64 sectors_total;
|
||||
__u64 sectors_error_corrected;
|
||||
__u64 sectors_error_uncorrected;
|
||||
} __packed __aligned(8);
|
||||
|
||||
enum bch_ioctl_data_event_ret {
|
||||
BCH_IOCTL_DATA_EVENT_RET_done = 1,
|
||||
BCH_IOCTL_DATA_EVENT_RET_device_offline = 2,
|
||||
};
|
||||
|
||||
struct bch_ioctl_data_event {
|
||||
__u8 type;
|
||||
__u8 ret;
|
||||
__u8 pad[6];
|
||||
union {
|
||||
struct bch_ioctl_data_progress p;
|
||||
__u64 pad2[15];
|
||||
};
|
||||
} __packed __aligned(8);
|
||||
|
||||
struct bch_replicas_usage {
|
||||
__u64 sectors;
|
||||
struct bch_replicas_entry_v1 r;
|
||||
} __packed;
|
||||
|
||||
static inline unsigned replicas_usage_bytes(struct bch_replicas_usage *u)
|
||||
{
|
||||
return offsetof(struct bch_replicas_usage, r) + replicas_entry_bytes(&u->r);
|
||||
}
|
||||
|
||||
static inline struct bch_replicas_usage *
|
||||
replicas_usage_next(struct bch_replicas_usage *u)
|
||||
{
|
||||
return (void *) u + replicas_usage_bytes(u);
|
||||
}
|
||||
|
||||
/* Obsolete */
|
||||
/*
|
||||
* BCH_IOCTL_FS_USAGE: query filesystem disk space usage
|
||||
*
|
||||
* Returns disk space usage broken out by data type, number of replicas, and
|
||||
* by component device
|
||||
*
|
||||
* @replica_entries_bytes - size, in bytes, allocated for replica usage entries
|
||||
*
|
||||
* On success, @replica_entries_bytes will be changed to indicate the number of
|
||||
* bytes actually used.
|
||||
*
|
||||
* Returns -ERANGE if @replica_entries_bytes was too small
|
||||
*/
|
||||
struct bch_ioctl_fs_usage {
|
||||
__u64 capacity;
|
||||
__u64 used;
|
||||
__u64 online_reserved;
|
||||
__u64 persistent_reserved[BCH_REPLICAS_MAX];
|
||||
|
||||
__u32 replica_entries_bytes;
|
||||
__u32 pad;
|
||||
|
||||
struct bch_replicas_usage replicas[];
|
||||
};
|
||||
|
||||
/* Obsolete */
|
||||
/*
|
||||
* BCH_IOCTL_DEV_USAGE: query device disk space usage
|
||||
*
|
||||
* Returns disk space usage broken out by data type - both by buckets and
|
||||
* sectors.
|
||||
*/
|
||||
struct bch_ioctl_dev_usage {
|
||||
__u64 dev;
|
||||
__u32 flags;
|
||||
__u8 state;
|
||||
__u8 pad[7];
|
||||
|
||||
__u32 bucket_size;
|
||||
__u64 nr_buckets;
|
||||
|
||||
__u64 buckets_ec;
|
||||
|
||||
struct bch_ioctl_dev_usage_type {
|
||||
__u64 buckets;
|
||||
__u64 sectors;
|
||||
__u64 fragmented;
|
||||
} d[10];
|
||||
};
|
||||
|
||||
/* Obsolete */
|
||||
struct bch_ioctl_dev_usage_v2 {
|
||||
__u64 dev;
|
||||
__u32 flags;
|
||||
__u8 state;
|
||||
__u8 nr_data_types;
|
||||
__u8 pad[6];
|
||||
|
||||
__u32 bucket_size;
|
||||
__u64 nr_buckets;
|
||||
|
||||
struct bch_ioctl_dev_usage_type d[];
|
||||
};
|
||||
|
||||
/*
|
||||
* BCH_IOCTL_READ_SUPER: read filesystem superblock
|
||||
*
|
||||
* Equivalent to reading the superblock directly from the block device, except
|
||||
* avoids racing with the kernel writing the superblock or having to figure out
|
||||
* which block device to read
|
||||
*
|
||||
* @sb - buffer to read into
|
||||
* @size - size of userspace allocated buffer
|
||||
* @dev - device to read superblock for, if BCH_READ_DEV flag is
|
||||
* specified
|
||||
*
|
||||
* Returns -ERANGE if buffer provided is too small
|
||||
*/
|
||||
struct bch_ioctl_read_super {
|
||||
__u32 flags;
|
||||
__u32 pad;
|
||||
__u64 dev;
|
||||
__u64 size;
|
||||
__u64 sb;
|
||||
};
|
||||
|
||||
/*
|
||||
* BCH_IOCTL_DISK_GET_IDX: give a path to a block device, query filesystem to
|
||||
* determine if disk is a (online) member - if so, returns device's index
|
||||
*
|
||||
* Returns -ENOENT if not found
|
||||
*/
|
||||
struct bch_ioctl_disk_get_idx {
|
||||
__u64 dev;
|
||||
};
|
||||
|
||||
/*
|
||||
* BCH_IOCTL_DISK_RESIZE: resize filesystem on a device
|
||||
*
|
||||
* @dev - member to resize
|
||||
* @nbuckets - new number of buckets
|
||||
*/
|
||||
struct bch_ioctl_disk_resize {
|
||||
__u32 flags;
|
||||
__u32 pad;
|
||||
__u64 dev;
|
||||
__u64 nbuckets;
|
||||
};
|
||||
|
||||
/*
|
||||
* BCH_IOCTL_DISK_RESIZE_JOURNAL: resize journal on a device
|
||||
*
|
||||
* @dev - member to resize
|
||||
* @nbuckets - new number of buckets
|
||||
*/
|
||||
struct bch_ioctl_disk_resize_journal {
|
||||
__u32 flags;
|
||||
__u32 pad;
|
||||
__u64 dev;
|
||||
__u64 nbuckets;
|
||||
};
|
||||
|
||||
struct bch_ioctl_subvolume {
|
||||
__u32 flags;
|
||||
__u32 dirfd;
|
||||
__u16 mode;
|
||||
__u16 pad[3];
|
||||
__u64 dst_ptr;
|
||||
__u64 src_ptr;
|
||||
};
|
||||
|
||||
#define BCH_SUBVOL_SNAPSHOT_CREATE (1U << 0)
|
||||
#define BCH_SUBVOL_SNAPSHOT_RO (1U << 1)
|
||||
|
||||
/*
|
||||
* BCH_IOCTL_FSCK_OFFLINE: run fsck from the 'bcachefs fsck' userspace command,
|
||||
* but with the kernel's implementation of fsck:
|
||||
*/
|
||||
struct bch_ioctl_fsck_offline {
|
||||
__u64 flags;
|
||||
__u64 opts; /* string */
|
||||
__u64 nr_devs;
|
||||
__u64 devs[] __counted_by(nr_devs);
|
||||
};
|
||||
|
||||
/*
|
||||
* BCH_IOCTL_FSCK_ONLINE: run fsck from the 'bcachefs fsck' userspace command,
|
||||
* but with the kernel's implementation of fsck:
|
||||
*/
|
||||
struct bch_ioctl_fsck_online {
|
||||
__u64 flags;
|
||||
__u64 opts; /* string */
|
||||
};
|
||||
|
||||
/*
|
||||
* BCH_IOCTL_QUERY_ACCOUNTING: query filesystem disk accounting
|
||||
*
|
||||
* Returns disk space usage broken out by data type, number of replicas, and
|
||||
* by component device
|
||||
*
|
||||
* @replica_entries_bytes - size, in bytes, allocated for replica usage entries
|
||||
*
|
||||
* On success, @replica_entries_bytes will be changed to indicate the number of
|
||||
* bytes actually used.
|
||||
*
|
||||
* Returns -ERANGE if @replica_entries_bytes was too small
|
||||
*/
|
||||
struct bch_ioctl_query_accounting {
|
||||
__u64 capacity;
|
||||
__u64 used;
|
||||
__u64 online_reserved;
|
||||
|
||||
__u32 accounting_u64s; /* input parameter */
|
||||
__u32 accounting_types_mask; /* input parameter */
|
||||
|
||||
struct bkey_i_accounting accounting[];
|
||||
};
|
||||
|
||||
#define BCH_IOCTL_QUERY_COUNTERS_MOUNT (1 << 0)
|
||||
|
||||
struct bch_ioctl_query_counters {
|
||||
__u16 nr;
|
||||
__u16 flags;
|
||||
__u32 pad;
|
||||
__u64 d[];
|
||||
};
|
||||
|
||||
#endif /* _BCACHEFS_IOCTL_H */
|
||||
1112
fs/bcachefs/bkey.c
1112
fs/bcachefs/bkey.c
File diff suppressed because it is too large
Load Diff
@@ -1,605 +0,0 @@
|
||||
/* SPDX-License-Identifier: GPL-2.0 */
|
||||
#ifndef _BCACHEFS_BKEY_H
|
||||
#define _BCACHEFS_BKEY_H
|
||||
|
||||
#include <linux/bug.h>
|
||||
#include "bcachefs_format.h"
|
||||
#include "bkey_types.h"
|
||||
#include "btree_types.h"
|
||||
#include "util.h"
|
||||
#include "vstructs.h"
|
||||
|
||||
#if 0
|
||||
|
||||
/*
|
||||
* compiled unpack functions are disabled, pending a new interface for
|
||||
* dynamically allocating executable memory:
|
||||
*/
|
||||
|
||||
#ifdef CONFIG_X86_64
|
||||
#define HAVE_BCACHEFS_COMPILED_UNPACK 1
|
||||
#endif
|
||||
#endif
|
||||
|
||||
void bch2_bkey_packed_to_binary_text(struct printbuf *,
|
||||
const struct bkey_format *,
|
||||
const struct bkey_packed *);
|
||||
|
||||
enum bkey_lr_packed {
|
||||
BKEY_PACKED_BOTH,
|
||||
BKEY_PACKED_RIGHT,
|
||||
BKEY_PACKED_LEFT,
|
||||
BKEY_PACKED_NONE,
|
||||
};
|
||||
|
||||
#define bkey_lr_packed(_l, _r) \
|
||||
((_l)->format + ((_r)->format << 1))
|
||||
|
||||
static inline void bkey_p_copy(struct bkey_packed *dst, const struct bkey_packed *src)
|
||||
{
|
||||
memcpy_u64s_small(dst, src, src->u64s);
|
||||
}
|
||||
|
||||
static inline void bkey_copy(struct bkey_i *dst, const struct bkey_i *src)
|
||||
{
|
||||
memcpy_u64s_small(dst, src, src->k.u64s);
|
||||
}
|
||||
|
||||
struct btree;
|
||||
|
||||
__pure
|
||||
unsigned bch2_bkey_greatest_differing_bit(const struct btree *,
|
||||
const struct bkey_packed *,
|
||||
const struct bkey_packed *);
|
||||
__pure
|
||||
unsigned bch2_bkey_ffs(const struct btree *, const struct bkey_packed *);
|
||||
|
||||
__pure
|
||||
int __bch2_bkey_cmp_packed_format_checked(const struct bkey_packed *,
|
||||
const struct bkey_packed *,
|
||||
const struct btree *);
|
||||
|
||||
__pure
|
||||
int __bch2_bkey_cmp_left_packed_format_checked(const struct btree *,
|
||||
const struct bkey_packed *,
|
||||
const struct bpos *);
|
||||
|
||||
__pure
|
||||
int bch2_bkey_cmp_packed(const struct btree *,
|
||||
const struct bkey_packed *,
|
||||
const struct bkey_packed *);
|
||||
|
||||
__pure
|
||||
int __bch2_bkey_cmp_left_packed(const struct btree *,
|
||||
const struct bkey_packed *,
|
||||
const struct bpos *);
|
||||
|
||||
static inline __pure
|
||||
int bkey_cmp_left_packed(const struct btree *b,
|
||||
const struct bkey_packed *l, const struct bpos *r)
|
||||
{
|
||||
return __bch2_bkey_cmp_left_packed(b, l, r);
|
||||
}
|
||||
|
||||
/*
|
||||
* The compiler generates better code when we pass bpos by ref, but it's often
|
||||
* enough terribly convenient to pass it by val... as much as I hate c++, const
|
||||
* ref would be nice here:
|
||||
*/
|
||||
__pure __flatten
|
||||
static inline int bkey_cmp_left_packed_byval(const struct btree *b,
|
||||
const struct bkey_packed *l,
|
||||
struct bpos r)
|
||||
{
|
||||
return bkey_cmp_left_packed(b, l, &r);
|
||||
}
|
||||
|
||||
static __always_inline bool bpos_eq(struct bpos l, struct bpos r)
|
||||
{
|
||||
return !((l.inode ^ r.inode) |
|
||||
(l.offset ^ r.offset) |
|
||||
(l.snapshot ^ r.snapshot));
|
||||
}
|
||||
|
||||
static __always_inline bool bpos_lt(struct bpos l, struct bpos r)
|
||||
{
|
||||
return l.inode != r.inode ? l.inode < r.inode :
|
||||
l.offset != r.offset ? l.offset < r.offset :
|
||||
l.snapshot != r.snapshot ? l.snapshot < r.snapshot : false;
|
||||
}
|
||||
|
||||
static __always_inline bool bpos_le(struct bpos l, struct bpos r)
|
||||
{
|
||||
return l.inode != r.inode ? l.inode < r.inode :
|
||||
l.offset != r.offset ? l.offset < r.offset :
|
||||
l.snapshot != r.snapshot ? l.snapshot < r.snapshot : true;
|
||||
}
|
||||
|
||||
static __always_inline bool bpos_gt(struct bpos l, struct bpos r)
|
||||
{
|
||||
return bpos_lt(r, l);
|
||||
}
|
||||
|
||||
static __always_inline bool bpos_ge(struct bpos l, struct bpos r)
|
||||
{
|
||||
return bpos_le(r, l);
|
||||
}
|
||||
|
||||
static __always_inline int bpos_cmp(struct bpos l, struct bpos r)
|
||||
{
|
||||
return cmp_int(l.inode, r.inode) ?:
|
||||
cmp_int(l.offset, r.offset) ?:
|
||||
cmp_int(l.snapshot, r.snapshot);
|
||||
}
|
||||
|
||||
static inline struct bpos bpos_min(struct bpos l, struct bpos r)
|
||||
{
|
||||
return bpos_lt(l, r) ? l : r;
|
||||
}
|
||||
|
||||
static inline struct bpos bpos_max(struct bpos l, struct bpos r)
|
||||
{
|
||||
return bpos_gt(l, r) ? l : r;
|
||||
}
|
||||
|
||||
static __always_inline bool bkey_eq(struct bpos l, struct bpos r)
|
||||
{
|
||||
return !((l.inode ^ r.inode) |
|
||||
(l.offset ^ r.offset));
|
||||
}
|
||||
|
||||
static __always_inline bool bkey_lt(struct bpos l, struct bpos r)
|
||||
{
|
||||
return l.inode != r.inode
|
||||
? l.inode < r.inode
|
||||
: l.offset < r.offset;
|
||||
}
|
||||
|
||||
static __always_inline bool bkey_le(struct bpos l, struct bpos r)
|
||||
{
|
||||
return l.inode != r.inode
|
||||
? l.inode < r.inode
|
||||
: l.offset <= r.offset;
|
||||
}
|
||||
|
||||
static __always_inline bool bkey_gt(struct bpos l, struct bpos r)
|
||||
{
|
||||
return bkey_lt(r, l);
|
||||
}
|
||||
|
||||
static __always_inline bool bkey_ge(struct bpos l, struct bpos r)
|
||||
{
|
||||
return bkey_le(r, l);
|
||||
}
|
||||
|
||||
static __always_inline int bkey_cmp(struct bpos l, struct bpos r)
|
||||
{
|
||||
return cmp_int(l.inode, r.inode) ?:
|
||||
cmp_int(l.offset, r.offset);
|
||||
}
|
||||
|
||||
static inline struct bpos bkey_min(struct bpos l, struct bpos r)
|
||||
{
|
||||
return bkey_lt(l, r) ? l : r;
|
||||
}
|
||||
|
||||
static inline struct bpos bkey_max(struct bpos l, struct bpos r)
|
||||
{
|
||||
return bkey_gt(l, r) ? l : r;
|
||||
}
|
||||
|
||||
static inline bool bkey_and_val_eq(struct bkey_s_c l, struct bkey_s_c r)
|
||||
{
|
||||
return bpos_eq(l.k->p, r.k->p) &&
|
||||
l.k->size == r.k->size &&
|
||||
bkey_bytes(l.k) == bkey_bytes(r.k) &&
|
||||
!memcmp(l.v, r.v, bkey_val_bytes(l.k));
|
||||
}
|
||||
|
||||
void bch2_bpos_swab(struct bpos *);
|
||||
void bch2_bkey_swab_key(const struct bkey_format *, struct bkey_packed *);
|
||||
|
||||
static __always_inline int bversion_cmp(struct bversion l, struct bversion r)
|
||||
{
|
||||
return cmp_int(l.hi, r.hi) ?:
|
||||
cmp_int(l.lo, r.lo);
|
||||
}
|
||||
|
||||
#define ZERO_VERSION ((struct bversion) { .hi = 0, .lo = 0 })
|
||||
#define MAX_VERSION ((struct bversion) { .hi = ~0, .lo = ~0ULL })
|
||||
|
||||
static __always_inline bool bversion_zero(struct bversion v)
|
||||
{
|
||||
return bversion_cmp(v, ZERO_VERSION) == 0;
|
||||
}
|
||||
|
||||
#ifdef CONFIG_BCACHEFS_DEBUG
|
||||
/* statement expressions confusing unlikely()? */
|
||||
#define bkey_packed(_k) \
|
||||
({ EBUG_ON((_k)->format > KEY_FORMAT_CURRENT); \
|
||||
(_k)->format != KEY_FORMAT_CURRENT; })
|
||||
#else
|
||||
#define bkey_packed(_k) ((_k)->format != KEY_FORMAT_CURRENT)
|
||||
#endif
|
||||
|
||||
/*
|
||||
* It's safe to treat an unpacked bkey as a packed one, but not the reverse
|
||||
*/
|
||||
static inline struct bkey_packed *bkey_to_packed(struct bkey_i *k)
|
||||
{
|
||||
return (struct bkey_packed *) k;
|
||||
}
|
||||
|
||||
static inline const struct bkey_packed *bkey_to_packed_c(const struct bkey_i *k)
|
||||
{
|
||||
return (const struct bkey_packed *) k;
|
||||
}
|
||||
|
||||
static inline struct bkey_i *packed_to_bkey(struct bkey_packed *k)
|
||||
{
|
||||
return bkey_packed(k) ? NULL : (struct bkey_i *) k;
|
||||
}
|
||||
|
||||
static inline const struct bkey *packed_to_bkey_c(const struct bkey_packed *k)
|
||||
{
|
||||
return bkey_packed(k) ? NULL : (const struct bkey *) k;
|
||||
}
|
||||
|
||||
static inline unsigned bkey_format_key_bits(const struct bkey_format *format)
|
||||
{
|
||||
return format->bits_per_field[BKEY_FIELD_INODE] +
|
||||
format->bits_per_field[BKEY_FIELD_OFFSET] +
|
||||
format->bits_per_field[BKEY_FIELD_SNAPSHOT];
|
||||
}
|
||||
|
||||
static inline struct bpos bpos_successor(struct bpos p)
|
||||
{
|
||||
if (!++p.snapshot &&
|
||||
!++p.offset &&
|
||||
!++p.inode)
|
||||
BUG();
|
||||
|
||||
return p;
|
||||
}
|
||||
|
||||
static inline struct bpos bpos_predecessor(struct bpos p)
|
||||
{
|
||||
if (!p.snapshot-- &&
|
||||
!p.offset-- &&
|
||||
!p.inode--)
|
||||
BUG();
|
||||
|
||||
return p;
|
||||
}
|
||||
|
||||
static inline struct bpos bpos_nosnap_successor(struct bpos p)
|
||||
{
|
||||
p.snapshot = 0;
|
||||
|
||||
if (!++p.offset &&
|
||||
!++p.inode)
|
||||
BUG();
|
||||
|
||||
return p;
|
||||
}
|
||||
|
||||
static inline struct bpos bpos_nosnap_predecessor(struct bpos p)
|
||||
{
|
||||
p.snapshot = 0;
|
||||
|
||||
if (!p.offset-- &&
|
||||
!p.inode--)
|
||||
BUG();
|
||||
|
||||
return p;
|
||||
}
|
||||
|
||||
static inline u64 bkey_start_offset(const struct bkey *k)
|
||||
{
|
||||
return k->p.offset - k->size;
|
||||
}
|
||||
|
||||
static inline struct bpos bkey_start_pos(const struct bkey *k)
|
||||
{
|
||||
return (struct bpos) {
|
||||
.inode = k->p.inode,
|
||||
.offset = bkey_start_offset(k),
|
||||
.snapshot = k->p.snapshot,
|
||||
};
|
||||
}
|
||||
|
||||
/* Packed helpers */
|
||||
|
||||
static inline unsigned bkeyp_key_u64s(const struct bkey_format *format,
|
||||
const struct bkey_packed *k)
|
||||
{
|
||||
return bkey_packed(k) ? format->key_u64s : BKEY_U64s;
|
||||
}
|
||||
|
||||
static inline bool bkeyp_u64s_valid(const struct bkey_format *f,
|
||||
const struct bkey_packed *k)
|
||||
{
|
||||
return ((unsigned) k->u64s - bkeyp_key_u64s(f, k) <= U8_MAX - BKEY_U64s);
|
||||
}
|
||||
|
||||
static inline unsigned bkeyp_key_bytes(const struct bkey_format *format,
|
||||
const struct bkey_packed *k)
|
||||
{
|
||||
return bkeyp_key_u64s(format, k) * sizeof(u64);
|
||||
}
|
||||
|
||||
static inline unsigned bkeyp_val_u64s(const struct bkey_format *format,
|
||||
const struct bkey_packed *k)
|
||||
{
|
||||
return k->u64s - bkeyp_key_u64s(format, k);
|
||||
}
|
||||
|
||||
static inline size_t bkeyp_val_bytes(const struct bkey_format *format,
|
||||
const struct bkey_packed *k)
|
||||
{
|
||||
return bkeyp_val_u64s(format, k) * sizeof(u64);
|
||||
}
|
||||
|
||||
static inline void set_bkeyp_val_u64s(const struct bkey_format *format,
|
||||
struct bkey_packed *k, unsigned val_u64s)
|
||||
{
|
||||
k->u64s = bkeyp_key_u64s(format, k) + val_u64s;
|
||||
}
|
||||
|
||||
#define bkeyp_val(_format, _k) \
|
||||
((struct bch_val *) ((u64 *) (_k)->_data + bkeyp_key_u64s(_format, _k)))
|
||||
|
||||
extern const struct bkey_format bch2_bkey_format_current;
|
||||
|
||||
bool bch2_bkey_transform(const struct bkey_format *,
|
||||
struct bkey_packed *,
|
||||
const struct bkey_format *,
|
||||
const struct bkey_packed *);
|
||||
|
||||
struct bkey __bch2_bkey_unpack_key(const struct bkey_format *,
|
||||
const struct bkey_packed *);
|
||||
|
||||
#ifndef HAVE_BCACHEFS_COMPILED_UNPACK
|
||||
struct bpos __bkey_unpack_pos(const struct bkey_format *,
|
||||
const struct bkey_packed *);
|
||||
#endif
|
||||
|
||||
bool bch2_bkey_pack_key(struct bkey_packed *, const struct bkey *,
|
||||
const struct bkey_format *);
|
||||
|
||||
enum bkey_pack_pos_ret {
|
||||
BKEY_PACK_POS_EXACT,
|
||||
BKEY_PACK_POS_SMALLER,
|
||||
BKEY_PACK_POS_FAIL,
|
||||
};
|
||||
|
||||
enum bkey_pack_pos_ret bch2_bkey_pack_pos_lossy(struct bkey_packed *, struct bpos,
|
||||
const struct btree *);
|
||||
|
||||
static inline bool bkey_pack_pos(struct bkey_packed *out, struct bpos in,
|
||||
const struct btree *b)
|
||||
{
|
||||
return bch2_bkey_pack_pos_lossy(out, in, b) == BKEY_PACK_POS_EXACT;
|
||||
}
|
||||
|
||||
void bch2_bkey_unpack(const struct btree *, struct bkey_i *,
|
||||
const struct bkey_packed *);
|
||||
bool bch2_bkey_pack(struct bkey_packed *, const struct bkey_i *,
|
||||
const struct bkey_format *);
|
||||
|
||||
typedef void (*compiled_unpack_fn)(struct bkey *, const struct bkey_packed *);
|
||||
|
||||
static inline void
|
||||
__bkey_unpack_key_format_checked(const struct btree *b,
|
||||
struct bkey *dst,
|
||||
const struct bkey_packed *src)
|
||||
{
|
||||
if (IS_ENABLED(HAVE_BCACHEFS_COMPILED_UNPACK)) {
|
||||
compiled_unpack_fn unpack_fn = b->aux_data;
|
||||
unpack_fn(dst, src);
|
||||
|
||||
if (static_branch_unlikely(&bch2_debug_check_bkey_unpack)) {
|
||||
struct bkey dst2 = __bch2_bkey_unpack_key(&b->format, src);
|
||||
|
||||
BUG_ON(memcmp(dst, &dst2, sizeof(*dst)));
|
||||
}
|
||||
} else {
|
||||
*dst = __bch2_bkey_unpack_key(&b->format, src);
|
||||
}
|
||||
}
|
||||
|
||||
static inline struct bkey
|
||||
bkey_unpack_key_format_checked(const struct btree *b,
|
||||
const struct bkey_packed *src)
|
||||
{
|
||||
struct bkey dst;
|
||||
|
||||
__bkey_unpack_key_format_checked(b, &dst, src);
|
||||
return dst;
|
||||
}
|
||||
|
||||
static inline void __bkey_unpack_key(const struct btree *b,
|
||||
struct bkey *dst,
|
||||
const struct bkey_packed *src)
|
||||
{
|
||||
if (likely(bkey_packed(src)))
|
||||
__bkey_unpack_key_format_checked(b, dst, src);
|
||||
else
|
||||
*dst = *packed_to_bkey_c(src);
|
||||
}
|
||||
|
||||
/**
|
||||
* bkey_unpack_key -- unpack just the key, not the value
|
||||
*/
|
||||
static inline struct bkey bkey_unpack_key(const struct btree *b,
|
||||
const struct bkey_packed *src)
|
||||
{
|
||||
return likely(bkey_packed(src))
|
||||
? bkey_unpack_key_format_checked(b, src)
|
||||
: *packed_to_bkey_c(src);
|
||||
}
|
||||
|
||||
static inline struct bpos
|
||||
bkey_unpack_pos_format_checked(const struct btree *b,
|
||||
const struct bkey_packed *src)
|
||||
{
|
||||
#ifdef HAVE_BCACHEFS_COMPILED_UNPACK
|
||||
return bkey_unpack_key_format_checked(b, src).p;
|
||||
#else
|
||||
return __bkey_unpack_pos(&b->format, src);
|
||||
#endif
|
||||
}
|
||||
|
||||
static inline struct bpos bkey_unpack_pos(const struct btree *b,
|
||||
const struct bkey_packed *src)
|
||||
{
|
||||
return likely(bkey_packed(src))
|
||||
? bkey_unpack_pos_format_checked(b, src)
|
||||
: packed_to_bkey_c(src)->p;
|
||||
}
|
||||
|
||||
/* Disassembled bkeys */
|
||||
|
||||
static inline struct bkey_s_c bkey_disassemble(const struct btree *b,
|
||||
const struct bkey_packed *k,
|
||||
struct bkey *u)
|
||||
{
|
||||
__bkey_unpack_key(b, u, k);
|
||||
|
||||
return (struct bkey_s_c) { u, bkeyp_val(&b->format, k), };
|
||||
}
|
||||
|
||||
/* non const version: */
|
||||
static inline struct bkey_s __bkey_disassemble(const struct btree *b,
|
||||
struct bkey_packed *k,
|
||||
struct bkey *u)
|
||||
{
|
||||
__bkey_unpack_key(b, u, k);
|
||||
|
||||
return (struct bkey_s) { .k = u, .v = bkeyp_val(&b->format, k), };
|
||||
}
|
||||
|
||||
static inline u64 bkey_field_max(const struct bkey_format *f,
|
||||
enum bch_bkey_fields nr)
|
||||
{
|
||||
return f->bits_per_field[nr] < 64
|
||||
? (le64_to_cpu(f->field_offset[nr]) +
|
||||
~(~0ULL << f->bits_per_field[nr]))
|
||||
: U64_MAX;
|
||||
}
|
||||
|
||||
#ifdef HAVE_BCACHEFS_COMPILED_UNPACK
|
||||
|
||||
int bch2_compile_bkey_format(const struct bkey_format *, void *);
|
||||
|
||||
#else
|
||||
|
||||
static inline int bch2_compile_bkey_format(const struct bkey_format *format,
|
||||
void *out) { return 0; }
|
||||
|
||||
#endif
|
||||
|
||||
static inline void bkey_reassemble(struct bkey_i *dst,
|
||||
struct bkey_s_c src)
|
||||
{
|
||||
dst->k = *src.k;
|
||||
memcpy_u64s_small(&dst->v, src.v, bkey_val_u64s(src.k));
|
||||
}
|
||||
|
||||
/* byte order helpers */
|
||||
|
||||
#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
|
||||
|
||||
static inline unsigned high_word_offset(const struct bkey_format *f)
|
||||
{
|
||||
return f->key_u64s - 1;
|
||||
}
|
||||
|
||||
#define high_bit_offset 0
|
||||
#define nth_word(p, n) ((p) - (n))
|
||||
|
||||
#elif __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
|
||||
|
||||
static inline unsigned high_word_offset(const struct bkey_format *f)
|
||||
{
|
||||
return 0;
|
||||
}
|
||||
|
||||
#define high_bit_offset KEY_PACKED_BITS_START
|
||||
#define nth_word(p, n) ((p) + (n))
|
||||
|
||||
#else
|
||||
#error edit for your odd byteorder.
|
||||
#endif
|
||||
|
||||
#define high_word(f, k) ((u64 *) (k)->_data + high_word_offset(f))
|
||||
#define next_word(p) nth_word(p, 1)
|
||||
#define prev_word(p) nth_word(p, -1)
|
||||
|
||||
#ifdef CONFIG_BCACHEFS_DEBUG
|
||||
void bch2_bkey_pack_test(void);
|
||||
#else
|
||||
static inline void bch2_bkey_pack_test(void) {}
|
||||
#endif
|
||||
|
||||
#define bkey_fields() \
|
||||
x(BKEY_FIELD_INODE, p.inode) \
|
||||
x(BKEY_FIELD_OFFSET, p.offset) \
|
||||
x(BKEY_FIELD_SNAPSHOT, p.snapshot) \
|
||||
x(BKEY_FIELD_SIZE, size) \
|
||||
x(BKEY_FIELD_VERSION_HI, bversion.hi) \
|
||||
x(BKEY_FIELD_VERSION_LO, bversion.lo)
|
||||
|
||||
struct bkey_format_state {
|
||||
u64 field_min[BKEY_NR_FIELDS];
|
||||
u64 field_max[BKEY_NR_FIELDS];
|
||||
};
|
||||
|
||||
void bch2_bkey_format_init(struct bkey_format_state *);
|
||||
|
||||
static inline void __bkey_format_add(struct bkey_format_state *s, unsigned field, u64 v)
|
||||
{
|
||||
s->field_min[field] = min(s->field_min[field], v);
|
||||
s->field_max[field] = max(s->field_max[field], v);
|
||||
}
|
||||
|
||||
/*
|
||||
* Changes @format so that @k can be successfully packed with @format
|
||||
*/
|
||||
static inline void bch2_bkey_format_add_key(struct bkey_format_state *s, const struct bkey *k)
|
||||
{
|
||||
#define x(id, field) __bkey_format_add(s, id, k->field);
|
||||
bkey_fields()
|
||||
#undef x
|
||||
}
|
||||
|
||||
void bch2_bkey_format_add_pos(struct bkey_format_state *, struct bpos);
|
||||
struct bkey_format bch2_bkey_format_done(struct bkey_format_state *);
|
||||
|
||||
static inline bool bch2_bkey_format_field_overflows(struct bkey_format *f, unsigned i)
|
||||
{
|
||||
unsigned f_bits = f->bits_per_field[i];
|
||||
unsigned unpacked_bits = bch2_bkey_format_current.bits_per_field[i];
|
||||
u64 unpacked_mask = ~((~0ULL << 1) << (unpacked_bits - 1));
|
||||
u64 field_offset = le64_to_cpu(f->field_offset[i]);
|
||||
|
||||
if (f_bits > unpacked_bits)
|
||||
return true;
|
||||
|
||||
if ((f_bits == unpacked_bits) && field_offset)
|
||||
return true;
|
||||
|
||||
u64 f_mask = f_bits
|
||||
? ~((~0ULL << (f_bits - 1)) << 1)
|
||||
: 0;
|
||||
|
||||
if (((field_offset + f_mask) & unpacked_mask) < field_offset)
|
||||
return true;
|
||||
return false;
|
||||
}
|
||||
|
||||
int bch2_bkey_format_invalid(struct bch_fs *, struct bkey_format *,
|
||||
enum bch_validate_flags, struct printbuf *);
|
||||
void bch2_bkey_format_to_text(struct printbuf *, const struct bkey_format *);
|
||||
|
||||
#endif /* _BCACHEFS_BKEY_H */
|
||||
@@ -1,61 +0,0 @@
|
||||
/* SPDX-License-Identifier: GPL-2.0 */
|
||||
#ifndef _BCACHEFS_BKEY_BUF_H
|
||||
#define _BCACHEFS_BKEY_BUF_H
|
||||
|
||||
#include "bcachefs.h"
|
||||
#include "bkey.h"
|
||||
|
||||
struct bkey_buf {
|
||||
struct bkey_i *k;
|
||||
u64 onstack[12];
|
||||
};
|
||||
|
||||
static inline void bch2_bkey_buf_realloc(struct bkey_buf *s,
|
||||
struct bch_fs *c, unsigned u64s)
|
||||
{
|
||||
if (s->k == (void *) s->onstack &&
|
||||
u64s > ARRAY_SIZE(s->onstack)) {
|
||||
s->k = mempool_alloc(&c->large_bkey_pool, GFP_NOFS);
|
||||
memcpy(s->k, s->onstack, sizeof(s->onstack));
|
||||
}
|
||||
}
|
||||
|
||||
static inline void bch2_bkey_buf_reassemble(struct bkey_buf *s,
|
||||
struct bch_fs *c,
|
||||
struct bkey_s_c k)
|
||||
{
|
||||
bch2_bkey_buf_realloc(s, c, k.k->u64s);
|
||||
bkey_reassemble(s->k, k);
|
||||
}
|
||||
|
||||
static inline void bch2_bkey_buf_copy(struct bkey_buf *s,
|
||||
struct bch_fs *c,
|
||||
struct bkey_i *src)
|
||||
{
|
||||
bch2_bkey_buf_realloc(s, c, src->k.u64s);
|
||||
bkey_copy(s->k, src);
|
||||
}
|
||||
|
||||
static inline void bch2_bkey_buf_unpack(struct bkey_buf *s,
|
||||
struct bch_fs *c,
|
||||
struct btree *b,
|
||||
struct bkey_packed *src)
|
||||
{
|
||||
bch2_bkey_buf_realloc(s, c, BKEY_U64s +
|
||||
bkeyp_val_u64s(&b->format, src));
|
||||
bch2_bkey_unpack(b, s->k, src);
|
||||
}
|
||||
|
||||
static inline void bch2_bkey_buf_init(struct bkey_buf *s)
|
||||
{
|
||||
s->k = (void *) s->onstack;
|
||||
}
|
||||
|
||||
static inline void bch2_bkey_buf_exit(struct bkey_buf *s, struct bch_fs *c)
|
||||
{
|
||||
if (s->k != (void *) s->onstack)
|
||||
mempool_free(s->k, &c->large_bkey_pool);
|
||||
s->k = NULL;
|
||||
}
|
||||
|
||||
#endif /* _BCACHEFS_BKEY_BUF_H */
|
||||
@@ -1,129 +0,0 @@
|
||||
/* SPDX-License-Identifier: GPL-2.0 */
|
||||
#ifndef _BCACHEFS_BKEY_CMP_H
|
||||
#define _BCACHEFS_BKEY_CMP_H
|
||||
|
||||
#include "bkey.h"
|
||||
|
||||
#ifdef CONFIG_X86_64
|
||||
static inline int __bkey_cmp_bits(const u64 *l, const u64 *r,
|
||||
unsigned nr_key_bits)
|
||||
{
|
||||
long d0, d1, d2, d3;
|
||||
int cmp;
|
||||
|
||||
/* we shouldn't need asm for this, but gcc is being retarded: */
|
||||
|
||||
asm(".intel_syntax noprefix;"
|
||||
"xor eax, eax;"
|
||||
"xor edx, edx;"
|
||||
"1:;"
|
||||
"mov r8, [rdi];"
|
||||
"mov r9, [rsi];"
|
||||
"sub ecx, 64;"
|
||||
"jl 2f;"
|
||||
|
||||
"cmp r8, r9;"
|
||||
"jnz 3f;"
|
||||
|
||||
"lea rdi, [rdi - 8];"
|
||||
"lea rsi, [rsi - 8];"
|
||||
"jmp 1b;"
|
||||
|
||||
"2:;"
|
||||
"not ecx;"
|
||||
"shr r8, 1;"
|
||||
"shr r9, 1;"
|
||||
"shr r8, cl;"
|
||||
"shr r9, cl;"
|
||||
"cmp r8, r9;"
|
||||
|
||||
"3:\n"
|
||||
"seta al;"
|
||||
"setb dl;"
|
||||
"sub eax, edx;"
|
||||
".att_syntax prefix;"
|
||||
: "=&D" (d0), "=&S" (d1), "=&d" (d2), "=&c" (d3), "=&a" (cmp)
|
||||
: "0" (l), "1" (r), "3" (nr_key_bits)
|
||||
: "r8", "r9", "cc", "memory");
|
||||
|
||||
return cmp;
|
||||
}
|
||||
#else
|
||||
static inline int __bkey_cmp_bits(const u64 *l, const u64 *r,
|
||||
unsigned nr_key_bits)
|
||||
{
|
||||
u64 l_v, r_v;
|
||||
|
||||
if (!nr_key_bits)
|
||||
return 0;
|
||||
|
||||
/* for big endian, skip past header */
|
||||
nr_key_bits += high_bit_offset;
|
||||
l_v = *l & (~0ULL >> high_bit_offset);
|
||||
r_v = *r & (~0ULL >> high_bit_offset);
|
||||
|
||||
while (1) {
|
||||
if (nr_key_bits < 64) {
|
||||
l_v >>= 64 - nr_key_bits;
|
||||
r_v >>= 64 - nr_key_bits;
|
||||
nr_key_bits = 0;
|
||||
} else {
|
||||
nr_key_bits -= 64;
|
||||
}
|
||||
|
||||
if (!nr_key_bits || l_v != r_v)
|
||||
break;
|
||||
|
||||
l = next_word(l);
|
||||
r = next_word(r);
|
||||
|
||||
l_v = *l;
|
||||
r_v = *r;
|
||||
}
|
||||
|
||||
return cmp_int(l_v, r_v);
|
||||
}
|
||||
#endif
|
||||
|
||||
static inline __pure __flatten
|
||||
int __bch2_bkey_cmp_packed_format_checked_inlined(const struct bkey_packed *l,
|
||||
const struct bkey_packed *r,
|
||||
const struct btree *b)
|
||||
{
|
||||
const struct bkey_format *f = &b->format;
|
||||
int ret;
|
||||
|
||||
EBUG_ON(!bkey_packed(l) || !bkey_packed(r));
|
||||
EBUG_ON(b->nr_key_bits != bkey_format_key_bits(f));
|
||||
|
||||
ret = __bkey_cmp_bits(high_word(f, l),
|
||||
high_word(f, r),
|
||||
b->nr_key_bits);
|
||||
|
||||
EBUG_ON(ret != bpos_cmp(bkey_unpack_pos(b, l),
|
||||
bkey_unpack_pos(b, r)));
|
||||
return ret;
|
||||
}
|
||||
|
||||
static inline __pure __flatten
|
||||
int bch2_bkey_cmp_packed_inlined(const struct btree *b,
|
||||
const struct bkey_packed *l,
|
||||
const struct bkey_packed *r)
|
||||
{
|
||||
struct bkey unpacked;
|
||||
|
||||
if (likely(bkey_packed(l) && bkey_packed(r)))
|
||||
return __bch2_bkey_cmp_packed_format_checked_inlined(l, r, b);
|
||||
|
||||
if (bkey_packed(l)) {
|
||||
__bkey_unpack_key_format_checked(b, &unpacked, l);
|
||||
l = (void *) &unpacked;
|
||||
} else if (bkey_packed(r)) {
|
||||
__bkey_unpack_key_format_checked(b, &unpacked, r);
|
||||
r = (void *) &unpacked;
|
||||
}
|
||||
|
||||
return bpos_cmp(((struct bkey *) l)->p, ((struct bkey *) r)->p);
|
||||
}
|
||||
|
||||
#endif /* _BCACHEFS_BKEY_CMP_H */
|
||||
@@ -1,497 +0,0 @@
|
||||
// SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
#include "bcachefs.h"
|
||||
#include "backpointers.h"
|
||||
#include "bkey_methods.h"
|
||||
#include "btree_cache.h"
|
||||
#include "btree_types.h"
|
||||
#include "alloc_background.h"
|
||||
#include "dirent.h"
|
||||
#include "disk_accounting.h"
|
||||
#include "ec.h"
|
||||
#include "error.h"
|
||||
#include "extents.h"
|
||||
#include "inode.h"
|
||||
#include "io_misc.h"
|
||||
#include "lru.h"
|
||||
#include "quota.h"
|
||||
#include "reflink.h"
|
||||
#include "snapshot.h"
|
||||
#include "subvolume.h"
|
||||
#include "xattr.h"
|
||||
|
||||
const char * const bch2_bkey_types[] = {
|
||||
#define x(name, nr, ...) #name,
|
||||
BCH_BKEY_TYPES()
|
||||
#undef x
|
||||
NULL
|
||||
};
|
||||
|
||||
static int deleted_key_validate(struct bch_fs *c, struct bkey_s_c k,
|
||||
struct bkey_validate_context from)
|
||||
{
|
||||
return 0;
|
||||
}
|
||||
|
||||
#define bch2_bkey_ops_deleted ((struct bkey_ops) { \
|
||||
.key_validate = deleted_key_validate, \
|
||||
})
|
||||
|
||||
#define bch2_bkey_ops_whiteout ((struct bkey_ops) { \
|
||||
.key_validate = deleted_key_validate, \
|
||||
})
|
||||
|
||||
static int empty_val_key_validate(struct bch_fs *c, struct bkey_s_c k,
|
||||
struct bkey_validate_context from)
|
||||
{
|
||||
int ret = 0;
|
||||
|
||||
bkey_fsck_err_on(bkey_val_bytes(k.k),
|
||||
c, bkey_val_size_nonzero,
|
||||
"incorrect value size (%zu != 0)",
|
||||
bkey_val_bytes(k.k));
|
||||
fsck_err:
|
||||
return ret;
|
||||
}
|
||||
|
||||
#define bch2_bkey_ops_error ((struct bkey_ops) { \
|
||||
.key_validate = empty_val_key_validate, \
|
||||
})
|
||||
|
||||
static int key_type_cookie_validate(struct bch_fs *c, struct bkey_s_c k,
|
||||
struct bkey_validate_context from)
|
||||
{
|
||||
return 0;
|
||||
}
|
||||
|
||||
static void key_type_cookie_to_text(struct printbuf *out, struct bch_fs *c,
|
||||
struct bkey_s_c k)
|
||||
{
|
||||
struct bkey_s_c_cookie ck = bkey_s_c_to_cookie(k);
|
||||
|
||||
prt_printf(out, "%llu", le64_to_cpu(ck.v->cookie));
|
||||
}
|
||||
|
||||
#define bch2_bkey_ops_cookie ((struct bkey_ops) { \
|
||||
.key_validate = key_type_cookie_validate, \
|
||||
.val_to_text = key_type_cookie_to_text, \
|
||||
.min_val_size = 8, \
|
||||
})
|
||||
|
||||
#define bch2_bkey_ops_hash_whiteout ((struct bkey_ops) {\
|
||||
.key_validate = empty_val_key_validate, \
|
||||
})
|
||||
|
||||
static int key_type_inline_data_validate(struct bch_fs *c, struct bkey_s_c k,
|
||||
struct bkey_validate_context from)
|
||||
{
|
||||
return 0;
|
||||
}
|
||||
|
||||
static void key_type_inline_data_to_text(struct printbuf *out, struct bch_fs *c,
|
||||
struct bkey_s_c k)
|
||||
{
|
||||
struct bkey_s_c_inline_data d = bkey_s_c_to_inline_data(k);
|
||||
unsigned datalen = bkey_inline_data_bytes(k.k);
|
||||
|
||||
prt_printf(out, "datalen %u: %*phN",
|
||||
datalen, min(datalen, 32U), d.v->data);
|
||||
}
|
||||
|
||||
#define bch2_bkey_ops_inline_data ((struct bkey_ops) { \
|
||||
.key_validate = key_type_inline_data_validate, \
|
||||
.val_to_text = key_type_inline_data_to_text, \
|
||||
})
|
||||
|
||||
static bool key_type_set_merge(struct bch_fs *c, struct bkey_s l, struct bkey_s_c r)
|
||||
{
|
||||
bch2_key_resize(l.k, l.k->size + r.k->size);
|
||||
return true;
|
||||
}
|
||||
|
||||
#define bch2_bkey_ops_set ((struct bkey_ops) { \
|
||||
.key_validate = empty_val_key_validate, \
|
||||
.key_merge = key_type_set_merge, \
|
||||
})
|
||||
|
||||
const struct bkey_ops bch2_bkey_ops[] = {
|
||||
#define x(name, nr, ...) [KEY_TYPE_##name] = bch2_bkey_ops_##name,
|
||||
BCH_BKEY_TYPES()
|
||||
#undef x
|
||||
};
|
||||
|
||||
const struct bkey_ops bch2_bkey_null_ops = {
|
||||
};
|
||||
|
||||
int bch2_bkey_val_validate(struct bch_fs *c, struct bkey_s_c k,
|
||||
struct bkey_validate_context from)
|
||||
{
|
||||
if (test_bit(BCH_FS_no_invalid_checks, &c->flags))
|
||||
return 0;
|
||||
|
||||
const struct bkey_ops *ops = bch2_bkey_type_ops(k.k->type);
|
||||
int ret = 0;
|
||||
|
||||
bkey_fsck_err_on(bkey_val_bytes(k.k) < ops->min_val_size,
|
||||
c, bkey_val_size_too_small,
|
||||
"bad val size (%zu < %u)",
|
||||
bkey_val_bytes(k.k), ops->min_val_size);
|
||||
|
||||
if (!ops->key_validate)
|
||||
return 0;
|
||||
|
||||
ret = ops->key_validate(c, k, from);
|
||||
fsck_err:
|
||||
return ret;
|
||||
}
|
||||
|
||||
static u64 bch2_key_types_allowed[] = {
|
||||
[BKEY_TYPE_btree] =
|
||||
BIT_ULL(KEY_TYPE_deleted)|
|
||||
BIT_ULL(KEY_TYPE_btree_ptr)|
|
||||
BIT_ULL(KEY_TYPE_btree_ptr_v2),
|
||||
#define x(name, nr, flags, keys) [BKEY_TYPE_##name] = BIT_ULL(KEY_TYPE_deleted)|keys,
|
||||
BCH_BTREE_IDS()
|
||||
#undef x
|
||||
};
|
||||
|
||||
static const enum bch_bkey_type_flags bch2_bkey_type_flags[] = {
|
||||
#define x(name, nr, flags) [KEY_TYPE_##name] = flags,
|
||||
BCH_BKEY_TYPES()
|
||||
#undef x
|
||||
};
|
||||
|
||||
const char *bch2_btree_node_type_str(enum btree_node_type type)
|
||||
{
|
||||
return type == BKEY_TYPE_btree ? "internal btree node" : bch2_btree_id_str(type - 1);
|
||||
}
|
||||
|
||||
int __bch2_bkey_validate(struct bch_fs *c, struct bkey_s_c k,
|
||||
struct bkey_validate_context from)
|
||||
{
|
||||
enum btree_node_type type = __btree_node_type(from.level, from.btree);
|
||||
|
||||
if (test_bit(BCH_FS_no_invalid_checks, &c->flags))
|
||||
return 0;
|
||||
|
||||
int ret = 0;
|
||||
|
||||
bkey_fsck_err_on(k.k->u64s < BKEY_U64s,
|
||||
c, bkey_u64s_too_small,
|
||||
"u64s too small (%u < %zu)", k.k->u64s, BKEY_U64s);
|
||||
|
||||
if (type >= BKEY_TYPE_NR)
|
||||
return 0;
|
||||
|
||||
enum bch_bkey_type_flags bkey_flags = k.k->type < KEY_TYPE_MAX
|
||||
? bch2_bkey_type_flags[k.k->type]
|
||||
: 0;
|
||||
|
||||
bool strict_key_type_allowed =
|
||||
(from.flags & BCH_VALIDATE_commit) ||
|
||||
type == BKEY_TYPE_btree ||
|
||||
(from.btree < BTREE_ID_NR &&
|
||||
(bkey_flags & BKEY_TYPE_strict_btree_checks));
|
||||
|
||||
bkey_fsck_err_on(strict_key_type_allowed &&
|
||||
k.k->type < KEY_TYPE_MAX &&
|
||||
!(bch2_key_types_allowed[type] & BIT_ULL(k.k->type)),
|
||||
c, bkey_invalid_type_for_btree,
|
||||
"invalid key type for btree %s (%s)",
|
||||
bch2_btree_node_type_str(type),
|
||||
k.k->type < KEY_TYPE_MAX
|
||||
? bch2_bkey_types[k.k->type]
|
||||
: "(unknown)");
|
||||
|
||||
if (btree_node_type_is_extents(type) && !bkey_whiteout(k.k)) {
|
||||
bkey_fsck_err_on(k.k->size == 0,
|
||||
c, bkey_extent_size_zero,
|
||||
"size == 0");
|
||||
|
||||
bkey_fsck_err_on(k.k->size > k.k->p.offset,
|
||||
c, bkey_extent_size_greater_than_offset,
|
||||
"size greater than offset (%u > %llu)",
|
||||
k.k->size, k.k->p.offset);
|
||||
} else {
|
||||
bkey_fsck_err_on(k.k->size,
|
||||
c, bkey_size_nonzero,
|
||||
"size != 0");
|
||||
}
|
||||
|
||||
if (type != BKEY_TYPE_btree) {
|
||||
enum btree_id btree = type - 1;
|
||||
|
||||
if (btree_type_has_snapshots(btree)) {
|
||||
bkey_fsck_err_on(!k.k->p.snapshot,
|
||||
c, bkey_snapshot_zero,
|
||||
"snapshot == 0");
|
||||
} else if (!btree_type_has_snapshot_field(btree)) {
|
||||
bkey_fsck_err_on(k.k->p.snapshot,
|
||||
c, bkey_snapshot_nonzero,
|
||||
"nonzero snapshot");
|
||||
} else {
|
||||
/*
|
||||
* btree uses snapshot field but it's not required to be
|
||||
* nonzero
|
||||
*/
|
||||
}
|
||||
|
||||
bkey_fsck_err_on(bkey_eq(k.k->p, POS_MAX),
|
||||
c, bkey_at_pos_max,
|
||||
"key at POS_MAX");
|
||||
}
|
||||
fsck_err:
|
||||
return ret;
|
||||
}
|
||||
|
||||
int bch2_bkey_validate(struct bch_fs *c, struct bkey_s_c k,
|
||||
struct bkey_validate_context from)
|
||||
{
|
||||
return __bch2_bkey_validate(c, k, from) ?:
|
||||
bch2_bkey_val_validate(c, k, from);
|
||||
}
|
||||
|
||||
int bch2_bkey_in_btree_node(struct bch_fs *c, struct btree *b,
|
||||
struct bkey_s_c k,
|
||||
struct bkey_validate_context from)
|
||||
{
|
||||
int ret = 0;
|
||||
|
||||
bkey_fsck_err_on(bpos_lt(k.k->p, b->data->min_key),
|
||||
c, bkey_before_start_of_btree_node,
|
||||
"key before start of btree node");
|
||||
|
||||
bkey_fsck_err_on(bpos_gt(k.k->p, b->data->max_key),
|
||||
c, bkey_after_end_of_btree_node,
|
||||
"key past end of btree node");
|
||||
fsck_err:
|
||||
return ret;
|
||||
}
|
||||
|
||||
void bch2_bpos_to_text(struct printbuf *out, struct bpos pos)
|
||||
{
|
||||
if (bpos_eq(pos, POS_MIN))
|
||||
prt_printf(out, "POS_MIN");
|
||||
else if (bpos_eq(pos, POS_MAX))
|
||||
prt_printf(out, "POS_MAX");
|
||||
else if (bpos_eq(pos, SPOS_MAX))
|
||||
prt_printf(out, "SPOS_MAX");
|
||||
else {
|
||||
if (pos.inode == U64_MAX)
|
||||
prt_printf(out, "U64_MAX");
|
||||
else
|
||||
prt_printf(out, "%llu", pos.inode);
|
||||
prt_printf(out, ":");
|
||||
if (pos.offset == U64_MAX)
|
||||
prt_printf(out, "U64_MAX");
|
||||
else
|
||||
prt_printf(out, "%llu", pos.offset);
|
||||
prt_printf(out, ":");
|
||||
if (pos.snapshot == U32_MAX)
|
||||
prt_printf(out, "U32_MAX");
|
||||
else
|
||||
prt_printf(out, "%u", pos.snapshot);
|
||||
}
|
||||
}
|
||||
|
||||
void bch2_bkey_to_text(struct printbuf *out, const struct bkey *k)
|
||||
{
|
||||
if (k) {
|
||||
prt_printf(out, "u64s %u type ", k->u64s);
|
||||
|
||||
if (k->type < KEY_TYPE_MAX)
|
||||
prt_printf(out, "%s ", bch2_bkey_types[k->type]);
|
||||
else
|
||||
prt_printf(out, "%u ", k->type);
|
||||
|
||||
bch2_bpos_to_text(out, k->p);
|
||||
|
||||
prt_printf(out, " len %u ver %llu", k->size, k->bversion.lo);
|
||||
} else {
|
||||
prt_printf(out, "(null)");
|
||||
}
|
||||
}
|
||||
|
||||
void bch2_val_to_text(struct printbuf *out, struct bch_fs *c,
|
||||
struct bkey_s_c k)
|
||||
{
|
||||
const struct bkey_ops *ops = bch2_bkey_type_ops(k.k->type);
|
||||
|
||||
if (likely(ops->val_to_text))
|
||||
ops->val_to_text(out, c, k);
|
||||
}
|
||||
|
||||
void bch2_bkey_val_to_text(struct printbuf *out, struct bch_fs *c,
|
||||
struct bkey_s_c k)
|
||||
{
|
||||
bch2_bkey_to_text(out, k.k);
|
||||
|
||||
if (bkey_val_bytes(k.k)) {
|
||||
prt_printf(out, ": ");
|
||||
bch2_val_to_text(out, c, k);
|
||||
}
|
||||
}
|
||||
|
||||
void bch2_bkey_swab_val(struct bkey_s k)
|
||||
{
|
||||
const struct bkey_ops *ops = bch2_bkey_type_ops(k.k->type);
|
||||
|
||||
if (ops->swab)
|
||||
ops->swab(k);
|
||||
}
|
||||
|
||||
bool bch2_bkey_normalize(struct bch_fs *c, struct bkey_s k)
|
||||
{
|
||||
const struct bkey_ops *ops = bch2_bkey_type_ops(k.k->type);
|
||||
|
||||
return ops->key_normalize
|
||||
? ops->key_normalize(c, k)
|
||||
: false;
|
||||
}
|
||||
|
||||
bool bch2_bkey_merge(struct bch_fs *c, struct bkey_s l, struct bkey_s_c r)
|
||||
{
|
||||
const struct bkey_ops *ops = bch2_bkey_type_ops(l.k->type);
|
||||
|
||||
return ops->key_merge &&
|
||||
bch2_bkey_maybe_mergable(l.k, r.k) &&
|
||||
(u64) l.k->size + r.k->size <= KEY_SIZE_MAX &&
|
||||
!static_branch_unlikely(&bch2_key_merging_disabled) &&
|
||||
ops->key_merge(c, l, r);
|
||||
}
|
||||
|
||||
static const struct old_bkey_type {
|
||||
u8 btree_node_type;
|
||||
u8 old;
|
||||
u8 new;
|
||||
} bkey_renumber_table[] = {
|
||||
{BKEY_TYPE_btree, 128, KEY_TYPE_btree_ptr },
|
||||
{BKEY_TYPE_extents, 128, KEY_TYPE_extent },
|
||||
{BKEY_TYPE_extents, 129, KEY_TYPE_extent },
|
||||
{BKEY_TYPE_extents, 130, KEY_TYPE_reservation },
|
||||
{BKEY_TYPE_inodes, 128, KEY_TYPE_inode },
|
||||
{BKEY_TYPE_inodes, 130, KEY_TYPE_inode_generation },
|
||||
{BKEY_TYPE_dirents, 128, KEY_TYPE_dirent },
|
||||
{BKEY_TYPE_dirents, 129, KEY_TYPE_hash_whiteout },
|
||||
{BKEY_TYPE_xattrs, 128, KEY_TYPE_xattr },
|
||||
{BKEY_TYPE_xattrs, 129, KEY_TYPE_hash_whiteout },
|
||||
{BKEY_TYPE_alloc, 128, KEY_TYPE_alloc },
|
||||
{BKEY_TYPE_quotas, 128, KEY_TYPE_quota },
|
||||
};
|
||||
|
||||
void bch2_bkey_renumber(enum btree_node_type btree_node_type,
|
||||
struct bkey_packed *k,
|
||||
int write)
|
||||
{
|
||||
const struct old_bkey_type *i;
|
||||
|
||||
for (i = bkey_renumber_table;
|
||||
i < bkey_renumber_table + ARRAY_SIZE(bkey_renumber_table);
|
||||
i++)
|
||||
if (btree_node_type == i->btree_node_type &&
|
||||
k->type == (write ? i->new : i->old)) {
|
||||
k->type = write ? i->old : i->new;
|
||||
break;
|
||||
}
|
||||
}
|
||||
|
||||
void __bch2_bkey_compat(unsigned level, enum btree_id btree_id,
|
||||
unsigned version, unsigned big_endian,
|
||||
int write,
|
||||
struct bkey_format *f,
|
||||
struct bkey_packed *k)
|
||||
{
|
||||
const struct bkey_ops *ops;
|
||||
struct bkey uk;
|
||||
unsigned nr_compat = 5;
|
||||
int i;
|
||||
|
||||
/*
|
||||
* Do these operations in reverse order in the write path:
|
||||
*/
|
||||
|
||||
for (i = 0; i < nr_compat; i++)
|
||||
switch (!write ? i : nr_compat - 1 - i) {
|
||||
case 0:
|
||||
if (big_endian != CPU_BIG_ENDIAN) {
|
||||
bch2_bkey_swab_key(f, k);
|
||||
} else if (IS_ENABLED(CONFIG_BCACHEFS_DEBUG)) {
|
||||
bch2_bkey_swab_key(f, k);
|
||||
bch2_bkey_swab_key(f, k);
|
||||
}
|
||||
break;
|
||||
case 1:
|
||||
if (version < bcachefs_metadata_version_bkey_renumber)
|
||||
bch2_bkey_renumber(__btree_node_type(level, btree_id), k, write);
|
||||
break;
|
||||
case 2:
|
||||
if (version < bcachefs_metadata_version_inode_btree_change &&
|
||||
btree_id == BTREE_ID_inodes) {
|
||||
if (!bkey_packed(k)) {
|
||||
struct bkey_i *u = packed_to_bkey(k);
|
||||
|
||||
swap(u->k.p.inode, u->k.p.offset);
|
||||
} else if (f->bits_per_field[BKEY_FIELD_INODE] &&
|
||||
f->bits_per_field[BKEY_FIELD_OFFSET]) {
|
||||
struct bkey_format tmp = *f, *in = f, *out = &tmp;
|
||||
|
||||
swap(tmp.bits_per_field[BKEY_FIELD_INODE],
|
||||
tmp.bits_per_field[BKEY_FIELD_OFFSET]);
|
||||
swap(tmp.field_offset[BKEY_FIELD_INODE],
|
||||
tmp.field_offset[BKEY_FIELD_OFFSET]);
|
||||
|
||||
if (!write)
|
||||
swap(in, out);
|
||||
|
||||
uk = __bch2_bkey_unpack_key(in, k);
|
||||
swap(uk.p.inode, uk.p.offset);
|
||||
BUG_ON(!bch2_bkey_pack_key(k, &uk, out));
|
||||
}
|
||||
}
|
||||
break;
|
||||
case 3:
|
||||
if (version < bcachefs_metadata_version_snapshot &&
|
||||
(level || btree_type_has_snapshots(btree_id))) {
|
||||
struct bkey_i *u = packed_to_bkey(k);
|
||||
|
||||
if (u) {
|
||||
u->k.p.snapshot = write
|
||||
? 0 : U32_MAX;
|
||||
} else {
|
||||
u64 min_packed = le64_to_cpu(f->field_offset[BKEY_FIELD_SNAPSHOT]);
|
||||
u64 max_packed = min_packed +
|
||||
~(~0ULL << f->bits_per_field[BKEY_FIELD_SNAPSHOT]);
|
||||
|
||||
uk = __bch2_bkey_unpack_key(f, k);
|
||||
uk.p.snapshot = write
|
||||
? min_packed : min_t(u64, U32_MAX, max_packed);
|
||||
|
||||
BUG_ON(!bch2_bkey_pack_key(k, &uk, f));
|
||||
}
|
||||
}
|
||||
|
||||
break;
|
||||
case 4: {
|
||||
struct bkey_s u;
|
||||
|
||||
if (!bkey_packed(k)) {
|
||||
u = bkey_i_to_s(packed_to_bkey(k));
|
||||
} else {
|
||||
uk = __bch2_bkey_unpack_key(f, k);
|
||||
u.k = &uk;
|
||||
u.v = bkeyp_val(f, k);
|
||||
}
|
||||
|
||||
if (big_endian != CPU_BIG_ENDIAN)
|
||||
bch2_bkey_swab_val(u);
|
||||
|
||||
ops = bch2_bkey_type_ops(k->type);
|
||||
|
||||
if (ops->compat)
|
||||
ops->compat(btree_id, version, big_endian, write, u);
|
||||
break;
|
||||
}
|
||||
default:
|
||||
BUG();
|
||||
}
|
||||
}
|
||||
@@ -1,139 +0,0 @@
|
||||
/* SPDX-License-Identifier: GPL-2.0 */
|
||||
#ifndef _BCACHEFS_BKEY_METHODS_H
|
||||
#define _BCACHEFS_BKEY_METHODS_H
|
||||
|
||||
#include "bkey.h"
|
||||
|
||||
struct bch_fs;
|
||||
struct btree;
|
||||
struct btree_trans;
|
||||
struct bkey;
|
||||
enum btree_node_type;
|
||||
|
||||
extern const char * const bch2_bkey_types[];
|
||||
extern const struct bkey_ops bch2_bkey_null_ops;
|
||||
|
||||
/*
|
||||
* key_validate: checks validity of @k, returns 0 if good or -EINVAL if bad. If
|
||||
* invalid, entire key will be deleted.
|
||||
*
|
||||
* When invalid, error string is returned via @err. @rw indicates whether key is
|
||||
* being read or written; more aggressive checks can be enabled when rw == WRITE.
|
||||
*/
|
||||
struct bkey_ops {
|
||||
int (*key_validate)(struct bch_fs *c, struct bkey_s_c k,
|
||||
struct bkey_validate_context from);
|
||||
void (*val_to_text)(struct printbuf *, struct bch_fs *,
|
||||
struct bkey_s_c);
|
||||
void (*swab)(struct bkey_s);
|
||||
bool (*key_normalize)(struct bch_fs *, struct bkey_s);
|
||||
bool (*key_merge)(struct bch_fs *, struct bkey_s, struct bkey_s_c);
|
||||
int (*trigger)(struct btree_trans *, enum btree_id, unsigned,
|
||||
struct bkey_s_c, struct bkey_s,
|
||||
enum btree_iter_update_trigger_flags);
|
||||
void (*compat)(enum btree_id id, unsigned version,
|
||||
unsigned big_endian, int write,
|
||||
struct bkey_s);
|
||||
|
||||
/* Size of value type when first created: */
|
||||
unsigned min_val_size;
|
||||
};
|
||||
|
||||
extern const struct bkey_ops bch2_bkey_ops[];
|
||||
|
||||
static inline const struct bkey_ops *bch2_bkey_type_ops(enum bch_bkey_type type)
|
||||
{
|
||||
return likely(type < KEY_TYPE_MAX)
|
||||
? &bch2_bkey_ops[type]
|
||||
: &bch2_bkey_null_ops;
|
||||
}
|
||||
|
||||
int bch2_bkey_val_validate(struct bch_fs *, struct bkey_s_c,
|
||||
struct bkey_validate_context);
|
||||
int __bch2_bkey_validate(struct bch_fs *, struct bkey_s_c,
|
||||
struct bkey_validate_context);
|
||||
int bch2_bkey_validate(struct bch_fs *, struct bkey_s_c,
|
||||
struct bkey_validate_context);
|
||||
int bch2_bkey_in_btree_node(struct bch_fs *, struct btree *, struct bkey_s_c,
|
||||
struct bkey_validate_context from);
|
||||
|
||||
void bch2_bpos_to_text(struct printbuf *, struct bpos);
|
||||
void bch2_bkey_to_text(struct printbuf *, const struct bkey *);
|
||||
void bch2_val_to_text(struct printbuf *, struct bch_fs *,
|
||||
struct bkey_s_c);
|
||||
void bch2_bkey_val_to_text(struct printbuf *, struct bch_fs *,
|
||||
struct bkey_s_c);
|
||||
|
||||
void bch2_bkey_swab_val(struct bkey_s);
|
||||
|
||||
bool bch2_bkey_normalize(struct bch_fs *, struct bkey_s);
|
||||
|
||||
static inline bool bch2_bkey_maybe_mergable(const struct bkey *l, const struct bkey *r)
|
||||
{
|
||||
return l->type == r->type &&
|
||||
!bversion_cmp(l->bversion, r->bversion) &&
|
||||
bpos_eq(l->p, bkey_start_pos(r));
|
||||
}
|
||||
|
||||
bool bch2_bkey_merge(struct bch_fs *, struct bkey_s, struct bkey_s_c);
|
||||
|
||||
static inline int bch2_key_trigger(struct btree_trans *trans,
|
||||
enum btree_id btree, unsigned level,
|
||||
struct bkey_s_c old, struct bkey_s new,
|
||||
enum btree_iter_update_trigger_flags flags)
|
||||
{
|
||||
const struct bkey_ops *ops = bch2_bkey_type_ops(old.k->type ?: new.k->type);
|
||||
|
||||
return ops->trigger
|
||||
? ops->trigger(trans, btree, level, old, new, flags)
|
||||
: 0;
|
||||
}
|
||||
|
||||
static inline int bch2_key_trigger_old(struct btree_trans *trans,
|
||||
enum btree_id btree_id, unsigned level,
|
||||
struct bkey_s_c old,
|
||||
enum btree_iter_update_trigger_flags flags)
|
||||
{
|
||||
struct bkey_i deleted;
|
||||
|
||||
bkey_init(&deleted.k);
|
||||
deleted.k.p = old.k->p;
|
||||
|
||||
return bch2_key_trigger(trans, btree_id, level, old, bkey_i_to_s(&deleted),
|
||||
BTREE_TRIGGER_overwrite|flags);
|
||||
}
|
||||
|
||||
static inline int bch2_key_trigger_new(struct btree_trans *trans,
|
||||
enum btree_id btree_id, unsigned level,
|
||||
struct bkey_s new,
|
||||
enum btree_iter_update_trigger_flags flags)
|
||||
{
|
||||
struct bkey_i deleted;
|
||||
|
||||
bkey_init(&deleted.k);
|
||||
deleted.k.p = new.k->p;
|
||||
|
||||
return bch2_key_trigger(trans, btree_id, level, bkey_i_to_s_c(&deleted), new,
|
||||
BTREE_TRIGGER_insert|flags);
|
||||
}
|
||||
|
||||
void bch2_bkey_renumber(enum btree_node_type, struct bkey_packed *, int);
|
||||
|
||||
void __bch2_bkey_compat(unsigned, enum btree_id, unsigned, unsigned,
|
||||
int, struct bkey_format *, struct bkey_packed *);
|
||||
|
||||
static inline void bch2_bkey_compat(unsigned level, enum btree_id btree_id,
|
||||
unsigned version, unsigned big_endian,
|
||||
int write,
|
||||
struct bkey_format *f,
|
||||
struct bkey_packed *k)
|
||||
{
|
||||
if (version < bcachefs_metadata_version_current ||
|
||||
big_endian != CPU_BIG_ENDIAN ||
|
||||
IS_ENABLED(CONFIG_BCACHEFS_DEBUG))
|
||||
__bch2_bkey_compat(level, btree_id, version,
|
||||
big_endian, write, f, k);
|
||||
|
||||
}
|
||||
|
||||
#endif /* _BCACHEFS_BKEY_METHODS_H */
|
||||
@@ -1,214 +0,0 @@
|
||||
// SPDX-License-Identifier: GPL-2.0
|
||||
#include "bcachefs.h"
|
||||
#include "bkey_buf.h"
|
||||
#include "bkey_cmp.h"
|
||||
#include "bkey_sort.h"
|
||||
#include "bset.h"
|
||||
#include "extents.h"
|
||||
|
||||
typedef int (*sort_cmp_fn)(const struct btree *,
|
||||
const struct bkey_packed *,
|
||||
const struct bkey_packed *);
|
||||
|
||||
static inline bool sort_iter_end(struct sort_iter *iter)
|
||||
{
|
||||
return !iter->used;
|
||||
}
|
||||
|
||||
static inline void sort_iter_sift(struct sort_iter *iter, unsigned from,
|
||||
sort_cmp_fn cmp)
|
||||
{
|
||||
unsigned i;
|
||||
|
||||
for (i = from;
|
||||
i + 1 < iter->used &&
|
||||
cmp(iter->b, iter->data[i].k, iter->data[i + 1].k) > 0;
|
||||
i++)
|
||||
swap(iter->data[i], iter->data[i + 1]);
|
||||
}
|
||||
|
||||
static inline void sort_iter_sort(struct sort_iter *iter, sort_cmp_fn cmp)
|
||||
{
|
||||
unsigned i = iter->used;
|
||||
|
||||
while (i--)
|
||||
sort_iter_sift(iter, i, cmp);
|
||||
}
|
||||
|
||||
static inline struct bkey_packed *sort_iter_peek(struct sort_iter *iter)
|
||||
{
|
||||
return !sort_iter_end(iter) ? iter->data->k : NULL;
|
||||
}
|
||||
|
||||
static inline void sort_iter_advance(struct sort_iter *iter, sort_cmp_fn cmp)
|
||||
{
|
||||
struct sort_iter_set *i = iter->data;
|
||||
|
||||
BUG_ON(!iter->used);
|
||||
|
||||
i->k = bkey_p_next(i->k);
|
||||
|
||||
BUG_ON(i->k > i->end);
|
||||
|
||||
if (i->k == i->end)
|
||||
array_remove_item(iter->data, iter->used, 0);
|
||||
else
|
||||
sort_iter_sift(iter, 0, cmp);
|
||||
}
|
||||
|
||||
static inline struct bkey_packed *sort_iter_next(struct sort_iter *iter,
|
||||
sort_cmp_fn cmp)
|
||||
{
|
||||
struct bkey_packed *ret = sort_iter_peek(iter);
|
||||
|
||||
if (ret)
|
||||
sort_iter_advance(iter, cmp);
|
||||
|
||||
return ret;
|
||||
}
|
||||
|
||||
/*
|
||||
* If keys compare equal, compare by pointer order:
|
||||
*/
|
||||
static inline int key_sort_fix_overlapping_cmp(const struct btree *b,
|
||||
const struct bkey_packed *l,
|
||||
const struct bkey_packed *r)
|
||||
{
|
||||
return bch2_bkey_cmp_packed(b, l, r) ?:
|
||||
cmp_int((unsigned long) l, (unsigned long) r);
|
||||
}
|
||||
|
||||
static inline bool should_drop_next_key(struct sort_iter *iter)
|
||||
{
|
||||
/*
|
||||
* key_sort_cmp() ensures that when keys compare equal the older key
|
||||
* comes first; so if l->k compares equal to r->k then l->k is older
|
||||
* and should be dropped.
|
||||
*/
|
||||
return iter->used >= 2 &&
|
||||
!bch2_bkey_cmp_packed(iter->b,
|
||||
iter->data[0].k,
|
||||
iter->data[1].k);
|
||||
}
|
||||
|
||||
struct btree_nr_keys
|
||||
bch2_key_sort_fix_overlapping(struct bch_fs *c, struct bset *dst,
|
||||
struct sort_iter *iter)
|
||||
{
|
||||
struct bkey_packed *out = dst->start;
|
||||
struct bkey_packed *k;
|
||||
struct btree_nr_keys nr;
|
||||
|
||||
memset(&nr, 0, sizeof(nr));
|
||||
|
||||
sort_iter_sort(iter, key_sort_fix_overlapping_cmp);
|
||||
|
||||
while ((k = sort_iter_peek(iter))) {
|
||||
if (!bkey_deleted(k) &&
|
||||
!should_drop_next_key(iter)) {
|
||||
bkey_p_copy(out, k);
|
||||
btree_keys_account_key_add(&nr, 0, out);
|
||||
out = bkey_p_next(out);
|
||||
}
|
||||
|
||||
sort_iter_advance(iter, key_sort_fix_overlapping_cmp);
|
||||
}
|
||||
|
||||
dst->u64s = cpu_to_le16((u64 *) out - dst->_data);
|
||||
return nr;
|
||||
}
|
||||
|
||||
/* Sort + repack in a new format: */
|
||||
struct btree_nr_keys
|
||||
bch2_sort_repack(struct bset *dst, struct btree *src,
|
||||
struct btree_node_iter *src_iter,
|
||||
struct bkey_format *out_f,
|
||||
bool filter_whiteouts)
|
||||
{
|
||||
struct bkey_format *in_f = &src->format;
|
||||
struct bkey_packed *in, *out = vstruct_last(dst);
|
||||
struct btree_nr_keys nr;
|
||||
bool transform = memcmp(out_f, &src->format, sizeof(*out_f));
|
||||
|
||||
memset(&nr, 0, sizeof(nr));
|
||||
|
||||
while ((in = bch2_btree_node_iter_next_all(src_iter, src))) {
|
||||
if (filter_whiteouts && bkey_deleted(in))
|
||||
continue;
|
||||
|
||||
if (!transform)
|
||||
bkey_p_copy(out, in);
|
||||
else if (bch2_bkey_transform(out_f, out, bkey_packed(in)
|
||||
? in_f : &bch2_bkey_format_current, in))
|
||||
out->format = KEY_FORMAT_LOCAL_BTREE;
|
||||
else
|
||||
bch2_bkey_unpack(src, (void *) out, in);
|
||||
|
||||
out->needs_whiteout = false;
|
||||
|
||||
btree_keys_account_key_add(&nr, 0, out);
|
||||
out = bkey_p_next(out);
|
||||
}
|
||||
|
||||
dst->u64s = cpu_to_le16((u64 *) out - dst->_data);
|
||||
return nr;
|
||||
}
|
||||
|
||||
static inline int keep_unwritten_whiteouts_cmp(const struct btree *b,
|
||||
const struct bkey_packed *l,
|
||||
const struct bkey_packed *r)
|
||||
{
|
||||
return bch2_bkey_cmp_packed_inlined(b, l, r) ?:
|
||||
(int) bkey_deleted(r) - (int) bkey_deleted(l) ?:
|
||||
(long) l - (long) r;
|
||||
}
|
||||
|
||||
#include "btree_update_interior.h"
|
||||
|
||||
/*
|
||||
* For sorting in the btree node write path: whiteouts not in the unwritten
|
||||
* whiteouts area are dropped, whiteouts in the unwritten whiteouts area are
|
||||
* dropped if overwritten by real keys:
|
||||
*/
|
||||
unsigned bch2_sort_keys_keep_unwritten_whiteouts(struct bkey_packed *dst, struct sort_iter *iter)
|
||||
{
|
||||
struct bkey_packed *in, *next, *out = dst;
|
||||
|
||||
sort_iter_sort(iter, keep_unwritten_whiteouts_cmp);
|
||||
|
||||
while ((in = sort_iter_next(iter, keep_unwritten_whiteouts_cmp))) {
|
||||
if (bkey_deleted(in) && in < unwritten_whiteouts_start(iter->b))
|
||||
continue;
|
||||
|
||||
if ((next = sort_iter_peek(iter)) &&
|
||||
!bch2_bkey_cmp_packed_inlined(iter->b, in, next))
|
||||
continue;
|
||||
|
||||
bkey_p_copy(out, in);
|
||||
out = bkey_p_next(out);
|
||||
}
|
||||
|
||||
return (u64 *) out - (u64 *) dst;
|
||||
}
|
||||
|
||||
/*
|
||||
* Main sort routine for compacting a btree node in memory: we always drop
|
||||
* whiteouts because any whiteouts that need to be written are in the unwritten
|
||||
* whiteouts area:
|
||||
*/
|
||||
unsigned bch2_sort_keys(struct bkey_packed *dst, struct sort_iter *iter)
|
||||
{
|
||||
struct bkey_packed *in, *out = dst;
|
||||
|
||||
sort_iter_sort(iter, bch2_bkey_cmp_packed_inlined);
|
||||
|
||||
while ((in = sort_iter_next(iter, bch2_bkey_cmp_packed_inlined))) {
|
||||
if (bkey_deleted(in))
|
||||
continue;
|
||||
|
||||
bkey_p_copy(out, in);
|
||||
out = bkey_p_next(out);
|
||||
}
|
||||
|
||||
return (u64 *) out - (u64 *) dst;
|
||||
}
|
||||
@@ -1,54 +0,0 @@
|
||||
/* SPDX-License-Identifier: GPL-2.0 */
|
||||
#ifndef _BCACHEFS_BKEY_SORT_H
|
||||
#define _BCACHEFS_BKEY_SORT_H
|
||||
|
||||
struct sort_iter {
|
||||
struct btree *b;
|
||||
unsigned used;
|
||||
unsigned size;
|
||||
|
||||
struct sort_iter_set {
|
||||
struct bkey_packed *k, *end;
|
||||
} data[];
|
||||
};
|
||||
|
||||
static inline void sort_iter_init(struct sort_iter *iter, struct btree *b, unsigned size)
|
||||
{
|
||||
iter->b = b;
|
||||
iter->used = 0;
|
||||
iter->size = size;
|
||||
}
|
||||
|
||||
struct sort_iter_stack {
|
||||
struct sort_iter iter;
|
||||
struct sort_iter_set sets[MAX_BSETS + 1];
|
||||
};
|
||||
|
||||
static inline void sort_iter_stack_init(struct sort_iter_stack *iter, struct btree *b)
|
||||
{
|
||||
sort_iter_init(&iter->iter, b, ARRAY_SIZE(iter->sets));
|
||||
}
|
||||
|
||||
static inline void sort_iter_add(struct sort_iter *iter,
|
||||
struct bkey_packed *k,
|
||||
struct bkey_packed *end)
|
||||
{
|
||||
BUG_ON(iter->used >= iter->size);
|
||||
|
||||
if (k != end)
|
||||
iter->data[iter->used++] = (struct sort_iter_set) { k, end };
|
||||
}
|
||||
|
||||
struct btree_nr_keys
|
||||
bch2_key_sort_fix_overlapping(struct bch_fs *, struct bset *,
|
||||
struct sort_iter *);
|
||||
|
||||
struct btree_nr_keys
|
||||
bch2_sort_repack(struct bset *, struct btree *,
|
||||
struct btree_node_iter *,
|
||||
struct bkey_format *, bool);
|
||||
|
||||
unsigned bch2_sort_keys_keep_unwritten_whiteouts(struct bkey_packed *, struct sort_iter *);
|
||||
unsigned bch2_sort_keys(struct bkey_packed *, struct sort_iter *);
|
||||
|
||||
#endif /* _BCACHEFS_BKEY_SORT_H */
|
||||
@@ -1,241 +0,0 @@
|
||||
/* SPDX-License-Identifier: GPL-2.0 */
|
||||
#ifndef _BCACHEFS_BKEY_TYPES_H
|
||||
#define _BCACHEFS_BKEY_TYPES_H
|
||||
|
||||
#include "bcachefs_format.h"
|
||||
|
||||
/*
|
||||
* bkey_i - bkey with inline value
|
||||
* bkey_s - bkey with split value
|
||||
* bkey_s_c - bkey with split value, const
|
||||
*/
|
||||
|
||||
#define bkey_p_next(_k) vstruct_next(_k)
|
||||
|
||||
static inline struct bkey_i *bkey_next(struct bkey_i *k)
|
||||
{
|
||||
return (struct bkey_i *) ((u64 *) k->_data + k->k.u64s);
|
||||
}
|
||||
|
||||
#define bkey_val_u64s(_k) ((_k)->u64s - BKEY_U64s)
|
||||
|
||||
static inline size_t bkey_val_bytes(const struct bkey *k)
|
||||
{
|
||||
return bkey_val_u64s(k) * sizeof(u64);
|
||||
}
|
||||
|
||||
static inline void set_bkey_val_u64s(struct bkey *k, unsigned val_u64s)
|
||||
{
|
||||
unsigned u64s = BKEY_U64s + val_u64s;
|
||||
|
||||
BUG_ON(u64s > U8_MAX);
|
||||
k->u64s = u64s;
|
||||
}
|
||||
|
||||
static inline void set_bkey_val_bytes(struct bkey *k, unsigned bytes)
|
||||
{
|
||||
set_bkey_val_u64s(k, DIV_ROUND_UP(bytes, sizeof(u64)));
|
||||
}
|
||||
|
||||
#define bkey_val_end(_k) ((void *) (((u64 *) (_k).v) + bkey_val_u64s((_k).k)))
|
||||
|
||||
#define bkey_deleted(_k) ((_k)->type == KEY_TYPE_deleted)
|
||||
|
||||
#define bkey_whiteout(_k) \
|
||||
((_k)->type == KEY_TYPE_deleted || (_k)->type == KEY_TYPE_whiteout)
|
||||
|
||||
/* bkey with split value, const */
|
||||
struct bkey_s_c {
|
||||
const struct bkey *k;
|
||||
const struct bch_val *v;
|
||||
};
|
||||
|
||||
/* bkey with split value */
|
||||
struct bkey_s {
|
||||
union {
|
||||
struct {
|
||||
struct bkey *k;
|
||||
struct bch_val *v;
|
||||
};
|
||||
struct bkey_s_c s_c;
|
||||
};
|
||||
};
|
||||
|
||||
#define bkey_s_null ((struct bkey_s) { .k = NULL })
|
||||
#define bkey_s_c_null ((struct bkey_s_c) { .k = NULL })
|
||||
|
||||
#define bkey_s_err(err) ((struct bkey_s) { .k = ERR_PTR(err) })
|
||||
#define bkey_s_c_err(err) ((struct bkey_s_c) { .k = ERR_PTR(err) })
|
||||
|
||||
static inline struct bkey_s bkey_to_s(struct bkey *k)
|
||||
{
|
||||
return (struct bkey_s) { .k = k, .v = NULL };
|
||||
}
|
||||
|
||||
static inline struct bkey_s_c bkey_to_s_c(const struct bkey *k)
|
||||
{
|
||||
return (struct bkey_s_c) { .k = k, .v = NULL };
|
||||
}
|
||||
|
||||
static inline struct bkey_s bkey_i_to_s(struct bkey_i *k)
|
||||
{
|
||||
return (struct bkey_s) { .k = &k->k, .v = &k->v };
|
||||
}
|
||||
|
||||
static inline struct bkey_s_c bkey_i_to_s_c(const struct bkey_i *k)
|
||||
{
|
||||
return (struct bkey_s_c) { .k = &k->k, .v = &k->v };
|
||||
}
|
||||
|
||||
/*
|
||||
* For a given type of value (e.g. struct bch_extent), generates the types for
|
||||
* bkey + bch_extent - inline, split, split const - and also all the conversion
|
||||
* functions, which also check that the value is of the correct type.
|
||||
*
|
||||
* We use anonymous unions for upcasting - e.g. converting from e.g. a
|
||||
* bkey_i_extent to a bkey_i - since that's always safe, instead of conversion
|
||||
* functions.
|
||||
*/
|
||||
#define x(name, ...) \
|
||||
struct bkey_i_##name { \
|
||||
union { \
|
||||
struct bkey k; \
|
||||
struct bkey_i k_i; \
|
||||
}; \
|
||||
struct bch_##name v; \
|
||||
}; \
|
||||
\
|
||||
struct bkey_s_c_##name { \
|
||||
union { \
|
||||
struct { \
|
||||
const struct bkey *k; \
|
||||
const struct bch_##name *v; \
|
||||
}; \
|
||||
struct bkey_s_c s_c; \
|
||||
}; \
|
||||
}; \
|
||||
\
|
||||
struct bkey_s_##name { \
|
||||
union { \
|
||||
struct { \
|
||||
struct bkey *k; \
|
||||
struct bch_##name *v; \
|
||||
}; \
|
||||
struct bkey_s_c_##name c; \
|
||||
struct bkey_s s; \
|
||||
struct bkey_s_c s_c; \
|
||||
}; \
|
||||
}; \
|
||||
\
|
||||
static inline struct bkey_i_##name *bkey_i_to_##name(struct bkey_i *k) \
|
||||
{ \
|
||||
EBUG_ON(!IS_ERR_OR_NULL(k) && k->k.type != KEY_TYPE_##name); \
|
||||
return container_of(&k->k, struct bkey_i_##name, k); \
|
||||
} \
|
||||
\
|
||||
static inline const struct bkey_i_##name * \
|
||||
bkey_i_to_##name##_c(const struct bkey_i *k) \
|
||||
{ \
|
||||
EBUG_ON(!IS_ERR_OR_NULL(k) && k->k.type != KEY_TYPE_##name); \
|
||||
return container_of(&k->k, struct bkey_i_##name, k); \
|
||||
} \
|
||||
\
|
||||
static inline struct bkey_s_##name bkey_s_to_##name(struct bkey_s k) \
|
||||
{ \
|
||||
EBUG_ON(!IS_ERR_OR_NULL(k.k) && k.k->type != KEY_TYPE_##name); \
|
||||
return (struct bkey_s_##name) { \
|
||||
.k = k.k, \
|
||||
.v = container_of(k.v, struct bch_##name, v), \
|
||||
}; \
|
||||
} \
|
||||
\
|
||||
static inline struct bkey_s_c_##name bkey_s_c_to_##name(struct bkey_s_c k)\
|
||||
{ \
|
||||
EBUG_ON(!IS_ERR_OR_NULL(k.k) && k.k->type != KEY_TYPE_##name); \
|
||||
return (struct bkey_s_c_##name) { \
|
||||
.k = k.k, \
|
||||
.v = container_of(k.v, struct bch_##name, v), \
|
||||
}; \
|
||||
} \
|
||||
\
|
||||
static inline struct bkey_s_##name name##_i_to_s(struct bkey_i_##name *k)\
|
||||
{ \
|
||||
return (struct bkey_s_##name) { \
|
||||
.k = &k->k, \
|
||||
.v = &k->v, \
|
||||
}; \
|
||||
} \
|
||||
\
|
||||
static inline struct bkey_s_c_##name \
|
||||
name##_i_to_s_c(const struct bkey_i_##name *k) \
|
||||
{ \
|
||||
return (struct bkey_s_c_##name) { \
|
||||
.k = &k->k, \
|
||||
.v = &k->v, \
|
||||
}; \
|
||||
} \
|
||||
\
|
||||
static inline struct bkey_s_##name bkey_i_to_s_##name(struct bkey_i *k) \
|
||||
{ \
|
||||
EBUG_ON(!IS_ERR_OR_NULL(k) && k->k.type != KEY_TYPE_##name); \
|
||||
return (struct bkey_s_##name) { \
|
||||
.k = &k->k, \
|
||||
.v = container_of(&k->v, struct bch_##name, v), \
|
||||
}; \
|
||||
} \
|
||||
\
|
||||
static inline struct bkey_s_c_##name \
|
||||
bkey_i_to_s_c_##name(const struct bkey_i *k) \
|
||||
{ \
|
||||
EBUG_ON(!IS_ERR_OR_NULL(k) && k->k.type != KEY_TYPE_##name); \
|
||||
return (struct bkey_s_c_##name) { \
|
||||
.k = &k->k, \
|
||||
.v = container_of(&k->v, struct bch_##name, v), \
|
||||
}; \
|
||||
} \
|
||||
\
|
||||
static inline struct bkey_i_##name *bkey_##name##_init(struct bkey_i *_k)\
|
||||
{ \
|
||||
struct bkey_i_##name *k = \
|
||||
container_of(&_k->k, struct bkey_i_##name, k); \
|
||||
\
|
||||
bkey_init(&k->k); \
|
||||
memset(&k->v, 0, sizeof(k->v)); \
|
||||
k->k.type = KEY_TYPE_##name; \
|
||||
set_bkey_val_bytes(&k->k, sizeof(k->v)); \
|
||||
\
|
||||
return k; \
|
||||
}
|
||||
|
||||
BCH_BKEY_TYPES();
|
||||
#undef x
|
||||
|
||||
enum bch_validate_flags {
|
||||
BCH_VALIDATE_write = BIT(0),
|
||||
BCH_VALIDATE_commit = BIT(1),
|
||||
BCH_VALIDATE_silent = BIT(2),
|
||||
};
|
||||
|
||||
#define BKEY_VALIDATE_CONTEXTS() \
|
||||
x(unknown) \
|
||||
x(superblock) \
|
||||
x(journal) \
|
||||
x(btree_root) \
|
||||
x(btree_node) \
|
||||
x(commit)
|
||||
|
||||
struct bkey_validate_context {
|
||||
enum {
|
||||
#define x(n) BKEY_VALIDATE_##n,
|
||||
BKEY_VALIDATE_CONTEXTS()
|
||||
#undef x
|
||||
} from:8;
|
||||
enum bch_validate_flags flags:8;
|
||||
u8 level;
|
||||
enum btree_id btree;
|
||||
bool root:1;
|
||||
unsigned journal_offset;
|
||||
u64 journal_seq;
|
||||
};
|
||||
|
||||
#endif /* _BCACHEFS_BKEY_TYPES_H */
|
||||
1576
fs/bcachefs/bset.c
1576
fs/bcachefs/bset.c
File diff suppressed because it is too large
Load Diff
@@ -1,536 +0,0 @@
|
||||
/* SPDX-License-Identifier: GPL-2.0 */
|
||||
#ifndef _BCACHEFS_BSET_H
|
||||
#define _BCACHEFS_BSET_H
|
||||
|
||||
#include <linux/kernel.h>
|
||||
#include <linux/types.h>
|
||||
|
||||
#include "bcachefs.h"
|
||||
#include "bkey.h"
|
||||
#include "bkey_methods.h"
|
||||
#include "btree_types.h"
|
||||
#include "util.h" /* for time_stats */
|
||||
#include "vstructs.h"
|
||||
|
||||
/*
|
||||
* BKEYS:
|
||||
*
|
||||
* A bkey contains a key, a size field, a variable number of pointers, and some
|
||||
* ancillary flag bits.
|
||||
*
|
||||
* We use two different functions for validating bkeys, bkey_invalid and
|
||||
* bkey_deleted().
|
||||
*
|
||||
* The one exception to the rule that ptr_invalid() filters out invalid keys is
|
||||
* that it also filters out keys of size 0 - these are keys that have been
|
||||
* completely overwritten. It'd be safe to delete these in memory while leaving
|
||||
* them on disk, just unnecessary work - so we filter them out when resorting
|
||||
* instead.
|
||||
*
|
||||
* We can't filter out stale keys when we're resorting, because garbage
|
||||
* collection needs to find them to ensure bucket gens don't wrap around -
|
||||
* unless we're rewriting the btree node those stale keys still exist on disk.
|
||||
*
|
||||
* We also implement functions here for removing some number of sectors from the
|
||||
* front or the back of a bkey - this is mainly used for fixing overlapping
|
||||
* extents, by removing the overlapping sectors from the older key.
|
||||
*
|
||||
* BSETS:
|
||||
*
|
||||
* A bset is an array of bkeys laid out contiguously in memory in sorted order,
|
||||
* along with a header. A btree node is made up of a number of these, written at
|
||||
* different times.
|
||||
*
|
||||
* There could be many of them on disk, but we never allow there to be more than
|
||||
* 4 in memory - we lazily resort as needed.
|
||||
*
|
||||
* We implement code here for creating and maintaining auxiliary search trees
|
||||
* (described below) for searching an individial bset, and on top of that we
|
||||
* implement a btree iterator.
|
||||
*
|
||||
* BTREE ITERATOR:
|
||||
*
|
||||
* Most of the code in bcache doesn't care about an individual bset - it needs
|
||||
* to search entire btree nodes and iterate over them in sorted order.
|
||||
*
|
||||
* The btree iterator code serves both functions; it iterates through the keys
|
||||
* in a btree node in sorted order, starting from either keys after a specific
|
||||
* point (if you pass it a search key) or the start of the btree node.
|
||||
*
|
||||
* AUXILIARY SEARCH TREES:
|
||||
*
|
||||
* Since keys are variable length, we can't use a binary search on a bset - we
|
||||
* wouldn't be able to find the start of the next key. But binary searches are
|
||||
* slow anyways, due to terrible cache behaviour; bcache originally used binary
|
||||
* searches and that code topped out at under 50k lookups/second.
|
||||
*
|
||||
* So we need to construct some sort of lookup table. Since we only insert keys
|
||||
* into the last (unwritten) set, most of the keys within a given btree node are
|
||||
* usually in sets that are mostly constant. We use two different types of
|
||||
* lookup tables to take advantage of this.
|
||||
*
|
||||
* Both lookup tables share in common that they don't index every key in the
|
||||
* set; they index one key every BSET_CACHELINE bytes, and then a linear search
|
||||
* is used for the rest.
|
||||
*
|
||||
* For sets that have been written to disk and are no longer being inserted
|
||||
* into, we construct a binary search tree in an array - traversing a binary
|
||||
* search tree in an array gives excellent locality of reference and is very
|
||||
* fast, since both children of any node are adjacent to each other in memory
|
||||
* (and their grandchildren, and great grandchildren...) - this means
|
||||
* prefetching can be used to great effect.
|
||||
*
|
||||
* It's quite useful performance wise to keep these nodes small - not just
|
||||
* because they're more likely to be in L2, but also because we can prefetch
|
||||
* more nodes on a single cacheline and thus prefetch more iterations in advance
|
||||
* when traversing this tree.
|
||||
*
|
||||
* Nodes in the auxiliary search tree must contain both a key to compare against
|
||||
* (we don't want to fetch the key from the set, that would defeat the purpose),
|
||||
* and a pointer to the key. We use a few tricks to compress both of these.
|
||||
*
|
||||
* To compress the pointer, we take advantage of the fact that one node in the
|
||||
* search tree corresponds to precisely BSET_CACHELINE bytes in the set. We have
|
||||
* a function (to_inorder()) that takes the index of a node in a binary tree and
|
||||
* returns what its index would be in an inorder traversal, so we only have to
|
||||
* store the low bits of the offset.
|
||||
*
|
||||
* The key is 84 bits (KEY_DEV + key->key, the offset on the device). To
|
||||
* compress that, we take advantage of the fact that when we're traversing the
|
||||
* search tree at every iteration we know that both our search key and the key
|
||||
* we're looking for lie within some range - bounded by our previous
|
||||
* comparisons. (We special case the start of a search so that this is true even
|
||||
* at the root of the tree).
|
||||
*
|
||||
* So we know the key we're looking for is between a and b, and a and b don't
|
||||
* differ higher than bit 50, we don't need to check anything higher than bit
|
||||
* 50.
|
||||
*
|
||||
* We don't usually need the rest of the bits, either; we only need enough bits
|
||||
* to partition the key range we're currently checking. Consider key n - the
|
||||
* key our auxiliary search tree node corresponds to, and key p, the key
|
||||
* immediately preceding n. The lowest bit we need to store in the auxiliary
|
||||
* search tree is the highest bit that differs between n and p.
|
||||
*
|
||||
* Note that this could be bit 0 - we might sometimes need all 80 bits to do the
|
||||
* comparison. But we'd really like our nodes in the auxiliary search tree to be
|
||||
* of fixed size.
|
||||
*
|
||||
* The solution is to make them fixed size, and when we're constructing a node
|
||||
* check if p and n differed in the bits we needed them to. If they don't we
|
||||
* flag that node, and when doing lookups we fallback to comparing against the
|
||||
* real key. As long as this doesn't happen to often (and it seems to reliably
|
||||
* happen a bit less than 1% of the time), we win - even on failures, that key
|
||||
* is then more likely to be in cache than if we were doing binary searches all
|
||||
* the way, since we're touching so much less memory.
|
||||
*
|
||||
* The keys in the auxiliary search tree are stored in (software) floating
|
||||
* point, with an exponent and a mantissa. The exponent needs to be big enough
|
||||
* to address all the bits in the original key, but the number of bits in the
|
||||
* mantissa is somewhat arbitrary; more bits just gets us fewer failures.
|
||||
*
|
||||
* We need 7 bits for the exponent and 3 bits for the key's offset (since keys
|
||||
* are 8 byte aligned); using 22 bits for the mantissa means a node is 4 bytes.
|
||||
* We need one node per 128 bytes in the btree node, which means the auxiliary
|
||||
* search trees take up 3% as much memory as the btree itself.
|
||||
*
|
||||
* Constructing these auxiliary search trees is moderately expensive, and we
|
||||
* don't want to be constantly rebuilding the search tree for the last set
|
||||
* whenever we insert another key into it. For the unwritten set, we use a much
|
||||
* simpler lookup table - it's just a flat array, so index i in the lookup table
|
||||
* corresponds to the i range of BSET_CACHELINE bytes in the set. Indexing
|
||||
* within each byte range works the same as with the auxiliary search trees.
|
||||
*
|
||||
* These are much easier to keep up to date when we insert a key - we do it
|
||||
* somewhat lazily; when we shift a key up we usually just increment the pointer
|
||||
* to it, only when it would overflow do we go to the trouble of finding the
|
||||
* first key in that range of bytes again.
|
||||
*/
|
||||
|
||||
enum bset_aux_tree_type {
|
||||
BSET_NO_AUX_TREE,
|
||||
BSET_RO_AUX_TREE,
|
||||
BSET_RW_AUX_TREE,
|
||||
};
|
||||
|
||||
#define BSET_TREE_NR_TYPES 3
|
||||
|
||||
#define BSET_NO_AUX_TREE_VAL (U16_MAX)
|
||||
#define BSET_RW_AUX_TREE_VAL (U16_MAX - 1)
|
||||
|
||||
static inline enum bset_aux_tree_type bset_aux_tree_type(const struct bset_tree *t)
|
||||
{
|
||||
switch (t->extra) {
|
||||
case BSET_NO_AUX_TREE_VAL:
|
||||
EBUG_ON(t->size);
|
||||
return BSET_NO_AUX_TREE;
|
||||
case BSET_RW_AUX_TREE_VAL:
|
||||
EBUG_ON(!t->size);
|
||||
return BSET_RW_AUX_TREE;
|
||||
default:
|
||||
EBUG_ON(!t->size);
|
||||
return BSET_RO_AUX_TREE;
|
||||
}
|
||||
}
|
||||
|
||||
/*
|
||||
* BSET_CACHELINE was originally intended to match the hardware cacheline size -
|
||||
* it used to be 64, but I realized the lookup code would touch slightly less
|
||||
* memory if it was 128.
|
||||
*
|
||||
* It definites the number of bytes (in struct bset) per struct bkey_float in
|
||||
* the auxiliar search tree - when we're done searching the bset_float tree we
|
||||
* have this many bytes left that we do a linear search over.
|
||||
*
|
||||
* Since (after level 5) every level of the bset_tree is on a new cacheline,
|
||||
* we're touching one fewer cacheline in the bset tree in exchange for one more
|
||||
* cacheline in the linear search - but the linear search might stop before it
|
||||
* gets to the second cacheline.
|
||||
*/
|
||||
|
||||
#define BSET_CACHELINE 256
|
||||
|
||||
static inline size_t btree_keys_cachelines(const struct btree *b)
|
||||
{
|
||||
return (1U << b->byte_order) / BSET_CACHELINE;
|
||||
}
|
||||
|
||||
static inline size_t btree_aux_data_bytes(const struct btree *b)
|
||||
{
|
||||
return btree_keys_cachelines(b) * 8;
|
||||
}
|
||||
|
||||
static inline size_t btree_aux_data_u64s(const struct btree *b)
|
||||
{
|
||||
return btree_aux_data_bytes(b) / sizeof(u64);
|
||||
}
|
||||
|
||||
#define for_each_bset(_b, _t) \
|
||||
for (struct bset_tree *_t = (_b)->set; _t < (_b)->set + (_b)->nsets; _t++)
|
||||
|
||||
#define for_each_bset_c(_b, _t) \
|
||||
for (const struct bset_tree *_t = (_b)->set; _t < (_b)->set + (_b)->nsets; _t++)
|
||||
|
||||
#define bset_tree_for_each_key(_b, _t, _k) \
|
||||
for (_k = btree_bkey_first(_b, _t); \
|
||||
_k != btree_bkey_last(_b, _t); \
|
||||
_k = bkey_p_next(_k))
|
||||
|
||||
static inline bool bset_has_ro_aux_tree(const struct bset_tree *t)
|
||||
{
|
||||
return bset_aux_tree_type(t) == BSET_RO_AUX_TREE;
|
||||
}
|
||||
|
||||
static inline bool bset_has_rw_aux_tree(struct bset_tree *t)
|
||||
{
|
||||
return bset_aux_tree_type(t) == BSET_RW_AUX_TREE;
|
||||
}
|
||||
|
||||
static inline void bch2_bset_set_no_aux_tree(struct btree *b,
|
||||
struct bset_tree *t)
|
||||
{
|
||||
BUG_ON(t < b->set);
|
||||
|
||||
for (; t < b->set + ARRAY_SIZE(b->set); t++) {
|
||||
t->size = 0;
|
||||
t->extra = BSET_NO_AUX_TREE_VAL;
|
||||
t->aux_data_offset = U16_MAX;
|
||||
}
|
||||
}
|
||||
|
||||
static inline void btree_node_set_format(struct btree *b,
|
||||
struct bkey_format f)
|
||||
{
|
||||
int len;
|
||||
|
||||
b->format = f;
|
||||
b->nr_key_bits = bkey_format_key_bits(&f);
|
||||
|
||||
len = bch2_compile_bkey_format(&b->format, b->aux_data);
|
||||
BUG_ON(len < 0 || len > U8_MAX);
|
||||
|
||||
b->unpack_fn_len = len;
|
||||
|
||||
bch2_bset_set_no_aux_tree(b, b->set);
|
||||
}
|
||||
|
||||
static inline struct bset *bset_next_set(struct btree *b,
|
||||
unsigned block_bytes)
|
||||
{
|
||||
struct bset *i = btree_bset_last(b);
|
||||
|
||||
EBUG_ON(!is_power_of_2(block_bytes));
|
||||
|
||||
return ((void *) i) + round_up(vstruct_bytes(i), block_bytes);
|
||||
}
|
||||
|
||||
void bch2_btree_keys_init(struct btree *);
|
||||
|
||||
void bch2_bset_init_first(struct btree *, struct bset *);
|
||||
void bch2_bset_init_next(struct btree *, struct btree_node_entry *);
|
||||
void bch2_bset_build_aux_tree(struct btree *, struct bset_tree *, bool);
|
||||
|
||||
void bch2_bset_insert(struct btree *, struct bkey_packed *, struct bkey_i *,
|
||||
unsigned);
|
||||
void bch2_bset_delete(struct btree *, struct bkey_packed *, unsigned);
|
||||
|
||||
/* Bkey utility code */
|
||||
|
||||
/* packed or unpacked */
|
||||
static inline int bkey_cmp_p_or_unp(const struct btree *b,
|
||||
const struct bkey_packed *l,
|
||||
const struct bkey_packed *r_packed,
|
||||
const struct bpos *r)
|
||||
{
|
||||
EBUG_ON(r_packed && !bkey_packed(r_packed));
|
||||
|
||||
if (unlikely(!bkey_packed(l)))
|
||||
return bpos_cmp(packed_to_bkey_c(l)->p, *r);
|
||||
|
||||
if (likely(r_packed))
|
||||
return __bch2_bkey_cmp_packed_format_checked(l, r_packed, b);
|
||||
|
||||
return __bch2_bkey_cmp_left_packed_format_checked(b, l, r);
|
||||
}
|
||||
|
||||
static inline struct bset_tree *
|
||||
bch2_bkey_to_bset_inlined(struct btree *b, struct bkey_packed *k)
|
||||
{
|
||||
unsigned offset = __btree_node_key_to_offset(b, k);
|
||||
|
||||
for_each_bset(b, t)
|
||||
if (offset <= t->end_offset) {
|
||||
EBUG_ON(offset < btree_bkey_first_offset(t));
|
||||
return t;
|
||||
}
|
||||
|
||||
BUG();
|
||||
}
|
||||
|
||||
struct bset_tree *bch2_bkey_to_bset(struct btree *, struct bkey_packed *);
|
||||
|
||||
struct bkey_packed *bch2_bkey_prev_filter(struct btree *, struct bset_tree *,
|
||||
struct bkey_packed *, unsigned);
|
||||
|
||||
static inline struct bkey_packed *
|
||||
bch2_bkey_prev_all(struct btree *b, struct bset_tree *t, struct bkey_packed *k)
|
||||
{
|
||||
return bch2_bkey_prev_filter(b, t, k, 0);
|
||||
}
|
||||
|
||||
static inline struct bkey_packed *
|
||||
bch2_bkey_prev(struct btree *b, struct bset_tree *t, struct bkey_packed *k)
|
||||
{
|
||||
return bch2_bkey_prev_filter(b, t, k, 1);
|
||||
}
|
||||
|
||||
/* Btree key iteration */
|
||||
|
||||
void bch2_btree_node_iter_push(struct btree_node_iter *, struct btree *,
|
||||
const struct bkey_packed *,
|
||||
const struct bkey_packed *);
|
||||
void bch2_btree_node_iter_init(struct btree_node_iter *, struct btree *,
|
||||
struct bpos *);
|
||||
void bch2_btree_node_iter_init_from_start(struct btree_node_iter *,
|
||||
struct btree *);
|
||||
struct bkey_packed *bch2_btree_node_iter_bset_pos(struct btree_node_iter *,
|
||||
struct btree *,
|
||||
struct bset_tree *);
|
||||
|
||||
void bch2_btree_node_iter_sort(struct btree_node_iter *, struct btree *);
|
||||
void bch2_btree_node_iter_set_drop(struct btree_node_iter *,
|
||||
struct btree_node_iter_set *);
|
||||
void bch2_btree_node_iter_advance(struct btree_node_iter *, struct btree *);
|
||||
|
||||
#define btree_node_iter_for_each(_iter, _set) \
|
||||
for (_set = (_iter)->data; \
|
||||
_set < (_iter)->data + ARRAY_SIZE((_iter)->data) && \
|
||||
(_set)->k != (_set)->end; \
|
||||
_set++)
|
||||
|
||||
static inline bool __btree_node_iter_set_end(struct btree_node_iter *iter,
|
||||
unsigned i)
|
||||
{
|
||||
return iter->data[i].k == iter->data[i].end;
|
||||
}
|
||||
|
||||
static inline bool bch2_btree_node_iter_end(struct btree_node_iter *iter)
|
||||
{
|
||||
return __btree_node_iter_set_end(iter, 0);
|
||||
}
|
||||
|
||||
/*
|
||||
* When keys compare equal, deleted keys compare first:
|
||||
*
|
||||
* XXX: only need to compare pointers for keys that are both within a
|
||||
* btree_node_iterator - we need to break ties for prev() to work correctly
|
||||
*/
|
||||
static inline int bkey_iter_cmp(const struct btree *b,
|
||||
const struct bkey_packed *l,
|
||||
const struct bkey_packed *r)
|
||||
{
|
||||
return bch2_bkey_cmp_packed(b, l, r)
|
||||
?: (int) bkey_deleted(r) - (int) bkey_deleted(l)
|
||||
?: cmp_int(l, r);
|
||||
}
|
||||
|
||||
static inline int btree_node_iter_cmp(const struct btree *b,
|
||||
struct btree_node_iter_set l,
|
||||
struct btree_node_iter_set r)
|
||||
{
|
||||
return bkey_iter_cmp(b,
|
||||
__btree_node_offset_to_key(b, l.k),
|
||||
__btree_node_offset_to_key(b, r.k));
|
||||
}
|
||||
|
||||
/* These assume r (the search key) is not a deleted key: */
|
||||
static inline int bkey_iter_pos_cmp(const struct btree *b,
|
||||
const struct bkey_packed *l,
|
||||
const struct bpos *r)
|
||||
{
|
||||
return bkey_cmp_left_packed(b, l, r)
|
||||
?: -((int) bkey_deleted(l));
|
||||
}
|
||||
|
||||
static inline int bkey_iter_cmp_p_or_unp(const struct btree *b,
|
||||
const struct bkey_packed *l,
|
||||
const struct bkey_packed *r_packed,
|
||||
const struct bpos *r)
|
||||
{
|
||||
return bkey_cmp_p_or_unp(b, l, r_packed, r)
|
||||
?: -((int) bkey_deleted(l));
|
||||
}
|
||||
|
||||
static inline struct bkey_packed *
|
||||
__bch2_btree_node_iter_peek_all(struct btree_node_iter *iter,
|
||||
struct btree *b)
|
||||
{
|
||||
return __btree_node_offset_to_key(b, iter->data->k);
|
||||
}
|
||||
|
||||
static inline struct bkey_packed *
|
||||
bch2_btree_node_iter_peek_all(struct btree_node_iter *iter, struct btree *b)
|
||||
{
|
||||
return !bch2_btree_node_iter_end(iter)
|
||||
? __btree_node_offset_to_key(b, iter->data->k)
|
||||
: NULL;
|
||||
}
|
||||
|
||||
static inline struct bkey_packed *
|
||||
bch2_btree_node_iter_peek(struct btree_node_iter *iter, struct btree *b)
|
||||
{
|
||||
struct bkey_packed *k;
|
||||
|
||||
while ((k = bch2_btree_node_iter_peek_all(iter, b)) &&
|
||||
bkey_deleted(k))
|
||||
bch2_btree_node_iter_advance(iter, b);
|
||||
|
||||
return k;
|
||||
}
|
||||
|
||||
static inline struct bkey_packed *
|
||||
bch2_btree_node_iter_next_all(struct btree_node_iter *iter, struct btree *b)
|
||||
{
|
||||
struct bkey_packed *ret = bch2_btree_node_iter_peek_all(iter, b);
|
||||
|
||||
if (ret)
|
||||
bch2_btree_node_iter_advance(iter, b);
|
||||
|
||||
return ret;
|
||||
}
|
||||
|
||||
struct bkey_packed *bch2_btree_node_iter_prev_all(struct btree_node_iter *,
|
||||
struct btree *);
|
||||
struct bkey_packed *bch2_btree_node_iter_prev(struct btree_node_iter *,
|
||||
struct btree *);
|
||||
|
||||
struct bkey_s_c bch2_btree_node_iter_peek_unpack(struct btree_node_iter *,
|
||||
struct btree *,
|
||||
struct bkey *);
|
||||
|
||||
#define for_each_btree_node_key(b, k, iter) \
|
||||
for (bch2_btree_node_iter_init_from_start((iter), (b)); \
|
||||
(k = bch2_btree_node_iter_peek((iter), (b))); \
|
||||
bch2_btree_node_iter_advance(iter, b))
|
||||
|
||||
#define for_each_btree_node_key_unpack(b, k, iter, unpacked) \
|
||||
for (bch2_btree_node_iter_init_from_start((iter), (b)); \
|
||||
(k = bch2_btree_node_iter_peek_unpack((iter), (b), (unpacked))).k;\
|
||||
bch2_btree_node_iter_advance(iter, b))
|
||||
|
||||
/* Accounting: */
|
||||
|
||||
struct btree_nr_keys bch2_btree_node_count_keys(struct btree *);
|
||||
|
||||
static inline void btree_keys_account_key(struct btree_nr_keys *n,
|
||||
unsigned bset,
|
||||
struct bkey_packed *k,
|
||||
int sign)
|
||||
{
|
||||
n->live_u64s += k->u64s * sign;
|
||||
n->bset_u64s[bset] += k->u64s * sign;
|
||||
|
||||
if (bkey_packed(k))
|
||||
n->packed_keys += sign;
|
||||
else
|
||||
n->unpacked_keys += sign;
|
||||
}
|
||||
|
||||
static inline void btree_keys_account_val_delta(struct btree *b,
|
||||
struct bkey_packed *k,
|
||||
int delta)
|
||||
{
|
||||
struct bset_tree *t = bch2_bkey_to_bset(b, k);
|
||||
|
||||
b->nr.live_u64s += delta;
|
||||
b->nr.bset_u64s[t - b->set] += delta;
|
||||
}
|
||||
|
||||
#define btree_keys_account_key_add(_nr, _bset_idx, _k) \
|
||||
btree_keys_account_key(_nr, _bset_idx, _k, 1)
|
||||
#define btree_keys_account_key_drop(_nr, _bset_idx, _k) \
|
||||
btree_keys_account_key(_nr, _bset_idx, _k, -1)
|
||||
|
||||
#define btree_account_key_add(_b, _k) \
|
||||
btree_keys_account_key(&(_b)->nr, \
|
||||
bch2_bkey_to_bset(_b, _k) - (_b)->set, _k, 1)
|
||||
#define btree_account_key_drop(_b, _k) \
|
||||
btree_keys_account_key(&(_b)->nr, \
|
||||
bch2_bkey_to_bset(_b, _k) - (_b)->set, _k, -1)
|
||||
|
||||
struct bset_stats {
|
||||
struct {
|
||||
size_t nr, bytes;
|
||||
} sets[BSET_TREE_NR_TYPES];
|
||||
|
||||
size_t floats;
|
||||
size_t failed;
|
||||
};
|
||||
|
||||
void bch2_btree_keys_stats(const struct btree *, struct bset_stats *);
|
||||
void bch2_bfloat_to_text(struct printbuf *, struct btree *,
|
||||
struct bkey_packed *);
|
||||
|
||||
/* Debug stuff */
|
||||
|
||||
void bch2_dump_bset(struct bch_fs *, struct btree *, struct bset *, unsigned);
|
||||
void bch2_dump_btree_node(struct bch_fs *, struct btree *);
|
||||
void bch2_dump_btree_node_iter(struct btree *, struct btree_node_iter *);
|
||||
|
||||
void __bch2_verify_btree_nr_keys(struct btree *);
|
||||
void __bch2_btree_node_iter_verify(struct btree_node_iter *, struct btree *);
|
||||
|
||||
static inline void bch2_btree_node_iter_verify(struct btree_node_iter *iter,
|
||||
struct btree *b)
|
||||
{
|
||||
if (static_branch_unlikely(&bch2_debug_check_bset_lookups))
|
||||
__bch2_btree_node_iter_verify(iter, b);
|
||||
}
|
||||
|
||||
static inline void bch2_verify_btree_nr_keys(struct btree *b)
|
||||
{
|
||||
if (static_branch_unlikely(&bch2_debug_check_btree_accounting))
|
||||
__bch2_verify_btree_nr_keys(b);
|
||||
}
|
||||
|
||||
#endif /* _BCACHEFS_BSET_H */
|
||||
File diff suppressed because it is too large
Load Diff
@@ -1,157 +0,0 @@
|
||||
/* SPDX-License-Identifier: GPL-2.0 */
|
||||
#ifndef _BCACHEFS_BTREE_CACHE_H
|
||||
#define _BCACHEFS_BTREE_CACHE_H
|
||||
|
||||
#include "bcachefs.h"
|
||||
#include "btree_types.h"
|
||||
#include "bkey_methods.h"
|
||||
|
||||
extern const char * const bch2_btree_node_flags[];
|
||||
|
||||
struct btree_iter;
|
||||
|
||||
void bch2_recalc_btree_reserve(struct bch_fs *);
|
||||
|
||||
void bch2_btree_node_to_freelist(struct bch_fs *, struct btree *);
|
||||
|
||||
void __bch2_btree_node_hash_remove(struct btree_cache *, struct btree *);
|
||||
void bch2_btree_node_hash_remove(struct btree_cache *, struct btree *);
|
||||
|
||||
int __bch2_btree_node_hash_insert(struct btree_cache *, struct btree *);
|
||||
int bch2_btree_node_hash_insert(struct btree_cache *, struct btree *,
|
||||
unsigned, enum btree_id);
|
||||
|
||||
void bch2_node_pin(struct bch_fs *, struct btree *);
|
||||
void bch2_btree_cache_unpin(struct bch_fs *);
|
||||
|
||||
void bch2_btree_node_update_key_early(struct btree_trans *, enum btree_id, unsigned,
|
||||
struct bkey_s_c, struct bkey_i *);
|
||||
|
||||
void bch2_btree_cache_cannibalize_unlock(struct btree_trans *);
|
||||
int bch2_btree_cache_cannibalize_lock(struct btree_trans *, struct closure *);
|
||||
|
||||
void __btree_node_data_free(struct btree *);
|
||||
struct btree *__bch2_btree_node_mem_alloc(struct bch_fs *);
|
||||
struct btree *bch2_btree_node_mem_alloc(struct btree_trans *, bool);
|
||||
|
||||
struct btree *bch2_btree_node_get(struct btree_trans *, struct btree_path *,
|
||||
const struct bkey_i *, unsigned,
|
||||
enum six_lock_type, unsigned long);
|
||||
|
||||
struct btree *bch2_btree_node_get_noiter(struct btree_trans *, const struct bkey_i *,
|
||||
enum btree_id, unsigned, bool);
|
||||
|
||||
int bch2_btree_node_prefetch(struct btree_trans *, struct btree_path *,
|
||||
const struct bkey_i *, enum btree_id, unsigned);
|
||||
|
||||
void bch2_btree_node_evict(struct btree_trans *, const struct bkey_i *);
|
||||
|
||||
void bch2_fs_btree_cache_exit(struct bch_fs *);
|
||||
int bch2_fs_btree_cache_init(struct bch_fs *);
|
||||
void bch2_fs_btree_cache_init_early(struct btree_cache *);
|
||||
|
||||
static inline u64 btree_ptr_hash_val(const struct bkey_i *k)
|
||||
{
|
||||
switch (k->k.type) {
|
||||
case KEY_TYPE_btree_ptr:
|
||||
return *((u64 *) bkey_i_to_btree_ptr_c(k)->v.start);
|
||||
case KEY_TYPE_btree_ptr_v2:
|
||||
/*
|
||||
* The cast/deref is only necessary to avoid sparse endianness
|
||||
* warnings:
|
||||
*/
|
||||
return *((u64 *) &bkey_i_to_btree_ptr_v2_c(k)->v.seq);
|
||||
default:
|
||||
return 0;
|
||||
}
|
||||
}
|
||||
|
||||
static inline struct btree *btree_node_mem_ptr(const struct bkey_i *k)
|
||||
{
|
||||
return k->k.type == KEY_TYPE_btree_ptr_v2
|
||||
? (void *)(unsigned long)bkey_i_to_btree_ptr_v2_c(k)->v.mem_ptr
|
||||
: NULL;
|
||||
}
|
||||
|
||||
/* is btree node in hash table? */
|
||||
static inline bool btree_node_hashed(struct btree *b)
|
||||
{
|
||||
return b->hash_val != 0;
|
||||
}
|
||||
|
||||
#define for_each_cached_btree(_b, _c, _tbl, _iter, _pos) \
|
||||
for ((_tbl) = rht_dereference_rcu((_c)->btree_cache.table.tbl, \
|
||||
&(_c)->btree_cache.table), \
|
||||
_iter = 0; _iter < (_tbl)->size; _iter++) \
|
||||
rht_for_each_entry_rcu((_b), (_pos), _tbl, _iter, hash)
|
||||
|
||||
static inline size_t btree_buf_bytes(const struct btree *b)
|
||||
{
|
||||
return 1UL << b->byte_order;
|
||||
}
|
||||
|
||||
static inline size_t btree_buf_max_u64s(const struct btree *b)
|
||||
{
|
||||
return (btree_buf_bytes(b) - sizeof(struct btree_node)) / sizeof(u64);
|
||||
}
|
||||
|
||||
static inline size_t btree_max_u64s(const struct bch_fs *c)
|
||||
{
|
||||
return (c->opts.btree_node_size - sizeof(struct btree_node)) / sizeof(u64);
|
||||
}
|
||||
|
||||
static inline size_t btree_sectors(const struct bch_fs *c)
|
||||
{
|
||||
return c->opts.btree_node_size >> SECTOR_SHIFT;
|
||||
}
|
||||
|
||||
static inline unsigned btree_blocks(const struct bch_fs *c)
|
||||
{
|
||||
return btree_sectors(c) >> c->block_bits;
|
||||
}
|
||||
|
||||
#define BTREE_SPLIT_THRESHOLD(c) (btree_max_u64s(c) * 2 / 3)
|
||||
|
||||
#define BTREE_FOREGROUND_MERGE_THRESHOLD(c) (btree_max_u64s(c) * 1 / 3)
|
||||
#define BTREE_FOREGROUND_MERGE_HYSTERESIS(c) \
|
||||
(BTREE_FOREGROUND_MERGE_THRESHOLD(c) + \
|
||||
(BTREE_FOREGROUND_MERGE_THRESHOLD(c) >> 2))
|
||||
|
||||
static inline unsigned btree_id_nr_alive(struct bch_fs *c)
|
||||
{
|
||||
return BTREE_ID_NR + c->btree_roots_extra.nr;
|
||||
}
|
||||
|
||||
static inline struct btree_root *bch2_btree_id_root(struct bch_fs *c, unsigned id)
|
||||
{
|
||||
if (likely(id < BTREE_ID_NR)) {
|
||||
return &c->btree_roots_known[id];
|
||||
} else {
|
||||
unsigned idx = id - BTREE_ID_NR;
|
||||
|
||||
/* This can happen when we're called from btree_node_scan */
|
||||
if (idx >= c->btree_roots_extra.nr)
|
||||
return NULL;
|
||||
|
||||
return &c->btree_roots_extra.data[idx];
|
||||
}
|
||||
}
|
||||
|
||||
static inline struct btree *btree_node_root(struct bch_fs *c, struct btree *b)
|
||||
{
|
||||
struct btree_root *r = bch2_btree_id_root(c, b->c.btree_id);
|
||||
|
||||
return r ? r->b : NULL;
|
||||
}
|
||||
|
||||
const char *bch2_btree_id_str(enum btree_id); /* avoid */
|
||||
void bch2_btree_id_to_text(struct printbuf *, enum btree_id);
|
||||
void bch2_btree_id_level_to_text(struct printbuf *, enum btree_id, unsigned);
|
||||
|
||||
void __bch2_btree_pos_to_text(struct printbuf *, struct bch_fs *,
|
||||
enum btree_id, unsigned, struct bkey_s_c);
|
||||
void bch2_btree_pos_to_text(struct printbuf *, struct bch_fs *, const struct btree *);
|
||||
void bch2_btree_node_to_text(struct printbuf *, struct bch_fs *, const struct btree *);
|
||||
void bch2_btree_cache_to_text(struct printbuf *, const struct btree_cache *);
|
||||
|
||||
#endif /* _BCACHEFS_BTREE_CACHE_H */
|
||||
File diff suppressed because it is too large
Load Diff
@@ -1,88 +0,0 @@
|
||||
/* SPDX-License-Identifier: GPL-2.0 */
|
||||
#ifndef _BCACHEFS_BTREE_GC_H
|
||||
#define _BCACHEFS_BTREE_GC_H
|
||||
|
||||
#include "bkey.h"
|
||||
#include "btree_gc_types.h"
|
||||
#include "btree_types.h"
|
||||
|
||||
int bch2_check_topology(struct bch_fs *);
|
||||
int bch2_check_allocations(struct bch_fs *);
|
||||
|
||||
/*
|
||||
* For concurrent mark and sweep (with other index updates), we define a total
|
||||
* ordering of _all_ references GC walks:
|
||||
*
|
||||
* Note that some references will have the same GC position as others - e.g.
|
||||
* everything within the same btree node; in those cases we're relying on
|
||||
* whatever locking exists for where those references live, i.e. the write lock
|
||||
* on a btree node.
|
||||
*
|
||||
* That locking is also required to ensure GC doesn't pass the updater in
|
||||
* between the updater adding/removing the reference and updating the GC marks;
|
||||
* without that, we would at best double count sometimes.
|
||||
*
|
||||
* That part is important - whenever calling bch2_mark_pointers(), a lock _must_
|
||||
* be held that prevents GC from passing the position the updater is at.
|
||||
*
|
||||
* (What about the start of gc, when we're clearing all the marks? GC clears the
|
||||
* mark with the gc pos seqlock held, and bch_mark_bucket checks against the gc
|
||||
* position inside its cmpxchg loop, so crap magically works).
|
||||
*/
|
||||
|
||||
/* Position of (the start of) a gc phase: */
|
||||
static inline struct gc_pos gc_phase(enum gc_phase phase)
|
||||
{
|
||||
return (struct gc_pos) { .phase = phase, };
|
||||
}
|
||||
|
||||
static inline struct gc_pos gc_pos_btree(enum btree_id btree, unsigned level,
|
||||
struct bpos pos)
|
||||
{
|
||||
return (struct gc_pos) {
|
||||
.phase = GC_PHASE_btree,
|
||||
.btree = btree,
|
||||
.level = level,
|
||||
.pos = pos,
|
||||
};
|
||||
}
|
||||
|
||||
static inline int gc_btree_order(enum btree_id btree)
|
||||
{
|
||||
if (btree == BTREE_ID_alloc)
|
||||
return -2;
|
||||
if (btree == BTREE_ID_stripes)
|
||||
return -1;
|
||||
return btree;
|
||||
}
|
||||
|
||||
static inline int gc_pos_cmp(struct gc_pos l, struct gc_pos r)
|
||||
{
|
||||
return cmp_int(l.phase, r.phase) ?:
|
||||
cmp_int(gc_btree_order(l.btree),
|
||||
gc_btree_order(r.btree)) ?:
|
||||
cmp_int(l.level, r.level) ?:
|
||||
bpos_cmp(l.pos, r.pos);
|
||||
}
|
||||
|
||||
static inline bool gc_visited(struct bch_fs *c, struct gc_pos pos)
|
||||
{
|
||||
unsigned seq;
|
||||
bool ret;
|
||||
|
||||
do {
|
||||
seq = read_seqcount_begin(&c->gc_pos_lock);
|
||||
ret = gc_pos_cmp(pos, c->gc_pos) <= 0;
|
||||
} while (read_seqcount_retry(&c->gc_pos_lock, seq));
|
||||
|
||||
return ret;
|
||||
}
|
||||
|
||||
void bch2_gc_pos_to_text(struct printbuf *, struct gc_pos *);
|
||||
|
||||
int bch2_gc_gens(struct bch_fs *);
|
||||
void bch2_gc_gens_async(struct bch_fs *);
|
||||
|
||||
void bch2_fs_btree_gc_init_early(struct bch_fs *);
|
||||
|
||||
#endif /* _BCACHEFS_BTREE_GC_H */
|
||||
@@ -1,34 +0,0 @@
|
||||
/* SPDX-License-Identifier: GPL-2.0 */
|
||||
#ifndef _BCACHEFS_BTREE_GC_TYPES_H
|
||||
#define _BCACHEFS_BTREE_GC_TYPES_H
|
||||
|
||||
#include <linux/generic-radix-tree.h>
|
||||
|
||||
#define GC_PHASES() \
|
||||
x(not_running) \
|
||||
x(start) \
|
||||
x(sb) \
|
||||
x(btree)
|
||||
|
||||
enum gc_phase {
|
||||
#define x(n) GC_PHASE_##n,
|
||||
GC_PHASES()
|
||||
#undef x
|
||||
};
|
||||
|
||||
struct gc_pos {
|
||||
enum gc_phase phase:8;
|
||||
enum btree_id btree:8;
|
||||
u16 level;
|
||||
struct bpos pos;
|
||||
};
|
||||
|
||||
struct reflink_gc {
|
||||
u64 offset;
|
||||
u32 size;
|
||||
u32 refcount;
|
||||
};
|
||||
|
||||
typedef GENRADIX(struct reflink_gc) reflink_gc_table;
|
||||
|
||||
#endif /* _BCACHEFS_BTREE_GC_TYPES_H */
|
||||
File diff suppressed because it is too large
Load Diff
@@ -1,239 +0,0 @@
|
||||
/* SPDX-License-Identifier: GPL-2.0 */
|
||||
#ifndef _BCACHEFS_BTREE_IO_H
|
||||
#define _BCACHEFS_BTREE_IO_H
|
||||
|
||||
#include "bkey_methods.h"
|
||||
#include "bset.h"
|
||||
#include "btree_locking.h"
|
||||
#include "checksum.h"
|
||||
#include "extents.h"
|
||||
#include "io_write_types.h"
|
||||
|
||||
struct bch_fs;
|
||||
struct btree_write;
|
||||
struct btree;
|
||||
struct btree_iter;
|
||||
struct btree_node_read_all;
|
||||
|
||||
static inline void set_btree_node_dirty_acct(struct bch_fs *c, struct btree *b)
|
||||
{
|
||||
if (!test_and_set_bit(BTREE_NODE_dirty, &b->flags))
|
||||
atomic_long_inc(&c->btree_cache.nr_dirty);
|
||||
}
|
||||
|
||||
static inline void clear_btree_node_dirty_acct(struct bch_fs *c, struct btree *b)
|
||||
{
|
||||
if (test_and_clear_bit(BTREE_NODE_dirty, &b->flags))
|
||||
atomic_long_dec(&c->btree_cache.nr_dirty);
|
||||
}
|
||||
|
||||
static inline unsigned btree_ptr_sectors_written(struct bkey_s_c k)
|
||||
{
|
||||
return k.k->type == KEY_TYPE_btree_ptr_v2
|
||||
? le16_to_cpu(bkey_s_c_to_btree_ptr_v2(k).v->sectors_written)
|
||||
: 0;
|
||||
}
|
||||
|
||||
struct btree_read_bio {
|
||||
struct bch_fs *c;
|
||||
struct btree *b;
|
||||
struct btree_node_read_all *ra;
|
||||
u64 start_time;
|
||||
unsigned have_ioref:1;
|
||||
unsigned idx:7;
|
||||
#ifdef CONFIG_BCACHEFS_ASYNC_OBJECT_LISTS
|
||||
unsigned list_idx;
|
||||
#endif
|
||||
struct extent_ptr_decoded pick;
|
||||
struct work_struct work;
|
||||
struct bio bio;
|
||||
};
|
||||
|
||||
struct btree_write_bio {
|
||||
struct work_struct work;
|
||||
__BKEY_PADDED(key, BKEY_BTREE_PTR_VAL_U64s_MAX);
|
||||
void *data;
|
||||
unsigned data_bytes;
|
||||
unsigned sector_offset;
|
||||
u64 start_time;
|
||||
#ifdef CONFIG_BCACHEFS_ASYNC_OBJECT_LISTS
|
||||
unsigned list_idx;
|
||||
#endif
|
||||
struct bch_write_bio wbio;
|
||||
};
|
||||
|
||||
void bch2_btree_node_io_unlock(struct btree *);
|
||||
void bch2_btree_node_io_lock(struct btree *);
|
||||
void __bch2_btree_node_wait_on_read(struct btree *);
|
||||
void __bch2_btree_node_wait_on_write(struct btree *);
|
||||
void bch2_btree_node_wait_on_read(struct btree *);
|
||||
void bch2_btree_node_wait_on_write(struct btree *);
|
||||
|
||||
enum compact_mode {
|
||||
COMPACT_LAZY,
|
||||
COMPACT_ALL,
|
||||
};
|
||||
|
||||
bool bch2_compact_whiteouts(struct bch_fs *, struct btree *,
|
||||
enum compact_mode);
|
||||
|
||||
static inline bool should_compact_bset_lazy(struct btree *b,
|
||||
struct bset_tree *t)
|
||||
{
|
||||
unsigned total_u64s = bset_u64s(t);
|
||||
unsigned dead_u64s = bset_dead_u64s(b, t);
|
||||
|
||||
return dead_u64s > 64 && dead_u64s * 3 > total_u64s;
|
||||
}
|
||||
|
||||
static inline bool bch2_maybe_compact_whiteouts(struct bch_fs *c, struct btree *b)
|
||||
{
|
||||
for_each_bset(b, t)
|
||||
if (should_compact_bset_lazy(b, t))
|
||||
return bch2_compact_whiteouts(c, b, COMPACT_LAZY);
|
||||
|
||||
return false;
|
||||
}
|
||||
|
||||
static inline struct nonce btree_nonce(struct bset *i, unsigned offset)
|
||||
{
|
||||
return (struct nonce) {{
|
||||
[0] = cpu_to_le32(offset),
|
||||
[1] = ((__le32 *) &i->seq)[0],
|
||||
[2] = ((__le32 *) &i->seq)[1],
|
||||
[3] = ((__le32 *) &i->journal_seq)[0]^BCH_NONCE_BTREE,
|
||||
}};
|
||||
}
|
||||
|
||||
static inline int bset_encrypt(struct bch_fs *c, struct bset *i, unsigned offset)
|
||||
{
|
||||
struct nonce nonce = btree_nonce(i, offset);
|
||||
int ret;
|
||||
|
||||
if (!offset) {
|
||||
struct btree_node *bn = container_of(i, struct btree_node, keys);
|
||||
unsigned bytes = (void *) &bn->keys - (void *) &bn->flags;
|
||||
|
||||
ret = bch2_encrypt(c, BSET_CSUM_TYPE(i), nonce,
|
||||
&bn->flags, bytes);
|
||||
if (ret)
|
||||
return ret;
|
||||
|
||||
nonce = nonce_add(nonce, round_up(bytes, CHACHA_BLOCK_SIZE));
|
||||
}
|
||||
|
||||
return bch2_encrypt(c, BSET_CSUM_TYPE(i), nonce, i->_data,
|
||||
vstruct_end(i) - (void *) i->_data);
|
||||
}
|
||||
|
||||
void bch2_btree_sort_into(struct bch_fs *, struct btree *, struct btree *);
|
||||
|
||||
void bch2_btree_node_drop_keys_outside_node(struct btree *);
|
||||
|
||||
void bch2_btree_build_aux_trees(struct btree *);
|
||||
void bch2_btree_init_next(struct btree_trans *, struct btree *);
|
||||
|
||||
int bch2_btree_node_read_done(struct bch_fs *, struct bch_dev *,
|
||||
struct btree *,
|
||||
struct bch_io_failures *,
|
||||
struct printbuf *);
|
||||
void bch2_btree_node_read(struct btree_trans *, struct btree *, bool);
|
||||
int bch2_btree_root_read(struct bch_fs *, enum btree_id,
|
||||
const struct bkey_i *, unsigned);
|
||||
|
||||
void bch2_btree_read_bio_to_text(struct printbuf *, struct btree_read_bio *);
|
||||
|
||||
int bch2_btree_node_scrub(struct btree_trans *, enum btree_id, unsigned,
|
||||
struct bkey_s_c, unsigned);
|
||||
|
||||
bool bch2_btree_post_write_cleanup(struct bch_fs *, struct btree *);
|
||||
|
||||
enum btree_write_flags {
|
||||
__BTREE_WRITE_ONLY_IF_NEED = BTREE_WRITE_TYPE_BITS,
|
||||
__BTREE_WRITE_ALREADY_STARTED,
|
||||
};
|
||||
#define BTREE_WRITE_ONLY_IF_NEED BIT(__BTREE_WRITE_ONLY_IF_NEED)
|
||||
#define BTREE_WRITE_ALREADY_STARTED BIT(__BTREE_WRITE_ALREADY_STARTED)
|
||||
|
||||
void __bch2_btree_node_write(struct bch_fs *, struct btree *, unsigned);
|
||||
void bch2_btree_node_write(struct bch_fs *, struct btree *,
|
||||
enum six_lock_type, unsigned);
|
||||
void bch2_btree_node_write_trans(struct btree_trans *, struct btree *,
|
||||
enum six_lock_type, unsigned);
|
||||
|
||||
static inline void btree_node_write_if_need(struct btree_trans *trans, struct btree *b,
|
||||
enum six_lock_type lock_held)
|
||||
{
|
||||
bch2_btree_node_write_trans(trans, b, lock_held, BTREE_WRITE_ONLY_IF_NEED);
|
||||
}
|
||||
|
||||
bool bch2_btree_flush_all_reads(struct bch_fs *);
|
||||
bool bch2_btree_flush_all_writes(struct bch_fs *);
|
||||
|
||||
static inline void compat_bformat(unsigned level, enum btree_id btree_id,
|
||||
unsigned version, unsigned big_endian,
|
||||
int write, struct bkey_format *f)
|
||||
{
|
||||
if (version < bcachefs_metadata_version_inode_btree_change &&
|
||||
btree_id == BTREE_ID_inodes) {
|
||||
swap(f->bits_per_field[BKEY_FIELD_INODE],
|
||||
f->bits_per_field[BKEY_FIELD_OFFSET]);
|
||||
swap(f->field_offset[BKEY_FIELD_INODE],
|
||||
f->field_offset[BKEY_FIELD_OFFSET]);
|
||||
}
|
||||
|
||||
if (version < bcachefs_metadata_version_snapshot &&
|
||||
(level || btree_type_has_snapshots(btree_id))) {
|
||||
u64 max_packed =
|
||||
~(~0ULL << f->bits_per_field[BKEY_FIELD_SNAPSHOT]);
|
||||
|
||||
f->field_offset[BKEY_FIELD_SNAPSHOT] = write
|
||||
? 0
|
||||
: cpu_to_le64(U32_MAX - max_packed);
|
||||
}
|
||||
}
|
||||
|
||||
static inline void compat_bpos(unsigned level, enum btree_id btree_id,
|
||||
unsigned version, unsigned big_endian,
|
||||
int write, struct bpos *p)
|
||||
{
|
||||
if (big_endian != CPU_BIG_ENDIAN)
|
||||
bch2_bpos_swab(p);
|
||||
|
||||
if (version < bcachefs_metadata_version_inode_btree_change &&
|
||||
btree_id == BTREE_ID_inodes)
|
||||
swap(p->inode, p->offset);
|
||||
}
|
||||
|
||||
static inline void compat_btree_node(unsigned level, enum btree_id btree_id,
|
||||
unsigned version, unsigned big_endian,
|
||||
int write,
|
||||
struct btree_node *bn)
|
||||
{
|
||||
if (version < bcachefs_metadata_version_inode_btree_change &&
|
||||
btree_id_is_extents(btree_id) &&
|
||||
!bpos_eq(bn->min_key, POS_MIN) &&
|
||||
write)
|
||||
bn->min_key = bpos_nosnap_predecessor(bn->min_key);
|
||||
|
||||
if (version < bcachefs_metadata_version_snapshot &&
|
||||
write)
|
||||
bn->max_key.snapshot = 0;
|
||||
|
||||
compat_bpos(level, btree_id, version, big_endian, write, &bn->min_key);
|
||||
compat_bpos(level, btree_id, version, big_endian, write, &bn->max_key);
|
||||
|
||||
if (version < bcachefs_metadata_version_snapshot &&
|
||||
!write)
|
||||
bn->max_key.snapshot = U32_MAX;
|
||||
|
||||
if (version < bcachefs_metadata_version_inode_btree_change &&
|
||||
btree_id_is_extents(btree_id) &&
|
||||
!bpos_eq(bn->min_key, POS_MIN) &&
|
||||
!write)
|
||||
bn->min_key = bpos_nosnap_successor(bn->min_key);
|
||||
}
|
||||
|
||||
void bch2_btree_write_stats_to_text(struct printbuf *, struct bch_fs *);
|
||||
|
||||
#endif /* _BCACHEFS_BTREE_IO_H */
|
||||
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
@@ -1,830 +0,0 @@
|
||||
// SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
#include "bcachefs.h"
|
||||
#include "bkey_buf.h"
|
||||
#include "bset.h"
|
||||
#include "btree_cache.h"
|
||||
#include "btree_journal_iter.h"
|
||||
#include "journal_io.h"
|
||||
|
||||
#include <linux/sort.h>
|
||||
|
||||
/*
|
||||
* For managing keys we read from the journal: until journal replay works normal
|
||||
* btree lookups need to be able to find and return keys from the journal where
|
||||
* they overwrite what's in the btree, so we have a special iterator and
|
||||
* operations for the regular btree iter code to use:
|
||||
*/
|
||||
|
||||
static inline size_t pos_to_idx(struct journal_keys *keys, size_t pos)
|
||||
{
|
||||
size_t gap_size = keys->size - keys->nr;
|
||||
|
||||
BUG_ON(pos >= keys->gap && pos < keys->gap + gap_size);
|
||||
|
||||
if (pos >= keys->gap)
|
||||
pos -= gap_size;
|
||||
return pos;
|
||||
}
|
||||
|
||||
static inline size_t idx_to_pos(struct journal_keys *keys, size_t idx)
|
||||
{
|
||||
size_t gap_size = keys->size - keys->nr;
|
||||
|
||||
if (idx >= keys->gap)
|
||||
idx += gap_size;
|
||||
return idx;
|
||||
}
|
||||
|
||||
static inline struct journal_key *idx_to_key(struct journal_keys *keys, size_t idx)
|
||||
{
|
||||
return keys->data + idx_to_pos(keys, idx);
|
||||
}
|
||||
|
||||
static size_t __bch2_journal_key_search(struct journal_keys *keys,
|
||||
enum btree_id id, unsigned level,
|
||||
struct bpos pos)
|
||||
{
|
||||
size_t l = 0, r = keys->nr, m;
|
||||
|
||||
while (l < r) {
|
||||
m = l + ((r - l) >> 1);
|
||||
if (__journal_key_cmp(id, level, pos, idx_to_key(keys, m)) > 0)
|
||||
l = m + 1;
|
||||
else
|
||||
r = m;
|
||||
}
|
||||
|
||||
BUG_ON(l < keys->nr &&
|
||||
__journal_key_cmp(id, level, pos, idx_to_key(keys, l)) > 0);
|
||||
|
||||
BUG_ON(l &&
|
||||
__journal_key_cmp(id, level, pos, idx_to_key(keys, l - 1)) <= 0);
|
||||
|
||||
return l;
|
||||
}
|
||||
|
||||
static size_t bch2_journal_key_search(struct journal_keys *keys,
|
||||
enum btree_id id, unsigned level,
|
||||
struct bpos pos)
|
||||
{
|
||||
return idx_to_pos(keys, __bch2_journal_key_search(keys, id, level, pos));
|
||||
}
|
||||
|
||||
/* Returns first non-overwritten key >= search key: */
|
||||
struct bkey_i *bch2_journal_keys_peek_max(struct bch_fs *c, enum btree_id btree_id,
|
||||
unsigned level, struct bpos pos,
|
||||
struct bpos end_pos, size_t *idx)
|
||||
{
|
||||
struct journal_keys *keys = &c->journal_keys;
|
||||
unsigned iters = 0;
|
||||
struct journal_key *k;
|
||||
|
||||
BUG_ON(*idx > keys->nr);
|
||||
search:
|
||||
if (!*idx)
|
||||
*idx = __bch2_journal_key_search(keys, btree_id, level, pos);
|
||||
|
||||
while (*idx &&
|
||||
__journal_key_cmp(btree_id, level, end_pos, idx_to_key(keys, *idx - 1)) <= 0) {
|
||||
--(*idx);
|
||||
iters++;
|
||||
if (iters == 10) {
|
||||
*idx = 0;
|
||||
goto search;
|
||||
}
|
||||
}
|
||||
|
||||
struct bkey_i *ret = NULL;
|
||||
rcu_read_lock(); /* for overwritten_ranges */
|
||||
|
||||
while ((k = *idx < keys->nr ? idx_to_key(keys, *idx) : NULL)) {
|
||||
if (__journal_key_cmp(btree_id, level, end_pos, k) < 0)
|
||||
break;
|
||||
|
||||
if (k->overwritten) {
|
||||
if (k->overwritten_range)
|
||||
*idx = rcu_dereference(k->overwritten_range)->end;
|
||||
else
|
||||
*idx += 1;
|
||||
continue;
|
||||
}
|
||||
|
||||
if (__journal_key_cmp(btree_id, level, pos, k) <= 0) {
|
||||
ret = k->k;
|
||||
break;
|
||||
}
|
||||
|
||||
(*idx)++;
|
||||
iters++;
|
||||
if (iters == 10) {
|
||||
*idx = 0;
|
||||
rcu_read_unlock();
|
||||
goto search;
|
||||
}
|
||||
}
|
||||
|
||||
rcu_read_unlock();
|
||||
return ret;
|
||||
}
|
||||
|
||||
struct bkey_i *bch2_journal_keys_peek_prev_min(struct bch_fs *c, enum btree_id btree_id,
|
||||
unsigned level, struct bpos pos,
|
||||
struct bpos end_pos, size_t *idx)
|
||||
{
|
||||
struct journal_keys *keys = &c->journal_keys;
|
||||
unsigned iters = 0;
|
||||
struct journal_key *k;
|
||||
|
||||
BUG_ON(*idx > keys->nr);
|
||||
|
||||
if (!keys->nr)
|
||||
return NULL;
|
||||
search:
|
||||
if (!*idx)
|
||||
*idx = __bch2_journal_key_search(keys, btree_id, level, pos);
|
||||
|
||||
while (*idx < keys->nr &&
|
||||
__journal_key_cmp(btree_id, level, end_pos, idx_to_key(keys, *idx)) >= 0) {
|
||||
(*idx)++;
|
||||
iters++;
|
||||
if (iters == 10) {
|
||||
*idx = 0;
|
||||
goto search;
|
||||
}
|
||||
}
|
||||
|
||||
if (*idx == keys->nr)
|
||||
--(*idx);
|
||||
|
||||
struct bkey_i *ret = NULL;
|
||||
rcu_read_lock(); /* for overwritten_ranges */
|
||||
|
||||
while (true) {
|
||||
k = idx_to_key(keys, *idx);
|
||||
if (__journal_key_cmp(btree_id, level, end_pos, k) > 0)
|
||||
break;
|
||||
|
||||
if (k->overwritten) {
|
||||
if (k->overwritten_range)
|
||||
*idx = rcu_dereference(k->overwritten_range)->start;
|
||||
if (!*idx)
|
||||
break;
|
||||
--(*idx);
|
||||
continue;
|
||||
}
|
||||
|
||||
if (__journal_key_cmp(btree_id, level, pos, k) >= 0) {
|
||||
ret = k->k;
|
||||
break;
|
||||
}
|
||||
|
||||
if (!*idx)
|
||||
break;
|
||||
--(*idx);
|
||||
iters++;
|
||||
if (iters == 10) {
|
||||
*idx = 0;
|
||||
goto search;
|
||||
}
|
||||
}
|
||||
|
||||
rcu_read_unlock();
|
||||
return ret;
|
||||
}
|
||||
|
||||
struct bkey_i *bch2_journal_keys_peek_slot(struct bch_fs *c, enum btree_id btree_id,
|
||||
unsigned level, struct bpos pos)
|
||||
{
|
||||
size_t idx = 0;
|
||||
|
||||
return bch2_journal_keys_peek_max(c, btree_id, level, pos, pos, &idx);
|
||||
}
|
||||
|
||||
static void journal_iter_verify(struct journal_iter *iter)
|
||||
{
|
||||
#ifdef CONFIG_BCACHEFS_DEBUG
|
||||
struct journal_keys *keys = iter->keys;
|
||||
size_t gap_size = keys->size - keys->nr;
|
||||
|
||||
BUG_ON(iter->idx >= keys->gap &&
|
||||
iter->idx < keys->gap + gap_size);
|
||||
|
||||
if (iter->idx < keys->size) {
|
||||
struct journal_key *k = keys->data + iter->idx;
|
||||
|
||||
int cmp = __journal_key_btree_cmp(iter->btree_id, iter->level, k);
|
||||
BUG_ON(cmp > 0);
|
||||
}
|
||||
#endif
|
||||
}
|
||||
|
||||
static void journal_iters_fix(struct bch_fs *c)
|
||||
{
|
||||
struct journal_keys *keys = &c->journal_keys;
|
||||
/* The key we just inserted is immediately before the gap: */
|
||||
size_t gap_end = keys->gap + (keys->size - keys->nr);
|
||||
struct journal_key *new_key = &keys->data[keys->gap - 1];
|
||||
struct journal_iter *iter;
|
||||
|
||||
/*
|
||||
* If an iterator points one after the key we just inserted, decrement
|
||||
* the iterator so it points at the key we just inserted - if the
|
||||
* decrement was unnecessary, bch2_btree_and_journal_iter_peek() will
|
||||
* handle that:
|
||||
*/
|
||||
list_for_each_entry(iter, &c->journal_iters, list) {
|
||||
journal_iter_verify(iter);
|
||||
if (iter->idx == gap_end &&
|
||||
new_key->btree_id == iter->btree_id &&
|
||||
new_key->level == iter->level)
|
||||
iter->idx = keys->gap - 1;
|
||||
journal_iter_verify(iter);
|
||||
}
|
||||
}
|
||||
|
||||
static void journal_iters_move_gap(struct bch_fs *c, size_t old_gap, size_t new_gap)
|
||||
{
|
||||
struct journal_keys *keys = &c->journal_keys;
|
||||
struct journal_iter *iter;
|
||||
size_t gap_size = keys->size - keys->nr;
|
||||
|
||||
list_for_each_entry(iter, &c->journal_iters, list) {
|
||||
if (iter->idx > old_gap)
|
||||
iter->idx -= gap_size;
|
||||
if (iter->idx >= new_gap)
|
||||
iter->idx += gap_size;
|
||||
}
|
||||
}
|
||||
|
||||
int bch2_journal_key_insert_take(struct bch_fs *c, enum btree_id id,
|
||||
unsigned level, struct bkey_i *k)
|
||||
{
|
||||
struct journal_key n = {
|
||||
.btree_id = id,
|
||||
.level = level,
|
||||
.k = k,
|
||||
.allocated = true,
|
||||
/*
|
||||
* Ensure these keys are done last by journal replay, to unblock
|
||||
* journal reclaim:
|
||||
*/
|
||||
.journal_seq = U64_MAX,
|
||||
};
|
||||
struct journal_keys *keys = &c->journal_keys;
|
||||
size_t idx = bch2_journal_key_search(keys, id, level, k->k.p);
|
||||
|
||||
BUG_ON(test_bit(BCH_FS_rw, &c->flags));
|
||||
|
||||
if (idx < keys->size &&
|
||||
journal_key_cmp(&n, &keys->data[idx]) == 0) {
|
||||
if (keys->data[idx].allocated)
|
||||
kfree(keys->data[idx].k);
|
||||
keys->data[idx] = n;
|
||||
return 0;
|
||||
}
|
||||
|
||||
if (idx > keys->gap)
|
||||
idx -= keys->size - keys->nr;
|
||||
|
||||
size_t old_gap = keys->gap;
|
||||
|
||||
if (keys->nr == keys->size) {
|
||||
journal_iters_move_gap(c, old_gap, keys->size);
|
||||
old_gap = keys->size;
|
||||
|
||||
struct journal_keys new_keys = {
|
||||
.nr = keys->nr,
|
||||
.size = max_t(size_t, keys->size, 8) * 2,
|
||||
};
|
||||
|
||||
new_keys.data = bch2_kvmalloc(new_keys.size * sizeof(new_keys.data[0]), GFP_KERNEL);
|
||||
if (!new_keys.data) {
|
||||
bch_err(c, "%s: error allocating new key array (size %zu)",
|
||||
__func__, new_keys.size);
|
||||
return bch_err_throw(c, ENOMEM_journal_key_insert);
|
||||
}
|
||||
|
||||
/* Since @keys was full, there was no gap: */
|
||||
memcpy(new_keys.data, keys->data, sizeof(keys->data[0]) * keys->nr);
|
||||
kvfree(keys->data);
|
||||
keys->data = new_keys.data;
|
||||
keys->nr = new_keys.nr;
|
||||
keys->size = new_keys.size;
|
||||
|
||||
/* And now the gap is at the end: */
|
||||
keys->gap = keys->nr;
|
||||
}
|
||||
|
||||
journal_iters_move_gap(c, old_gap, idx);
|
||||
|
||||
move_gap(keys, idx);
|
||||
|
||||
keys->nr++;
|
||||
keys->data[keys->gap++] = n;
|
||||
|
||||
journal_iters_fix(c);
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
/*
|
||||
* Can only be used from the recovery thread while we're still RO - can't be
|
||||
* used once we've got RW, as journal_keys is at that point used by multiple
|
||||
* threads:
|
||||
*/
|
||||
int bch2_journal_key_insert(struct bch_fs *c, enum btree_id id,
|
||||
unsigned level, struct bkey_i *k)
|
||||
{
|
||||
struct bkey_i *n;
|
||||
int ret;
|
||||
|
||||
n = kmalloc(bkey_bytes(&k->k), GFP_KERNEL);
|
||||
if (!n)
|
||||
return bch_err_throw(c, ENOMEM_journal_key_insert);
|
||||
|
||||
bkey_copy(n, k);
|
||||
ret = bch2_journal_key_insert_take(c, id, level, n);
|
||||
if (ret)
|
||||
kfree(n);
|
||||
return ret;
|
||||
}
|
||||
|
||||
int bch2_journal_key_delete(struct bch_fs *c, enum btree_id id,
|
||||
unsigned level, struct bpos pos)
|
||||
{
|
||||
struct bkey_i whiteout;
|
||||
|
||||
bkey_init(&whiteout.k);
|
||||
whiteout.k.p = pos;
|
||||
|
||||
return bch2_journal_key_insert(c, id, level, &whiteout);
|
||||
}
|
||||
|
||||
bool bch2_key_deleted_in_journal(struct btree_trans *trans, enum btree_id btree,
|
||||
unsigned level, struct bpos pos)
|
||||
{
|
||||
struct journal_keys *keys = &trans->c->journal_keys;
|
||||
size_t idx = bch2_journal_key_search(keys, btree, level, pos);
|
||||
|
||||
if (!trans->journal_replay_not_finished)
|
||||
return false;
|
||||
|
||||
return (idx < keys->size &&
|
||||
keys->data[idx].btree_id == btree &&
|
||||
keys->data[idx].level == level &&
|
||||
bpos_eq(keys->data[idx].k->k.p, pos) &&
|
||||
bkey_deleted(&keys->data[idx].k->k));
|
||||
}
|
||||
|
||||
static void __bch2_journal_key_overwritten(struct journal_keys *keys, size_t pos)
|
||||
{
|
||||
struct journal_key *k = keys->data + pos;
|
||||
size_t idx = pos_to_idx(keys, pos);
|
||||
|
||||
k->overwritten = true;
|
||||
|
||||
struct journal_key *prev = idx > 0 ? keys->data + idx_to_pos(keys, idx - 1) : NULL;
|
||||
struct journal_key *next = idx + 1 < keys->nr ? keys->data + idx_to_pos(keys, idx + 1) : NULL;
|
||||
|
||||
bool prev_overwritten = prev && prev->overwritten;
|
||||
bool next_overwritten = next && next->overwritten;
|
||||
|
||||
struct journal_key_range_overwritten *prev_range =
|
||||
prev_overwritten ? prev->overwritten_range : NULL;
|
||||
struct journal_key_range_overwritten *next_range =
|
||||
next_overwritten ? next->overwritten_range : NULL;
|
||||
|
||||
BUG_ON(prev_range && prev_range->end != idx);
|
||||
BUG_ON(next_range && next_range->start != idx + 1);
|
||||
|
||||
if (prev_range && next_range) {
|
||||
prev_range->end = next_range->end;
|
||||
|
||||
keys->data[pos].overwritten_range = prev_range;
|
||||
for (size_t i = next_range->start; i < next_range->end; i++) {
|
||||
struct journal_key *ip = keys->data + idx_to_pos(keys, i);
|
||||
BUG_ON(ip->overwritten_range != next_range);
|
||||
ip->overwritten_range = prev_range;
|
||||
}
|
||||
|
||||
kfree_rcu_mightsleep(next_range);
|
||||
} else if (prev_range) {
|
||||
prev_range->end++;
|
||||
k->overwritten_range = prev_range;
|
||||
if (next_overwritten) {
|
||||
prev_range->end++;
|
||||
next->overwritten_range = prev_range;
|
||||
}
|
||||
} else if (next_range) {
|
||||
next_range->start--;
|
||||
k->overwritten_range = next_range;
|
||||
if (prev_overwritten) {
|
||||
next_range->start--;
|
||||
prev->overwritten_range = next_range;
|
||||
}
|
||||
} else if (prev_overwritten || next_overwritten) {
|
||||
struct journal_key_range_overwritten *r = kmalloc(sizeof(*r), GFP_KERNEL);
|
||||
if (!r)
|
||||
return;
|
||||
|
||||
r->start = idx - (size_t) prev_overwritten;
|
||||
r->end = idx + 1 + (size_t) next_overwritten;
|
||||
|
||||
rcu_assign_pointer(k->overwritten_range, r);
|
||||
if (prev_overwritten)
|
||||
prev->overwritten_range = r;
|
||||
if (next_overwritten)
|
||||
next->overwritten_range = r;
|
||||
}
|
||||
}
|
||||
|
||||
void bch2_journal_key_overwritten(struct bch_fs *c, enum btree_id btree,
|
||||
unsigned level, struct bpos pos)
|
||||
{
|
||||
struct journal_keys *keys = &c->journal_keys;
|
||||
size_t idx = bch2_journal_key_search(keys, btree, level, pos);
|
||||
|
||||
if (idx < keys->size &&
|
||||
keys->data[idx].btree_id == btree &&
|
||||
keys->data[idx].level == level &&
|
||||
bpos_eq(keys->data[idx].k->k.p, pos) &&
|
||||
!keys->data[idx].overwritten) {
|
||||
mutex_lock(&keys->overwrite_lock);
|
||||
__bch2_journal_key_overwritten(keys, idx);
|
||||
mutex_unlock(&keys->overwrite_lock);
|
||||
}
|
||||
}
|
||||
|
||||
static void bch2_journal_iter_advance(struct journal_iter *iter)
|
||||
{
|
||||
if (iter->idx < iter->keys->size) {
|
||||
iter->idx++;
|
||||
if (iter->idx == iter->keys->gap)
|
||||
iter->idx += iter->keys->size - iter->keys->nr;
|
||||
}
|
||||
}
|
||||
|
||||
static struct bkey_s_c bch2_journal_iter_peek(struct journal_iter *iter)
|
||||
{
|
||||
journal_iter_verify(iter);
|
||||
|
||||
guard(rcu)();
|
||||
while (iter->idx < iter->keys->size) {
|
||||
struct journal_key *k = iter->keys->data + iter->idx;
|
||||
|
||||
int cmp = __journal_key_btree_cmp(iter->btree_id, iter->level, k);
|
||||
if (cmp < 0)
|
||||
break;
|
||||
BUG_ON(cmp);
|
||||
|
||||
if (!k->overwritten)
|
||||
return bkey_i_to_s_c(k->k);
|
||||
|
||||
if (k->overwritten_range)
|
||||
iter->idx = idx_to_pos(iter->keys, rcu_dereference(k->overwritten_range)->end);
|
||||
else
|
||||
bch2_journal_iter_advance(iter);
|
||||
}
|
||||
|
||||
return bkey_s_c_null;
|
||||
}
|
||||
|
||||
static void bch2_journal_iter_exit(struct journal_iter *iter)
|
||||
{
|
||||
list_del(&iter->list);
|
||||
}
|
||||
|
||||
static void bch2_journal_iter_init(struct bch_fs *c,
|
||||
struct journal_iter *iter,
|
||||
enum btree_id id, unsigned level,
|
||||
struct bpos pos)
|
||||
{
|
||||
iter->btree_id = id;
|
||||
iter->level = level;
|
||||
iter->keys = &c->journal_keys;
|
||||
iter->idx = bch2_journal_key_search(&c->journal_keys, id, level, pos);
|
||||
|
||||
journal_iter_verify(iter);
|
||||
}
|
||||
|
||||
static struct bkey_s_c bch2_journal_iter_peek_btree(struct btree_and_journal_iter *iter)
|
||||
{
|
||||
return bch2_btree_node_iter_peek_unpack(&iter->node_iter,
|
||||
iter->b, &iter->unpacked);
|
||||
}
|
||||
|
||||
static void bch2_journal_iter_advance_btree(struct btree_and_journal_iter *iter)
|
||||
{
|
||||
bch2_btree_node_iter_advance(&iter->node_iter, iter->b);
|
||||
}
|
||||
|
||||
void bch2_btree_and_journal_iter_advance(struct btree_and_journal_iter *iter)
|
||||
{
|
||||
if (bpos_eq(iter->pos, SPOS_MAX))
|
||||
iter->at_end = true;
|
||||
else
|
||||
iter->pos = bpos_successor(iter->pos);
|
||||
}
|
||||
|
||||
static void btree_and_journal_iter_prefetch(struct btree_and_journal_iter *_iter)
|
||||
{
|
||||
struct btree_and_journal_iter iter = *_iter;
|
||||
struct bch_fs *c = iter.trans->c;
|
||||
unsigned level = iter.journal.level;
|
||||
struct bkey_buf tmp;
|
||||
unsigned nr = test_bit(BCH_FS_started, &c->flags)
|
||||
? (level > 1 ? 0 : 2)
|
||||
: (level > 1 ? 1 : 16);
|
||||
|
||||
iter.prefetch = false;
|
||||
iter.fail_if_too_many_whiteouts = true;
|
||||
bch2_bkey_buf_init(&tmp);
|
||||
|
||||
while (nr--) {
|
||||
bch2_btree_and_journal_iter_advance(&iter);
|
||||
struct bkey_s_c k = bch2_btree_and_journal_iter_peek(&iter);
|
||||
if (!k.k)
|
||||
break;
|
||||
|
||||
bch2_bkey_buf_reassemble(&tmp, c, k);
|
||||
bch2_btree_node_prefetch(iter.trans, NULL, tmp.k, iter.journal.btree_id, level - 1);
|
||||
}
|
||||
|
||||
bch2_bkey_buf_exit(&tmp, c);
|
||||
}
|
||||
|
||||
struct bkey_s_c bch2_btree_and_journal_iter_peek(struct btree_and_journal_iter *iter)
|
||||
{
|
||||
struct bkey_s_c btree_k, journal_k = bkey_s_c_null, ret;
|
||||
size_t iters = 0;
|
||||
|
||||
if (iter->prefetch && iter->journal.level)
|
||||
btree_and_journal_iter_prefetch(iter);
|
||||
again:
|
||||
if (iter->at_end)
|
||||
return bkey_s_c_null;
|
||||
|
||||
iters++;
|
||||
|
||||
if (iters > 20 && iter->fail_if_too_many_whiteouts)
|
||||
return bkey_s_c_null;
|
||||
|
||||
while ((btree_k = bch2_journal_iter_peek_btree(iter)).k &&
|
||||
bpos_lt(btree_k.k->p, iter->pos))
|
||||
bch2_journal_iter_advance_btree(iter);
|
||||
|
||||
if (iter->trans->journal_replay_not_finished)
|
||||
while ((journal_k = bch2_journal_iter_peek(&iter->journal)).k &&
|
||||
bpos_lt(journal_k.k->p, iter->pos))
|
||||
bch2_journal_iter_advance(&iter->journal);
|
||||
|
||||
ret = journal_k.k &&
|
||||
(!btree_k.k || bpos_le(journal_k.k->p, btree_k.k->p))
|
||||
? journal_k
|
||||
: btree_k;
|
||||
|
||||
if (ret.k && iter->b && bpos_gt(ret.k->p, iter->b->data->max_key))
|
||||
ret = bkey_s_c_null;
|
||||
|
||||
if (ret.k) {
|
||||
iter->pos = ret.k->p;
|
||||
if (bkey_deleted(ret.k)) {
|
||||
bch2_btree_and_journal_iter_advance(iter);
|
||||
goto again;
|
||||
}
|
||||
} else {
|
||||
iter->pos = SPOS_MAX;
|
||||
iter->at_end = true;
|
||||
}
|
||||
|
||||
return ret;
|
||||
}
|
||||
|
||||
void bch2_btree_and_journal_iter_exit(struct btree_and_journal_iter *iter)
|
||||
{
|
||||
bch2_journal_iter_exit(&iter->journal);
|
||||
}
|
||||
|
||||
void __bch2_btree_and_journal_iter_init_node_iter(struct btree_trans *trans,
|
||||
struct btree_and_journal_iter *iter,
|
||||
struct btree *b,
|
||||
struct btree_node_iter node_iter,
|
||||
struct bpos pos)
|
||||
{
|
||||
memset(iter, 0, sizeof(*iter));
|
||||
|
||||
iter->trans = trans;
|
||||
iter->b = b;
|
||||
iter->node_iter = node_iter;
|
||||
iter->pos = b->data->min_key;
|
||||
iter->at_end = false;
|
||||
INIT_LIST_HEAD(&iter->journal.list);
|
||||
|
||||
if (trans->journal_replay_not_finished) {
|
||||
bch2_journal_iter_init(trans->c, &iter->journal, b->c.btree_id, b->c.level, pos);
|
||||
if (!test_bit(BCH_FS_may_go_rw, &trans->c->flags))
|
||||
list_add(&iter->journal.list, &trans->c->journal_iters);
|
||||
}
|
||||
}
|
||||
|
||||
/*
|
||||
* this version is used by btree_gc before filesystem has gone RW and
|
||||
* multithreaded, so uses the journal_iters list:
|
||||
*/
|
||||
void bch2_btree_and_journal_iter_init_node_iter(struct btree_trans *trans,
|
||||
struct btree_and_journal_iter *iter,
|
||||
struct btree *b)
|
||||
{
|
||||
struct btree_node_iter node_iter;
|
||||
|
||||
bch2_btree_node_iter_init_from_start(&node_iter, b);
|
||||
__bch2_btree_and_journal_iter_init_node_iter(trans, iter, b, node_iter, b->data->min_key);
|
||||
}
|
||||
|
||||
/* sort and dedup all keys in the journal: */
|
||||
|
||||
/*
|
||||
* When keys compare equal, oldest compares first:
|
||||
*/
|
||||
static int journal_sort_key_cmp(const void *_l, const void *_r)
|
||||
{
|
||||
const struct journal_key *l = _l;
|
||||
const struct journal_key *r = _r;
|
||||
int rewind = l->rewind && r->rewind ? -1 : 1;
|
||||
|
||||
return journal_key_cmp(l, r) ?:
|
||||
((cmp_int(l->journal_seq, r->journal_seq) ?:
|
||||
cmp_int(l->journal_offset, r->journal_offset)) * rewind);
|
||||
}
|
||||
|
||||
void bch2_journal_keys_put(struct bch_fs *c)
|
||||
{
|
||||
struct journal_keys *keys = &c->journal_keys;
|
||||
|
||||
BUG_ON(atomic_read(&keys->ref) <= 0);
|
||||
|
||||
if (!atomic_dec_and_test(&keys->ref))
|
||||
return;
|
||||
|
||||
move_gap(keys, keys->nr);
|
||||
|
||||
darray_for_each(*keys, i) {
|
||||
if (i->overwritten_range &&
|
||||
(i == &darray_last(*keys) ||
|
||||
i->overwritten_range != i[1].overwritten_range))
|
||||
kfree(i->overwritten_range);
|
||||
|
||||
if (i->allocated)
|
||||
kfree(i->k);
|
||||
}
|
||||
|
||||
kvfree(keys->data);
|
||||
keys->data = NULL;
|
||||
keys->nr = keys->gap = keys->size = 0;
|
||||
|
||||
struct journal_replay **i;
|
||||
struct genradix_iter iter;
|
||||
|
||||
genradix_for_each(&c->journal_entries, iter, i)
|
||||
kvfree(*i);
|
||||
genradix_free(&c->journal_entries);
|
||||
}
|
||||
|
||||
static void __journal_keys_sort(struct journal_keys *keys)
|
||||
{
|
||||
sort_nonatomic(keys->data, keys->nr, sizeof(keys->data[0]),
|
||||
journal_sort_key_cmp, NULL);
|
||||
|
||||
cond_resched();
|
||||
|
||||
struct journal_key *dst = keys->data;
|
||||
|
||||
darray_for_each(*keys, src) {
|
||||
/*
|
||||
* We don't accumulate accounting keys here because we have to
|
||||
* compare each individual accounting key against the version in
|
||||
* the btree during replay:
|
||||
*/
|
||||
if (src->k->k.type != KEY_TYPE_accounting &&
|
||||
src + 1 < &darray_top(*keys) &&
|
||||
!journal_key_cmp(src, src + 1))
|
||||
continue;
|
||||
|
||||
*dst++ = *src;
|
||||
}
|
||||
|
||||
keys->nr = dst - keys->data;
|
||||
}
|
||||
|
||||
int bch2_journal_keys_sort(struct bch_fs *c)
|
||||
{
|
||||
struct genradix_iter iter;
|
||||
struct journal_replay *i, **_i;
|
||||
struct journal_keys *keys = &c->journal_keys;
|
||||
size_t nr_read = 0;
|
||||
|
||||
u64 rewind_seq = c->opts.journal_rewind ?: U64_MAX;
|
||||
|
||||
genradix_for_each(&c->journal_entries, iter, _i) {
|
||||
i = *_i;
|
||||
|
||||
if (journal_replay_ignore(i))
|
||||
continue;
|
||||
|
||||
cond_resched();
|
||||
|
||||
vstruct_for_each(&i->j, entry) {
|
||||
bool rewind = !entry->level &&
|
||||
!btree_id_is_alloc(entry->btree_id) &&
|
||||
le64_to_cpu(i->j.seq) >= rewind_seq;
|
||||
|
||||
if (entry->type != (rewind
|
||||
? BCH_JSET_ENTRY_overwrite
|
||||
: BCH_JSET_ENTRY_btree_keys))
|
||||
continue;
|
||||
|
||||
if (!rewind && le64_to_cpu(i->j.seq) < c->journal_replay_seq_start)
|
||||
continue;
|
||||
|
||||
jset_entry_for_each_key(entry, k) {
|
||||
struct journal_key n = (struct journal_key) {
|
||||
.btree_id = entry->btree_id,
|
||||
.level = entry->level,
|
||||
.rewind = rewind,
|
||||
.k = k,
|
||||
.journal_seq = le64_to_cpu(i->j.seq),
|
||||
.journal_offset = k->_data - i->j._data,
|
||||
};
|
||||
|
||||
if (darray_push(keys, n)) {
|
||||
__journal_keys_sort(keys);
|
||||
|
||||
if (keys->nr * 8 > keys->size * 7) {
|
||||
bch_err(c, "Too many journal keys for slowpath; have %zu compacted, buf size %zu, processed %zu keys at seq %llu",
|
||||
keys->nr, keys->size, nr_read, le64_to_cpu(i->j.seq));
|
||||
return bch_err_throw(c, ENOMEM_journal_keys_sort);
|
||||
}
|
||||
|
||||
BUG_ON(darray_push(keys, n));
|
||||
}
|
||||
|
||||
nr_read++;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
__journal_keys_sort(keys);
|
||||
keys->gap = keys->nr;
|
||||
|
||||
bch_verbose(c, "Journal keys: %zu read, %zu after sorting and compacting", nr_read, keys->nr);
|
||||
return 0;
|
||||
}
|
||||
|
||||
void bch2_shoot_down_journal_keys(struct bch_fs *c, enum btree_id btree,
|
||||
unsigned level_min, unsigned level_max,
|
||||
struct bpos start, struct bpos end)
|
||||
{
|
||||
struct journal_keys *keys = &c->journal_keys;
|
||||
size_t dst = 0;
|
||||
|
||||
move_gap(keys, keys->nr);
|
||||
|
||||
darray_for_each(*keys, i)
|
||||
if (!(i->btree_id == btree &&
|
||||
i->level >= level_min &&
|
||||
i->level <= level_max &&
|
||||
bpos_ge(i->k->k.p, start) &&
|
||||
bpos_le(i->k->k.p, end)))
|
||||
keys->data[dst++] = *i;
|
||||
keys->nr = keys->gap = dst;
|
||||
}
|
||||
|
||||
void bch2_journal_keys_dump(struct bch_fs *c)
|
||||
{
|
||||
struct journal_keys *keys = &c->journal_keys;
|
||||
struct printbuf buf = PRINTBUF;
|
||||
|
||||
pr_info("%zu keys:", keys->nr);
|
||||
|
||||
move_gap(keys, keys->nr);
|
||||
|
||||
darray_for_each(*keys, i) {
|
||||
printbuf_reset(&buf);
|
||||
prt_printf(&buf, "btree=");
|
||||
bch2_btree_id_to_text(&buf, i->btree_id);
|
||||
prt_printf(&buf, " l=%u ", i->level);
|
||||
bch2_bkey_val_to_text(&buf, c, bkey_i_to_s_c(i->k));
|
||||
pr_err("%s", buf.buf);
|
||||
}
|
||||
printbuf_exit(&buf);
|
||||
}
|
||||
|
||||
void bch2_fs_journal_keys_init(struct bch_fs *c)
|
||||
{
|
||||
struct journal_keys *keys = &c->journal_keys;
|
||||
|
||||
atomic_set(&keys->ref, 1);
|
||||
keys->initial_ref_held = true;
|
||||
mutex_init(&keys->overwrite_lock);
|
||||
}
|
||||
@@ -1,102 +0,0 @@
|
||||
/* SPDX-License-Identifier: GPL-2.0 */
|
||||
#ifndef _BCACHEFS_BTREE_JOURNAL_ITER_H
|
||||
#define _BCACHEFS_BTREE_JOURNAL_ITER_H
|
||||
|
||||
#include "bkey.h"
|
||||
|
||||
struct journal_iter {
|
||||
struct list_head list;
|
||||
enum btree_id btree_id;
|
||||
unsigned level;
|
||||
size_t idx;
|
||||
struct journal_keys *keys;
|
||||
};
|
||||
|
||||
/*
|
||||
* Iterate over keys in the btree, with keys from the journal overlaid on top:
|
||||
*/
|
||||
|
||||
struct btree_and_journal_iter {
|
||||
struct btree_trans *trans;
|
||||
struct btree *b;
|
||||
struct btree_node_iter node_iter;
|
||||
struct bkey unpacked;
|
||||
|
||||
struct journal_iter journal;
|
||||
struct bpos pos;
|
||||
bool at_end;
|
||||
bool prefetch;
|
||||
bool fail_if_too_many_whiteouts;
|
||||
};
|
||||
|
||||
static inline int __journal_key_btree_cmp(enum btree_id l_btree_id,
|
||||
unsigned l_level,
|
||||
const struct journal_key *r)
|
||||
{
|
||||
return -cmp_int(l_level, r->level) ?:
|
||||
cmp_int(l_btree_id, r->btree_id);
|
||||
}
|
||||
|
||||
static inline int __journal_key_cmp(enum btree_id l_btree_id,
|
||||
unsigned l_level,
|
||||
struct bpos l_pos,
|
||||
const struct journal_key *r)
|
||||
{
|
||||
return __journal_key_btree_cmp(l_btree_id, l_level, r) ?:
|
||||
bpos_cmp(l_pos, r->k->k.p);
|
||||
}
|
||||
|
||||
static inline int journal_key_cmp(const struct journal_key *l, const struct journal_key *r)
|
||||
{
|
||||
return __journal_key_cmp(l->btree_id, l->level, l->k->k.p, r);
|
||||
}
|
||||
|
||||
struct bkey_i *bch2_journal_keys_peek_max(struct bch_fs *, enum btree_id,
|
||||
unsigned, struct bpos, struct bpos, size_t *);
|
||||
struct bkey_i *bch2_journal_keys_peek_prev_min(struct bch_fs *, enum btree_id,
|
||||
unsigned, struct bpos, struct bpos, size_t *);
|
||||
struct bkey_i *bch2_journal_keys_peek_slot(struct bch_fs *, enum btree_id,
|
||||
unsigned, struct bpos);
|
||||
|
||||
int bch2_btree_and_journal_iter_prefetch(struct btree_trans *, struct btree_path *,
|
||||
struct btree_and_journal_iter *);
|
||||
|
||||
int bch2_journal_key_insert_take(struct bch_fs *, enum btree_id,
|
||||
unsigned, struct bkey_i *);
|
||||
int bch2_journal_key_insert(struct bch_fs *, enum btree_id,
|
||||
unsigned, struct bkey_i *);
|
||||
int bch2_journal_key_delete(struct bch_fs *, enum btree_id,
|
||||
unsigned, struct bpos);
|
||||
bool bch2_key_deleted_in_journal(struct btree_trans *, enum btree_id, unsigned, struct bpos);
|
||||
void bch2_journal_key_overwritten(struct bch_fs *, enum btree_id, unsigned, struct bpos);
|
||||
|
||||
void bch2_btree_and_journal_iter_advance(struct btree_and_journal_iter *);
|
||||
struct bkey_s_c bch2_btree_and_journal_iter_peek(struct btree_and_journal_iter *);
|
||||
|
||||
void bch2_btree_and_journal_iter_exit(struct btree_and_journal_iter *);
|
||||
void __bch2_btree_and_journal_iter_init_node_iter(struct btree_trans *,
|
||||
struct btree_and_journal_iter *, struct btree *,
|
||||
struct btree_node_iter, struct bpos);
|
||||
void bch2_btree_and_journal_iter_init_node_iter(struct btree_trans *,
|
||||
struct btree_and_journal_iter *, struct btree *);
|
||||
|
||||
void bch2_journal_keys_put(struct bch_fs *);
|
||||
|
||||
static inline void bch2_journal_keys_put_initial(struct bch_fs *c)
|
||||
{
|
||||
if (c->journal_keys.initial_ref_held)
|
||||
bch2_journal_keys_put(c);
|
||||
c->journal_keys.initial_ref_held = false;
|
||||
}
|
||||
|
||||
int bch2_journal_keys_sort(struct bch_fs *);
|
||||
|
||||
void bch2_shoot_down_journal_keys(struct bch_fs *, enum btree_id,
|
||||
unsigned, unsigned,
|
||||
struct bpos, struct bpos);
|
||||
|
||||
void bch2_journal_keys_dump(struct bch_fs *);
|
||||
|
||||
void bch2_fs_journal_keys_init(struct bch_fs *);
|
||||
|
||||
#endif /* _BCACHEFS_BTREE_JOURNAL_ITER_H */
|
||||
@@ -1,37 +0,0 @@
|
||||
/* SPDX-License-Identifier: GPL-2.0 */
|
||||
#ifndef _BCACHEFS_BTREE_JOURNAL_ITER_TYPES_H
|
||||
#define _BCACHEFS_BTREE_JOURNAL_ITER_TYPES_H
|
||||
|
||||
struct journal_key_range_overwritten {
|
||||
size_t start, end;
|
||||
};
|
||||
|
||||
struct journal_key {
|
||||
u64 journal_seq;
|
||||
u32 journal_offset;
|
||||
enum btree_id btree_id:8;
|
||||
unsigned level:8;
|
||||
bool allocated:1;
|
||||
bool overwritten:1;
|
||||
bool rewind:1;
|
||||
struct journal_key_range_overwritten __rcu *
|
||||
overwritten_range;
|
||||
struct bkey_i *k;
|
||||
};
|
||||
|
||||
struct journal_keys {
|
||||
/* must match layout in darray_types.h */
|
||||
size_t nr, size;
|
||||
struct journal_key *data;
|
||||
/*
|
||||
* Gap buffer: instead of all the empty space in the array being at the
|
||||
* end of the buffer - from @nr to @size - the empty space is at @gap.
|
||||
* This means that sequential insertions are O(n) instead of O(n^2).
|
||||
*/
|
||||
size_t gap;
|
||||
atomic_t ref;
|
||||
bool initial_ref_held;
|
||||
struct mutex overwrite_lock;
|
||||
};
|
||||
|
||||
#endif /* _BCACHEFS_BTREE_JOURNAL_ITER_TYPES_H */
|
||||
@@ -1,880 +0,0 @@
|
||||
// SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
#include "bcachefs.h"
|
||||
#include "btree_cache.h"
|
||||
#include "btree_iter.h"
|
||||
#include "btree_key_cache.h"
|
||||
#include "btree_locking.h"
|
||||
#include "btree_update.h"
|
||||
#include "errcode.h"
|
||||
#include "error.h"
|
||||
#include "journal.h"
|
||||
#include "journal_reclaim.h"
|
||||
#include "trace.h"
|
||||
|
||||
#include <linux/sched/mm.h>
|
||||
|
||||
static inline bool btree_uses_pcpu_readers(enum btree_id id)
|
||||
{
|
||||
return id == BTREE_ID_subvolumes;
|
||||
}
|
||||
|
||||
static struct kmem_cache *bch2_key_cache;
|
||||
|
||||
static int bch2_btree_key_cache_cmp_fn(struct rhashtable_compare_arg *arg,
|
||||
const void *obj)
|
||||
{
|
||||
const struct bkey_cached *ck = obj;
|
||||
const struct bkey_cached_key *key = arg->key;
|
||||
|
||||
return ck->key.btree_id != key->btree_id ||
|
||||
!bpos_eq(ck->key.pos, key->pos);
|
||||
}
|
||||
|
||||
static const struct rhashtable_params bch2_btree_key_cache_params = {
|
||||
.head_offset = offsetof(struct bkey_cached, hash),
|
||||
.key_offset = offsetof(struct bkey_cached, key),
|
||||
.key_len = sizeof(struct bkey_cached_key),
|
||||
.obj_cmpfn = bch2_btree_key_cache_cmp_fn,
|
||||
.automatic_shrinking = true,
|
||||
};
|
||||
|
||||
static inline void btree_path_cached_set(struct btree_trans *trans, struct btree_path *path,
|
||||
struct bkey_cached *ck,
|
||||
enum btree_node_locked_type lock_held)
|
||||
{
|
||||
path->l[0].lock_seq = six_lock_seq(&ck->c.lock);
|
||||
path->l[0].b = (void *) ck;
|
||||
mark_btree_node_locked(trans, path, 0, lock_held);
|
||||
}
|
||||
|
||||
__flatten
|
||||
inline struct bkey_cached *
|
||||
bch2_btree_key_cache_find(struct bch_fs *c, enum btree_id btree_id, struct bpos pos)
|
||||
{
|
||||
struct bkey_cached_key key = {
|
||||
.btree_id = btree_id,
|
||||
.pos = pos,
|
||||
};
|
||||
|
||||
return rhashtable_lookup_fast(&c->btree_key_cache.table, &key,
|
||||
bch2_btree_key_cache_params);
|
||||
}
|
||||
|
||||
static bool bkey_cached_lock_for_evict(struct bkey_cached *ck)
|
||||
{
|
||||
if (!six_trylock_intent(&ck->c.lock))
|
||||
return false;
|
||||
|
||||
if (test_bit(BKEY_CACHED_DIRTY, &ck->flags)) {
|
||||
six_unlock_intent(&ck->c.lock);
|
||||
return false;
|
||||
}
|
||||
|
||||
if (!six_trylock_write(&ck->c.lock)) {
|
||||
six_unlock_intent(&ck->c.lock);
|
||||
return false;
|
||||
}
|
||||
|
||||
return true;
|
||||
}
|
||||
|
||||
static bool bkey_cached_evict(struct btree_key_cache *c,
|
||||
struct bkey_cached *ck)
|
||||
{
|
||||
bool ret = !rhashtable_remove_fast(&c->table, &ck->hash,
|
||||
bch2_btree_key_cache_params);
|
||||
if (ret) {
|
||||
memset(&ck->key, ~0, sizeof(ck->key));
|
||||
atomic_long_dec(&c->nr_keys);
|
||||
}
|
||||
|
||||
return ret;
|
||||
}
|
||||
|
||||
static void __bkey_cached_free(struct rcu_pending *pending, struct rcu_head *rcu)
|
||||
{
|
||||
struct bch_fs *c = container_of(pending->srcu, struct bch_fs, btree_trans_barrier);
|
||||
struct bkey_cached *ck = container_of(rcu, struct bkey_cached, rcu);
|
||||
|
||||
this_cpu_dec(*c->btree_key_cache.nr_pending);
|
||||
kmem_cache_free(bch2_key_cache, ck);
|
||||
}
|
||||
|
||||
static inline void bkey_cached_free_noassert(struct btree_key_cache *bc,
|
||||
struct bkey_cached *ck)
|
||||
{
|
||||
kfree(ck->k);
|
||||
ck->k = NULL;
|
||||
ck->u64s = 0;
|
||||
|
||||
six_unlock_write(&ck->c.lock);
|
||||
six_unlock_intent(&ck->c.lock);
|
||||
|
||||
bool pcpu_readers = ck->c.lock.readers != NULL;
|
||||
rcu_pending_enqueue(&bc->pending[pcpu_readers], &ck->rcu);
|
||||
this_cpu_inc(*bc->nr_pending);
|
||||
}
|
||||
|
||||
static void bkey_cached_free(struct btree_trans *trans,
|
||||
struct btree_key_cache *bc,
|
||||
struct bkey_cached *ck)
|
||||
{
|
||||
/*
|
||||
* we'll hit strange issues in the SRCU code if we aren't holding an
|
||||
* SRCU read lock...
|
||||
*/
|
||||
EBUG_ON(!trans->srcu_held);
|
||||
|
||||
bkey_cached_free_noassert(bc, ck);
|
||||
}
|
||||
|
||||
static struct bkey_cached *__bkey_cached_alloc(unsigned key_u64s, gfp_t gfp)
|
||||
{
|
||||
gfp |= __GFP_ACCOUNT|__GFP_RECLAIMABLE;
|
||||
|
||||
struct bkey_cached *ck = kmem_cache_zalloc(bch2_key_cache, gfp);
|
||||
if (unlikely(!ck))
|
||||
return NULL;
|
||||
ck->k = kmalloc(key_u64s * sizeof(u64), gfp);
|
||||
if (unlikely(!ck->k)) {
|
||||
kmem_cache_free(bch2_key_cache, ck);
|
||||
return NULL;
|
||||
}
|
||||
ck->u64s = key_u64s;
|
||||
return ck;
|
||||
}
|
||||
|
||||
static struct bkey_cached *
|
||||
bkey_cached_alloc(struct btree_trans *trans, struct btree_path *path, unsigned key_u64s)
|
||||
{
|
||||
struct bch_fs *c = trans->c;
|
||||
struct btree_key_cache *bc = &c->btree_key_cache;
|
||||
bool pcpu_readers = btree_uses_pcpu_readers(path->btree_id);
|
||||
int ret;
|
||||
|
||||
struct bkey_cached *ck = container_of_or_null(
|
||||
rcu_pending_dequeue(&bc->pending[pcpu_readers]),
|
||||
struct bkey_cached, rcu);
|
||||
if (ck)
|
||||
goto lock;
|
||||
|
||||
ck = allocate_dropping_locks(trans, ret,
|
||||
__bkey_cached_alloc(key_u64s, _gfp));
|
||||
if (ret) {
|
||||
if (ck)
|
||||
kfree(ck->k);
|
||||
kmem_cache_free(bch2_key_cache, ck);
|
||||
return ERR_PTR(ret);
|
||||
}
|
||||
|
||||
if (ck) {
|
||||
bch2_btree_lock_init(&ck->c, pcpu_readers ? SIX_LOCK_INIT_PCPU : 0, GFP_KERNEL);
|
||||
ck->c.cached = true;
|
||||
goto lock;
|
||||
}
|
||||
|
||||
ck = container_of_or_null(rcu_pending_dequeue_from_all(&bc->pending[pcpu_readers]),
|
||||
struct bkey_cached, rcu);
|
||||
if (ck)
|
||||
goto lock;
|
||||
lock:
|
||||
six_lock_intent(&ck->c.lock, NULL, NULL);
|
||||
six_lock_write(&ck->c.lock, NULL, NULL);
|
||||
return ck;
|
||||
}
|
||||
|
||||
static struct bkey_cached *
|
||||
bkey_cached_reuse(struct btree_key_cache *c)
|
||||
{
|
||||
|
||||
guard(rcu)();
|
||||
struct bucket_table *tbl = rht_dereference_rcu(c->table.tbl, &c->table);
|
||||
struct rhash_head *pos;
|
||||
struct bkey_cached *ck;
|
||||
|
||||
for (unsigned i = 0; i < tbl->size; i++)
|
||||
rht_for_each_entry_rcu(ck, pos, tbl, i, hash) {
|
||||
if (!test_bit(BKEY_CACHED_DIRTY, &ck->flags) &&
|
||||
bkey_cached_lock_for_evict(ck)) {
|
||||
if (bkey_cached_evict(c, ck))
|
||||
return ck;
|
||||
six_unlock_write(&ck->c.lock);
|
||||
six_unlock_intent(&ck->c.lock);
|
||||
}
|
||||
}
|
||||
return NULL;
|
||||
}
|
||||
|
||||
static int btree_key_cache_create(struct btree_trans *trans,
|
||||
struct btree_path *path,
|
||||
struct btree_path *ck_path,
|
||||
struct bkey_s_c k)
|
||||
{
|
||||
struct bch_fs *c = trans->c;
|
||||
struct btree_key_cache *bc = &c->btree_key_cache;
|
||||
|
||||
/*
|
||||
* bch2_varint_decode can read past the end of the buffer by at
|
||||
* most 7 bytes (it won't be used):
|
||||
*/
|
||||
unsigned key_u64s = k.k->u64s + 1;
|
||||
|
||||
/*
|
||||
* Allocate some extra space so that the transaction commit path is less
|
||||
* likely to have to reallocate, since that requires a transaction
|
||||
* restart:
|
||||
*/
|
||||
key_u64s = min(256U, (key_u64s * 3) / 2);
|
||||
key_u64s = roundup_pow_of_two(key_u64s);
|
||||
|
||||
struct bkey_cached *ck = bkey_cached_alloc(trans, ck_path, key_u64s);
|
||||
int ret = PTR_ERR_OR_ZERO(ck);
|
||||
if (ret)
|
||||
return ret;
|
||||
|
||||
if (unlikely(!ck)) {
|
||||
ck = bkey_cached_reuse(bc);
|
||||
if (unlikely(!ck)) {
|
||||
bch_err(c, "error allocating memory for key cache item, btree %s",
|
||||
bch2_btree_id_str(ck_path->btree_id));
|
||||
return bch_err_throw(c, ENOMEM_btree_key_cache_create);
|
||||
}
|
||||
}
|
||||
|
||||
ck->c.level = 0;
|
||||
ck->c.btree_id = ck_path->btree_id;
|
||||
ck->key.btree_id = ck_path->btree_id;
|
||||
ck->key.pos = ck_path->pos;
|
||||
ck->flags = 1U << BKEY_CACHED_ACCESSED;
|
||||
|
||||
if (unlikely(key_u64s > ck->u64s)) {
|
||||
mark_btree_node_locked_noreset(ck_path, 0, BTREE_NODE_UNLOCKED);
|
||||
|
||||
struct bkey_i *new_k = allocate_dropping_locks(trans, ret,
|
||||
kmalloc(key_u64s * sizeof(u64), _gfp));
|
||||
if (unlikely(!new_k)) {
|
||||
bch_err(trans->c, "error allocating memory for key cache key, btree %s u64s %u",
|
||||
bch2_btree_id_str(ck->key.btree_id), key_u64s);
|
||||
ret = bch_err_throw(c, ENOMEM_btree_key_cache_fill);
|
||||
} else if (ret) {
|
||||
kfree(new_k);
|
||||
goto err;
|
||||
}
|
||||
|
||||
kfree(ck->k);
|
||||
ck->k = new_k;
|
||||
ck->u64s = key_u64s;
|
||||
}
|
||||
|
||||
bkey_reassemble(ck->k, k);
|
||||
|
||||
ret = bch2_btree_node_lock_write(trans, path, &path_l(path)->b->c);
|
||||
if (unlikely(ret))
|
||||
goto err;
|
||||
|
||||
ret = rhashtable_lookup_insert_fast(&bc->table, &ck->hash, bch2_btree_key_cache_params);
|
||||
|
||||
bch2_btree_node_unlock_write(trans, path, path_l(path)->b);
|
||||
|
||||
if (unlikely(ret)) /* raced with another fill? */
|
||||
goto err;
|
||||
|
||||
atomic_long_inc(&bc->nr_keys);
|
||||
six_unlock_write(&ck->c.lock);
|
||||
|
||||
enum six_lock_type lock_want = __btree_lock_want(ck_path, 0);
|
||||
if (lock_want == SIX_LOCK_read)
|
||||
six_lock_downgrade(&ck->c.lock);
|
||||
btree_path_cached_set(trans, ck_path, ck, (enum btree_node_locked_type) lock_want);
|
||||
ck_path->uptodate = BTREE_ITER_UPTODATE;
|
||||
return 0;
|
||||
err:
|
||||
bkey_cached_free(trans, bc, ck);
|
||||
mark_btree_node_locked_noreset(ck_path, 0, BTREE_NODE_UNLOCKED);
|
||||
|
||||
return ret;
|
||||
}
|
||||
|
||||
static noinline_for_stack void do_trace_key_cache_fill(struct btree_trans *trans,
|
||||
struct btree_path *ck_path,
|
||||
struct bkey_s_c k)
|
||||
{
|
||||
struct printbuf buf = PRINTBUF;
|
||||
|
||||
bch2_bpos_to_text(&buf, ck_path->pos);
|
||||
prt_char(&buf, ' ');
|
||||
bch2_bkey_val_to_text(&buf, trans->c, k);
|
||||
trace_key_cache_fill(trans, buf.buf);
|
||||
printbuf_exit(&buf);
|
||||
}
|
||||
|
||||
static noinline int btree_key_cache_fill(struct btree_trans *trans,
|
||||
btree_path_idx_t ck_path_idx,
|
||||
unsigned flags)
|
||||
{
|
||||
struct btree_path *ck_path = trans->paths + ck_path_idx;
|
||||
|
||||
if (flags & BTREE_ITER_cached_nofill) {
|
||||
ck_path->l[0].b = NULL;
|
||||
return 0;
|
||||
}
|
||||
|
||||
struct bch_fs *c = trans->c;
|
||||
struct btree_iter iter;
|
||||
struct bkey_s_c k;
|
||||
int ret;
|
||||
|
||||
bch2_trans_iter_init(trans, &iter, ck_path->btree_id, ck_path->pos,
|
||||
BTREE_ITER_intent|
|
||||
BTREE_ITER_key_cache_fill|
|
||||
BTREE_ITER_cached_nofill);
|
||||
iter.flags &= ~BTREE_ITER_with_journal;
|
||||
k = bch2_btree_iter_peek_slot(trans, &iter);
|
||||
ret = bkey_err(k);
|
||||
if (ret)
|
||||
goto err;
|
||||
|
||||
/* Recheck after btree lookup, before allocating: */
|
||||
ck_path = trans->paths + ck_path_idx;
|
||||
ret = bch2_btree_key_cache_find(c, ck_path->btree_id, ck_path->pos) ? -EEXIST : 0;
|
||||
if (unlikely(ret))
|
||||
goto out;
|
||||
|
||||
ret = btree_key_cache_create(trans, btree_iter_path(trans, &iter), ck_path, k);
|
||||
if (ret)
|
||||
goto err;
|
||||
|
||||
if (trace_key_cache_fill_enabled())
|
||||
do_trace_key_cache_fill(trans, ck_path, k);
|
||||
out:
|
||||
/* We're not likely to need this iterator again: */
|
||||
bch2_set_btree_iter_dontneed(trans, &iter);
|
||||
err:
|
||||
bch2_trans_iter_exit(trans, &iter);
|
||||
return ret;
|
||||
}
|
||||
|
||||
static inline int btree_path_traverse_cached_fast(struct btree_trans *trans,
|
||||
btree_path_idx_t path_idx)
|
||||
{
|
||||
struct bch_fs *c = trans->c;
|
||||
struct bkey_cached *ck;
|
||||
struct btree_path *path = trans->paths + path_idx;
|
||||
retry:
|
||||
ck = bch2_btree_key_cache_find(c, path->btree_id, path->pos);
|
||||
if (!ck)
|
||||
return -ENOENT;
|
||||
|
||||
enum six_lock_type lock_want = __btree_lock_want(path, 0);
|
||||
|
||||
int ret = btree_node_lock(trans, path, (void *) ck, 0, lock_want, _THIS_IP_);
|
||||
if (ret)
|
||||
return ret;
|
||||
|
||||
if (ck->key.btree_id != path->btree_id ||
|
||||
!bpos_eq(ck->key.pos, path->pos)) {
|
||||
six_unlock_type(&ck->c.lock, lock_want);
|
||||
goto retry;
|
||||
}
|
||||
|
||||
if (!test_bit(BKEY_CACHED_ACCESSED, &ck->flags))
|
||||
set_bit(BKEY_CACHED_ACCESSED, &ck->flags);
|
||||
|
||||
btree_path_cached_set(trans, path, ck, (enum btree_node_locked_type) lock_want);
|
||||
path->uptodate = BTREE_ITER_UPTODATE;
|
||||
return 0;
|
||||
}
|
||||
|
||||
int bch2_btree_path_traverse_cached(struct btree_trans *trans,
|
||||
btree_path_idx_t path_idx,
|
||||
unsigned flags)
|
||||
{
|
||||
EBUG_ON(trans->paths[path_idx].level);
|
||||
|
||||
int ret;
|
||||
do {
|
||||
ret = btree_path_traverse_cached_fast(trans, path_idx);
|
||||
if (unlikely(ret == -ENOENT))
|
||||
ret = btree_key_cache_fill(trans, path_idx, flags);
|
||||
} while (ret == -EEXIST);
|
||||
|
||||
struct btree_path *path = trans->paths + path_idx;
|
||||
|
||||
if (unlikely(ret)) {
|
||||
path->uptodate = BTREE_ITER_NEED_TRAVERSE;
|
||||
if (!bch2_err_matches(ret, BCH_ERR_transaction_restart)) {
|
||||
btree_node_unlock(trans, path, 0);
|
||||
path->l[0].b = ERR_PTR(ret);
|
||||
}
|
||||
} else {
|
||||
BUG_ON(path->uptodate);
|
||||
BUG_ON(!path->nodes_locked);
|
||||
}
|
||||
|
||||
return ret;
|
||||
}
|
||||
|
||||
static int btree_key_cache_flush_pos(struct btree_trans *trans,
|
||||
struct bkey_cached_key key,
|
||||
u64 journal_seq,
|
||||
unsigned commit_flags,
|
||||
bool evict)
|
||||
{
|
||||
struct bch_fs *c = trans->c;
|
||||
struct journal *j = &c->journal;
|
||||
struct btree_iter c_iter, b_iter;
|
||||
struct bkey_cached *ck = NULL;
|
||||
int ret;
|
||||
|
||||
bch2_trans_iter_init(trans, &b_iter, key.btree_id, key.pos,
|
||||
BTREE_ITER_slots|
|
||||
BTREE_ITER_intent|
|
||||
BTREE_ITER_all_snapshots);
|
||||
bch2_trans_iter_init(trans, &c_iter, key.btree_id, key.pos,
|
||||
BTREE_ITER_cached|
|
||||
BTREE_ITER_intent);
|
||||
b_iter.flags &= ~BTREE_ITER_with_key_cache;
|
||||
|
||||
ret = bch2_btree_iter_traverse(trans, &c_iter);
|
||||
if (ret)
|
||||
goto out;
|
||||
|
||||
ck = (void *) btree_iter_path(trans, &c_iter)->l[0].b;
|
||||
if (!ck)
|
||||
goto out;
|
||||
|
||||
if (!test_bit(BKEY_CACHED_DIRTY, &ck->flags)) {
|
||||
if (evict)
|
||||
goto evict;
|
||||
goto out;
|
||||
}
|
||||
|
||||
if (journal_seq && ck->journal.seq != journal_seq)
|
||||
goto out;
|
||||
|
||||
trans->journal_res.seq = ck->journal.seq;
|
||||
|
||||
/*
|
||||
* If we're at the end of the journal, we really want to free up space
|
||||
* in the journal right away - we don't want to pin that old journal
|
||||
* sequence number with a new btree node write, we want to re-journal
|
||||
* the update
|
||||
*/
|
||||
if (ck->journal.seq == journal_last_seq(j))
|
||||
commit_flags |= BCH_WATERMARK_reclaim;
|
||||
|
||||
if (ck->journal.seq != journal_last_seq(j) ||
|
||||
!test_bit(JOURNAL_space_low, &c->journal.flags))
|
||||
commit_flags |= BCH_TRANS_COMMIT_no_journal_res;
|
||||
|
||||
struct bkey_s_c btree_k = bch2_btree_iter_peek_slot(trans, &b_iter);
|
||||
ret = bkey_err(btree_k);
|
||||
if (ret)
|
||||
goto err;
|
||||
|
||||
/* * Check that we're not violating cache coherency rules: */
|
||||
BUG_ON(bkey_deleted(btree_k.k));
|
||||
|
||||
ret = bch2_trans_update(trans, &b_iter, ck->k,
|
||||
BTREE_UPDATE_key_cache_reclaim|
|
||||
BTREE_UPDATE_internal_snapshot_node|
|
||||
BTREE_TRIGGER_norun) ?:
|
||||
bch2_trans_commit(trans, NULL, NULL,
|
||||
BCH_TRANS_COMMIT_no_check_rw|
|
||||
BCH_TRANS_COMMIT_no_enospc|
|
||||
commit_flags);
|
||||
err:
|
||||
bch2_fs_fatal_err_on(ret &&
|
||||
!bch2_err_matches(ret, BCH_ERR_transaction_restart) &&
|
||||
!bch2_err_matches(ret, BCH_ERR_journal_reclaim_would_deadlock) &&
|
||||
!bch2_journal_error(j), c,
|
||||
"flushing key cache: %s", bch2_err_str(ret));
|
||||
if (ret)
|
||||
goto out;
|
||||
|
||||
bch2_journal_pin_drop(j, &ck->journal);
|
||||
|
||||
struct btree_path *path = btree_iter_path(trans, &c_iter);
|
||||
BUG_ON(!btree_node_locked(path, 0));
|
||||
|
||||
if (!evict) {
|
||||
if (test_bit(BKEY_CACHED_DIRTY, &ck->flags)) {
|
||||
clear_bit(BKEY_CACHED_DIRTY, &ck->flags);
|
||||
atomic_long_dec(&c->btree_key_cache.nr_dirty);
|
||||
}
|
||||
} else {
|
||||
struct btree_path *path2;
|
||||
unsigned i;
|
||||
evict:
|
||||
trans_for_each_path(trans, path2, i)
|
||||
if (path2 != path)
|
||||
__bch2_btree_path_unlock(trans, path2);
|
||||
|
||||
bch2_btree_node_lock_write_nofail(trans, path, &ck->c);
|
||||
|
||||
if (test_bit(BKEY_CACHED_DIRTY, &ck->flags)) {
|
||||
clear_bit(BKEY_CACHED_DIRTY, &ck->flags);
|
||||
atomic_long_dec(&c->btree_key_cache.nr_dirty);
|
||||
}
|
||||
|
||||
mark_btree_node_locked_noreset(path, 0, BTREE_NODE_UNLOCKED);
|
||||
if (bkey_cached_evict(&c->btree_key_cache, ck)) {
|
||||
bkey_cached_free(trans, &c->btree_key_cache, ck);
|
||||
} else {
|
||||
six_unlock_write(&ck->c.lock);
|
||||
six_unlock_intent(&ck->c.lock);
|
||||
}
|
||||
}
|
||||
out:
|
||||
bch2_trans_iter_exit(trans, &b_iter);
|
||||
bch2_trans_iter_exit(trans, &c_iter);
|
||||
return ret;
|
||||
}
|
||||
|
||||
int bch2_btree_key_cache_journal_flush(struct journal *j,
|
||||
struct journal_entry_pin *pin, u64 seq)
|
||||
{
|
||||
struct bch_fs *c = container_of(j, struct bch_fs, journal);
|
||||
struct bkey_cached *ck =
|
||||
container_of(pin, struct bkey_cached, journal);
|
||||
struct bkey_cached_key key;
|
||||
struct btree_trans *trans = bch2_trans_get(c);
|
||||
int srcu_idx = srcu_read_lock(&c->btree_trans_barrier);
|
||||
int ret = 0;
|
||||
|
||||
btree_node_lock_nopath_nofail(trans, &ck->c, SIX_LOCK_read);
|
||||
key = ck->key;
|
||||
|
||||
if (ck->journal.seq != seq ||
|
||||
!test_bit(BKEY_CACHED_DIRTY, &ck->flags)) {
|
||||
six_unlock_read(&ck->c.lock);
|
||||
goto unlock;
|
||||
}
|
||||
|
||||
if (ck->seq != seq) {
|
||||
bch2_journal_pin_update(&c->journal, ck->seq, &ck->journal,
|
||||
bch2_btree_key_cache_journal_flush);
|
||||
six_unlock_read(&ck->c.lock);
|
||||
goto unlock;
|
||||
}
|
||||
six_unlock_read(&ck->c.lock);
|
||||
|
||||
ret = lockrestart_do(trans,
|
||||
btree_key_cache_flush_pos(trans, key, seq,
|
||||
BCH_TRANS_COMMIT_journal_reclaim, false));
|
||||
unlock:
|
||||
srcu_read_unlock(&c->btree_trans_barrier, srcu_idx);
|
||||
|
||||
bch2_trans_put(trans);
|
||||
return ret;
|
||||
}
|
||||
|
||||
bool bch2_btree_insert_key_cached(struct btree_trans *trans,
|
||||
unsigned flags,
|
||||
struct btree_insert_entry *insert_entry)
|
||||
{
|
||||
struct bch_fs *c = trans->c;
|
||||
struct bkey_cached *ck = (void *) (trans->paths + insert_entry->path)->l[0].b;
|
||||
struct bkey_i *insert = insert_entry->k;
|
||||
bool kick_reclaim = false;
|
||||
|
||||
BUG_ON(insert->k.u64s > ck->u64s);
|
||||
|
||||
bkey_copy(ck->k, insert);
|
||||
|
||||
if (!test_bit(BKEY_CACHED_DIRTY, &ck->flags)) {
|
||||
EBUG_ON(test_bit(BCH_FS_clean_shutdown, &c->flags));
|
||||
set_bit(BKEY_CACHED_DIRTY, &ck->flags);
|
||||
atomic_long_inc(&c->btree_key_cache.nr_dirty);
|
||||
|
||||
if (bch2_nr_btree_keys_need_flush(c))
|
||||
kick_reclaim = true;
|
||||
}
|
||||
|
||||
/*
|
||||
* To minimize lock contention, we only add the journal pin here and
|
||||
* defer pin updates to the flush callback via ->seq. Be careful not to
|
||||
* update ->seq on nojournal commits because we don't want to update the
|
||||
* pin to a seq that doesn't include journal updates on disk. Otherwise
|
||||
* we risk losing the update after a crash.
|
||||
*
|
||||
* The only exception is if the pin is not active in the first place. We
|
||||
* have to add the pin because journal reclaim drives key cache
|
||||
* flushing. The flush callback will not proceed unless ->seq matches
|
||||
* the latest pin, so make sure it starts with a consistent value.
|
||||
*/
|
||||
if (!(insert_entry->flags & BTREE_UPDATE_nojournal) ||
|
||||
!journal_pin_active(&ck->journal)) {
|
||||
ck->seq = trans->journal_res.seq;
|
||||
}
|
||||
bch2_journal_pin_add(&c->journal, trans->journal_res.seq,
|
||||
&ck->journal, bch2_btree_key_cache_journal_flush);
|
||||
|
||||
if (kick_reclaim)
|
||||
journal_reclaim_kick(&c->journal);
|
||||
return true;
|
||||
}
|
||||
|
||||
void bch2_btree_key_cache_drop(struct btree_trans *trans,
|
||||
struct btree_path *path)
|
||||
{
|
||||
struct bch_fs *c = trans->c;
|
||||
struct btree_key_cache *bc = &c->btree_key_cache;
|
||||
struct bkey_cached *ck = (void *) path->l[0].b;
|
||||
|
||||
/*
|
||||
* We just did an update to the btree, bypassing the key cache: the key
|
||||
* cache key is now stale and must be dropped, even if dirty:
|
||||
*/
|
||||
if (test_bit(BKEY_CACHED_DIRTY, &ck->flags)) {
|
||||
clear_bit(BKEY_CACHED_DIRTY, &ck->flags);
|
||||
atomic_long_dec(&c->btree_key_cache.nr_dirty);
|
||||
bch2_journal_pin_drop(&c->journal, &ck->journal);
|
||||
}
|
||||
|
||||
bkey_cached_evict(bc, ck);
|
||||
bkey_cached_free(trans, bc, ck);
|
||||
|
||||
mark_btree_node_locked(trans, path, 0, BTREE_NODE_UNLOCKED);
|
||||
|
||||
struct btree_path *path2;
|
||||
unsigned i;
|
||||
trans_for_each_path(trans, path2, i)
|
||||
if (path2->l[0].b == (void *) ck) {
|
||||
/*
|
||||
* It's safe to clear should_be_locked here because
|
||||
* we're evicting from the key cache, and we still have
|
||||
* the underlying btree locked: filling into the key
|
||||
* cache would require taking a write lock on the btree
|
||||
* node
|
||||
*/
|
||||
path2->should_be_locked = false;
|
||||
__bch2_btree_path_unlock(trans, path2);
|
||||
path2->l[0].b = ERR_PTR(-BCH_ERR_no_btree_node_drop);
|
||||
btree_path_set_dirty(trans, path2, BTREE_ITER_NEED_TRAVERSE);
|
||||
}
|
||||
|
||||
bch2_trans_verify_locks(trans);
|
||||
}
|
||||
|
||||
static unsigned long bch2_btree_key_cache_scan(struct shrinker *shrink,
|
||||
struct shrink_control *sc)
|
||||
{
|
||||
struct bch_fs *c = shrink->private_data;
|
||||
struct btree_key_cache *bc = &c->btree_key_cache;
|
||||
struct bucket_table *tbl;
|
||||
struct bkey_cached *ck;
|
||||
size_t scanned = 0, freed = 0, nr = sc->nr_to_scan;
|
||||
unsigned iter, start;
|
||||
int srcu_idx;
|
||||
|
||||
srcu_idx = srcu_read_lock(&c->btree_trans_barrier);
|
||||
rcu_read_lock();
|
||||
|
||||
tbl = rht_dereference_rcu(bc->table.tbl, &bc->table);
|
||||
|
||||
/*
|
||||
* Scanning is expensive while a rehash is in progress - most elements
|
||||
* will be on the new hashtable, if it's in progress
|
||||
*
|
||||
* A rehash could still start while we're scanning - that's ok, we'll
|
||||
* still see most elements.
|
||||
*/
|
||||
if (unlikely(tbl->nest)) {
|
||||
rcu_read_unlock();
|
||||
srcu_read_unlock(&c->btree_trans_barrier, srcu_idx);
|
||||
return SHRINK_STOP;
|
||||
}
|
||||
|
||||
iter = bc->shrink_iter;
|
||||
if (iter >= tbl->size)
|
||||
iter = 0;
|
||||
start = iter;
|
||||
|
||||
do {
|
||||
struct rhash_head *pos, *next;
|
||||
|
||||
pos = rht_ptr_rcu(&tbl->buckets[iter]);
|
||||
|
||||
while (!rht_is_a_nulls(pos)) {
|
||||
next = rht_dereference_bucket_rcu(pos->next, tbl, iter);
|
||||
ck = container_of(pos, struct bkey_cached, hash);
|
||||
|
||||
if (test_bit(BKEY_CACHED_DIRTY, &ck->flags)) {
|
||||
bc->skipped_dirty++;
|
||||
} else if (test_bit(BKEY_CACHED_ACCESSED, &ck->flags)) {
|
||||
clear_bit(BKEY_CACHED_ACCESSED, &ck->flags);
|
||||
bc->skipped_accessed++;
|
||||
} else if (!bkey_cached_lock_for_evict(ck)) {
|
||||
bc->skipped_lock_fail++;
|
||||
} else if (bkey_cached_evict(bc, ck)) {
|
||||
bkey_cached_free_noassert(bc, ck);
|
||||
bc->freed++;
|
||||
freed++;
|
||||
} else {
|
||||
six_unlock_write(&ck->c.lock);
|
||||
six_unlock_intent(&ck->c.lock);
|
||||
}
|
||||
|
||||
scanned++;
|
||||
if (scanned >= nr)
|
||||
goto out;
|
||||
|
||||
pos = next;
|
||||
}
|
||||
|
||||
iter++;
|
||||
if (iter >= tbl->size)
|
||||
iter = 0;
|
||||
} while (scanned < nr && iter != start);
|
||||
out:
|
||||
bc->shrink_iter = iter;
|
||||
|
||||
rcu_read_unlock();
|
||||
srcu_read_unlock(&c->btree_trans_barrier, srcu_idx);
|
||||
|
||||
return freed;
|
||||
}
|
||||
|
||||
static unsigned long bch2_btree_key_cache_count(struct shrinker *shrink,
|
||||
struct shrink_control *sc)
|
||||
{
|
||||
struct bch_fs *c = shrink->private_data;
|
||||
struct btree_key_cache *bc = &c->btree_key_cache;
|
||||
long nr = atomic_long_read(&bc->nr_keys) -
|
||||
atomic_long_read(&bc->nr_dirty);
|
||||
|
||||
/*
|
||||
* Avoid hammering our shrinker too much if it's nearly empty - the
|
||||
* shrinker code doesn't take into account how big our cache is, if it's
|
||||
* mostly empty but the system is under memory pressure it causes nasty
|
||||
* lock contention:
|
||||
*/
|
||||
nr -= 128;
|
||||
|
||||
return max(0L, nr);
|
||||
}
|
||||
|
||||
void bch2_fs_btree_key_cache_exit(struct btree_key_cache *bc)
|
||||
{
|
||||
struct bch_fs *c = container_of(bc, struct bch_fs, btree_key_cache);
|
||||
struct bucket_table *tbl;
|
||||
struct bkey_cached *ck;
|
||||
struct rhash_head *pos;
|
||||
LIST_HEAD(items);
|
||||
unsigned i;
|
||||
|
||||
shrinker_free(bc->shrink);
|
||||
|
||||
/*
|
||||
* The loop is needed to guard against racing with rehash:
|
||||
*/
|
||||
while (atomic_long_read(&bc->nr_keys)) {
|
||||
rcu_read_lock();
|
||||
tbl = rht_dereference_rcu(bc->table.tbl, &bc->table);
|
||||
if (tbl) {
|
||||
if (tbl->nest) {
|
||||
/* wait for in progress rehash */
|
||||
rcu_read_unlock();
|
||||
mutex_lock(&bc->table.mutex);
|
||||
mutex_unlock(&bc->table.mutex);
|
||||
continue;
|
||||
}
|
||||
for (i = 0; i < tbl->size; i++)
|
||||
while (pos = rht_ptr_rcu(&tbl->buckets[i]), !rht_is_a_nulls(pos)) {
|
||||
ck = container_of(pos, struct bkey_cached, hash);
|
||||
BUG_ON(!bkey_cached_evict(bc, ck));
|
||||
kfree(ck->k);
|
||||
kmem_cache_free(bch2_key_cache, ck);
|
||||
}
|
||||
}
|
||||
rcu_read_unlock();
|
||||
}
|
||||
|
||||
if (atomic_long_read(&bc->nr_dirty) &&
|
||||
!bch2_journal_error(&c->journal) &&
|
||||
test_bit(BCH_FS_was_rw, &c->flags))
|
||||
panic("btree key cache shutdown error: nr_dirty nonzero (%li)\n",
|
||||
atomic_long_read(&bc->nr_dirty));
|
||||
|
||||
if (atomic_long_read(&bc->nr_keys))
|
||||
panic("btree key cache shutdown error: nr_keys nonzero (%li)\n",
|
||||
atomic_long_read(&bc->nr_keys));
|
||||
|
||||
if (bc->table_init_done)
|
||||
rhashtable_destroy(&bc->table);
|
||||
|
||||
rcu_pending_exit(&bc->pending[0]);
|
||||
rcu_pending_exit(&bc->pending[1]);
|
||||
|
||||
free_percpu(bc->nr_pending);
|
||||
}
|
||||
|
||||
void bch2_fs_btree_key_cache_init_early(struct btree_key_cache *c)
|
||||
{
|
||||
}
|
||||
|
||||
int bch2_fs_btree_key_cache_init(struct btree_key_cache *bc)
|
||||
{
|
||||
struct bch_fs *c = container_of(bc, struct bch_fs, btree_key_cache);
|
||||
struct shrinker *shrink;
|
||||
|
||||
bc->nr_pending = alloc_percpu(size_t);
|
||||
if (!bc->nr_pending)
|
||||
return bch_err_throw(c, ENOMEM_fs_btree_cache_init);
|
||||
|
||||
if (rcu_pending_init(&bc->pending[0], &c->btree_trans_barrier, __bkey_cached_free) ||
|
||||
rcu_pending_init(&bc->pending[1], &c->btree_trans_barrier, __bkey_cached_free))
|
||||
return bch_err_throw(c, ENOMEM_fs_btree_cache_init);
|
||||
|
||||
if (rhashtable_init(&bc->table, &bch2_btree_key_cache_params))
|
||||
return bch_err_throw(c, ENOMEM_fs_btree_cache_init);
|
||||
|
||||
bc->table_init_done = true;
|
||||
|
||||
shrink = shrinker_alloc(0, "%s-btree_key_cache", c->name);
|
||||
if (!shrink)
|
||||
return bch_err_throw(c, ENOMEM_fs_btree_cache_init);
|
||||
bc->shrink = shrink;
|
||||
shrink->count_objects = bch2_btree_key_cache_count;
|
||||
shrink->scan_objects = bch2_btree_key_cache_scan;
|
||||
shrink->batch = 1 << 14;
|
||||
shrink->seeks = 0;
|
||||
shrink->private_data = c;
|
||||
shrinker_register(shrink);
|
||||
return 0;
|
||||
}
|
||||
|
||||
void bch2_btree_key_cache_to_text(struct printbuf *out, struct btree_key_cache *bc)
|
||||
{
|
||||
printbuf_tabstop_push(out, 24);
|
||||
printbuf_tabstop_push(out, 12);
|
||||
|
||||
prt_printf(out, "keys:\t%lu\r\n", atomic_long_read(&bc->nr_keys));
|
||||
prt_printf(out, "dirty:\t%lu\r\n", atomic_long_read(&bc->nr_dirty));
|
||||
prt_printf(out, "table size:\t%u\r\n", bc->table.tbl->size);
|
||||
prt_newline(out);
|
||||
prt_printf(out, "shrinker:\n");
|
||||
prt_printf(out, "requested_to_free:\t%lu\r\n", bc->requested_to_free);
|
||||
prt_printf(out, "freed:\t%lu\r\n", bc->freed);
|
||||
prt_printf(out, "skipped_dirty:\t%lu\r\n", bc->skipped_dirty);
|
||||
prt_printf(out, "skipped_accessed:\t%lu\r\n", bc->skipped_accessed);
|
||||
prt_printf(out, "skipped_lock_fail:\t%lu\r\n", bc->skipped_lock_fail);
|
||||
prt_newline(out);
|
||||
prt_printf(out, "pending:\t%zu\r\n", per_cpu_sum(bc->nr_pending));
|
||||
}
|
||||
|
||||
void bch2_btree_key_cache_exit(void)
|
||||
{
|
||||
kmem_cache_destroy(bch2_key_cache);
|
||||
}
|
||||
|
||||
int __init bch2_btree_key_cache_init(void)
|
||||
{
|
||||
bch2_key_cache = KMEM_CACHE(bkey_cached, SLAB_RECLAIM_ACCOUNT);
|
||||
if (!bch2_key_cache)
|
||||
return -ENOMEM;
|
||||
|
||||
return 0;
|
||||
}
|
||||
@@ -1,59 +0,0 @@
|
||||
/* SPDX-License-Identifier: GPL-2.0 */
|
||||
#ifndef _BCACHEFS_BTREE_KEY_CACHE_H
|
||||
#define _BCACHEFS_BTREE_KEY_CACHE_H
|
||||
|
||||
static inline size_t bch2_nr_btree_keys_need_flush(struct bch_fs *c)
|
||||
{
|
||||
size_t nr_dirty = atomic_long_read(&c->btree_key_cache.nr_dirty);
|
||||
size_t nr_keys = atomic_long_read(&c->btree_key_cache.nr_keys);
|
||||
size_t max_dirty = 1024 + nr_keys / 2;
|
||||
|
||||
return max_t(ssize_t, 0, nr_dirty - max_dirty);
|
||||
}
|
||||
|
||||
static inline ssize_t __bch2_btree_key_cache_must_wait(struct bch_fs *c)
|
||||
{
|
||||
size_t nr_dirty = atomic_long_read(&c->btree_key_cache.nr_dirty);
|
||||
size_t nr_keys = atomic_long_read(&c->btree_key_cache.nr_keys);
|
||||
size_t max_dirty = 4096 + (nr_keys * 3) / 4;
|
||||
|
||||
return nr_dirty - max_dirty;
|
||||
}
|
||||
|
||||
static inline bool bch2_btree_key_cache_must_wait(struct bch_fs *c)
|
||||
{
|
||||
return __bch2_btree_key_cache_must_wait(c) > 0;
|
||||
}
|
||||
|
||||
static inline bool bch2_btree_key_cache_wait_done(struct bch_fs *c)
|
||||
{
|
||||
size_t nr_dirty = atomic_long_read(&c->btree_key_cache.nr_dirty);
|
||||
size_t nr_keys = atomic_long_read(&c->btree_key_cache.nr_keys);
|
||||
size_t max_dirty = 2048 + (nr_keys * 5) / 8;
|
||||
|
||||
return nr_dirty <= max_dirty;
|
||||
}
|
||||
|
||||
int bch2_btree_key_cache_journal_flush(struct journal *,
|
||||
struct journal_entry_pin *, u64);
|
||||
|
||||
struct bkey_cached *
|
||||
bch2_btree_key_cache_find(struct bch_fs *, enum btree_id, struct bpos);
|
||||
|
||||
int bch2_btree_path_traverse_cached(struct btree_trans *, btree_path_idx_t, unsigned);
|
||||
|
||||
bool bch2_btree_insert_key_cached(struct btree_trans *, unsigned,
|
||||
struct btree_insert_entry *);
|
||||
void bch2_btree_key_cache_drop(struct btree_trans *,
|
||||
struct btree_path *);
|
||||
|
||||
void bch2_fs_btree_key_cache_exit(struct btree_key_cache *);
|
||||
void bch2_fs_btree_key_cache_init_early(struct btree_key_cache *);
|
||||
int bch2_fs_btree_key_cache_init(struct btree_key_cache *);
|
||||
|
||||
void bch2_btree_key_cache_to_text(struct printbuf *, struct btree_key_cache *);
|
||||
|
||||
void bch2_btree_key_cache_exit(void);
|
||||
int __init bch2_btree_key_cache_init(void);
|
||||
|
||||
#endif /* _BCACHEFS_BTREE_KEY_CACHE_H */
|
||||
@@ -1,34 +0,0 @@
|
||||
/* SPDX-License-Identifier: GPL-2.0 */
|
||||
#ifndef _BCACHEFS_BTREE_KEY_CACHE_TYPES_H
|
||||
#define _BCACHEFS_BTREE_KEY_CACHE_TYPES_H
|
||||
|
||||
#include "rcu_pending.h"
|
||||
|
||||
struct btree_key_cache {
|
||||
struct rhashtable table;
|
||||
bool table_init_done;
|
||||
|
||||
struct shrinker *shrink;
|
||||
unsigned shrink_iter;
|
||||
|
||||
/* 0: non pcpu reader locks, 1: pcpu reader locks */
|
||||
struct rcu_pending pending[2];
|
||||
size_t __percpu *nr_pending;
|
||||
|
||||
atomic_long_t nr_keys;
|
||||
atomic_long_t nr_dirty;
|
||||
|
||||
/* shrinker stats */
|
||||
unsigned long requested_to_free;
|
||||
unsigned long freed;
|
||||
unsigned long skipped_dirty;
|
||||
unsigned long skipped_accessed;
|
||||
unsigned long skipped_lock_fail;
|
||||
};
|
||||
|
||||
struct bkey_cached_key {
|
||||
u32 btree_id;
|
||||
struct bpos pos;
|
||||
} __packed __aligned(4);
|
||||
|
||||
#endif /* _BCACHEFS_BTREE_KEY_CACHE_TYPES_H */
|
||||
@@ -1,936 +0,0 @@
|
||||
// SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
#include "bcachefs.h"
|
||||
#include "btree_cache.h"
|
||||
#include "btree_locking.h"
|
||||
#include "btree_types.h"
|
||||
|
||||
static struct lock_class_key bch2_btree_node_lock_key;
|
||||
|
||||
void bch2_btree_lock_init(struct btree_bkey_cached_common *b,
|
||||
enum six_lock_init_flags flags,
|
||||
gfp_t gfp)
|
||||
{
|
||||
__six_lock_init(&b->lock, "b->c.lock", &bch2_btree_node_lock_key, flags, gfp);
|
||||
lockdep_set_notrack_class(&b->lock);
|
||||
}
|
||||
|
||||
/* Btree node locking: */
|
||||
|
||||
struct six_lock_count bch2_btree_node_lock_counts(struct btree_trans *trans,
|
||||
struct btree_path *skip,
|
||||
struct btree_bkey_cached_common *b,
|
||||
unsigned level)
|
||||
{
|
||||
struct btree_path *path;
|
||||
struct six_lock_count ret;
|
||||
unsigned i;
|
||||
|
||||
memset(&ret, 0, sizeof(ret));
|
||||
|
||||
if (IS_ERR_OR_NULL(b))
|
||||
return ret;
|
||||
|
||||
trans_for_each_path(trans, path, i)
|
||||
if (path != skip && &path->l[level].b->c == b) {
|
||||
int t = btree_node_locked_type(path, level);
|
||||
|
||||
if (t != BTREE_NODE_UNLOCKED)
|
||||
ret.n[t]++;
|
||||
}
|
||||
|
||||
return ret;
|
||||
}
|
||||
|
||||
/* unlock */
|
||||
|
||||
void bch2_btree_node_unlock_write(struct btree_trans *trans,
|
||||
struct btree_path *path, struct btree *b)
|
||||
{
|
||||
bch2_btree_node_unlock_write_inlined(trans, path, b);
|
||||
}
|
||||
|
||||
/* lock */
|
||||
|
||||
/*
|
||||
* @trans wants to lock @b with type @type
|
||||
*/
|
||||
struct trans_waiting_for_lock {
|
||||
struct btree_trans *trans;
|
||||
struct btree_bkey_cached_common *node_want;
|
||||
enum six_lock_type lock_want;
|
||||
|
||||
/* for iterating over held locks :*/
|
||||
u8 path_idx;
|
||||
u8 level;
|
||||
u64 lock_start_time;
|
||||
};
|
||||
|
||||
struct lock_graph {
|
||||
struct trans_waiting_for_lock g[8];
|
||||
unsigned nr;
|
||||
};
|
||||
|
||||
static noinline void print_cycle(struct printbuf *out, struct lock_graph *g)
|
||||
{
|
||||
struct trans_waiting_for_lock *i;
|
||||
|
||||
prt_printf(out, "Found lock cycle (%u entries):\n", g->nr);
|
||||
|
||||
for (i = g->g; i < g->g + g->nr; i++) {
|
||||
struct task_struct *task = READ_ONCE(i->trans->locking_wait.task);
|
||||
if (!task)
|
||||
continue;
|
||||
|
||||
bch2_btree_trans_to_text(out, i->trans);
|
||||
bch2_prt_task_backtrace(out, task, i == g->g ? 5 : 1, GFP_NOWAIT);
|
||||
}
|
||||
}
|
||||
|
||||
static noinline void print_chain(struct printbuf *out, struct lock_graph *g)
|
||||
{
|
||||
struct trans_waiting_for_lock *i;
|
||||
|
||||
for (i = g->g; i != g->g + g->nr; i++) {
|
||||
struct task_struct *task = READ_ONCE(i->trans->locking_wait.task);
|
||||
if (i != g->g)
|
||||
prt_str(out, "<- ");
|
||||
prt_printf(out, "%u ", task ? task->pid : 0);
|
||||
}
|
||||
prt_newline(out);
|
||||
}
|
||||
|
||||
static void lock_graph_up(struct lock_graph *g)
|
||||
{
|
||||
closure_put(&g->g[--g->nr].trans->ref);
|
||||
}
|
||||
|
||||
static noinline void lock_graph_pop_all(struct lock_graph *g)
|
||||
{
|
||||
while (g->nr)
|
||||
lock_graph_up(g);
|
||||
}
|
||||
|
||||
static noinline void lock_graph_pop_from(struct lock_graph *g, struct trans_waiting_for_lock *i)
|
||||
{
|
||||
while (g->g + g->nr > i)
|
||||
lock_graph_up(g);
|
||||
}
|
||||
|
||||
static void __lock_graph_down(struct lock_graph *g, struct btree_trans *trans)
|
||||
{
|
||||
g->g[g->nr++] = (struct trans_waiting_for_lock) {
|
||||
.trans = trans,
|
||||
.node_want = trans->locking,
|
||||
.lock_want = trans->locking_wait.lock_want,
|
||||
};
|
||||
}
|
||||
|
||||
static void lock_graph_down(struct lock_graph *g, struct btree_trans *trans)
|
||||
{
|
||||
closure_get(&trans->ref);
|
||||
__lock_graph_down(g, trans);
|
||||
}
|
||||
|
||||
static bool lock_graph_remove_non_waiters(struct lock_graph *g,
|
||||
struct trans_waiting_for_lock *from)
|
||||
{
|
||||
struct trans_waiting_for_lock *i;
|
||||
|
||||
if (from->trans->locking != from->node_want) {
|
||||
lock_graph_pop_from(g, from);
|
||||
return true;
|
||||
}
|
||||
|
||||
for (i = from + 1; i < g->g + g->nr; i++)
|
||||
if (i->trans->locking != i->node_want ||
|
||||
i->trans->locking_wait.start_time != i[-1].lock_start_time) {
|
||||
lock_graph_pop_from(g, i);
|
||||
return true;
|
||||
}
|
||||
|
||||
return false;
|
||||
}
|
||||
|
||||
static void trace_would_deadlock(struct lock_graph *g, struct btree_trans *trans)
|
||||
{
|
||||
struct bch_fs *c = trans->c;
|
||||
|
||||
count_event(c, trans_restart_would_deadlock);
|
||||
|
||||
if (trace_trans_restart_would_deadlock_enabled()) {
|
||||
struct printbuf buf = PRINTBUF;
|
||||
|
||||
buf.atomic++;
|
||||
print_cycle(&buf, g);
|
||||
|
||||
trace_trans_restart_would_deadlock(trans, buf.buf);
|
||||
printbuf_exit(&buf);
|
||||
}
|
||||
}
|
||||
|
||||
static int abort_lock(struct lock_graph *g, struct trans_waiting_for_lock *i)
|
||||
{
|
||||
if (i == g->g) {
|
||||
trace_would_deadlock(g, i->trans);
|
||||
return btree_trans_restart_foreign_task(i->trans,
|
||||
BCH_ERR_transaction_restart_would_deadlock,
|
||||
_THIS_IP_);
|
||||
} else {
|
||||
i->trans->lock_must_abort = true;
|
||||
wake_up_process(i->trans->locking_wait.task);
|
||||
return 0;
|
||||
}
|
||||
}
|
||||
|
||||
static int btree_trans_abort_preference(struct btree_trans *trans)
|
||||
{
|
||||
if (trans->lock_may_not_fail)
|
||||
return 0;
|
||||
if (trans->locking_wait.lock_want == SIX_LOCK_write)
|
||||
return 1;
|
||||
if (!trans->in_traverse_all)
|
||||
return 2;
|
||||
return 3;
|
||||
}
|
||||
|
||||
static noinline __noreturn void break_cycle_fail(struct lock_graph *g)
|
||||
{
|
||||
struct printbuf buf = PRINTBUF;
|
||||
buf.atomic++;
|
||||
|
||||
prt_printf(&buf, bch2_fmt(g->g->trans->c, "cycle of nofail locks"));
|
||||
|
||||
for (struct trans_waiting_for_lock *i = g->g; i < g->g + g->nr; i++) {
|
||||
struct btree_trans *trans = i->trans;
|
||||
|
||||
bch2_btree_trans_to_text(&buf, trans);
|
||||
|
||||
prt_printf(&buf, "backtrace:\n");
|
||||
printbuf_indent_add(&buf, 2);
|
||||
bch2_prt_task_backtrace(&buf, trans->locking_wait.task, 2, GFP_NOWAIT);
|
||||
printbuf_indent_sub(&buf, 2);
|
||||
prt_newline(&buf);
|
||||
}
|
||||
|
||||
bch2_print_str(g->g->trans->c, KERN_ERR, buf.buf);
|
||||
printbuf_exit(&buf);
|
||||
BUG();
|
||||
}
|
||||
|
||||
static noinline int break_cycle(struct lock_graph *g, struct printbuf *cycle,
|
||||
struct trans_waiting_for_lock *from)
|
||||
{
|
||||
struct trans_waiting_for_lock *i, *abort = NULL;
|
||||
unsigned best = 0, pref;
|
||||
int ret;
|
||||
|
||||
if (lock_graph_remove_non_waiters(g, from))
|
||||
return 0;
|
||||
|
||||
/* Only checking, for debugfs: */
|
||||
if (cycle) {
|
||||
print_cycle(cycle, g);
|
||||
ret = -1;
|
||||
goto out;
|
||||
}
|
||||
|
||||
for (i = from; i < g->g + g->nr; i++) {
|
||||
pref = btree_trans_abort_preference(i->trans);
|
||||
if (pref > best) {
|
||||
abort = i;
|
||||
best = pref;
|
||||
}
|
||||
}
|
||||
|
||||
if (unlikely(!best))
|
||||
break_cycle_fail(g);
|
||||
|
||||
ret = abort_lock(g, abort);
|
||||
out:
|
||||
if (ret)
|
||||
lock_graph_pop_all(g);
|
||||
else
|
||||
lock_graph_pop_from(g, abort);
|
||||
return ret;
|
||||
}
|
||||
|
||||
static int lock_graph_descend(struct lock_graph *g, struct btree_trans *trans,
|
||||
struct printbuf *cycle)
|
||||
{
|
||||
struct btree_trans *orig_trans = g->g->trans;
|
||||
|
||||
for (struct trans_waiting_for_lock *i = g->g; i < g->g + g->nr; i++)
|
||||
if (i->trans == trans) {
|
||||
closure_put(&trans->ref);
|
||||
return break_cycle(g, cycle, i);
|
||||
}
|
||||
|
||||
if (unlikely(g->nr == ARRAY_SIZE(g->g))) {
|
||||
closure_put(&trans->ref);
|
||||
|
||||
if (orig_trans->lock_may_not_fail)
|
||||
return 0;
|
||||
|
||||
lock_graph_pop_all(g);
|
||||
|
||||
if (cycle)
|
||||
return 0;
|
||||
|
||||
trace_and_count(trans->c, trans_restart_would_deadlock_recursion_limit, trans, _RET_IP_);
|
||||
return btree_trans_restart(orig_trans, BCH_ERR_transaction_restart_deadlock_recursion_limit);
|
||||
}
|
||||
|
||||
__lock_graph_down(g, trans);
|
||||
return 0;
|
||||
}
|
||||
|
||||
static bool lock_type_conflicts(enum six_lock_type t1, enum six_lock_type t2)
|
||||
{
|
||||
return t1 + t2 > 1;
|
||||
}
|
||||
|
||||
int bch2_check_for_deadlock(struct btree_trans *trans, struct printbuf *cycle)
|
||||
{
|
||||
struct lock_graph g;
|
||||
struct trans_waiting_for_lock *top;
|
||||
struct btree_bkey_cached_common *b;
|
||||
btree_path_idx_t path_idx;
|
||||
int ret = 0;
|
||||
|
||||
g.nr = 0;
|
||||
|
||||
if (trans->lock_must_abort && !trans->lock_may_not_fail) {
|
||||
if (cycle)
|
||||
return -1;
|
||||
|
||||
trace_would_deadlock(&g, trans);
|
||||
return btree_trans_restart(trans, BCH_ERR_transaction_restart_would_deadlock);
|
||||
}
|
||||
|
||||
lock_graph_down(&g, trans);
|
||||
|
||||
/* trans->paths is rcu protected vs. freeing */
|
||||
guard(rcu)();
|
||||
if (cycle)
|
||||
cycle->atomic++;
|
||||
next:
|
||||
if (!g.nr)
|
||||
goto out;
|
||||
|
||||
top = &g.g[g.nr - 1];
|
||||
|
||||
struct btree_path *paths = rcu_dereference(top->trans->paths);
|
||||
if (!paths)
|
||||
goto up;
|
||||
|
||||
unsigned long *paths_allocated = trans_paths_allocated(paths);
|
||||
|
||||
trans_for_each_path_idx_from(paths_allocated, *trans_paths_nr(paths),
|
||||
path_idx, top->path_idx) {
|
||||
struct btree_path *path = paths + path_idx;
|
||||
if (!path->nodes_locked)
|
||||
continue;
|
||||
|
||||
if (path_idx != top->path_idx) {
|
||||
top->path_idx = path_idx;
|
||||
top->level = 0;
|
||||
top->lock_start_time = 0;
|
||||
}
|
||||
|
||||
for (;
|
||||
top->level < BTREE_MAX_DEPTH;
|
||||
top->level++, top->lock_start_time = 0) {
|
||||
int lock_held = btree_node_locked_type(path, top->level);
|
||||
|
||||
if (lock_held == BTREE_NODE_UNLOCKED)
|
||||
continue;
|
||||
|
||||
b = &READ_ONCE(path->l[top->level].b)->c;
|
||||
|
||||
if (IS_ERR_OR_NULL(b)) {
|
||||
/*
|
||||
* If we get here, it means we raced with the
|
||||
* other thread updating its btree_path
|
||||
* structures - which means it can't be blocked
|
||||
* waiting on a lock:
|
||||
*/
|
||||
if (!lock_graph_remove_non_waiters(&g, g.g)) {
|
||||
/*
|
||||
* If lock_graph_remove_non_waiters()
|
||||
* didn't do anything, it must be
|
||||
* because we're being called by debugfs
|
||||
* checking for lock cycles, which
|
||||
* invokes us on btree_transactions that
|
||||
* aren't actually waiting on anything.
|
||||
* Just bail out:
|
||||
*/
|
||||
lock_graph_pop_all(&g);
|
||||
}
|
||||
|
||||
goto next;
|
||||
}
|
||||
|
||||
if (list_empty_careful(&b->lock.wait_list))
|
||||
continue;
|
||||
|
||||
raw_spin_lock(&b->lock.wait_lock);
|
||||
list_for_each_entry(trans, &b->lock.wait_list, locking_wait.list) {
|
||||
BUG_ON(b != trans->locking);
|
||||
|
||||
if (top->lock_start_time &&
|
||||
time_after_eq64(top->lock_start_time, trans->locking_wait.start_time))
|
||||
continue;
|
||||
|
||||
top->lock_start_time = trans->locking_wait.start_time;
|
||||
|
||||
/* Don't check for self deadlock: */
|
||||
if (trans == top->trans ||
|
||||
!lock_type_conflicts(lock_held, trans->locking_wait.lock_want))
|
||||
continue;
|
||||
|
||||
closure_get(&trans->ref);
|
||||
raw_spin_unlock(&b->lock.wait_lock);
|
||||
|
||||
ret = lock_graph_descend(&g, trans, cycle);
|
||||
if (ret)
|
||||
goto out;
|
||||
goto next;
|
||||
|
||||
}
|
||||
raw_spin_unlock(&b->lock.wait_lock);
|
||||
}
|
||||
}
|
||||
up:
|
||||
if (g.nr > 1 && cycle)
|
||||
print_chain(cycle, &g);
|
||||
lock_graph_up(&g);
|
||||
goto next;
|
||||
out:
|
||||
if (cycle)
|
||||
--cycle->atomic;
|
||||
return ret;
|
||||
}
|
||||
|
||||
int bch2_six_check_for_deadlock(struct six_lock *lock, void *p)
|
||||
{
|
||||
struct btree_trans *trans = p;
|
||||
|
||||
return bch2_check_for_deadlock(trans, NULL);
|
||||
}
|
||||
|
||||
int __bch2_btree_node_lock_write(struct btree_trans *trans, struct btree_path *path,
|
||||
struct btree_bkey_cached_common *b,
|
||||
bool lock_may_not_fail)
|
||||
{
|
||||
int readers = bch2_btree_node_lock_counts(trans, NULL, b, b->level).n[SIX_LOCK_read];
|
||||
int ret;
|
||||
|
||||
/*
|
||||
* Must drop our read locks before calling six_lock_write() -
|
||||
* six_unlock() won't do wakeups until the reader count
|
||||
* goes to 0, and it's safe because we have the node intent
|
||||
* locked:
|
||||
*/
|
||||
six_lock_readers_add(&b->lock, -readers);
|
||||
ret = __btree_node_lock_nopath(trans, b, SIX_LOCK_write,
|
||||
lock_may_not_fail, _RET_IP_);
|
||||
six_lock_readers_add(&b->lock, readers);
|
||||
|
||||
if (ret)
|
||||
mark_btree_node_locked_noreset(path, b->level, BTREE_NODE_INTENT_LOCKED);
|
||||
|
||||
return ret;
|
||||
}
|
||||
|
||||
void bch2_btree_node_lock_write_nofail(struct btree_trans *trans,
|
||||
struct btree_path *path,
|
||||
struct btree_bkey_cached_common *b)
|
||||
{
|
||||
int ret = __btree_node_lock_write(trans, path, b, true);
|
||||
BUG_ON(ret);
|
||||
}
|
||||
|
||||
/* relock */
|
||||
|
||||
static int btree_path_get_locks(struct btree_trans *trans,
|
||||
struct btree_path *path,
|
||||
bool upgrade,
|
||||
struct get_locks_fail *f,
|
||||
int restart_err)
|
||||
{
|
||||
unsigned l = path->level;
|
||||
|
||||
do {
|
||||
if (!btree_path_node(path, l))
|
||||
break;
|
||||
|
||||
if (!(upgrade
|
||||
? bch2_btree_node_upgrade(trans, path, l)
|
||||
: bch2_btree_node_relock(trans, path, l)))
|
||||
goto err;
|
||||
|
||||
l++;
|
||||
} while (l < path->locks_want);
|
||||
|
||||
if (path->uptodate == BTREE_ITER_NEED_RELOCK)
|
||||
path->uptodate = BTREE_ITER_UPTODATE;
|
||||
|
||||
return path->uptodate < BTREE_ITER_NEED_RELOCK ? 0 : -1;
|
||||
err:
|
||||
if (f) {
|
||||
f->l = l;
|
||||
f->b = path->l[l].b;
|
||||
}
|
||||
|
||||
/*
|
||||
* Do transaction restart before unlocking, so we don't pop
|
||||
* should_be_locked asserts
|
||||
*/
|
||||
if (restart_err) {
|
||||
btree_trans_restart(trans, restart_err);
|
||||
} else if (path->should_be_locked && !trans->restarted) {
|
||||
if (upgrade)
|
||||
path->locks_want = l;
|
||||
return -1;
|
||||
}
|
||||
|
||||
__bch2_btree_path_unlock(trans, path);
|
||||
btree_path_set_dirty(trans, path, BTREE_ITER_NEED_TRAVERSE);
|
||||
|
||||
/*
|
||||
* When we fail to get a lock, we have to ensure that any child nodes
|
||||
* can't be relocked so bch2_btree_path_traverse has to walk back up to
|
||||
* the node that we failed to relock:
|
||||
*/
|
||||
do {
|
||||
path->l[l].b = upgrade
|
||||
? ERR_PTR(-BCH_ERR_no_btree_node_upgrade)
|
||||
: ERR_PTR(-BCH_ERR_no_btree_node_relock);
|
||||
} while (l--);
|
||||
|
||||
return -restart_err ?: -1;
|
||||
}
|
||||
|
||||
bool __bch2_btree_node_relock(struct btree_trans *trans,
|
||||
struct btree_path *path, unsigned level,
|
||||
bool trace)
|
||||
{
|
||||
struct btree *b = btree_path_node(path, level);
|
||||
int want = __btree_lock_want(path, level);
|
||||
|
||||
if (race_fault())
|
||||
goto fail;
|
||||
|
||||
if (six_relock_type(&b->c.lock, want, path->l[level].lock_seq) ||
|
||||
(btree_node_lock_seq_matches(path, b, level) &&
|
||||
btree_node_lock_increment(trans, &b->c, level, want))) {
|
||||
mark_btree_node_locked(trans, path, level, want);
|
||||
return true;
|
||||
}
|
||||
fail:
|
||||
if (trace && !trans->notrace_relock_fail)
|
||||
trace_and_count(trans->c, btree_path_relock_fail, trans, _RET_IP_, path, level);
|
||||
return false;
|
||||
}
|
||||
|
||||
/* upgrade */
|
||||
|
||||
bool bch2_btree_node_upgrade(struct btree_trans *trans,
|
||||
struct btree_path *path, unsigned level)
|
||||
{
|
||||
struct btree *b = path->l[level].b;
|
||||
|
||||
if (!is_btree_node(path, level))
|
||||
return false;
|
||||
|
||||
switch (btree_lock_want(path, level)) {
|
||||
case BTREE_NODE_UNLOCKED:
|
||||
BUG_ON(btree_node_locked(path, level));
|
||||
return true;
|
||||
case BTREE_NODE_READ_LOCKED:
|
||||
BUG_ON(btree_node_intent_locked(path, level));
|
||||
return bch2_btree_node_relock(trans, path, level);
|
||||
case BTREE_NODE_INTENT_LOCKED:
|
||||
break;
|
||||
case BTREE_NODE_WRITE_LOCKED:
|
||||
BUG();
|
||||
}
|
||||
|
||||
if (btree_node_intent_locked(path, level))
|
||||
return true;
|
||||
|
||||
if (race_fault())
|
||||
return false;
|
||||
|
||||
if (btree_node_locked(path, level)
|
||||
? six_lock_tryupgrade(&b->c.lock)
|
||||
: six_relock_type(&b->c.lock, SIX_LOCK_intent, path->l[level].lock_seq))
|
||||
goto success;
|
||||
|
||||
if (btree_node_lock_seq_matches(path, b, level) &&
|
||||
btree_node_lock_increment(trans, &b->c, level, BTREE_NODE_INTENT_LOCKED)) {
|
||||
btree_node_unlock(trans, path, level);
|
||||
goto success;
|
||||
}
|
||||
|
||||
trace_and_count(trans->c, btree_path_upgrade_fail, trans, _RET_IP_, path, level);
|
||||
return false;
|
||||
success:
|
||||
mark_btree_node_locked_noreset(path, level, BTREE_NODE_INTENT_LOCKED);
|
||||
return true;
|
||||
}
|
||||
|
||||
/* Btree path locking: */
|
||||
|
||||
/*
|
||||
* Only for btree_cache.c - only relocks intent locks
|
||||
*/
|
||||
int bch2_btree_path_relock_intent(struct btree_trans *trans,
|
||||
struct btree_path *path)
|
||||
{
|
||||
unsigned l;
|
||||
|
||||
for (l = path->level;
|
||||
l < path->locks_want && btree_path_node(path, l);
|
||||
l++) {
|
||||
if (!bch2_btree_node_relock(trans, path, l)) {
|
||||
__bch2_btree_path_unlock(trans, path);
|
||||
btree_path_set_dirty(trans, path, BTREE_ITER_NEED_TRAVERSE);
|
||||
trace_and_count(trans->c, trans_restart_relock_path_intent, trans, _RET_IP_, path);
|
||||
return btree_trans_restart(trans, BCH_ERR_transaction_restart_relock_path_intent);
|
||||
}
|
||||
}
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
__flatten
|
||||
bool bch2_btree_path_relock_norestart(struct btree_trans *trans, struct btree_path *path)
|
||||
{
|
||||
bool ret = !btree_path_get_locks(trans, path, false, NULL, 0);
|
||||
bch2_trans_verify_locks(trans);
|
||||
return ret;
|
||||
}
|
||||
|
||||
int __bch2_btree_path_relock(struct btree_trans *trans,
|
||||
struct btree_path *path, unsigned long trace_ip)
|
||||
{
|
||||
if (!bch2_btree_path_relock_norestart(trans, path)) {
|
||||
trace_and_count(trans->c, trans_restart_relock_path, trans, trace_ip, path);
|
||||
return btree_trans_restart(trans, BCH_ERR_transaction_restart_relock_path);
|
||||
}
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
bool __bch2_btree_path_upgrade_norestart(struct btree_trans *trans,
|
||||
struct btree_path *path,
|
||||
unsigned new_locks_want)
|
||||
{
|
||||
path->locks_want = new_locks_want;
|
||||
|
||||
/*
|
||||
* If we need it locked, we can't touch it. Otherwise, we can return
|
||||
* success - bch2_path_get() will use this path, and it'll just be
|
||||
* retraversed:
|
||||
*/
|
||||
bool ret = !btree_path_get_locks(trans, path, true, NULL, 0) ||
|
||||
!path->should_be_locked;
|
||||
|
||||
bch2_btree_path_verify_locks(trans, path);
|
||||
return ret;
|
||||
}
|
||||
|
||||
int __bch2_btree_path_upgrade(struct btree_trans *trans,
|
||||
struct btree_path *path,
|
||||
unsigned new_locks_want)
|
||||
{
|
||||
unsigned old_locks = path->nodes_locked;
|
||||
unsigned old_locks_want = path->locks_want;
|
||||
|
||||
path->locks_want = max_t(unsigned, path->locks_want, new_locks_want);
|
||||
|
||||
struct get_locks_fail f = {};
|
||||
int ret = btree_path_get_locks(trans, path, true, &f,
|
||||
BCH_ERR_transaction_restart_upgrade);
|
||||
if (!ret)
|
||||
goto out;
|
||||
|
||||
/*
|
||||
* XXX: this is ugly - we'd prefer to not be mucking with other
|
||||
* iterators in the btree_trans here.
|
||||
*
|
||||
* On failure to upgrade the iterator, setting iter->locks_want and
|
||||
* calling get_locks() is sufficient to make bch2_btree_path_traverse()
|
||||
* get the locks we want on transaction restart.
|
||||
*
|
||||
* But if this iterator was a clone, on transaction restart what we did
|
||||
* to this iterator isn't going to be preserved.
|
||||
*
|
||||
* Possibly we could add an iterator field for the parent iterator when
|
||||
* an iterator is a copy - for now, we'll just upgrade any other
|
||||
* iterators with the same btree id.
|
||||
*
|
||||
* The code below used to be needed to ensure ancestor nodes get locked
|
||||
* before interior nodes - now that's handled by
|
||||
* bch2_btree_path_traverse_all().
|
||||
*/
|
||||
if (!path->cached && !trans->in_traverse_all) {
|
||||
struct btree_path *linked;
|
||||
unsigned i;
|
||||
|
||||
trans_for_each_path(trans, linked, i)
|
||||
if (linked != path &&
|
||||
linked->cached == path->cached &&
|
||||
linked->btree_id == path->btree_id &&
|
||||
linked->locks_want < new_locks_want) {
|
||||
linked->locks_want = new_locks_want;
|
||||
btree_path_get_locks(trans, linked, true, NULL, 0);
|
||||
}
|
||||
}
|
||||
|
||||
count_event(trans->c, trans_restart_upgrade);
|
||||
if (trace_trans_restart_upgrade_enabled()) {
|
||||
struct printbuf buf = PRINTBUF;
|
||||
|
||||
prt_printf(&buf, "%s %pS\n", trans->fn, (void *) _RET_IP_);
|
||||
prt_printf(&buf, "btree %s pos\n", bch2_btree_id_str(path->btree_id));
|
||||
bch2_bpos_to_text(&buf, path->pos);
|
||||
prt_printf(&buf, "locks want %u -> %u level %u\n",
|
||||
old_locks_want, new_locks_want, f.l);
|
||||
prt_printf(&buf, "nodes_locked %x -> %x\n",
|
||||
old_locks, path->nodes_locked);
|
||||
prt_printf(&buf, "node %s ", IS_ERR(f.b) ? bch2_err_str(PTR_ERR(f.b)) :
|
||||
!f.b ? "(null)" : "(node)");
|
||||
prt_printf(&buf, "path seq %u node seq %u\n",
|
||||
IS_ERR_OR_NULL(f.b) ? 0 : f.b->c.lock.seq,
|
||||
path->l[f.l].lock_seq);
|
||||
|
||||
trace_trans_restart_upgrade(trans->c, buf.buf);
|
||||
printbuf_exit(&buf);
|
||||
}
|
||||
out:
|
||||
bch2_trans_verify_locks(trans);
|
||||
return ret;
|
||||
}
|
||||
|
||||
void __bch2_btree_path_downgrade(struct btree_trans *trans,
|
||||
struct btree_path *path,
|
||||
unsigned new_locks_want)
|
||||
{
|
||||
unsigned l, old_locks_want = path->locks_want;
|
||||
|
||||
if (trans->restarted)
|
||||
return;
|
||||
|
||||
EBUG_ON(path->locks_want < new_locks_want);
|
||||
|
||||
path->locks_want = new_locks_want;
|
||||
|
||||
while (path->nodes_locked &&
|
||||
(l = btree_path_highest_level_locked(path)) >= path->locks_want) {
|
||||
if (l > path->level) {
|
||||
btree_node_unlock(trans, path, l);
|
||||
} else {
|
||||
if (btree_node_intent_locked(path, l)) {
|
||||
six_lock_downgrade(&path->l[l].b->c.lock);
|
||||
mark_btree_node_locked_noreset(path, l, BTREE_NODE_READ_LOCKED);
|
||||
}
|
||||
break;
|
||||
}
|
||||
}
|
||||
|
||||
bch2_btree_path_verify_locks(trans, path);
|
||||
|
||||
trace_path_downgrade(trans, _RET_IP_, path, old_locks_want);
|
||||
}
|
||||
|
||||
/* Btree transaction locking: */
|
||||
|
||||
void bch2_trans_downgrade(struct btree_trans *trans)
|
||||
{
|
||||
struct btree_path *path;
|
||||
unsigned i;
|
||||
|
||||
if (trans->restarted)
|
||||
return;
|
||||
|
||||
trans_for_each_path(trans, path, i)
|
||||
if (path->ref)
|
||||
bch2_btree_path_downgrade(trans, path);
|
||||
}
|
||||
|
||||
static inline void __bch2_trans_unlock(struct btree_trans *trans)
|
||||
{
|
||||
struct btree_path *path;
|
||||
unsigned i;
|
||||
|
||||
trans_for_each_path(trans, path, i)
|
||||
__bch2_btree_path_unlock(trans, path);
|
||||
}
|
||||
|
||||
static noinline __cold void bch2_trans_relock_fail(struct btree_trans *trans, struct btree_path *path,
|
||||
struct get_locks_fail *f, bool trace, ulong ip)
|
||||
{
|
||||
if (!trace)
|
||||
goto out;
|
||||
|
||||
if (trace_trans_restart_relock_enabled()) {
|
||||
struct printbuf buf = PRINTBUF;
|
||||
|
||||
bch2_bpos_to_text(&buf, path->pos);
|
||||
prt_printf(&buf, " %s l=%u seq=%u node seq=",
|
||||
bch2_btree_id_str(path->btree_id),
|
||||
f->l, path->l[f->l].lock_seq);
|
||||
if (IS_ERR_OR_NULL(f->b)) {
|
||||
prt_str(&buf, bch2_err_str(PTR_ERR(f->b)));
|
||||
} else {
|
||||
prt_printf(&buf, "%u", f->b->c.lock.seq);
|
||||
|
||||
struct six_lock_count c =
|
||||
bch2_btree_node_lock_counts(trans, NULL, &f->b->c, f->l);
|
||||
prt_printf(&buf, " self locked %u.%u.%u", c.n[0], c.n[1], c.n[2]);
|
||||
|
||||
c = six_lock_counts(&f->b->c.lock);
|
||||
prt_printf(&buf, " total locked %u.%u.%u", c.n[0], c.n[1], c.n[2]);
|
||||
}
|
||||
|
||||
trace_trans_restart_relock(trans, ip, buf.buf);
|
||||
printbuf_exit(&buf);
|
||||
}
|
||||
|
||||
count_event(trans->c, trans_restart_relock);
|
||||
out:
|
||||
__bch2_trans_unlock(trans);
|
||||
bch2_trans_verify_locks(trans);
|
||||
}
|
||||
|
||||
static inline int __bch2_trans_relock(struct btree_trans *trans, bool trace, ulong ip)
|
||||
{
|
||||
bch2_trans_verify_locks(trans);
|
||||
|
||||
if (unlikely(trans->restarted))
|
||||
return -((int) trans->restarted);
|
||||
if (unlikely(trans->locked))
|
||||
goto out;
|
||||
|
||||
struct btree_path *path;
|
||||
unsigned i;
|
||||
|
||||
trans_for_each_path(trans, path, i) {
|
||||
struct get_locks_fail f;
|
||||
int ret;
|
||||
|
||||
if (path->should_be_locked &&
|
||||
(ret = btree_path_get_locks(trans, path, false, &f,
|
||||
BCH_ERR_transaction_restart_relock))) {
|
||||
bch2_trans_relock_fail(trans, path, &f, trace, ip);
|
||||
return ret;
|
||||
}
|
||||
}
|
||||
|
||||
trans_set_locked(trans, true);
|
||||
out:
|
||||
bch2_trans_verify_locks(trans);
|
||||
return 0;
|
||||
}
|
||||
|
||||
int bch2_trans_relock(struct btree_trans *trans)
|
||||
{
|
||||
return __bch2_trans_relock(trans, true, _RET_IP_);
|
||||
}
|
||||
|
||||
int bch2_trans_relock_notrace(struct btree_trans *trans)
|
||||
{
|
||||
return __bch2_trans_relock(trans, false, _RET_IP_);
|
||||
}
|
||||
|
||||
void bch2_trans_unlock(struct btree_trans *trans)
|
||||
{
|
||||
trans_set_unlocked(trans);
|
||||
|
||||
__bch2_trans_unlock(trans);
|
||||
}
|
||||
|
||||
void bch2_trans_unlock_long(struct btree_trans *trans)
|
||||
{
|
||||
bch2_trans_unlock(trans);
|
||||
bch2_trans_srcu_unlock(trans);
|
||||
}
|
||||
|
||||
void bch2_trans_unlock_write(struct btree_trans *trans)
|
||||
{
|
||||
struct btree_path *path;
|
||||
unsigned i;
|
||||
|
||||
trans_for_each_path(trans, path, i)
|
||||
for (unsigned l = 0; l < BTREE_MAX_DEPTH; l++)
|
||||
if (btree_node_write_locked(path, l))
|
||||
bch2_btree_node_unlock_write(trans, path, path->l[l].b);
|
||||
}
|
||||
|
||||
int __bch2_trans_mutex_lock(struct btree_trans *trans,
|
||||
struct mutex *lock)
|
||||
{
|
||||
int ret = drop_locks_do(trans, (mutex_lock(lock), 0));
|
||||
|
||||
if (ret)
|
||||
mutex_unlock(lock);
|
||||
return ret;
|
||||
}
|
||||
|
||||
/* Debug */
|
||||
|
||||
void __bch2_btree_path_verify_locks(struct btree_trans *trans, struct btree_path *path)
|
||||
{
|
||||
if (!path->nodes_locked && btree_path_node(path, path->level)) {
|
||||
/*
|
||||
* A path may be uptodate and yet have nothing locked if and only if
|
||||
* there is no node at path->level, which generally means we were
|
||||
* iterating over all nodes and got to the end of the btree
|
||||
*/
|
||||
BUG_ON(path->uptodate == BTREE_ITER_UPTODATE);
|
||||
BUG_ON(path->should_be_locked && trans->locked && !trans->restarted);
|
||||
}
|
||||
|
||||
if (!path->nodes_locked)
|
||||
return;
|
||||
|
||||
for (unsigned l = 0; l < BTREE_MAX_DEPTH; l++) {
|
||||
int want = btree_lock_want(path, l);
|
||||
int have = btree_node_locked_type_nowrite(path, l);
|
||||
|
||||
BUG_ON(!is_btree_node(path, l) && have != BTREE_NODE_UNLOCKED);
|
||||
|
||||
BUG_ON(is_btree_node(path, l) && want != have);
|
||||
|
||||
BUG_ON(btree_node_locked(path, l) &&
|
||||
path->l[l].lock_seq != six_lock_seq(&path->l[l].b->c.lock));
|
||||
}
|
||||
}
|
||||
|
||||
static bool bch2_trans_locked(struct btree_trans *trans)
|
||||
{
|
||||
struct btree_path *path;
|
||||
unsigned i;
|
||||
|
||||
trans_for_each_path(trans, path, i)
|
||||
if (path->nodes_locked)
|
||||
return true;
|
||||
return false;
|
||||
}
|
||||
|
||||
void __bch2_trans_verify_locks(struct btree_trans *trans)
|
||||
{
|
||||
if (!trans->locked) {
|
||||
BUG_ON(bch2_trans_locked(trans));
|
||||
return;
|
||||
}
|
||||
|
||||
struct btree_path *path;
|
||||
unsigned i;
|
||||
|
||||
trans_for_each_path(trans, path, i)
|
||||
__bch2_btree_path_verify_locks(trans, path);
|
||||
}
|
||||
@@ -1,466 +0,0 @@
|
||||
/* SPDX-License-Identifier: GPL-2.0 */
|
||||
#ifndef _BCACHEFS_BTREE_LOCKING_H
|
||||
#define _BCACHEFS_BTREE_LOCKING_H
|
||||
|
||||
/*
|
||||
* Only for internal btree use:
|
||||
*
|
||||
* The btree iterator tracks what locks it wants to take, and what locks it
|
||||
* currently has - here we have wrappers for locking/unlocking btree nodes and
|
||||
* updating the iterator state
|
||||
*/
|
||||
|
||||
#include "btree_iter.h"
|
||||
#include "six.h"
|
||||
|
||||
void bch2_btree_lock_init(struct btree_bkey_cached_common *, enum six_lock_init_flags, gfp_t gfp);
|
||||
|
||||
void bch2_trans_unlock_write(struct btree_trans *);
|
||||
|
||||
static inline bool is_btree_node(struct btree_path *path, unsigned l)
|
||||
{
|
||||
return l < BTREE_MAX_DEPTH && !IS_ERR_OR_NULL(path->l[l].b);
|
||||
}
|
||||
|
||||
static inline struct btree_transaction_stats *btree_trans_stats(struct btree_trans *trans)
|
||||
{
|
||||
return trans->fn_idx < ARRAY_SIZE(trans->c->btree_transaction_stats)
|
||||
? &trans->c->btree_transaction_stats[trans->fn_idx]
|
||||
: NULL;
|
||||
}
|
||||
|
||||
/* matches six lock types */
|
||||
enum btree_node_locked_type {
|
||||
BTREE_NODE_UNLOCKED = -1,
|
||||
BTREE_NODE_READ_LOCKED = SIX_LOCK_read,
|
||||
BTREE_NODE_INTENT_LOCKED = SIX_LOCK_intent,
|
||||
BTREE_NODE_WRITE_LOCKED = SIX_LOCK_write,
|
||||
};
|
||||
|
||||
static inline int btree_node_locked_type(struct btree_path *path,
|
||||
unsigned level)
|
||||
{
|
||||
return BTREE_NODE_UNLOCKED + ((path->nodes_locked >> (level << 1)) & 3);
|
||||
}
|
||||
|
||||
static inline int btree_node_locked_type_nowrite(struct btree_path *path,
|
||||
unsigned level)
|
||||
{
|
||||
int have = btree_node_locked_type(path, level);
|
||||
return have == BTREE_NODE_WRITE_LOCKED
|
||||
? BTREE_NODE_INTENT_LOCKED
|
||||
: have;
|
||||
}
|
||||
|
||||
static inline bool btree_node_write_locked(struct btree_path *path, unsigned l)
|
||||
{
|
||||
return btree_node_locked_type(path, l) == BTREE_NODE_WRITE_LOCKED;
|
||||
}
|
||||
|
||||
static inline bool btree_node_intent_locked(struct btree_path *path, unsigned l)
|
||||
{
|
||||
return btree_node_locked_type(path, l) == BTREE_NODE_INTENT_LOCKED;
|
||||
}
|
||||
|
||||
static inline bool btree_node_read_locked(struct btree_path *path, unsigned l)
|
||||
{
|
||||
return btree_node_locked_type(path, l) == BTREE_NODE_READ_LOCKED;
|
||||
}
|
||||
|
||||
static inline bool btree_node_locked(struct btree_path *path, unsigned level)
|
||||
{
|
||||
return btree_node_locked_type(path, level) != BTREE_NODE_UNLOCKED;
|
||||
}
|
||||
|
||||
static inline void mark_btree_node_locked_noreset(struct btree_path *path,
|
||||
unsigned level,
|
||||
enum btree_node_locked_type type)
|
||||
{
|
||||
/* relying on this to avoid a branch */
|
||||
BUILD_BUG_ON(SIX_LOCK_read != 0);
|
||||
BUILD_BUG_ON(SIX_LOCK_intent != 1);
|
||||
|
||||
path->nodes_locked &= ~(3U << (level << 1));
|
||||
path->nodes_locked |= (type + 1) << (level << 1);
|
||||
}
|
||||
|
||||
static inline void mark_btree_node_locked(struct btree_trans *trans,
|
||||
struct btree_path *path,
|
||||
unsigned level,
|
||||
enum btree_node_locked_type type)
|
||||
{
|
||||
mark_btree_node_locked_noreset(path, level, (enum btree_node_locked_type) type);
|
||||
#ifdef CONFIG_BCACHEFS_LOCK_TIME_STATS
|
||||
path->l[level].lock_taken_time = local_clock();
|
||||
#endif
|
||||
}
|
||||
|
||||
static inline enum six_lock_type __btree_lock_want(struct btree_path *path, int level)
|
||||
{
|
||||
return level < path->locks_want
|
||||
? SIX_LOCK_intent
|
||||
: SIX_LOCK_read;
|
||||
}
|
||||
|
||||
static inline enum btree_node_locked_type
|
||||
btree_lock_want(struct btree_path *path, int level)
|
||||
{
|
||||
if (level < path->level)
|
||||
return BTREE_NODE_UNLOCKED;
|
||||
if (level < path->locks_want)
|
||||
return BTREE_NODE_INTENT_LOCKED;
|
||||
if (level == path->level)
|
||||
return BTREE_NODE_READ_LOCKED;
|
||||
return BTREE_NODE_UNLOCKED;
|
||||
}
|
||||
|
||||
static void btree_trans_lock_hold_time_update(struct btree_trans *trans,
|
||||
struct btree_path *path, unsigned level)
|
||||
{
|
||||
#ifdef CONFIG_BCACHEFS_LOCK_TIME_STATS
|
||||
__bch2_time_stats_update(&btree_trans_stats(trans)->lock_hold_times,
|
||||
path->l[level].lock_taken_time,
|
||||
local_clock());
|
||||
#endif
|
||||
}
|
||||
|
||||
/* unlock: */
|
||||
|
||||
void bch2_btree_node_unlock_write(struct btree_trans *,
|
||||
struct btree_path *, struct btree *);
|
||||
|
||||
static inline void btree_node_unlock(struct btree_trans *trans,
|
||||
struct btree_path *path, unsigned level)
|
||||
{
|
||||
int lock_type = btree_node_locked_type(path, level);
|
||||
|
||||
EBUG_ON(level >= BTREE_MAX_DEPTH);
|
||||
|
||||
if (lock_type != BTREE_NODE_UNLOCKED) {
|
||||
if (unlikely(lock_type == BTREE_NODE_WRITE_LOCKED)) {
|
||||
bch2_btree_node_unlock_write(trans, path, path->l[level].b);
|
||||
lock_type = BTREE_NODE_INTENT_LOCKED;
|
||||
}
|
||||
six_unlock_type(&path->l[level].b->c.lock, lock_type);
|
||||
btree_trans_lock_hold_time_update(trans, path, level);
|
||||
mark_btree_node_locked_noreset(path, level, BTREE_NODE_UNLOCKED);
|
||||
}
|
||||
}
|
||||
|
||||
static inline int btree_path_lowest_level_locked(struct btree_path *path)
|
||||
{
|
||||
return __ffs(path->nodes_locked) >> 1;
|
||||
}
|
||||
|
||||
static inline int btree_path_highest_level_locked(struct btree_path *path)
|
||||
{
|
||||
return __fls(path->nodes_locked) >> 1;
|
||||
}
|
||||
|
||||
static inline void __bch2_btree_path_unlock(struct btree_trans *trans,
|
||||
struct btree_path *path)
|
||||
{
|
||||
btree_path_set_dirty(trans, path, BTREE_ITER_NEED_RELOCK);
|
||||
|
||||
while (path->nodes_locked)
|
||||
btree_node_unlock(trans, path, btree_path_lowest_level_locked(path));
|
||||
}
|
||||
|
||||
/*
|
||||
* Updates the saved lock sequence number, so that bch2_btree_node_relock() will
|
||||
* succeed:
|
||||
*/
|
||||
static inline void
|
||||
__bch2_btree_node_unlock_write(struct btree_trans *trans, struct btree *b)
|
||||
{
|
||||
if (!b->c.lock.write_lock_recurse) {
|
||||
struct btree_path *linked;
|
||||
unsigned i;
|
||||
|
||||
trans_for_each_path_with_node(trans, b, linked, i)
|
||||
linked->l[b->c.level].lock_seq++;
|
||||
}
|
||||
|
||||
six_unlock_write(&b->c.lock);
|
||||
}
|
||||
|
||||
static inline void
|
||||
bch2_btree_node_unlock_write_inlined(struct btree_trans *trans, struct btree_path *path,
|
||||
struct btree *b)
|
||||
{
|
||||
EBUG_ON(path->l[b->c.level].b != b);
|
||||
EBUG_ON(path->l[b->c.level].lock_seq != six_lock_seq(&b->c.lock));
|
||||
EBUG_ON(btree_node_locked_type(path, b->c.level) != SIX_LOCK_write);
|
||||
|
||||
mark_btree_node_locked_noreset(path, b->c.level, BTREE_NODE_INTENT_LOCKED);
|
||||
__bch2_btree_node_unlock_write(trans, b);
|
||||
}
|
||||
|
||||
int bch2_six_check_for_deadlock(struct six_lock *lock, void *p);
|
||||
|
||||
/* lock: */
|
||||
|
||||
static inline void trans_set_locked(struct btree_trans *trans, bool try)
|
||||
{
|
||||
if (!trans->locked) {
|
||||
lock_acquire_exclusive(&trans->dep_map, 0, try, NULL, _THIS_IP_);
|
||||
trans->locked = true;
|
||||
trans->last_unlock_ip = 0;
|
||||
|
||||
trans->pf_memalloc_nofs = (current->flags & PF_MEMALLOC_NOFS) != 0;
|
||||
current->flags |= PF_MEMALLOC_NOFS;
|
||||
}
|
||||
}
|
||||
|
||||
static inline void trans_set_unlocked(struct btree_trans *trans)
|
||||
{
|
||||
if (trans->locked) {
|
||||
lock_release(&trans->dep_map, _THIS_IP_);
|
||||
trans->locked = false;
|
||||
trans->last_unlock_ip = _RET_IP_;
|
||||
|
||||
if (!trans->pf_memalloc_nofs)
|
||||
current->flags &= ~PF_MEMALLOC_NOFS;
|
||||
}
|
||||
}
|
||||
|
||||
static inline int __btree_node_lock_nopath(struct btree_trans *trans,
|
||||
struct btree_bkey_cached_common *b,
|
||||
enum six_lock_type type,
|
||||
bool lock_may_not_fail,
|
||||
unsigned long ip)
|
||||
{
|
||||
trans->lock_may_not_fail = lock_may_not_fail;
|
||||
trans->lock_must_abort = false;
|
||||
trans->locking = b;
|
||||
|
||||
int ret = six_lock_ip_waiter(&b->lock, type, &trans->locking_wait,
|
||||
bch2_six_check_for_deadlock, trans, ip);
|
||||
WRITE_ONCE(trans->locking, NULL);
|
||||
WRITE_ONCE(trans->locking_wait.start_time, 0);
|
||||
|
||||
if (!ret)
|
||||
trace_btree_path_lock(trans, _THIS_IP_, b);
|
||||
return ret;
|
||||
}
|
||||
|
||||
static inline int __must_check
|
||||
btree_node_lock_nopath(struct btree_trans *trans,
|
||||
struct btree_bkey_cached_common *b,
|
||||
enum six_lock_type type,
|
||||
unsigned long ip)
|
||||
{
|
||||
return __btree_node_lock_nopath(trans, b, type, false, ip);
|
||||
}
|
||||
|
||||
static inline void btree_node_lock_nopath_nofail(struct btree_trans *trans,
|
||||
struct btree_bkey_cached_common *b,
|
||||
enum six_lock_type type)
|
||||
{
|
||||
int ret = __btree_node_lock_nopath(trans, b, type, true, _THIS_IP_);
|
||||
|
||||
BUG_ON(ret);
|
||||
}
|
||||
|
||||
/*
|
||||
* Lock a btree node if we already have it locked on one of our linked
|
||||
* iterators:
|
||||
*/
|
||||
static inline bool btree_node_lock_increment(struct btree_trans *trans,
|
||||
struct btree_bkey_cached_common *b,
|
||||
unsigned level,
|
||||
enum btree_node_locked_type want)
|
||||
{
|
||||
struct btree_path *path;
|
||||
unsigned i;
|
||||
|
||||
trans_for_each_path(trans, path, i)
|
||||
if (&path->l[level].b->c == b &&
|
||||
btree_node_locked_type(path, level) >= want) {
|
||||
six_lock_increment(&b->lock, (enum six_lock_type) want);
|
||||
return true;
|
||||
}
|
||||
|
||||
return false;
|
||||
}
|
||||
|
||||
static inline int btree_node_lock(struct btree_trans *trans,
|
||||
struct btree_path *path,
|
||||
struct btree_bkey_cached_common *b,
|
||||
unsigned level,
|
||||
enum six_lock_type type,
|
||||
unsigned long ip)
|
||||
{
|
||||
int ret = 0;
|
||||
|
||||
EBUG_ON(level >= BTREE_MAX_DEPTH);
|
||||
bch2_trans_verify_not_unlocked_or_in_restart(trans);
|
||||
|
||||
if (likely(six_trylock_type(&b->lock, type)) ||
|
||||
btree_node_lock_increment(trans, b, level, (enum btree_node_locked_type) type) ||
|
||||
!(ret = btree_node_lock_nopath(trans, b, type, btree_path_ip_allocated(path)))) {
|
||||
#ifdef CONFIG_BCACHEFS_LOCK_TIME_STATS
|
||||
path->l[b->level].lock_taken_time = local_clock();
|
||||
#endif
|
||||
}
|
||||
|
||||
return ret;
|
||||
}
|
||||
|
||||
int __bch2_btree_node_lock_write(struct btree_trans *, struct btree_path *,
|
||||
struct btree_bkey_cached_common *b, bool);
|
||||
|
||||
static inline int __btree_node_lock_write(struct btree_trans *trans,
|
||||
struct btree_path *path,
|
||||
struct btree_bkey_cached_common *b,
|
||||
bool lock_may_not_fail)
|
||||
{
|
||||
EBUG_ON(&path->l[b->level].b->c != b);
|
||||
EBUG_ON(path->l[b->level].lock_seq != six_lock_seq(&b->lock));
|
||||
EBUG_ON(!btree_node_intent_locked(path, b->level));
|
||||
|
||||
/*
|
||||
* six locks are unfair, and read locks block while a thread wants a
|
||||
* write lock: thus, we need to tell the cycle detector we have a write
|
||||
* lock _before_ taking the lock:
|
||||
*/
|
||||
mark_btree_node_locked_noreset(path, b->level, BTREE_NODE_WRITE_LOCKED);
|
||||
|
||||
return likely(six_trylock_write(&b->lock))
|
||||
? 0
|
||||
: __bch2_btree_node_lock_write(trans, path, b, lock_may_not_fail);
|
||||
}
|
||||
|
||||
static inline int __must_check
|
||||
bch2_btree_node_lock_write(struct btree_trans *trans,
|
||||
struct btree_path *path,
|
||||
struct btree_bkey_cached_common *b)
|
||||
{
|
||||
return __btree_node_lock_write(trans, path, b, false);
|
||||
}
|
||||
|
||||
void bch2_btree_node_lock_write_nofail(struct btree_trans *,
|
||||
struct btree_path *,
|
||||
struct btree_bkey_cached_common *);
|
||||
|
||||
/* relock: */
|
||||
|
||||
bool bch2_btree_path_relock_norestart(struct btree_trans *, struct btree_path *);
|
||||
int __bch2_btree_path_relock(struct btree_trans *,
|
||||
struct btree_path *, unsigned long);
|
||||
|
||||
static inline int bch2_btree_path_relock(struct btree_trans *trans,
|
||||
struct btree_path *path, unsigned long trace_ip)
|
||||
{
|
||||
return btree_node_locked(path, path->level)
|
||||
? 0
|
||||
: __bch2_btree_path_relock(trans, path, trace_ip);
|
||||
}
|
||||
|
||||
bool __bch2_btree_node_relock(struct btree_trans *, struct btree_path *, unsigned, bool trace);
|
||||
|
||||
static inline bool bch2_btree_node_relock(struct btree_trans *trans,
|
||||
struct btree_path *path, unsigned level)
|
||||
{
|
||||
EBUG_ON(btree_node_locked(path, level) &&
|
||||
!btree_node_write_locked(path, level) &&
|
||||
btree_node_locked_type(path, level) != __btree_lock_want(path, level));
|
||||
|
||||
return likely(btree_node_locked(path, level)) ||
|
||||
(!IS_ERR_OR_NULL(path->l[level].b) &&
|
||||
__bch2_btree_node_relock(trans, path, level, true));
|
||||
}
|
||||
|
||||
static inline bool bch2_btree_node_relock_notrace(struct btree_trans *trans,
|
||||
struct btree_path *path, unsigned level)
|
||||
{
|
||||
EBUG_ON(btree_node_locked(path, level) &&
|
||||
btree_node_locked_type_nowrite(path, level) !=
|
||||
__btree_lock_want(path, level));
|
||||
|
||||
return likely(btree_node_locked(path, level)) ||
|
||||
(!IS_ERR_OR_NULL(path->l[level].b) &&
|
||||
__bch2_btree_node_relock(trans, path, level, false));
|
||||
}
|
||||
|
||||
/* upgrade */
|
||||
|
||||
bool __bch2_btree_path_upgrade_norestart(struct btree_trans *, struct btree_path *, unsigned);
|
||||
|
||||
static inline bool bch2_btree_path_upgrade_norestart(struct btree_trans *trans,
|
||||
struct btree_path *path,
|
||||
unsigned new_locks_want)
|
||||
{
|
||||
return new_locks_want > path->locks_want
|
||||
? __bch2_btree_path_upgrade_norestart(trans, path, new_locks_want)
|
||||
: true;
|
||||
}
|
||||
|
||||
int __bch2_btree_path_upgrade(struct btree_trans *,
|
||||
struct btree_path *, unsigned);
|
||||
|
||||
static inline int bch2_btree_path_upgrade(struct btree_trans *trans,
|
||||
struct btree_path *path,
|
||||
unsigned new_locks_want)
|
||||
{
|
||||
new_locks_want = min(new_locks_want, BTREE_MAX_DEPTH);
|
||||
|
||||
return likely(path->locks_want >= new_locks_want && path->nodes_locked)
|
||||
? 0
|
||||
: __bch2_btree_path_upgrade(trans, path, new_locks_want);
|
||||
}
|
||||
|
||||
/* misc: */
|
||||
|
||||
static inline void btree_path_set_should_be_locked(struct btree_trans *trans, struct btree_path *path)
|
||||
{
|
||||
EBUG_ON(!btree_node_locked(path, path->level));
|
||||
EBUG_ON(path->uptodate);
|
||||
|
||||
if (!path->should_be_locked) {
|
||||
path->should_be_locked = true;
|
||||
trace_btree_path_should_be_locked(trans, path);
|
||||
}
|
||||
}
|
||||
|
||||
static inline void __btree_path_set_level_up(struct btree_trans *trans,
|
||||
struct btree_path *path,
|
||||
unsigned l)
|
||||
{
|
||||
btree_node_unlock(trans, path, l);
|
||||
path->l[l].b = ERR_PTR(-BCH_ERR_no_btree_node_up);
|
||||
}
|
||||
|
||||
static inline void btree_path_set_level_up(struct btree_trans *trans,
|
||||
struct btree_path *path)
|
||||
{
|
||||
__btree_path_set_level_up(trans, path, path->level++);
|
||||
btree_path_set_dirty(trans, path, BTREE_ITER_NEED_TRAVERSE);
|
||||
}
|
||||
|
||||
/* debug */
|
||||
|
||||
struct six_lock_count bch2_btree_node_lock_counts(struct btree_trans *,
|
||||
struct btree_path *,
|
||||
struct btree_bkey_cached_common *b,
|
||||
unsigned);
|
||||
|
||||
int bch2_check_for_deadlock(struct btree_trans *, struct printbuf *);
|
||||
|
||||
void __bch2_btree_path_verify_locks(struct btree_trans *, struct btree_path *);
|
||||
void __bch2_trans_verify_locks(struct btree_trans *);
|
||||
|
||||
static inline void bch2_btree_path_verify_locks(struct btree_trans *trans,
|
||||
struct btree_path *path)
|
||||
{
|
||||
if (static_branch_unlikely(&bch2_debug_check_btree_locking))
|
||||
__bch2_btree_path_verify_locks(trans, path);
|
||||
}
|
||||
|
||||
static inline void bch2_trans_verify_locks(struct btree_trans *trans)
|
||||
{
|
||||
if (static_branch_unlikely(&bch2_debug_check_btree_locking))
|
||||
__bch2_trans_verify_locks(trans);
|
||||
}
|
||||
|
||||
#endif /* _BCACHEFS_BTREE_LOCKING_H */
|
||||
@@ -1,611 +0,0 @@
|
||||
// SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
#include "bcachefs.h"
|
||||
#include "btree_cache.h"
|
||||
#include "btree_io.h"
|
||||
#include "btree_journal_iter.h"
|
||||
#include "btree_node_scan.h"
|
||||
#include "btree_update_interior.h"
|
||||
#include "buckets.h"
|
||||
#include "error.h"
|
||||
#include "journal_io.h"
|
||||
#include "recovery_passes.h"
|
||||
|
||||
#include <linux/kthread.h>
|
||||
#include <linux/min_heap.h>
|
||||
#include <linux/sched/sysctl.h>
|
||||
#include <linux/sort.h>
|
||||
|
||||
struct find_btree_nodes_worker {
|
||||
struct closure *cl;
|
||||
struct find_btree_nodes *f;
|
||||
struct bch_dev *ca;
|
||||
};
|
||||
|
||||
static void found_btree_node_to_text(struct printbuf *out, struct bch_fs *c, const struct found_btree_node *n)
|
||||
{
|
||||
bch2_btree_id_level_to_text(out, n->btree_id, n->level);
|
||||
prt_printf(out, " seq=%u journal_seq=%llu cookie=%llx ",
|
||||
n->seq, n->journal_seq, n->cookie);
|
||||
bch2_bpos_to_text(out, n->min_key);
|
||||
prt_str(out, "-");
|
||||
bch2_bpos_to_text(out, n->max_key);
|
||||
|
||||
if (n->range_updated)
|
||||
prt_str(out, " range updated");
|
||||
|
||||
for (unsigned i = 0; i < n->nr_ptrs; i++) {
|
||||
prt_char(out, ' ');
|
||||
bch2_extent_ptr_to_text(out, c, n->ptrs + i);
|
||||
}
|
||||
}
|
||||
|
||||
static void found_btree_nodes_to_text(struct printbuf *out, struct bch_fs *c, found_btree_nodes nodes)
|
||||
{
|
||||
printbuf_indent_add(out, 2);
|
||||
darray_for_each(nodes, i) {
|
||||
found_btree_node_to_text(out, c, i);
|
||||
prt_newline(out);
|
||||
}
|
||||
printbuf_indent_sub(out, 2);
|
||||
}
|
||||
|
||||
static void found_btree_node_to_key(struct bkey_i *k, const struct found_btree_node *f)
|
||||
{
|
||||
struct bkey_i_btree_ptr_v2 *bp = bkey_btree_ptr_v2_init(k);
|
||||
|
||||
set_bkey_val_u64s(&bp->k, sizeof(struct bch_btree_ptr_v2) / sizeof(u64) + f->nr_ptrs);
|
||||
bp->k.p = f->max_key;
|
||||
bp->v.seq = cpu_to_le64(f->cookie);
|
||||
bp->v.sectors_written = 0;
|
||||
bp->v.flags = 0;
|
||||
bp->v.sectors_written = cpu_to_le16(f->sectors_written);
|
||||
bp->v.min_key = f->min_key;
|
||||
SET_BTREE_PTR_RANGE_UPDATED(&bp->v, f->range_updated);
|
||||
memcpy(bp->v.start, f->ptrs, sizeof(struct bch_extent_ptr) * f->nr_ptrs);
|
||||
}
|
||||
|
||||
static inline u64 bkey_journal_seq(struct bkey_s_c k)
|
||||
{
|
||||
switch (k.k->type) {
|
||||
case KEY_TYPE_inode_v3:
|
||||
return le64_to_cpu(bkey_s_c_to_inode_v3(k).v->bi_journal_seq);
|
||||
default:
|
||||
return 0;
|
||||
}
|
||||
}
|
||||
|
||||
static int found_btree_node_cmp_cookie(const void *_l, const void *_r)
|
||||
{
|
||||
const struct found_btree_node *l = _l;
|
||||
const struct found_btree_node *r = _r;
|
||||
|
||||
return cmp_int(l->btree_id, r->btree_id) ?:
|
||||
cmp_int(l->level, r->level) ?:
|
||||
cmp_int(l->cookie, r->cookie);
|
||||
}
|
||||
|
||||
/*
|
||||
* Given two found btree nodes, if their sequence numbers are equal, take the
|
||||
* one that's readable:
|
||||
*/
|
||||
static int found_btree_node_cmp_time(const struct found_btree_node *l,
|
||||
const struct found_btree_node *r)
|
||||
{
|
||||
return cmp_int(l->seq, r->seq) ?:
|
||||
cmp_int(l->journal_seq, r->journal_seq);
|
||||
}
|
||||
|
||||
static int found_btree_node_cmp_pos(const void *_l, const void *_r)
|
||||
{
|
||||
const struct found_btree_node *l = _l;
|
||||
const struct found_btree_node *r = _r;
|
||||
|
||||
return cmp_int(l->btree_id, r->btree_id) ?:
|
||||
-cmp_int(l->level, r->level) ?:
|
||||
bpos_cmp(l->min_key, r->min_key) ?:
|
||||
-found_btree_node_cmp_time(l, r);
|
||||
}
|
||||
|
||||
static inline bool found_btree_node_cmp_pos_less(const void *l, const void *r, void *arg)
|
||||
{
|
||||
return found_btree_node_cmp_pos(l, r) < 0;
|
||||
}
|
||||
|
||||
static inline void found_btree_node_swap(void *_l, void *_r, void *arg)
|
||||
{
|
||||
struct found_btree_node *l = _l;
|
||||
struct found_btree_node *r = _r;
|
||||
|
||||
swap(*l, *r);
|
||||
}
|
||||
|
||||
static const struct min_heap_callbacks found_btree_node_heap_cbs = {
|
||||
.less = found_btree_node_cmp_pos_less,
|
||||
.swp = found_btree_node_swap,
|
||||
};
|
||||
|
||||
static void try_read_btree_node(struct find_btree_nodes *f, struct bch_dev *ca,
|
||||
struct btree *b, struct bio *bio, u64 offset)
|
||||
{
|
||||
struct bch_fs *c = container_of(f, struct bch_fs, found_btree_nodes);
|
||||
struct btree_node *bn = b->data;
|
||||
|
||||
bio_reset(bio, ca->disk_sb.bdev, REQ_OP_READ);
|
||||
bio->bi_iter.bi_sector = offset;
|
||||
bch2_bio_map(bio, b->data, c->opts.block_size);
|
||||
|
||||
u64 submit_time = local_clock();
|
||||
submit_bio_wait(bio);
|
||||
bch2_account_io_completion(ca, BCH_MEMBER_ERROR_read, submit_time, !bio->bi_status);
|
||||
|
||||
if (bio->bi_status) {
|
||||
bch_err_dev_ratelimited(ca,
|
||||
"IO error in try_read_btree_node() at %llu: %s",
|
||||
offset, bch2_blk_status_to_str(bio->bi_status));
|
||||
return;
|
||||
}
|
||||
|
||||
if (le64_to_cpu(bn->magic) != bset_magic(c))
|
||||
return;
|
||||
|
||||
if (bch2_csum_type_is_encryption(BSET_CSUM_TYPE(&bn->keys))) {
|
||||
if (!c->chacha20_key_set)
|
||||
return;
|
||||
|
||||
struct nonce nonce = btree_nonce(&bn->keys, 0);
|
||||
unsigned bytes = (void *) &bn->keys - (void *) &bn->flags;
|
||||
|
||||
bch2_encrypt(c, BSET_CSUM_TYPE(&bn->keys), nonce, &bn->flags, bytes);
|
||||
}
|
||||
|
||||
if (btree_id_is_alloc(BTREE_NODE_ID(bn)))
|
||||
return;
|
||||
|
||||
if (BTREE_NODE_LEVEL(bn) >= BTREE_MAX_DEPTH)
|
||||
return;
|
||||
|
||||
if (BTREE_NODE_ID(bn) >= BTREE_ID_NR_MAX)
|
||||
return;
|
||||
|
||||
rcu_read_lock();
|
||||
struct found_btree_node n = {
|
||||
.btree_id = BTREE_NODE_ID(bn),
|
||||
.level = BTREE_NODE_LEVEL(bn),
|
||||
.seq = BTREE_NODE_SEQ(bn),
|
||||
.cookie = le64_to_cpu(bn->keys.seq),
|
||||
.min_key = bn->min_key,
|
||||
.max_key = bn->max_key,
|
||||
.nr_ptrs = 1,
|
||||
.ptrs[0].type = 1 << BCH_EXTENT_ENTRY_ptr,
|
||||
.ptrs[0].offset = offset,
|
||||
.ptrs[0].dev = ca->dev_idx,
|
||||
.ptrs[0].gen = bucket_gen_get(ca, sector_to_bucket(ca, offset)),
|
||||
};
|
||||
rcu_read_unlock();
|
||||
|
||||
bio_reset(bio, ca->disk_sb.bdev, REQ_OP_READ);
|
||||
bio->bi_iter.bi_sector = offset;
|
||||
bch2_bio_map(bio, b->data, c->opts.btree_node_size);
|
||||
|
||||
submit_time = local_clock();
|
||||
submit_bio_wait(bio);
|
||||
bch2_account_io_completion(ca, BCH_MEMBER_ERROR_read, submit_time, !bio->bi_status);
|
||||
|
||||
found_btree_node_to_key(&b->key, &n);
|
||||
|
||||
CLASS(printbuf, buf)();
|
||||
if (!bch2_btree_node_read_done(c, ca, b, NULL, &buf)) {
|
||||
/* read_done will swap out b->data for another buffer */
|
||||
bn = b->data;
|
||||
/*
|
||||
* Grab journal_seq here because we want the max journal_seq of
|
||||
* any bset; read_done sorts down to a single set and picks the
|
||||
* max journal_seq
|
||||
*/
|
||||
n.journal_seq = le64_to_cpu(bn->keys.journal_seq),
|
||||
n.sectors_written = b->written;
|
||||
|
||||
mutex_lock(&f->lock);
|
||||
if (BSET_BIG_ENDIAN(&bn->keys) != CPU_BIG_ENDIAN) {
|
||||
bch_err(c, "try_read_btree_node() can't handle endian conversion");
|
||||
f->ret = -EINVAL;
|
||||
goto unlock;
|
||||
}
|
||||
|
||||
if (darray_push(&f->nodes, n))
|
||||
f->ret = -ENOMEM;
|
||||
unlock:
|
||||
mutex_unlock(&f->lock);
|
||||
}
|
||||
}
|
||||
|
||||
static int read_btree_nodes_worker(void *p)
|
||||
{
|
||||
struct find_btree_nodes_worker *w = p;
|
||||
struct bch_fs *c = container_of(w->f, struct bch_fs, found_btree_nodes);
|
||||
struct bch_dev *ca = w->ca;
|
||||
unsigned long last_print = jiffies;
|
||||
struct btree *b = NULL;
|
||||
struct bio *bio = NULL;
|
||||
|
||||
b = __bch2_btree_node_mem_alloc(c);
|
||||
if (!b) {
|
||||
bch_err(c, "read_btree_nodes_worker: error allocating buf");
|
||||
w->f->ret = -ENOMEM;
|
||||
goto err;
|
||||
}
|
||||
|
||||
bio = bio_alloc(NULL, buf_pages(b->data, c->opts.btree_node_size), 0, GFP_KERNEL);
|
||||
if (!bio) {
|
||||
bch_err(c, "read_btree_nodes_worker: error allocating bio");
|
||||
w->f->ret = -ENOMEM;
|
||||
goto err;
|
||||
}
|
||||
|
||||
for (u64 bucket = ca->mi.first_bucket; bucket < ca->mi.nbuckets; bucket++)
|
||||
for (unsigned bucket_offset = 0;
|
||||
bucket_offset + btree_sectors(c) <= ca->mi.bucket_size;
|
||||
bucket_offset += btree_sectors(c)) {
|
||||
if (time_after(jiffies, last_print + HZ * 30)) {
|
||||
u64 cur_sector = bucket * ca->mi.bucket_size + bucket_offset;
|
||||
u64 end_sector = ca->mi.nbuckets * ca->mi.bucket_size;
|
||||
|
||||
bch_info(ca, "%s: %2u%% done", __func__,
|
||||
(unsigned) div64_u64(cur_sector * 100, end_sector));
|
||||
last_print = jiffies;
|
||||
}
|
||||
|
||||
u64 sector = bucket * ca->mi.bucket_size + bucket_offset;
|
||||
|
||||
if (c->sb.version_upgrade_complete >= bcachefs_metadata_version_mi_btree_bitmap &&
|
||||
!bch2_dev_btree_bitmap_marked_sectors(ca, sector, btree_sectors(c)))
|
||||
continue;
|
||||
|
||||
try_read_btree_node(w->f, ca, b, bio, sector);
|
||||
}
|
||||
err:
|
||||
if (b)
|
||||
__btree_node_data_free(b);
|
||||
kfree(b);
|
||||
bio_put(bio);
|
||||
enumerated_ref_put(&ca->io_ref[READ], BCH_DEV_READ_REF_btree_node_scan);
|
||||
closure_put(w->cl);
|
||||
kfree(w);
|
||||
return 0;
|
||||
}
|
||||
|
||||
static int read_btree_nodes(struct find_btree_nodes *f)
|
||||
{
|
||||
struct bch_fs *c = container_of(f, struct bch_fs, found_btree_nodes);
|
||||
struct closure cl;
|
||||
int ret = 0;
|
||||
|
||||
closure_init_stack(&cl);
|
||||
|
||||
for_each_online_member(c, ca, BCH_DEV_READ_REF_btree_node_scan) {
|
||||
if (!(ca->mi.data_allowed & BIT(BCH_DATA_btree)))
|
||||
continue;
|
||||
|
||||
struct find_btree_nodes_worker *w = kmalloc(sizeof(*w), GFP_KERNEL);
|
||||
if (!w) {
|
||||
enumerated_ref_put(&ca->io_ref[READ], BCH_DEV_READ_REF_btree_node_scan);
|
||||
ret = -ENOMEM;
|
||||
goto err;
|
||||
}
|
||||
|
||||
w->cl = &cl;
|
||||
w->f = f;
|
||||
w->ca = ca;
|
||||
|
||||
struct task_struct *t = kthread_create(read_btree_nodes_worker, w, "read_btree_nodes/%s", ca->name);
|
||||
ret = PTR_ERR_OR_ZERO(t);
|
||||
if (ret) {
|
||||
enumerated_ref_put(&ca->io_ref[READ], BCH_DEV_READ_REF_btree_node_scan);
|
||||
kfree(w);
|
||||
bch_err_msg(c, ret, "starting kthread");
|
||||
break;
|
||||
}
|
||||
|
||||
closure_get(&cl);
|
||||
enumerated_ref_get(&ca->io_ref[READ], BCH_DEV_READ_REF_btree_node_scan);
|
||||
wake_up_process(t);
|
||||
}
|
||||
err:
|
||||
while (closure_sync_timeout(&cl, sysctl_hung_task_timeout_secs * HZ / 2))
|
||||
;
|
||||
return f->ret ?: ret;
|
||||
}
|
||||
|
||||
static bool nodes_overlap(const struct found_btree_node *l,
|
||||
const struct found_btree_node *r)
|
||||
{
|
||||
return (l->btree_id == r->btree_id &&
|
||||
l->level == r->level &&
|
||||
bpos_gt(l->max_key, r->min_key));
|
||||
}
|
||||
|
||||
static int handle_overwrites(struct bch_fs *c,
|
||||
struct found_btree_node *l,
|
||||
found_btree_nodes *nodes_heap)
|
||||
{
|
||||
struct found_btree_node *r;
|
||||
|
||||
while ((r = min_heap_peek(nodes_heap)) &&
|
||||
nodes_overlap(l, r)) {
|
||||
int cmp = found_btree_node_cmp_time(l, r);
|
||||
|
||||
if (cmp > 0) {
|
||||
if (bpos_cmp(l->max_key, r->max_key) >= 0)
|
||||
min_heap_pop(nodes_heap, &found_btree_node_heap_cbs, NULL);
|
||||
else {
|
||||
r->range_updated = true;
|
||||
r->min_key = bpos_successor(l->max_key);
|
||||
r->range_updated = true;
|
||||
min_heap_sift_down(nodes_heap, 0, &found_btree_node_heap_cbs, NULL);
|
||||
}
|
||||
} else if (cmp < 0) {
|
||||
BUG_ON(bpos_eq(l->min_key, r->min_key));
|
||||
|
||||
l->max_key = bpos_predecessor(r->min_key);
|
||||
l->range_updated = true;
|
||||
} else if (r->level) {
|
||||
min_heap_pop(nodes_heap, &found_btree_node_heap_cbs, NULL);
|
||||
} else {
|
||||
if (bpos_cmp(l->max_key, r->max_key) >= 0)
|
||||
min_heap_pop(nodes_heap, &found_btree_node_heap_cbs, NULL);
|
||||
else {
|
||||
r->range_updated = true;
|
||||
r->min_key = bpos_successor(l->max_key);
|
||||
r->range_updated = true;
|
||||
min_heap_sift_down(nodes_heap, 0, &found_btree_node_heap_cbs, NULL);
|
||||
}
|
||||
}
|
||||
|
||||
cond_resched();
|
||||
}
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
int bch2_scan_for_btree_nodes(struct bch_fs *c)
|
||||
{
|
||||
struct find_btree_nodes *f = &c->found_btree_nodes;
|
||||
struct printbuf buf = PRINTBUF;
|
||||
found_btree_nodes nodes_heap = {};
|
||||
size_t dst;
|
||||
int ret = 0;
|
||||
|
||||
if (f->nodes.nr)
|
||||
return 0;
|
||||
|
||||
mutex_init(&f->lock);
|
||||
|
||||
ret = read_btree_nodes(f);
|
||||
if (ret)
|
||||
return ret;
|
||||
|
||||
if (!f->nodes.nr) {
|
||||
bch_err(c, "%s: no btree nodes found", __func__);
|
||||
ret = -EINVAL;
|
||||
goto err;
|
||||
}
|
||||
|
||||
if (0 && c->opts.verbose) {
|
||||
printbuf_reset(&buf);
|
||||
prt_printf(&buf, "%s: nodes found:\n", __func__);
|
||||
found_btree_nodes_to_text(&buf, c, f->nodes);
|
||||
bch2_print_str(c, KERN_INFO, buf.buf);
|
||||
}
|
||||
|
||||
sort_nonatomic(f->nodes.data, f->nodes.nr, sizeof(f->nodes.data[0]), found_btree_node_cmp_cookie, NULL);
|
||||
|
||||
dst = 0;
|
||||
darray_for_each(f->nodes, i) {
|
||||
struct found_btree_node *prev = dst ? f->nodes.data + dst - 1 : NULL;
|
||||
|
||||
if (prev &&
|
||||
prev->cookie == i->cookie) {
|
||||
if (prev->nr_ptrs == ARRAY_SIZE(prev->ptrs)) {
|
||||
bch_err(c, "%s: found too many replicas for btree node", __func__);
|
||||
ret = -EINVAL;
|
||||
goto err;
|
||||
}
|
||||
prev->ptrs[prev->nr_ptrs++] = i->ptrs[0];
|
||||
} else {
|
||||
f->nodes.data[dst++] = *i;
|
||||
}
|
||||
}
|
||||
f->nodes.nr = dst;
|
||||
|
||||
sort_nonatomic(f->nodes.data, f->nodes.nr, sizeof(f->nodes.data[0]), found_btree_node_cmp_pos, NULL);
|
||||
|
||||
if (0 && c->opts.verbose) {
|
||||
printbuf_reset(&buf);
|
||||
prt_printf(&buf, "%s: nodes after merging replicas:\n", __func__);
|
||||
found_btree_nodes_to_text(&buf, c, f->nodes);
|
||||
bch2_print_str(c, KERN_INFO, buf.buf);
|
||||
}
|
||||
|
||||
swap(nodes_heap, f->nodes);
|
||||
|
||||
{
|
||||
/* darray must have same layout as a heap */
|
||||
min_heap_char real_heap;
|
||||
BUILD_BUG_ON(sizeof(nodes_heap.nr) != sizeof(real_heap.nr));
|
||||
BUILD_BUG_ON(sizeof(nodes_heap.size) != sizeof(real_heap.size));
|
||||
BUILD_BUG_ON(offsetof(found_btree_nodes, nr) != offsetof(min_heap_char, nr));
|
||||
BUILD_BUG_ON(offsetof(found_btree_nodes, size) != offsetof(min_heap_char, size));
|
||||
}
|
||||
|
||||
min_heapify_all(&nodes_heap, &found_btree_node_heap_cbs, NULL);
|
||||
|
||||
if (nodes_heap.nr) {
|
||||
ret = darray_push(&f->nodes, *min_heap_peek(&nodes_heap));
|
||||
if (ret)
|
||||
goto err;
|
||||
|
||||
min_heap_pop(&nodes_heap, &found_btree_node_heap_cbs, NULL);
|
||||
}
|
||||
|
||||
while (true) {
|
||||
ret = handle_overwrites(c, &darray_last(f->nodes), &nodes_heap);
|
||||
if (ret)
|
||||
goto err;
|
||||
|
||||
if (!nodes_heap.nr)
|
||||
break;
|
||||
|
||||
ret = darray_push(&f->nodes, *min_heap_peek(&nodes_heap));
|
||||
if (ret)
|
||||
goto err;
|
||||
|
||||
min_heap_pop(&nodes_heap, &found_btree_node_heap_cbs, NULL);
|
||||
}
|
||||
|
||||
for (struct found_btree_node *n = f->nodes.data; n < &darray_last(f->nodes); n++)
|
||||
BUG_ON(nodes_overlap(n, n + 1));
|
||||
|
||||
if (0 && c->opts.verbose) {
|
||||
printbuf_reset(&buf);
|
||||
prt_printf(&buf, "%s: nodes found after overwrites:\n", __func__);
|
||||
found_btree_nodes_to_text(&buf, c, f->nodes);
|
||||
bch2_print_str(c, KERN_INFO, buf.buf);
|
||||
} else {
|
||||
bch_info(c, "btree node scan found %zu nodes after overwrites", f->nodes.nr);
|
||||
}
|
||||
|
||||
eytzinger0_sort(f->nodes.data, f->nodes.nr, sizeof(f->nodes.data[0]), found_btree_node_cmp_pos, NULL);
|
||||
err:
|
||||
darray_exit(&nodes_heap);
|
||||
printbuf_exit(&buf);
|
||||
return ret;
|
||||
}
|
||||
|
||||
static int found_btree_node_range_start_cmp(const void *_l, const void *_r)
|
||||
{
|
||||
const struct found_btree_node *l = _l;
|
||||
const struct found_btree_node *r = _r;
|
||||
|
||||
return cmp_int(l->btree_id, r->btree_id) ?:
|
||||
-cmp_int(l->level, r->level) ?:
|
||||
bpos_cmp(l->max_key, r->min_key);
|
||||
}
|
||||
|
||||
#define for_each_found_btree_node_in_range(_f, _search, _idx) \
|
||||
for (size_t _idx = eytzinger0_find_gt((_f)->nodes.data, (_f)->nodes.nr, \
|
||||
sizeof((_f)->nodes.data[0]), \
|
||||
found_btree_node_range_start_cmp, &search); \
|
||||
_idx < (_f)->nodes.nr && \
|
||||
(_f)->nodes.data[_idx].btree_id == _search.btree_id && \
|
||||
(_f)->nodes.data[_idx].level == _search.level && \
|
||||
bpos_lt((_f)->nodes.data[_idx].min_key, _search.max_key); \
|
||||
_idx = eytzinger0_next(_idx, (_f)->nodes.nr))
|
||||
|
||||
bool bch2_btree_node_is_stale(struct bch_fs *c, struct btree *b)
|
||||
{
|
||||
struct find_btree_nodes *f = &c->found_btree_nodes;
|
||||
|
||||
struct found_btree_node search = {
|
||||
.btree_id = b->c.btree_id,
|
||||
.level = b->c.level,
|
||||
.min_key = b->data->min_key,
|
||||
.max_key = b->key.k.p,
|
||||
};
|
||||
|
||||
for_each_found_btree_node_in_range(f, search, idx)
|
||||
if (f->nodes.data[idx].seq > BTREE_NODE_SEQ(b->data))
|
||||
return true;
|
||||
return false;
|
||||
}
|
||||
|
||||
int bch2_btree_has_scanned_nodes(struct bch_fs *c, enum btree_id btree)
|
||||
{
|
||||
int ret = bch2_run_print_explicit_recovery_pass(c, BCH_RECOVERY_PASS_scan_for_btree_nodes);
|
||||
if (ret)
|
||||
return ret;
|
||||
|
||||
struct found_btree_node search = {
|
||||
.btree_id = btree,
|
||||
.level = 0,
|
||||
.min_key = POS_MIN,
|
||||
.max_key = SPOS_MAX,
|
||||
};
|
||||
|
||||
for_each_found_btree_node_in_range(&c->found_btree_nodes, search, idx)
|
||||
return true;
|
||||
return false;
|
||||
}
|
||||
|
||||
int bch2_get_scanned_nodes(struct bch_fs *c, enum btree_id btree,
|
||||
unsigned level, struct bpos node_min, struct bpos node_max)
|
||||
{
|
||||
if (btree_id_is_alloc(btree))
|
||||
return 0;
|
||||
|
||||
struct find_btree_nodes *f = &c->found_btree_nodes;
|
||||
|
||||
int ret = bch2_run_print_explicit_recovery_pass(c, BCH_RECOVERY_PASS_scan_for_btree_nodes);
|
||||
if (ret)
|
||||
return ret;
|
||||
|
||||
if (c->opts.verbose) {
|
||||
struct printbuf buf = PRINTBUF;
|
||||
|
||||
prt_str(&buf, "recovery ");
|
||||
bch2_btree_id_level_to_text(&buf, btree, level);
|
||||
prt_str(&buf, " ");
|
||||
bch2_bpos_to_text(&buf, node_min);
|
||||
prt_str(&buf, " - ");
|
||||
bch2_bpos_to_text(&buf, node_max);
|
||||
|
||||
bch_info(c, "%s(): %s", __func__, buf.buf);
|
||||
printbuf_exit(&buf);
|
||||
}
|
||||
|
||||
struct found_btree_node search = {
|
||||
.btree_id = btree,
|
||||
.level = level,
|
||||
.min_key = node_min,
|
||||
.max_key = node_max,
|
||||
};
|
||||
|
||||
for_each_found_btree_node_in_range(f, search, idx) {
|
||||
struct found_btree_node n = f->nodes.data[idx];
|
||||
|
||||
n.range_updated |= bpos_lt(n.min_key, node_min);
|
||||
n.min_key = bpos_max(n.min_key, node_min);
|
||||
|
||||
n.range_updated |= bpos_gt(n.max_key, node_max);
|
||||
n.max_key = bpos_min(n.max_key, node_max);
|
||||
|
||||
struct { __BKEY_PADDED(k, BKEY_BTREE_PTR_VAL_U64s_MAX); } tmp;
|
||||
|
||||
found_btree_node_to_key(&tmp.k, &n);
|
||||
|
||||
if (c->opts.verbose) {
|
||||
struct printbuf buf = PRINTBUF;
|
||||
bch2_bkey_val_to_text(&buf, c, bkey_i_to_s_c(&tmp.k));
|
||||
bch_verbose(c, "%s(): recovering %s", __func__, buf.buf);
|
||||
printbuf_exit(&buf);
|
||||
}
|
||||
|
||||
BUG_ON(bch2_bkey_validate(c, bkey_i_to_s_c(&tmp.k),
|
||||
(struct bkey_validate_context) {
|
||||
.from = BKEY_VALIDATE_btree_node,
|
||||
.level = level + 1,
|
||||
.btree = btree,
|
||||
}));
|
||||
|
||||
ret = bch2_journal_key_insert(c, btree, level + 1, &tmp.k);
|
||||
if (ret)
|
||||
return ret;
|
||||
}
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
void bch2_find_btree_nodes_exit(struct find_btree_nodes *f)
|
||||
{
|
||||
darray_exit(&f->nodes);
|
||||
}
|
||||
@@ -1,11 +0,0 @@
|
||||
/* SPDX-License-Identifier: GPL-2.0 */
|
||||
#ifndef _BCACHEFS_BTREE_NODE_SCAN_H
|
||||
#define _BCACHEFS_BTREE_NODE_SCAN_H
|
||||
|
||||
int bch2_scan_for_btree_nodes(struct bch_fs *);
|
||||
bool bch2_btree_node_is_stale(struct bch_fs *, struct btree *);
|
||||
int bch2_btree_has_scanned_nodes(struct bch_fs *, enum btree_id);
|
||||
int bch2_get_scanned_nodes(struct bch_fs *, enum btree_id, unsigned, struct bpos, struct bpos);
|
||||
void bch2_find_btree_nodes_exit(struct find_btree_nodes *);
|
||||
|
||||
#endif /* _BCACHEFS_BTREE_NODE_SCAN_H */
|
||||
@@ -1,31 +0,0 @@
|
||||
/* SPDX-License-Identifier: GPL-2.0 */
|
||||
#ifndef _BCACHEFS_BTREE_NODE_SCAN_TYPES_H
|
||||
#define _BCACHEFS_BTREE_NODE_SCAN_TYPES_H
|
||||
|
||||
#include "darray.h"
|
||||
|
||||
struct found_btree_node {
|
||||
bool range_updated:1;
|
||||
u8 btree_id;
|
||||
u8 level;
|
||||
unsigned sectors_written;
|
||||
u32 seq;
|
||||
u64 journal_seq;
|
||||
u64 cookie;
|
||||
|
||||
struct bpos min_key;
|
||||
struct bpos max_key;
|
||||
|
||||
unsigned nr_ptrs;
|
||||
struct bch_extent_ptr ptrs[BCH_REPLICAS_MAX];
|
||||
};
|
||||
|
||||
typedef DARRAY(struct found_btree_node) found_btree_nodes;
|
||||
|
||||
struct find_btree_nodes {
|
||||
int ret;
|
||||
struct mutex lock;
|
||||
found_btree_nodes nodes;
|
||||
};
|
||||
|
||||
#endif /* _BCACHEFS_BTREE_NODE_SCAN_TYPES_H */
|
||||
File diff suppressed because it is too large
Load Diff
@@ -1,937 +0,0 @@
|
||||
/* SPDX-License-Identifier: GPL-2.0 */
|
||||
#ifndef _BCACHEFS_BTREE_TYPES_H
|
||||
#define _BCACHEFS_BTREE_TYPES_H
|
||||
|
||||
#include <linux/list.h>
|
||||
#include <linux/rhashtable.h>
|
||||
|
||||
#include "bbpos_types.h"
|
||||
#include "btree_key_cache_types.h"
|
||||
#include "buckets_types.h"
|
||||
#include "darray.h"
|
||||
#include "errcode.h"
|
||||
#include "journal_types.h"
|
||||
#include "replicas_types.h"
|
||||
#include "six.h"
|
||||
|
||||
struct open_bucket;
|
||||
struct btree_update;
|
||||
struct btree_trans;
|
||||
|
||||
#define MAX_BSETS 3U
|
||||
|
||||
struct btree_nr_keys {
|
||||
|
||||
/*
|
||||
* Amount of live metadata (i.e. size of node after a compaction) in
|
||||
* units of u64s
|
||||
*/
|
||||
u16 live_u64s;
|
||||
u16 bset_u64s[MAX_BSETS];
|
||||
|
||||
/* live keys only: */
|
||||
u16 packed_keys;
|
||||
u16 unpacked_keys;
|
||||
};
|
||||
|
||||
struct bset_tree {
|
||||
/*
|
||||
* We construct a binary tree in an array as if the array
|
||||
* started at 1, so that things line up on the same cachelines
|
||||
* better: see comments in bset.c at cacheline_to_bkey() for
|
||||
* details
|
||||
*/
|
||||
|
||||
/* size of the binary tree and prev array */
|
||||
u16 size;
|
||||
|
||||
/* function of size - precalculated for to_inorder() */
|
||||
u16 extra;
|
||||
|
||||
u16 data_offset;
|
||||
u16 aux_data_offset;
|
||||
u16 end_offset;
|
||||
};
|
||||
|
||||
struct btree_write {
|
||||
struct journal_entry_pin journal;
|
||||
};
|
||||
|
||||
struct btree_alloc {
|
||||
struct open_buckets ob;
|
||||
__BKEY_PADDED(k, BKEY_BTREE_PTR_VAL_U64s_MAX);
|
||||
};
|
||||
|
||||
struct btree_bkey_cached_common {
|
||||
struct six_lock lock;
|
||||
u8 level;
|
||||
u8 btree_id;
|
||||
bool cached;
|
||||
};
|
||||
|
||||
struct btree {
|
||||
struct btree_bkey_cached_common c;
|
||||
|
||||
struct rhash_head hash;
|
||||
u64 hash_val;
|
||||
|
||||
unsigned long flags;
|
||||
u16 written;
|
||||
u8 nsets;
|
||||
u8 nr_key_bits;
|
||||
u16 version_ondisk;
|
||||
|
||||
struct bkey_format format;
|
||||
|
||||
struct btree_node *data;
|
||||
void *aux_data;
|
||||
|
||||
/*
|
||||
* Sets of sorted keys - the real btree node - plus a binary search tree
|
||||
*
|
||||
* set[0] is special; set[0]->tree, set[0]->prev and set[0]->data point
|
||||
* to the memory we have allocated for this btree node. Additionally,
|
||||
* set[0]->data points to the entire btree node as it exists on disk.
|
||||
*/
|
||||
struct bset_tree set[MAX_BSETS];
|
||||
|
||||
struct btree_nr_keys nr;
|
||||
u16 sib_u64s[2];
|
||||
u16 whiteout_u64s;
|
||||
u8 byte_order;
|
||||
u8 unpack_fn_len;
|
||||
|
||||
struct btree_write writes[2];
|
||||
|
||||
/* Key/pointer for this btree node */
|
||||
__BKEY_PADDED(key, BKEY_BTREE_PTR_VAL_U64s_MAX);
|
||||
|
||||
/*
|
||||
* XXX: add a delete sequence number, so when bch2_btree_node_relock()
|
||||
* fails because the lock sequence number has changed - i.e. the
|
||||
* contents were modified - we can still relock the node if it's still
|
||||
* the one we want, without redoing the traversal
|
||||
*/
|
||||
|
||||
/*
|
||||
* For asynchronous splits/interior node updates:
|
||||
* When we do a split, we allocate new child nodes and update the parent
|
||||
* node to point to them: we update the parent in memory immediately,
|
||||
* but then we must wait until the children have been written out before
|
||||
* the update to the parent can be written - this is a list of the
|
||||
* btree_updates that are blocking this node from being
|
||||
* written:
|
||||
*/
|
||||
struct list_head write_blocked;
|
||||
|
||||
/*
|
||||
* Also for asynchronous splits/interior node updates:
|
||||
* If a btree node isn't reachable yet, we don't want to kick off
|
||||
* another write - because that write also won't yet be reachable and
|
||||
* marking it as completed before it's reachable would be incorrect:
|
||||
*/
|
||||
unsigned long will_make_reachable;
|
||||
|
||||
struct open_buckets ob;
|
||||
|
||||
/* lru list */
|
||||
struct list_head list;
|
||||
};
|
||||
|
||||
#define BCH_BTREE_CACHE_NOT_FREED_REASONS() \
|
||||
x(cache_reserve) \
|
||||
x(lock_intent) \
|
||||
x(lock_write) \
|
||||
x(dirty) \
|
||||
x(read_in_flight) \
|
||||
x(write_in_flight) \
|
||||
x(noevict) \
|
||||
x(write_blocked) \
|
||||
x(will_make_reachable) \
|
||||
x(access_bit)
|
||||
|
||||
enum bch_btree_cache_not_freed_reasons {
|
||||
#define x(n) BCH_BTREE_CACHE_NOT_FREED_##n,
|
||||
BCH_BTREE_CACHE_NOT_FREED_REASONS()
|
||||
#undef x
|
||||
BCH_BTREE_CACHE_NOT_FREED_REASONS_NR,
|
||||
};
|
||||
|
||||
struct btree_cache_list {
|
||||
unsigned idx;
|
||||
struct shrinker *shrink;
|
||||
struct list_head list;
|
||||
size_t nr;
|
||||
};
|
||||
|
||||
struct btree_cache {
|
||||
struct rhashtable table;
|
||||
bool table_init_done;
|
||||
/*
|
||||
* We never free a struct btree, except on shutdown - we just put it on
|
||||
* the btree_cache_freed list and reuse it later. This simplifies the
|
||||
* code, and it doesn't cost us much memory as the memory usage is
|
||||
* dominated by buffers that hold the actual btree node data and those
|
||||
* can be freed - and the number of struct btrees allocated is
|
||||
* effectively bounded.
|
||||
*
|
||||
* btree_cache_freeable effectively is a small cache - we use it because
|
||||
* high order page allocations can be rather expensive, and it's quite
|
||||
* common to delete and allocate btree nodes in quick succession. It
|
||||
* should never grow past ~2-3 nodes in practice.
|
||||
*/
|
||||
struct mutex lock;
|
||||
struct list_head freeable;
|
||||
struct list_head freed_pcpu;
|
||||
struct list_head freed_nonpcpu;
|
||||
struct btree_cache_list live[2];
|
||||
|
||||
size_t nr_freeable;
|
||||
size_t nr_reserve;
|
||||
size_t nr_by_btree[BTREE_ID_NR];
|
||||
atomic_long_t nr_dirty;
|
||||
|
||||
/* shrinker stats */
|
||||
size_t nr_freed;
|
||||
u64 not_freed[BCH_BTREE_CACHE_NOT_FREED_REASONS_NR];
|
||||
|
||||
/*
|
||||
* If we need to allocate memory for a new btree node and that
|
||||
* allocation fails, we can cannibalize another node in the btree cache
|
||||
* to satisfy the allocation - lock to guarantee only one thread does
|
||||
* this at a time:
|
||||
*/
|
||||
struct task_struct *alloc_lock;
|
||||
struct closure_waitlist alloc_wait;
|
||||
|
||||
struct bbpos pinned_nodes_start;
|
||||
struct bbpos pinned_nodes_end;
|
||||
/* btree id mask: 0 for leaves, 1 for interior */
|
||||
u64 pinned_nodes_mask[2];
|
||||
};
|
||||
|
||||
struct btree_node_iter {
|
||||
struct btree_node_iter_set {
|
||||
u16 k, end;
|
||||
} data[MAX_BSETS];
|
||||
};
|
||||
|
||||
#define BTREE_ITER_FLAGS() \
|
||||
x(slots) \
|
||||
x(intent) \
|
||||
x(prefetch) \
|
||||
x(is_extents) \
|
||||
x(not_extents) \
|
||||
x(cached) \
|
||||
x(with_key_cache) \
|
||||
x(with_updates) \
|
||||
x(with_journal) \
|
||||
x(snapshot_field) \
|
||||
x(all_snapshots) \
|
||||
x(filter_snapshots) \
|
||||
x(nopreserve) \
|
||||
x(cached_nofill) \
|
||||
x(key_cache_fill) \
|
||||
|
||||
#define STR_HASH_FLAGS() \
|
||||
x(must_create) \
|
||||
x(must_replace)
|
||||
|
||||
#define BTREE_UPDATE_FLAGS() \
|
||||
x(internal_snapshot_node) \
|
||||
x(nojournal) \
|
||||
x(key_cache_reclaim)
|
||||
|
||||
|
||||
/*
|
||||
* BTREE_TRIGGER_norun - don't run triggers at all
|
||||
*
|
||||
* BTREE_TRIGGER_transactional - we're running transactional triggers as part of
|
||||
* a transaction commit: triggers may generate new updates
|
||||
*
|
||||
* BTREE_TRIGGER_atomic - we're running atomic triggers during a transaction
|
||||
* commit: we have our journal reservation, we're holding btree node write
|
||||
* locks, and we know the transaction is going to commit (returning an error
|
||||
* here is a fatal error, causing us to go emergency read-only)
|
||||
*
|
||||
* BTREE_TRIGGER_gc - we're in gc/fsck: running triggers to recalculate e.g. disk usage
|
||||
*
|
||||
* BTREE_TRIGGER_insert - @new is entering the btree
|
||||
* BTREE_TRIGGER_overwrite - @old is leaving the btree
|
||||
*/
|
||||
#define BTREE_TRIGGER_FLAGS() \
|
||||
x(norun) \
|
||||
x(transactional) \
|
||||
x(atomic) \
|
||||
x(check_repair) \
|
||||
x(gc) \
|
||||
x(insert) \
|
||||
x(overwrite) \
|
||||
x(is_root)
|
||||
|
||||
enum {
|
||||
#define x(n) BTREE_ITER_FLAG_BIT_##n,
|
||||
BTREE_ITER_FLAGS()
|
||||
STR_HASH_FLAGS()
|
||||
BTREE_UPDATE_FLAGS()
|
||||
BTREE_TRIGGER_FLAGS()
|
||||
#undef x
|
||||
};
|
||||
|
||||
/* iter flags must fit in a u16: */
|
||||
//BUILD_BUG_ON(BTREE_ITER_FLAG_BIT_key_cache_fill > 15);
|
||||
|
||||
enum btree_iter_update_trigger_flags {
|
||||
#define x(n) BTREE_ITER_##n = 1U << BTREE_ITER_FLAG_BIT_##n,
|
||||
BTREE_ITER_FLAGS()
|
||||
#undef x
|
||||
#define x(n) STR_HASH_##n = 1U << BTREE_ITER_FLAG_BIT_##n,
|
||||
STR_HASH_FLAGS()
|
||||
#undef x
|
||||
#define x(n) BTREE_UPDATE_##n = 1U << BTREE_ITER_FLAG_BIT_##n,
|
||||
BTREE_UPDATE_FLAGS()
|
||||
#undef x
|
||||
#define x(n) BTREE_TRIGGER_##n = 1U << BTREE_ITER_FLAG_BIT_##n,
|
||||
BTREE_TRIGGER_FLAGS()
|
||||
#undef x
|
||||
};
|
||||
|
||||
enum btree_path_uptodate {
|
||||
BTREE_ITER_UPTODATE = 0,
|
||||
BTREE_ITER_NEED_RELOCK = 1,
|
||||
BTREE_ITER_NEED_TRAVERSE = 2,
|
||||
};
|
||||
|
||||
#if defined(CONFIG_BCACHEFS_LOCK_TIME_STATS) || defined(CONFIG_BCACHEFS_DEBUG)
|
||||
#define TRACK_PATH_ALLOCATED
|
||||
#endif
|
||||
|
||||
typedef u16 btree_path_idx_t;
|
||||
|
||||
struct btree_path {
|
||||
btree_path_idx_t sorted_idx;
|
||||
u8 ref;
|
||||
u8 intent_ref;
|
||||
|
||||
/* btree_iter_copy starts here: */
|
||||
struct bpos pos;
|
||||
|
||||
enum btree_id btree_id:5;
|
||||
bool cached:1;
|
||||
bool preserve:1;
|
||||
enum btree_path_uptodate uptodate:2;
|
||||
/*
|
||||
* When true, failing to relock this path will cause the transaction to
|
||||
* restart:
|
||||
*/
|
||||
bool should_be_locked:1;
|
||||
unsigned level:3,
|
||||
locks_want:3;
|
||||
u8 nodes_locked;
|
||||
|
||||
struct btree_path_level {
|
||||
struct btree *b;
|
||||
struct btree_node_iter iter;
|
||||
u32 lock_seq;
|
||||
#ifdef CONFIG_BCACHEFS_LOCK_TIME_STATS
|
||||
u64 lock_taken_time;
|
||||
#endif
|
||||
} l[BTREE_MAX_DEPTH];
|
||||
#ifdef TRACK_PATH_ALLOCATED
|
||||
unsigned long ip_allocated;
|
||||
#endif
|
||||
};
|
||||
|
||||
static inline struct btree_path_level *path_l(struct btree_path *path)
|
||||
{
|
||||
return path->l + path->level;
|
||||
}
|
||||
|
||||
static inline unsigned long btree_path_ip_allocated(struct btree_path *path)
|
||||
{
|
||||
#ifdef TRACK_PATH_ALLOCATED
|
||||
return path->ip_allocated;
|
||||
#else
|
||||
return _THIS_IP_;
|
||||
#endif
|
||||
}
|
||||
|
||||
/*
|
||||
* @pos - iterator's current position
|
||||
* @level - current btree depth
|
||||
* @locks_want - btree level below which we start taking intent locks
|
||||
* @nodes_locked - bitmask indicating which nodes in @nodes are locked
|
||||
* @nodes_intent_locked - bitmask indicating which locks are intent locks
|
||||
*/
|
||||
struct btree_iter {
|
||||
btree_path_idx_t path;
|
||||
btree_path_idx_t update_path;
|
||||
btree_path_idx_t key_cache_path;
|
||||
|
||||
enum btree_id btree_id:8;
|
||||
u8 min_depth;
|
||||
|
||||
/* btree_iter_copy starts here: */
|
||||
u16 flags;
|
||||
|
||||
/* When we're filtering by snapshot, the snapshot ID we're looking for: */
|
||||
unsigned snapshot;
|
||||
|
||||
struct bpos pos;
|
||||
/*
|
||||
* Current unpacked key - so that bch2_btree_iter_next()/
|
||||
* bch2_btree_iter_next_slot() can correctly advance pos.
|
||||
*/
|
||||
struct bkey k;
|
||||
|
||||
/* BTREE_ITER_with_journal: */
|
||||
size_t journal_idx;
|
||||
#ifdef TRACK_PATH_ALLOCATED
|
||||
unsigned long ip_allocated;
|
||||
#endif
|
||||
};
|
||||
|
||||
#define BKEY_CACHED_ACCESSED 0
|
||||
#define BKEY_CACHED_DIRTY 1
|
||||
|
||||
struct bkey_cached {
|
||||
struct btree_bkey_cached_common c;
|
||||
|
||||
unsigned long flags;
|
||||
u16 u64s;
|
||||
struct bkey_cached_key key;
|
||||
|
||||
struct rhash_head hash;
|
||||
|
||||
struct journal_entry_pin journal;
|
||||
u64 seq;
|
||||
|
||||
struct bkey_i *k;
|
||||
struct rcu_head rcu;
|
||||
};
|
||||
|
||||
static inline struct bpos btree_node_pos(struct btree_bkey_cached_common *b)
|
||||
{
|
||||
return !b->cached
|
||||
? container_of(b, struct btree, c)->key.k.p
|
||||
: container_of(b, struct bkey_cached, c)->key.pos;
|
||||
}
|
||||
|
||||
struct btree_insert_entry {
|
||||
unsigned flags;
|
||||
u8 sort_order;
|
||||
u8 bkey_type;
|
||||
enum btree_id btree_id:8;
|
||||
u8 level:4;
|
||||
bool cached:1;
|
||||
bool insert_trigger_run:1;
|
||||
bool overwrite_trigger_run:1;
|
||||
bool key_cache_already_flushed:1;
|
||||
/*
|
||||
* @old_k may be a key from the journal; @old_btree_u64s always refers
|
||||
* to the size of the key being overwritten in the btree:
|
||||
*/
|
||||
u8 old_btree_u64s;
|
||||
btree_path_idx_t path;
|
||||
struct bkey_i *k;
|
||||
/* key being overwritten: */
|
||||
struct bkey old_k;
|
||||
const struct bch_val *old_v;
|
||||
unsigned long ip_allocated;
|
||||
};
|
||||
|
||||
/* Number of btree paths we preallocate, usually enough */
|
||||
#define BTREE_ITER_INITIAL 64
|
||||
/*
|
||||
* Lmiit for btree_trans_too_many_iters(); this is enough that almost all code
|
||||
* paths should run inside this limit, and if they don't it usually indicates a
|
||||
* bug (leaking/duplicated btree paths).
|
||||
*
|
||||
* exception: some fsck paths
|
||||
*
|
||||
* bugs with excessive path usage seem to have possibly been eliminated now, so
|
||||
* we might consider eliminating this (and btree_trans_too_many_iter()) at some
|
||||
* point.
|
||||
*/
|
||||
#define BTREE_ITER_NORMAL_LIMIT 256
|
||||
/* never exceed limit */
|
||||
#define BTREE_ITER_MAX (1U << 10)
|
||||
|
||||
struct btree_trans_commit_hook;
|
||||
typedef int (btree_trans_commit_hook_fn)(struct btree_trans *, struct btree_trans_commit_hook *);
|
||||
|
||||
struct btree_trans_commit_hook {
|
||||
btree_trans_commit_hook_fn *fn;
|
||||
struct btree_trans_commit_hook *next;
|
||||
};
|
||||
|
||||
#define BTREE_TRANS_MEM_MAX (1U << 16)
|
||||
|
||||
#define BTREE_TRANS_MAX_LOCK_HOLD_TIME_NS 10000
|
||||
|
||||
struct btree_trans_paths {
|
||||
unsigned long nr_paths;
|
||||
struct btree_path paths[];
|
||||
};
|
||||
|
||||
struct trans_kmalloc_trace {
|
||||
unsigned long ip;
|
||||
size_t bytes;
|
||||
};
|
||||
typedef DARRAY(struct trans_kmalloc_trace) darray_trans_kmalloc_trace;
|
||||
|
||||
struct btree_trans_subbuf {
|
||||
u16 base;
|
||||
u16 u64s;
|
||||
u16 size;;
|
||||
};
|
||||
|
||||
struct btree_trans {
|
||||
struct bch_fs *c;
|
||||
|
||||
unsigned long *paths_allocated;
|
||||
struct btree_path *paths;
|
||||
btree_path_idx_t *sorted;
|
||||
struct btree_insert_entry *updates;
|
||||
|
||||
void *mem;
|
||||
unsigned mem_top;
|
||||
unsigned mem_bytes;
|
||||
unsigned realloc_bytes_required;
|
||||
#ifdef CONFIG_BCACHEFS_TRANS_KMALLOC_TRACE
|
||||
darray_trans_kmalloc_trace trans_kmalloc_trace;
|
||||
#endif
|
||||
|
||||
btree_path_idx_t nr_sorted;
|
||||
btree_path_idx_t nr_paths;
|
||||
btree_path_idx_t nr_paths_max;
|
||||
btree_path_idx_t nr_updates;
|
||||
u8 fn_idx;
|
||||
u8 lock_must_abort;
|
||||
bool lock_may_not_fail:1;
|
||||
bool srcu_held:1;
|
||||
bool locked:1;
|
||||
bool pf_memalloc_nofs:1;
|
||||
bool write_locked:1;
|
||||
bool used_mempool:1;
|
||||
bool in_traverse_all:1;
|
||||
bool paths_sorted:1;
|
||||
bool memory_allocation_failure:1;
|
||||
bool journal_transaction_names:1;
|
||||
bool journal_replay_not_finished:1;
|
||||
bool notrace_relock_fail:1;
|
||||
enum bch_errcode restarted:16;
|
||||
u32 restart_count;
|
||||
#ifdef CONFIG_BCACHEFS_INJECT_TRANSACTION_RESTARTS
|
||||
u32 restart_count_this_trans;
|
||||
#endif
|
||||
|
||||
u64 last_begin_time;
|
||||
unsigned long last_begin_ip;
|
||||
unsigned long last_restarted_ip;
|
||||
#ifdef CONFIG_BCACHEFS_DEBUG
|
||||
bch_stacktrace last_restarted_trace;
|
||||
#endif
|
||||
unsigned long last_unlock_ip;
|
||||
unsigned long srcu_lock_time;
|
||||
|
||||
const char *fn;
|
||||
struct btree_bkey_cached_common *locking;
|
||||
struct six_lock_waiter locking_wait;
|
||||
int srcu_idx;
|
||||
|
||||
/* update path: */
|
||||
struct btree_trans_subbuf journal_entries;
|
||||
struct btree_trans_subbuf accounting;
|
||||
|
||||
struct btree_trans_commit_hook *hooks;
|
||||
struct journal_entry_pin *journal_pin;
|
||||
|
||||
struct journal_res journal_res;
|
||||
u64 *journal_seq;
|
||||
struct disk_reservation *disk_res;
|
||||
|
||||
struct bch_fs_usage_base fs_usage_delta;
|
||||
|
||||
unsigned journal_u64s;
|
||||
unsigned extra_disk_res; /* XXX kill */
|
||||
|
||||
__BKEY_PADDED(btree_path_down, BKEY_BTREE_PTR_VAL_U64s_MAX);
|
||||
|
||||
#ifdef CONFIG_DEBUG_LOCK_ALLOC
|
||||
struct lockdep_map dep_map;
|
||||
#endif
|
||||
/* Entries before this are zeroed out on every bch2_trans_get() call */
|
||||
|
||||
struct list_head list;
|
||||
struct closure ref;
|
||||
|
||||
unsigned long _paths_allocated[BITS_TO_LONGS(BTREE_ITER_INITIAL)];
|
||||
struct btree_trans_paths trans_paths;
|
||||
struct btree_path _paths[BTREE_ITER_INITIAL];
|
||||
btree_path_idx_t _sorted[BTREE_ITER_INITIAL + 4];
|
||||
struct btree_insert_entry _updates[BTREE_ITER_INITIAL];
|
||||
};
|
||||
|
||||
static inline struct btree_path *btree_iter_path(struct btree_trans *trans, struct btree_iter *iter)
|
||||
{
|
||||
return trans->paths + iter->path;
|
||||
}
|
||||
|
||||
static inline struct btree_path *btree_iter_key_cache_path(struct btree_trans *trans, struct btree_iter *iter)
|
||||
{
|
||||
return iter->key_cache_path
|
||||
? trans->paths + iter->key_cache_path
|
||||
: NULL;
|
||||
}
|
||||
|
||||
#define BCH_BTREE_WRITE_TYPES() \
|
||||
x(initial, 0) \
|
||||
x(init_next_bset, 1) \
|
||||
x(cache_reclaim, 2) \
|
||||
x(journal_reclaim, 3) \
|
||||
x(interior, 4)
|
||||
|
||||
enum btree_write_type {
|
||||
#define x(t, n) BTREE_WRITE_##t,
|
||||
BCH_BTREE_WRITE_TYPES()
|
||||
#undef x
|
||||
BTREE_WRITE_TYPE_NR,
|
||||
};
|
||||
|
||||
#define BTREE_WRITE_TYPE_MASK (roundup_pow_of_two(BTREE_WRITE_TYPE_NR) - 1)
|
||||
#define BTREE_WRITE_TYPE_BITS ilog2(roundup_pow_of_two(BTREE_WRITE_TYPE_NR))
|
||||
|
||||
#define BTREE_FLAGS() \
|
||||
x(read_in_flight) \
|
||||
x(read_error) \
|
||||
x(dirty) \
|
||||
x(need_write) \
|
||||
x(write_blocked) \
|
||||
x(will_make_reachable) \
|
||||
x(noevict) \
|
||||
x(write_idx) \
|
||||
x(accessed) \
|
||||
x(write_in_flight) \
|
||||
x(write_in_flight_inner) \
|
||||
x(just_written) \
|
||||
x(dying) \
|
||||
x(fake) \
|
||||
x(need_rewrite) \
|
||||
x(need_rewrite_error) \
|
||||
x(need_rewrite_degraded) \
|
||||
x(need_rewrite_ptr_written_zero) \
|
||||
x(never_write) \
|
||||
x(pinned)
|
||||
|
||||
enum btree_flags {
|
||||
/* First bits for btree node write type */
|
||||
BTREE_NODE_FLAGS_START = BTREE_WRITE_TYPE_BITS - 1,
|
||||
#define x(flag) BTREE_NODE_##flag,
|
||||
BTREE_FLAGS()
|
||||
#undef x
|
||||
};
|
||||
|
||||
#define x(flag) \
|
||||
static inline bool btree_node_ ## flag(struct btree *b) \
|
||||
{ return test_bit(BTREE_NODE_ ## flag, &b->flags); } \
|
||||
\
|
||||
static inline void set_btree_node_ ## flag(struct btree *b) \
|
||||
{ set_bit(BTREE_NODE_ ## flag, &b->flags); } \
|
||||
\
|
||||
static inline void clear_btree_node_ ## flag(struct btree *b) \
|
||||
{ clear_bit(BTREE_NODE_ ## flag, &b->flags); }
|
||||
|
||||
BTREE_FLAGS()
|
||||
#undef x
|
||||
|
||||
#define BTREE_NODE_REWRITE_REASON() \
|
||||
x(none) \
|
||||
x(unknown) \
|
||||
x(error) \
|
||||
x(degraded) \
|
||||
x(ptr_written_zero)
|
||||
|
||||
enum btree_node_rewrite_reason {
|
||||
#define x(n) BTREE_NODE_REWRITE_##n,
|
||||
BTREE_NODE_REWRITE_REASON()
|
||||
#undef x
|
||||
};
|
||||
|
||||
static inline enum btree_node_rewrite_reason btree_node_rewrite_reason(struct btree *b)
|
||||
{
|
||||
if (btree_node_need_rewrite_ptr_written_zero(b))
|
||||
return BTREE_NODE_REWRITE_ptr_written_zero;
|
||||
if (btree_node_need_rewrite_degraded(b))
|
||||
return BTREE_NODE_REWRITE_degraded;
|
||||
if (btree_node_need_rewrite_error(b))
|
||||
return BTREE_NODE_REWRITE_error;
|
||||
if (btree_node_need_rewrite(b))
|
||||
return BTREE_NODE_REWRITE_unknown;
|
||||
return BTREE_NODE_REWRITE_none;
|
||||
}
|
||||
|
||||
static inline struct btree_write *btree_current_write(struct btree *b)
|
||||
{
|
||||
return b->writes + btree_node_write_idx(b);
|
||||
}
|
||||
|
||||
static inline struct btree_write *btree_prev_write(struct btree *b)
|
||||
{
|
||||
return b->writes + (btree_node_write_idx(b) ^ 1);
|
||||
}
|
||||
|
||||
static inline struct bset_tree *bset_tree_last(struct btree *b)
|
||||
{
|
||||
EBUG_ON(!b->nsets);
|
||||
return b->set + b->nsets - 1;
|
||||
}
|
||||
|
||||
static inline void *
|
||||
__btree_node_offset_to_ptr(const struct btree *b, u16 offset)
|
||||
{
|
||||
return (void *) ((u64 *) b->data + offset);
|
||||
}
|
||||
|
||||
static inline u16
|
||||
__btree_node_ptr_to_offset(const struct btree *b, const void *p)
|
||||
{
|
||||
u16 ret = (u64 *) p - (u64 *) b->data;
|
||||
|
||||
EBUG_ON(__btree_node_offset_to_ptr(b, ret) != p);
|
||||
return ret;
|
||||
}
|
||||
|
||||
static inline struct bset *bset(const struct btree *b,
|
||||
const struct bset_tree *t)
|
||||
{
|
||||
return __btree_node_offset_to_ptr(b, t->data_offset);
|
||||
}
|
||||
|
||||
static inline void set_btree_bset_end(struct btree *b, struct bset_tree *t)
|
||||
{
|
||||
t->end_offset =
|
||||
__btree_node_ptr_to_offset(b, vstruct_last(bset(b, t)));
|
||||
}
|
||||
|
||||
static inline void set_btree_bset(struct btree *b, struct bset_tree *t,
|
||||
const struct bset *i)
|
||||
{
|
||||
t->data_offset = __btree_node_ptr_to_offset(b, i);
|
||||
set_btree_bset_end(b, t);
|
||||
}
|
||||
|
||||
static inline struct bset *btree_bset_first(struct btree *b)
|
||||
{
|
||||
return bset(b, b->set);
|
||||
}
|
||||
|
||||
static inline struct bset *btree_bset_last(struct btree *b)
|
||||
{
|
||||
return bset(b, bset_tree_last(b));
|
||||
}
|
||||
|
||||
static inline u16
|
||||
__btree_node_key_to_offset(const struct btree *b, const struct bkey_packed *k)
|
||||
{
|
||||
return __btree_node_ptr_to_offset(b, k);
|
||||
}
|
||||
|
||||
static inline struct bkey_packed *
|
||||
__btree_node_offset_to_key(const struct btree *b, u16 k)
|
||||
{
|
||||
return __btree_node_offset_to_ptr(b, k);
|
||||
}
|
||||
|
||||
static inline unsigned btree_bkey_first_offset(const struct bset_tree *t)
|
||||
{
|
||||
return t->data_offset + offsetof(struct bset, _data) / sizeof(u64);
|
||||
}
|
||||
|
||||
#define btree_bkey_first(_b, _t) \
|
||||
({ \
|
||||
EBUG_ON(bset(_b, _t)->start != \
|
||||
__btree_node_offset_to_key(_b, btree_bkey_first_offset(_t)));\
|
||||
\
|
||||
bset(_b, _t)->start; \
|
||||
})
|
||||
|
||||
#define btree_bkey_last(_b, _t) \
|
||||
({ \
|
||||
EBUG_ON(__btree_node_offset_to_key(_b, (_t)->end_offset) != \
|
||||
vstruct_last(bset(_b, _t))); \
|
||||
\
|
||||
__btree_node_offset_to_key(_b, (_t)->end_offset); \
|
||||
})
|
||||
|
||||
static inline unsigned bset_u64s(struct bset_tree *t)
|
||||
{
|
||||
return t->end_offset - t->data_offset -
|
||||
sizeof(struct bset) / sizeof(u64);
|
||||
}
|
||||
|
||||
static inline unsigned bset_dead_u64s(struct btree *b, struct bset_tree *t)
|
||||
{
|
||||
return bset_u64s(t) - b->nr.bset_u64s[t - b->set];
|
||||
}
|
||||
|
||||
static inline unsigned bset_byte_offset(struct btree *b, void *i)
|
||||
{
|
||||
return i - (void *) b->data;
|
||||
}
|
||||
|
||||
enum btree_node_type {
|
||||
BKEY_TYPE_btree,
|
||||
#define x(kwd, val, ...) BKEY_TYPE_##kwd = val + 1,
|
||||
BCH_BTREE_IDS()
|
||||
#undef x
|
||||
BKEY_TYPE_NR
|
||||
};
|
||||
|
||||
/* Type of a key in btree @id at level @level: */
|
||||
static inline enum btree_node_type __btree_node_type(unsigned level, enum btree_id id)
|
||||
{
|
||||
return level ? BKEY_TYPE_btree : (unsigned) id + 1;
|
||||
}
|
||||
|
||||
/* Type of keys @b contains: */
|
||||
static inline enum btree_node_type btree_node_type(struct btree *b)
|
||||
{
|
||||
return __btree_node_type(b->c.level, b->c.btree_id);
|
||||
}
|
||||
|
||||
const char *bch2_btree_node_type_str(enum btree_node_type);
|
||||
|
||||
#define BTREE_NODE_TYPE_HAS_TRANS_TRIGGERS \
|
||||
(BIT_ULL(BKEY_TYPE_extents)| \
|
||||
BIT_ULL(BKEY_TYPE_alloc)| \
|
||||
BIT_ULL(BKEY_TYPE_inodes)| \
|
||||
BIT_ULL(BKEY_TYPE_stripes)| \
|
||||
BIT_ULL(BKEY_TYPE_reflink)| \
|
||||
BIT_ULL(BKEY_TYPE_subvolumes)| \
|
||||
BIT_ULL(BKEY_TYPE_btree))
|
||||
|
||||
#define BTREE_NODE_TYPE_HAS_ATOMIC_TRIGGERS \
|
||||
(BIT_ULL(BKEY_TYPE_alloc)| \
|
||||
BIT_ULL(BKEY_TYPE_inodes)| \
|
||||
BIT_ULL(BKEY_TYPE_stripes)| \
|
||||
BIT_ULL(BKEY_TYPE_snapshots))
|
||||
|
||||
#define BTREE_NODE_TYPE_HAS_TRIGGERS \
|
||||
(BTREE_NODE_TYPE_HAS_TRANS_TRIGGERS| \
|
||||
BTREE_NODE_TYPE_HAS_ATOMIC_TRIGGERS)
|
||||
|
||||
static inline bool btree_node_type_has_trans_triggers(enum btree_node_type type)
|
||||
{
|
||||
return BIT_ULL(type) & BTREE_NODE_TYPE_HAS_TRANS_TRIGGERS;
|
||||
}
|
||||
|
||||
static inline bool btree_node_type_has_atomic_triggers(enum btree_node_type type)
|
||||
{
|
||||
return BIT_ULL(type) & BTREE_NODE_TYPE_HAS_ATOMIC_TRIGGERS;
|
||||
}
|
||||
|
||||
static inline bool btree_node_type_has_triggers(enum btree_node_type type)
|
||||
{
|
||||
return BIT_ULL(type) & BTREE_NODE_TYPE_HAS_TRIGGERS;
|
||||
}
|
||||
|
||||
static inline bool btree_id_is_extents(enum btree_id btree)
|
||||
{
|
||||
const u64 mask = 0
|
||||
#define x(name, nr, flags, ...) |((!!((flags) & BTREE_IS_extents)) << nr)
|
||||
BCH_BTREE_IDS()
|
||||
#undef x
|
||||
;
|
||||
|
||||
return BIT_ULL(btree) & mask;
|
||||
}
|
||||
|
||||
static inline bool btree_node_type_is_extents(enum btree_node_type type)
|
||||
{
|
||||
return type != BKEY_TYPE_btree && btree_id_is_extents(type - 1);
|
||||
}
|
||||
|
||||
static inline bool btree_type_has_snapshots(enum btree_id btree)
|
||||
{
|
||||
const u64 mask = 0
|
||||
#define x(name, nr, flags, ...) |((!!((flags) & BTREE_IS_snapshots)) << nr)
|
||||
BCH_BTREE_IDS()
|
||||
#undef x
|
||||
;
|
||||
|
||||
return BIT_ULL(btree) & mask;
|
||||
}
|
||||
|
||||
static inline bool btree_type_has_snapshot_field(enum btree_id btree)
|
||||
{
|
||||
const u64 mask = 0
|
||||
#define x(name, nr, flags, ...) |((!!((flags) & (BTREE_IS_snapshot_field|BTREE_IS_snapshots))) << nr)
|
||||
BCH_BTREE_IDS()
|
||||
#undef x
|
||||
;
|
||||
|
||||
return BIT_ULL(btree) & mask;
|
||||
}
|
||||
|
||||
static inline bool btree_type_has_ptrs(enum btree_id btree)
|
||||
{
|
||||
const u64 mask = 0
|
||||
#define x(name, nr, flags, ...) |((!!((flags) & BTREE_IS_data)) << nr)
|
||||
BCH_BTREE_IDS()
|
||||
#undef x
|
||||
;
|
||||
|
||||
return BIT_ULL(btree) & mask;
|
||||
}
|
||||
|
||||
static inline bool btree_type_uses_write_buffer(enum btree_id btree)
|
||||
{
|
||||
const u64 mask = 0
|
||||
#define x(name, nr, flags, ...) |((!!((flags) & BTREE_IS_write_buffer)) << nr)
|
||||
BCH_BTREE_IDS()
|
||||
#undef x
|
||||
;
|
||||
|
||||
return BIT_ULL(btree) & mask;
|
||||
}
|
||||
|
||||
static inline u8 btree_trigger_order(enum btree_id btree)
|
||||
{
|
||||
switch (btree) {
|
||||
case BTREE_ID_alloc:
|
||||
return U8_MAX;
|
||||
case BTREE_ID_stripes:
|
||||
return U8_MAX - 1;
|
||||
default:
|
||||
return btree;
|
||||
}
|
||||
}
|
||||
|
||||
struct btree_root {
|
||||
struct btree *b;
|
||||
|
||||
/* On disk root - see async splits: */
|
||||
__BKEY_PADDED(key, BKEY_BTREE_PTR_VAL_U64s_MAX);
|
||||
u8 level;
|
||||
u8 alive;
|
||||
s16 error;
|
||||
};
|
||||
|
||||
enum btree_gc_coalesce_fail_reason {
|
||||
BTREE_GC_COALESCE_FAIL_RESERVE_GET,
|
||||
BTREE_GC_COALESCE_FAIL_KEYLIST_REALLOC,
|
||||
BTREE_GC_COALESCE_FAIL_FORMAT_FITS,
|
||||
};
|
||||
|
||||
enum btree_node_sibling {
|
||||
btree_prev_sib,
|
||||
btree_next_sib,
|
||||
};
|
||||
|
||||
struct get_locks_fail {
|
||||
unsigned l;
|
||||
struct btree *b;
|
||||
};
|
||||
|
||||
#endif /* _BCACHEFS_BTREE_TYPES_H */
|
||||
@@ -1,916 +0,0 @@
|
||||
// SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
#include "bcachefs.h"
|
||||
#include "btree_update.h"
|
||||
#include "btree_iter.h"
|
||||
#include "btree_journal_iter.h"
|
||||
#include "btree_locking.h"
|
||||
#include "buckets.h"
|
||||
#include "debug.h"
|
||||
#include "errcode.h"
|
||||
#include "error.h"
|
||||
#include "extents.h"
|
||||
#include "keylist.h"
|
||||
#include "snapshot.h"
|
||||
#include "trace.h"
|
||||
|
||||
#include <linux/string_helpers.h>
|
||||
|
||||
static inline int btree_insert_entry_cmp(const struct btree_insert_entry *l,
|
||||
const struct btree_insert_entry *r)
|
||||
{
|
||||
return cmp_int(l->sort_order, r->sort_order) ?:
|
||||
cmp_int(l->cached, r->cached) ?:
|
||||
-cmp_int(l->level, r->level) ?:
|
||||
bpos_cmp(l->k->k.p, r->k->k.p);
|
||||
}
|
||||
|
||||
static int __must_check
|
||||
bch2_trans_update_by_path(struct btree_trans *, btree_path_idx_t,
|
||||
struct bkey_i *, enum btree_iter_update_trigger_flags,
|
||||
unsigned long ip);
|
||||
|
||||
static noinline int extent_front_merge(struct btree_trans *trans,
|
||||
struct btree_iter *iter,
|
||||
struct bkey_s_c k,
|
||||
struct bkey_i **insert,
|
||||
enum btree_iter_update_trigger_flags flags)
|
||||
{
|
||||
struct bch_fs *c = trans->c;
|
||||
struct bkey_i *update;
|
||||
int ret;
|
||||
|
||||
if (unlikely(trans->journal_replay_not_finished))
|
||||
return 0;
|
||||
|
||||
update = bch2_bkey_make_mut_noupdate(trans, k);
|
||||
ret = PTR_ERR_OR_ZERO(update);
|
||||
if (ret)
|
||||
return ret;
|
||||
|
||||
if (!bch2_bkey_merge(c, bkey_i_to_s(update), bkey_i_to_s_c(*insert)))
|
||||
return 0;
|
||||
|
||||
ret = bch2_key_has_snapshot_overwrites(trans, iter->btree_id, k.k->p) ?:
|
||||
bch2_key_has_snapshot_overwrites(trans, iter->btree_id, (*insert)->k.p);
|
||||
if (ret < 0)
|
||||
return ret;
|
||||
if (ret)
|
||||
return 0;
|
||||
|
||||
ret = bch2_btree_delete_at(trans, iter, flags);
|
||||
if (ret)
|
||||
return ret;
|
||||
|
||||
*insert = update;
|
||||
return 0;
|
||||
}
|
||||
|
||||
static noinline int extent_back_merge(struct btree_trans *trans,
|
||||
struct btree_iter *iter,
|
||||
struct bkey_i *insert,
|
||||
struct bkey_s_c k)
|
||||
{
|
||||
struct bch_fs *c = trans->c;
|
||||
int ret;
|
||||
|
||||
if (unlikely(trans->journal_replay_not_finished))
|
||||
return 0;
|
||||
|
||||
ret = bch2_key_has_snapshot_overwrites(trans, iter->btree_id, insert->k.p) ?:
|
||||
bch2_key_has_snapshot_overwrites(trans, iter->btree_id, k.k->p);
|
||||
if (ret < 0)
|
||||
return ret;
|
||||
if (ret)
|
||||
return 0;
|
||||
|
||||
bch2_bkey_merge(c, bkey_i_to_s(insert), k);
|
||||
return 0;
|
||||
}
|
||||
|
||||
/*
|
||||
* When deleting, check if we need to emit a whiteout (because we're overwriting
|
||||
* something in an ancestor snapshot)
|
||||
*/
|
||||
static int need_whiteout_for_snapshot(struct btree_trans *trans,
|
||||
enum btree_id btree_id, struct bpos pos)
|
||||
{
|
||||
struct btree_iter iter;
|
||||
struct bkey_s_c k;
|
||||
u32 snapshot = pos.snapshot;
|
||||
int ret;
|
||||
|
||||
if (!bch2_snapshot_parent(trans->c, pos.snapshot))
|
||||
return 0;
|
||||
|
||||
pos.snapshot++;
|
||||
|
||||
for_each_btree_key_norestart(trans, iter, btree_id, pos,
|
||||
BTREE_ITER_all_snapshots|
|
||||
BTREE_ITER_nopreserve, k, ret) {
|
||||
if (!bkey_eq(k.k->p, pos))
|
||||
break;
|
||||
|
||||
if (bch2_snapshot_is_ancestor(trans->c, snapshot,
|
||||
k.k->p.snapshot)) {
|
||||
ret = !bkey_whiteout(k.k);
|
||||
break;
|
||||
}
|
||||
}
|
||||
bch2_trans_iter_exit(trans, &iter);
|
||||
|
||||
return ret;
|
||||
}
|
||||
|
||||
int __bch2_insert_snapshot_whiteouts(struct btree_trans *trans,
|
||||
enum btree_id btree, struct bpos pos,
|
||||
snapshot_id_list *s)
|
||||
{
|
||||
int ret = 0;
|
||||
|
||||
darray_for_each(*s, id) {
|
||||
pos.snapshot = *id;
|
||||
|
||||
struct btree_iter iter;
|
||||
struct bkey_s_c k = bch2_bkey_get_iter(trans, &iter, btree, pos,
|
||||
BTREE_ITER_not_extents|
|
||||
BTREE_ITER_intent);
|
||||
ret = bkey_err(k);
|
||||
if (ret)
|
||||
break;
|
||||
|
||||
if (k.k->type == KEY_TYPE_deleted) {
|
||||
struct bkey_i *update = bch2_trans_kmalloc(trans, sizeof(struct bkey_i));
|
||||
ret = PTR_ERR_OR_ZERO(update);
|
||||
if (ret) {
|
||||
bch2_trans_iter_exit(trans, &iter);
|
||||
break;
|
||||
}
|
||||
|
||||
bkey_init(&update->k);
|
||||
update->k.p = pos;
|
||||
update->k.type = KEY_TYPE_whiteout;
|
||||
|
||||
ret = bch2_trans_update(trans, &iter, update,
|
||||
BTREE_UPDATE_internal_snapshot_node);
|
||||
}
|
||||
bch2_trans_iter_exit(trans, &iter);
|
||||
|
||||
if (ret)
|
||||
break;
|
||||
}
|
||||
|
||||
darray_exit(s);
|
||||
return ret;
|
||||
}
|
||||
|
||||
int bch2_trans_update_extent_overwrite(struct btree_trans *trans,
|
||||
struct btree_iter *iter,
|
||||
enum btree_iter_update_trigger_flags flags,
|
||||
struct bkey_s_c old,
|
||||
struct bkey_s_c new)
|
||||
{
|
||||
enum btree_id btree_id = iter->btree_id;
|
||||
struct bkey_i *update;
|
||||
struct bpos new_start = bkey_start_pos(new.k);
|
||||
unsigned front_split = bkey_lt(bkey_start_pos(old.k), new_start);
|
||||
unsigned back_split = bkey_gt(old.k->p, new.k->p);
|
||||
unsigned middle_split = (front_split || back_split) &&
|
||||
old.k->p.snapshot != new.k->p.snapshot;
|
||||
unsigned nr_splits = front_split + back_split + middle_split;
|
||||
int ret = 0, compressed_sectors;
|
||||
|
||||
/*
|
||||
* If we're going to be splitting a compressed extent, note it
|
||||
* so that __bch2_trans_commit() can increase our disk
|
||||
* reservation:
|
||||
*/
|
||||
if (nr_splits > 1 &&
|
||||
(compressed_sectors = bch2_bkey_sectors_compressed(old)))
|
||||
trans->extra_disk_res += compressed_sectors * (nr_splits - 1);
|
||||
|
||||
if (front_split) {
|
||||
update = bch2_bkey_make_mut_noupdate(trans, old);
|
||||
if ((ret = PTR_ERR_OR_ZERO(update)))
|
||||
return ret;
|
||||
|
||||
bch2_cut_back(new_start, update);
|
||||
|
||||
ret = bch2_insert_snapshot_whiteouts(trans, btree_id,
|
||||
old.k->p, update->k.p) ?:
|
||||
bch2_btree_insert_nonextent(trans, btree_id, update,
|
||||
BTREE_UPDATE_internal_snapshot_node|flags);
|
||||
if (ret)
|
||||
return ret;
|
||||
}
|
||||
|
||||
/* If we're overwriting in a different snapshot - middle split: */
|
||||
if (middle_split) {
|
||||
update = bch2_bkey_make_mut_noupdate(trans, old);
|
||||
if ((ret = PTR_ERR_OR_ZERO(update)))
|
||||
return ret;
|
||||
|
||||
bch2_cut_front(new_start, update);
|
||||
bch2_cut_back(new.k->p, update);
|
||||
|
||||
ret = bch2_insert_snapshot_whiteouts(trans, btree_id,
|
||||
old.k->p, update->k.p) ?:
|
||||
bch2_btree_insert_nonextent(trans, btree_id, update,
|
||||
BTREE_UPDATE_internal_snapshot_node|flags);
|
||||
if (ret)
|
||||
return ret;
|
||||
}
|
||||
|
||||
if (bkey_le(old.k->p, new.k->p)) {
|
||||
update = bch2_trans_kmalloc(trans, sizeof(*update));
|
||||
if ((ret = PTR_ERR_OR_ZERO(update)))
|
||||
return ret;
|
||||
|
||||
bkey_init(&update->k);
|
||||
update->k.p = old.k->p;
|
||||
update->k.p.snapshot = new.k->p.snapshot;
|
||||
|
||||
if (new.k->p.snapshot != old.k->p.snapshot) {
|
||||
update->k.type = KEY_TYPE_whiteout;
|
||||
} else if (btree_type_has_snapshots(btree_id)) {
|
||||
ret = need_whiteout_for_snapshot(trans, btree_id, update->k.p);
|
||||
if (ret < 0)
|
||||
return ret;
|
||||
if (ret)
|
||||
update->k.type = KEY_TYPE_whiteout;
|
||||
}
|
||||
|
||||
ret = bch2_btree_insert_nonextent(trans, btree_id, update,
|
||||
BTREE_UPDATE_internal_snapshot_node|flags);
|
||||
if (ret)
|
||||
return ret;
|
||||
}
|
||||
|
||||
if (back_split) {
|
||||
update = bch2_bkey_make_mut_noupdate(trans, old);
|
||||
if ((ret = PTR_ERR_OR_ZERO(update)))
|
||||
return ret;
|
||||
|
||||
bch2_cut_front(new.k->p, update);
|
||||
|
||||
ret = bch2_trans_update_by_path(trans, iter->path, update,
|
||||
BTREE_UPDATE_internal_snapshot_node|
|
||||
flags, _RET_IP_);
|
||||
if (ret)
|
||||
return ret;
|
||||
}
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
static int bch2_trans_update_extent(struct btree_trans *trans,
|
||||
struct btree_iter *orig_iter,
|
||||
struct bkey_i *insert,
|
||||
enum btree_iter_update_trigger_flags flags)
|
||||
{
|
||||
struct btree_iter iter;
|
||||
struct bkey_s_c k;
|
||||
enum btree_id btree_id = orig_iter->btree_id;
|
||||
int ret = 0;
|
||||
|
||||
bch2_trans_iter_init(trans, &iter, btree_id, bkey_start_pos(&insert->k),
|
||||
BTREE_ITER_intent|
|
||||
BTREE_ITER_with_updates|
|
||||
BTREE_ITER_not_extents);
|
||||
k = bch2_btree_iter_peek_max(trans, &iter, POS(insert->k.p.inode, U64_MAX));
|
||||
if ((ret = bkey_err(k)))
|
||||
goto err;
|
||||
if (!k.k)
|
||||
goto out;
|
||||
|
||||
if (bkey_eq(k.k->p, bkey_start_pos(&insert->k))) {
|
||||
if (bch2_bkey_maybe_mergable(k.k, &insert->k)) {
|
||||
ret = extent_front_merge(trans, &iter, k, &insert, flags);
|
||||
if (ret)
|
||||
goto err;
|
||||
}
|
||||
|
||||
goto next;
|
||||
}
|
||||
|
||||
while (bkey_gt(insert->k.p, bkey_start_pos(k.k))) {
|
||||
bool done = bkey_lt(insert->k.p, k.k->p);
|
||||
|
||||
ret = bch2_trans_update_extent_overwrite(trans, &iter, flags, k, bkey_i_to_s_c(insert));
|
||||
if (ret)
|
||||
goto err;
|
||||
|
||||
if (done)
|
||||
goto out;
|
||||
next:
|
||||
bch2_btree_iter_advance(trans, &iter);
|
||||
k = bch2_btree_iter_peek_max(trans, &iter, POS(insert->k.p.inode, U64_MAX));
|
||||
if ((ret = bkey_err(k)))
|
||||
goto err;
|
||||
if (!k.k)
|
||||
goto out;
|
||||
}
|
||||
|
||||
if (bch2_bkey_maybe_mergable(&insert->k, k.k)) {
|
||||
ret = extent_back_merge(trans, &iter, insert, k);
|
||||
if (ret)
|
||||
goto err;
|
||||
}
|
||||
out:
|
||||
if (!bkey_deleted(&insert->k))
|
||||
ret = bch2_btree_insert_nonextent(trans, btree_id, insert, flags);
|
||||
err:
|
||||
bch2_trans_iter_exit(trans, &iter);
|
||||
|
||||
return ret;
|
||||
}
|
||||
|
||||
static noinline int flush_new_cached_update(struct btree_trans *trans,
|
||||
struct btree_insert_entry *i,
|
||||
enum btree_iter_update_trigger_flags flags,
|
||||
unsigned long ip)
|
||||
{
|
||||
struct bkey k;
|
||||
int ret;
|
||||
|
||||
btree_path_idx_t path_idx =
|
||||
bch2_path_get(trans, i->btree_id, i->old_k.p, 1, 0,
|
||||
BTREE_ITER_intent, _THIS_IP_);
|
||||
ret = bch2_btree_path_traverse(trans, path_idx, 0);
|
||||
if (ret)
|
||||
goto out;
|
||||
|
||||
struct btree_path *btree_path = trans->paths + path_idx;
|
||||
|
||||
/*
|
||||
* The old key in the insert entry might actually refer to an existing
|
||||
* key in the btree that has been deleted from cache and not yet
|
||||
* flushed. Check for this and skip the flush so we don't run triggers
|
||||
* against a stale key.
|
||||
*/
|
||||
bch2_btree_path_peek_slot_exact(btree_path, &k);
|
||||
if (!bkey_deleted(&k))
|
||||
goto out;
|
||||
|
||||
i->key_cache_already_flushed = true;
|
||||
i->flags |= BTREE_TRIGGER_norun;
|
||||
|
||||
btree_path_set_should_be_locked(trans, btree_path);
|
||||
ret = bch2_trans_update_by_path(trans, path_idx, i->k, flags, ip);
|
||||
out:
|
||||
bch2_path_put(trans, path_idx, true);
|
||||
return ret;
|
||||
}
|
||||
|
||||
static int __must_check
|
||||
bch2_trans_update_by_path(struct btree_trans *trans, btree_path_idx_t path_idx,
|
||||
struct bkey_i *k, enum btree_iter_update_trigger_flags flags,
|
||||
unsigned long ip)
|
||||
{
|
||||
struct bch_fs *c = trans->c;
|
||||
struct btree_insert_entry *i, n;
|
||||
int cmp;
|
||||
|
||||
struct btree_path *path = trans->paths + path_idx;
|
||||
EBUG_ON(!path->should_be_locked);
|
||||
EBUG_ON(trans->nr_updates >= trans->nr_paths);
|
||||
EBUG_ON(!bpos_eq(k->k.p, path->pos));
|
||||
|
||||
n = (struct btree_insert_entry) {
|
||||
.flags = flags,
|
||||
.sort_order = btree_trigger_order(path->btree_id),
|
||||
.bkey_type = __btree_node_type(path->level, path->btree_id),
|
||||
.btree_id = path->btree_id,
|
||||
.level = path->level,
|
||||
.cached = path->cached,
|
||||
.path = path_idx,
|
||||
.k = k,
|
||||
.ip_allocated = ip,
|
||||
};
|
||||
|
||||
#ifdef CONFIG_BCACHEFS_DEBUG
|
||||
trans_for_each_update(trans, i)
|
||||
BUG_ON(i != trans->updates &&
|
||||
btree_insert_entry_cmp(i - 1, i) >= 0);
|
||||
#endif
|
||||
|
||||
/*
|
||||
* Pending updates are kept sorted: first, find position of new update,
|
||||
* then delete/trim any updates the new update overwrites:
|
||||
*/
|
||||
for (i = trans->updates; i < trans->updates + trans->nr_updates; i++) {
|
||||
cmp = btree_insert_entry_cmp(&n, i);
|
||||
if (cmp <= 0)
|
||||
break;
|
||||
}
|
||||
|
||||
bool overwrite = !cmp && i < trans->updates + trans->nr_updates;
|
||||
|
||||
if (overwrite) {
|
||||
EBUG_ON(i->insert_trigger_run || i->overwrite_trigger_run);
|
||||
|
||||
bch2_path_put(trans, i->path, true);
|
||||
i->flags = n.flags;
|
||||
i->cached = n.cached;
|
||||
i->k = n.k;
|
||||
i->path = n.path;
|
||||
i->ip_allocated = n.ip_allocated;
|
||||
} else {
|
||||
array_insert_item(trans->updates, trans->nr_updates,
|
||||
i - trans->updates, n);
|
||||
|
||||
i->old_v = bch2_btree_path_peek_slot_exact(path, &i->old_k).v;
|
||||
i->old_btree_u64s = !bkey_deleted(&i->old_k) ? i->old_k.u64s : 0;
|
||||
|
||||
if (unlikely(trans->journal_replay_not_finished)) {
|
||||
struct bkey_i *j_k =
|
||||
bch2_journal_keys_peek_slot(c, n.btree_id, n.level, k->k.p);
|
||||
|
||||
if (j_k) {
|
||||
i->old_k = j_k->k;
|
||||
i->old_v = &j_k->v;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
__btree_path_get(trans, trans->paths + i->path, true);
|
||||
|
||||
trace_update_by_path(trans, path, i, overwrite);
|
||||
|
||||
/*
|
||||
* If a key is present in the key cache, it must also exist in the
|
||||
* btree - this is necessary for cache coherency. When iterating over
|
||||
* a btree that's cached in the key cache, the btree iter code checks
|
||||
* the key cache - but the key has to exist in the btree for that to
|
||||
* work:
|
||||
*/
|
||||
if (path->cached && !i->old_btree_u64s)
|
||||
return flush_new_cached_update(trans, i, flags, ip);
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
static noinline int bch2_trans_update_get_key_cache(struct btree_trans *trans,
|
||||
struct btree_iter *iter,
|
||||
struct btree_path *path)
|
||||
{
|
||||
struct btree_path *key_cache_path = btree_iter_key_cache_path(trans, iter);
|
||||
|
||||
if (!key_cache_path ||
|
||||
!key_cache_path->should_be_locked ||
|
||||
!bpos_eq(key_cache_path->pos, iter->pos)) {
|
||||
struct bkey_cached *ck;
|
||||
int ret;
|
||||
|
||||
if (!iter->key_cache_path)
|
||||
iter->key_cache_path =
|
||||
bch2_path_get(trans, path->btree_id, path->pos, 1, 0,
|
||||
BTREE_ITER_intent|
|
||||
BTREE_ITER_cached, _THIS_IP_);
|
||||
|
||||
iter->key_cache_path =
|
||||
bch2_btree_path_set_pos(trans, iter->key_cache_path, path->pos,
|
||||
iter->flags & BTREE_ITER_intent,
|
||||
_THIS_IP_);
|
||||
|
||||
ret = bch2_btree_path_traverse(trans, iter->key_cache_path, BTREE_ITER_cached);
|
||||
if (unlikely(ret))
|
||||
return ret;
|
||||
|
||||
ck = (void *) trans->paths[iter->key_cache_path].l[0].b;
|
||||
|
||||
if (test_bit(BKEY_CACHED_DIRTY, &ck->flags)) {
|
||||
trace_and_count(trans->c, trans_restart_key_cache_raced, trans, _RET_IP_);
|
||||
return btree_trans_restart(trans, BCH_ERR_transaction_restart_key_cache_raced);
|
||||
}
|
||||
|
||||
btree_path_set_should_be_locked(trans, trans->paths + iter->key_cache_path);
|
||||
}
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
int __must_check bch2_trans_update_ip(struct btree_trans *trans, struct btree_iter *iter,
|
||||
struct bkey_i *k, enum btree_iter_update_trigger_flags flags,
|
||||
unsigned long ip)
|
||||
{
|
||||
kmsan_check_memory(k, bkey_bytes(&k->k));
|
||||
|
||||
btree_path_idx_t path_idx = iter->update_path ?: iter->path;
|
||||
int ret;
|
||||
|
||||
if (iter->flags & BTREE_ITER_is_extents)
|
||||
return bch2_trans_update_extent(trans, iter, k, flags);
|
||||
|
||||
if (bkey_deleted(&k->k) &&
|
||||
!(flags & BTREE_UPDATE_key_cache_reclaim) &&
|
||||
(iter->flags & BTREE_ITER_filter_snapshots)) {
|
||||
ret = need_whiteout_for_snapshot(trans, iter->btree_id, k->k.p);
|
||||
if (unlikely(ret < 0))
|
||||
return ret;
|
||||
|
||||
if (ret)
|
||||
k->k.type = KEY_TYPE_whiteout;
|
||||
}
|
||||
|
||||
/*
|
||||
* Ensure that updates to cached btrees go to the key cache:
|
||||
*/
|
||||
struct btree_path *path = trans->paths + path_idx;
|
||||
if (!(flags & BTREE_UPDATE_key_cache_reclaim) &&
|
||||
!path->cached &&
|
||||
!path->level &&
|
||||
btree_id_cached(trans->c, path->btree_id)) {
|
||||
ret = bch2_trans_update_get_key_cache(trans, iter, path);
|
||||
if (ret)
|
||||
return ret;
|
||||
|
||||
path_idx = iter->key_cache_path;
|
||||
}
|
||||
|
||||
return bch2_trans_update_by_path(trans, path_idx, k, flags, ip);
|
||||
}
|
||||
|
||||
int bch2_btree_insert_clone_trans(struct btree_trans *trans,
|
||||
enum btree_id btree,
|
||||
struct bkey_i *k)
|
||||
{
|
||||
struct bkey_i *n = bch2_trans_kmalloc(trans, bkey_bytes(&k->k));
|
||||
int ret = PTR_ERR_OR_ZERO(n);
|
||||
if (ret)
|
||||
return ret;
|
||||
|
||||
bkey_copy(n, k);
|
||||
return bch2_btree_insert_trans(trans, btree, n, 0);
|
||||
}
|
||||
|
||||
void *__bch2_trans_subbuf_alloc(struct btree_trans *trans,
|
||||
struct btree_trans_subbuf *buf,
|
||||
unsigned u64s)
|
||||
{
|
||||
unsigned new_top = buf->u64s + u64s;
|
||||
unsigned new_size = buf->size;
|
||||
|
||||
BUG_ON(roundup_pow_of_two(new_top) > U16_MAX);
|
||||
|
||||
if (new_top > new_size)
|
||||
new_size = roundup_pow_of_two(new_top);
|
||||
|
||||
void *n = bch2_trans_kmalloc_nomemzero(trans, new_size * sizeof(u64));
|
||||
if (IS_ERR(n))
|
||||
return n;
|
||||
|
||||
unsigned offset = (u64 *) n - (u64 *) trans->mem;
|
||||
BUG_ON(offset > U16_MAX);
|
||||
|
||||
if (buf->u64s)
|
||||
memcpy(n,
|
||||
btree_trans_subbuf_base(trans, buf),
|
||||
buf->size * sizeof(u64));
|
||||
buf->base = (u64 *) n - (u64 *) trans->mem;
|
||||
buf->size = new_size;
|
||||
|
||||
void *p = btree_trans_subbuf_top(trans, buf);
|
||||
buf->u64s = new_top;
|
||||
return p;
|
||||
}
|
||||
|
||||
int bch2_bkey_get_empty_slot(struct btree_trans *trans, struct btree_iter *iter,
|
||||
enum btree_id btree, struct bpos end)
|
||||
{
|
||||
bch2_trans_iter_init(trans, iter, btree, end, BTREE_ITER_intent);
|
||||
struct bkey_s_c k = bch2_btree_iter_peek_prev(trans, iter);
|
||||
int ret = bkey_err(k);
|
||||
if (ret)
|
||||
goto err;
|
||||
|
||||
bch2_btree_iter_advance(trans, iter);
|
||||
k = bch2_btree_iter_peek_slot(trans, iter);
|
||||
ret = bkey_err(k);
|
||||
if (ret)
|
||||
goto err;
|
||||
|
||||
BUG_ON(k.k->type != KEY_TYPE_deleted);
|
||||
|
||||
if (bkey_gt(k.k->p, end)) {
|
||||
ret = bch_err_throw(trans->c, ENOSPC_btree_slot);
|
||||
goto err;
|
||||
}
|
||||
|
||||
return 0;
|
||||
err:
|
||||
bch2_trans_iter_exit(trans, iter);
|
||||
return ret;
|
||||
}
|
||||
|
||||
void bch2_trans_commit_hook(struct btree_trans *trans,
|
||||
struct btree_trans_commit_hook *h)
|
||||
{
|
||||
h->next = trans->hooks;
|
||||
trans->hooks = h;
|
||||
}
|
||||
|
||||
int bch2_btree_insert_nonextent(struct btree_trans *trans,
|
||||
enum btree_id btree, struct bkey_i *k,
|
||||
enum btree_iter_update_trigger_flags flags)
|
||||
{
|
||||
struct btree_iter iter;
|
||||
int ret;
|
||||
|
||||
bch2_trans_iter_init(trans, &iter, btree, k->k.p,
|
||||
BTREE_ITER_cached|
|
||||
BTREE_ITER_not_extents|
|
||||
BTREE_ITER_intent);
|
||||
ret = bch2_btree_iter_traverse(trans, &iter) ?:
|
||||
bch2_trans_update(trans, &iter, k, flags);
|
||||
bch2_trans_iter_exit(trans, &iter);
|
||||
return ret;
|
||||
}
|
||||
|
||||
int bch2_btree_insert_trans(struct btree_trans *trans, enum btree_id id,
|
||||
struct bkey_i *k, enum btree_iter_update_trigger_flags flags)
|
||||
{
|
||||
struct btree_iter iter;
|
||||
bch2_trans_iter_init(trans, &iter, id, bkey_start_pos(&k->k),
|
||||
BTREE_ITER_intent|flags);
|
||||
int ret = bch2_btree_iter_traverse(trans, &iter) ?:
|
||||
bch2_trans_update(trans, &iter, k, flags);
|
||||
bch2_trans_iter_exit(trans, &iter);
|
||||
return ret;
|
||||
}
|
||||
|
||||
/**
|
||||
* bch2_btree_insert - insert keys into the extent btree
|
||||
* @c: pointer to struct bch_fs
|
||||
* @id: btree to insert into
|
||||
* @k: key to insert
|
||||
* @disk_res: must be non-NULL whenever inserting or potentially
|
||||
* splitting data extents
|
||||
* @flags: transaction commit flags
|
||||
* @iter_flags: btree iter update trigger flags
|
||||
*
|
||||
* Returns: 0 on success, error code on failure
|
||||
*/
|
||||
int bch2_btree_insert(struct bch_fs *c, enum btree_id id, struct bkey_i *k,
|
||||
struct disk_reservation *disk_res, int flags,
|
||||
enum btree_iter_update_trigger_flags iter_flags)
|
||||
{
|
||||
return bch2_trans_commit_do(c, disk_res, NULL, flags,
|
||||
bch2_btree_insert_trans(trans, id, k, iter_flags));
|
||||
}
|
||||
|
||||
int bch2_btree_delete_at(struct btree_trans *trans,
|
||||
struct btree_iter *iter, unsigned update_flags)
|
||||
{
|
||||
struct bkey_i *k = bch2_trans_kmalloc(trans, sizeof(*k));
|
||||
int ret = PTR_ERR_OR_ZERO(k);
|
||||
if (ret)
|
||||
return ret;
|
||||
|
||||
bkey_init(&k->k);
|
||||
k->k.p = iter->pos;
|
||||
return bch2_trans_update(trans, iter, k, update_flags);
|
||||
}
|
||||
|
||||
int bch2_btree_delete(struct btree_trans *trans,
|
||||
enum btree_id btree, struct bpos pos,
|
||||
unsigned update_flags)
|
||||
{
|
||||
struct btree_iter iter;
|
||||
int ret;
|
||||
|
||||
bch2_trans_iter_init(trans, &iter, btree, pos,
|
||||
BTREE_ITER_cached|
|
||||
BTREE_ITER_intent);
|
||||
ret = bch2_btree_iter_traverse(trans, &iter) ?:
|
||||
bch2_btree_delete_at(trans, &iter, update_flags);
|
||||
bch2_trans_iter_exit(trans, &iter);
|
||||
|
||||
return ret;
|
||||
}
|
||||
|
||||
int bch2_btree_delete_range_trans(struct btree_trans *trans, enum btree_id id,
|
||||
struct bpos start, struct bpos end,
|
||||
unsigned update_flags,
|
||||
u64 *journal_seq)
|
||||
{
|
||||
u32 restart_count = trans->restart_count;
|
||||
struct btree_iter iter;
|
||||
struct bkey_s_c k;
|
||||
int ret = 0;
|
||||
|
||||
bch2_trans_iter_init(trans, &iter, id, start, BTREE_ITER_intent);
|
||||
while ((k = bch2_btree_iter_peek_max(trans, &iter, end)).k) {
|
||||
struct disk_reservation disk_res =
|
||||
bch2_disk_reservation_init(trans->c, 0);
|
||||
struct bkey_i delete;
|
||||
|
||||
ret = bkey_err(k);
|
||||
if (ret)
|
||||
goto err;
|
||||
|
||||
bkey_init(&delete.k);
|
||||
|
||||
/*
|
||||
* This could probably be more efficient for extents:
|
||||
*/
|
||||
|
||||
/*
|
||||
* For extents, iter.pos won't necessarily be the same as
|
||||
* bkey_start_pos(k.k) (for non extents they always will be the
|
||||
* same). It's important that we delete starting from iter.pos
|
||||
* because the range we want to delete could start in the middle
|
||||
* of k.
|
||||
*
|
||||
* (bch2_btree_iter_peek() does guarantee that iter.pos >=
|
||||
* bkey_start_pos(k.k)).
|
||||
*/
|
||||
delete.k.p = iter.pos;
|
||||
|
||||
if (iter.flags & BTREE_ITER_is_extents)
|
||||
bch2_key_resize(&delete.k,
|
||||
bpos_min(end, k.k->p).offset -
|
||||
iter.pos.offset);
|
||||
|
||||
ret = bch2_trans_update(trans, &iter, &delete, update_flags) ?:
|
||||
bch2_trans_commit(trans, &disk_res, journal_seq,
|
||||
BCH_TRANS_COMMIT_no_enospc);
|
||||
bch2_disk_reservation_put(trans->c, &disk_res);
|
||||
err:
|
||||
/*
|
||||
* the bch2_trans_begin() call is in a weird place because we
|
||||
* need to call it after every transaction commit, to avoid path
|
||||
* overflow, but don't want to call it if the delete operation
|
||||
* is a no-op and we have no work to do:
|
||||
*/
|
||||
bch2_trans_begin(trans);
|
||||
|
||||
if (bch2_err_matches(ret, BCH_ERR_transaction_restart))
|
||||
ret = 0;
|
||||
if (ret)
|
||||
break;
|
||||
}
|
||||
bch2_trans_iter_exit(trans, &iter);
|
||||
|
||||
return ret ?: trans_was_restarted(trans, restart_count);
|
||||
}
|
||||
|
||||
/*
|
||||
* bch_btree_delete_range - delete everything within a given range
|
||||
*
|
||||
* Range is a half open interval - [start, end)
|
||||
*/
|
||||
int bch2_btree_delete_range(struct bch_fs *c, enum btree_id id,
|
||||
struct bpos start, struct bpos end,
|
||||
unsigned update_flags,
|
||||
u64 *journal_seq)
|
||||
{
|
||||
int ret = bch2_trans_run(c,
|
||||
bch2_btree_delete_range_trans(trans, id, start, end,
|
||||
update_flags, journal_seq));
|
||||
if (ret == -BCH_ERR_transaction_restart_nested)
|
||||
ret = 0;
|
||||
return ret;
|
||||
}
|
||||
|
||||
int bch2_btree_bit_mod_iter(struct btree_trans *trans, struct btree_iter *iter, bool set)
|
||||
{
|
||||
struct bkey_i *k = bch2_trans_kmalloc(trans, sizeof(*k));
|
||||
int ret = PTR_ERR_OR_ZERO(k);
|
||||
if (ret)
|
||||
return ret;
|
||||
|
||||
bkey_init(&k->k);
|
||||
k->k.type = set ? KEY_TYPE_set : KEY_TYPE_deleted;
|
||||
k->k.p = iter->pos;
|
||||
if (iter->flags & BTREE_ITER_is_extents)
|
||||
bch2_key_resize(&k->k, 1);
|
||||
|
||||
return bch2_trans_update(trans, iter, k, 0);
|
||||
}
|
||||
|
||||
int bch2_btree_bit_mod(struct btree_trans *trans, enum btree_id btree,
|
||||
struct bpos pos, bool set)
|
||||
{
|
||||
struct btree_iter iter;
|
||||
bch2_trans_iter_init(trans, &iter, btree, pos, BTREE_ITER_intent);
|
||||
|
||||
int ret = bch2_btree_iter_traverse(trans, &iter) ?:
|
||||
bch2_btree_bit_mod_iter(trans, &iter, set);
|
||||
bch2_trans_iter_exit(trans, &iter);
|
||||
return ret;
|
||||
}
|
||||
|
||||
int bch2_btree_bit_mod_buffered(struct btree_trans *trans, enum btree_id btree,
|
||||
struct bpos pos, bool set)
|
||||
{
|
||||
struct bkey_i k;
|
||||
|
||||
bkey_init(&k.k);
|
||||
k.k.type = set ? KEY_TYPE_set : KEY_TYPE_deleted;
|
||||
k.k.p = pos;
|
||||
|
||||
return bch2_trans_update_buffered(trans, btree, &k);
|
||||
}
|
||||
|
||||
static int __bch2_trans_log_str(struct btree_trans *trans, const char *str, unsigned len)
|
||||
{
|
||||
unsigned u64s = DIV_ROUND_UP(len, sizeof(u64));
|
||||
|
||||
struct jset_entry *e = bch2_trans_jset_entry_alloc(trans, jset_u64s(u64s));
|
||||
int ret = PTR_ERR_OR_ZERO(e);
|
||||
if (ret)
|
||||
return ret;
|
||||
|
||||
struct jset_entry_log *l = container_of(e, struct jset_entry_log, entry);
|
||||
journal_entry_init(e, BCH_JSET_ENTRY_log, 0, 1, u64s);
|
||||
memcpy_and_pad(l->d, u64s * sizeof(u64), str, len, 0);
|
||||
return 0;
|
||||
}
|
||||
|
||||
int bch2_trans_log_str(struct btree_trans *trans, const char *str)
|
||||
{
|
||||
return __bch2_trans_log_str(trans, str, strlen(str));
|
||||
}
|
||||
|
||||
int bch2_trans_log_msg(struct btree_trans *trans, struct printbuf *buf)
|
||||
{
|
||||
int ret = buf->allocation_failure ? -BCH_ERR_ENOMEM_trans_log_msg : 0;
|
||||
if (ret)
|
||||
return ret;
|
||||
|
||||
return __bch2_trans_log_str(trans, buf->buf, buf->pos);
|
||||
}
|
||||
|
||||
int bch2_trans_log_bkey(struct btree_trans *trans, enum btree_id btree,
|
||||
unsigned level, struct bkey_i *k)
|
||||
{
|
||||
struct jset_entry *e = bch2_trans_jset_entry_alloc(trans, jset_u64s(k->k.u64s));
|
||||
int ret = PTR_ERR_OR_ZERO(e);
|
||||
if (ret)
|
||||
return ret;
|
||||
|
||||
journal_entry_init(e, BCH_JSET_ENTRY_log_bkey, btree, level, k->k.u64s);
|
||||
bkey_copy(e->start, k);
|
||||
return 0;
|
||||
}
|
||||
|
||||
__printf(3, 0)
|
||||
static int
|
||||
__bch2_fs_log_msg(struct bch_fs *c, unsigned commit_flags, const char *fmt,
|
||||
va_list args)
|
||||
{
|
||||
struct printbuf buf = PRINTBUF;
|
||||
prt_vprintf(&buf, fmt, args);
|
||||
|
||||
unsigned u64s = DIV_ROUND_UP(buf.pos, sizeof(u64));
|
||||
|
||||
int ret = buf.allocation_failure ? -BCH_ERR_ENOMEM_trans_log_msg : 0;
|
||||
if (ret)
|
||||
goto err;
|
||||
|
||||
if (!test_bit(JOURNAL_running, &c->journal.flags)) {
|
||||
ret = darray_make_room(&c->journal.early_journal_entries, jset_u64s(u64s));
|
||||
if (ret)
|
||||
goto err;
|
||||
|
||||
struct jset_entry_log *l = (void *) &darray_top(c->journal.early_journal_entries);
|
||||
journal_entry_init(&l->entry, BCH_JSET_ENTRY_log, 0, 1, u64s);
|
||||
memcpy_and_pad(l->d, u64s * sizeof(u64), buf.buf, buf.pos, 0);
|
||||
c->journal.early_journal_entries.nr += jset_u64s(u64s);
|
||||
} else {
|
||||
ret = bch2_trans_commit_do(c, NULL, NULL, commit_flags,
|
||||
bch2_trans_log_msg(trans, &buf));
|
||||
}
|
||||
err:
|
||||
printbuf_exit(&buf);
|
||||
return ret;
|
||||
}
|
||||
|
||||
__printf(2, 3)
|
||||
int bch2_fs_log_msg(struct bch_fs *c, const char *fmt, ...)
|
||||
{
|
||||
va_list args;
|
||||
int ret;
|
||||
|
||||
va_start(args, fmt);
|
||||
ret = __bch2_fs_log_msg(c, 0, fmt, args);
|
||||
va_end(args);
|
||||
return ret;
|
||||
}
|
||||
|
||||
/*
|
||||
* Use for logging messages during recovery to enable reserved space and avoid
|
||||
* blocking.
|
||||
*/
|
||||
__printf(2, 3)
|
||||
int bch2_journal_log_msg(struct bch_fs *c, const char *fmt, ...)
|
||||
{
|
||||
va_list args;
|
||||
int ret;
|
||||
|
||||
va_start(args, fmt);
|
||||
ret = __bch2_fs_log_msg(c, BCH_WATERMARK_reclaim, fmt, args);
|
||||
va_end(args);
|
||||
return ret;
|
||||
}
|
||||
@@ -1,429 +0,0 @@
|
||||
/* SPDX-License-Identifier: GPL-2.0 */
|
||||
#ifndef _BCACHEFS_BTREE_UPDATE_H
|
||||
#define _BCACHEFS_BTREE_UPDATE_H
|
||||
|
||||
#include "btree_iter.h"
|
||||
#include "journal.h"
|
||||
#include "snapshot.h"
|
||||
|
||||
struct bch_fs;
|
||||
struct btree;
|
||||
|
||||
void bch2_btree_node_prep_for_write(struct btree_trans *,
|
||||
struct btree_path *, struct btree *);
|
||||
bool bch2_btree_bset_insert_key(struct btree_trans *, struct btree_path *,
|
||||
struct btree *, struct btree_node_iter *,
|
||||
struct bkey_i *);
|
||||
|
||||
int bch2_btree_node_flush0(struct journal *, struct journal_entry_pin *, u64);
|
||||
int bch2_btree_node_flush1(struct journal *, struct journal_entry_pin *, u64);
|
||||
void bch2_btree_add_journal_pin(struct bch_fs *, struct btree *, u64);
|
||||
|
||||
void bch2_btree_insert_key_leaf(struct btree_trans *, struct btree_path *,
|
||||
struct bkey_i *, u64);
|
||||
|
||||
#define BCH_TRANS_COMMIT_FLAGS() \
|
||||
x(no_enospc, "don't check for enospc") \
|
||||
x(no_check_rw, "don't attempt to take a ref on c->writes") \
|
||||
x(no_journal_res, "don't take a journal reservation, instead " \
|
||||
"pin journal entry referred to by trans->journal_res.seq") \
|
||||
x(journal_reclaim, "operation required for journal reclaim; may return error" \
|
||||
"instead of deadlocking if BCH_WATERMARK_reclaim not specified")\
|
||||
x(skip_accounting_apply, "we're in journal replay - accounting updates have already been applied")
|
||||
|
||||
enum __bch_trans_commit_flags {
|
||||
/* First bits for bch_watermark: */
|
||||
__BCH_TRANS_COMMIT_FLAGS_START = BCH_WATERMARK_BITS,
|
||||
#define x(n, ...) __BCH_TRANS_COMMIT_##n,
|
||||
BCH_TRANS_COMMIT_FLAGS()
|
||||
#undef x
|
||||
};
|
||||
|
||||
enum bch_trans_commit_flags {
|
||||
#define x(n, ...) BCH_TRANS_COMMIT_##n = BIT(__BCH_TRANS_COMMIT_##n),
|
||||
BCH_TRANS_COMMIT_FLAGS()
|
||||
#undef x
|
||||
};
|
||||
|
||||
void bch2_trans_commit_flags_to_text(struct printbuf *, enum bch_trans_commit_flags);
|
||||
|
||||
int bch2_btree_delete_at(struct btree_trans *, struct btree_iter *, unsigned);
|
||||
int bch2_btree_delete(struct btree_trans *, enum btree_id, struct bpos, unsigned);
|
||||
|
||||
int bch2_btree_insert_nonextent(struct btree_trans *, enum btree_id,
|
||||
struct bkey_i *, enum btree_iter_update_trigger_flags);
|
||||
|
||||
int bch2_btree_insert_trans(struct btree_trans *, enum btree_id, struct bkey_i *,
|
||||
enum btree_iter_update_trigger_flags);
|
||||
int bch2_btree_insert(struct bch_fs *, enum btree_id, struct bkey_i *, struct
|
||||
disk_reservation *, int flags, enum
|
||||
btree_iter_update_trigger_flags iter_flags);
|
||||
|
||||
int bch2_btree_delete_range_trans(struct btree_trans *, enum btree_id,
|
||||
struct bpos, struct bpos, unsigned, u64 *);
|
||||
int bch2_btree_delete_range(struct bch_fs *, enum btree_id,
|
||||
struct bpos, struct bpos, unsigned, u64 *);
|
||||
|
||||
int bch2_btree_bit_mod_iter(struct btree_trans *, struct btree_iter *, bool);
|
||||
int bch2_btree_bit_mod(struct btree_trans *, enum btree_id, struct bpos, bool);
|
||||
int bch2_btree_bit_mod_buffered(struct btree_trans *, enum btree_id, struct bpos, bool);
|
||||
|
||||
static inline int bch2_btree_delete_at_buffered(struct btree_trans *trans,
|
||||
enum btree_id btree, struct bpos pos)
|
||||
{
|
||||
return bch2_btree_bit_mod_buffered(trans, btree, pos, false);
|
||||
}
|
||||
|
||||
int __bch2_insert_snapshot_whiteouts(struct btree_trans *, enum btree_id,
|
||||
struct bpos, snapshot_id_list *);
|
||||
|
||||
/*
|
||||
* For use when splitting extents in existing snapshots:
|
||||
*
|
||||
* If @old_pos is an interior snapshot node, iterate over descendent snapshot
|
||||
* nodes: for every descendent snapshot in whiche @old_pos is overwritten and
|
||||
* not visible, emit a whiteout at @new_pos.
|
||||
*/
|
||||
static inline int bch2_insert_snapshot_whiteouts(struct btree_trans *trans,
|
||||
enum btree_id btree,
|
||||
struct bpos old_pos,
|
||||
struct bpos new_pos)
|
||||
{
|
||||
BUG_ON(old_pos.snapshot != new_pos.snapshot);
|
||||
|
||||
if (!btree_type_has_snapshots(btree) ||
|
||||
bkey_eq(old_pos, new_pos))
|
||||
return 0;
|
||||
|
||||
snapshot_id_list s;
|
||||
int ret = bch2_get_snapshot_overwrites(trans, btree, old_pos, &s);
|
||||
if (ret)
|
||||
return ret;
|
||||
|
||||
return s.nr
|
||||
? __bch2_insert_snapshot_whiteouts(trans, btree, new_pos, &s)
|
||||
: 0;
|
||||
}
|
||||
|
||||
int bch2_trans_update_extent_overwrite(struct btree_trans *, struct btree_iter *,
|
||||
enum btree_iter_update_trigger_flags,
|
||||
struct bkey_s_c, struct bkey_s_c);
|
||||
|
||||
int bch2_bkey_get_empty_slot(struct btree_trans *, struct btree_iter *,
|
||||
enum btree_id, struct bpos);
|
||||
|
||||
int __must_check bch2_trans_update_ip(struct btree_trans *, struct btree_iter *,
|
||||
struct bkey_i *, enum btree_iter_update_trigger_flags,
|
||||
unsigned long);
|
||||
|
||||
static inline int __must_check
|
||||
bch2_trans_update(struct btree_trans *trans, struct btree_iter *iter,
|
||||
struct bkey_i *k, enum btree_iter_update_trigger_flags flags)
|
||||
{
|
||||
return bch2_trans_update_ip(trans, iter, k, flags, _THIS_IP_);
|
||||
}
|
||||
|
||||
static inline void *btree_trans_subbuf_base(struct btree_trans *trans,
|
||||
struct btree_trans_subbuf *buf)
|
||||
{
|
||||
return (u64 *) trans->mem + buf->base;
|
||||
}
|
||||
|
||||
static inline void *btree_trans_subbuf_top(struct btree_trans *trans,
|
||||
struct btree_trans_subbuf *buf)
|
||||
{
|
||||
return (u64 *) trans->mem + buf->base + buf->u64s;
|
||||
}
|
||||
|
||||
void *__bch2_trans_subbuf_alloc(struct btree_trans *,
|
||||
struct btree_trans_subbuf *,
|
||||
unsigned);
|
||||
|
||||
static inline void *
|
||||
bch2_trans_subbuf_alloc(struct btree_trans *trans,
|
||||
struct btree_trans_subbuf *buf,
|
||||
unsigned u64s)
|
||||
{
|
||||
if (buf->u64s + u64s > buf->size)
|
||||
return __bch2_trans_subbuf_alloc(trans, buf, u64s);
|
||||
|
||||
void *p = btree_trans_subbuf_top(trans, buf);
|
||||
buf->u64s += u64s;
|
||||
return p;
|
||||
}
|
||||
|
||||
static inline struct jset_entry *btree_trans_journal_entries_start(struct btree_trans *trans)
|
||||
{
|
||||
return btree_trans_subbuf_base(trans, &trans->journal_entries);
|
||||
}
|
||||
|
||||
static inline struct jset_entry *btree_trans_journal_entries_top(struct btree_trans *trans)
|
||||
{
|
||||
return btree_trans_subbuf_top(trans, &trans->journal_entries);
|
||||
}
|
||||
|
||||
static inline struct jset_entry *
|
||||
bch2_trans_jset_entry_alloc(struct btree_trans *trans, unsigned u64s)
|
||||
{
|
||||
return bch2_trans_subbuf_alloc(trans, &trans->journal_entries, u64s);
|
||||
}
|
||||
|
||||
int bch2_btree_insert_clone_trans(struct btree_trans *, enum btree_id, struct bkey_i *);
|
||||
|
||||
int bch2_btree_write_buffer_insert_err(struct bch_fs *, enum btree_id, struct bkey_i *);
|
||||
|
||||
static inline int __must_check bch2_trans_update_buffered(struct btree_trans *trans,
|
||||
enum btree_id btree,
|
||||
struct bkey_i *k)
|
||||
{
|
||||
kmsan_check_memory(k, bkey_bytes(&k->k));
|
||||
|
||||
EBUG_ON(k->k.u64s > BTREE_WRITE_BUFERED_U64s_MAX);
|
||||
|
||||
if (unlikely(!btree_type_uses_write_buffer(btree))) {
|
||||
int ret = bch2_btree_write_buffer_insert_err(trans->c, btree, k);
|
||||
dump_stack();
|
||||
return ret;
|
||||
}
|
||||
/*
|
||||
* Most updates skip the btree write buffer until journal replay is
|
||||
* finished because synchronization with journal replay relies on having
|
||||
* a btree node locked - if we're overwriting a key in the journal that
|
||||
* journal replay hasn't yet replayed, we have to mark it as
|
||||
* overwritten.
|
||||
*
|
||||
* But accounting updates don't overwrite, they're deltas, and they have
|
||||
* to be flushed to the btree strictly in order for journal replay to be
|
||||
* able to tell which updates need to be applied:
|
||||
*/
|
||||
if (k->k.type != KEY_TYPE_accounting &&
|
||||
unlikely(trans->journal_replay_not_finished))
|
||||
return bch2_btree_insert_clone_trans(trans, btree, k);
|
||||
|
||||
struct jset_entry *e = bch2_trans_jset_entry_alloc(trans, jset_u64s(k->k.u64s));
|
||||
int ret = PTR_ERR_OR_ZERO(e);
|
||||
if (ret)
|
||||
return ret;
|
||||
|
||||
journal_entry_init(e, BCH_JSET_ENTRY_write_buffer_keys, btree, 0, k->k.u64s);
|
||||
bkey_copy(e->start, k);
|
||||
return 0;
|
||||
}
|
||||
|
||||
void bch2_trans_commit_hook(struct btree_trans *,
|
||||
struct btree_trans_commit_hook *);
|
||||
int __bch2_trans_commit(struct btree_trans *, unsigned);
|
||||
|
||||
int bch2_trans_log_str(struct btree_trans *, const char *);
|
||||
int bch2_trans_log_msg(struct btree_trans *, struct printbuf *);
|
||||
int bch2_trans_log_bkey(struct btree_trans *, enum btree_id, unsigned, struct bkey_i *);
|
||||
|
||||
__printf(2, 3) int bch2_fs_log_msg(struct bch_fs *, const char *, ...);
|
||||
__printf(2, 3) int bch2_journal_log_msg(struct bch_fs *, const char *, ...);
|
||||
|
||||
/**
|
||||
* bch2_trans_commit - insert keys at given iterator positions
|
||||
*
|
||||
* This is main entry point for btree updates.
|
||||
*
|
||||
* Return values:
|
||||
* -EROFS: filesystem read only
|
||||
* -EIO: journal or btree node IO error
|
||||
*/
|
||||
static inline int bch2_trans_commit(struct btree_trans *trans,
|
||||
struct disk_reservation *disk_res,
|
||||
u64 *journal_seq,
|
||||
unsigned flags)
|
||||
{
|
||||
trans->disk_res = disk_res;
|
||||
trans->journal_seq = journal_seq;
|
||||
|
||||
return __bch2_trans_commit(trans, flags);
|
||||
}
|
||||
|
||||
#define commit_do(_trans, _disk_res, _journal_seq, _flags, _do) \
|
||||
lockrestart_do(_trans, _do ?: bch2_trans_commit(_trans, (_disk_res),\
|
||||
(_journal_seq), (_flags)))
|
||||
|
||||
#define nested_commit_do(_trans, _disk_res, _journal_seq, _flags, _do) \
|
||||
nested_lockrestart_do(_trans, _do ?: bch2_trans_commit(_trans, (_disk_res),\
|
||||
(_journal_seq), (_flags)))
|
||||
|
||||
#define bch2_trans_commit_do(_c, _disk_res, _journal_seq, _flags, _do) \
|
||||
bch2_trans_run(_c, commit_do(trans, _disk_res, _journal_seq, _flags, _do))
|
||||
|
||||
#define trans_for_each_update(_trans, _i) \
|
||||
for (struct btree_insert_entry *_i = (_trans)->updates; \
|
||||
(_i) < (_trans)->updates + (_trans)->nr_updates; \
|
||||
(_i)++)
|
||||
|
||||
static inline void bch2_trans_reset_updates(struct btree_trans *trans)
|
||||
{
|
||||
trans_for_each_update(trans, i)
|
||||
bch2_path_put(trans, i->path, true);
|
||||
|
||||
trans->nr_updates = 0;
|
||||
trans->journal_entries.u64s = 0;
|
||||
trans->journal_entries.size = 0;
|
||||
trans->accounting.u64s = 0;
|
||||
trans->accounting.size = 0;
|
||||
trans->hooks = NULL;
|
||||
trans->extra_disk_res = 0;
|
||||
}
|
||||
|
||||
static __always_inline struct bkey_i *__bch2_bkey_make_mut_noupdate(struct btree_trans *trans, struct bkey_s_c k,
|
||||
unsigned type, unsigned min_bytes)
|
||||
{
|
||||
unsigned bytes = max_t(unsigned, min_bytes, bkey_bytes(k.k));
|
||||
struct bkey_i *mut;
|
||||
|
||||
if (type && k.k->type != type)
|
||||
return ERR_PTR(-ENOENT);
|
||||
|
||||
/* extra padding for varint_decode_fast... */
|
||||
mut = bch2_trans_kmalloc_nomemzero(trans, bytes + 8);
|
||||
if (!IS_ERR(mut)) {
|
||||
bkey_reassemble(mut, k);
|
||||
|
||||
if (unlikely(bytes > bkey_bytes(k.k))) {
|
||||
memset((void *) mut + bkey_bytes(k.k), 0,
|
||||
bytes - bkey_bytes(k.k));
|
||||
mut->k.u64s = DIV_ROUND_UP(bytes, sizeof(u64));
|
||||
}
|
||||
}
|
||||
return mut;
|
||||
}
|
||||
|
||||
static __always_inline struct bkey_i *bch2_bkey_make_mut_noupdate(struct btree_trans *trans, struct bkey_s_c k)
|
||||
{
|
||||
return __bch2_bkey_make_mut_noupdate(trans, k, 0, 0);
|
||||
}
|
||||
|
||||
#define bch2_bkey_make_mut_noupdate_typed(_trans, _k, _type) \
|
||||
bkey_i_to_##_type(__bch2_bkey_make_mut_noupdate(_trans, _k, \
|
||||
KEY_TYPE_##_type, sizeof(struct bkey_i_##_type)))
|
||||
|
||||
static inline struct bkey_i *__bch2_bkey_make_mut(struct btree_trans *trans, struct btree_iter *iter,
|
||||
struct bkey_s_c *k,
|
||||
enum btree_iter_update_trigger_flags flags,
|
||||
unsigned type, unsigned min_bytes)
|
||||
{
|
||||
struct bkey_i *mut = __bch2_bkey_make_mut_noupdate(trans, *k, type, min_bytes);
|
||||
int ret;
|
||||
|
||||
if (IS_ERR(mut))
|
||||
return mut;
|
||||
|
||||
ret = bch2_trans_update(trans, iter, mut, flags);
|
||||
if (ret)
|
||||
return ERR_PTR(ret);
|
||||
|
||||
*k = bkey_i_to_s_c(mut);
|
||||
return mut;
|
||||
}
|
||||
|
||||
static inline struct bkey_i *bch2_bkey_make_mut(struct btree_trans *trans,
|
||||
struct btree_iter *iter, struct bkey_s_c *k,
|
||||
enum btree_iter_update_trigger_flags flags)
|
||||
{
|
||||
return __bch2_bkey_make_mut(trans, iter, k, flags, 0, 0);
|
||||
}
|
||||
|
||||
#define bch2_bkey_make_mut_typed(_trans, _iter, _k, _flags, _type) \
|
||||
bkey_i_to_##_type(__bch2_bkey_make_mut(_trans, _iter, _k, _flags,\
|
||||
KEY_TYPE_##_type, sizeof(struct bkey_i_##_type)))
|
||||
|
||||
static inline struct bkey_i *__bch2_bkey_get_mut_noupdate(struct btree_trans *trans,
|
||||
struct btree_iter *iter,
|
||||
unsigned btree_id, struct bpos pos,
|
||||
enum btree_iter_update_trigger_flags flags,
|
||||
unsigned type, unsigned min_bytes)
|
||||
{
|
||||
struct bkey_s_c k = __bch2_bkey_get_iter(trans, iter,
|
||||
btree_id, pos, flags|BTREE_ITER_intent, type);
|
||||
struct bkey_i *ret = IS_ERR(k.k)
|
||||
? ERR_CAST(k.k)
|
||||
: __bch2_bkey_make_mut_noupdate(trans, k, 0, min_bytes);
|
||||
if (IS_ERR(ret))
|
||||
bch2_trans_iter_exit(trans, iter);
|
||||
return ret;
|
||||
}
|
||||
|
||||
static inline struct bkey_i *bch2_bkey_get_mut_noupdate(struct btree_trans *trans,
|
||||
struct btree_iter *iter,
|
||||
unsigned btree_id, struct bpos pos,
|
||||
enum btree_iter_update_trigger_flags flags)
|
||||
{
|
||||
return __bch2_bkey_get_mut_noupdate(trans, iter, btree_id, pos, flags, 0, 0);
|
||||
}
|
||||
|
||||
static inline struct bkey_i *__bch2_bkey_get_mut(struct btree_trans *trans,
|
||||
struct btree_iter *iter,
|
||||
unsigned btree_id, struct bpos pos,
|
||||
enum btree_iter_update_trigger_flags flags,
|
||||
unsigned type, unsigned min_bytes)
|
||||
{
|
||||
struct bkey_i *mut = __bch2_bkey_get_mut_noupdate(trans, iter,
|
||||
btree_id, pos, flags|BTREE_ITER_intent, type, min_bytes);
|
||||
int ret;
|
||||
|
||||
if (IS_ERR(mut))
|
||||
return mut;
|
||||
|
||||
ret = bch2_trans_update(trans, iter, mut, flags);
|
||||
if (ret) {
|
||||
bch2_trans_iter_exit(trans, iter);
|
||||
return ERR_PTR(ret);
|
||||
}
|
||||
|
||||
return mut;
|
||||
}
|
||||
|
||||
static inline struct bkey_i *bch2_bkey_get_mut_minsize(struct btree_trans *trans,
|
||||
struct btree_iter *iter,
|
||||
unsigned btree_id, struct bpos pos,
|
||||
enum btree_iter_update_trigger_flags flags,
|
||||
unsigned min_bytes)
|
||||
{
|
||||
return __bch2_bkey_get_mut(trans, iter, btree_id, pos, flags, 0, min_bytes);
|
||||
}
|
||||
|
||||
static inline struct bkey_i *bch2_bkey_get_mut(struct btree_trans *trans,
|
||||
struct btree_iter *iter,
|
||||
unsigned btree_id, struct bpos pos,
|
||||
enum btree_iter_update_trigger_flags flags)
|
||||
{
|
||||
return __bch2_bkey_get_mut(trans, iter, btree_id, pos, flags, 0, 0);
|
||||
}
|
||||
|
||||
#define bch2_bkey_get_mut_typed(_trans, _iter, _btree_id, _pos, _flags, _type)\
|
||||
bkey_i_to_##_type(__bch2_bkey_get_mut(_trans, _iter, \
|
||||
_btree_id, _pos, _flags, \
|
||||
KEY_TYPE_##_type, sizeof(struct bkey_i_##_type)))
|
||||
|
||||
static inline struct bkey_i *__bch2_bkey_alloc(struct btree_trans *trans, struct btree_iter *iter,
|
||||
enum btree_iter_update_trigger_flags flags,
|
||||
unsigned type, unsigned val_size)
|
||||
{
|
||||
struct bkey_i *k = bch2_trans_kmalloc(trans, sizeof(*k) + val_size);
|
||||
int ret;
|
||||
|
||||
if (IS_ERR(k))
|
||||
return k;
|
||||
|
||||
bkey_init(&k->k);
|
||||
k->k.p = iter->pos;
|
||||
k->k.type = type;
|
||||
set_bkey_val_bytes(&k->k, val_size);
|
||||
|
||||
ret = bch2_trans_update(trans, iter, k, flags);
|
||||
if (unlikely(ret))
|
||||
return ERR_PTR(ret);
|
||||
return k;
|
||||
}
|
||||
|
||||
#define bch2_bkey_alloc(_trans, _iter, _flags, _type) \
|
||||
bkey_i_to_##_type(__bch2_bkey_alloc(_trans, _iter, _flags, \
|
||||
KEY_TYPE_##_type, sizeof(struct bch_##_type)))
|
||||
|
||||
#endif /* _BCACHEFS_BTREE_UPDATE_H */
|
||||
File diff suppressed because it is too large
Load Diff
@@ -1,364 +0,0 @@
|
||||
/* SPDX-License-Identifier: GPL-2.0 */
|
||||
#ifndef _BCACHEFS_BTREE_UPDATE_INTERIOR_H
|
||||
#define _BCACHEFS_BTREE_UPDATE_INTERIOR_H
|
||||
|
||||
#include "btree_cache.h"
|
||||
#include "btree_locking.h"
|
||||
#include "btree_update.h"
|
||||
|
||||
#define BTREE_UPDATE_NODES_MAX ((BTREE_MAX_DEPTH - 2) * 2 + GC_MERGE_NODES)
|
||||
|
||||
#define BTREE_UPDATE_JOURNAL_RES (BTREE_UPDATE_NODES_MAX * (BKEY_BTREE_PTR_U64s_MAX + 1))
|
||||
|
||||
int bch2_btree_node_check_topology(struct btree_trans *, struct btree *);
|
||||
|
||||
#define BTREE_UPDATE_MODES() \
|
||||
x(none) \
|
||||
x(node) \
|
||||
x(root) \
|
||||
x(update)
|
||||
|
||||
enum btree_update_mode {
|
||||
#define x(n) BTREE_UPDATE_##n,
|
||||
BTREE_UPDATE_MODES()
|
||||
#undef x
|
||||
};
|
||||
|
||||
/*
|
||||
* Tracks an in progress split/rewrite of a btree node and the update to the
|
||||
* parent node:
|
||||
*
|
||||
* When we split/rewrite a node, we do all the updates in memory without
|
||||
* waiting for any writes to complete - we allocate the new node(s) and update
|
||||
* the parent node, possibly recursively up to the root.
|
||||
*
|
||||
* The end result is that we have one or more new nodes being written -
|
||||
* possibly several, if there were multiple splits - and then a write (updating
|
||||
* an interior node) which will make all these new nodes visible.
|
||||
*
|
||||
* Additionally, as we split/rewrite nodes we free the old nodes - but the old
|
||||
* nodes can't be freed (their space on disk can't be reclaimed) until the
|
||||
* update to the interior node that makes the new node visible completes -
|
||||
* until then, the old nodes are still reachable on disk.
|
||||
*
|
||||
*/
|
||||
struct btree_update {
|
||||
struct closure cl;
|
||||
struct bch_fs *c;
|
||||
u64 start_time;
|
||||
unsigned long ip_started;
|
||||
|
||||
struct list_head list;
|
||||
struct list_head unwritten_list;
|
||||
|
||||
enum btree_update_mode mode;
|
||||
enum bch_trans_commit_flags flags;
|
||||
unsigned nodes_written:1;
|
||||
unsigned took_gc_lock:1;
|
||||
|
||||
enum btree_id btree_id;
|
||||
struct bpos node_start;
|
||||
struct bpos node_end;
|
||||
enum btree_node_rewrite_reason node_needed_rewrite;
|
||||
u16 node_written;
|
||||
u16 node_sectors;
|
||||
u16 node_remaining;
|
||||
|
||||
unsigned update_level_start;
|
||||
unsigned update_level_end;
|
||||
|
||||
struct disk_reservation disk_res;
|
||||
|
||||
/*
|
||||
* BTREE_UPDATE_node:
|
||||
* The update that made the new nodes visible was a regular update to an
|
||||
* existing interior node - @b. We can't write out the update to @b
|
||||
* until the new nodes we created are finished writing, so we block @b
|
||||
* from writing by putting this btree_interior update on the
|
||||
* @b->write_blocked list with @write_blocked_list:
|
||||
*/
|
||||
struct btree *b;
|
||||
struct list_head write_blocked_list;
|
||||
|
||||
/*
|
||||
* We may be freeing nodes that were dirty, and thus had journal entries
|
||||
* pinned: we need to transfer the oldest of those pins to the
|
||||
* btree_update operation, and release it when the new node(s)
|
||||
* are all persistent and reachable:
|
||||
*/
|
||||
struct journal_entry_pin journal;
|
||||
|
||||
/* Preallocated nodes we reserve when we start the update: */
|
||||
struct prealloc_nodes {
|
||||
struct btree *b[BTREE_UPDATE_NODES_MAX];
|
||||
unsigned nr;
|
||||
} prealloc_nodes[2];
|
||||
|
||||
/* Nodes being freed: */
|
||||
struct keylist old_keys;
|
||||
u64 _old_keys[BTREE_UPDATE_NODES_MAX *
|
||||
BKEY_BTREE_PTR_U64s_MAX];
|
||||
|
||||
/* Nodes being added: */
|
||||
struct keylist new_keys;
|
||||
u64 _new_keys[BTREE_UPDATE_NODES_MAX *
|
||||
BKEY_BTREE_PTR_U64s_MAX];
|
||||
|
||||
/* New nodes, that will be made reachable by this update: */
|
||||
struct btree *new_nodes[BTREE_UPDATE_NODES_MAX];
|
||||
unsigned nr_new_nodes;
|
||||
|
||||
struct btree *old_nodes[BTREE_UPDATE_NODES_MAX];
|
||||
__le64 old_nodes_seq[BTREE_UPDATE_NODES_MAX];
|
||||
unsigned nr_old_nodes;
|
||||
|
||||
open_bucket_idx_t open_buckets[BTREE_UPDATE_NODES_MAX *
|
||||
BCH_REPLICAS_MAX];
|
||||
open_bucket_idx_t nr_open_buckets;
|
||||
|
||||
unsigned journal_u64s;
|
||||
u64 journal_entries[BTREE_UPDATE_JOURNAL_RES];
|
||||
|
||||
/* Only here to reduce stack usage on recursive splits: */
|
||||
struct keylist parent_keys;
|
||||
/*
|
||||
* Enough room for btree_split's keys without realloc - btree node
|
||||
* pointers never have crc/compression info, so we only need to acount
|
||||
* for the pointers for three keys
|
||||
*/
|
||||
u64 inline_keys[BKEY_BTREE_PTR_U64s_MAX * 3];
|
||||
};
|
||||
|
||||
struct btree *__bch2_btree_node_alloc_replacement(struct btree_update *,
|
||||
struct btree_trans *,
|
||||
struct btree *,
|
||||
struct bkey_format);
|
||||
|
||||
int bch2_btree_split_leaf(struct btree_trans *, btree_path_idx_t, unsigned);
|
||||
|
||||
int bch2_btree_increase_depth(struct btree_trans *, btree_path_idx_t, unsigned);
|
||||
|
||||
int __bch2_foreground_maybe_merge(struct btree_trans *, btree_path_idx_t,
|
||||
unsigned, unsigned, enum btree_node_sibling);
|
||||
|
||||
static inline int bch2_foreground_maybe_merge_sibling(struct btree_trans *trans,
|
||||
btree_path_idx_t path_idx,
|
||||
unsigned level, unsigned flags,
|
||||
enum btree_node_sibling sib)
|
||||
{
|
||||
struct btree_path *path = trans->paths + path_idx;
|
||||
struct btree *b;
|
||||
|
||||
EBUG_ON(!btree_node_locked(path, level));
|
||||
|
||||
if (static_branch_unlikely(&bch2_btree_node_merging_disabled))
|
||||
return 0;
|
||||
|
||||
b = path->l[level].b;
|
||||
if (b->sib_u64s[sib] > trans->c->btree_foreground_merge_threshold)
|
||||
return 0;
|
||||
|
||||
return __bch2_foreground_maybe_merge(trans, path_idx, level, flags, sib);
|
||||
}
|
||||
|
||||
static inline int bch2_foreground_maybe_merge(struct btree_trans *trans,
|
||||
btree_path_idx_t path,
|
||||
unsigned level,
|
||||
unsigned flags)
|
||||
{
|
||||
bch2_trans_verify_not_unlocked_or_in_restart(trans);
|
||||
|
||||
return bch2_foreground_maybe_merge_sibling(trans, path, level, flags,
|
||||
btree_prev_sib) ?:
|
||||
bch2_foreground_maybe_merge_sibling(trans, path, level, flags,
|
||||
btree_next_sib);
|
||||
}
|
||||
|
||||
int bch2_btree_node_rewrite(struct btree_trans *, struct btree_iter *,
|
||||
struct btree *, unsigned, unsigned);
|
||||
int bch2_btree_node_rewrite_key(struct btree_trans *,
|
||||
enum btree_id, unsigned,
|
||||
struct bkey_i *, unsigned);
|
||||
int bch2_btree_node_rewrite_pos(struct btree_trans *,
|
||||
enum btree_id, unsigned,
|
||||
struct bpos, unsigned, unsigned);
|
||||
int bch2_btree_node_rewrite_key_get_iter(struct btree_trans *,
|
||||
struct btree *, unsigned);
|
||||
|
||||
void bch2_btree_node_rewrite_async(struct bch_fs *, struct btree *);
|
||||
|
||||
int bch2_btree_node_update_key(struct btree_trans *, struct btree_iter *,
|
||||
struct btree *, struct bkey_i *,
|
||||
unsigned, bool);
|
||||
int bch2_btree_node_update_key_get_iter(struct btree_trans *, struct btree *,
|
||||
struct bkey_i *, unsigned, bool);
|
||||
|
||||
void bch2_btree_set_root_for_read(struct bch_fs *, struct btree *);
|
||||
|
||||
int bch2_btree_root_alloc_fake_trans(struct btree_trans *, enum btree_id, unsigned);
|
||||
void bch2_btree_root_alloc_fake(struct bch_fs *, enum btree_id, unsigned);
|
||||
|
||||
static inline unsigned btree_update_reserve_required(struct bch_fs *c,
|
||||
struct btree *b)
|
||||
{
|
||||
unsigned depth = btree_node_root(c, b)->c.level + 1;
|
||||
|
||||
/*
|
||||
* Number of nodes we might have to allocate in a worst case btree
|
||||
* split operation - we split all the way up to the root, then allocate
|
||||
* a new root, unless we're already at max depth:
|
||||
*/
|
||||
if (depth < BTREE_MAX_DEPTH)
|
||||
return (depth - b->c.level) * 2 + 1;
|
||||
else
|
||||
return (depth - b->c.level) * 2 - 1;
|
||||
}
|
||||
|
||||
static inline void btree_node_reset_sib_u64s(struct btree *b)
|
||||
{
|
||||
b->sib_u64s[0] = b->nr.live_u64s;
|
||||
b->sib_u64s[1] = b->nr.live_u64s;
|
||||
}
|
||||
|
||||
static inline void *btree_data_end(struct btree *b)
|
||||
{
|
||||
return (void *) b->data + btree_buf_bytes(b);
|
||||
}
|
||||
|
||||
static inline struct bkey_packed *unwritten_whiteouts_start(struct btree *b)
|
||||
{
|
||||
return (void *) ((u64 *) btree_data_end(b) - b->whiteout_u64s);
|
||||
}
|
||||
|
||||
static inline struct bkey_packed *unwritten_whiteouts_end(struct btree *b)
|
||||
{
|
||||
return btree_data_end(b);
|
||||
}
|
||||
|
||||
static inline void *write_block(struct btree *b)
|
||||
{
|
||||
return (void *) b->data + (b->written << 9);
|
||||
}
|
||||
|
||||
static inline bool __btree_addr_written(struct btree *b, void *p)
|
||||
{
|
||||
return p < write_block(b);
|
||||
}
|
||||
|
||||
static inline bool bset_written(struct btree *b, struct bset *i)
|
||||
{
|
||||
return __btree_addr_written(b, i);
|
||||
}
|
||||
|
||||
static inline bool bkey_written(struct btree *b, struct bkey_packed *k)
|
||||
{
|
||||
return __btree_addr_written(b, k);
|
||||
}
|
||||
|
||||
static inline ssize_t __bch2_btree_u64s_remaining(struct btree *b, void *end)
|
||||
{
|
||||
ssize_t used = bset_byte_offset(b, end) / sizeof(u64) +
|
||||
b->whiteout_u64s;
|
||||
ssize_t total = btree_buf_bytes(b) >> 3;
|
||||
|
||||
/* Always leave one extra u64 for bch2_varint_decode: */
|
||||
used++;
|
||||
|
||||
return total - used;
|
||||
}
|
||||
|
||||
static inline size_t bch2_btree_keys_u64s_remaining(struct btree *b)
|
||||
{
|
||||
ssize_t remaining = __bch2_btree_u64s_remaining(b,
|
||||
btree_bkey_last(b, bset_tree_last(b)));
|
||||
|
||||
BUG_ON(remaining < 0);
|
||||
|
||||
if (bset_written(b, btree_bset_last(b)))
|
||||
return 0;
|
||||
|
||||
return remaining;
|
||||
}
|
||||
|
||||
#define BTREE_WRITE_SET_U64s_BITS 9
|
||||
|
||||
static inline unsigned btree_write_set_buffer(struct btree *b)
|
||||
{
|
||||
/*
|
||||
* Could buffer up larger amounts of keys for btrees with larger keys,
|
||||
* pending benchmarking:
|
||||
*/
|
||||
return 8 << BTREE_WRITE_SET_U64s_BITS;
|
||||
}
|
||||
|
||||
static inline struct btree_node_entry *want_new_bset(struct bch_fs *c, struct btree *b)
|
||||
{
|
||||
struct bset_tree *t = bset_tree_last(b);
|
||||
struct btree_node_entry *bne = max(write_block(b),
|
||||
(void *) btree_bkey_last(b, t));
|
||||
ssize_t remaining_space =
|
||||
__bch2_btree_u64s_remaining(b, bne->keys.start);
|
||||
|
||||
if (unlikely(bset_written(b, bset(b, t)))) {
|
||||
if (b->written + block_sectors(c) <= btree_sectors(c))
|
||||
return bne;
|
||||
} else {
|
||||
if (unlikely(bset_u64s(t) * sizeof(u64) > btree_write_set_buffer(b)) &&
|
||||
remaining_space > (ssize_t) (btree_write_set_buffer(b) >> 3))
|
||||
return bne;
|
||||
}
|
||||
|
||||
return NULL;
|
||||
}
|
||||
|
||||
static inline void push_whiteout(struct btree *b, struct bpos pos)
|
||||
{
|
||||
struct bkey_packed k;
|
||||
|
||||
BUG_ON(bch2_btree_keys_u64s_remaining(b) < BKEY_U64s);
|
||||
EBUG_ON(btree_node_just_written(b));
|
||||
|
||||
if (!bkey_pack_pos(&k, pos, b)) {
|
||||
struct bkey *u = (void *) &k;
|
||||
|
||||
bkey_init(u);
|
||||
u->p = pos;
|
||||
}
|
||||
|
||||
k.needs_whiteout = true;
|
||||
|
||||
b->whiteout_u64s += k.u64s;
|
||||
bkey_p_copy(unwritten_whiteouts_start(b), &k);
|
||||
}
|
||||
|
||||
/*
|
||||
* write lock must be held on @b (else the dirty bset that we were going to
|
||||
* insert into could be written out from under us)
|
||||
*/
|
||||
static inline bool bch2_btree_node_insert_fits(struct btree *b, unsigned u64s)
|
||||
{
|
||||
if (unlikely(btree_node_need_rewrite(b)))
|
||||
return false;
|
||||
|
||||
return u64s <= bch2_btree_keys_u64s_remaining(b);
|
||||
}
|
||||
|
||||
void bch2_btree_updates_to_text(struct printbuf *, struct bch_fs *);
|
||||
|
||||
bool bch2_btree_interior_updates_flush(struct bch_fs *);
|
||||
|
||||
void bch2_journal_entry_to_btree_root(struct bch_fs *, struct jset_entry *);
|
||||
struct jset_entry *bch2_btree_roots_to_journal_entries(struct bch_fs *,
|
||||
struct jset_entry *, unsigned long);
|
||||
|
||||
void bch2_async_btree_node_rewrites_flush(struct bch_fs *);
|
||||
void bch2_do_pending_node_rewrites(struct bch_fs *);
|
||||
void bch2_free_pending_node_rewrites(struct bch_fs *);
|
||||
|
||||
void bch2_btree_reserve_cache_to_text(struct printbuf *, struct bch_fs *);
|
||||
|
||||
void bch2_fs_btree_interior_update_exit(struct bch_fs *);
|
||||
void bch2_fs_btree_interior_update_init_early(struct bch_fs *);
|
||||
int bch2_fs_btree_interior_update_init(struct bch_fs *);
|
||||
|
||||
#endif /* _BCACHEFS_BTREE_UPDATE_INTERIOR_H */
|
||||
@@ -1,893 +0,0 @@
|
||||
// SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
#include "bcachefs.h"
|
||||
#include "bkey_buf.h"
|
||||
#include "btree_locking.h"
|
||||
#include "btree_update.h"
|
||||
#include "btree_update_interior.h"
|
||||
#include "btree_write_buffer.h"
|
||||
#include "disk_accounting.h"
|
||||
#include "enumerated_ref.h"
|
||||
#include "error.h"
|
||||
#include "extents.h"
|
||||
#include "journal.h"
|
||||
#include "journal_io.h"
|
||||
#include "journal_reclaim.h"
|
||||
|
||||
#include <linux/prefetch.h>
|
||||
#include <linux/sort.h>
|
||||
|
||||
static int bch2_btree_write_buffer_journal_flush(struct journal *,
|
||||
struct journal_entry_pin *, u64);
|
||||
|
||||
static inline bool __wb_key_ref_cmp(const struct wb_key_ref *l, const struct wb_key_ref *r)
|
||||
{
|
||||
return (cmp_int(l->hi, r->hi) ?:
|
||||
cmp_int(l->mi, r->mi) ?:
|
||||
cmp_int(l->lo, r->lo)) >= 0;
|
||||
}
|
||||
|
||||
static inline bool wb_key_ref_cmp(const struct wb_key_ref *l, const struct wb_key_ref *r)
|
||||
{
|
||||
#ifdef CONFIG_X86_64
|
||||
int cmp;
|
||||
|
||||
asm("mov (%[l]), %%rax;"
|
||||
"sub (%[r]), %%rax;"
|
||||
"mov 8(%[l]), %%rax;"
|
||||
"sbb 8(%[r]), %%rax;"
|
||||
"mov 16(%[l]), %%rax;"
|
||||
"sbb 16(%[r]), %%rax;"
|
||||
: "=@ccae" (cmp)
|
||||
: [l] "r" (l), [r] "r" (r)
|
||||
: "rax", "cc");
|
||||
|
||||
EBUG_ON(cmp != __wb_key_ref_cmp(l, r));
|
||||
return cmp;
|
||||
#else
|
||||
return __wb_key_ref_cmp(l, r);
|
||||
#endif
|
||||
}
|
||||
|
||||
static int wb_key_seq_cmp(const void *_l, const void *_r)
|
||||
{
|
||||
const struct btree_write_buffered_key *l = _l;
|
||||
const struct btree_write_buffered_key *r = _r;
|
||||
|
||||
return cmp_int(l->journal_seq, r->journal_seq);
|
||||
}
|
||||
|
||||
/* Compare excluding idx, the low 24 bits: */
|
||||
static inline bool wb_key_eq(const void *_l, const void *_r)
|
||||
{
|
||||
const struct wb_key_ref *l = _l;
|
||||
const struct wb_key_ref *r = _r;
|
||||
|
||||
return !((l->hi ^ r->hi)|
|
||||
(l->mi ^ r->mi)|
|
||||
((l->lo >> 24) ^ (r->lo >> 24)));
|
||||
}
|
||||
|
||||
static noinline void wb_sort(struct wb_key_ref *base, size_t num)
|
||||
{
|
||||
size_t n = num, a = num / 2;
|
||||
|
||||
if (!a) /* num < 2 || size == 0 */
|
||||
return;
|
||||
|
||||
for (;;) {
|
||||
size_t b, c, d;
|
||||
|
||||
if (a) /* Building heap: sift down --a */
|
||||
--a;
|
||||
else if (--n) /* Sorting: Extract root to --n */
|
||||
swap(base[0], base[n]);
|
||||
else /* Sort complete */
|
||||
break;
|
||||
|
||||
/*
|
||||
* Sift element at "a" down into heap. This is the
|
||||
* "bottom-up" variant, which significantly reduces
|
||||
* calls to cmp_func(): we find the sift-down path all
|
||||
* the way to the leaves (one compare per level), then
|
||||
* backtrack to find where to insert the target element.
|
||||
*
|
||||
* Because elements tend to sift down close to the leaves,
|
||||
* this uses fewer compares than doing two per level
|
||||
* on the way down. (A bit more than half as many on
|
||||
* average, 3/4 worst-case.)
|
||||
*/
|
||||
for (b = a; c = 2*b + 1, (d = c + 1) < n;)
|
||||
b = wb_key_ref_cmp(base + c, base + d) ? c : d;
|
||||
if (d == n) /* Special case last leaf with no sibling */
|
||||
b = c;
|
||||
|
||||
/* Now backtrack from "b" to the correct location for "a" */
|
||||
while (b != a && wb_key_ref_cmp(base + a, base + b))
|
||||
b = (b - 1) / 2;
|
||||
c = b; /* Where "a" belongs */
|
||||
while (b != a) { /* Shift it into place */
|
||||
b = (b - 1) / 2;
|
||||
swap(base[b], base[c]);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
static noinline int wb_flush_one_slowpath(struct btree_trans *trans,
|
||||
struct btree_iter *iter,
|
||||
struct btree_write_buffered_key *wb)
|
||||
{
|
||||
struct btree_path *path = btree_iter_path(trans, iter);
|
||||
|
||||
bch2_btree_node_unlock_write(trans, path, path->l[0].b);
|
||||
|
||||
trans->journal_res.seq = wb->journal_seq;
|
||||
|
||||
return bch2_trans_update(trans, iter, &wb->k,
|
||||
BTREE_UPDATE_internal_snapshot_node) ?:
|
||||
bch2_trans_commit(trans, NULL, NULL,
|
||||
BCH_TRANS_COMMIT_no_enospc|
|
||||
BCH_TRANS_COMMIT_no_check_rw|
|
||||
BCH_TRANS_COMMIT_no_journal_res|
|
||||
BCH_TRANS_COMMIT_journal_reclaim);
|
||||
}
|
||||
|
||||
static inline int wb_flush_one(struct btree_trans *trans, struct btree_iter *iter,
|
||||
struct btree_write_buffered_key *wb,
|
||||
bool *write_locked,
|
||||
bool *accounting_accumulated,
|
||||
size_t *fast)
|
||||
{
|
||||
struct btree_path *path;
|
||||
int ret;
|
||||
|
||||
EBUG_ON(!wb->journal_seq);
|
||||
EBUG_ON(!trans->c->btree_write_buffer.flushing.pin.seq);
|
||||
EBUG_ON(trans->c->btree_write_buffer.flushing.pin.seq > wb->journal_seq);
|
||||
|
||||
ret = bch2_btree_iter_traverse(trans, iter);
|
||||
if (ret)
|
||||
return ret;
|
||||
|
||||
if (!*accounting_accumulated && wb->k.k.type == KEY_TYPE_accounting) {
|
||||
struct bkey u;
|
||||
struct bkey_s_c k = bch2_btree_path_peek_slot_exact(btree_iter_path(trans, iter), &u);
|
||||
|
||||
if (k.k->type == KEY_TYPE_accounting)
|
||||
bch2_accounting_accumulate(bkey_i_to_accounting(&wb->k),
|
||||
bkey_s_c_to_accounting(k));
|
||||
}
|
||||
*accounting_accumulated = true;
|
||||
|
||||
/*
|
||||
* We can't clone a path that has write locks: unshare it now, before
|
||||
* set_pos and traverse():
|
||||
*/
|
||||
if (btree_iter_path(trans, iter)->ref > 1)
|
||||
iter->path = __bch2_btree_path_make_mut(trans, iter->path, true, _THIS_IP_);
|
||||
|
||||
path = btree_iter_path(trans, iter);
|
||||
|
||||
if (!*write_locked) {
|
||||
ret = bch2_btree_node_lock_write(trans, path, &path->l[0].b->c);
|
||||
if (ret)
|
||||
return ret;
|
||||
|
||||
bch2_btree_node_prep_for_write(trans, path, path->l[0].b);
|
||||
*write_locked = true;
|
||||
}
|
||||
|
||||
if (unlikely(!bch2_btree_node_insert_fits(path->l[0].b, wb->k.k.u64s))) {
|
||||
*write_locked = false;
|
||||
return wb_flush_one_slowpath(trans, iter, wb);
|
||||
}
|
||||
|
||||
EBUG_ON(!bpos_eq(wb->k.k.p, path->pos));
|
||||
|
||||
bch2_btree_insert_key_leaf(trans, path, &wb->k, wb->journal_seq);
|
||||
(*fast)++;
|
||||
return 0;
|
||||
}
|
||||
|
||||
/*
|
||||
* Update a btree with a write buffered key using the journal seq of the
|
||||
* original write buffer insert.
|
||||
*
|
||||
* It is not safe to rejournal the key once it has been inserted into the write
|
||||
* buffer because that may break recovery ordering. For example, the key may
|
||||
* have already been modified in the active write buffer in a seq that comes
|
||||
* before the current transaction. If we were to journal this key again and
|
||||
* crash, recovery would process updates in the wrong order.
|
||||
*/
|
||||
static int
|
||||
btree_write_buffered_insert(struct btree_trans *trans,
|
||||
struct btree_write_buffered_key *wb)
|
||||
{
|
||||
struct btree_iter iter;
|
||||
int ret;
|
||||
|
||||
bch2_trans_iter_init(trans, &iter, wb->btree, bkey_start_pos(&wb->k.k),
|
||||
BTREE_ITER_cached|BTREE_ITER_intent);
|
||||
|
||||
trans->journal_res.seq = wb->journal_seq;
|
||||
|
||||
ret = bch2_btree_iter_traverse(trans, &iter) ?:
|
||||
bch2_trans_update(trans, &iter, &wb->k,
|
||||
BTREE_UPDATE_internal_snapshot_node);
|
||||
bch2_trans_iter_exit(trans, &iter);
|
||||
return ret;
|
||||
}
|
||||
|
||||
static void move_keys_from_inc_to_flushing(struct btree_write_buffer *wb)
|
||||
{
|
||||
struct bch_fs *c = container_of(wb, struct bch_fs, btree_write_buffer);
|
||||
struct journal *j = &c->journal;
|
||||
|
||||
if (!wb->inc.keys.nr)
|
||||
return;
|
||||
|
||||
bch2_journal_pin_add(j, wb->inc.keys.data[0].journal_seq, &wb->flushing.pin,
|
||||
bch2_btree_write_buffer_journal_flush);
|
||||
|
||||
darray_resize(&wb->flushing.keys, min_t(size_t, 1U << 20, wb->flushing.keys.nr + wb->inc.keys.nr));
|
||||
darray_resize(&wb->sorted, wb->flushing.keys.size);
|
||||
|
||||
if (!wb->flushing.keys.nr && wb->sorted.size >= wb->inc.keys.nr) {
|
||||
swap(wb->flushing.keys, wb->inc.keys);
|
||||
goto out;
|
||||
}
|
||||
|
||||
size_t nr = min(darray_room(wb->flushing.keys),
|
||||
wb->sorted.size - wb->flushing.keys.nr);
|
||||
nr = min(nr, wb->inc.keys.nr);
|
||||
|
||||
memcpy(&darray_top(wb->flushing.keys),
|
||||
wb->inc.keys.data,
|
||||
sizeof(wb->inc.keys.data[0]) * nr);
|
||||
|
||||
memmove(wb->inc.keys.data,
|
||||
wb->inc.keys.data + nr,
|
||||
sizeof(wb->inc.keys.data[0]) * (wb->inc.keys.nr - nr));
|
||||
|
||||
wb->flushing.keys.nr += nr;
|
||||
wb->inc.keys.nr -= nr;
|
||||
out:
|
||||
if (!wb->inc.keys.nr)
|
||||
bch2_journal_pin_drop(j, &wb->inc.pin);
|
||||
else
|
||||
bch2_journal_pin_update(j, wb->inc.keys.data[0].journal_seq, &wb->inc.pin,
|
||||
bch2_btree_write_buffer_journal_flush);
|
||||
|
||||
if (j->watermark) {
|
||||
spin_lock(&j->lock);
|
||||
bch2_journal_set_watermark(j);
|
||||
spin_unlock(&j->lock);
|
||||
}
|
||||
|
||||
BUG_ON(wb->sorted.size < wb->flushing.keys.nr);
|
||||
}
|
||||
|
||||
int bch2_btree_write_buffer_insert_err(struct bch_fs *c,
|
||||
enum btree_id btree, struct bkey_i *k)
|
||||
{
|
||||
struct printbuf buf = PRINTBUF;
|
||||
|
||||
prt_printf(&buf, "attempting to do write buffer update on non wb btree=");
|
||||
bch2_btree_id_to_text(&buf, btree);
|
||||
prt_str(&buf, "\n");
|
||||
bch2_bkey_val_to_text(&buf, c, bkey_i_to_s_c(k));
|
||||
|
||||
bch2_fs_inconsistent(c, "%s", buf.buf);
|
||||
printbuf_exit(&buf);
|
||||
return -EROFS;
|
||||
}
|
||||
|
||||
static int bch2_btree_write_buffer_flush_locked(struct btree_trans *trans)
|
||||
{
|
||||
struct bch_fs *c = trans->c;
|
||||
struct journal *j = &c->journal;
|
||||
struct btree_write_buffer *wb = &c->btree_write_buffer;
|
||||
struct btree_iter iter = {};
|
||||
size_t overwritten = 0, fast = 0, slowpath = 0, could_not_insert = 0;
|
||||
bool write_locked = false;
|
||||
bool accounting_replay_done = test_bit(BCH_FS_accounting_replay_done, &c->flags);
|
||||
int ret = 0;
|
||||
|
||||
ret = bch2_journal_error(&c->journal);
|
||||
if (ret)
|
||||
return ret;
|
||||
|
||||
bch2_trans_unlock(trans);
|
||||
bch2_trans_begin(trans);
|
||||
|
||||
mutex_lock(&wb->inc.lock);
|
||||
move_keys_from_inc_to_flushing(wb);
|
||||
mutex_unlock(&wb->inc.lock);
|
||||
|
||||
for (size_t i = 0; i < wb->flushing.keys.nr; i++) {
|
||||
wb->sorted.data[i].idx = i;
|
||||
wb->sorted.data[i].btree = wb->flushing.keys.data[i].btree;
|
||||
memcpy(&wb->sorted.data[i].pos, &wb->flushing.keys.data[i].k.k.p, sizeof(struct bpos));
|
||||
}
|
||||
wb->sorted.nr = wb->flushing.keys.nr;
|
||||
|
||||
/*
|
||||
* We first sort so that we can detect and skip redundant updates, and
|
||||
* then we attempt to flush in sorted btree order, as this is most
|
||||
* efficient.
|
||||
*
|
||||
* However, since we're not flushing in the order they appear in the
|
||||
* journal we won't be able to drop our journal pin until everything is
|
||||
* flushed - which means this could deadlock the journal if we weren't
|
||||
* passing BCH_TRANS_COMMIT_journal_reclaim. This causes the update to fail
|
||||
* if it would block taking a journal reservation.
|
||||
*
|
||||
* If that happens, simply skip the key so we can optimistically insert
|
||||
* as many keys as possible in the fast path.
|
||||
*/
|
||||
wb_sort(wb->sorted.data, wb->sorted.nr);
|
||||
|
||||
darray_for_each(wb->sorted, i) {
|
||||
struct btree_write_buffered_key *k = &wb->flushing.keys.data[i->idx];
|
||||
|
||||
if (unlikely(!btree_type_uses_write_buffer(k->btree))) {
|
||||
ret = bch2_btree_write_buffer_insert_err(trans->c, k->btree, &k->k);
|
||||
goto err;
|
||||
}
|
||||
|
||||
for (struct wb_key_ref *n = i + 1; n < min(i + 4, &darray_top(wb->sorted)); n++)
|
||||
prefetch(&wb->flushing.keys.data[n->idx]);
|
||||
|
||||
BUG_ON(!k->journal_seq);
|
||||
|
||||
if (!accounting_replay_done &&
|
||||
k->k.k.type == KEY_TYPE_accounting) {
|
||||
slowpath++;
|
||||
continue;
|
||||
}
|
||||
|
||||
if (i + 1 < &darray_top(wb->sorted) &&
|
||||
wb_key_eq(i, i + 1)) {
|
||||
struct btree_write_buffered_key *n = &wb->flushing.keys.data[i[1].idx];
|
||||
|
||||
if (k->k.k.type == KEY_TYPE_accounting &&
|
||||
n->k.k.type == KEY_TYPE_accounting)
|
||||
bch2_accounting_accumulate(bkey_i_to_accounting(&n->k),
|
||||
bkey_i_to_s_c_accounting(&k->k));
|
||||
|
||||
overwritten++;
|
||||
n->journal_seq = min_t(u64, n->journal_seq, k->journal_seq);
|
||||
k->journal_seq = 0;
|
||||
continue;
|
||||
}
|
||||
|
||||
if (write_locked) {
|
||||
struct btree_path *path = btree_iter_path(trans, &iter);
|
||||
|
||||
if (path->btree_id != i->btree ||
|
||||
bpos_gt(k->k.k.p, path->l[0].b->key.k.p)) {
|
||||
bch2_btree_node_unlock_write(trans, path, path->l[0].b);
|
||||
write_locked = false;
|
||||
|
||||
ret = lockrestart_do(trans,
|
||||
bch2_btree_iter_traverse(trans, &iter) ?:
|
||||
bch2_foreground_maybe_merge(trans, iter.path, 0,
|
||||
BCH_WATERMARK_reclaim|
|
||||
BCH_TRANS_COMMIT_journal_reclaim|
|
||||
BCH_TRANS_COMMIT_no_check_rw|
|
||||
BCH_TRANS_COMMIT_no_enospc));
|
||||
if (ret)
|
||||
goto err;
|
||||
}
|
||||
}
|
||||
|
||||
if (!iter.path || iter.btree_id != k->btree) {
|
||||
bch2_trans_iter_exit(trans, &iter);
|
||||
bch2_trans_iter_init(trans, &iter, k->btree, k->k.k.p,
|
||||
BTREE_ITER_intent|BTREE_ITER_all_snapshots);
|
||||
}
|
||||
|
||||
bch2_btree_iter_set_pos(trans, &iter, k->k.k.p);
|
||||
btree_iter_path(trans, &iter)->preserve = false;
|
||||
|
||||
bool accounting_accumulated = false;
|
||||
do {
|
||||
if (race_fault()) {
|
||||
ret = bch_err_throw(c, journal_reclaim_would_deadlock);
|
||||
break;
|
||||
}
|
||||
|
||||
ret = wb_flush_one(trans, &iter, k, &write_locked,
|
||||
&accounting_accumulated, &fast);
|
||||
if (!write_locked)
|
||||
bch2_trans_begin(trans);
|
||||
} while (bch2_err_matches(ret, BCH_ERR_transaction_restart));
|
||||
|
||||
if (!ret) {
|
||||
k->journal_seq = 0;
|
||||
} else if (ret == -BCH_ERR_journal_reclaim_would_deadlock) {
|
||||
slowpath++;
|
||||
ret = 0;
|
||||
} else
|
||||
break;
|
||||
}
|
||||
|
||||
if (write_locked) {
|
||||
struct btree_path *path = btree_iter_path(trans, &iter);
|
||||
bch2_btree_node_unlock_write(trans, path, path->l[0].b);
|
||||
}
|
||||
bch2_trans_iter_exit(trans, &iter);
|
||||
|
||||
if (ret)
|
||||
goto err;
|
||||
|
||||
if (slowpath) {
|
||||
/*
|
||||
* Flush in the order they were present in the journal, so that
|
||||
* we can release journal pins:
|
||||
* The fastpath zapped the seq of keys that were successfully flushed so
|
||||
* we can skip those here.
|
||||
*/
|
||||
trace_and_count(c, write_buffer_flush_slowpath, trans, slowpath, wb->flushing.keys.nr);
|
||||
|
||||
sort_nonatomic(wb->flushing.keys.data,
|
||||
wb->flushing.keys.nr,
|
||||
sizeof(wb->flushing.keys.data[0]),
|
||||
wb_key_seq_cmp, NULL);
|
||||
|
||||
darray_for_each(wb->flushing.keys, i) {
|
||||
if (!i->journal_seq)
|
||||
continue;
|
||||
|
||||
if (!accounting_replay_done &&
|
||||
i->k.k.type == KEY_TYPE_accounting) {
|
||||
could_not_insert++;
|
||||
continue;
|
||||
}
|
||||
|
||||
if (!could_not_insert)
|
||||
bch2_journal_pin_update(j, i->journal_seq, &wb->flushing.pin,
|
||||
bch2_btree_write_buffer_journal_flush);
|
||||
|
||||
bch2_trans_begin(trans);
|
||||
|
||||
ret = commit_do(trans, NULL, NULL,
|
||||
BCH_WATERMARK_reclaim|
|
||||
BCH_TRANS_COMMIT_journal_reclaim|
|
||||
BCH_TRANS_COMMIT_no_check_rw|
|
||||
BCH_TRANS_COMMIT_no_enospc|
|
||||
BCH_TRANS_COMMIT_no_journal_res ,
|
||||
btree_write_buffered_insert(trans, i));
|
||||
if (ret)
|
||||
goto err;
|
||||
|
||||
i->journal_seq = 0;
|
||||
}
|
||||
|
||||
/*
|
||||
* If journal replay hasn't finished with accounting keys we
|
||||
* can't flush accounting keys at all - condense them and leave
|
||||
* them for next time.
|
||||
*
|
||||
* Q: Can the write buffer overflow?
|
||||
* A Shouldn't be any actual risk. It's just new accounting
|
||||
* updates that the write buffer can't flush, and those are only
|
||||
* going to be generated by interior btree node updates as
|
||||
* journal replay has to split/rewrite nodes to make room for
|
||||
* its updates.
|
||||
*
|
||||
* And for those new acounting updates, updates to the same
|
||||
* counters get accumulated as they're flushed from the journal
|
||||
* to the write buffer - see the patch for eytzingcer tree
|
||||
* accumulated. So we could only overflow if the number of
|
||||
* distinct counters touched somehow was very large.
|
||||
*/
|
||||
if (could_not_insert) {
|
||||
struct btree_write_buffered_key *dst = wb->flushing.keys.data;
|
||||
|
||||
darray_for_each(wb->flushing.keys, i)
|
||||
if (i->journal_seq)
|
||||
*dst++ = *i;
|
||||
wb->flushing.keys.nr = dst - wb->flushing.keys.data;
|
||||
}
|
||||
}
|
||||
err:
|
||||
if (ret || !could_not_insert) {
|
||||
bch2_journal_pin_drop(j, &wb->flushing.pin);
|
||||
wb->flushing.keys.nr = 0;
|
||||
}
|
||||
|
||||
bch2_fs_fatal_err_on(ret, c, "%s", bch2_err_str(ret));
|
||||
trace_write_buffer_flush(trans, wb->flushing.keys.nr, overwritten, fast, 0);
|
||||
return ret;
|
||||
}
|
||||
|
||||
static int bch2_journal_keys_to_write_buffer(struct bch_fs *c, struct journal_buf *buf)
|
||||
{
|
||||
struct journal_keys_to_wb dst;
|
||||
int ret = 0;
|
||||
|
||||
bch2_journal_keys_to_write_buffer_start(c, &dst, le64_to_cpu(buf->data->seq));
|
||||
|
||||
for_each_jset_entry_type(entry, buf->data, BCH_JSET_ENTRY_write_buffer_keys) {
|
||||
jset_entry_for_each_key(entry, k) {
|
||||
ret = bch2_journal_key_to_wb(c, &dst, entry->btree_id, k);
|
||||
if (ret)
|
||||
goto out;
|
||||
}
|
||||
|
||||
entry->type = BCH_JSET_ENTRY_btree_keys;
|
||||
}
|
||||
out:
|
||||
ret = bch2_journal_keys_to_write_buffer_end(c, &dst) ?: ret;
|
||||
return ret;
|
||||
}
|
||||
|
||||
static int fetch_wb_keys_from_journal(struct bch_fs *c, u64 max_seq)
|
||||
{
|
||||
struct journal *j = &c->journal;
|
||||
struct journal_buf *buf;
|
||||
bool blocked;
|
||||
int ret = 0;
|
||||
|
||||
while (!ret && (buf = bch2_next_write_buffer_flush_journal_buf(j, max_seq, &blocked))) {
|
||||
ret = bch2_journal_keys_to_write_buffer(c, buf);
|
||||
|
||||
if (!blocked && !ret) {
|
||||
spin_lock(&j->lock);
|
||||
buf->need_flush_to_write_buffer = false;
|
||||
spin_unlock(&j->lock);
|
||||
}
|
||||
|
||||
mutex_unlock(&j->buf_lock);
|
||||
|
||||
if (blocked) {
|
||||
bch2_journal_unblock(j);
|
||||
break;
|
||||
}
|
||||
}
|
||||
|
||||
return ret;
|
||||
}
|
||||
|
||||
static int btree_write_buffer_flush_seq(struct btree_trans *trans, u64 max_seq,
|
||||
bool *did_work)
|
||||
{
|
||||
struct bch_fs *c = trans->c;
|
||||
struct btree_write_buffer *wb = &c->btree_write_buffer;
|
||||
int ret = 0, fetch_from_journal_err;
|
||||
|
||||
do {
|
||||
bch2_trans_unlock(trans);
|
||||
|
||||
fetch_from_journal_err = fetch_wb_keys_from_journal(c, max_seq);
|
||||
|
||||
*did_work |= wb->inc.keys.nr || wb->flushing.keys.nr;
|
||||
|
||||
/*
|
||||
* On memory allocation failure, bch2_btree_write_buffer_flush_locked()
|
||||
* is not guaranteed to empty wb->inc:
|
||||
*/
|
||||
mutex_lock(&wb->flushing.lock);
|
||||
ret = bch2_btree_write_buffer_flush_locked(trans);
|
||||
mutex_unlock(&wb->flushing.lock);
|
||||
} while (!ret &&
|
||||
(fetch_from_journal_err ||
|
||||
(wb->inc.pin.seq && wb->inc.pin.seq <= max_seq) ||
|
||||
(wb->flushing.pin.seq && wb->flushing.pin.seq <= max_seq)));
|
||||
|
||||
return ret;
|
||||
}
|
||||
|
||||
static int bch2_btree_write_buffer_journal_flush(struct journal *j,
|
||||
struct journal_entry_pin *_pin, u64 seq)
|
||||
{
|
||||
struct bch_fs *c = container_of(j, struct bch_fs, journal);
|
||||
bool did_work = false;
|
||||
|
||||
return bch2_trans_run(c, btree_write_buffer_flush_seq(trans, seq, &did_work));
|
||||
}
|
||||
|
||||
int bch2_btree_write_buffer_flush_sync(struct btree_trans *trans)
|
||||
{
|
||||
struct bch_fs *c = trans->c;
|
||||
bool did_work = false;
|
||||
|
||||
trace_and_count(c, write_buffer_flush_sync, trans, _RET_IP_);
|
||||
|
||||
return btree_write_buffer_flush_seq(trans, journal_cur_seq(&c->journal), &did_work);
|
||||
}
|
||||
|
||||
/*
|
||||
* The write buffer requires flushing when going RO: keys in the journal for the
|
||||
* write buffer don't have a journal pin yet
|
||||
*/
|
||||
bool bch2_btree_write_buffer_flush_going_ro(struct bch_fs *c)
|
||||
{
|
||||
if (bch2_journal_error(&c->journal))
|
||||
return false;
|
||||
|
||||
bool did_work = false;
|
||||
bch2_trans_run(c, btree_write_buffer_flush_seq(trans,
|
||||
journal_cur_seq(&c->journal), &did_work));
|
||||
return did_work;
|
||||
}
|
||||
|
||||
int bch2_btree_write_buffer_flush_nocheck_rw(struct btree_trans *trans)
|
||||
{
|
||||
struct bch_fs *c = trans->c;
|
||||
struct btree_write_buffer *wb = &c->btree_write_buffer;
|
||||
int ret = 0;
|
||||
|
||||
if (mutex_trylock(&wb->flushing.lock)) {
|
||||
ret = bch2_btree_write_buffer_flush_locked(trans);
|
||||
mutex_unlock(&wb->flushing.lock);
|
||||
}
|
||||
|
||||
return ret;
|
||||
}
|
||||
|
||||
int bch2_btree_write_buffer_tryflush(struct btree_trans *trans)
|
||||
{
|
||||
struct bch_fs *c = trans->c;
|
||||
|
||||
if (!enumerated_ref_tryget(&c->writes, BCH_WRITE_REF_btree_write_buffer))
|
||||
return bch_err_throw(c, erofs_no_writes);
|
||||
|
||||
int ret = bch2_btree_write_buffer_flush_nocheck_rw(trans);
|
||||
enumerated_ref_put(&c->writes, BCH_WRITE_REF_btree_write_buffer);
|
||||
return ret;
|
||||
}
|
||||
|
||||
/*
|
||||
* In check and repair code, when checking references to write buffer btrees we
|
||||
* need to issue a flush before we have a definitive error: this issues a flush
|
||||
* if this is a key we haven't yet checked.
|
||||
*/
|
||||
int bch2_btree_write_buffer_maybe_flush(struct btree_trans *trans,
|
||||
struct bkey_s_c referring_k,
|
||||
struct bkey_buf *last_flushed)
|
||||
{
|
||||
struct bch_fs *c = trans->c;
|
||||
struct bkey_buf tmp;
|
||||
int ret = 0;
|
||||
|
||||
bch2_bkey_buf_init(&tmp);
|
||||
|
||||
if (!bkey_and_val_eq(referring_k, bkey_i_to_s_c(last_flushed->k))) {
|
||||
if (trace_write_buffer_maybe_flush_enabled()) {
|
||||
struct printbuf buf = PRINTBUF;
|
||||
|
||||
bch2_bkey_val_to_text(&buf, c, referring_k);
|
||||
trace_write_buffer_maybe_flush(trans, _RET_IP_, buf.buf);
|
||||
printbuf_exit(&buf);
|
||||
}
|
||||
|
||||
bch2_bkey_buf_reassemble(&tmp, c, referring_k);
|
||||
|
||||
if (bkey_is_btree_ptr(referring_k.k)) {
|
||||
bch2_trans_unlock(trans);
|
||||
bch2_btree_interior_updates_flush(c);
|
||||
}
|
||||
|
||||
ret = bch2_btree_write_buffer_flush_sync(trans);
|
||||
if (ret)
|
||||
goto err;
|
||||
|
||||
bch2_bkey_buf_copy(last_flushed, c, tmp.k);
|
||||
|
||||
/* can we avoid the unconditional restart? */
|
||||
trace_and_count(c, trans_restart_write_buffer_flush, trans, _RET_IP_);
|
||||
ret = bch_err_throw(c, transaction_restart_write_buffer_flush);
|
||||
}
|
||||
err:
|
||||
bch2_bkey_buf_exit(&tmp, c);
|
||||
return ret;
|
||||
}
|
||||
|
||||
static void bch2_btree_write_buffer_flush_work(struct work_struct *work)
|
||||
{
|
||||
struct bch_fs *c = container_of(work, struct bch_fs, btree_write_buffer.flush_work);
|
||||
struct btree_write_buffer *wb = &c->btree_write_buffer;
|
||||
int ret;
|
||||
|
||||
mutex_lock(&wb->flushing.lock);
|
||||
do {
|
||||
ret = bch2_trans_run(c, bch2_btree_write_buffer_flush_locked(trans));
|
||||
} while (!ret && bch2_btree_write_buffer_should_flush(c));
|
||||
mutex_unlock(&wb->flushing.lock);
|
||||
|
||||
enumerated_ref_put(&c->writes, BCH_WRITE_REF_btree_write_buffer);
|
||||
}
|
||||
|
||||
static void wb_accounting_sort(struct btree_write_buffer *wb)
|
||||
{
|
||||
eytzinger0_sort(wb->accounting.data, wb->accounting.nr,
|
||||
sizeof(wb->accounting.data[0]),
|
||||
wb_key_cmp, NULL);
|
||||
}
|
||||
|
||||
int bch2_accounting_key_to_wb_slowpath(struct bch_fs *c, enum btree_id btree,
|
||||
struct bkey_i_accounting *k)
|
||||
{
|
||||
struct btree_write_buffer *wb = &c->btree_write_buffer;
|
||||
struct btree_write_buffered_key new = { .btree = btree };
|
||||
|
||||
bkey_copy(&new.k, &k->k_i);
|
||||
|
||||
int ret = darray_push(&wb->accounting, new);
|
||||
if (ret)
|
||||
return ret;
|
||||
|
||||
wb_accounting_sort(wb);
|
||||
return 0;
|
||||
}
|
||||
|
||||
int bch2_journal_key_to_wb_slowpath(struct bch_fs *c,
|
||||
struct journal_keys_to_wb *dst,
|
||||
enum btree_id btree, struct bkey_i *k)
|
||||
{
|
||||
struct btree_write_buffer *wb = &c->btree_write_buffer;
|
||||
int ret;
|
||||
retry:
|
||||
ret = darray_make_room_gfp(&dst->wb->keys, 1, GFP_KERNEL);
|
||||
if (!ret && dst->wb == &wb->flushing)
|
||||
ret = darray_resize(&wb->sorted, wb->flushing.keys.size);
|
||||
|
||||
if (unlikely(ret)) {
|
||||
if (dst->wb == &c->btree_write_buffer.flushing) {
|
||||
mutex_unlock(&dst->wb->lock);
|
||||
dst->wb = &c->btree_write_buffer.inc;
|
||||
bch2_journal_pin_add(&c->journal, dst->seq, &dst->wb->pin,
|
||||
bch2_btree_write_buffer_journal_flush);
|
||||
goto retry;
|
||||
}
|
||||
|
||||
return ret;
|
||||
}
|
||||
|
||||
dst->room = darray_room(dst->wb->keys);
|
||||
if (dst->wb == &wb->flushing)
|
||||
dst->room = min(dst->room, wb->sorted.size - wb->flushing.keys.nr);
|
||||
BUG_ON(!dst->room);
|
||||
BUG_ON(!dst->seq);
|
||||
|
||||
struct btree_write_buffered_key *wb_k = &darray_top(dst->wb->keys);
|
||||
wb_k->journal_seq = dst->seq;
|
||||
wb_k->btree = btree;
|
||||
bkey_copy(&wb_k->k, k);
|
||||
dst->wb->keys.nr++;
|
||||
dst->room--;
|
||||
return 0;
|
||||
}
|
||||
|
||||
void bch2_journal_keys_to_write_buffer_start(struct bch_fs *c, struct journal_keys_to_wb *dst, u64 seq)
|
||||
{
|
||||
struct btree_write_buffer *wb = &c->btree_write_buffer;
|
||||
|
||||
if (mutex_trylock(&wb->flushing.lock)) {
|
||||
mutex_lock(&wb->inc.lock);
|
||||
move_keys_from_inc_to_flushing(wb);
|
||||
|
||||
/*
|
||||
* Attempt to skip wb->inc, and add keys directly to
|
||||
* wb->flushing, saving us a copy later:
|
||||
*/
|
||||
|
||||
if (!wb->inc.keys.nr) {
|
||||
dst->wb = &wb->flushing;
|
||||
} else {
|
||||
mutex_unlock(&wb->flushing.lock);
|
||||
dst->wb = &wb->inc;
|
||||
}
|
||||
} else {
|
||||
mutex_lock(&wb->inc.lock);
|
||||
dst->wb = &wb->inc;
|
||||
}
|
||||
|
||||
dst->room = darray_room(dst->wb->keys);
|
||||
if (dst->wb == &wb->flushing)
|
||||
dst->room = min(dst->room, wb->sorted.size - wb->flushing.keys.nr);
|
||||
dst->seq = seq;
|
||||
|
||||
bch2_journal_pin_add(&c->journal, seq, &dst->wb->pin,
|
||||
bch2_btree_write_buffer_journal_flush);
|
||||
|
||||
darray_for_each(wb->accounting, i)
|
||||
memset(&i->k.v, 0, bkey_val_bytes(&i->k.k));
|
||||
}
|
||||
|
||||
int bch2_journal_keys_to_write_buffer_end(struct bch_fs *c, struct journal_keys_to_wb *dst)
|
||||
{
|
||||
struct btree_write_buffer *wb = &c->btree_write_buffer;
|
||||
unsigned live_accounting_keys = 0;
|
||||
int ret = 0;
|
||||
|
||||
darray_for_each(wb->accounting, i)
|
||||
if (!bch2_accounting_key_is_zero(bkey_i_to_s_c_accounting(&i->k))) {
|
||||
i->journal_seq = dst->seq;
|
||||
live_accounting_keys++;
|
||||
ret = __bch2_journal_key_to_wb(c, dst, i->btree, &i->k);
|
||||
if (ret)
|
||||
break;
|
||||
}
|
||||
|
||||
if (live_accounting_keys * 2 < wb->accounting.nr) {
|
||||
struct btree_write_buffered_key *dst = wb->accounting.data;
|
||||
|
||||
darray_for_each(wb->accounting, src)
|
||||
if (!bch2_accounting_key_is_zero(bkey_i_to_s_c_accounting(&src->k)))
|
||||
*dst++ = *src;
|
||||
wb->accounting.nr = dst - wb->accounting.data;
|
||||
wb_accounting_sort(wb);
|
||||
}
|
||||
|
||||
if (!dst->wb->keys.nr)
|
||||
bch2_journal_pin_drop(&c->journal, &dst->wb->pin);
|
||||
|
||||
if (bch2_btree_write_buffer_should_flush(c) &&
|
||||
__enumerated_ref_tryget(&c->writes, BCH_WRITE_REF_btree_write_buffer) &&
|
||||
!queue_work(system_dfl_wq, &c->btree_write_buffer.flush_work))
|
||||
enumerated_ref_put(&c->writes, BCH_WRITE_REF_btree_write_buffer);
|
||||
|
||||
if (dst->wb == &wb->flushing)
|
||||
mutex_unlock(&wb->flushing.lock);
|
||||
mutex_unlock(&wb->inc.lock);
|
||||
|
||||
return ret;
|
||||
}
|
||||
|
||||
static int wb_keys_resize(struct btree_write_buffer_keys *wb, size_t new_size)
|
||||
{
|
||||
if (wb->keys.size >= new_size)
|
||||
return 0;
|
||||
|
||||
if (!mutex_trylock(&wb->lock))
|
||||
return -EINTR;
|
||||
|
||||
int ret = darray_resize(&wb->keys, new_size);
|
||||
mutex_unlock(&wb->lock);
|
||||
return ret;
|
||||
}
|
||||
|
||||
int bch2_btree_write_buffer_resize(struct bch_fs *c, size_t new_size)
|
||||
{
|
||||
struct btree_write_buffer *wb = &c->btree_write_buffer;
|
||||
|
||||
return wb_keys_resize(&wb->flushing, new_size) ?:
|
||||
wb_keys_resize(&wb->inc, new_size);
|
||||
}
|
||||
|
||||
void bch2_fs_btree_write_buffer_exit(struct bch_fs *c)
|
||||
{
|
||||
struct btree_write_buffer *wb = &c->btree_write_buffer;
|
||||
|
||||
BUG_ON((wb->inc.keys.nr || wb->flushing.keys.nr) &&
|
||||
!bch2_journal_error(&c->journal));
|
||||
|
||||
darray_exit(&wb->accounting);
|
||||
darray_exit(&wb->sorted);
|
||||
darray_exit(&wb->flushing.keys);
|
||||
darray_exit(&wb->inc.keys);
|
||||
}
|
||||
|
||||
void bch2_fs_btree_write_buffer_init_early(struct bch_fs *c)
|
||||
{
|
||||
struct btree_write_buffer *wb = &c->btree_write_buffer;
|
||||
|
||||
mutex_init(&wb->inc.lock);
|
||||
mutex_init(&wb->flushing.lock);
|
||||
INIT_WORK(&wb->flush_work, bch2_btree_write_buffer_flush_work);
|
||||
}
|
||||
|
||||
int bch2_fs_btree_write_buffer_init(struct bch_fs *c)
|
||||
{
|
||||
struct btree_write_buffer *wb = &c->btree_write_buffer;
|
||||
|
||||
/* Will be resized by journal as needed: */
|
||||
unsigned initial_size = 1 << 16;
|
||||
|
||||
return darray_make_room(&wb->inc.keys, initial_size) ?:
|
||||
darray_make_room(&wb->flushing.keys, initial_size) ?:
|
||||
darray_make_room(&wb->sorted, initial_size);
|
||||
}
|
||||
@@ -1,113 +0,0 @@
|
||||
/* SPDX-License-Identifier: GPL-2.0 */
|
||||
#ifndef _BCACHEFS_BTREE_WRITE_BUFFER_H
|
||||
#define _BCACHEFS_BTREE_WRITE_BUFFER_H
|
||||
|
||||
#include "bkey.h"
|
||||
#include "disk_accounting.h"
|
||||
|
||||
static inline bool bch2_btree_write_buffer_should_flush(struct bch_fs *c)
|
||||
{
|
||||
struct btree_write_buffer *wb = &c->btree_write_buffer;
|
||||
|
||||
return wb->inc.keys.nr + wb->flushing.keys.nr > wb->inc.keys.size / 4;
|
||||
}
|
||||
|
||||
static inline bool bch2_btree_write_buffer_must_wait(struct bch_fs *c)
|
||||
{
|
||||
struct btree_write_buffer *wb = &c->btree_write_buffer;
|
||||
|
||||
return wb->inc.keys.nr > wb->inc.keys.size * 3 / 4;
|
||||
}
|
||||
|
||||
struct btree_trans;
|
||||
int bch2_btree_write_buffer_flush_sync(struct btree_trans *);
|
||||
bool bch2_btree_write_buffer_flush_going_ro(struct bch_fs *);
|
||||
int bch2_btree_write_buffer_flush_nocheck_rw(struct btree_trans *);
|
||||
int bch2_btree_write_buffer_tryflush(struct btree_trans *);
|
||||
|
||||
struct bkey_buf;
|
||||
int bch2_btree_write_buffer_maybe_flush(struct btree_trans *, struct bkey_s_c, struct bkey_buf *);
|
||||
|
||||
struct journal_keys_to_wb {
|
||||
struct btree_write_buffer_keys *wb;
|
||||
size_t room;
|
||||
u64 seq;
|
||||
};
|
||||
|
||||
static inline int wb_key_cmp(const void *_l, const void *_r)
|
||||
{
|
||||
const struct btree_write_buffered_key *l = _l;
|
||||
const struct btree_write_buffered_key *r = _r;
|
||||
|
||||
return cmp_int(l->btree, r->btree) ?: bpos_cmp(l->k.k.p, r->k.k.p);
|
||||
}
|
||||
|
||||
int bch2_accounting_key_to_wb_slowpath(struct bch_fs *,
|
||||
enum btree_id, struct bkey_i_accounting *);
|
||||
|
||||
static inline int bch2_accounting_key_to_wb(struct bch_fs *c,
|
||||
enum btree_id btree, struct bkey_i_accounting *k)
|
||||
{
|
||||
struct btree_write_buffer *wb = &c->btree_write_buffer;
|
||||
struct btree_write_buffered_key search;
|
||||
search.btree = btree;
|
||||
search.k.k.p = k->k.p;
|
||||
|
||||
unsigned idx = eytzinger0_find(wb->accounting.data, wb->accounting.nr,
|
||||
sizeof(wb->accounting.data[0]),
|
||||
wb_key_cmp, &search);
|
||||
|
||||
if (idx >= wb->accounting.nr)
|
||||
return bch2_accounting_key_to_wb_slowpath(c, btree, k);
|
||||
|
||||
struct bkey_i_accounting *dst = bkey_i_to_accounting(&wb->accounting.data[idx].k);
|
||||
bch2_accounting_accumulate(dst, accounting_i_to_s_c(k));
|
||||
return 0;
|
||||
}
|
||||
|
||||
int bch2_journal_key_to_wb_slowpath(struct bch_fs *,
|
||||
struct journal_keys_to_wb *,
|
||||
enum btree_id, struct bkey_i *);
|
||||
|
||||
static inline int __bch2_journal_key_to_wb(struct bch_fs *c,
|
||||
struct journal_keys_to_wb *dst,
|
||||
enum btree_id btree, struct bkey_i *k)
|
||||
{
|
||||
if (unlikely(!dst->room))
|
||||
return bch2_journal_key_to_wb_slowpath(c, dst, btree, k);
|
||||
|
||||
struct btree_write_buffered_key *wb_k = &darray_top(dst->wb->keys);
|
||||
wb_k->journal_seq = dst->seq;
|
||||
wb_k->btree = btree;
|
||||
bkey_copy(&wb_k->k, k);
|
||||
dst->wb->keys.nr++;
|
||||
dst->room--;
|
||||
return 0;
|
||||
}
|
||||
|
||||
static inline int bch2_journal_key_to_wb(struct bch_fs *c,
|
||||
struct journal_keys_to_wb *dst,
|
||||
enum btree_id btree, struct bkey_i *k)
|
||||
{
|
||||
if (unlikely(!btree_type_uses_write_buffer(btree))) {
|
||||
int ret = bch2_btree_write_buffer_insert_err(c, btree, k);
|
||||
dump_stack();
|
||||
return ret;
|
||||
}
|
||||
|
||||
EBUG_ON(!dst->seq);
|
||||
|
||||
return k->k.type == KEY_TYPE_accounting
|
||||
? bch2_accounting_key_to_wb(c, btree, bkey_i_to_accounting(k))
|
||||
: __bch2_journal_key_to_wb(c, dst, btree, k);
|
||||
}
|
||||
|
||||
void bch2_journal_keys_to_write_buffer_start(struct bch_fs *, struct journal_keys_to_wb *, u64);
|
||||
int bch2_journal_keys_to_write_buffer_end(struct bch_fs *, struct journal_keys_to_wb *);
|
||||
|
||||
int bch2_btree_write_buffer_resize(struct bch_fs *, size_t);
|
||||
void bch2_fs_btree_write_buffer_exit(struct bch_fs *);
|
||||
void bch2_fs_btree_write_buffer_init_early(struct bch_fs *);
|
||||
int bch2_fs_btree_write_buffer_init(struct bch_fs *);
|
||||
|
||||
#endif /* _BCACHEFS_BTREE_WRITE_BUFFER_H */
|
||||
@@ -1,59 +0,0 @@
|
||||
/* SPDX-License-Identifier: GPL-2.0 */
|
||||
#ifndef _BCACHEFS_BTREE_WRITE_BUFFER_TYPES_H
|
||||
#define _BCACHEFS_BTREE_WRITE_BUFFER_TYPES_H
|
||||
|
||||
#include "darray.h"
|
||||
#include "journal_types.h"
|
||||
|
||||
#define BTREE_WRITE_BUFERED_VAL_U64s_MAX 4
|
||||
#define BTREE_WRITE_BUFERED_U64s_MAX (BKEY_U64s + BTREE_WRITE_BUFERED_VAL_U64s_MAX)
|
||||
|
||||
struct wb_key_ref {
|
||||
union {
|
||||
struct {
|
||||
#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
|
||||
unsigned idx:24;
|
||||
u8 pos[sizeof(struct bpos)];
|
||||
enum btree_id btree:8;
|
||||
#else
|
||||
enum btree_id btree:8;
|
||||
u8 pos[sizeof(struct bpos)];
|
||||
unsigned idx:24;
|
||||
#endif
|
||||
} __packed;
|
||||
struct {
|
||||
#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
|
||||
u64 lo;
|
||||
u64 mi;
|
||||
u64 hi;
|
||||
#else
|
||||
u64 hi;
|
||||
u64 mi;
|
||||
u64 lo;
|
||||
#endif
|
||||
};
|
||||
};
|
||||
};
|
||||
|
||||
struct btree_write_buffered_key {
|
||||
enum btree_id btree:8;
|
||||
u64 journal_seq:56;
|
||||
__BKEY_PADDED(k, BTREE_WRITE_BUFERED_VAL_U64s_MAX);
|
||||
};
|
||||
|
||||
struct btree_write_buffer_keys {
|
||||
DARRAY(struct btree_write_buffered_key) keys;
|
||||
struct journal_entry_pin pin;
|
||||
struct mutex lock;
|
||||
};
|
||||
|
||||
struct btree_write_buffer {
|
||||
DARRAY(struct wb_key_ref) sorted;
|
||||
struct btree_write_buffer_keys inc;
|
||||
struct btree_write_buffer_keys flushing;
|
||||
struct work_struct flush_work;
|
||||
|
||||
DARRAY(struct btree_write_buffered_key) accounting;
|
||||
};
|
||||
|
||||
#endif /* _BCACHEFS_BTREE_WRITE_BUFFER_TYPES_H */
|
||||
File diff suppressed because it is too large
Load Diff
@@ -1,369 +0,0 @@
|
||||
/* SPDX-License-Identifier: GPL-2.0 */
|
||||
/*
|
||||
* Code for manipulating bucket marks for garbage collection.
|
||||
*
|
||||
* Copyright 2014 Datera, Inc.
|
||||
*/
|
||||
|
||||
#ifndef _BUCKETS_H
|
||||
#define _BUCKETS_H
|
||||
|
||||
#include "buckets_types.h"
|
||||
#include "extents.h"
|
||||
#include "sb-members.h"
|
||||
|
||||
static inline u64 sector_to_bucket(const struct bch_dev *ca, sector_t s)
|
||||
{
|
||||
return div_u64(s, ca->mi.bucket_size);
|
||||
}
|
||||
|
||||
static inline sector_t bucket_to_sector(const struct bch_dev *ca, size_t b)
|
||||
{
|
||||
return ((sector_t) b) * ca->mi.bucket_size;
|
||||
}
|
||||
|
||||
static inline sector_t bucket_remainder(const struct bch_dev *ca, sector_t s)
|
||||
{
|
||||
u32 remainder;
|
||||
|
||||
div_u64_rem(s, ca->mi.bucket_size, &remainder);
|
||||
return remainder;
|
||||
}
|
||||
|
||||
static inline u64 sector_to_bucket_and_offset(const struct bch_dev *ca, sector_t s, u32 *offset)
|
||||
{
|
||||
return div_u64_rem(s, ca->mi.bucket_size, offset);
|
||||
}
|
||||
|
||||
#define for_each_bucket(_b, _buckets) \
|
||||
for (_b = (_buckets)->b + (_buckets)->first_bucket; \
|
||||
_b < (_buckets)->b + (_buckets)->nbuckets; _b++)
|
||||
|
||||
static inline void bucket_unlock(struct bucket *b)
|
||||
{
|
||||
BUILD_BUG_ON(!((union ulong_byte_assert) { .ulong = 1UL << BUCKET_LOCK_BITNR }).byte);
|
||||
|
||||
clear_bit_unlock(BUCKET_LOCK_BITNR, (void *) &b->lock);
|
||||
smp_mb__after_atomic();
|
||||
wake_up_bit((void *) &b->lock, BUCKET_LOCK_BITNR);
|
||||
}
|
||||
|
||||
static inline void bucket_lock(struct bucket *b)
|
||||
{
|
||||
wait_on_bit_lock((void *) &b->lock, BUCKET_LOCK_BITNR,
|
||||
TASK_UNINTERRUPTIBLE);
|
||||
}
|
||||
|
||||
static inline struct bucket *gc_bucket(struct bch_dev *ca, size_t b)
|
||||
{
|
||||
return bucket_valid(ca, b)
|
||||
? genradix_ptr(&ca->buckets_gc, b)
|
||||
: NULL;
|
||||
}
|
||||
|
||||
static inline struct bucket_gens *bucket_gens(struct bch_dev *ca)
|
||||
{
|
||||
return rcu_dereference_check(ca->bucket_gens,
|
||||
lockdep_is_held(&ca->fs->state_lock));
|
||||
}
|
||||
|
||||
static inline u8 *bucket_gen(struct bch_dev *ca, size_t b)
|
||||
{
|
||||
struct bucket_gens *gens = bucket_gens(ca);
|
||||
|
||||
if (b - gens->first_bucket >= gens->nbuckets_minus_first)
|
||||
return NULL;
|
||||
return gens->b + b;
|
||||
}
|
||||
|
||||
static inline int bucket_gen_get_rcu(struct bch_dev *ca, size_t b)
|
||||
{
|
||||
u8 *gen = bucket_gen(ca, b);
|
||||
return gen ? *gen : -1;
|
||||
}
|
||||
|
||||
static inline int bucket_gen_get(struct bch_dev *ca, size_t b)
|
||||
{
|
||||
guard(rcu)();
|
||||
return bucket_gen_get_rcu(ca, b);
|
||||
}
|
||||
|
||||
static inline size_t PTR_BUCKET_NR(const struct bch_dev *ca,
|
||||
const struct bch_extent_ptr *ptr)
|
||||
{
|
||||
return sector_to_bucket(ca, ptr->offset);
|
||||
}
|
||||
|
||||
static inline struct bpos PTR_BUCKET_POS(const struct bch_dev *ca,
|
||||
const struct bch_extent_ptr *ptr)
|
||||
{
|
||||
return POS(ptr->dev, PTR_BUCKET_NR(ca, ptr));
|
||||
}
|
||||
|
||||
static inline struct bpos PTR_BUCKET_POS_OFFSET(const struct bch_dev *ca,
|
||||
const struct bch_extent_ptr *ptr,
|
||||
u32 *bucket_offset)
|
||||
{
|
||||
return POS(ptr->dev, sector_to_bucket_and_offset(ca, ptr->offset, bucket_offset));
|
||||
}
|
||||
|
||||
static inline struct bucket *PTR_GC_BUCKET(struct bch_dev *ca,
|
||||
const struct bch_extent_ptr *ptr)
|
||||
{
|
||||
return gc_bucket(ca, PTR_BUCKET_NR(ca, ptr));
|
||||
}
|
||||
|
||||
static inline enum bch_data_type ptr_data_type(const struct bkey *k,
|
||||
const struct bch_extent_ptr *ptr)
|
||||
{
|
||||
if (bkey_is_btree_ptr(k))
|
||||
return BCH_DATA_btree;
|
||||
|
||||
return ptr->cached ? BCH_DATA_cached : BCH_DATA_user;
|
||||
}
|
||||
|
||||
static inline s64 ptr_disk_sectors(s64 sectors, struct extent_ptr_decoded p)
|
||||
{
|
||||
EBUG_ON(sectors < 0);
|
||||
|
||||
return crc_is_compressed(p.crc)
|
||||
? DIV_ROUND_UP_ULL(sectors * p.crc.compressed_size,
|
||||
p.crc.uncompressed_size)
|
||||
: sectors;
|
||||
}
|
||||
|
||||
static inline int gen_cmp(u8 a, u8 b)
|
||||
{
|
||||
return (s8) (a - b);
|
||||
}
|
||||
|
||||
static inline int gen_after(u8 a, u8 b)
|
||||
{
|
||||
return max(0, gen_cmp(a, b));
|
||||
}
|
||||
|
||||
static inline int dev_ptr_stale_rcu(struct bch_dev *ca, const struct bch_extent_ptr *ptr)
|
||||
{
|
||||
int gen = bucket_gen_get_rcu(ca, PTR_BUCKET_NR(ca, ptr));
|
||||
return gen < 0 ? gen : gen_after(gen, ptr->gen);
|
||||
}
|
||||
|
||||
/**
|
||||
* dev_ptr_stale() - check if a pointer points into a bucket that has been
|
||||
* invalidated.
|
||||
*/
|
||||
static inline int dev_ptr_stale(struct bch_dev *ca, const struct bch_extent_ptr *ptr)
|
||||
{
|
||||
guard(rcu)();
|
||||
return dev_ptr_stale_rcu(ca, ptr);
|
||||
}
|
||||
|
||||
/* Device usage: */
|
||||
|
||||
void bch2_dev_usage_read_fast(struct bch_dev *, struct bch_dev_usage *);
|
||||
static inline struct bch_dev_usage bch2_dev_usage_read(struct bch_dev *ca)
|
||||
{
|
||||
struct bch_dev_usage ret;
|
||||
|
||||
bch2_dev_usage_read_fast(ca, &ret);
|
||||
return ret;
|
||||
}
|
||||
|
||||
void bch2_dev_usage_full_read_fast(struct bch_dev *, struct bch_dev_usage_full *);
|
||||
static inline struct bch_dev_usage_full bch2_dev_usage_full_read(struct bch_dev *ca)
|
||||
{
|
||||
struct bch_dev_usage_full ret;
|
||||
|
||||
bch2_dev_usage_full_read_fast(ca, &ret);
|
||||
return ret;
|
||||
}
|
||||
|
||||
void bch2_dev_usage_to_text(struct printbuf *, struct bch_dev *, struct bch_dev_usage_full *);
|
||||
|
||||
static inline u64 bch2_dev_buckets_reserved(struct bch_dev *ca, enum bch_watermark watermark)
|
||||
{
|
||||
s64 reserved = 0;
|
||||
|
||||
switch (watermark) {
|
||||
case BCH_WATERMARK_NR:
|
||||
BUG();
|
||||
case BCH_WATERMARK_stripe:
|
||||
reserved += ca->mi.nbuckets >> 6;
|
||||
fallthrough;
|
||||
case BCH_WATERMARK_normal:
|
||||
reserved += ca->mi.nbuckets >> 6;
|
||||
fallthrough;
|
||||
case BCH_WATERMARK_copygc:
|
||||
reserved += ca->nr_btree_reserve;
|
||||
fallthrough;
|
||||
case BCH_WATERMARK_btree:
|
||||
reserved += ca->nr_btree_reserve;
|
||||
fallthrough;
|
||||
case BCH_WATERMARK_btree_copygc:
|
||||
case BCH_WATERMARK_reclaim:
|
||||
case BCH_WATERMARK_interior_updates:
|
||||
break;
|
||||
}
|
||||
|
||||
return reserved;
|
||||
}
|
||||
|
||||
static inline u64 dev_buckets_free(struct bch_dev *ca,
|
||||
struct bch_dev_usage usage,
|
||||
enum bch_watermark watermark)
|
||||
{
|
||||
return max_t(s64, 0,
|
||||
usage.buckets[BCH_DATA_free]-
|
||||
ca->nr_open_buckets -
|
||||
bch2_dev_buckets_reserved(ca, watermark));
|
||||
}
|
||||
|
||||
static inline u64 __dev_buckets_available(struct bch_dev *ca,
|
||||
struct bch_dev_usage usage,
|
||||
enum bch_watermark watermark)
|
||||
{
|
||||
return max_t(s64, 0,
|
||||
usage.buckets[BCH_DATA_free]
|
||||
+ usage.buckets[BCH_DATA_cached]
|
||||
+ usage.buckets[BCH_DATA_need_gc_gens]
|
||||
+ usage.buckets[BCH_DATA_need_discard]
|
||||
- ca->nr_open_buckets
|
||||
- bch2_dev_buckets_reserved(ca, watermark));
|
||||
}
|
||||
|
||||
static inline u64 dev_buckets_available(struct bch_dev *ca,
|
||||
enum bch_watermark watermark)
|
||||
{
|
||||
return __dev_buckets_available(ca, bch2_dev_usage_read(ca), watermark);
|
||||
}
|
||||
|
||||
/* Filesystem usage: */
|
||||
|
||||
struct bch_fs_usage_short
|
||||
bch2_fs_usage_read_short(struct bch_fs *);
|
||||
|
||||
int bch2_bucket_ref_update(struct btree_trans *, struct bch_dev *,
|
||||
struct bkey_s_c, const struct bch_extent_ptr *,
|
||||
s64, enum bch_data_type, u8, u8, u32 *);
|
||||
|
||||
int bch2_check_fix_ptrs(struct btree_trans *,
|
||||
enum btree_id, unsigned, struct bkey_s_c,
|
||||
enum btree_iter_update_trigger_flags);
|
||||
|
||||
int bch2_trigger_extent(struct btree_trans *, enum btree_id, unsigned,
|
||||
struct bkey_s_c, struct bkey_s,
|
||||
enum btree_iter_update_trigger_flags);
|
||||
int bch2_trigger_reservation(struct btree_trans *, enum btree_id, unsigned,
|
||||
struct bkey_s_c, struct bkey_s,
|
||||
enum btree_iter_update_trigger_flags);
|
||||
|
||||
#define trigger_run_overwrite_then_insert(_fn, _trans, _btree_id, _level, _old, _new, _flags)\
|
||||
({ \
|
||||
int ret = 0; \
|
||||
\
|
||||
if (_old.k->type) \
|
||||
ret = _fn(_trans, _btree_id, _level, _old, _flags & ~BTREE_TRIGGER_insert); \
|
||||
if (!ret && _new.k->type) \
|
||||
ret = _fn(_trans, _btree_id, _level, _new.s_c, _flags & ~BTREE_TRIGGER_overwrite);\
|
||||
ret; \
|
||||
})
|
||||
|
||||
void bch2_trans_account_disk_usage_change(struct btree_trans *);
|
||||
|
||||
int bch2_trans_mark_metadata_bucket(struct btree_trans *, struct bch_dev *, u64,
|
||||
enum bch_data_type, unsigned,
|
||||
enum btree_iter_update_trigger_flags);
|
||||
int bch2_trans_mark_dev_sb(struct bch_fs *, struct bch_dev *,
|
||||
enum btree_iter_update_trigger_flags);
|
||||
int bch2_trans_mark_dev_sbs_flags(struct bch_fs *,
|
||||
enum btree_iter_update_trigger_flags);
|
||||
int bch2_trans_mark_dev_sbs(struct bch_fs *);
|
||||
|
||||
bool bch2_is_superblock_bucket(struct bch_dev *, u64);
|
||||
|
||||
static inline const char *bch2_data_type_str(enum bch_data_type type)
|
||||
{
|
||||
return type < BCH_DATA_NR
|
||||
? __bch2_data_types[type]
|
||||
: "(invalid data type)";
|
||||
}
|
||||
|
||||
/* disk reservations: */
|
||||
|
||||
static inline void bch2_disk_reservation_put(struct bch_fs *c,
|
||||
struct disk_reservation *res)
|
||||
{
|
||||
if (res->sectors) {
|
||||
this_cpu_sub(*c->online_reserved, res->sectors);
|
||||
res->sectors = 0;
|
||||
}
|
||||
}
|
||||
|
||||
enum bch_reservation_flags {
|
||||
BCH_DISK_RESERVATION_NOFAIL = 1 << 0,
|
||||
BCH_DISK_RESERVATION_PARTIAL = 1 << 1,
|
||||
};
|
||||
|
||||
int __bch2_disk_reservation_add(struct bch_fs *, struct disk_reservation *,
|
||||
u64, enum bch_reservation_flags);
|
||||
|
||||
static inline int bch2_disk_reservation_add(struct bch_fs *c, struct disk_reservation *res,
|
||||
u64 sectors, enum bch_reservation_flags flags)
|
||||
{
|
||||
#ifdef __KERNEL__
|
||||
u64 old, new;
|
||||
|
||||
old = this_cpu_read(c->pcpu->sectors_available);
|
||||
do {
|
||||
if (sectors > old)
|
||||
return __bch2_disk_reservation_add(c, res, sectors, flags);
|
||||
|
||||
new = old - sectors;
|
||||
} while (!this_cpu_try_cmpxchg(c->pcpu->sectors_available, &old, new));
|
||||
|
||||
this_cpu_add(*c->online_reserved, sectors);
|
||||
res->sectors += sectors;
|
||||
return 0;
|
||||
#else
|
||||
return __bch2_disk_reservation_add(c, res, sectors, flags);
|
||||
#endif
|
||||
}
|
||||
|
||||
static inline struct disk_reservation
|
||||
bch2_disk_reservation_init(struct bch_fs *c, unsigned nr_replicas)
|
||||
{
|
||||
return (struct disk_reservation) {
|
||||
.sectors = 0,
|
||||
#if 0
|
||||
/* not used yet: */
|
||||
.gen = c->capacity_gen,
|
||||
#endif
|
||||
.nr_replicas = nr_replicas,
|
||||
};
|
||||
}
|
||||
|
||||
static inline int bch2_disk_reservation_get(struct bch_fs *c,
|
||||
struct disk_reservation *res,
|
||||
u64 sectors, unsigned nr_replicas,
|
||||
int flags)
|
||||
{
|
||||
*res = bch2_disk_reservation_init(c, nr_replicas);
|
||||
|
||||
return bch2_disk_reservation_add(c, res, sectors * nr_replicas, flags);
|
||||
}
|
||||
|
||||
#define RESERVE_FACTOR 6
|
||||
|
||||
static inline u64 avail_factor(u64 r)
|
||||
{
|
||||
return div_u64(r << RESERVE_FACTOR, (1 << RESERVE_FACTOR) + 1);
|
||||
}
|
||||
|
||||
void bch2_buckets_nouse_free(struct bch_fs *);
|
||||
int bch2_buckets_nouse_alloc(struct bch_fs *);
|
||||
|
||||
int bch2_dev_buckets_resize(struct bch_fs *, struct bch_dev *, u64);
|
||||
void bch2_dev_buckets_free(struct bch_dev *);
|
||||
int bch2_dev_buckets_alloc(struct bch_fs *, struct bch_dev *);
|
||||
|
||||
#endif /* _BUCKETS_H */
|
||||
@@ -1,100 +0,0 @@
|
||||
/* SPDX-License-Identifier: GPL-2.0 */
|
||||
#ifndef _BUCKETS_TYPES_H
|
||||
#define _BUCKETS_TYPES_H
|
||||
|
||||
#include "bcachefs_format.h"
|
||||
#include "util.h"
|
||||
|
||||
#define BUCKET_JOURNAL_SEQ_BITS 16
|
||||
|
||||
/*
|
||||
* Ugly hack alert:
|
||||
*
|
||||
* We need to cram a spinlock in a single byte, because that's what we have left
|
||||
* in struct bucket, and we care about the size of these - during fsck, we need
|
||||
* in memory state for every single bucket on every device.
|
||||
*
|
||||
* We used to do
|
||||
* while (xchg(&b->lock, 1) cpu_relax();
|
||||
* but, it turns out not all architectures support xchg on a single byte.
|
||||
*
|
||||
* So now we use bit_spin_lock(), with fun games since we can't burn a whole
|
||||
* ulong for this - we just need to make sure the lock bit always ends up in the
|
||||
* first byte.
|
||||
*/
|
||||
|
||||
#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
|
||||
#define BUCKET_LOCK_BITNR 0
|
||||
#else
|
||||
#define BUCKET_LOCK_BITNR (BITS_PER_LONG - 1)
|
||||
#endif
|
||||
|
||||
union ulong_byte_assert {
|
||||
ulong ulong;
|
||||
u8 byte;
|
||||
};
|
||||
|
||||
struct bucket {
|
||||
u8 lock;
|
||||
u8 gen_valid:1;
|
||||
u8 data_type:7;
|
||||
u8 gen;
|
||||
u8 stripe_redundancy;
|
||||
u32 stripe;
|
||||
u32 dirty_sectors;
|
||||
u32 cached_sectors;
|
||||
u32 stripe_sectors;
|
||||
} __aligned(sizeof(long));
|
||||
|
||||
struct bucket_gens {
|
||||
struct rcu_head rcu;
|
||||
u16 first_bucket;
|
||||
size_t nbuckets;
|
||||
size_t nbuckets_minus_first;
|
||||
u8 b[] __counted_by(nbuckets);
|
||||
};
|
||||
|
||||
/* Only info on bucket countns: */
|
||||
struct bch_dev_usage {
|
||||
u64 buckets[BCH_DATA_NR];
|
||||
};
|
||||
|
||||
struct bch_dev_usage_full {
|
||||
struct bch_dev_usage_type {
|
||||
u64 buckets;
|
||||
u64 sectors; /* _compressed_ sectors: */
|
||||
/*
|
||||
* XXX
|
||||
* Why do we have this? Isn't it just buckets * bucket_size -
|
||||
* sectors?
|
||||
*/
|
||||
u64 fragmented;
|
||||
} d[BCH_DATA_NR];
|
||||
};
|
||||
|
||||
struct bch_fs_usage_base {
|
||||
u64 hidden;
|
||||
u64 btree;
|
||||
u64 data;
|
||||
u64 cached;
|
||||
u64 reserved;
|
||||
u64 nr_inodes;
|
||||
};
|
||||
|
||||
struct bch_fs_usage_short {
|
||||
u64 capacity;
|
||||
u64 used;
|
||||
u64 free;
|
||||
u64 nr_inodes;
|
||||
};
|
||||
|
||||
/*
|
||||
* A reservation for space on disk:
|
||||
*/
|
||||
struct disk_reservation {
|
||||
u64 sectors;
|
||||
u32 gen;
|
||||
unsigned nr_replicas;
|
||||
};
|
||||
|
||||
#endif /* _BUCKETS_TYPES_H */
|
||||
@@ -1,174 +0,0 @@
|
||||
// SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
#include "bcachefs.h"
|
||||
#include "buckets_waiting_for_journal.h"
|
||||
#include <linux/hash.h>
|
||||
#include <linux/random.h>
|
||||
|
||||
static inline struct bucket_hashed *
|
||||
bucket_hash(struct buckets_waiting_for_journal_table *t,
|
||||
unsigned hash_seed_idx, u64 dev_bucket)
|
||||
{
|
||||
return t->d + hash_64(dev_bucket ^ t->hash_seeds[hash_seed_idx], t->bits);
|
||||
}
|
||||
|
||||
static void bucket_table_init(struct buckets_waiting_for_journal_table *t, size_t bits)
|
||||
{
|
||||
unsigned i;
|
||||
|
||||
t->bits = bits;
|
||||
for (i = 0; i < ARRAY_SIZE(t->hash_seeds); i++)
|
||||
get_random_bytes(&t->hash_seeds[i], sizeof(t->hash_seeds[i]));
|
||||
memset(t->d, 0, sizeof(t->d[0]) << t->bits);
|
||||
}
|
||||
|
||||
u64 bch2_bucket_journal_seq_ready(struct buckets_waiting_for_journal *b,
|
||||
unsigned dev, u64 bucket)
|
||||
{
|
||||
struct buckets_waiting_for_journal_table *t;
|
||||
u64 dev_bucket = (u64) dev << 56 | bucket;
|
||||
u64 ret = 0;
|
||||
|
||||
mutex_lock(&b->lock);
|
||||
t = b->t;
|
||||
|
||||
for (unsigned i = 0; i < ARRAY_SIZE(t->hash_seeds); i++) {
|
||||
struct bucket_hashed *h = bucket_hash(t, i, dev_bucket);
|
||||
|
||||
if (h->dev_bucket == dev_bucket) {
|
||||
ret = h->journal_seq;
|
||||
break;
|
||||
}
|
||||
}
|
||||
|
||||
mutex_unlock(&b->lock);
|
||||
|
||||
return ret;
|
||||
}
|
||||
|
||||
static bool bucket_table_insert(struct buckets_waiting_for_journal_table *t,
|
||||
struct bucket_hashed *new,
|
||||
u64 flushed_seq)
|
||||
{
|
||||
struct bucket_hashed *last_evicted = NULL;
|
||||
unsigned tries, i;
|
||||
|
||||
for (tries = 0; tries < 10; tries++) {
|
||||
struct bucket_hashed *old, *victim = NULL;
|
||||
|
||||
for (i = 0; i < ARRAY_SIZE(t->hash_seeds); i++) {
|
||||
old = bucket_hash(t, i, new->dev_bucket);
|
||||
|
||||
if (old->dev_bucket == new->dev_bucket ||
|
||||
old->journal_seq <= flushed_seq) {
|
||||
*old = *new;
|
||||
return true;
|
||||
}
|
||||
|
||||
if (last_evicted != old)
|
||||
victim = old;
|
||||
}
|
||||
|
||||
/* hashed to same slot 3 times: */
|
||||
if (!victim)
|
||||
break;
|
||||
|
||||
/* Failed to find an empty slot: */
|
||||
swap(*new, *victim);
|
||||
last_evicted = victim;
|
||||
}
|
||||
|
||||
return false;
|
||||
}
|
||||
|
||||
int bch2_set_bucket_needs_journal_commit(struct buckets_waiting_for_journal *b,
|
||||
u64 flushed_seq,
|
||||
unsigned dev, u64 bucket,
|
||||
u64 journal_seq)
|
||||
{
|
||||
struct buckets_waiting_for_journal_table *t, *n;
|
||||
struct bucket_hashed tmp, new = {
|
||||
.dev_bucket = (u64) dev << 56 | bucket,
|
||||
.journal_seq = journal_seq,
|
||||
};
|
||||
size_t i, size, new_bits, nr_elements = 1, nr_rehashes = 0, nr_rehashes_this_size = 0;
|
||||
int ret = 0;
|
||||
|
||||
mutex_lock(&b->lock);
|
||||
|
||||
if (likely(bucket_table_insert(b->t, &new, flushed_seq)))
|
||||
goto out;
|
||||
|
||||
t = b->t;
|
||||
size = 1UL << t->bits;
|
||||
for (i = 0; i < size; i++)
|
||||
nr_elements += t->d[i].journal_seq > flushed_seq;
|
||||
|
||||
new_bits = ilog2(roundup_pow_of_two(nr_elements * 3));
|
||||
realloc:
|
||||
n = kvmalloc(sizeof(*n) + (sizeof(n->d[0]) << new_bits), GFP_KERNEL);
|
||||
if (!n) {
|
||||
struct bch_fs *c = container_of(b, struct bch_fs, buckets_waiting_for_journal);
|
||||
ret = bch_err_throw(c, ENOMEM_buckets_waiting_for_journal_set);
|
||||
goto out;
|
||||
}
|
||||
|
||||
retry_rehash:
|
||||
if (nr_rehashes_this_size == 3) {
|
||||
new_bits++;
|
||||
nr_rehashes_this_size = 0;
|
||||
kvfree(n);
|
||||
goto realloc;
|
||||
}
|
||||
|
||||
nr_rehashes++;
|
||||
nr_rehashes_this_size++;
|
||||
|
||||
bucket_table_init(n, new_bits);
|
||||
|
||||
tmp = new;
|
||||
BUG_ON(!bucket_table_insert(n, &tmp, flushed_seq));
|
||||
|
||||
for (i = 0; i < 1UL << t->bits; i++) {
|
||||
if (t->d[i].journal_seq <= flushed_seq)
|
||||
continue;
|
||||
|
||||
tmp = t->d[i];
|
||||
if (!bucket_table_insert(n, &tmp, flushed_seq))
|
||||
goto retry_rehash;
|
||||
}
|
||||
|
||||
b->t = n;
|
||||
kvfree(t);
|
||||
|
||||
pr_debug("took %zu rehashes, table at %zu/%lu elements",
|
||||
nr_rehashes, nr_elements, 1UL << b->t->bits);
|
||||
out:
|
||||
mutex_unlock(&b->lock);
|
||||
|
||||
return ret;
|
||||
}
|
||||
|
||||
void bch2_fs_buckets_waiting_for_journal_exit(struct bch_fs *c)
|
||||
{
|
||||
struct buckets_waiting_for_journal *b = &c->buckets_waiting_for_journal;
|
||||
|
||||
kvfree(b->t);
|
||||
}
|
||||
|
||||
#define INITIAL_TABLE_BITS 3
|
||||
|
||||
int bch2_fs_buckets_waiting_for_journal_init(struct bch_fs *c)
|
||||
{
|
||||
struct buckets_waiting_for_journal *b = &c->buckets_waiting_for_journal;
|
||||
|
||||
mutex_init(&b->lock);
|
||||
|
||||
b->t = kvmalloc(sizeof(*b->t) +
|
||||
(sizeof(b->t->d[0]) << INITIAL_TABLE_BITS), GFP_KERNEL);
|
||||
if (!b->t)
|
||||
return -BCH_ERR_ENOMEM_buckets_waiting_for_journal_init;
|
||||
|
||||
bucket_table_init(b->t, INITIAL_TABLE_BITS);
|
||||
return 0;
|
||||
}
|
||||
@@ -1,15 +0,0 @@
|
||||
/* SPDX-License-Identifier: GPL-2.0 */
|
||||
#ifndef _BUCKETS_WAITING_FOR_JOURNAL_H
|
||||
#define _BUCKETS_WAITING_FOR_JOURNAL_H
|
||||
|
||||
#include "buckets_waiting_for_journal_types.h"
|
||||
|
||||
u64 bch2_bucket_journal_seq_ready(struct buckets_waiting_for_journal *,
|
||||
unsigned, u64);
|
||||
int bch2_set_bucket_needs_journal_commit(struct buckets_waiting_for_journal *,
|
||||
u64, unsigned, u64, u64);
|
||||
|
||||
void bch2_fs_buckets_waiting_for_journal_exit(struct bch_fs *);
|
||||
int bch2_fs_buckets_waiting_for_journal_init(struct bch_fs *);
|
||||
|
||||
#endif /* _BUCKETS_WAITING_FOR_JOURNAL_H */
|
||||
@@ -1,23 +0,0 @@
|
||||
/* SPDX-License-Identifier: GPL-2.0 */
|
||||
#ifndef _BUCKETS_WAITING_FOR_JOURNAL_TYPES_H
|
||||
#define _BUCKETS_WAITING_FOR_JOURNAL_TYPES_H
|
||||
|
||||
#include <linux/siphash.h>
|
||||
|
||||
struct bucket_hashed {
|
||||
u64 dev_bucket;
|
||||
u64 journal_seq;
|
||||
};
|
||||
|
||||
struct buckets_waiting_for_journal_table {
|
||||
unsigned bits;
|
||||
u64 hash_seeds[3];
|
||||
struct bucket_hashed d[];
|
||||
};
|
||||
|
||||
struct buckets_waiting_for_journal {
|
||||
struct mutex lock;
|
||||
struct buckets_waiting_for_journal_table *t;
|
||||
};
|
||||
|
||||
#endif /* _BUCKETS_WAITING_FOR_JOURNAL_TYPES_H */
|
||||
@@ -1,843 +0,0 @@
|
||||
// SPDX-License-Identifier: GPL-2.0
|
||||
#ifndef NO_BCACHEFS_CHARDEV
|
||||
|
||||
#include "bcachefs.h"
|
||||
#include "bcachefs_ioctl.h"
|
||||
#include "buckets.h"
|
||||
#include "chardev.h"
|
||||
#include "disk_accounting.h"
|
||||
#include "fsck.h"
|
||||
#include "journal.h"
|
||||
#include "move.h"
|
||||
#include "recovery_passes.h"
|
||||
#include "replicas.h"
|
||||
#include "sb-counters.h"
|
||||
#include "super-io.h"
|
||||
#include "thread_with_file.h"
|
||||
|
||||
#include <linux/cdev.h>
|
||||
#include <linux/device.h>
|
||||
#include <linux/fs.h>
|
||||
#include <linux/ioctl.h>
|
||||
#include <linux/major.h>
|
||||
#include <linux/sched/task.h>
|
||||
#include <linux/slab.h>
|
||||
#include <linux/uaccess.h>
|
||||
|
||||
/* returns with ref on ca->ref */
|
||||
static struct bch_dev *bch2_device_lookup(struct bch_fs *c, u64 dev,
|
||||
unsigned flags)
|
||||
{
|
||||
struct bch_dev *ca;
|
||||
|
||||
if (flags & BCH_BY_INDEX) {
|
||||
if (dev >= c->sb.nr_devices)
|
||||
return ERR_PTR(-EINVAL);
|
||||
|
||||
ca = bch2_dev_tryget_noerror(c, dev);
|
||||
if (!ca)
|
||||
return ERR_PTR(-EINVAL);
|
||||
} else {
|
||||
char *path;
|
||||
|
||||
path = strndup_user((const char __user *)
|
||||
(unsigned long) dev, PATH_MAX);
|
||||
if (IS_ERR(path))
|
||||
return ERR_CAST(path);
|
||||
|
||||
ca = bch2_dev_lookup(c, path);
|
||||
kfree(path);
|
||||
}
|
||||
|
||||
return ca;
|
||||
}
|
||||
|
||||
#if 0
|
||||
static long bch2_ioctl_assemble(struct bch_ioctl_assemble __user *user_arg)
|
||||
{
|
||||
struct bch_ioctl_assemble arg;
|
||||
struct bch_fs *c;
|
||||
u64 *user_devs = NULL;
|
||||
char **devs = NULL;
|
||||
unsigned i;
|
||||
int ret = -EFAULT;
|
||||
|
||||
if (copy_from_user(&arg, user_arg, sizeof(arg)))
|
||||
return -EFAULT;
|
||||
|
||||
if (arg.flags || arg.pad)
|
||||
return -EINVAL;
|
||||
|
||||
user_devs = kmalloc_array(arg.nr_devs, sizeof(u64), GFP_KERNEL);
|
||||
if (!user_devs)
|
||||
return -ENOMEM;
|
||||
|
||||
devs = kcalloc(arg.nr_devs, sizeof(char *), GFP_KERNEL);
|
||||
|
||||
if (copy_from_user(user_devs, user_arg->devs,
|
||||
sizeof(u64) * arg.nr_devs))
|
||||
goto err;
|
||||
|
||||
for (i = 0; i < arg.nr_devs; i++) {
|
||||
devs[i] = strndup_user((const char __user *)(unsigned long)
|
||||
user_devs[i],
|
||||
PATH_MAX);
|
||||
ret= PTR_ERR_OR_ZERO(devs[i]);
|
||||
if (ret)
|
||||
goto err;
|
||||
}
|
||||
|
||||
c = bch2_fs_open(devs, arg.nr_devs, bch2_opts_empty());
|
||||
ret = PTR_ERR_OR_ZERO(c);
|
||||
if (!ret)
|
||||
closure_put(&c->cl);
|
||||
err:
|
||||
if (devs)
|
||||
for (i = 0; i < arg.nr_devs; i++)
|
||||
kfree(devs[i]);
|
||||
kfree(devs);
|
||||
return ret;
|
||||
}
|
||||
|
||||
static long bch2_ioctl_incremental(struct bch_ioctl_incremental __user *user_arg)
|
||||
{
|
||||
struct bch_ioctl_incremental arg;
|
||||
const char *err;
|
||||
char *path;
|
||||
|
||||
if (copy_from_user(&arg, user_arg, sizeof(arg)))
|
||||
return -EFAULT;
|
||||
|
||||
if (arg.flags || arg.pad)
|
||||
return -EINVAL;
|
||||
|
||||
path = strndup_user((const char __user *)(unsigned long) arg.dev, PATH_MAX);
|
||||
ret = PTR_ERR_OR_ZERO(path);
|
||||
if (ret)
|
||||
return ret;
|
||||
|
||||
err = bch2_fs_open_incremental(path);
|
||||
kfree(path);
|
||||
|
||||
if (err) {
|
||||
pr_err("Could not register bcachefs devices: %s", err);
|
||||
return -EINVAL;
|
||||
}
|
||||
|
||||
return 0;
|
||||
}
|
||||
#endif
|
||||
|
||||
static long bch2_global_ioctl(unsigned cmd, void __user *arg)
|
||||
{
|
||||
long ret;
|
||||
|
||||
switch (cmd) {
|
||||
#if 0
|
||||
case BCH_IOCTL_ASSEMBLE:
|
||||
return bch2_ioctl_assemble(arg);
|
||||
case BCH_IOCTL_INCREMENTAL:
|
||||
return bch2_ioctl_incremental(arg);
|
||||
#endif
|
||||
case BCH_IOCTL_FSCK_OFFLINE: {
|
||||
ret = bch2_ioctl_fsck_offline(arg);
|
||||
break;
|
||||
}
|
||||
default:
|
||||
ret = -ENOTTY;
|
||||
break;
|
||||
}
|
||||
|
||||
if (ret < 0)
|
||||
ret = bch2_err_class(ret);
|
||||
return ret;
|
||||
}
|
||||
|
||||
static long bch2_ioctl_query_uuid(struct bch_fs *c,
|
||||
struct bch_ioctl_query_uuid __user *user_arg)
|
||||
{
|
||||
return copy_to_user_errcode(&user_arg->uuid, &c->sb.user_uuid,
|
||||
sizeof(c->sb.user_uuid));
|
||||
}
|
||||
|
||||
#if 0
|
||||
static long bch2_ioctl_start(struct bch_fs *c, struct bch_ioctl_start arg)
|
||||
{
|
||||
if (!capable(CAP_SYS_ADMIN))
|
||||
return -EPERM;
|
||||
|
||||
if (arg.flags || arg.pad)
|
||||
return -EINVAL;
|
||||
|
||||
return bch2_fs_start(c);
|
||||
}
|
||||
|
||||
static long bch2_ioctl_stop(struct bch_fs *c)
|
||||
{
|
||||
if (!capable(CAP_SYS_ADMIN))
|
||||
return -EPERM;
|
||||
|
||||
bch2_fs_stop(c);
|
||||
return 0;
|
||||
}
|
||||
#endif
|
||||
|
||||
static long bch2_ioctl_disk_add(struct bch_fs *c, struct bch_ioctl_disk arg)
|
||||
{
|
||||
char *path;
|
||||
int ret;
|
||||
|
||||
if (!capable(CAP_SYS_ADMIN))
|
||||
return -EPERM;
|
||||
|
||||
if (arg.flags || arg.pad)
|
||||
return -EINVAL;
|
||||
|
||||
path = strndup_user((const char __user *)(unsigned long) arg.dev, PATH_MAX);
|
||||
ret = PTR_ERR_OR_ZERO(path);
|
||||
if (ret)
|
||||
return ret;
|
||||
|
||||
ret = bch2_dev_add(c, path);
|
||||
if (!IS_ERR(path))
|
||||
kfree(path);
|
||||
|
||||
return ret;
|
||||
}
|
||||
|
||||
static long bch2_ioctl_disk_remove(struct bch_fs *c, struct bch_ioctl_disk arg)
|
||||
{
|
||||
struct bch_dev *ca;
|
||||
|
||||
if (!capable(CAP_SYS_ADMIN))
|
||||
return -EPERM;
|
||||
|
||||
if ((arg.flags & ~(BCH_FORCE_IF_DATA_LOST|
|
||||
BCH_FORCE_IF_METADATA_LOST|
|
||||
BCH_FORCE_IF_DEGRADED|
|
||||
BCH_BY_INDEX)) ||
|
||||
arg.pad)
|
||||
return -EINVAL;
|
||||
|
||||
ca = bch2_device_lookup(c, arg.dev, arg.flags);
|
||||
if (IS_ERR(ca))
|
||||
return PTR_ERR(ca);
|
||||
|
||||
return bch2_dev_remove(c, ca, arg.flags);
|
||||
}
|
||||
|
||||
static long bch2_ioctl_disk_online(struct bch_fs *c, struct bch_ioctl_disk arg)
|
||||
{
|
||||
char *path;
|
||||
int ret;
|
||||
|
||||
if (!capable(CAP_SYS_ADMIN))
|
||||
return -EPERM;
|
||||
|
||||
if (arg.flags || arg.pad)
|
||||
return -EINVAL;
|
||||
|
||||
path = strndup_user((const char __user *)(unsigned long) arg.dev, PATH_MAX);
|
||||
ret = PTR_ERR_OR_ZERO(path);
|
||||
if (ret)
|
||||
return ret;
|
||||
|
||||
ret = bch2_dev_online(c, path);
|
||||
kfree(path);
|
||||
return ret;
|
||||
}
|
||||
|
||||
static long bch2_ioctl_disk_offline(struct bch_fs *c, struct bch_ioctl_disk arg)
|
||||
{
|
||||
struct bch_dev *ca;
|
||||
int ret;
|
||||
|
||||
if (!capable(CAP_SYS_ADMIN))
|
||||
return -EPERM;
|
||||
|
||||
if ((arg.flags & ~(BCH_FORCE_IF_DATA_LOST|
|
||||
BCH_FORCE_IF_METADATA_LOST|
|
||||
BCH_FORCE_IF_DEGRADED|
|
||||
BCH_BY_INDEX)) ||
|
||||
arg.pad)
|
||||
return -EINVAL;
|
||||
|
||||
ca = bch2_device_lookup(c, arg.dev, arg.flags);
|
||||
if (IS_ERR(ca))
|
||||
return PTR_ERR(ca);
|
||||
|
||||
ret = bch2_dev_offline(c, ca, arg.flags);
|
||||
bch2_dev_put(ca);
|
||||
return ret;
|
||||
}
|
||||
|
||||
static long bch2_ioctl_disk_set_state(struct bch_fs *c,
|
||||
struct bch_ioctl_disk_set_state arg)
|
||||
{
|
||||
struct bch_dev *ca;
|
||||
int ret;
|
||||
|
||||
if (!capable(CAP_SYS_ADMIN))
|
||||
return -EPERM;
|
||||
|
||||
if ((arg.flags & ~(BCH_FORCE_IF_DATA_LOST|
|
||||
BCH_FORCE_IF_METADATA_LOST|
|
||||
BCH_FORCE_IF_DEGRADED|
|
||||
BCH_BY_INDEX)) ||
|
||||
arg.pad[0] || arg.pad[1] || arg.pad[2] ||
|
||||
arg.new_state >= BCH_MEMBER_STATE_NR)
|
||||
return -EINVAL;
|
||||
|
||||
ca = bch2_device_lookup(c, arg.dev, arg.flags);
|
||||
if (IS_ERR(ca))
|
||||
return PTR_ERR(ca);
|
||||
|
||||
ret = bch2_dev_set_state(c, ca, arg.new_state, arg.flags);
|
||||
if (ret)
|
||||
bch_err(c, "Error setting device state: %s", bch2_err_str(ret));
|
||||
|
||||
bch2_dev_put(ca);
|
||||
return ret;
|
||||
}
|
||||
|
||||
struct bch_data_ctx {
|
||||
struct thread_with_file thr;
|
||||
|
||||
struct bch_fs *c;
|
||||
struct bch_ioctl_data arg;
|
||||
struct bch_move_stats stats;
|
||||
};
|
||||
|
||||
static int bch2_data_thread(void *arg)
|
||||
{
|
||||
struct bch_data_ctx *ctx = container_of(arg, struct bch_data_ctx, thr);
|
||||
|
||||
ctx->thr.ret = bch2_data_job(ctx->c, &ctx->stats, ctx->arg);
|
||||
if (ctx->thr.ret == -BCH_ERR_device_offline)
|
||||
ctx->stats.ret = BCH_IOCTL_DATA_EVENT_RET_device_offline;
|
||||
else {
|
||||
ctx->stats.ret = BCH_IOCTL_DATA_EVENT_RET_done;
|
||||
ctx->stats.data_type = (int) DATA_PROGRESS_DATA_TYPE_done;
|
||||
}
|
||||
enumerated_ref_put(&ctx->c->writes, BCH_WRITE_REF_ioctl_data);
|
||||
return 0;
|
||||
}
|
||||
|
||||
static int bch2_data_job_release(struct inode *inode, struct file *file)
|
||||
{
|
||||
struct bch_data_ctx *ctx = container_of(file->private_data, struct bch_data_ctx, thr);
|
||||
|
||||
bch2_thread_with_file_exit(&ctx->thr);
|
||||
kfree(ctx);
|
||||
return 0;
|
||||
}
|
||||
|
||||
static ssize_t bch2_data_job_read(struct file *file, char __user *buf,
|
||||
size_t len, loff_t *ppos)
|
||||
{
|
||||
struct bch_data_ctx *ctx = container_of(file->private_data, struct bch_data_ctx, thr);
|
||||
struct bch_fs *c = ctx->c;
|
||||
struct bch_ioctl_data_event e = {
|
||||
.type = BCH_DATA_EVENT_PROGRESS,
|
||||
.ret = ctx->stats.ret,
|
||||
.p.data_type = ctx->stats.data_type,
|
||||
.p.btree_id = ctx->stats.pos.btree,
|
||||
.p.pos = ctx->stats.pos.pos,
|
||||
.p.sectors_done = atomic64_read(&ctx->stats.sectors_seen),
|
||||
.p.sectors_error_corrected = atomic64_read(&ctx->stats.sectors_error_corrected),
|
||||
.p.sectors_error_uncorrected = atomic64_read(&ctx->stats.sectors_error_uncorrected),
|
||||
};
|
||||
|
||||
if (ctx->arg.op == BCH_DATA_OP_scrub) {
|
||||
struct bch_dev *ca = bch2_dev_tryget(c, ctx->arg.scrub.dev);
|
||||
if (ca) {
|
||||
struct bch_dev_usage_full u;
|
||||
bch2_dev_usage_full_read_fast(ca, &u);
|
||||
for (unsigned i = BCH_DATA_btree; i < ARRAY_SIZE(u.d); i++)
|
||||
if (ctx->arg.scrub.data_types & BIT(i))
|
||||
e.p.sectors_total += u.d[i].sectors;
|
||||
bch2_dev_put(ca);
|
||||
}
|
||||
} else {
|
||||
e.p.sectors_total = bch2_fs_usage_read_short(c).used;
|
||||
}
|
||||
|
||||
if (len < sizeof(e))
|
||||
return -EINVAL;
|
||||
|
||||
return copy_to_user_errcode(buf, &e, sizeof(e)) ?: sizeof(e);
|
||||
}
|
||||
|
||||
static const struct file_operations bcachefs_data_ops = {
|
||||
.release = bch2_data_job_release,
|
||||
.read = bch2_data_job_read,
|
||||
};
|
||||
|
||||
static long bch2_ioctl_data(struct bch_fs *c,
|
||||
struct bch_ioctl_data arg)
|
||||
{
|
||||
struct bch_data_ctx *ctx;
|
||||
int ret;
|
||||
|
||||
if (!enumerated_ref_tryget(&c->writes, BCH_WRITE_REF_ioctl_data))
|
||||
return -EROFS;
|
||||
|
||||
if (!capable(CAP_SYS_ADMIN)) {
|
||||
ret = -EPERM;
|
||||
goto put_ref;
|
||||
}
|
||||
|
||||
if (arg.op >= BCH_DATA_OP_NR || arg.flags) {
|
||||
ret = -EINVAL;
|
||||
goto put_ref;
|
||||
}
|
||||
|
||||
ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
|
||||
if (!ctx) {
|
||||
ret = -ENOMEM;
|
||||
goto put_ref;
|
||||
}
|
||||
|
||||
ctx->c = c;
|
||||
ctx->arg = arg;
|
||||
|
||||
ret = bch2_run_thread_with_file(&ctx->thr,
|
||||
&bcachefs_data_ops,
|
||||
bch2_data_thread);
|
||||
if (ret < 0)
|
||||
goto cleanup;
|
||||
return ret;
|
||||
cleanup:
|
||||
kfree(ctx);
|
||||
put_ref:
|
||||
enumerated_ref_put(&c->writes, BCH_WRITE_REF_ioctl_data);
|
||||
return ret;
|
||||
}
|
||||
|
||||
static noinline_for_stack long bch2_ioctl_fs_usage(struct bch_fs *c,
|
||||
struct bch_ioctl_fs_usage __user *user_arg)
|
||||
{
|
||||
struct bch_ioctl_fs_usage arg = {};
|
||||
darray_char replicas = {};
|
||||
u32 replica_entries_bytes;
|
||||
int ret = 0;
|
||||
|
||||
if (!test_bit(BCH_FS_started, &c->flags))
|
||||
return -EINVAL;
|
||||
|
||||
if (get_user(replica_entries_bytes, &user_arg->replica_entries_bytes))
|
||||
return -EFAULT;
|
||||
|
||||
ret = bch2_fs_replicas_usage_read(c, &replicas) ?:
|
||||
(replica_entries_bytes < replicas.nr ? -ERANGE : 0) ?:
|
||||
copy_to_user_errcode(&user_arg->replicas, replicas.data, replicas.nr);
|
||||
if (ret)
|
||||
goto err;
|
||||
|
||||
struct bch_fs_usage_short u = bch2_fs_usage_read_short(c);
|
||||
arg.capacity = c->capacity;
|
||||
arg.used = u.used;
|
||||
arg.online_reserved = percpu_u64_get(c->online_reserved);
|
||||
arg.replica_entries_bytes = replicas.nr;
|
||||
|
||||
for (unsigned i = 0; i < BCH_REPLICAS_MAX; i++) {
|
||||
struct disk_accounting_pos k;
|
||||
disk_accounting_key_init(k, persistent_reserved, .nr_replicas = i);
|
||||
|
||||
bch2_accounting_mem_read(c,
|
||||
disk_accounting_pos_to_bpos(&k),
|
||||
&arg.persistent_reserved[i], 1);
|
||||
}
|
||||
|
||||
ret = copy_to_user_errcode(user_arg, &arg, sizeof(arg));
|
||||
err:
|
||||
darray_exit(&replicas);
|
||||
return ret;
|
||||
}
|
||||
|
||||
static long bch2_ioctl_query_accounting(struct bch_fs *c,
|
||||
struct bch_ioctl_query_accounting __user *user_arg)
|
||||
{
|
||||
struct bch_ioctl_query_accounting arg;
|
||||
darray_char accounting = {};
|
||||
int ret = 0;
|
||||
|
||||
if (!test_bit(BCH_FS_started, &c->flags))
|
||||
return -EINVAL;
|
||||
|
||||
ret = copy_from_user_errcode(&arg, user_arg, sizeof(arg)) ?:
|
||||
bch2_fs_accounting_read(c, &accounting, arg.accounting_types_mask) ?:
|
||||
(arg.accounting_u64s * sizeof(u64) < accounting.nr ? -ERANGE : 0) ?:
|
||||
copy_to_user_errcode(&user_arg->accounting, accounting.data, accounting.nr);
|
||||
if (ret)
|
||||
goto err;
|
||||
|
||||
arg.capacity = c->capacity;
|
||||
arg.used = bch2_fs_usage_read_short(c).used;
|
||||
arg.online_reserved = percpu_u64_get(c->online_reserved);
|
||||
arg.accounting_u64s = accounting.nr / sizeof(u64);
|
||||
|
||||
ret = copy_to_user_errcode(user_arg, &arg, sizeof(arg));
|
||||
err:
|
||||
darray_exit(&accounting);
|
||||
return ret;
|
||||
}
|
||||
|
||||
/* obsolete, didn't allow for new data types: */
|
||||
static noinline_for_stack long bch2_ioctl_dev_usage(struct bch_fs *c,
|
||||
struct bch_ioctl_dev_usage __user *user_arg)
|
||||
{
|
||||
struct bch_ioctl_dev_usage arg;
|
||||
struct bch_dev_usage_full src;
|
||||
struct bch_dev *ca;
|
||||
unsigned i;
|
||||
|
||||
if (!test_bit(BCH_FS_started, &c->flags))
|
||||
return -EINVAL;
|
||||
|
||||
if (copy_from_user(&arg, user_arg, sizeof(arg)))
|
||||
return -EFAULT;
|
||||
|
||||
if ((arg.flags & ~BCH_BY_INDEX) ||
|
||||
arg.pad[0] ||
|
||||
arg.pad[1] ||
|
||||
arg.pad[2])
|
||||
return -EINVAL;
|
||||
|
||||
ca = bch2_device_lookup(c, arg.dev, arg.flags);
|
||||
if (IS_ERR(ca))
|
||||
return PTR_ERR(ca);
|
||||
|
||||
src = bch2_dev_usage_full_read(ca);
|
||||
|
||||
arg.state = ca->mi.state;
|
||||
arg.bucket_size = ca->mi.bucket_size;
|
||||
arg.nr_buckets = ca->mi.nbuckets - ca->mi.first_bucket;
|
||||
|
||||
for (i = 0; i < ARRAY_SIZE(arg.d); i++) {
|
||||
arg.d[i].buckets = src.d[i].buckets;
|
||||
arg.d[i].sectors = src.d[i].sectors;
|
||||
arg.d[i].fragmented = src.d[i].fragmented;
|
||||
}
|
||||
|
||||
bch2_dev_put(ca);
|
||||
|
||||
return copy_to_user_errcode(user_arg, &arg, sizeof(arg));
|
||||
}
|
||||
|
||||
static long bch2_ioctl_dev_usage_v2(struct bch_fs *c,
|
||||
struct bch_ioctl_dev_usage_v2 __user *user_arg)
|
||||
{
|
||||
struct bch_ioctl_dev_usage_v2 arg;
|
||||
struct bch_dev_usage_full src;
|
||||
struct bch_dev *ca;
|
||||
int ret = 0;
|
||||
|
||||
if (!test_bit(BCH_FS_started, &c->flags))
|
||||
return -EINVAL;
|
||||
|
||||
if (copy_from_user(&arg, user_arg, sizeof(arg)))
|
||||
return -EFAULT;
|
||||
|
||||
if ((arg.flags & ~BCH_BY_INDEX) ||
|
||||
arg.pad[0] ||
|
||||
arg.pad[1] ||
|
||||
arg.pad[2])
|
||||
return -EINVAL;
|
||||
|
||||
ca = bch2_device_lookup(c, arg.dev, arg.flags);
|
||||
if (IS_ERR(ca))
|
||||
return PTR_ERR(ca);
|
||||
|
||||
src = bch2_dev_usage_full_read(ca);
|
||||
|
||||
arg.state = ca->mi.state;
|
||||
arg.bucket_size = ca->mi.bucket_size;
|
||||
arg.nr_data_types = min(arg.nr_data_types, BCH_DATA_NR);
|
||||
arg.nr_buckets = ca->mi.nbuckets - ca->mi.first_bucket;
|
||||
|
||||
ret = copy_to_user_errcode(user_arg, &arg, sizeof(arg));
|
||||
if (ret)
|
||||
goto err;
|
||||
|
||||
for (unsigned i = 0; i < arg.nr_data_types; i++) {
|
||||
struct bch_ioctl_dev_usage_type t = {
|
||||
.buckets = src.d[i].buckets,
|
||||
.sectors = src.d[i].sectors,
|
||||
.fragmented = src.d[i].fragmented,
|
||||
};
|
||||
|
||||
ret = copy_to_user_errcode(&user_arg->d[i], &t, sizeof(t));
|
||||
if (ret)
|
||||
goto err;
|
||||
}
|
||||
err:
|
||||
bch2_dev_put(ca);
|
||||
return ret;
|
||||
}
|
||||
|
||||
static long bch2_ioctl_read_super(struct bch_fs *c,
|
||||
struct bch_ioctl_read_super arg)
|
||||
{
|
||||
struct bch_dev *ca = NULL;
|
||||
struct bch_sb *sb;
|
||||
int ret = 0;
|
||||
|
||||
if (!capable(CAP_SYS_ADMIN))
|
||||
return -EPERM;
|
||||
|
||||
if ((arg.flags & ~(BCH_BY_INDEX|BCH_READ_DEV)) ||
|
||||
arg.pad)
|
||||
return -EINVAL;
|
||||
|
||||
mutex_lock(&c->sb_lock);
|
||||
|
||||
if (arg.flags & BCH_READ_DEV) {
|
||||
ca = bch2_device_lookup(c, arg.dev, arg.flags);
|
||||
ret = PTR_ERR_OR_ZERO(ca);
|
||||
if (ret)
|
||||
goto err_unlock;
|
||||
|
||||
sb = ca->disk_sb.sb;
|
||||
} else {
|
||||
sb = c->disk_sb.sb;
|
||||
}
|
||||
|
||||
if (vstruct_bytes(sb) > arg.size) {
|
||||
ret = -ERANGE;
|
||||
goto err;
|
||||
}
|
||||
|
||||
ret = copy_to_user_errcode((void __user *)(unsigned long)arg.sb, sb,
|
||||
vstruct_bytes(sb));
|
||||
err:
|
||||
bch2_dev_put(ca);
|
||||
err_unlock:
|
||||
mutex_unlock(&c->sb_lock);
|
||||
return ret;
|
||||
}
|
||||
|
||||
static long bch2_ioctl_disk_get_idx(struct bch_fs *c,
|
||||
struct bch_ioctl_disk_get_idx arg)
|
||||
{
|
||||
dev_t dev = huge_decode_dev(arg.dev);
|
||||
|
||||
if (!capable(CAP_SYS_ADMIN))
|
||||
return -EPERM;
|
||||
|
||||
if (!dev)
|
||||
return -EINVAL;
|
||||
|
||||
guard(rcu)();
|
||||
for_each_online_member_rcu(c, ca)
|
||||
if (ca->dev == dev)
|
||||
return ca->dev_idx;
|
||||
|
||||
return bch_err_throw(c, ENOENT_dev_idx_not_found);
|
||||
}
|
||||
|
||||
static long bch2_ioctl_disk_resize(struct bch_fs *c,
|
||||
struct bch_ioctl_disk_resize arg)
|
||||
{
|
||||
struct bch_dev *ca;
|
||||
int ret;
|
||||
|
||||
if (!capable(CAP_SYS_ADMIN))
|
||||
return -EPERM;
|
||||
|
||||
if ((arg.flags & ~BCH_BY_INDEX) ||
|
||||
arg.pad)
|
||||
return -EINVAL;
|
||||
|
||||
ca = bch2_device_lookup(c, arg.dev, arg.flags);
|
||||
if (IS_ERR(ca))
|
||||
return PTR_ERR(ca);
|
||||
|
||||
ret = bch2_dev_resize(c, ca, arg.nbuckets);
|
||||
|
||||
bch2_dev_put(ca);
|
||||
return ret;
|
||||
}
|
||||
|
||||
static long bch2_ioctl_disk_resize_journal(struct bch_fs *c,
|
||||
struct bch_ioctl_disk_resize_journal arg)
|
||||
{
|
||||
struct bch_dev *ca;
|
||||
int ret;
|
||||
|
||||
if (!capable(CAP_SYS_ADMIN))
|
||||
return -EPERM;
|
||||
|
||||
if ((arg.flags & ~BCH_BY_INDEX) ||
|
||||
arg.pad)
|
||||
return -EINVAL;
|
||||
|
||||
if (arg.nbuckets > U32_MAX)
|
||||
return -EINVAL;
|
||||
|
||||
ca = bch2_device_lookup(c, arg.dev, arg.flags);
|
||||
if (IS_ERR(ca))
|
||||
return PTR_ERR(ca);
|
||||
|
||||
ret = bch2_set_nr_journal_buckets(c, ca, arg.nbuckets);
|
||||
|
||||
bch2_dev_put(ca);
|
||||
return ret;
|
||||
}
|
||||
|
||||
#define BCH_IOCTL(_name, _argtype) \
|
||||
do { \
|
||||
_argtype i; \
|
||||
\
|
||||
if (copy_from_user(&i, arg, sizeof(i))) \
|
||||
return -EFAULT; \
|
||||
ret = bch2_ioctl_##_name(c, i); \
|
||||
goto out; \
|
||||
} while (0)
|
||||
|
||||
long bch2_fs_ioctl(struct bch_fs *c, unsigned cmd, void __user *arg)
|
||||
{
|
||||
long ret;
|
||||
|
||||
switch (cmd) {
|
||||
case BCH_IOCTL_QUERY_UUID:
|
||||
return bch2_ioctl_query_uuid(c, arg);
|
||||
case BCH_IOCTL_FS_USAGE:
|
||||
return bch2_ioctl_fs_usage(c, arg);
|
||||
case BCH_IOCTL_DEV_USAGE:
|
||||
return bch2_ioctl_dev_usage(c, arg);
|
||||
case BCH_IOCTL_DEV_USAGE_V2:
|
||||
return bch2_ioctl_dev_usage_v2(c, arg);
|
||||
#if 0
|
||||
case BCH_IOCTL_START:
|
||||
BCH_IOCTL(start, struct bch_ioctl_start);
|
||||
case BCH_IOCTL_STOP:
|
||||
return bch2_ioctl_stop(c);
|
||||
#endif
|
||||
case BCH_IOCTL_READ_SUPER:
|
||||
BCH_IOCTL(read_super, struct bch_ioctl_read_super);
|
||||
case BCH_IOCTL_DISK_GET_IDX:
|
||||
BCH_IOCTL(disk_get_idx, struct bch_ioctl_disk_get_idx);
|
||||
}
|
||||
|
||||
if (!test_bit(BCH_FS_started, &c->flags))
|
||||
return -EINVAL;
|
||||
|
||||
switch (cmd) {
|
||||
case BCH_IOCTL_DISK_ADD:
|
||||
BCH_IOCTL(disk_add, struct bch_ioctl_disk);
|
||||
case BCH_IOCTL_DISK_REMOVE:
|
||||
BCH_IOCTL(disk_remove, struct bch_ioctl_disk);
|
||||
case BCH_IOCTL_DISK_ONLINE:
|
||||
BCH_IOCTL(disk_online, struct bch_ioctl_disk);
|
||||
case BCH_IOCTL_DISK_OFFLINE:
|
||||
BCH_IOCTL(disk_offline, struct bch_ioctl_disk);
|
||||
case BCH_IOCTL_DISK_SET_STATE:
|
||||
BCH_IOCTL(disk_set_state, struct bch_ioctl_disk_set_state);
|
||||
case BCH_IOCTL_DATA:
|
||||
BCH_IOCTL(data, struct bch_ioctl_data);
|
||||
case BCH_IOCTL_DISK_RESIZE:
|
||||
BCH_IOCTL(disk_resize, struct bch_ioctl_disk_resize);
|
||||
case BCH_IOCTL_DISK_RESIZE_JOURNAL:
|
||||
BCH_IOCTL(disk_resize_journal, struct bch_ioctl_disk_resize_journal);
|
||||
case BCH_IOCTL_FSCK_ONLINE:
|
||||
BCH_IOCTL(fsck_online, struct bch_ioctl_fsck_online);
|
||||
case BCH_IOCTL_QUERY_ACCOUNTING:
|
||||
return bch2_ioctl_query_accounting(c, arg);
|
||||
case BCH_IOCTL_QUERY_COUNTERS:
|
||||
return bch2_ioctl_query_counters(c, arg);
|
||||
default:
|
||||
return -ENOTTY;
|
||||
}
|
||||
out:
|
||||
if (ret < 0)
|
||||
ret = bch2_err_class(ret);
|
||||
return ret;
|
||||
}
|
||||
|
||||
static DEFINE_IDR(bch_chardev_minor);
|
||||
|
||||
static long bch2_chardev_ioctl(struct file *filp, unsigned cmd, unsigned long v)
|
||||
{
|
||||
unsigned minor = iminor(file_inode(filp));
|
||||
struct bch_fs *c = minor < U8_MAX ? idr_find(&bch_chardev_minor, minor) : NULL;
|
||||
void __user *arg = (void __user *) v;
|
||||
|
||||
return c
|
||||
? bch2_fs_ioctl(c, cmd, arg)
|
||||
: bch2_global_ioctl(cmd, arg);
|
||||
}
|
||||
|
||||
static const struct file_operations bch_chardev_fops = {
|
||||
.owner = THIS_MODULE,
|
||||
.unlocked_ioctl = bch2_chardev_ioctl,
|
||||
.open = nonseekable_open,
|
||||
};
|
||||
|
||||
static int bch_chardev_major;
|
||||
static const struct class bch_chardev_class = {
|
||||
.name = "bcachefs",
|
||||
};
|
||||
static struct device *bch_chardev;
|
||||
|
||||
void bch2_fs_chardev_exit(struct bch_fs *c)
|
||||
{
|
||||
if (!IS_ERR_OR_NULL(c->chardev))
|
||||
device_unregister(c->chardev);
|
||||
if (c->minor >= 0)
|
||||
idr_remove(&bch_chardev_minor, c->minor);
|
||||
}
|
||||
|
||||
int bch2_fs_chardev_init(struct bch_fs *c)
|
||||
{
|
||||
c->minor = idr_alloc(&bch_chardev_minor, c, 0, 0, GFP_KERNEL);
|
||||
if (c->minor < 0)
|
||||
return c->minor;
|
||||
|
||||
c->chardev = device_create(&bch_chardev_class, NULL,
|
||||
MKDEV(bch_chardev_major, c->minor), c,
|
||||
"bcachefs%u-ctl", c->minor);
|
||||
if (IS_ERR(c->chardev))
|
||||
return PTR_ERR(c->chardev);
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
void bch2_chardev_exit(void)
|
||||
{
|
||||
device_destroy(&bch_chardev_class, MKDEV(bch_chardev_major, U8_MAX));
|
||||
class_unregister(&bch_chardev_class);
|
||||
if (bch_chardev_major > 0)
|
||||
unregister_chrdev(bch_chardev_major, "bcachefs");
|
||||
}
|
||||
|
||||
int __init bch2_chardev_init(void)
|
||||
{
|
||||
int ret;
|
||||
|
||||
bch_chardev_major = register_chrdev(0, "bcachefs-ctl", &bch_chardev_fops);
|
||||
if (bch_chardev_major < 0)
|
||||
return bch_chardev_major;
|
||||
|
||||
ret = class_register(&bch_chardev_class);
|
||||
if (ret)
|
||||
goto major_out;
|
||||
|
||||
bch_chardev = device_create(&bch_chardev_class, NULL,
|
||||
MKDEV(bch_chardev_major, U8_MAX),
|
||||
NULL, "bcachefs-ctl");
|
||||
if (IS_ERR(bch_chardev)) {
|
||||
ret = PTR_ERR(bch_chardev);
|
||||
goto class_out;
|
||||
}
|
||||
|
||||
return 0;
|
||||
|
||||
class_out:
|
||||
class_unregister(&bch_chardev_class);
|
||||
major_out:
|
||||
unregister_chrdev(bch_chardev_major, "bcachefs-ctl");
|
||||
return ret;
|
||||
}
|
||||
|
||||
#endif /* NO_BCACHEFS_CHARDEV */
|
||||
@@ -1,31 +0,0 @@
|
||||
/* SPDX-License-Identifier: GPL-2.0 */
|
||||
#ifndef _BCACHEFS_CHARDEV_H
|
||||
#define _BCACHEFS_CHARDEV_H
|
||||
|
||||
#ifndef NO_BCACHEFS_FS
|
||||
|
||||
long bch2_fs_ioctl(struct bch_fs *, unsigned, void __user *);
|
||||
|
||||
void bch2_fs_chardev_exit(struct bch_fs *);
|
||||
int bch2_fs_chardev_init(struct bch_fs *);
|
||||
|
||||
void bch2_chardev_exit(void);
|
||||
int __init bch2_chardev_init(void);
|
||||
|
||||
#else
|
||||
|
||||
static inline long bch2_fs_ioctl(struct bch_fs *c,
|
||||
unsigned cmd, void __user * arg)
|
||||
{
|
||||
return -ENOTTY;
|
||||
}
|
||||
|
||||
static inline void bch2_fs_chardev_exit(struct bch_fs *c) {}
|
||||
static inline int bch2_fs_chardev_init(struct bch_fs *c) { return 0; }
|
||||
|
||||
static inline void bch2_chardev_exit(void) {}
|
||||
static inline int __init bch2_chardev_init(void) { return 0; }
|
||||
|
||||
#endif /* NO_BCACHEFS_FS */
|
||||
|
||||
#endif /* _BCACHEFS_CHARDEV_H */
|
||||
@@ -1,698 +0,0 @@
|
||||
// SPDX-License-Identifier: GPL-2.0
|
||||
#include "bcachefs.h"
|
||||
#include "checksum.h"
|
||||
#include "errcode.h"
|
||||
#include "error.h"
|
||||
#include "super.h"
|
||||
#include "super-io.h"
|
||||
|
||||
#include <linux/crc32c.h>
|
||||
#include <linux/xxhash.h>
|
||||
#include <linux/key.h>
|
||||
#include <linux/random.h>
|
||||
#include <linux/ratelimit.h>
|
||||
#include <crypto/chacha.h>
|
||||
#include <crypto/poly1305.h>
|
||||
#include <keys/user-type.h>
|
||||
|
||||
/*
|
||||
* bch2_checksum state is an abstraction of the checksum state calculated over different pages.
|
||||
* it features page merging without having the checksum algorithm lose its state.
|
||||
* for native checksum aglorithms (like crc), a default seed value will do.
|
||||
* for hash-like algorithms, a state needs to be stored
|
||||
*/
|
||||
|
||||
struct bch2_checksum_state {
|
||||
union {
|
||||
u64 seed;
|
||||
struct xxh64_state h64state;
|
||||
};
|
||||
unsigned int type;
|
||||
};
|
||||
|
||||
static void bch2_checksum_init(struct bch2_checksum_state *state)
|
||||
{
|
||||
switch (state->type) {
|
||||
case BCH_CSUM_none:
|
||||
case BCH_CSUM_crc32c:
|
||||
case BCH_CSUM_crc64:
|
||||
state->seed = 0;
|
||||
break;
|
||||
case BCH_CSUM_crc32c_nonzero:
|
||||
state->seed = U32_MAX;
|
||||
break;
|
||||
case BCH_CSUM_crc64_nonzero:
|
||||
state->seed = U64_MAX;
|
||||
break;
|
||||
case BCH_CSUM_xxhash:
|
||||
xxh64_reset(&state->h64state, 0);
|
||||
break;
|
||||
default:
|
||||
BUG();
|
||||
}
|
||||
}
|
||||
|
||||
static u64 bch2_checksum_final(const struct bch2_checksum_state *state)
|
||||
{
|
||||
switch (state->type) {
|
||||
case BCH_CSUM_none:
|
||||
case BCH_CSUM_crc32c:
|
||||
case BCH_CSUM_crc64:
|
||||
return state->seed;
|
||||
case BCH_CSUM_crc32c_nonzero:
|
||||
return state->seed ^ U32_MAX;
|
||||
case BCH_CSUM_crc64_nonzero:
|
||||
return state->seed ^ U64_MAX;
|
||||
case BCH_CSUM_xxhash:
|
||||
return xxh64_digest(&state->h64state);
|
||||
default:
|
||||
BUG();
|
||||
}
|
||||
}
|
||||
|
||||
static void bch2_checksum_update(struct bch2_checksum_state *state, const void *data, size_t len)
|
||||
{
|
||||
switch (state->type) {
|
||||
case BCH_CSUM_none:
|
||||
return;
|
||||
case BCH_CSUM_crc32c_nonzero:
|
||||
case BCH_CSUM_crc32c:
|
||||
state->seed = crc32c(state->seed, data, len);
|
||||
break;
|
||||
case BCH_CSUM_crc64_nonzero:
|
||||
case BCH_CSUM_crc64:
|
||||
state->seed = crc64_be(state->seed, data, len);
|
||||
break;
|
||||
case BCH_CSUM_xxhash:
|
||||
xxh64_update(&state->h64state, data, len);
|
||||
break;
|
||||
default:
|
||||
BUG();
|
||||
}
|
||||
}
|
||||
|
||||
static void bch2_chacha20_init(struct chacha_state *state,
|
||||
const struct bch_key *key, struct nonce nonce)
|
||||
{
|
||||
u32 key_words[CHACHA_KEY_SIZE / sizeof(u32)];
|
||||
|
||||
BUILD_BUG_ON(sizeof(key_words) != sizeof(*key));
|
||||
memcpy(key_words, key, sizeof(key_words));
|
||||
le32_to_cpu_array(key_words, ARRAY_SIZE(key_words));
|
||||
|
||||
BUILD_BUG_ON(sizeof(nonce) != CHACHA_IV_SIZE);
|
||||
chacha_init(state, key_words, (const u8 *)nonce.d);
|
||||
|
||||
memzero_explicit(key_words, sizeof(key_words));
|
||||
}
|
||||
|
||||
void bch2_chacha20(const struct bch_key *key, struct nonce nonce,
|
||||
void *data, size_t len)
|
||||
{
|
||||
struct chacha_state state;
|
||||
|
||||
bch2_chacha20_init(&state, key, nonce);
|
||||
chacha20_crypt(&state, data, data, len);
|
||||
chacha_zeroize_state(&state);
|
||||
}
|
||||
|
||||
static void bch2_poly1305_init(struct poly1305_desc_ctx *desc,
|
||||
struct bch_fs *c, struct nonce nonce)
|
||||
{
|
||||
u8 key[POLY1305_KEY_SIZE] = { 0 };
|
||||
|
||||
nonce.d[3] ^= BCH_NONCE_POLY;
|
||||
|
||||
bch2_chacha20(&c->chacha20_key, nonce, key, sizeof(key));
|
||||
poly1305_init(desc, key);
|
||||
}
|
||||
|
||||
struct bch_csum bch2_checksum(struct bch_fs *c, unsigned type,
|
||||
struct nonce nonce, const void *data, size_t len)
|
||||
{
|
||||
switch (type) {
|
||||
case BCH_CSUM_none:
|
||||
case BCH_CSUM_crc32c_nonzero:
|
||||
case BCH_CSUM_crc64_nonzero:
|
||||
case BCH_CSUM_crc32c:
|
||||
case BCH_CSUM_xxhash:
|
||||
case BCH_CSUM_crc64: {
|
||||
struct bch2_checksum_state state;
|
||||
|
||||
state.type = type;
|
||||
|
||||
bch2_checksum_init(&state);
|
||||
bch2_checksum_update(&state, data, len);
|
||||
|
||||
return (struct bch_csum) { .lo = cpu_to_le64(bch2_checksum_final(&state)) };
|
||||
}
|
||||
|
||||
case BCH_CSUM_chacha20_poly1305_80:
|
||||
case BCH_CSUM_chacha20_poly1305_128: {
|
||||
struct poly1305_desc_ctx dctx;
|
||||
u8 digest[POLY1305_DIGEST_SIZE];
|
||||
struct bch_csum ret = { 0 };
|
||||
|
||||
bch2_poly1305_init(&dctx, c, nonce);
|
||||
poly1305_update(&dctx, data, len);
|
||||
poly1305_final(&dctx, digest);
|
||||
|
||||
memcpy(&ret, digest, bch_crc_bytes[type]);
|
||||
return ret;
|
||||
}
|
||||
default:
|
||||
return (struct bch_csum) {};
|
||||
}
|
||||
}
|
||||
|
||||
int bch2_encrypt(struct bch_fs *c, unsigned type,
|
||||
struct nonce nonce, void *data, size_t len)
|
||||
{
|
||||
if (!bch2_csum_type_is_encryption(type))
|
||||
return 0;
|
||||
|
||||
if (bch2_fs_inconsistent_on(!c->chacha20_key_set,
|
||||
c, "attempting to encrypt without encryption key"))
|
||||
return bch_err_throw(c, no_encryption_key);
|
||||
|
||||
bch2_chacha20(&c->chacha20_key, nonce, data, len);
|
||||
return 0;
|
||||
}
|
||||
|
||||
static struct bch_csum __bch2_checksum_bio(struct bch_fs *c, unsigned type,
|
||||
struct nonce nonce, struct bio *bio,
|
||||
struct bvec_iter *iter)
|
||||
{
|
||||
struct bio_vec bv;
|
||||
|
||||
switch (type) {
|
||||
case BCH_CSUM_none:
|
||||
return (struct bch_csum) { 0 };
|
||||
case BCH_CSUM_crc32c_nonzero:
|
||||
case BCH_CSUM_crc64_nonzero:
|
||||
case BCH_CSUM_crc32c:
|
||||
case BCH_CSUM_xxhash:
|
||||
case BCH_CSUM_crc64: {
|
||||
struct bch2_checksum_state state;
|
||||
|
||||
state.type = type;
|
||||
bch2_checksum_init(&state);
|
||||
|
||||
#ifdef CONFIG_HIGHMEM
|
||||
__bio_for_each_segment(bv, bio, *iter, *iter) {
|
||||
void *p = kmap_local_page(bv.bv_page) + bv.bv_offset;
|
||||
|
||||
bch2_checksum_update(&state, p, bv.bv_len);
|
||||
kunmap_local(p);
|
||||
}
|
||||
#else
|
||||
__bio_for_each_bvec(bv, bio, *iter, *iter)
|
||||
bch2_checksum_update(&state, page_address(bv.bv_page) + bv.bv_offset,
|
||||
bv.bv_len);
|
||||
#endif
|
||||
return (struct bch_csum) { .lo = cpu_to_le64(bch2_checksum_final(&state)) };
|
||||
}
|
||||
|
||||
case BCH_CSUM_chacha20_poly1305_80:
|
||||
case BCH_CSUM_chacha20_poly1305_128: {
|
||||
struct poly1305_desc_ctx dctx;
|
||||
u8 digest[POLY1305_DIGEST_SIZE];
|
||||
struct bch_csum ret = { 0 };
|
||||
|
||||
bch2_poly1305_init(&dctx, c, nonce);
|
||||
|
||||
#ifdef CONFIG_HIGHMEM
|
||||
__bio_for_each_segment(bv, bio, *iter, *iter) {
|
||||
void *p = kmap_local_page(bv.bv_page) + bv.bv_offset;
|
||||
|
||||
poly1305_update(&dctx, p, bv.bv_len);
|
||||
kunmap_local(p);
|
||||
}
|
||||
#else
|
||||
__bio_for_each_bvec(bv, bio, *iter, *iter)
|
||||
poly1305_update(&dctx,
|
||||
page_address(bv.bv_page) + bv.bv_offset,
|
||||
bv.bv_len);
|
||||
#endif
|
||||
poly1305_final(&dctx, digest);
|
||||
|
||||
memcpy(&ret, digest, bch_crc_bytes[type]);
|
||||
return ret;
|
||||
}
|
||||
default:
|
||||
return (struct bch_csum) {};
|
||||
}
|
||||
}
|
||||
|
||||
struct bch_csum bch2_checksum_bio(struct bch_fs *c, unsigned type,
|
||||
struct nonce nonce, struct bio *bio)
|
||||
{
|
||||
struct bvec_iter iter = bio->bi_iter;
|
||||
|
||||
return __bch2_checksum_bio(c, type, nonce, bio, &iter);
|
||||
}
|
||||
|
||||
int __bch2_encrypt_bio(struct bch_fs *c, unsigned type,
|
||||
struct nonce nonce, struct bio *bio)
|
||||
{
|
||||
struct bio_vec bv;
|
||||
struct bvec_iter iter;
|
||||
struct chacha_state chacha_state;
|
||||
int ret = 0;
|
||||
|
||||
if (bch2_fs_inconsistent_on(!c->chacha20_key_set,
|
||||
c, "attempting to encrypt without encryption key"))
|
||||
return bch_err_throw(c, no_encryption_key);
|
||||
|
||||
bch2_chacha20_init(&chacha_state, &c->chacha20_key, nonce);
|
||||
|
||||
bio_for_each_segment(bv, bio, iter) {
|
||||
void *p;
|
||||
|
||||
/*
|
||||
* chacha_crypt() assumes that the length is a multiple of
|
||||
* CHACHA_BLOCK_SIZE on any non-final call.
|
||||
*/
|
||||
if (!IS_ALIGNED(bv.bv_len, CHACHA_BLOCK_SIZE)) {
|
||||
bch_err_ratelimited(c, "bio not aligned for encryption");
|
||||
ret = -EIO;
|
||||
break;
|
||||
}
|
||||
|
||||
p = bvec_kmap_local(&bv);
|
||||
chacha20_crypt(&chacha_state, p, p, bv.bv_len);
|
||||
kunmap_local(p);
|
||||
}
|
||||
chacha_zeroize_state(&chacha_state);
|
||||
return ret;
|
||||
}
|
||||
|
||||
struct bch_csum bch2_checksum_merge(unsigned type, struct bch_csum a,
|
||||
struct bch_csum b, size_t b_len)
|
||||
{
|
||||
struct bch2_checksum_state state;
|
||||
|
||||
state.type = type;
|
||||
bch2_checksum_init(&state);
|
||||
state.seed = le64_to_cpu(a.lo);
|
||||
|
||||
BUG_ON(!bch2_checksum_mergeable(type));
|
||||
|
||||
while (b_len) {
|
||||
unsigned page_len = min_t(unsigned, b_len, PAGE_SIZE);
|
||||
|
||||
bch2_checksum_update(&state,
|
||||
page_address(ZERO_PAGE(0)), page_len);
|
||||
b_len -= page_len;
|
||||
}
|
||||
a.lo = cpu_to_le64(bch2_checksum_final(&state));
|
||||
a.lo ^= b.lo;
|
||||
a.hi ^= b.hi;
|
||||
return a;
|
||||
}
|
||||
|
||||
int bch2_rechecksum_bio(struct bch_fs *c, struct bio *bio,
|
||||
struct bversion version,
|
||||
struct bch_extent_crc_unpacked crc_old,
|
||||
struct bch_extent_crc_unpacked *crc_a,
|
||||
struct bch_extent_crc_unpacked *crc_b,
|
||||
unsigned len_a, unsigned len_b,
|
||||
unsigned new_csum_type)
|
||||
{
|
||||
struct bvec_iter iter = bio->bi_iter;
|
||||
struct nonce nonce = extent_nonce(version, crc_old);
|
||||
struct bch_csum merged = { 0 };
|
||||
struct crc_split {
|
||||
struct bch_extent_crc_unpacked *crc;
|
||||
unsigned len;
|
||||
unsigned csum_type;
|
||||
struct bch_csum csum;
|
||||
} splits[3] = {
|
||||
{ crc_a, len_a, new_csum_type, { 0 }},
|
||||
{ crc_b, len_b, new_csum_type, { 0 } },
|
||||
{ NULL, bio_sectors(bio) - len_a - len_b, new_csum_type, { 0 } },
|
||||
}, *i;
|
||||
bool mergeable = crc_old.csum_type == new_csum_type &&
|
||||
bch2_checksum_mergeable(new_csum_type);
|
||||
unsigned crc_nonce = crc_old.nonce;
|
||||
|
||||
BUG_ON(len_a + len_b > bio_sectors(bio));
|
||||
BUG_ON(crc_old.uncompressed_size != bio_sectors(bio));
|
||||
BUG_ON(crc_is_compressed(crc_old));
|
||||
BUG_ON(bch2_csum_type_is_encryption(crc_old.csum_type) !=
|
||||
bch2_csum_type_is_encryption(new_csum_type));
|
||||
|
||||
for (i = splits; i < splits + ARRAY_SIZE(splits); i++) {
|
||||
iter.bi_size = i->len << 9;
|
||||
if (mergeable || i->crc)
|
||||
i->csum = __bch2_checksum_bio(c, i->csum_type,
|
||||
nonce, bio, &iter);
|
||||
else
|
||||
bio_advance_iter(bio, &iter, i->len << 9);
|
||||
nonce = nonce_add(nonce, i->len << 9);
|
||||
}
|
||||
|
||||
if (mergeable)
|
||||
for (i = splits; i < splits + ARRAY_SIZE(splits); i++)
|
||||
merged = bch2_checksum_merge(new_csum_type, merged,
|
||||
i->csum, i->len << 9);
|
||||
else
|
||||
merged = bch2_checksum_bio(c, crc_old.csum_type,
|
||||
extent_nonce(version, crc_old), bio);
|
||||
|
||||
if (bch2_crc_cmp(merged, crc_old.csum) && !c->opts.no_data_io) {
|
||||
struct printbuf buf = PRINTBUF;
|
||||
prt_printf(&buf, "checksum error in %s() (memory corruption or bug?)\n"
|
||||
" expected %0llx:%0llx got %0llx:%0llx (old type ",
|
||||
__func__,
|
||||
crc_old.csum.hi,
|
||||
crc_old.csum.lo,
|
||||
merged.hi,
|
||||
merged.lo);
|
||||
bch2_prt_csum_type(&buf, crc_old.csum_type);
|
||||
prt_str(&buf, " new type ");
|
||||
bch2_prt_csum_type(&buf, new_csum_type);
|
||||
prt_str(&buf, ")");
|
||||
WARN_RATELIMIT(1, "%s", buf.buf);
|
||||
printbuf_exit(&buf);
|
||||
return bch_err_throw(c, recompute_checksum);
|
||||
}
|
||||
|
||||
for (i = splits; i < splits + ARRAY_SIZE(splits); i++) {
|
||||
if (i->crc)
|
||||
*i->crc = (struct bch_extent_crc_unpacked) {
|
||||
.csum_type = i->csum_type,
|
||||
.compression_type = crc_old.compression_type,
|
||||
.compressed_size = i->len,
|
||||
.uncompressed_size = i->len,
|
||||
.offset = 0,
|
||||
.live_size = i->len,
|
||||
.nonce = crc_nonce,
|
||||
.csum = i->csum,
|
||||
};
|
||||
|
||||
if (bch2_csum_type_is_encryption(new_csum_type))
|
||||
crc_nonce += i->len;
|
||||
}
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
/* BCH_SB_FIELD_crypt: */
|
||||
|
||||
static int bch2_sb_crypt_validate(struct bch_sb *sb, struct bch_sb_field *f,
|
||||
enum bch_validate_flags flags, struct printbuf *err)
|
||||
{
|
||||
struct bch_sb_field_crypt *crypt = field_to_type(f, crypt);
|
||||
|
||||
if (vstruct_bytes(&crypt->field) < sizeof(*crypt)) {
|
||||
prt_printf(err, "wrong size (got %zu should be %zu)",
|
||||
vstruct_bytes(&crypt->field), sizeof(*crypt));
|
||||
return -BCH_ERR_invalid_sb_crypt;
|
||||
}
|
||||
|
||||
if (BCH_CRYPT_KDF_TYPE(crypt)) {
|
||||
prt_printf(err, "bad kdf type %llu", BCH_CRYPT_KDF_TYPE(crypt));
|
||||
return -BCH_ERR_invalid_sb_crypt;
|
||||
}
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
static void bch2_sb_crypt_to_text(struct printbuf *out, struct bch_sb *sb,
|
||||
struct bch_sb_field *f)
|
||||
{
|
||||
struct bch_sb_field_crypt *crypt = field_to_type(f, crypt);
|
||||
|
||||
prt_printf(out, "KFD: %llu\n", BCH_CRYPT_KDF_TYPE(crypt));
|
||||
prt_printf(out, "scrypt n: %llu\n", BCH_KDF_SCRYPT_N(crypt));
|
||||
prt_printf(out, "scrypt r: %llu\n", BCH_KDF_SCRYPT_R(crypt));
|
||||
prt_printf(out, "scrypt p: %llu\n", BCH_KDF_SCRYPT_P(crypt));
|
||||
}
|
||||
|
||||
const struct bch_sb_field_ops bch_sb_field_ops_crypt = {
|
||||
.validate = bch2_sb_crypt_validate,
|
||||
.to_text = bch2_sb_crypt_to_text,
|
||||
};
|
||||
|
||||
#ifdef __KERNEL__
|
||||
static int __bch2_request_key(char *key_description, struct bch_key *key)
|
||||
{
|
||||
struct key *keyring_key;
|
||||
const struct user_key_payload *ukp;
|
||||
int ret;
|
||||
|
||||
keyring_key = request_key(&key_type_user, key_description, NULL);
|
||||
if (IS_ERR(keyring_key))
|
||||
return PTR_ERR(keyring_key);
|
||||
|
||||
down_read(&keyring_key->sem);
|
||||
ukp = dereference_key_locked(keyring_key);
|
||||
if (ukp->datalen == sizeof(*key)) {
|
||||
memcpy(key, ukp->data, ukp->datalen);
|
||||
ret = 0;
|
||||
} else {
|
||||
ret = -EINVAL;
|
||||
}
|
||||
up_read(&keyring_key->sem);
|
||||
key_put(keyring_key);
|
||||
|
||||
return ret;
|
||||
}
|
||||
#else
|
||||
#include <keyutils.h>
|
||||
|
||||
static int __bch2_request_key(char *key_description, struct bch_key *key)
|
||||
{
|
||||
key_serial_t key_id;
|
||||
|
||||
key_id = request_key("user", key_description, NULL,
|
||||
KEY_SPEC_SESSION_KEYRING);
|
||||
if (key_id >= 0)
|
||||
goto got_key;
|
||||
|
||||
key_id = request_key("user", key_description, NULL,
|
||||
KEY_SPEC_USER_KEYRING);
|
||||
if (key_id >= 0)
|
||||
goto got_key;
|
||||
|
||||
key_id = request_key("user", key_description, NULL,
|
||||
KEY_SPEC_USER_SESSION_KEYRING);
|
||||
if (key_id >= 0)
|
||||
goto got_key;
|
||||
|
||||
return -errno;
|
||||
got_key:
|
||||
|
||||
if (keyctl_read(key_id, (void *) key, sizeof(*key)) != sizeof(*key))
|
||||
return -1;
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
#include "crypto.h"
|
||||
#endif
|
||||
|
||||
int bch2_request_key(struct bch_sb *sb, struct bch_key *key)
|
||||
{
|
||||
struct printbuf key_description = PRINTBUF;
|
||||
int ret;
|
||||
|
||||
prt_printf(&key_description, "bcachefs:");
|
||||
pr_uuid(&key_description, sb->user_uuid.b);
|
||||
|
||||
ret = __bch2_request_key(key_description.buf, key);
|
||||
printbuf_exit(&key_description);
|
||||
|
||||
#ifndef __KERNEL__
|
||||
if (ret) {
|
||||
char *passphrase = read_passphrase("Enter passphrase: ");
|
||||
struct bch_encrypted_key sb_key;
|
||||
|
||||
bch2_passphrase_check(sb, passphrase,
|
||||
key, &sb_key);
|
||||
ret = 0;
|
||||
}
|
||||
#endif
|
||||
|
||||
/* stash with memfd, pass memfd fd to mount */
|
||||
|
||||
return ret;
|
||||
}
|
||||
|
||||
#ifndef __KERNEL__
|
||||
int bch2_revoke_key(struct bch_sb *sb)
|
||||
{
|
||||
key_serial_t key_id;
|
||||
struct printbuf key_description = PRINTBUF;
|
||||
|
||||
prt_printf(&key_description, "bcachefs:");
|
||||
pr_uuid(&key_description, sb->user_uuid.b);
|
||||
|
||||
key_id = request_key("user", key_description.buf, NULL, KEY_SPEC_USER_KEYRING);
|
||||
printbuf_exit(&key_description);
|
||||
if (key_id < 0)
|
||||
return errno;
|
||||
|
||||
keyctl_revoke(key_id);
|
||||
|
||||
return 0;
|
||||
}
|
||||
#endif
|
||||
|
||||
int bch2_decrypt_sb_key(struct bch_fs *c,
|
||||
struct bch_sb_field_crypt *crypt,
|
||||
struct bch_key *key)
|
||||
{
|
||||
struct bch_encrypted_key sb_key = crypt->key;
|
||||
struct bch_key user_key;
|
||||
int ret = 0;
|
||||
|
||||
/* is key encrypted? */
|
||||
if (!bch2_key_is_encrypted(&sb_key))
|
||||
goto out;
|
||||
|
||||
ret = bch2_request_key(c->disk_sb.sb, &user_key);
|
||||
if (ret) {
|
||||
bch_err(c, "error requesting encryption key: %s", bch2_err_str(ret));
|
||||
goto err;
|
||||
}
|
||||
|
||||
/* decrypt real key: */
|
||||
bch2_chacha20(&user_key, bch2_sb_key_nonce(c), &sb_key, sizeof(sb_key));
|
||||
|
||||
if (bch2_key_is_encrypted(&sb_key)) {
|
||||
bch_err(c, "incorrect encryption key");
|
||||
ret = -EINVAL;
|
||||
goto err;
|
||||
}
|
||||
out:
|
||||
*key = sb_key.key;
|
||||
err:
|
||||
memzero_explicit(&sb_key, sizeof(sb_key));
|
||||
memzero_explicit(&user_key, sizeof(user_key));
|
||||
return ret;
|
||||
}
|
||||
|
||||
#if 0
|
||||
|
||||
/*
|
||||
* This seems to be duplicating code in cmd_remove_passphrase() in
|
||||
* bcachefs-tools, but we might want to switch userspace to use this - and
|
||||
* perhaps add an ioctl for calling this at runtime, so we can take the
|
||||
* passphrase off of a mounted filesystem (which has come up).
|
||||
*/
|
||||
int bch2_disable_encryption(struct bch_fs *c)
|
||||
{
|
||||
struct bch_sb_field_crypt *crypt;
|
||||
struct bch_key key;
|
||||
int ret = -EINVAL;
|
||||
|
||||
mutex_lock(&c->sb_lock);
|
||||
|
||||
crypt = bch2_sb_field_get(c->disk_sb.sb, crypt);
|
||||
if (!crypt)
|
||||
goto out;
|
||||
|
||||
/* is key encrypted? */
|
||||
ret = 0;
|
||||
if (bch2_key_is_encrypted(&crypt->key))
|
||||
goto out;
|
||||
|
||||
ret = bch2_decrypt_sb_key(c, crypt, &key);
|
||||
if (ret)
|
||||
goto out;
|
||||
|
||||
crypt->key.magic = cpu_to_le64(BCH_KEY_MAGIC);
|
||||
crypt->key.key = key;
|
||||
|
||||
SET_BCH_SB_ENCRYPTION_TYPE(c->disk_sb.sb, 0);
|
||||
bch2_write_super(c);
|
||||
out:
|
||||
mutex_unlock(&c->sb_lock);
|
||||
|
||||
return ret;
|
||||
}
|
||||
|
||||
/*
|
||||
* For enabling encryption on an existing filesystem: not hooked up yet, but it
|
||||
* should be
|
||||
*/
|
||||
int bch2_enable_encryption(struct bch_fs *c, bool keyed)
|
||||
{
|
||||
struct bch_encrypted_key key;
|
||||
struct bch_key user_key;
|
||||
struct bch_sb_field_crypt *crypt;
|
||||
int ret = -EINVAL;
|
||||
|
||||
mutex_lock(&c->sb_lock);
|
||||
|
||||
/* Do we already have an encryption key? */
|
||||
if (bch2_sb_field_get(c->disk_sb.sb, crypt))
|
||||
goto err;
|
||||
|
||||
ret = bch2_alloc_ciphers(c);
|
||||
if (ret)
|
||||
goto err;
|
||||
|
||||
key.magic = cpu_to_le64(BCH_KEY_MAGIC);
|
||||
get_random_bytes(&key.key, sizeof(key.key));
|
||||
|
||||
if (keyed) {
|
||||
ret = bch2_request_key(c->disk_sb.sb, &user_key);
|
||||
if (ret) {
|
||||
bch_err(c, "error requesting encryption key: %s", bch2_err_str(ret));
|
||||
goto err;
|
||||
}
|
||||
|
||||
ret = bch2_chacha_encrypt_key(&user_key, bch2_sb_key_nonce(c),
|
||||
&key, sizeof(key));
|
||||
if (ret)
|
||||
goto err;
|
||||
}
|
||||
|
||||
ret = crypto_skcipher_setkey(&c->chacha20->base,
|
||||
(void *) &key.key, sizeof(key.key));
|
||||
if (ret)
|
||||
goto err;
|
||||
|
||||
crypt = bch2_sb_field_resize(&c->disk_sb, crypt,
|
||||
sizeof(*crypt) / sizeof(u64));
|
||||
if (!crypt) {
|
||||
ret = bch_err_throw(c, ENOSPC_sb_crypt);
|
||||
goto err;
|
||||
}
|
||||
|
||||
crypt->key = key;
|
||||
|
||||
/* write superblock */
|
||||
SET_BCH_SB_ENCRYPTION_TYPE(c->disk_sb.sb, 1);
|
||||
bch2_write_super(c);
|
||||
err:
|
||||
mutex_unlock(&c->sb_lock);
|
||||
memzero_explicit(&user_key, sizeof(user_key));
|
||||
memzero_explicit(&key, sizeof(key));
|
||||
return ret;
|
||||
}
|
||||
#endif
|
||||
|
||||
void bch2_fs_encryption_exit(struct bch_fs *c)
|
||||
{
|
||||
memzero_explicit(&c->chacha20_key, sizeof(c->chacha20_key));
|
||||
}
|
||||
|
||||
int bch2_fs_encryption_init(struct bch_fs *c)
|
||||
{
|
||||
struct bch_sb_field_crypt *crypt;
|
||||
int ret;
|
||||
|
||||
crypt = bch2_sb_field_get(c->disk_sb.sb, crypt);
|
||||
if (!crypt)
|
||||
return 0;
|
||||
|
||||
ret = bch2_decrypt_sb_key(c, crypt, &c->chacha20_key);
|
||||
if (ret)
|
||||
return ret;
|
||||
c->chacha20_key_set = true;
|
||||
return 0;
|
||||
}
|
||||
@@ -1,240 +0,0 @@
|
||||
/* SPDX-License-Identifier: GPL-2.0 */
|
||||
#ifndef _BCACHEFS_CHECKSUM_H
|
||||
#define _BCACHEFS_CHECKSUM_H
|
||||
|
||||
#include "bcachefs.h"
|
||||
#include "extents_types.h"
|
||||
#include "super-io.h"
|
||||
|
||||
#include <linux/crc64.h>
|
||||
#include <crypto/chacha.h>
|
||||
|
||||
static inline bool bch2_checksum_mergeable(unsigned type)
|
||||
{
|
||||
|
||||
switch (type) {
|
||||
case BCH_CSUM_none:
|
||||
case BCH_CSUM_crc32c:
|
||||
case BCH_CSUM_crc64:
|
||||
return true;
|
||||
default:
|
||||
return false;
|
||||
}
|
||||
}
|
||||
|
||||
struct bch_csum bch2_checksum_merge(unsigned, struct bch_csum,
|
||||
struct bch_csum, size_t);
|
||||
|
||||
#define BCH_NONCE_EXTENT cpu_to_le32(1 << 28)
|
||||
#define BCH_NONCE_BTREE cpu_to_le32(2 << 28)
|
||||
#define BCH_NONCE_JOURNAL cpu_to_le32(3 << 28)
|
||||
#define BCH_NONCE_PRIO cpu_to_le32(4 << 28)
|
||||
#define BCH_NONCE_POLY cpu_to_le32(1 << 31)
|
||||
|
||||
struct bch_csum bch2_checksum(struct bch_fs *, unsigned, struct nonce,
|
||||
const void *, size_t);
|
||||
|
||||
/*
|
||||
* This is used for various on disk data structures - bch_sb, prio_set, bset,
|
||||
* jset: The checksum is _always_ the first field of these structs
|
||||
*/
|
||||
#define csum_vstruct(_c, _type, _nonce, _i) \
|
||||
({ \
|
||||
const void *_start = ((const void *) (_i)) + sizeof((_i)->csum);\
|
||||
\
|
||||
bch2_checksum(_c, _type, _nonce, _start, vstruct_end(_i) - _start);\
|
||||
})
|
||||
|
||||
static inline void bch2_csum_to_text(struct printbuf *out,
|
||||
enum bch_csum_type type,
|
||||
struct bch_csum csum)
|
||||
{
|
||||
const u8 *p = (u8 *) &csum;
|
||||
unsigned bytes = type < BCH_CSUM_NR ? bch_crc_bytes[type] : 16;
|
||||
|
||||
for (unsigned i = 0; i < bytes; i++)
|
||||
prt_hex_byte(out, p[i]);
|
||||
}
|
||||
|
||||
static inline void bch2_csum_err_msg(struct printbuf *out,
|
||||
enum bch_csum_type type,
|
||||
struct bch_csum expected,
|
||||
struct bch_csum got)
|
||||
{
|
||||
prt_str(out, "checksum error, type ");
|
||||
bch2_prt_csum_type(out, type);
|
||||
prt_str(out, ": got ");
|
||||
bch2_csum_to_text(out, type, got);
|
||||
prt_str(out, " should be ");
|
||||
bch2_csum_to_text(out, type, expected);
|
||||
}
|
||||
|
||||
void bch2_chacha20(const struct bch_key *, struct nonce, void *, size_t);
|
||||
|
||||
int bch2_request_key(struct bch_sb *, struct bch_key *);
|
||||
#ifndef __KERNEL__
|
||||
int bch2_revoke_key(struct bch_sb *);
|
||||
#endif
|
||||
|
||||
int bch2_encrypt(struct bch_fs *, unsigned, struct nonce,
|
||||
void *data, size_t);
|
||||
|
||||
struct bch_csum bch2_checksum_bio(struct bch_fs *, unsigned,
|
||||
struct nonce, struct bio *);
|
||||
|
||||
int bch2_rechecksum_bio(struct bch_fs *, struct bio *, struct bversion,
|
||||
struct bch_extent_crc_unpacked,
|
||||
struct bch_extent_crc_unpacked *,
|
||||
struct bch_extent_crc_unpacked *,
|
||||
unsigned, unsigned, unsigned);
|
||||
|
||||
int __bch2_encrypt_bio(struct bch_fs *, unsigned,
|
||||
struct nonce, struct bio *);
|
||||
|
||||
static inline int bch2_encrypt_bio(struct bch_fs *c, unsigned type,
|
||||
struct nonce nonce, struct bio *bio)
|
||||
{
|
||||
return bch2_csum_type_is_encryption(type)
|
||||
? __bch2_encrypt_bio(c, type, nonce, bio)
|
||||
: 0;
|
||||
}
|
||||
|
||||
extern const struct bch_sb_field_ops bch_sb_field_ops_crypt;
|
||||
|
||||
int bch2_decrypt_sb_key(struct bch_fs *, struct bch_sb_field_crypt *,
|
||||
struct bch_key *);
|
||||
|
||||
#if 0
|
||||
int bch2_disable_encryption(struct bch_fs *);
|
||||
int bch2_enable_encryption(struct bch_fs *, bool);
|
||||
#endif
|
||||
|
||||
void bch2_fs_encryption_exit(struct bch_fs *);
|
||||
int bch2_fs_encryption_init(struct bch_fs *);
|
||||
|
||||
static inline enum bch_csum_type bch2_csum_opt_to_type(enum bch_csum_opt type,
|
||||
bool data)
|
||||
{
|
||||
switch (type) {
|
||||
case BCH_CSUM_OPT_none:
|
||||
return BCH_CSUM_none;
|
||||
case BCH_CSUM_OPT_crc32c:
|
||||
return data ? BCH_CSUM_crc32c : BCH_CSUM_crc32c_nonzero;
|
||||
case BCH_CSUM_OPT_crc64:
|
||||
return data ? BCH_CSUM_crc64 : BCH_CSUM_crc64_nonzero;
|
||||
case BCH_CSUM_OPT_xxhash:
|
||||
return BCH_CSUM_xxhash;
|
||||
default:
|
||||
BUG();
|
||||
}
|
||||
}
|
||||
|
||||
static inline enum bch_csum_type bch2_data_checksum_type(struct bch_fs *c,
|
||||
struct bch_io_opts opts)
|
||||
{
|
||||
if (opts.nocow)
|
||||
return 0;
|
||||
|
||||
if (c->sb.encryption_type)
|
||||
return c->opts.wide_macs
|
||||
? BCH_CSUM_chacha20_poly1305_128
|
||||
: BCH_CSUM_chacha20_poly1305_80;
|
||||
|
||||
return bch2_csum_opt_to_type(opts.data_checksum, true);
|
||||
}
|
||||
|
||||
static inline enum bch_csum_type bch2_meta_checksum_type(struct bch_fs *c)
|
||||
{
|
||||
if (c->sb.encryption_type)
|
||||
return BCH_CSUM_chacha20_poly1305_128;
|
||||
|
||||
return bch2_csum_opt_to_type(c->opts.metadata_checksum, false);
|
||||
}
|
||||
|
||||
static inline bool bch2_checksum_type_valid(const struct bch_fs *c,
|
||||
unsigned type)
|
||||
{
|
||||
if (type >= BCH_CSUM_NR)
|
||||
return false;
|
||||
|
||||
if (bch2_csum_type_is_encryption(type) && !c->chacha20_key_set)
|
||||
return false;
|
||||
|
||||
return true;
|
||||
}
|
||||
|
||||
/* returns true if not equal */
|
||||
static inline bool bch2_crc_cmp(struct bch_csum l, struct bch_csum r)
|
||||
{
|
||||
/*
|
||||
* XXX: need some way of preventing the compiler from optimizing this
|
||||
* into a form that isn't constant time..
|
||||
*/
|
||||
return ((l.lo ^ r.lo) | (l.hi ^ r.hi)) != 0;
|
||||
}
|
||||
|
||||
/* for skipping ahead and encrypting/decrypting at an offset: */
|
||||
static inline struct nonce nonce_add(struct nonce nonce, unsigned offset)
|
||||
{
|
||||
EBUG_ON(offset & (CHACHA_BLOCK_SIZE - 1));
|
||||
|
||||
le32_add_cpu(&nonce.d[0], offset / CHACHA_BLOCK_SIZE);
|
||||
return nonce;
|
||||
}
|
||||
|
||||
static inline struct nonce null_nonce(void)
|
||||
{
|
||||
struct nonce ret;
|
||||
|
||||
memset(&ret, 0, sizeof(ret));
|
||||
return ret;
|
||||
}
|
||||
|
||||
static inline struct nonce extent_nonce(struct bversion version,
|
||||
struct bch_extent_crc_unpacked crc)
|
||||
{
|
||||
unsigned compression_type = crc_is_compressed(crc)
|
||||
? crc.compression_type
|
||||
: 0;
|
||||
unsigned size = compression_type ? crc.uncompressed_size : 0;
|
||||
struct nonce nonce = (struct nonce) {{
|
||||
[0] = cpu_to_le32(size << 22),
|
||||
[1] = cpu_to_le32(version.lo),
|
||||
[2] = cpu_to_le32(version.lo >> 32),
|
||||
[3] = cpu_to_le32(version.hi|
|
||||
(compression_type << 24))^BCH_NONCE_EXTENT,
|
||||
}};
|
||||
|
||||
return nonce_add(nonce, crc.nonce << 9);
|
||||
}
|
||||
|
||||
static inline bool bch2_key_is_encrypted(struct bch_encrypted_key *key)
|
||||
{
|
||||
return le64_to_cpu(key->magic) != BCH_KEY_MAGIC;
|
||||
}
|
||||
|
||||
static inline struct nonce __bch2_sb_key_nonce(struct bch_sb *sb)
|
||||
{
|
||||
__le64 magic = __bch2_sb_magic(sb);
|
||||
|
||||
return (struct nonce) {{
|
||||
[0] = 0,
|
||||
[1] = 0,
|
||||
[2] = ((__le32 *) &magic)[0],
|
||||
[3] = ((__le32 *) &magic)[1],
|
||||
}};
|
||||
}
|
||||
|
||||
static inline struct nonce bch2_sb_key_nonce(struct bch_fs *c)
|
||||
{
|
||||
__le64 magic = bch2_sb_magic(c);
|
||||
|
||||
return (struct nonce) {{
|
||||
[0] = 0,
|
||||
[1] = 0,
|
||||
[2] = ((__le32 *) &magic)[0],
|
||||
[3] = ((__le32 *) &magic)[1],
|
||||
}};
|
||||
}
|
||||
|
||||
#endif /* _BCACHEFS_CHECKSUM_H */
|
||||
@@ -1,181 +0,0 @@
|
||||
// SPDX-License-Identifier: GPL-2.0
|
||||
#include "bcachefs.h"
|
||||
#include "clock.h"
|
||||
|
||||
#include <linux/freezer.h>
|
||||
#include <linux/kthread.h>
|
||||
#include <linux/preempt.h>
|
||||
|
||||
static inline bool io_timer_cmp(const void *l, const void *r, void __always_unused *args)
|
||||
{
|
||||
struct io_timer **_l = (struct io_timer **)l;
|
||||
struct io_timer **_r = (struct io_timer **)r;
|
||||
|
||||
return (*_l)->expire < (*_r)->expire;
|
||||
}
|
||||
|
||||
static const struct min_heap_callbacks callbacks = {
|
||||
.less = io_timer_cmp,
|
||||
.swp = NULL,
|
||||
};
|
||||
|
||||
void bch2_io_timer_add(struct io_clock *clock, struct io_timer *timer)
|
||||
{
|
||||
spin_lock(&clock->timer_lock);
|
||||
|
||||
if (time_after_eq64((u64) atomic64_read(&clock->now), timer->expire)) {
|
||||
spin_unlock(&clock->timer_lock);
|
||||
timer->fn(timer);
|
||||
return;
|
||||
}
|
||||
|
||||
for (size_t i = 0; i < clock->timers.nr; i++)
|
||||
if (clock->timers.data[i] == timer)
|
||||
goto out;
|
||||
|
||||
BUG_ON(!min_heap_push(&clock->timers, &timer, &callbacks, NULL));
|
||||
out:
|
||||
spin_unlock(&clock->timer_lock);
|
||||
}
|
||||
|
||||
void bch2_io_timer_del(struct io_clock *clock, struct io_timer *timer)
|
||||
{
|
||||
spin_lock(&clock->timer_lock);
|
||||
|
||||
for (size_t i = 0; i < clock->timers.nr; i++)
|
||||
if (clock->timers.data[i] == timer) {
|
||||
min_heap_del(&clock->timers, i, &callbacks, NULL);
|
||||
break;
|
||||
}
|
||||
|
||||
spin_unlock(&clock->timer_lock);
|
||||
}
|
||||
|
||||
struct io_clock_wait {
|
||||
struct io_timer io_timer;
|
||||
struct task_struct *task;
|
||||
int expired;
|
||||
};
|
||||
|
||||
static void io_clock_wait_fn(struct io_timer *timer)
|
||||
{
|
||||
struct io_clock_wait *wait = container_of(timer,
|
||||
struct io_clock_wait, io_timer);
|
||||
|
||||
wait->expired = 1;
|
||||
wake_up_process(wait->task);
|
||||
}
|
||||
|
||||
void bch2_io_clock_schedule_timeout(struct io_clock *clock, u64 until)
|
||||
{
|
||||
struct io_clock_wait wait = {
|
||||
.io_timer.expire = until,
|
||||
.io_timer.fn = io_clock_wait_fn,
|
||||
.io_timer.fn2 = (void *) _RET_IP_,
|
||||
.task = current,
|
||||
};
|
||||
|
||||
bch2_io_timer_add(clock, &wait.io_timer);
|
||||
schedule();
|
||||
bch2_io_timer_del(clock, &wait.io_timer);
|
||||
}
|
||||
|
||||
unsigned long bch2_kthread_io_clock_wait_once(struct io_clock *clock,
|
||||
u64 io_until, unsigned long cpu_timeout)
|
||||
{
|
||||
bool kthread = (current->flags & PF_KTHREAD) != 0;
|
||||
struct io_clock_wait wait = {
|
||||
.io_timer.expire = io_until,
|
||||
.io_timer.fn = io_clock_wait_fn,
|
||||
.io_timer.fn2 = (void *) _RET_IP_,
|
||||
.task = current,
|
||||
};
|
||||
|
||||
bch2_io_timer_add(clock, &wait.io_timer);
|
||||
|
||||
set_current_state(TASK_INTERRUPTIBLE);
|
||||
if (!(kthread && kthread_should_stop())) {
|
||||
cpu_timeout = schedule_timeout(cpu_timeout);
|
||||
try_to_freeze();
|
||||
}
|
||||
|
||||
__set_current_state(TASK_RUNNING);
|
||||
bch2_io_timer_del(clock, &wait.io_timer);
|
||||
return cpu_timeout;
|
||||
}
|
||||
|
||||
void bch2_kthread_io_clock_wait(struct io_clock *clock,
|
||||
u64 io_until, unsigned long cpu_timeout)
|
||||
{
|
||||
bool kthread = (current->flags & PF_KTHREAD) != 0;
|
||||
|
||||
while (!(kthread && kthread_should_stop()) &&
|
||||
cpu_timeout &&
|
||||
atomic64_read(&clock->now) < io_until)
|
||||
cpu_timeout = bch2_kthread_io_clock_wait_once(clock, io_until, cpu_timeout);
|
||||
}
|
||||
|
||||
static struct io_timer *get_expired_timer(struct io_clock *clock, u64 now)
|
||||
{
|
||||
struct io_timer *ret = NULL;
|
||||
|
||||
if (clock->timers.nr &&
|
||||
time_after_eq64(now, clock->timers.data[0]->expire)) {
|
||||
ret = *min_heap_peek(&clock->timers);
|
||||
min_heap_pop(&clock->timers, &callbacks, NULL);
|
||||
}
|
||||
|
||||
return ret;
|
||||
}
|
||||
|
||||
void __bch2_increment_clock(struct io_clock *clock, u64 sectors)
|
||||
{
|
||||
struct io_timer *timer;
|
||||
u64 now = atomic64_add_return(sectors, &clock->now);
|
||||
|
||||
spin_lock(&clock->timer_lock);
|
||||
while ((timer = get_expired_timer(clock, now)))
|
||||
timer->fn(timer);
|
||||
spin_unlock(&clock->timer_lock);
|
||||
}
|
||||
|
||||
void bch2_io_timers_to_text(struct printbuf *out, struct io_clock *clock)
|
||||
{
|
||||
out->atomic++;
|
||||
spin_lock(&clock->timer_lock);
|
||||
u64 now = atomic64_read(&clock->now);
|
||||
|
||||
printbuf_tabstop_push(out, 40);
|
||||
prt_printf(out, "current time:\t%llu\n", now);
|
||||
|
||||
for (unsigned i = 0; i < clock->timers.nr; i++)
|
||||
prt_printf(out, "%ps %ps:\t%llu\n",
|
||||
clock->timers.data[i]->fn,
|
||||
clock->timers.data[i]->fn2,
|
||||
clock->timers.data[i]->expire);
|
||||
spin_unlock(&clock->timer_lock);
|
||||
--out->atomic;
|
||||
}
|
||||
|
||||
void bch2_io_clock_exit(struct io_clock *clock)
|
||||
{
|
||||
free_heap(&clock->timers);
|
||||
free_percpu(clock->pcpu_buf);
|
||||
}
|
||||
|
||||
int bch2_io_clock_init(struct io_clock *clock)
|
||||
{
|
||||
atomic64_set(&clock->now, 0);
|
||||
spin_lock_init(&clock->timer_lock);
|
||||
|
||||
clock->max_slop = IO_CLOCK_PCPU_SECTORS * num_possible_cpus();
|
||||
|
||||
clock->pcpu_buf = alloc_percpu(*clock->pcpu_buf);
|
||||
if (!clock->pcpu_buf)
|
||||
return -BCH_ERR_ENOMEM_io_clock_init;
|
||||
|
||||
if (!init_heap(&clock->timers, NR_IO_TIMERS, GFP_KERNEL))
|
||||
return -BCH_ERR_ENOMEM_io_clock_init;
|
||||
|
||||
return 0;
|
||||
}
|
||||
@@ -1,29 +0,0 @@
|
||||
/* SPDX-License-Identifier: GPL-2.0 */
|
||||
#ifndef _BCACHEFS_CLOCK_H
|
||||
#define _BCACHEFS_CLOCK_H
|
||||
|
||||
void bch2_io_timer_add(struct io_clock *, struct io_timer *);
|
||||
void bch2_io_timer_del(struct io_clock *, struct io_timer *);
|
||||
unsigned long bch2_kthread_io_clock_wait_once(struct io_clock *, u64, unsigned long);
|
||||
void bch2_kthread_io_clock_wait(struct io_clock *, u64, unsigned long);
|
||||
|
||||
void __bch2_increment_clock(struct io_clock *, u64);
|
||||
|
||||
static inline void bch2_increment_clock(struct bch_fs *c, u64 sectors,
|
||||
int rw)
|
||||
{
|
||||
struct io_clock *clock = &c->io_clock[rw];
|
||||
|
||||
if (unlikely(this_cpu_add_return(*clock->pcpu_buf, sectors) >=
|
||||
IO_CLOCK_PCPU_SECTORS))
|
||||
__bch2_increment_clock(clock, this_cpu_xchg(*clock->pcpu_buf, 0));
|
||||
}
|
||||
|
||||
void bch2_io_clock_schedule_timeout(struct io_clock *, u64);
|
||||
|
||||
void bch2_io_timers_to_text(struct printbuf *, struct io_clock *);
|
||||
|
||||
void bch2_io_clock_exit(struct io_clock *);
|
||||
int bch2_io_clock_init(struct io_clock *);
|
||||
|
||||
#endif /* _BCACHEFS_CLOCK_H */
|
||||
@@ -1,38 +0,0 @@
|
||||
/* SPDX-License-Identifier: GPL-2.0 */
|
||||
#ifndef _BCACHEFS_CLOCK_TYPES_H
|
||||
#define _BCACHEFS_CLOCK_TYPES_H
|
||||
|
||||
#include "util.h"
|
||||
|
||||
#define NR_IO_TIMERS (BCH_SB_MEMBERS_MAX * 3)
|
||||
|
||||
/*
|
||||
* Clocks/timers in units of sectors of IO:
|
||||
*
|
||||
* Note - they use percpu batching, so they're only approximate.
|
||||
*/
|
||||
|
||||
struct io_timer;
|
||||
typedef void (*io_timer_fn)(struct io_timer *);
|
||||
|
||||
struct io_timer {
|
||||
io_timer_fn fn;
|
||||
void *fn2;
|
||||
u64 expire;
|
||||
};
|
||||
|
||||
/* Amount to buffer up on a percpu counter */
|
||||
#define IO_CLOCK_PCPU_SECTORS 128
|
||||
|
||||
typedef DEFINE_MIN_HEAP(struct io_timer *, io_timer_heap) io_timer_heap;
|
||||
|
||||
struct io_clock {
|
||||
atomic64_t now;
|
||||
u16 __percpu *pcpu_buf;
|
||||
unsigned max_slop;
|
||||
|
||||
spinlock_t timer_lock;
|
||||
io_timer_heap timers;
|
||||
};
|
||||
|
||||
#endif /* _BCACHEFS_CLOCK_TYPES_H */
|
||||
@@ -1,773 +0,0 @@
|
||||
// SPDX-License-Identifier: GPL-2.0
|
||||
#include "bcachefs.h"
|
||||
#include "checksum.h"
|
||||
#include "compress.h"
|
||||
#include "error.h"
|
||||
#include "extents.h"
|
||||
#include "io_write.h"
|
||||
#include "opts.h"
|
||||
#include "super-io.h"
|
||||
|
||||
#include <linux/lz4.h>
|
||||
#include <linux/zlib.h>
|
||||
#include <linux/zstd.h>
|
||||
|
||||
static inline enum bch_compression_opts bch2_compression_type_to_opt(enum bch_compression_type type)
|
||||
{
|
||||
switch (type) {
|
||||
case BCH_COMPRESSION_TYPE_none:
|
||||
case BCH_COMPRESSION_TYPE_incompressible:
|
||||
return BCH_COMPRESSION_OPT_none;
|
||||
case BCH_COMPRESSION_TYPE_lz4_old:
|
||||
case BCH_COMPRESSION_TYPE_lz4:
|
||||
return BCH_COMPRESSION_OPT_lz4;
|
||||
case BCH_COMPRESSION_TYPE_gzip:
|
||||
return BCH_COMPRESSION_OPT_gzip;
|
||||
case BCH_COMPRESSION_TYPE_zstd:
|
||||
return BCH_COMPRESSION_OPT_zstd;
|
||||
default:
|
||||
BUG();
|
||||
}
|
||||
}
|
||||
|
||||
/* Bounce buffer: */
|
||||
struct bbuf {
|
||||
void *b;
|
||||
enum {
|
||||
BB_NONE,
|
||||
BB_VMAP,
|
||||
BB_KMALLOC,
|
||||
BB_MEMPOOL,
|
||||
} type;
|
||||
int rw;
|
||||
};
|
||||
|
||||
static struct bbuf __bounce_alloc(struct bch_fs *c, unsigned size, int rw)
|
||||
{
|
||||
void *b;
|
||||
|
||||
BUG_ON(size > c->opts.encoded_extent_max);
|
||||
|
||||
b = kmalloc(size, GFP_NOFS|__GFP_NOWARN);
|
||||
if (b)
|
||||
return (struct bbuf) { .b = b, .type = BB_KMALLOC, .rw = rw };
|
||||
|
||||
b = mempool_alloc(&c->compression_bounce[rw], GFP_NOFS);
|
||||
if (b)
|
||||
return (struct bbuf) { .b = b, .type = BB_MEMPOOL, .rw = rw };
|
||||
|
||||
BUG();
|
||||
}
|
||||
|
||||
static bool bio_phys_contig(struct bio *bio, struct bvec_iter start)
|
||||
{
|
||||
struct bio_vec bv;
|
||||
struct bvec_iter iter;
|
||||
void *expected_start = NULL;
|
||||
|
||||
__bio_for_each_bvec(bv, bio, iter, start) {
|
||||
if (expected_start &&
|
||||
expected_start != page_address(bv.bv_page) + bv.bv_offset)
|
||||
return false;
|
||||
|
||||
expected_start = page_address(bv.bv_page) +
|
||||
bv.bv_offset + bv.bv_len;
|
||||
}
|
||||
|
||||
return true;
|
||||
}
|
||||
|
||||
static struct bbuf __bio_map_or_bounce(struct bch_fs *c, struct bio *bio,
|
||||
struct bvec_iter start, int rw)
|
||||
{
|
||||
struct bbuf ret;
|
||||
struct bio_vec bv;
|
||||
struct bvec_iter iter;
|
||||
unsigned nr_pages = 0;
|
||||
struct page *stack_pages[16];
|
||||
struct page **pages = NULL;
|
||||
void *data;
|
||||
|
||||
BUG_ON(start.bi_size > c->opts.encoded_extent_max);
|
||||
|
||||
if (!PageHighMem(bio_iter_page(bio, start)) &&
|
||||
bio_phys_contig(bio, start))
|
||||
return (struct bbuf) {
|
||||
.b = page_address(bio_iter_page(bio, start)) +
|
||||
bio_iter_offset(bio, start),
|
||||
.type = BB_NONE, .rw = rw
|
||||
};
|
||||
|
||||
/* check if we can map the pages contiguously: */
|
||||
__bio_for_each_segment(bv, bio, iter, start) {
|
||||
if (iter.bi_size != start.bi_size &&
|
||||
bv.bv_offset)
|
||||
goto bounce;
|
||||
|
||||
if (bv.bv_len < iter.bi_size &&
|
||||
bv.bv_offset + bv.bv_len < PAGE_SIZE)
|
||||
goto bounce;
|
||||
|
||||
nr_pages++;
|
||||
}
|
||||
|
||||
BUG_ON(DIV_ROUND_UP(start.bi_size, PAGE_SIZE) > nr_pages);
|
||||
|
||||
pages = nr_pages > ARRAY_SIZE(stack_pages)
|
||||
? kmalloc_array(nr_pages, sizeof(struct page *), GFP_NOFS)
|
||||
: stack_pages;
|
||||
if (!pages)
|
||||
goto bounce;
|
||||
|
||||
nr_pages = 0;
|
||||
__bio_for_each_segment(bv, bio, iter, start)
|
||||
pages[nr_pages++] = bv.bv_page;
|
||||
|
||||
data = vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL);
|
||||
if (pages != stack_pages)
|
||||
kfree(pages);
|
||||
|
||||
if (data)
|
||||
return (struct bbuf) {
|
||||
.b = data + bio_iter_offset(bio, start),
|
||||
.type = BB_VMAP, .rw = rw
|
||||
};
|
||||
bounce:
|
||||
ret = __bounce_alloc(c, start.bi_size, rw);
|
||||
|
||||
if (rw == READ)
|
||||
memcpy_from_bio(ret.b, bio, start);
|
||||
|
||||
return ret;
|
||||
}
|
||||
|
||||
static struct bbuf bio_map_or_bounce(struct bch_fs *c, struct bio *bio, int rw)
|
||||
{
|
||||
return __bio_map_or_bounce(c, bio, bio->bi_iter, rw);
|
||||
}
|
||||
|
||||
static void bio_unmap_or_unbounce(struct bch_fs *c, struct bbuf buf)
|
||||
{
|
||||
switch (buf.type) {
|
||||
case BB_NONE:
|
||||
break;
|
||||
case BB_VMAP:
|
||||
vunmap((void *) ((unsigned long) buf.b & PAGE_MASK));
|
||||
break;
|
||||
case BB_KMALLOC:
|
||||
kfree(buf.b);
|
||||
break;
|
||||
case BB_MEMPOOL:
|
||||
mempool_free(buf.b, &c->compression_bounce[buf.rw]);
|
||||
break;
|
||||
}
|
||||
}
|
||||
|
||||
static inline void zlib_set_workspace(z_stream *strm, void *workspace)
|
||||
{
|
||||
#ifdef __KERNEL__
|
||||
strm->workspace = workspace;
|
||||
#endif
|
||||
}
|
||||
|
||||
static int __bio_uncompress(struct bch_fs *c, struct bio *src,
|
||||
void *dst_data, struct bch_extent_crc_unpacked crc)
|
||||
{
|
||||
struct bbuf src_data = { NULL };
|
||||
size_t src_len = src->bi_iter.bi_size;
|
||||
size_t dst_len = crc.uncompressed_size << 9;
|
||||
void *workspace;
|
||||
int ret = 0, ret2;
|
||||
|
||||
enum bch_compression_opts opt = bch2_compression_type_to_opt(crc.compression_type);
|
||||
mempool_t *workspace_pool = &c->compress_workspace[opt];
|
||||
if (unlikely(!mempool_initialized(workspace_pool))) {
|
||||
if (fsck_err(c, compression_type_not_marked_in_sb,
|
||||
"compression type %s set but not marked in superblock",
|
||||
__bch2_compression_types[crc.compression_type]))
|
||||
ret = bch2_check_set_has_compressed_data(c, opt);
|
||||
else
|
||||
ret = bch_err_throw(c, compression_workspace_not_initialized);
|
||||
if (ret)
|
||||
goto err;
|
||||
}
|
||||
|
||||
src_data = bio_map_or_bounce(c, src, READ);
|
||||
|
||||
switch (crc.compression_type) {
|
||||
case BCH_COMPRESSION_TYPE_lz4_old:
|
||||
case BCH_COMPRESSION_TYPE_lz4:
|
||||
ret2 = LZ4_decompress_safe_partial(src_data.b, dst_data,
|
||||
src_len, dst_len, dst_len);
|
||||
if (ret2 != dst_len)
|
||||
ret = bch_err_throw(c, decompress_lz4);
|
||||
break;
|
||||
case BCH_COMPRESSION_TYPE_gzip: {
|
||||
z_stream strm = {
|
||||
.next_in = src_data.b,
|
||||
.avail_in = src_len,
|
||||
.next_out = dst_data,
|
||||
.avail_out = dst_len,
|
||||
};
|
||||
|
||||
workspace = mempool_alloc(workspace_pool, GFP_NOFS);
|
||||
|
||||
zlib_set_workspace(&strm, workspace);
|
||||
zlib_inflateInit2(&strm, -MAX_WBITS);
|
||||
ret2 = zlib_inflate(&strm, Z_FINISH);
|
||||
|
||||
mempool_free(workspace, workspace_pool);
|
||||
|
||||
if (ret2 != Z_STREAM_END)
|
||||
ret = bch_err_throw(c, decompress_gzip);
|
||||
break;
|
||||
}
|
||||
case BCH_COMPRESSION_TYPE_zstd: {
|
||||
ZSTD_DCtx *ctx;
|
||||
size_t real_src_len = le32_to_cpup(src_data.b);
|
||||
|
||||
if (real_src_len > src_len - 4) {
|
||||
ret = bch_err_throw(c, decompress_zstd_src_len_bad);
|
||||
goto err;
|
||||
}
|
||||
|
||||
workspace = mempool_alloc(workspace_pool, GFP_NOFS);
|
||||
ctx = zstd_init_dctx(workspace, zstd_dctx_workspace_bound());
|
||||
|
||||
ret2 = zstd_decompress_dctx(ctx,
|
||||
dst_data, dst_len,
|
||||
src_data.b + 4, real_src_len);
|
||||
|
||||
mempool_free(workspace, workspace_pool);
|
||||
|
||||
if (ret2 != dst_len)
|
||||
ret = bch_err_throw(c, decompress_zstd);
|
||||
break;
|
||||
}
|
||||
default:
|
||||
BUG();
|
||||
}
|
||||
err:
|
||||
fsck_err:
|
||||
bio_unmap_or_unbounce(c, src_data);
|
||||
return ret;
|
||||
}
|
||||
|
||||
int bch2_bio_uncompress_inplace(struct bch_write_op *op,
|
||||
struct bio *bio)
|
||||
{
|
||||
struct bch_fs *c = op->c;
|
||||
struct bch_extent_crc_unpacked *crc = &op->crc;
|
||||
struct bbuf data = { NULL };
|
||||
size_t dst_len = crc->uncompressed_size << 9;
|
||||
int ret = 0;
|
||||
|
||||
/* bio must own its pages: */
|
||||
BUG_ON(!bio->bi_vcnt);
|
||||
BUG_ON(DIV_ROUND_UP(crc->live_size, PAGE_SECTORS) > bio->bi_max_vecs);
|
||||
|
||||
if (crc->uncompressed_size << 9 > c->opts.encoded_extent_max) {
|
||||
bch2_write_op_error(op, op->pos.offset,
|
||||
"extent too big to decompress (%u > %u)",
|
||||
crc->uncompressed_size << 9, c->opts.encoded_extent_max);
|
||||
return bch_err_throw(c, decompress_exceeded_max_encoded_extent);
|
||||
}
|
||||
|
||||
data = __bounce_alloc(c, dst_len, WRITE);
|
||||
|
||||
ret = __bio_uncompress(c, bio, data.b, *crc);
|
||||
|
||||
if (c->opts.no_data_io)
|
||||
ret = 0;
|
||||
|
||||
if (ret) {
|
||||
bch2_write_op_error(op, op->pos.offset, "%s", bch2_err_str(ret));
|
||||
goto err;
|
||||
}
|
||||
|
||||
/*
|
||||
* XXX: don't have a good way to assert that the bio was allocated with
|
||||
* enough space, we depend on bch2_move_extent doing the right thing
|
||||
*/
|
||||
bio->bi_iter.bi_size = crc->live_size << 9;
|
||||
|
||||
memcpy_to_bio(bio, bio->bi_iter, data.b + (crc->offset << 9));
|
||||
|
||||
crc->csum_type = 0;
|
||||
crc->compression_type = 0;
|
||||
crc->compressed_size = crc->live_size;
|
||||
crc->uncompressed_size = crc->live_size;
|
||||
crc->offset = 0;
|
||||
crc->csum = (struct bch_csum) { 0, 0 };
|
||||
err:
|
||||
bio_unmap_or_unbounce(c, data);
|
||||
return ret;
|
||||
}
|
||||
|
||||
int bch2_bio_uncompress(struct bch_fs *c, struct bio *src,
|
||||
struct bio *dst, struct bvec_iter dst_iter,
|
||||
struct bch_extent_crc_unpacked crc)
|
||||
{
|
||||
struct bbuf dst_data = { NULL };
|
||||
size_t dst_len = crc.uncompressed_size << 9;
|
||||
int ret;
|
||||
|
||||
if (crc.uncompressed_size << 9 > c->opts.encoded_extent_max ||
|
||||
crc.compressed_size << 9 > c->opts.encoded_extent_max)
|
||||
return bch_err_throw(c, decompress_exceeded_max_encoded_extent);
|
||||
|
||||
dst_data = dst_len == dst_iter.bi_size
|
||||
? __bio_map_or_bounce(c, dst, dst_iter, WRITE)
|
||||
: __bounce_alloc(c, dst_len, WRITE);
|
||||
|
||||
ret = __bio_uncompress(c, src, dst_data.b, crc);
|
||||
if (ret)
|
||||
goto err;
|
||||
|
||||
if (dst_data.type != BB_NONE &&
|
||||
dst_data.type != BB_VMAP)
|
||||
memcpy_to_bio(dst, dst_iter, dst_data.b + (crc.offset << 9));
|
||||
err:
|
||||
bio_unmap_or_unbounce(c, dst_data);
|
||||
return ret;
|
||||
}
|
||||
|
||||
static int attempt_compress(struct bch_fs *c,
|
||||
void *workspace,
|
||||
void *dst, size_t dst_len,
|
||||
void *src, size_t src_len,
|
||||
struct bch_compression_opt compression)
|
||||
{
|
||||
enum bch_compression_type compression_type =
|
||||
__bch2_compression_opt_to_type[compression.type];
|
||||
|
||||
switch (compression_type) {
|
||||
case BCH_COMPRESSION_TYPE_lz4:
|
||||
if (compression.level < LZ4HC_MIN_CLEVEL) {
|
||||
int len = src_len;
|
||||
int ret = LZ4_compress_destSize(
|
||||
src, dst,
|
||||
&len, dst_len,
|
||||
workspace);
|
||||
if (len < src_len)
|
||||
return -len;
|
||||
|
||||
return ret;
|
||||
} else {
|
||||
int ret = LZ4_compress_HC(
|
||||
src, dst,
|
||||
src_len, dst_len,
|
||||
compression.level,
|
||||
workspace);
|
||||
|
||||
return ret ?: -1;
|
||||
}
|
||||
case BCH_COMPRESSION_TYPE_gzip: {
|
||||
z_stream strm = {
|
||||
.next_in = src,
|
||||
.avail_in = src_len,
|
||||
.next_out = dst,
|
||||
.avail_out = dst_len,
|
||||
};
|
||||
|
||||
zlib_set_workspace(&strm, workspace);
|
||||
if (zlib_deflateInit2(&strm,
|
||||
compression.level
|
||||
? clamp_t(unsigned, compression.level,
|
||||
Z_BEST_SPEED, Z_BEST_COMPRESSION)
|
||||
: Z_DEFAULT_COMPRESSION,
|
||||
Z_DEFLATED, -MAX_WBITS, DEF_MEM_LEVEL,
|
||||
Z_DEFAULT_STRATEGY) != Z_OK)
|
||||
return 0;
|
||||
|
||||
if (zlib_deflate(&strm, Z_FINISH) != Z_STREAM_END)
|
||||
return 0;
|
||||
|
||||
if (zlib_deflateEnd(&strm) != Z_OK)
|
||||
return 0;
|
||||
|
||||
return strm.total_out;
|
||||
}
|
||||
case BCH_COMPRESSION_TYPE_zstd: {
|
||||
/*
|
||||
* rescale:
|
||||
* zstd max compression level is 22, our max level is 15
|
||||
*/
|
||||
unsigned level = min((compression.level * 3) / 2, zstd_max_clevel());
|
||||
ZSTD_parameters params = zstd_get_params(level, c->opts.encoded_extent_max);
|
||||
ZSTD_CCtx *ctx = zstd_init_cctx(workspace, c->zstd_workspace_size);
|
||||
|
||||
/*
|
||||
* ZSTD requires that when we decompress we pass in the exact
|
||||
* compressed size - rounding it up to the nearest sector
|
||||
* doesn't work, so we use the first 4 bytes of the buffer for
|
||||
* that.
|
||||
*
|
||||
* Additionally, the ZSTD code seems to have a bug where it will
|
||||
* write just past the end of the buffer - so subtract a fudge
|
||||
* factor (7 bytes) from the dst buffer size to account for
|
||||
* that.
|
||||
*/
|
||||
size_t len = zstd_compress_cctx(ctx,
|
||||
dst + 4, dst_len - 4 - 7,
|
||||
src, src_len,
|
||||
¶ms);
|
||||
if (zstd_is_error(len))
|
||||
return 0;
|
||||
|
||||
*((__le32 *) dst) = cpu_to_le32(len);
|
||||
return len + 4;
|
||||
}
|
||||
default:
|
||||
BUG();
|
||||
}
|
||||
}
|
||||
|
||||
static unsigned __bio_compress(struct bch_fs *c,
|
||||
struct bio *dst, size_t *dst_len,
|
||||
struct bio *src, size_t *src_len,
|
||||
struct bch_compression_opt compression)
|
||||
{
|
||||
struct bbuf src_data = { NULL }, dst_data = { NULL };
|
||||
void *workspace;
|
||||
enum bch_compression_type compression_type =
|
||||
__bch2_compression_opt_to_type[compression.type];
|
||||
unsigned pad;
|
||||
int ret = 0;
|
||||
|
||||
/* bch2_compression_decode catches unknown compression types: */
|
||||
BUG_ON(compression.type >= BCH_COMPRESSION_OPT_NR);
|
||||
|
||||
mempool_t *workspace_pool = &c->compress_workspace[compression.type];
|
||||
if (unlikely(!mempool_initialized(workspace_pool))) {
|
||||
if (fsck_err(c, compression_opt_not_marked_in_sb,
|
||||
"compression opt %s set but not marked in superblock",
|
||||
bch2_compression_opts[compression.type])) {
|
||||
ret = bch2_check_set_has_compressed_data(c, compression.type);
|
||||
if (ret) /* memory allocation failure, don't compress */
|
||||
return 0;
|
||||
} else {
|
||||
return 0;
|
||||
}
|
||||
}
|
||||
|
||||
/* If it's only one block, don't bother trying to compress: */
|
||||
if (src->bi_iter.bi_size <= c->opts.block_size)
|
||||
return BCH_COMPRESSION_TYPE_incompressible;
|
||||
|
||||
dst_data = bio_map_or_bounce(c, dst, WRITE);
|
||||
src_data = bio_map_or_bounce(c, src, READ);
|
||||
|
||||
workspace = mempool_alloc(workspace_pool, GFP_NOFS);
|
||||
|
||||
*src_len = src->bi_iter.bi_size;
|
||||
*dst_len = dst->bi_iter.bi_size;
|
||||
|
||||
/*
|
||||
* XXX: this algorithm sucks when the compression code doesn't tell us
|
||||
* how much would fit, like LZ4 does:
|
||||
*/
|
||||
while (1) {
|
||||
if (*src_len <= block_bytes(c)) {
|
||||
ret = -1;
|
||||
break;
|
||||
}
|
||||
|
||||
ret = attempt_compress(c, workspace,
|
||||
dst_data.b, *dst_len,
|
||||
src_data.b, *src_len,
|
||||
compression);
|
||||
if (ret > 0) {
|
||||
*dst_len = ret;
|
||||
ret = 0;
|
||||
break;
|
||||
}
|
||||
|
||||
/* Didn't fit: should we retry with a smaller amount? */
|
||||
if (*src_len <= *dst_len) {
|
||||
ret = -1;
|
||||
break;
|
||||
}
|
||||
|
||||
/*
|
||||
* If ret is negative, it's a hint as to how much data would fit
|
||||
*/
|
||||
BUG_ON(-ret >= *src_len);
|
||||
|
||||
if (ret < 0)
|
||||
*src_len = -ret;
|
||||
else
|
||||
*src_len -= (*src_len - *dst_len) / 2;
|
||||
*src_len = round_down(*src_len, block_bytes(c));
|
||||
}
|
||||
|
||||
mempool_free(workspace, workspace_pool);
|
||||
|
||||
if (ret)
|
||||
goto err;
|
||||
|
||||
/* Didn't get smaller: */
|
||||
if (round_up(*dst_len, block_bytes(c)) >= *src_len)
|
||||
goto err;
|
||||
|
||||
pad = round_up(*dst_len, block_bytes(c)) - *dst_len;
|
||||
|
||||
memset(dst_data.b + *dst_len, 0, pad);
|
||||
*dst_len += pad;
|
||||
|
||||
if (dst_data.type != BB_NONE &&
|
||||
dst_data.type != BB_VMAP)
|
||||
memcpy_to_bio(dst, dst->bi_iter, dst_data.b);
|
||||
|
||||
BUG_ON(!*dst_len || *dst_len > dst->bi_iter.bi_size);
|
||||
BUG_ON(!*src_len || *src_len > src->bi_iter.bi_size);
|
||||
BUG_ON(*dst_len & (block_bytes(c) - 1));
|
||||
BUG_ON(*src_len & (block_bytes(c) - 1));
|
||||
ret = compression_type;
|
||||
out:
|
||||
bio_unmap_or_unbounce(c, src_data);
|
||||
bio_unmap_or_unbounce(c, dst_data);
|
||||
return ret;
|
||||
err:
|
||||
ret = BCH_COMPRESSION_TYPE_incompressible;
|
||||
goto out;
|
||||
fsck_err:
|
||||
ret = 0;
|
||||
goto out;
|
||||
}
|
||||
|
||||
unsigned bch2_bio_compress(struct bch_fs *c,
|
||||
struct bio *dst, size_t *dst_len,
|
||||
struct bio *src, size_t *src_len,
|
||||
unsigned compression_opt)
|
||||
{
|
||||
unsigned orig_dst = dst->bi_iter.bi_size;
|
||||
unsigned orig_src = src->bi_iter.bi_size;
|
||||
unsigned compression_type;
|
||||
|
||||
/* Don't consume more than BCH_ENCODED_EXTENT_MAX from @src: */
|
||||
src->bi_iter.bi_size = min_t(unsigned, src->bi_iter.bi_size,
|
||||
c->opts.encoded_extent_max);
|
||||
/* Don't generate a bigger output than input: */
|
||||
dst->bi_iter.bi_size = min(dst->bi_iter.bi_size, src->bi_iter.bi_size);
|
||||
|
||||
compression_type =
|
||||
__bio_compress(c, dst, dst_len, src, src_len,
|
||||
bch2_compression_decode(compression_opt));
|
||||
|
||||
dst->bi_iter.bi_size = orig_dst;
|
||||
src->bi_iter.bi_size = orig_src;
|
||||
return compression_type;
|
||||
}
|
||||
|
||||
static int __bch2_fs_compress_init(struct bch_fs *, u64);
|
||||
|
||||
#define BCH_FEATURE_none 0
|
||||
|
||||
static const unsigned bch2_compression_opt_to_feature[] = {
|
||||
#define x(t, n) [BCH_COMPRESSION_OPT_##t] = BCH_FEATURE_##t,
|
||||
BCH_COMPRESSION_OPTS()
|
||||
#undef x
|
||||
};
|
||||
|
||||
#undef BCH_FEATURE_none
|
||||
|
||||
static int __bch2_check_set_has_compressed_data(struct bch_fs *c, u64 f)
|
||||
{
|
||||
int ret = 0;
|
||||
|
||||
if ((c->sb.features & f) == f)
|
||||
return 0;
|
||||
|
||||
mutex_lock(&c->sb_lock);
|
||||
|
||||
if ((c->sb.features & f) == f) {
|
||||
mutex_unlock(&c->sb_lock);
|
||||
return 0;
|
||||
}
|
||||
|
||||
ret = __bch2_fs_compress_init(c, c->sb.features|f);
|
||||
if (ret) {
|
||||
mutex_unlock(&c->sb_lock);
|
||||
return ret;
|
||||
}
|
||||
|
||||
c->disk_sb.sb->features[0] |= cpu_to_le64(f);
|
||||
bch2_write_super(c);
|
||||
mutex_unlock(&c->sb_lock);
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
int bch2_check_set_has_compressed_data(struct bch_fs *c,
|
||||
unsigned compression_opt)
|
||||
{
|
||||
unsigned compression_type = bch2_compression_decode(compression_opt).type;
|
||||
|
||||
BUG_ON(compression_type >= ARRAY_SIZE(bch2_compression_opt_to_feature));
|
||||
|
||||
return compression_type
|
||||
? __bch2_check_set_has_compressed_data(c,
|
||||
1ULL << bch2_compression_opt_to_feature[compression_type])
|
||||
: 0;
|
||||
}
|
||||
|
||||
void bch2_fs_compress_exit(struct bch_fs *c)
|
||||
{
|
||||
unsigned i;
|
||||
|
||||
for (i = 0; i < ARRAY_SIZE(c->compress_workspace); i++)
|
||||
mempool_exit(&c->compress_workspace[i]);
|
||||
mempool_exit(&c->compression_bounce[WRITE]);
|
||||
mempool_exit(&c->compression_bounce[READ]);
|
||||
}
|
||||
|
||||
static int __bch2_fs_compress_init(struct bch_fs *c, u64 features)
|
||||
{
|
||||
ZSTD_parameters params = zstd_get_params(zstd_max_clevel(),
|
||||
c->opts.encoded_extent_max);
|
||||
|
||||
c->zstd_workspace_size = zstd_cctx_workspace_bound(¶ms.cParams);
|
||||
|
||||
struct {
|
||||
unsigned feature;
|
||||
enum bch_compression_opts type;
|
||||
size_t compress_workspace;
|
||||
} compression_types[] = {
|
||||
{ BCH_FEATURE_lz4, BCH_COMPRESSION_OPT_lz4,
|
||||
max_t(size_t, LZ4_MEM_COMPRESS, LZ4HC_MEM_COMPRESS) },
|
||||
{ BCH_FEATURE_gzip, BCH_COMPRESSION_OPT_gzip,
|
||||
max(zlib_deflate_workspacesize(MAX_WBITS, DEF_MEM_LEVEL),
|
||||
zlib_inflate_workspacesize()) },
|
||||
{ BCH_FEATURE_zstd, BCH_COMPRESSION_OPT_zstd,
|
||||
max(c->zstd_workspace_size,
|
||||
zstd_dctx_workspace_bound()) },
|
||||
}, *i;
|
||||
bool have_compressed = false;
|
||||
|
||||
for (i = compression_types;
|
||||
i < compression_types + ARRAY_SIZE(compression_types);
|
||||
i++)
|
||||
have_compressed |= (features & (1 << i->feature)) != 0;
|
||||
|
||||
if (!have_compressed)
|
||||
return 0;
|
||||
|
||||
if (!mempool_initialized(&c->compression_bounce[READ]) &&
|
||||
mempool_init_kvmalloc_pool(&c->compression_bounce[READ],
|
||||
1, c->opts.encoded_extent_max))
|
||||
return bch_err_throw(c, ENOMEM_compression_bounce_read_init);
|
||||
|
||||
if (!mempool_initialized(&c->compression_bounce[WRITE]) &&
|
||||
mempool_init_kvmalloc_pool(&c->compression_bounce[WRITE],
|
||||
1, c->opts.encoded_extent_max))
|
||||
return bch_err_throw(c, ENOMEM_compression_bounce_write_init);
|
||||
|
||||
for (i = compression_types;
|
||||
i < compression_types + ARRAY_SIZE(compression_types);
|
||||
i++) {
|
||||
if (!(features & (1 << i->feature)))
|
||||
continue;
|
||||
|
||||
if (mempool_initialized(&c->compress_workspace[i->type]))
|
||||
continue;
|
||||
|
||||
if (mempool_init_kvmalloc_pool(
|
||||
&c->compress_workspace[i->type],
|
||||
1, i->compress_workspace))
|
||||
return bch_err_throw(c, ENOMEM_compression_workspace_init);
|
||||
}
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
static u64 compression_opt_to_feature(unsigned v)
|
||||
{
|
||||
unsigned type = bch2_compression_decode(v).type;
|
||||
|
||||
return BIT_ULL(bch2_compression_opt_to_feature[type]);
|
||||
}
|
||||
|
||||
int bch2_fs_compress_init(struct bch_fs *c)
|
||||
{
|
||||
u64 f = c->sb.features;
|
||||
|
||||
f |= compression_opt_to_feature(c->opts.compression);
|
||||
f |= compression_opt_to_feature(c->opts.background_compression);
|
||||
|
||||
return __bch2_fs_compress_init(c, f);
|
||||
}
|
||||
|
||||
int bch2_opt_compression_parse(struct bch_fs *c, const char *_val, u64 *res,
|
||||
struct printbuf *err)
|
||||
{
|
||||
char *val = kstrdup(_val, GFP_KERNEL);
|
||||
char *p = val, *type_str, *level_str;
|
||||
struct bch_compression_opt opt = { 0 };
|
||||
int ret;
|
||||
|
||||
if (!val)
|
||||
return -ENOMEM;
|
||||
|
||||
type_str = strsep(&p, ":");
|
||||
level_str = p;
|
||||
|
||||
ret = match_string(bch2_compression_opts, -1, type_str);
|
||||
if (ret < 0 && err)
|
||||
prt_printf(err, "invalid compression type\n");
|
||||
if (ret < 0)
|
||||
goto err;
|
||||
|
||||
opt.type = ret;
|
||||
|
||||
if (level_str) {
|
||||
unsigned level;
|
||||
|
||||
ret = kstrtouint(level_str, 10, &level);
|
||||
if (!ret && !opt.type && level)
|
||||
ret = -EINVAL;
|
||||
if (!ret && level > 15)
|
||||
ret = -EINVAL;
|
||||
if (ret < 0 && err)
|
||||
prt_printf(err, "invalid compression level\n");
|
||||
if (ret < 0)
|
||||
goto err;
|
||||
|
||||
opt.level = level;
|
||||
}
|
||||
|
||||
*res = bch2_compression_encode(opt);
|
||||
err:
|
||||
kfree(val);
|
||||
return ret;
|
||||
}
|
||||
|
||||
void bch2_compression_opt_to_text(struct printbuf *out, u64 v)
|
||||
{
|
||||
struct bch_compression_opt opt = bch2_compression_decode(v);
|
||||
|
||||
if (opt.type < BCH_COMPRESSION_OPT_NR)
|
||||
prt_str(out, bch2_compression_opts[opt.type]);
|
||||
else
|
||||
prt_printf(out, "(unknown compression opt %u)", opt.type);
|
||||
if (opt.level)
|
||||
prt_printf(out, ":%u", opt.level);
|
||||
}
|
||||
|
||||
void bch2_opt_compression_to_text(struct printbuf *out,
|
||||
struct bch_fs *c,
|
||||
struct bch_sb *sb,
|
||||
u64 v)
|
||||
{
|
||||
return bch2_compression_opt_to_text(out, v);
|
||||
}
|
||||
|
||||
int bch2_opt_compression_validate(u64 v, struct printbuf *err)
|
||||
{
|
||||
if (!bch2_compression_opt_valid(v)) {
|
||||
prt_printf(err, "invalid compression opt %llu", v);
|
||||
return -BCH_ERR_invalid_sb_opt_compression;
|
||||
}
|
||||
|
||||
return 0;
|
||||
}
|
||||
@@ -1,73 +0,0 @@
|
||||
/* SPDX-License-Identifier: GPL-2.0 */
|
||||
#ifndef _BCACHEFS_COMPRESS_H
|
||||
#define _BCACHEFS_COMPRESS_H
|
||||
|
||||
#include "extents_types.h"
|
||||
|
||||
static const unsigned __bch2_compression_opt_to_type[] = {
|
||||
#define x(t, n) [BCH_COMPRESSION_OPT_##t] = BCH_COMPRESSION_TYPE_##t,
|
||||
BCH_COMPRESSION_OPTS()
|
||||
#undef x
|
||||
};
|
||||
|
||||
struct bch_compression_opt {
|
||||
u8 type:4,
|
||||
level:4;
|
||||
};
|
||||
|
||||
static inline struct bch_compression_opt __bch2_compression_decode(unsigned v)
|
||||
{
|
||||
return (struct bch_compression_opt) {
|
||||
.type = v & 15,
|
||||
.level = v >> 4,
|
||||
};
|
||||
}
|
||||
|
||||
static inline bool bch2_compression_opt_valid(unsigned v)
|
||||
{
|
||||
struct bch_compression_opt opt = __bch2_compression_decode(v);
|
||||
|
||||
return opt.type < ARRAY_SIZE(__bch2_compression_opt_to_type) && !(!opt.type && opt.level);
|
||||
}
|
||||
|
||||
static inline struct bch_compression_opt bch2_compression_decode(unsigned v)
|
||||
{
|
||||
return bch2_compression_opt_valid(v)
|
||||
? __bch2_compression_decode(v)
|
||||
: (struct bch_compression_opt) { 0 };
|
||||
}
|
||||
|
||||
static inline unsigned bch2_compression_encode(struct bch_compression_opt opt)
|
||||
{
|
||||
return opt.type|(opt.level << 4);
|
||||
}
|
||||
|
||||
static inline enum bch_compression_type bch2_compression_opt_to_type(unsigned v)
|
||||
{
|
||||
return __bch2_compression_opt_to_type[bch2_compression_decode(v).type];
|
||||
}
|
||||
|
||||
struct bch_write_op;
|
||||
int bch2_bio_uncompress_inplace(struct bch_write_op *, struct bio *);
|
||||
int bch2_bio_uncompress(struct bch_fs *, struct bio *, struct bio *,
|
||||
struct bvec_iter, struct bch_extent_crc_unpacked);
|
||||
unsigned bch2_bio_compress(struct bch_fs *, struct bio *, size_t *,
|
||||
struct bio *, size_t *, unsigned);
|
||||
|
||||
int bch2_check_set_has_compressed_data(struct bch_fs *, unsigned);
|
||||
void bch2_fs_compress_exit(struct bch_fs *);
|
||||
int bch2_fs_compress_init(struct bch_fs *);
|
||||
|
||||
void bch2_compression_opt_to_text(struct printbuf *, u64);
|
||||
|
||||
int bch2_opt_compression_parse(struct bch_fs *, const char *, u64 *, struct printbuf *);
|
||||
void bch2_opt_compression_to_text(struct printbuf *, struct bch_fs *, struct bch_sb *, u64);
|
||||
int bch2_opt_compression_validate(u64, struct printbuf *);
|
||||
|
||||
#define bch2_opt_compression (struct bch_opt_fn) { \
|
||||
.parse = bch2_opt_compression_parse, \
|
||||
.to_text = bch2_opt_compression_to_text, \
|
||||
.validate = bch2_opt_compression_validate, \
|
||||
}
|
||||
|
||||
#endif /* _BCACHEFS_COMPRESS_H */
|
||||
@@ -1,38 +0,0 @@
|
||||
// SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
#include <linux/log2.h>
|
||||
#include <linux/slab.h>
|
||||
#include <linux/vmalloc.h>
|
||||
#include "darray.h"
|
||||
|
||||
int __bch2_darray_resize_noprof(darray_char *d, size_t element_size, size_t new_size, gfp_t gfp)
|
||||
{
|
||||
if (new_size > d->size) {
|
||||
new_size = roundup_pow_of_two(new_size);
|
||||
|
||||
/*
|
||||
* This is a workaround: kvmalloc() doesn't support > INT_MAX
|
||||
* allocations, but vmalloc() does.
|
||||
* The limit needs to be lifted from kvmalloc, and when it does
|
||||
* we'll go back to just using that.
|
||||
*/
|
||||
size_t bytes;
|
||||
if (unlikely(check_mul_overflow(new_size, element_size, &bytes)))
|
||||
return -ENOMEM;
|
||||
|
||||
void *data = likely(bytes < INT_MAX)
|
||||
? kvmalloc_noprof(bytes, gfp)
|
||||
: vmalloc_noprof(bytes);
|
||||
if (!data)
|
||||
return -ENOMEM;
|
||||
|
||||
if (d->size)
|
||||
memcpy(data, d->data, d->size * element_size);
|
||||
if (d->data != d->preallocated)
|
||||
kvfree(d->data);
|
||||
d->data = data;
|
||||
d->size = new_size;
|
||||
}
|
||||
|
||||
return 0;
|
||||
}
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user