About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog

Documentation / bcache.txt




Custom Search

Based on kernel version 3.16. Page generated on 2014-08-06 21:36 EST.

1	Say you've got a big slow raid 6, and an X-25E or three. Wouldn't it be
2	nice if you could use them as cache... Hence bcache.
3	
4	Wiki and git repositories are at:
5	  http://bcache.evilpiepirate.org
6	  http://evilpiepirate.org/git/linux-bcache.git
7	  http://evilpiepirate.org/git/bcache-tools.git
8	
9	It's designed around the performance characteristics of SSDs - it only allocates
10	in erase block sized buckets, and it uses a hybrid btree/log to track cached
11	extants (which can be anywhere from a single sector to the bucket size). It's
12	designed to avoid random writes at all costs; it fills up an erase block
13	sequentially, then issues a discard before reusing it.
14	
15	Both writethrough and writeback caching are supported. Writeback defaults to
16	off, but can be switched on and off arbitrarily at runtime. Bcache goes to
17	great lengths to protect your data - it reliably handles unclean shutdown. (It
18	doesn't even have a notion of a clean shutdown; bcache simply doesn't return
19	writes as completed until they're on stable storage).
20	
21	Writeback caching can use most of the cache for buffering writes - writing
22	dirty data to the backing device is always done sequentially, scanning from the
23	start to the end of the index.
24	
25	Since random IO is what SSDs excel at, there generally won't be much benefit
26	to caching large sequential IO. Bcache detects sequential IO and skips it;
27	it also keeps a rolling average of the IO sizes per task, and as long as the
28	average is above the cutoff it will skip all IO from that task - instead of
29	caching the first 512k after every seek. Backups and large file copies should
30	thus entirely bypass the cache.
31	
32	In the event of a data IO error on the flash it will try to recover by reading
33	from disk or invalidating cache entries.  For unrecoverable errors (meta data
34	or dirty data), caching is automatically disabled; if dirty data was present
35	in the cache it first disables writeback caching and waits for all dirty data
36	to be flushed.
37	
38	Getting started:
39	You'll need make-bcache from the bcache-tools repository. Both the cache device
40	and backing device must be formatted before use.
41	  make-bcache -B /dev/sdb
42	  make-bcache -C /dev/sdc
43	
44	make-bcache has the ability to format multiple devices at the same time - if
45	you format your backing devices and cache device at the same time, you won't
46	have to manually attach:
47	  make-bcache -B /dev/sda /dev/sdb -C /dev/sdc
48	
49	bcache-tools now ships udev rules, and bcache devices are known to the kernel
50	immediately.  Without udev, you can manually register devices like this:
51	
52	  echo /dev/sdb > /sys/fs/bcache/register
53	  echo /dev/sdc > /sys/fs/bcache/register
54	
55	Registering the backing device makes the bcache device show up in /dev; you can
56	now format it and use it as normal. But the first time using a new bcache
57	device, it'll be running in passthrough mode until you attach it to a cache.
58	See the section on attaching.
59	
60	The devices show up as:
61	
62	  /dev/bcache<N>
63	
64	As well as (with udev):
65	
66	  /dev/bcache/by-uuid/<uuid>
67	  /dev/bcache/by-label/<label>
68	
69	To get started:
70	
71	  mkfs.ext4 /dev/bcache0
72	  mount /dev/bcache0 /mnt
73	
74	You can control bcache devices through sysfs at /sys/block/bcache<N>/bcache .
75	
76	Cache devices are managed as sets; multiple caches per set isn't supported yet
77	but will allow for mirroring of metadata and dirty data in the future. Your new
78	cache set shows up as /sys/fs/bcache/<UUID>
79	
80	ATTACHING:
81	
82	After your cache device and backing device are registered, the backing device
83	must be attached to your cache set to enable caching. Attaching a backing
84	device to a cache set is done thusly, with the UUID of the cache set in
85	/sys/fs/bcache:
86	
87	  echo <CSET-UUID> > /sys/block/bcache0/bcache/attach
88	
89	This only has to be done once. The next time you reboot, just reregister all
90	your bcache devices. If a backing device has data in a cache somewhere, the
91	/dev/bcache<N> device won't be created until the cache shows up - particularly
92	important if you have writeback caching turned on.
93	
94	If you're booting up and your cache device is gone and never coming back, you
95	can force run the backing device:
96	
97	  echo 1 > /sys/block/sdb/bcache/running
98	
99	(You need to use /sys/block/sdb (or whatever your backing device is called), not
100	/sys/block/bcache0, because bcache0 doesn't exist yet. If you're using a
101	partition, the bcache directory would be at /sys/block/sdb/sdb2/bcache)
102	
103	The backing device will still use that cache set if it shows up in the future,
104	but all the cached data will be invalidated. If there was dirty data in the
105	cache, don't expect the filesystem to be recoverable - you will have massive
106	filesystem corruption, though ext4's fsck does work miracles.
107	
108	ERROR HANDLING:
109	
110	Bcache tries to transparently handle IO errors to/from the cache device without
111	affecting normal operation; if it sees too many errors (the threshold is
112	configurable, and defaults to 0) it shuts down the cache device and switches all
113	the backing devices to passthrough mode.
114	
115	 - For reads from the cache, if they error we just retry the read from the
116	   backing device.
117	
118	 - For writethrough writes, if the write to the cache errors we just switch to
119	   invalidating the data at that lba in the cache (i.e. the same thing we do for
120	   a write that bypasses the cache)
121	
122	 - For writeback writes, we currently pass that error back up to the
123	   filesystem/userspace. This could be improved - we could retry it as a write
124	   that skips the cache so we don't have to error the write.
125	
126	 - When we detach, we first try to flush any dirty data (if we were running in
127	   writeback mode). It currently doesn't do anything intelligent if it fails to
128	   read some of the dirty data, though.
129	
130	TROUBLESHOOTING PERFORMANCE:
131	
132	Bcache has a bunch of config options and tunables. The defaults are intended to
133	be reasonable for typical desktop and server workloads, but they're not what you
134	want for getting the best possible numbers when benchmarking.
135	
136	 - Bad write performance
137	
138	   If write performance is not what you expected, you probably wanted to be
139	   running in writeback mode, which isn't the default (not due to a lack of
140	   maturity, but simply because in writeback mode you'll lose data if something
141	   happens to your SSD)
142	
143	   # echo writeback > /sys/block/bcache0/cache_mode
144	
145	 - Bad performance, or traffic not going to the SSD that you'd expect
146	
147	   By default, bcache doesn't cache everything. It tries to skip sequential IO -
148	   because you really want to be caching the random IO, and if you copy a 10
149	   gigabyte file you probably don't want that pushing 10 gigabytes of randomly
150	   accessed data out of your cache.
151	
152	   But if you want to benchmark reads from cache, and you start out with fio
153	   writing an 8 gigabyte test file - so you want to disable that.
154	
155	   # echo 0 > /sys/block/bcache0/bcache/sequential_cutoff
156	
157	   To set it back to the default (4 mb), do
158	
159	   # echo 4M > /sys/block/bcache0/bcache/sequential_cutoff
160	
161	 - Traffic's still going to the spindle/still getting cache misses
162	
163	   In the real world, SSDs don't always keep up with disks - particularly with
164	   slower SSDs, many disks being cached by one SSD, or mostly sequential IO. So
165	   you want to avoid being bottlenecked by the SSD and having it slow everything
166	   down.
167	
168	   To avoid that bcache tracks latency to the cache device, and gradually
169	   throttles traffic if the latency exceeds a threshold (it does this by
170	   cranking down the sequential bypass).
171	
172	   You can disable this if you need to by setting the thresholds to 0:
173	
174	   # echo 0 > /sys/fs/bcache/<cache set>/congested_read_threshold_us
175	   # echo 0 > /sys/fs/bcache/<cache set>/congested_write_threshold_us
176	
177	   The default is 2000 us (2 milliseconds) for reads, and 20000 for writes.
178	
179	 - Still getting cache misses, of the same data
180	
181	   One last issue that sometimes trips people up is actually an old bug, due to
182	   the way cache coherency is handled for cache misses. If a btree node is full,
183	   a cache miss won't be able to insert a key for the new data and the data
184	   won't be written to the cache.
185	
186	   In practice this isn't an issue because as soon as a write comes along it'll
187	   cause the btree node to be split, and you need almost no write traffic for
188	   this to not show up enough to be noticeable (especially since bcache's btree
189	   nodes are huge and index large regions of the device). But when you're
190	   benchmarking, if you're trying to warm the cache by reading a bunch of data
191	   and there's no other traffic - that can be a problem.
192	
193	   Solution: warm the cache by doing writes, or use the testing branch (there's
194	   a fix for the issue there).
195	
196	SYSFS - BACKING DEVICE:
197	
198	Available at /sys/block/<bdev>/bcache, /sys/block/bcache*/bcache and
199	(if attached) /sys/fs/bcache/<cset-uuid>/bdev*
200	
201	attach
202	  Echo the UUID of a cache set to this file to enable caching.
203	
204	cache_mode
205	  Can be one of either writethrough, writeback, writearound or none.
206	
207	clear_stats
208	  Writing to this file resets the running total stats (not the day/hour/5 minute
209	  decaying versions).
210	
211	detach
212	  Write to this file to detach from a cache set. If there is dirty data in the
213	  cache, it will be flushed first.
214	
215	dirty_data
216	  Amount of dirty data for this backing device in the cache. Continuously
217	  updated unlike the cache set's version, but may be slightly off.
218	
219	label
220	  Name of underlying device.
221	
222	readahead
223	  Size of readahead that should be performed.  Defaults to 0.  If set to e.g.
224	  1M, it will round cache miss reads up to that size, but without overlapping
225	  existing cache entries.
226	
227	running
228	  1 if bcache is running (i.e. whether the /dev/bcache device exists, whether
229	  it's in passthrough mode or caching).
230	
231	sequential_cutoff
232	  A sequential IO will bypass the cache once it passes this threshold; the
233	  most recent 128 IOs are tracked so sequential IO can be detected even when
234	  it isn't all done at once.
235	
236	sequential_merge
237	  If non zero, bcache keeps a list of the last 128 requests submitted to compare
238	  against all new requests to determine which new requests are sequential
239	  continuations of previous requests for the purpose of determining sequential
240	  cutoff. This is necessary if the sequential cutoff value is greater than the
241	  maximum acceptable sequential size for any single request. 
242	
243	state
244	  The backing device can be in one of four different states:
245	
246	  no cache: Has never been attached to a cache set.
247	
248	  clean: Part of a cache set, and there is no cached dirty data.
249	
250	  dirty: Part of a cache set, and there is cached dirty data.
251	
252	  inconsistent: The backing device was forcibly run by the user when there was
253	  dirty data cached but the cache set was unavailable; whatever data was on the
254	  backing device has likely been corrupted.
255	
256	stop
257	  Write to this file to shut down the bcache device and close the backing
258	  device.
259	
260	writeback_delay
261	  When dirty data is written to the cache and it previously did not contain
262	  any, waits some number of seconds before initiating writeback. Defaults to
263	  30.
264	
265	writeback_percent
266	  If nonzero, bcache tries to keep around this percentage of the cache dirty by
267	  throttling background writeback and using a PD controller to smoothly adjust
268	  the rate.
269	
270	writeback_rate
271	  Rate in sectors per second - if writeback_percent is nonzero, background
272	  writeback is throttled to this rate. Continuously adjusted by bcache but may
273	  also be set by the user.
274	
275	writeback_running
276	  If off, writeback of dirty data will not take place at all. Dirty data will
277	  still be added to the cache until it is mostly full; only meant for
278	  benchmarking. Defaults to on.
279	
280	SYSFS - BACKING DEVICE STATS:
281	
282	There are directories with these numbers for a running total, as well as
283	versions that decay over the past day, hour and 5 minutes; they're also
284	aggregated in the cache set directory as well.
285	
286	bypassed
287	  Amount of IO (both reads and writes) that has bypassed the cache
288	
289	cache_hits
290	cache_misses
291	cache_hit_ratio
292	  Hits and misses are counted per individual IO as bcache sees them; a
293	  partial hit is counted as a miss.
294	
295	cache_bypass_hits
296	cache_bypass_misses
297	  Hits and misses for IO that is intended to skip the cache are still counted,
298	  but broken out here.
299	
300	cache_miss_collisions
301	  Counts instances where data was going to be inserted into the cache from a
302	  cache miss, but raced with a write and data was already present (usually 0
303	  since the synchronization for cache misses was rewritten)
304	
305	cache_readaheads
306	  Count of times readahead occurred.
307	
308	SYSFS - CACHE SET:
309	
310	Available at /sys/fs/bcache/<cset-uuid>
311	
312	average_key_size
313	  Average data per key in the btree.
314	
315	bdev<0..n>
316	  Symlink to each of the attached backing devices.
317	
318	block_size
319	  Block size of the cache devices.
320	
321	btree_cache_size
322	  Amount of memory currently used by the btree cache
323	
324	bucket_size
325	  Size of buckets
326	
327	cache<0..n>
328	  Symlink to each of the cache devices comprising this cache set. 
329	
330	cache_available_percent
331	  Percentage of cache device which doesn't contain dirty data, and could
332	  potentially be used for writeback.  This doesn't mean this space isn't used
333	  for clean cached data; the unused statistic (in priority_stats) is typically
334	  much lower.
335	
336	clear_stats
337	  Clears the statistics associated with this cache
338	
339	dirty_data
340	  Amount of dirty data is in the cache (updated when garbage collection runs).
341	
342	flash_vol_create
343	  Echoing a size to this file (in human readable units, k/M/G) creates a thinly
344	  provisioned volume backed by the cache set.
345	
346	io_error_halflife
347	io_error_limit
348	  These determines how many errors we accept before disabling the cache.
349	  Each error is decayed by the half life (in # ios).  If the decaying count
350	  reaches io_error_limit dirty data is written out and the cache is disabled.
351	
352	journal_delay_ms
353	  Journal writes will delay for up to this many milliseconds, unless a cache
354	  flush happens sooner. Defaults to 100.
355	
356	root_usage_percent
357	  Percentage of the root btree node in use.  If this gets too high the node
358	  will split, increasing the tree depth.
359	
360	stop
361	  Write to this file to shut down the cache set - waits until all attached
362	  backing devices have been shut down.
363	
364	tree_depth
365	  Depth of the btree (A single node btree has depth 0).
366	
367	unregister
368	  Detaches all backing devices and closes the cache devices; if dirty data is
369	  present it will disable writeback caching and wait for it to be flushed.
370	
371	SYSFS - CACHE SET INTERNAL:
372	
373	This directory also exposes timings for a number of internal operations, with
374	separate files for average duration, average frequency, last occurrence and max
375	duration: garbage collection, btree read, btree node sorts and btree splits.
376	
377	active_journal_entries
378	  Number of journal entries that are newer than the index.
379	
380	btree_nodes
381	  Total nodes in the btree.
382	
383	btree_used_percent
384	  Average fraction of btree in use.
385	
386	bset_tree_stats
387	  Statistics about the auxiliary search trees
388	
389	btree_cache_max_chain
390	  Longest chain in the btree node cache's hash table
391	
392	cache_read_races
393	  Counts instances where while data was being read from the cache, the bucket
394	  was reused and invalidated - i.e. where the pointer was stale after the read
395	  completed. When this occurs the data is reread from the backing device.
396	
397	trigger_gc
398	  Writing to this file forces garbage collection to run.
399	
400	SYSFS - CACHE DEVICE:
401	
402	Available at /sys/block/<cdev>/bcache
403	
404	block_size
405	  Minimum granularity of writes - should match hardware sector size.
406	
407	btree_written
408	  Sum of all btree writes, in (kilo/mega/giga) bytes
409	
410	bucket_size
411	  Size of buckets
412	
413	cache_replacement_policy
414	  One of either lru, fifo or random.
415	
416	discard
417	  Boolean; if on a discard/TRIM will be issued to each bucket before it is
418	  reused. Defaults to off, since SATA TRIM is an unqueued command (and thus
419	  slow).
420	
421	freelist_percent
422	  Size of the freelist as a percentage of nbuckets. Can be written to to
423	  increase the number of buckets kept on the freelist, which lets you
424	  artificially reduce the size of the cache at runtime. Mostly for testing
425	  purposes (i.e. testing how different size caches affect your hit rate), but
426	  since buckets are discarded when they move on to the freelist will also make
427	  the SSD's garbage collection easier by effectively giving it more reserved
428	  space.
429	
430	io_errors
431	  Number of errors that have occurred, decayed by io_error_halflife.
432	
433	metadata_written
434	  Sum of all non data writes (btree writes and all other metadata).
435	
436	nbuckets
437	  Total buckets in this cache
438	
439	priority_stats
440	  Statistics about how recently data in the cache has been accessed.
441	  This can reveal your working set size.  Unused is the percentage of
442	  the cache that doesn't contain any data.  Metadata is bcache's
443	  metadata overhead.  Average is the average priority of cache buckets.
444	  Next is a list of quantiles with the priority threshold of each.
445	
446	written
447	  Sum of all data that has been written to the cache; comparison with
448	  btree_written gives the amount of write inflation in bcache.
Hide Line Numbers
About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog

Information is copyright its respective author. All material is available from the Linux Kernel Source distributed under a GPL License. This page is provided as a free service by mjmwired.net.