Linux Kernel
3.7.1
Main Page
Related Pages
Modules
Namespaces
Data Structures
Files
File List
Globals
All
Data Structures
Namespaces
Files
Functions
Variables
Typedefs
Enumerations
Enumerator
Macros
Groups
Pages
drivers
md
raid5.h
Go to the documentation of this file.
1
#ifndef _RAID5_H
2
#define _RAID5_H
3
4
#include <
linux/raid/xor.h
>
5
#include <
linux/dmaengine.h
>
6
7
/*
8
*
9
* Each stripe contains one buffer per device. Each buffer can be in
10
* one of a number of states stored in "flags". Changes between
11
* these states happen *almost* exclusively under the protection of the
12
* STRIPE_ACTIVE flag. Some very specific changes can happen in bi_end_io, and
13
* these are not protected by STRIPE_ACTIVE.
14
*
15
* The flag bits that are used to represent these states are:
16
* R5_UPTODATE and R5_LOCKED
17
*
18
* State Empty == !UPTODATE, !LOCK
19
* We have no data, and there is no active request
20
* State Want == !UPTODATE, LOCK
21
* A read request is being submitted for this block
22
* State Dirty == UPTODATE, LOCK
23
* Some new data is in this buffer, and it is being written out
24
* State Clean == UPTODATE, !LOCK
25
* We have valid data which is the same as on disc
26
*
27
* The possible state transitions are:
28
*
29
* Empty -> Want - on read or write to get old data for parity calc
30
* Empty -> Dirty - on compute_parity to satisfy write/sync request.
31
* Empty -> Clean - on compute_block when computing a block for failed drive
32
* Want -> Empty - on failed read
33
* Want -> Clean - on successful completion of read request
34
* Dirty -> Clean - on successful completion of write request
35
* Dirty -> Clean - on failed write
36
* Clean -> Dirty - on compute_parity to satisfy write/sync (RECONSTRUCT or RMW)
37
*
38
* The Want->Empty, Want->Clean, Dirty->Clean, transitions
39
* all happen in b_end_io at interrupt time.
40
* Each sets the Uptodate bit before releasing the Lock bit.
41
* This leaves one multi-stage transition:
42
* Want->Dirty->Clean
43
* This is safe because thinking that a Clean buffer is actually dirty
44
* will at worst delay some action, and the stripe will be scheduled
45
* for attention after the transition is complete.
46
*
47
* There is one possibility that is not covered by these states. That
48
* is if one drive has failed and there is a spare being rebuilt. We
49
* can't distinguish between a clean block that has been generated
50
* from parity calculations, and a clean block that has been
51
* successfully written to the spare ( or to parity when resyncing).
52
* To distingush these states we have a stripe bit STRIPE_INSYNC that
53
* is set whenever a write is scheduled to the spare, or to the parity
54
* disc if there is no spare. A sync request clears this bit, and
55
* when we find it set with no buffers locked, we know the sync is
56
* complete.
57
*
58
* Buffers for the md device that arrive via make_request are attached
59
* to the appropriate stripe in one of two lists linked on b_reqnext.
60
* One list (bh_read) for read requests, one (bh_write) for write.
61
* There should never be more than one buffer on the two lists
62
* together, but we are not guaranteed of that so we allow for more.
63
*
64
* If a buffer is on the read list when the associated cache buffer is
65
* Uptodate, the data is copied into the read buffer and it's b_end_io
66
* routine is called. This may happen in the end_request routine only
67
* if the buffer has just successfully been read. end_request should
68
* remove the buffers from the list and then set the Uptodate bit on
69
* the buffer. Other threads may do this only if they first check
70
* that the Uptodate bit is set. Once they have checked that they may
71
* take buffers off the read queue.
72
*
73
* When a buffer on the write list is committed for write it is copied
74
* into the cache buffer, which is then marked dirty, and moved onto a
75
* third list, the written list (bh_written). Once both the parity
76
* block and the cached buffer are successfully written, any buffer on
77
* a written list can be returned with b_end_io.
78
*
79
* The write list and read list both act as fifos. The read list,
80
* write list and written list are protected by the device_lock.
81
* The device_lock is only for list manipulations and will only be
82
* held for a very short time. It can be claimed from interrupts.
83
*
84
*
85
* Stripes in the stripe cache can be on one of two lists (or on
86
* neither). The "inactive_list" contains stripes which are not
87
* currently being used for any request. They can freely be reused
88
* for another stripe. The "handle_list" contains stripes that need
89
* to be handled in some way. Both of these are fifo queues. Each
90
* stripe is also (potentially) linked to a hash bucket in the hash
91
* table so that it can be found by sector number. Stripes that are
92
* not hashed must be on the inactive_list, and will normally be at
93
* the front. All stripes start life this way.
94
*
95
* The inactive_list, handle_list and hash bucket lists are all protected by the
96
* device_lock.
97
* - stripes have a reference counter. If count==0, they are on a list.
98
* - If a stripe might need handling, STRIPE_HANDLE is set.
99
* - When refcount reaches zero, then if STRIPE_HANDLE it is put on
100
* handle_list else inactive_list
101
*
102
* This, combined with the fact that STRIPE_HANDLE is only ever
103
* cleared while a stripe has a non-zero count means that if the
104
* refcount is 0 and STRIPE_HANDLE is set, then it is on the
105
* handle_list and if recount is 0 and STRIPE_HANDLE is not set, then
106
* the stripe is on inactive_list.
107
*
108
* The possible transitions are:
109
* activate an unhashed/inactive stripe (get_active_stripe())
110
* lockdev check-hash unlink-stripe cnt++ clean-stripe hash-stripe unlockdev
111
* activate a hashed, possibly active stripe (get_active_stripe())
112
* lockdev check-hash if(!cnt++)unlink-stripe unlockdev
113
* attach a request to an active stripe (add_stripe_bh())
114
* lockdev attach-buffer unlockdev
115
* handle a stripe (handle_stripe())
116
* setSTRIPE_ACTIVE, clrSTRIPE_HANDLE ...
117
* (lockdev check-buffers unlockdev) ..
118
* change-state ..
119
* record io/ops needed clearSTRIPE_ACTIVE schedule io/ops
120
* release an active stripe (release_stripe())
121
* lockdev if (!--cnt) { if STRIPE_HANDLE, add to handle_list else add to inactive-list } unlockdev
122
*
123
* The refcount counts each thread that have activated the stripe,
124
* plus raid5d if it is handling it, plus one for each active request
125
* on a cached buffer, and plus one if the stripe is undergoing stripe
126
* operations.
127
*
128
* The stripe operations are:
129
* -copying data between the stripe cache and user application buffers
130
* -computing blocks to save a disk access, or to recover a missing block
131
* -updating the parity on a write operation (reconstruct write and
132
* read-modify-write)
133
* -checking parity correctness
134
* -running i/o to disk
135
* These operations are carried out by raid5_run_ops which uses the async_tx
136
* api to (optionally) offload operations to dedicated hardware engines.
137
* When requesting an operation handle_stripe sets the pending bit for the
138
* operation and increments the count. raid5_run_ops is then run whenever
139
* the count is non-zero.
140
* There are some critical dependencies between the operations that prevent some
141
* from being requested while another is in flight.
142
* 1/ Parity check operations destroy the in cache version of the parity block,
143
* so we prevent parity dependent operations like writes and compute_blocks
144
* from starting while a check is in progress. Some dma engines can perform
145
* the check without damaging the parity block, in these cases the parity
146
* block is re-marked up to date (assuming the check was successful) and is
147
* not re-read from disk.
148
* 2/ When a write operation is requested we immediately lock the affected
149
* blocks, and mark them as not up to date. This causes new read requests
150
* to be held off, as well as parity checks and compute block operations.
151
* 3/ Once a compute block operation has been requested handle_stripe treats
152
* that block as if it is up to date. raid5_run_ops guaruntees that any
153
* operation that is dependent on the compute block result is initiated after
154
* the compute block completes.
155
*/
156
157
/*
158
* Operations state - intermediate states that are visible outside of
159
* STRIPE_ACTIVE.
160
* In general _idle indicates nothing is running, _run indicates a data
161
* processing operation is active, and _result means the data processing result
162
* is stable and can be acted upon. For simple operations like biofill and
163
* compute that only have an _idle and _run state they are indicated with
164
* sh->state flags (STRIPE_BIOFILL_RUN and STRIPE_COMPUTE_RUN)
165
*/
174
enum
check_states
{
175
check_state_idle
= 0,
176
check_state_run
,
/* xor parity check */
177
check_state_run_q
,
/* q-parity check */
178
check_state_run_pq
,
/* pq dual parity check */
179
check_state_check_result
,
180
check_state_compute_run
,
/* parity repair */
181
check_state_compute_result
,
182
};
183
187
enum
reconstruct_states
{
188
reconstruct_state_idle
= 0,
189
reconstruct_state_prexor_drain_run
,
/* prexor-write */
190
reconstruct_state_drain_run
,
/* write */
191
reconstruct_state_run
,
/* expand */
192
reconstruct_state_prexor_drain_result
,
193
reconstruct_state_drain_result
,
194
reconstruct_state_result
,
195
};
196
197
struct
stripe_head
{
198
struct
hlist_node
hash
;
199
struct
list_head
lru
;
/* inactive_list or handle_list */
200
struct
r5conf
*
raid_conf
;
201
short
generation
;
/* increments with every
202
* reshape */
203
sector_t
sector
;
/* sector of this row */
204
short
pd_idx
;
/* parity disk index */
205
short
qd_idx
;
/* 'Q' disk index for raid6 */
206
short
ddf_layout
;
/* use DDF ordering to calculate Q */
207
unsigned
long
state
;
/* state flags */
208
atomic_t
count
;
/* nr of active thread/requests */
209
int
bm_seq
;
/* sequence number for bitmap flushes */
210
int
disks
;
/* disks in stripe */
211
enum
check_states
check_state
;
212
enum
reconstruct_states
reconstruct_state
;
213
spinlock_t
stripe_lock
;
221
struct
stripe_operations
{
222
int
target
,
target2
;
223
enum
sum_check_flags
zero_sum_result
;
224
#ifdef CONFIG_MULTICORE_RAID456
225
unsigned
long
request
;
226
wait_queue_head_t
wait_for_ops;
227
#endif
228
}
ops
;
229
struct
r5dev
{
230
/* rreq and rvec are used for the replacement device when
231
* writing data to both devices.
232
*/
233
struct
bio
req
,
rreq
;
234
struct
bio_vec vec,
rvec
;
235
struct
page
*
page
;
236
struct
bio *
toread
, *
read
, *
towrite
, *
written
;
237
sector_t
sector
;
/* sector of this page */
238
unsigned
long
flags
;
239
}
dev
[1];
/* allocated with extra space depending of RAID geometry */
240
};
241
242
/* stripe_head_state - collects and tracks the dynamic state of a stripe_head
243
* for handle_stripe.
244
*/
245
struct
stripe_head_state
{
246
/* 'syncing' means that we need to read all devices, either
247
* to check/correct parity, or to reconstruct a missing device.
248
* 'replacing' means we are replacing one or more drives and
249
* the source is valid at this point so we don't need to
250
* read all devices, just the replacement targets.
251
*/
252
int
syncing
,
expanding
,
expanded
,
replacing
;
253
int
locked
,
uptodate
,
to_read
,
to_write
,
failed
,
written
;
254
int
to_fill
,
compute
,
req_compute
,
non_overwrite
;
255
int
failed_num
[2];
256
int
p_failed
,
q_failed
;
257
int
dec_preread_active
;
258
unsigned
long
ops_request
;
259
260
struct
bio *
return_bi
;
261
struct
md_rdev
*
blocked_rdev
;
262
int
handle_bad_blocks
;
263
};
264
265
/* Flags for struct r5dev.flags */
266
enum
r5dev_flags
{
267
R5_UPTODATE
,
/* page contains current data */
268
R5_LOCKED
,
/* IO has been submitted on "req" */
269
R5_DOUBLE_LOCKED
,
/* Cannot clear R5_LOCKED until 2 writes complete */
270
R5_OVERWRITE
,
/* towrite covers whole page */
271
/* and some that are internal to handle_stripe */
272
R5_Insync
,
/* rdev && rdev->in_sync at start */
273
R5_Wantread
,
/* want to schedule a read */
274
R5_Wantwrite
,
275
R5_Overlap
,
/* There is a pending overlapping request
276
* on this block */
277
R5_ReadNoMerge
,
/* prevent bio from merging in block-layer */
278
R5_ReadError
,
/* seen a read error here recently */
279
R5_ReWrite
,
/* have tried to over-write the readerror */
280
281
R5_Expanded
,
/* This block now has post-expand data */
282
R5_Wantcompute
,
/* compute_block in progress treat as
283
* uptodate
284
*/
285
R5_Wantfill
,
/* dev->toread contains a bio that needs
286
* filling
287
*/
288
R5_Wantdrain
,
/* dev->towrite needs to be drained */
289
R5_WantFUA
,
/* Write should be FUA */
290
R5_SyncIO
,
/* The IO is sync */
291
R5_WriteError
,
/* got a write error - need to record it */
292
R5_MadeGood
,
/* A bad block has been fixed by writing to it */
293
R5_ReadRepl
,
/* Will/did read from replacement rather than orig */
294
R5_MadeGoodRepl
,
/* A bad block on the replacement device has been
295
* fixed by writing to it */
296
R5_NeedReplace
,
/* This device has a replacement which is not
297
* up-to-date at this stripe. */
298
R5_WantReplace
,
/* We need to update the replacement, we have read
299
* data in, and now is a good time to write it out.
300
*/
301
R5_Discard
,
/* Discard the stripe */
302
};
303
304
/*
305
* Stripe state
306
*/
307
enum
{
308
STRIPE_ACTIVE
,
309
STRIPE_HANDLE
,
310
STRIPE_SYNC_REQUESTED
,
311
STRIPE_SYNCING
,
312
STRIPE_INSYNC
,
313
STRIPE_PREREAD_ACTIVE
,
314
STRIPE_DELAYED
,
315
STRIPE_DEGRADED
,
316
STRIPE_BIT_DELAY
,
317
STRIPE_EXPANDING
,
318
STRIPE_EXPAND_SOURCE
,
319
STRIPE_EXPAND_READY
,
320
STRIPE_IO_STARTED
,
/* do not count towards 'bypass_count' */
321
STRIPE_FULL_WRITE
,
/* all blocks are set to be overwritten */
322
STRIPE_BIOFILL_RUN
,
323
STRIPE_COMPUTE_RUN
,
324
STRIPE_OPS_REQ_PENDING
,
325
STRIPE_ON_UNPLUG_LIST
,
326
};
327
328
/*
329
* Operation request flags
330
*/
331
enum
{
332
STRIPE_OP_BIOFILL
,
333
STRIPE_OP_COMPUTE_BLK
,
334
STRIPE_OP_PREXOR
,
335
STRIPE_OP_BIODRAIN
,
336
STRIPE_OP_RECONSTRUCT
,
337
STRIPE_OP_CHECK
,
338
};
339
/*
340
* Plugging:
341
*
342
* To improve write throughput, we need to delay the handling of some
343
* stripes until there has been a chance that several write requests
344
* for the one stripe have all been collected.
345
* In particular, any write request that would require pre-reading
346
* is put on a "delayed" queue until there are no stripes currently
347
* in a pre-read phase. Further, if the "delayed" queue is empty when
348
* a stripe is put on it then we "plug" the queue and do not process it
349
* until an unplug call is made. (the unplug_io_fn() is called).
350
*
351
* When preread is initiated on a stripe, we set PREREAD_ACTIVE and add
352
* it to the count of prereading stripes.
353
* When write is initiated, or the stripe refcnt == 0 (just in case) we
354
* clear the PREREAD_ACTIVE flag and decrement the count
355
* Whenever the 'handle' queue is empty and the device is not plugged, we
356
* move any strips from delayed to handle and clear the DELAYED flag and set
357
* PREREAD_ACTIVE.
358
* In stripe_handle, if we find pre-reading is necessary, we do it if
359
* PREREAD_ACTIVE is set, else we set DELAYED which will send it to the delayed queue.
360
* HANDLE gets cleared if stripe_handle leaves nothing locked.
361
*/
362
363
364
struct
disk_info
{
365
struct
md_rdev
*
rdev
, *
replacement
;
366
};
367
368
struct
r5conf
{
369
struct
hlist_head
*
stripe_hashtbl
;
370
struct
mddev
*
mddev
;
371
int
chunk_sectors
;
372
int
level
,
algorithm
;
373
int
max_degraded
;
374
int
raid_disks
;
375
int
max_nr_stripes
;
376
377
/* reshape_progress is the leading edge of a 'reshape'
378
* It has value MaxSector when no reshape is happening
379
* If delta_disks < 0, it is the last sector we started work on,
380
* else is it the next sector to work on.
381
*/
382
sector_t
reshape_progress
;
383
/* reshape_safe is the trailing edge of a reshape. We know that
384
* before (or after) this address, all reshape has completed.
385
*/
386
sector_t
reshape_safe
;
387
int
previous_raid_disks
;
388
int
prev_chunk_sectors
;
389
int
prev_algo
;
390
short
generation
;
/* increments with every reshape */
391
unsigned
long
reshape_checkpoint
;
/* Time we last updated
392
* metadata */
393
long
long
min_offset_diff
;
/* minimum difference between
394
* data_offset and
395
* new_data_offset across all
396
* devices. May be negative,
397
* but is closest to zero.
398
*/
399
400
struct
list_head
handle_list
;
/* stripes needing handling */
401
struct
list_head
hold_list
;
/* preread ready stripes */
402
struct
list_head
delayed_list
;
/* stripes that have plugged requests */
403
struct
list_head
bitmap_list
;
/* stripes delaying awaiting bitmap update */
404
struct
bio *
retry_read_aligned
;
/* currently retrying aligned bios */
405
struct
bio *
retry_read_aligned_list
;
/* aligned bios retry list */
406
atomic_t
preread_active_stripes
;
/* stripes with scheduled io */
407
atomic_t
active_aligned_reads
;
408
atomic_t
pending_full_writes
;
/* full write backlog */
409
int
bypass_count
;
/* bypassed prereads */
410
int
bypass_threshold
;
/* preread nice */
411
struct
list_head
*
last_hold
;
/* detect hold_list promotions */
412
413
atomic_t
reshape_stripes
;
/* stripes with pending writes for reshape */
414
/* unfortunately we need two cache names as we temporarily have
415
* two caches.
416
*/
417
int
active_name
;
418
char
cache_name
[2][32];
419
struct
kmem_cache
*
slab_cache
;
/* for allocating stripes */
420
421
int
seq_flush
,
seq_write
;
422
int
quiesce
;
423
424
int
fullsync
;
/* set to 1 if a full sync is needed,
425
* (fresh device added).
426
* Cleared when a sync completes.
427
*/
428
int
recovery_disabled
;
429
/* per cpu variables */
430
struct
raid5_percpu
{
431
struct
page
*
spare_page
;
/* Used when checking P/Q in raid6 */
432
void
*
scribble
;
/* space for constructing buffer
433
* lists and performing address
434
* conversions
435
*/
436
}
__percpu
*
percpu
;
437
size_t
scribble_len
;
/* size of scribble region must be
438
* associated with conf to handle
439
* cpu hotplug while reshaping
440
*/
441
#ifdef CONFIG_HOTPLUG_CPU
442
struct
notifier_block
cpu_notify;
443
#endif
444
445
/*
446
* Free stripes pool
447
*/
448
atomic_t
active_stripes
;
449
struct
list_head
inactive_list
;
450
wait_queue_head_t
wait_for_stripe
;
451
wait_queue_head_t
wait_for_overlap
;
452
int
inactive_blocked
;
/* release of inactive stripes blocked,
453
* waiting for 25% to be free
454
*/
455
int
pool_size
;
/* number of disks in stripeheads in pool */
456
spinlock_t
device_lock
;
457
struct
disk_info
*
disks
;
458
459
/* When taking over an array from a different personality, we store
460
* the new thread here until we fully activate the array.
461
*/
462
struct
md_thread
*
thread
;
463
};
464
465
/*
466
* Our supported algorithms
467
*/
468
#define ALGORITHM_LEFT_ASYMMETRIC 0
/* Rotating Parity N with Data Restart */
469
#define ALGORITHM_RIGHT_ASYMMETRIC 1
/* Rotating Parity 0 with Data Restart */
470
#define ALGORITHM_LEFT_SYMMETRIC 2
/* Rotating Parity N with Data Continuation */
471
#define ALGORITHM_RIGHT_SYMMETRIC 3
/* Rotating Parity 0 with Data Continuation */
472
473
/* Define non-rotating (raid4) algorithms. These allow
474
* conversion of raid4 to raid5.
475
*/
476
#define ALGORITHM_PARITY_0 4
/* P or P,Q are initial devices */
477
#define ALGORITHM_PARITY_N 5
/* P or P,Q are final devices. */
478
479
/* DDF RAID6 layouts differ from md/raid6 layouts in two ways.
480
* Firstly, the exact positioning of the parity block is slightly
481
* different between the 'LEFT_*' modes of md and the "_N_*" modes
482
* of DDF.
483
* Secondly, or order of datablocks over which the Q syndrome is computed
484
* is different.
485
* Consequently we have different layouts for DDF/raid6 than md/raid6.
486
* These layouts are from the DDFv1.2 spec.
487
* Interestingly DDFv1.2-Errata-A does not specify N_CONTINUE but
488
* leaves RLQ=3 as 'Vendor Specific'
489
*/
490
491
#define ALGORITHM_ROTATING_ZERO_RESTART 8
/* DDF PRL=6 RLQ=1 */
492
#define ALGORITHM_ROTATING_N_RESTART 9
/* DDF PRL=6 RLQ=2 */
493
#define ALGORITHM_ROTATING_N_CONTINUE 10
/*DDF PRL=6 RLQ=3 */
494
495
496
/* For every RAID5 algorithm we define a RAID6 algorithm
497
* with exactly the same layout for data and parity, and
498
* with the Q block always on the last device (N-1).
499
* This allows trivial conversion from RAID5 to RAID6
500
*/
501
#define ALGORITHM_LEFT_ASYMMETRIC_6 16
502
#define ALGORITHM_RIGHT_ASYMMETRIC_6 17
503
#define ALGORITHM_LEFT_SYMMETRIC_6 18
504
#define ALGORITHM_RIGHT_SYMMETRIC_6 19
505
#define ALGORITHM_PARITY_0_6 20
506
#define ALGORITHM_PARITY_N_6 ALGORITHM_PARITY_N
507
508
static
inline
int
algorithm_valid_raid5(
int
layout
)
509
{
510
return
(layout >= 0) &&
511
(layout <= 5);
512
}
513
static
inline
int
algorithm_valid_raid6(
int
layout
)
514
{
515
return
(layout >= 0 && layout <= 5)
516
||
517
(layout >= 8 && layout <= 10)
518
||
519
(layout >= 16 && layout <= 20);
520
}
521
522
static
inline
int
algorithm_is_DDF(
int
layout)
523
{
524
return
layout >= 8 && layout <= 10;
525
}
526
527
extern
int
md_raid5_congested
(
struct
mddev
*
mddev
,
int
bits
);
528
extern
void
md_raid5_kick_device
(
struct
r5conf
*conf);
529
extern
int
raid5_set_cache_size
(
struct
mddev
*
mddev
,
int
size
);
530
#endif
Generated on Thu Jan 10 2013 13:44:24 for Linux Kernel by
1.8.2