linux user group brescia

immagine del castello

Archivio della mailing list

RAID-1 recovery (automatica) che non funge

Luca Coianiz luca a coianiz.it
Mar 14 Set 2004 13:00:13 UTC
Tenterò di essere stringato ;)
(la maggior parte del msg è dedicata ai log)

 Domenica scorsa (12/9) torno dalle ferie ed il server è spento (ventola
grippata = alimentatore "fuso"): solo un led che lampeggia (quindi PC tutto
sommato "a posto"). Pare che lo spegnimento sia avvenuto il 9, alle
01:09:14, preceduto da alcuni "reset" (sempre dovuti all'alimentatore):

(/var/log/messages)
---8<---
Sep  8 22:33:51 home syslogd 1.4.1: restart (remote reception).
Sep  8 22:33:52 home logtmstamp: 20040908.223352: STARTUP TIMESTAMP.
Sep  8 22:33:56 home kernel: klogd 1.4.1, log source = /proc/kmsg started.
---8<---
Sep  8 22:37:22 home syslogd 1.4.1: restart (remote reception).
Sep  8 22:37:23 home logtmstamp: 20040908.223723: STARTUP TIMESTAMP.
Sep  8 22:37:27 home kernel: klogd 1.4.1, log source = /proc/kmsg started.
---8<---
Sep  8 22:59:33 home kernel: md: md0: sync done.
---8<---
Sep  9 01:09:09 home syslogd 1.4.1: restart (remote reception).
Sep  9 01:09:11 home logtmstamp: 20040909.010911: STARTUP TIMESTAMP.
Sep  9 01:09:14 home kernel: klogd 1.4.1, log source = /proc/kmsg started.
---8<---
Sep  9 01:09:41 home /usr/sbin/cron[1883]: (CRON) STARTUP (fork ok)
---8<---

 Vabbè.
 Ieri ho comprato il nuovo alimentatore e fatto ripartire il tutto.
 In effetti riparte ma... il RAID-1 non (auto)recovera. :(

---8<---
Sep 13 20:31:08 home syslogd 1.4.1: restart (remote reception).
Sep 13 20:31:09 home logtmstamp: 20040913.203109: STARTUP TIMESTAMP.
Sep 13 20:31:13 home kernel: klogd 1.4.1, log source = /proc/kmsg started.
Sep 13 20:31:13 home kernel: hda: dma_intr: status=0x51 { DriveReady
SeekComplete Error }
Sep 13 20:31:13 home kernel: hda: dma_intr: error=0x40 { UncorrectableError
}, LBAsect=656093, sector=61568
Sep 13 20:31:13 home kernel: end_request: I/O error, dev 03:03 (hda), sector
61568
Sep 13 20:31:13 home kernel: raid1: Disk failure on hda3, disabling device.
Sep 13 20:31:13 home kernel:    Operation continuing on 1 devices
Sep 13 20:31:13 home kernel: raid1: mirror resync was not fully finished,
restarting next time.
Sep 13 20:31:13 home kernel: md: recovery thread got woken up ...
Sep 13 20:31:13 home kernel: md: updating md0 RAID superblock on device
Sep 13 20:31:13 home kernel: md: (skipping faulty hda3 )
Sep 13 20:31:13 home kernel: md: hdb3 [events: 0000019c]<6>(write) hdb3's sb
offset: 39905344
Sep 13 20:31:13 home kernel: hda: dma_intr: status=0x51 { DriveReady
SeekComplete Error }
Sep 13 20:31:13 home kernel: hda: dma_intr: error=0x40 { UncorrectableError
}, LBAsect=656093, sector=61576
Sep 13 20:31:13 home kernel: end_request: I/O error, dev 03:03 (hda), sector
61576
Sep 13 20:31:13 home kernel: md0: no spare disk to reconstruct array! --
continuing in degraded mode
Sep 13 20:31:13 home kernel: md: recovery thread finished ...
Sep 13 20:31:13 home kernel: md: md_do_sync() got signal ... exiting
---8<---

 Mi fa il boot da /dev/hda1 (quindi credo che hda sia danneggiato solo in
modo logico, a parte la segnalazione di I/O error qui sopra, che m'ha dato
solo una volta) ma poi in /dev/md0 inserisce solo il disco /dev/hdb3 e, dato
che non ho uno spare, gira in modo degraded.

 Cenni di config:

- distro SuSE 8.0
- RAID-1 software
- 2 HD IDE da 40GB /dev/hda e /dev/hdb
- partizioni (/etc/fstab):
  /dev/hda1	/boot	(fuori RAID)	Ext3	ok
  /dev/hdb1	/boot1	(  "    "  )	Ext3	ok
  /dev/hda2	swap	(  "    "  )		ok
  /dev/hdb2	swap	(  "    "  )		ok
  /dev/hda3	/	RAID-1		Ext3	"fault"
  /dev/hdb3	/	RAID-1		Ext3	ok

(da /var/log/boot.msg)
---8<---
<4>    ide0: BM-DMA at 0xa800-0xa807, BIOS settings: hda:DMA, hdb:DMA
<4>    ide1: BM-DMA at 0xa808-0xa80f, BIOS settings: hdc:DMA, hdd:pio
<4>hda: IC35L040AVVA07-0, ATA DISK drive
<4>hdb: IC35L040AVVA07-0, ATA DISK drive
---8<---
<6>Partition check:
<6> hda: hda1 hda2 hda3
<6> hdb: hdb1 hdb2 hdb3
---8<---
<6>md: md driver 0.90.0 MAX_MD_DEVS=256, MD_SB_DISKS=27
<6>md: Autodetecting RAID arrays.
<6> [events: 0000019b]
<6> [events: 0000019d]
<6>md: autorun ...
<6>md: considering hdb3 ...
<6>md:  adding hdb3 ...
<6>md:  adding hda3 ...
<6>md: created md0
<6>md: bind<hda3,1>
<6>md: bind<hdb3,2>
<6>md: running: <hdb3><hda3>
<6>md: hdb3's event counter: 0000019d
<6>md: hda3's event counter: 0000019b
<3>md: superblock update time inconsistency -- using the most recent one
<6>md: freshest: hdb3
<4>md: kicking non-fresh hda3 from array!
<6>md: unbind<hda3,1>
<6>md: export_rdev(hda3)
<4>md0: removing former faulty hda3!
<3>md: md0: raid array is not clean -- starting background reconstruction
<6>md: RAID level 1 does not need chunksize! Continuing anyway.
<3>request_module[md-personality-3]: Root fs not mounted
<3>md: personality 3 is not loaded!
<4>md :do_md_run() returned -22
<6>md: md0 stopped.
<6>md: unbind<hdb3,0>
<6>md: export_rdev(hdb3)
<6>md: ... autorun DONE.
---8<---
<6>Journalled Block Device driver loaded
<6>md: raid1 personality registered as nr 3
<6>md: Autodetecting RAID arrays.
<6> [events: 0000019b]
<6> [events: 0000019d]
<6>md: autorun ...
<6>md: considering hdb3 ...
<6>md:  adding hdb3 ...
<6>md:  adding hda3 ...
<6>md: created md0
<6>md: bind<hda3,1>
<6>md: bind<hdb3,2>
<6>md: running: <hdb3><hda3>
<6>md: hdb3's event counter: 0000019d
<6>md: hda3's event counter: 0000019b
<3>md: superblock update time inconsistency -- using the most recent one
<6>md: freshest: hdb3
<4>md: kicking non-fresh hda3 from array!
<6>md: unbind<hda3,1>
<6>md: export_rdev(hda3)
<4>md0: removing former faulty hda3!
<3>md: md0: raid array is not clean -- starting background reconstruction
<6>md: RAID level 1 does not need chunksize! Continuing anyway.
<6>md0: max total readahead window set to 124k
<6>md0: 1 data-disks, max readahead per data-disk: 124k
<6>raid1: device hdb3 operational as mirror 1
<1>raid1: md0, not all disks are operational -- trying to recover array
<6>raid1: raid set md0 active with 1 out of 2 mirrors
<6>md: updating md0 RAID superblock on device
<6>md: hdb3 [events: 0000019e]<6>(write) hdb3's sb offset: 39905344
<6>md: recovery thread got woken up ...
<3>md0: no spare disk to reconstruct array! -- continuing in degraded mode
<6>md: recovery thread finished ...
<6>md: ... autorun DONE.
---8<---
<6>md: Autodetecting RAID arrays.
<6> [events: 0000019b]
<6>md: autorun ...
<6>md: considering hda3 ...
<6>md:  adding hda3 ...
<4>md: md0 already running, cannot run hda3
<6>md: export_rdev(hda3)
<6>md: (hda3 was pending)
<6>md: ... autorun DONE.
<6>Adding Swap: 265064k swap-space (priority 42)
<6>Adding Swap: 265064k swap-space (priority 42)
<6>EXT3 FS 2.4-0.9.17, 10 Jan 2002 on md(9,0), internal journal
<6>kjournald starting.  Commit interval 5 seconds
<6>EXT3 FS 2.4-0.9.17, 10 Jan 2002 on ide0(3,1), internal journal
<6>EXT3-fs: mounted filesystem with ordered data mode.
<6>kjournald starting.  Commit interval 5 seconds
<6>EXT3 FS 2.4-0.9.17, 10 Jan 2002 on ide0(3,65), internal journal
<6>EXT3-fs: mounted filesystem with ordered data mode.
---8<---

home:/var/log # cat /proc/mdstat
Personalities : [raid1]
read_ahead 1024 sectors
md0 : active raid1 hdb3[1]
      39905344 blocks [2/1] [_U]

unused devices: <none>

 Dato che hda3 non è utilizzato ho anche provato un fsck:

home:/etc # e2fsck -cc -v /dev/hda3
e2fsck 1.26 (3-Feb-2002)
Checking for bad blocks (non-destructive read-write test): done
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

/dev/hda3: ***** FILE SYSTEM WAS MODIFIED *****

  196741 inodes used (3%)
   10295 non-contiguous inodes (5.2%)
 # of inodes with ind/dind/tind blocks: 17246/276/0
 5776652 blocks used (57%)
       0 bad blocks
       0 large files

  172746 regular files
   13303 directories
    1828 character device files
    6690 block device files
       6 fifos
     527 links
    2152 symbolic links (2132 fast symbolic links)
       7 sockets
--------
  197259 files

 ...e non mi pare che riscontri problemi.

 Purtroppo come "tools" non è che abbia granchè:

home:/var/log # l /sbin/raid*
 /sbin/raid0run -> mkraid*
 /sbin/raidautorun*
 /sbin/raidhotadd -> raidstart*
 /sbin/raidhotgenerateerror -> raidstart*
 /sbin/raidhotremove -> raidstart*
 /sbin/raidstart*
 /sbin/raidstop -> raidstart*

 In pratica il sistema sembra usare solo raidautorun, tant'è che non esiste
nemmeno /etc/raidtab: quando ho provato ad usare raidstart s'è subito
lamentato che mancava:

home:/etc # raidstart /dev/md0
Couldn't open /etc/raidtab -- No such file or directory

 Usando un sample ho creato una raidtab ma rilanciando raidstart non è che
le cose cambiassero:

home:/etc # raidstart /dev/md0
/dev/md0: File exists

(ed in /var/log/messages)
---8<---
Sep 14 14:11:02 home kernel: md: array md0 already exists!
---8<---

(ovvio che esiste: ci sto lavorando sopra) :|

 Non posso dare lo stop al RAID, dato che (ovviamente) è busy.
 Rilanciando raidautorun ottengo:

(/var/log/messages)
---8<---
Sep 14 14:13:54 home kernel: md: Autodetecting RAID arrays.
Sep 14 14:13:54 home kernel:  [events: 0000019b]
Sep 14 14:13:54 home kernel: md: autorun ...
Sep 14 14:13:54 home kernel: md: considering hda3 ...
Sep 14 14:13:54 home kernel: md:  adding hda3 ...
Sep 14 14:13:54 home kernel: md: md0 already running, cannot run hda3
Sep 14 14:13:54 home kernel: md: export_rdev(hda3)
Sep 14 14:13:54 home kernel: md: (hda3 was pending)
Sep 14 14:13:54 home kernel: md: ... autorun DONE.
---8<---

 L'unica cosa che non ho ancora fatto è un mkfs: non sono molto pratico con
i device ed ho troppa paura di sderenare la parte funzionante dell'array. :|
(nel frattempo mi sto preoccupando del backup su un altro disco IDE da 30GB)

 Così a sensazione mi sembra un pò una situazione di stallo, in cui non
posso andare indietro (stop del RAID), non posso andare avanti (il rebuild
del mirror non avviene, non capisco il perchè nè riesco a forzarlo) e stando
"fermo" non cambia nulla (mica siamo in windows che dopo un paio di reboot
"qualcosa riparte" ;)). :|

 Consigli?

	LC





Maggiori informazioni sulla lista Lug