Quantcast
Channel: User st-h - Server Fault
Viewing all articles
Browse latest Browse all 6

Linux: Rebuilding software Raid 1 fails when adding partition

$
0
0

I had an issue yesterday with a software Raid, where one disk had to be replaced. I removed the partitions from the array using

mdadm /dev/mdx -r /dev/sdbx

After the failed drive has been replaced by the hosting center, I applied the partition table to the new disk (sdb was the bad device)

sgdisk -R /dev/sdb /dev/sda 

Gave it a new id:

sgdisk -G /dev/sdb

Then I added all the partitions again using:

mdadm /dev/mdx -r /dev/sdbx

This went well for all partitions except one, which bails out after a few hours about at 60%This is the current state of the raid:

cat /proc/mdstat Personalities : [raid1] md5 : active raid1 sda6[0] sdb6[2](S)      2633910528 blocks super 1.2 [2/1] [U_]md4 : active raid1 sda5[0] sdb5[2]      16768896 blocks super 1.2 [2/2] [UU]md3 : active raid1 sda4[0] sdb4[2]      2096064 blocks super 1.2 [2/2] [UU]md2 : active raid1 sda3[0] sdb3[2]      268304192 blocks super 1.2 [2/2] [UU]md1 : active raid1 sda2[0] sdb2[2]      523968 blocks super 1.2 [2/2] [UU]md0 : active raid1 sda1[0] sdb1[2]      8384448 blocks super 1.2 [2/2] [UU]unused devices: <none>

In syslog I can see messages like:

n 23 14:24:04 rescue kernel: [11163.329021] ata1.00: exception Emask 0x0 SAct 0xf00000 SErr 0x0 action 0x0Jan 23 14:24:04 rescue kernel: [11163.376449] ata1.00: configured for UDMA/133Jan 23 14:24:04 rescue kernel: [11163.376475] sd 0:0:0:0: [sda] Unhandled sense codeJan 23 14:24:04 rescue kernel: [11163.376477] sd 0:0:0:0: [sda]  Jan 23 14:24:04 rescue kernel: [11163.376479] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSEJan 23 14:24:04 rescue kernel: [11163.376481] sd 0:0:0:0: [sda]  Jan 23 14:24:04 rescue kernel: [11163.376483] Sense Key : Medium Error [current] [descriptor]Jan 23 14:24:04 rescue kernel: [11163.376486] Descriptor sense data with sense descriptors (in hex):Jan 23 14:24:04 rescue kernel: [11163.376487]         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 Jan 23 14:24:04 rescue kernel: [11163.376495]         ce 1f 0d 58 Jan 23 14:24:04 rescue kernel: [11163.376498] sd 0:0:0:0: [sda]  Jan 23 14:24:04 rescue kernel: [11163.376501] Add. Sense: Unrecovered read error - auto reallocate failedJan 23 14:24:04 rescue kernel: [11163.376503] sd 0:0:0:0: [sda] CDB: Jan 23 14:24:04 rescue kernel: [11163.376504] Read(16): 88 00 00 00 00 00 ce 1f 0b 80 00 00 04 00 00 00Jan 23 14:24:04 rescue kernel: [11163.376513] end_request: I/O error, dev sda, sector 3458141528

and

Jan 23 14:35:22 rescue kernel: [11840.396206] ata1.00: configured for UDMA/133Jan 23 14:35:22 rescue kernel: [11840.396212] ata1.00: device reported invalid CHS sector 0Jan 23 14:35:22 rescue kernel: [11840.396216] ata1.00: device reported invalid CHS sector 0Jan 23 14:35:22 rescue kernel: [11840.396220] ata1.00: device reported invalid CHS sector 0Jan 23 14:35:22 rescue kernel: [11840.396223] ata1.00: device reported invalid CHS sector 0Jan 23 14:35:22 rescue kernel: [11840.396230] ata1: EH completeJan 23 14:35:52 rescue kernel: [11870.888343] ata1.00: exception Emask 0x0 SAct 0x40000007 SErr 0x0 action 0x6 frozenJan 23 14:35:52 rescue kernel: [11870.945207] ata1.00: cmd 60/00:08:80:c3:58/04:00:ce:00:00/40 tag 1 ncq 524288 inJan 23 14:35:52 rescue kernel: [11870.945207]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)Jan 23 14:35:52 rescue kernel: [11870.982487] ata1.00: cmd 60/80:10:00:c0:58/03:00:ce:00:00/40 tag 2 ncq 458752 inJan 23 14:35:52 rescue kernel: [11870.982487]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)Jan 23 14:35:53 rescue kernel: [11871.019291] ata1.00: cmd 60/00:f0:80:cb:58/04:00:ce:00:00/40 tag 30 ncq 524288 inJan 23 14:35:53 rescue kernel: [11871.019291]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)Jan 23 14:35:53 rescue kernel: [11871.055486] ata1: hard resetting linkJan 23 14:35:53 rescue kernel: [11871.707811] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)Jan 23 14:35:53 rescue kernel: [11871.708270] ACPI Error: [DSSP] Namespace lookup failure, AE_NOT_FOUND (20131218/psargs-359)Jan 23 14:35:53 rescue kernel: [11871.708279] ACPI Error: Method parse/execution failed [\_SB_.PCI0.SAT0.SPT0._GTF] (Node ffff88041d869a88), AE_NOT_FOUND (20131218/psparse-536)Jan 23 14:35:53 rescue kernel: [11871.709174] ACPI Error: [DSSP] Namespace lookup failure, AE_NOT_FOUND (20131218/psargs-359)Jan 23 14:35:53 rescue kernel: [11871.709182] ACPI Error: Method parse/execution failed [\_SB_.PCI0.SAT0.SPT0._GTF] (Node ffff88041d869a88), AE_NOT_FOUND (20131218/psparse-536)

I am able to mount /dev/md5 and list the files. However I can not add the new partition to the array.

Is there any way I can fix this without losing the data on the partition?

If not, is it possible to just format that single partition and then add the new drive again? I should have up to date backup of that partition, so that would not be an issue. If possible I just would like having to erase all partitions.

smartctl output:

/dev/sda:

smartctl -a /dev/sdasmartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.14.27] (local build)Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net=== START OF INFORMATION SECTION ===Device Model:     ST3000DM001-1CH166Serial Number:    Z1F1XJHCLU WWN Device Id: 5 000c50 04f3fc2c7Firmware Version: CC24User Capacity:    3,000,592,982,016 bytes [3.00 TB]Sector Sizes:     512 bytes logical, 4096 bytes physicalDevice is:        Not in smartctl database [for details use: -P showall]ATA Version is:   8ATA Standard is:  ATA-8-ACS revision 4Local Time is:    Fri Jan 23 16:16:32 2015 CETSMART support is: Available - device has SMART capability.SMART support is: EnabledError SMART Values Read failed: scsi error aborted commandSmartctl: SMART Read Values failed.=== START OF READ SMART DATA SECTION ===SMART overall-health self-assessment test result: UNKNOWN!SMART Status, Attributes and Thresholds cannot be read.SMART Error Log Version: 1ATA Error Count: 107 (device log contains only the most recent five errors)    CR = Command Register [HEX]    FR = Features Register [HEX]    SC = Sector Count Register [HEX]    SN = Sector Number Register [HEX]    CL = Cylinder Low Register [HEX]    CH = Cylinder High Register [HEX]    DH = Device/Head Register [HEX]    DC = Device Command Register [HEX]    ER = Error register [HEX]    ST = Status register [HEX]Powered_Up_Time is measured from power on, and printed asDDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,SS=sec, and sss=millisec. It "wraps" after 49.710 days.Error 107 occurred at disk power-on lifetime: 13180 hours (549 days + 4 hours)  When the command that caused the error occurred, the device was active or idle.  After command completion occurred, registers were:  ER ST SC SN CL CH DH  -- -- -- -- -- -- --  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455  Commands leading to the command that caused the error were:  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name  -- -- -- -- -- -- -- --  ----------------  --------------------  25 00 08 ff ff ff ef 00      15:56:49.931  READ DMA EXT  25 00 08 ff ff ff ef 00      15:56:48.680  READ DMA EXT  ef 10 02 00 00 00 a0 00      15:56:48.644  SET FEATURES [Reserved for Serial ATA]  27 00 00 00 00 00 e0 00      15:56:48.644  READ NATIVE MAX ADDRESS EXT  ec 00 00 00 00 00 a0 00      15:56:48.644  IDENTIFY DEVICEError 106 occurred at disk power-on lifetime: 13180 hours (549 days + 4 hours)  When the command that caused the error occurred, the device was active or idle.  After command completion occurred, registers were:  ER ST SC SN CL CH DH  -- -- -- -- -- -- --  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455  Commands leading to the command that caused the error were:  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name  -- -- -- -- -- -- -- --  ----------------  --------------------  25 00 08 ff ff ff ef 00      15:56:45.363  READ DMA EXT  25 00 08 ff ff ff ef 00      15:56:44.071  READ DMA EXT  25 00 08 ff ff ff ef 00      15:56:42.789  READ DMA EXT  25 00 08 ff ff ff ef 00      15:56:42.755  READ DMA EXT  25 00 08 ff ff ff ef 00      15:56:42.722  READ DMA EXTError 105 occurred at disk power-on lifetime: 13180 hours (549 days + 4 hours)  When the command that caused the error occurred, the device was active or idle.  After command completion occurred, registers were:  ER ST SC SN CL CH DH  -- -- -- -- -- -- --  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455  Commands leading to the command that caused the error were:  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name  -- -- -- -- -- -- -- --  ----------------  --------------------  25 00 08 ff ff ff ef 00      15:56:15.716  READ DMA EXT  25 00 08 ff ff ff ef 00      15:56:12.832  READ DMA EXT  25 00 08 ff ff ff ef 00      15:56:11.540  READ DMA EXT  25 00 08 ff ff ff ef 00      15:56:10.290  READ DMA EXT  25 00 08 ff ff ff ef 00      15:56:09.448  READ DMA EXTError 104 occurred at disk power-on lifetime: 13180 hours (549 days + 4 hours)  When the command that caused the error occurred, the device was active or idle.  After command completion occurred, registers were:  ER ST SC SN CL CH DH  -- -- -- -- -- -- --  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455  Commands leading to the command that caused the error were:  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name  -- -- -- -- -- -- -- --  ----------------  --------------------  25 00 08 ff ff ff ef 00      15:56:02.563  READ DMA EXT  25 00 08 ff ff ff ef 00      15:55:59.655  READ DMA EXT  25 00 08 ff ff ff ef 00      15:55:58.319  READ DMA EXT  25 00 08 ff ff ff ef 00      15:55:58.069  READ DMA EXT  25 00 08 ff ff ff ef 00      15:55:57.838  READ DMA EXTError 103 occurred at disk power-on lifetime: 13180 hours (549 days + 4 hours)  When the command that caused the error occurred, the device was active or idle.  After command completion occurred, registers were:  ER ST SC SN CL CH DH  -- -- -- -- -- -- --  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455  Commands leading to the command that caused the error were:  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name  -- -- -- -- -- -- -- --  ----------------  --------------------  25 00 80 ff ff ff ef 00      15:55:51.995  READ DMA EXT  25 00 08 ff ff ff ef 00      15:55:50.735  READ DMA EXT  ef 10 02 00 00 00 a0 00      15:55:50.700  SET FEATURES [Reserved for Serial ATA]  27 00 00 00 00 00 e0 00      15:55:50.700  READ NATIVE MAX ADDRESS EXT  ec 00 00 00 00 00 a0 00      15:55:50.699  IDENTIFY DEVICESMART Self-test log structure revision number 1Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error# 1  Extended offline    Completed without error       00%      4561         -# 2  Extended offline    Completed without error       00%      2977         -# 3  Extended offline    Completed without error       00%         5         -Device does not support Selective Self Tests/Logging

/dev/sdb:

smartctl -a /dev/sdbsmartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.14.27] (local build)Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net=== START OF INFORMATION SECTION ===Device Model:     ST33000650NSSerial Number:    Z295TK0GLU WWN Device Id: 5 000c50 04f891dedFirmware Version: 0004User Capacity:    3,000,592,982,016 bytes [3.00 TB]Sector Size:      512 bytes logical/physicalDevice is:        Not in smartctl database [for details use: -P showall]ATA Version is:   8ATA Standard is:  ATA-8-ACS revision 4Local Time is:    Fri Jan 23 16:15:30 2015 CETSMART support is: Available - device has SMART capability.SMART support is: Enabled=== START OF READ SMART DATA SECTION ===SMART overall-health self-assessment test result: PASSEDSee vendor-specific Attribute list for marginal Attributes.General SMART Values:Offline data collection status:  (0x82) Offline data collection activity                    was completed without error.                    Auto Offline Data Collection: Enabled.Self-test execution status:      (   0) The previous self-test routine completed                    without error or no self-test has ever                     been run.Total time to complete Offline data collection:        (  600) seconds.Offline data collectioncapabilities:            (0x7b) SMART execute Offline immediate.                    Auto Offline data collection on/off support.                    Suspend Offline collection upon new                    command.                    Offline surface scan supported.                    Self-test supported.                    Conveyance Self-test supported.                    Selective Self-test supported.SMART capabilities:            (0x0003) Saves SMART data before entering                    power-saving mode.                    Supports SMART auto save timer.Error logging capability:        (0x01) Error logging supported.                    General Purpose Logging supported.Short self-test routine recommended polling time:    (   1) minutes.Extended self-test routinerecommended polling time:    ( 255) minutes.Conveyance self-test routinerecommended polling time:    (   2) minutes.SCT capabilities:          (0x10bd) SCT Status supported.                    SCT Error Recovery Control supported.                    SCT Feature Control supported.                    SCT Data Table supported.SMART Attributes Data Structure revision number: 10Vendor Specific SMART Attributes with Thresholds:ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE  1 Raw_Read_Error_Rate     0x000f   078   053   044    Pre-fail  Always       -       70825960  3 Spin_Up_Time            0x0003   093   093   000    Pre-fail  Always       -       0  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       11  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       1  7 Seek_Error_Rate         0x000f   088   060   030    Pre-fail  Always       -       791126750  9 Power_On_Hours          0x0032   092   092   000    Old_age   Always       -       7155 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       11184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0187 Reported_Uncorrect      0x0032   090   090   000    Old_age   Always       -       10188 Command_Timeout         0x0032   100   099   000    Old_age   Always       -       1189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0190 Airflow_Temperature_Cel 0x0022   066   043   045    Old_age   Always   In_the_past 34 (5 173 37 27)191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       8193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       11194 Temperature_Celsius     0x0022   034   057   000    Old_age   Always       -       34 (0 24 0 0)195 Hardware_ECC_Recovered  0x001a   018   007   000    Old_age   Always       -       70825960197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0SMART Error Log Version: 1ATA Error Count: 18 (device log contains only the most recent five errors)    CR = Command Register [HEX]    FR = Features Register [HEX]    SC = Sector Count Register [HEX]    SN = Sector Number Register [HEX]    CL = Cylinder Low Register [HEX]    CH = Cylinder High Register [HEX]    DH = Device/Head Register [HEX]    DC = Device Command Register [HEX]    ER = Error register [HEX]    ST = Status register [HEX]Powered_Up_Time is measured from power on, and printed asDDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,SS=sec, and sss=millisec. It "wraps" after 49.710 days.Error 18 occurred at disk power-on lifetime: 5559 hours (231 days + 15 hours)  When the command that caused the error occurred, the device was active or idle.  After command completion occurred, registers were:  ER ST SC SN CL CH DH  -- -- -- -- -- -- --  40 51 00 ff ff ff 0f  Error: WP at LBA = 0x0fffffff = 268435455  Commands leading to the command that caused the error were:  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name  -- -- -- -- -- -- -- --  ----------------  --------------------  61 00 18 ff ff ff 4f 00  26d+03:52:28.560  WRITE FPDMA QUEUED  60 00 00 ff ff ff 4f 00  26d+03:52:28.560  READ FPDMA QUEUED  60 00 08 ff ff ff 4f 00  26d+03:52:28.559  READ FPDMA QUEUED  60 00 08 ff ff ff 4f 00  26d+03:52:28.559  READ FPDMA QUEUED  60 00 08 ff ff ff 4f 00  26d+03:52:28.559  READ FPDMA QUEUEDError 17 occurred at disk power-on lifetime: 5559 hours (231 days + 15 hours)  When the command that caused the error occurred, the device was active or idle.  After command completion occurred, registers were:  ER ST SC SN CL CH DH  -- -- -- -- -- -- --  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455  Commands leading to the command that caused the error were:  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name  -- -- -- -- -- -- -- --  ----------------  --------------------  60 00 08 ff ff ff 4f 00  26d+03:52:13.471  READ FPDMA QUEUED  60 00 58 d0 57 44 43 00  26d+03:52:13.471  READ FPDMA QUEUED  61 00 02 08 90 6d 49 00  26d+03:52:13.471  WRITE FPDMA QUEUED  ea 00 00 00 00 00 a0 00  26d+03:52:13.470  FLUSH CACHE EXT  60 00 00 e0 42 20 4e 00  26d+03:52:13.422  READ FPDMA QUEUEDError 16 occurred at disk power-on lifetime: 5559 hours (231 days + 15 hours)  When the command that caused the error occurred, the device was active or idle.  After command completion occurred, registers were:  ER ST SC SN CL CH DH  -- -- -- -- -- -- --  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455  Commands leading to the command that caused the error were:  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name  -- -- -- -- -- -- -- --  ----------------  --------------------  60 00 00 ff ff ff 4f 00  26d+03:51:56.176  READ FPDMA QUEUED  60 00 08 ff ff ff 4f 00  26d+03:51:56.176  READ FPDMA QUEUED  60 00 08 ff ff ff 4f 00  26d+03:51:56.175  READ FPDMA QUEUED  60 00 00 e0 0d 20 4e 00  26d+03:51:56.116  READ FPDMA QUEUED  60 00 00 e0 0c 20 4e 00  26d+03:51:56.114  READ FPDMA QUEUEDError 15 occurred at disk power-on lifetime: 5559 hours (231 days + 15 hours)  When the command that caused the error occurred, the device was active or idle.  After command completion occurred, registers were:  ER ST SC SN CL CH DH  -- -- -- -- -- -- --  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455  Commands leading to the command that caused the error were:  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name  -- -- -- -- -- -- -- --  ----------------  --------------------  60 00 08 50 59 cb 43 00  26d+03:51:24.077  READ FPDMA QUEUED  60 00 08 ff ff ff 4f 00  26d+03:51:24.077  READ FPDMA QUEUED  60 00 00 e0 c5 1c 4e 00  26d+03:51:24.076  READ FPDMA QUEUED  ea 00 00 00 00 00 a0 00  26d+03:51:24.071  FLUSH CACHE EXT  60 00 08 28 46 c1 43 00  26d+03:51:22.717  READ FPDMA QUEUEDError 14 occurred at disk power-on lifetime: 5559 hours (231 days + 15 hours)  When the command that caused the error occurred, the device was active or idle.  After command completion occurred, registers were:  ER ST SC SN CL CH DH  -- -- -- -- -- -- --  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455  Commands leading to the command that caused the error were:  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name  -- -- -- -- -- -- -- --  ----------------  --------------------  60 00 00 ff ff ff 4f 00  26d+03:51:02.317  READ FPDMA QUEUED  61 00 08 ff ff ff 4f 00  26d+03:51:02.317  WRITE FPDMA QUEUED  ea 00 00 00 00 00 a0 00  26d+03:51:02.316  FLUSH CACHE EXT  60 00 08 ff ff ff 4f 00  26d+03:51:02.303  READ FPDMA QUEUED  60 00 08 ff ff ff 4f 00  26d+03:51:02.300  READ FPDMA QUEUEDSMART Self-test log structure revision number 1Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error# 1  Extended offline    Completed without error       00%      7071         -# 2  Extended offline    Completed without error       00%      7060         -# 3  Extended offline    Completed without error       00%      5600         -# 4  Short offline       Completed without error       00%      2489         -SMART Selective self-test log data structure revision number 1 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS    1        0        0  Not_testing    2        0        0  Not_testing    3        0        0  Not_testing    4        0        0  Not_testing    5        0        0  Not_testingSelective self-test flags (0x0):  After scanning selected spans, do NOT read-scan remainder of disk.If Selective self-test is pending on power-up, resume after 0 minute delay.

Viewing all articles
Browse latest Browse all 6

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>