I had an issue yesterday with a software Raid, where one disk had to be replaced. I removed the partitions from the array using
mdadm /dev/mdx -r /dev/sdbx
After the failed drive has been replaced by the hosting center, I applied the partition table to the new disk (sdb was the bad device)
sgdisk -R /dev/sdb /dev/sda
Gave it a new id:
sgdisk -G /dev/sdb
Then I added all the partitions again using:
mdadm /dev/mdx -r /dev/sdbx
This went well for all partitions except one, which bails out after a few hours about at 60%This is the current state of the raid:
cat /proc/mdstat Personalities : [raid1] md5 : active raid1 sda6[0] sdb6[2](S) 2633910528 blocks super 1.2 [2/1] [U_]md4 : active raid1 sda5[0] sdb5[2] 16768896 blocks super 1.2 [2/2] [UU]md3 : active raid1 sda4[0] sdb4[2] 2096064 blocks super 1.2 [2/2] [UU]md2 : active raid1 sda3[0] sdb3[2] 268304192 blocks super 1.2 [2/2] [UU]md1 : active raid1 sda2[0] sdb2[2] 523968 blocks super 1.2 [2/2] [UU]md0 : active raid1 sda1[0] sdb1[2] 8384448 blocks super 1.2 [2/2] [UU]unused devices: <none>
In syslog I can see messages like:
n 23 14:24:04 rescue kernel: [11163.329021] ata1.00: exception Emask 0x0 SAct 0xf00000 SErr 0x0 action 0x0Jan 23 14:24:04 rescue kernel: [11163.376449] ata1.00: configured for UDMA/133Jan 23 14:24:04 rescue kernel: [11163.376475] sd 0:0:0:0: [sda] Unhandled sense codeJan 23 14:24:04 rescue kernel: [11163.376477] sd 0:0:0:0: [sda] Jan 23 14:24:04 rescue kernel: [11163.376479] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSEJan 23 14:24:04 rescue kernel: [11163.376481] sd 0:0:0:0: [sda] Jan 23 14:24:04 rescue kernel: [11163.376483] Sense Key : Medium Error [current] [descriptor]Jan 23 14:24:04 rescue kernel: [11163.376486] Descriptor sense data with sense descriptors (in hex):Jan 23 14:24:04 rescue kernel: [11163.376487] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 Jan 23 14:24:04 rescue kernel: [11163.376495] ce 1f 0d 58 Jan 23 14:24:04 rescue kernel: [11163.376498] sd 0:0:0:0: [sda] Jan 23 14:24:04 rescue kernel: [11163.376501] Add. Sense: Unrecovered read error - auto reallocate failedJan 23 14:24:04 rescue kernel: [11163.376503] sd 0:0:0:0: [sda] CDB: Jan 23 14:24:04 rescue kernel: [11163.376504] Read(16): 88 00 00 00 00 00 ce 1f 0b 80 00 00 04 00 00 00Jan 23 14:24:04 rescue kernel: [11163.376513] end_request: I/O error, dev sda, sector 3458141528
and
Jan 23 14:35:22 rescue kernel: [11840.396206] ata1.00: configured for UDMA/133Jan 23 14:35:22 rescue kernel: [11840.396212] ata1.00: device reported invalid CHS sector 0Jan 23 14:35:22 rescue kernel: [11840.396216] ata1.00: device reported invalid CHS sector 0Jan 23 14:35:22 rescue kernel: [11840.396220] ata1.00: device reported invalid CHS sector 0Jan 23 14:35:22 rescue kernel: [11840.396223] ata1.00: device reported invalid CHS sector 0Jan 23 14:35:22 rescue kernel: [11840.396230] ata1: EH completeJan 23 14:35:52 rescue kernel: [11870.888343] ata1.00: exception Emask 0x0 SAct 0x40000007 SErr 0x0 action 0x6 frozenJan 23 14:35:52 rescue kernel: [11870.945207] ata1.00: cmd 60/00:08:80:c3:58/04:00:ce:00:00/40 tag 1 ncq 524288 inJan 23 14:35:52 rescue kernel: [11870.945207] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)Jan 23 14:35:52 rescue kernel: [11870.982487] ata1.00: cmd 60/80:10:00:c0:58/03:00:ce:00:00/40 tag 2 ncq 458752 inJan 23 14:35:52 rescue kernel: [11870.982487] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)Jan 23 14:35:53 rescue kernel: [11871.019291] ata1.00: cmd 60/00:f0:80:cb:58/04:00:ce:00:00/40 tag 30 ncq 524288 inJan 23 14:35:53 rescue kernel: [11871.019291] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)Jan 23 14:35:53 rescue kernel: [11871.055486] ata1: hard resetting linkJan 23 14:35:53 rescue kernel: [11871.707811] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)Jan 23 14:35:53 rescue kernel: [11871.708270] ACPI Error: [DSSP] Namespace lookup failure, AE_NOT_FOUND (20131218/psargs-359)Jan 23 14:35:53 rescue kernel: [11871.708279] ACPI Error: Method parse/execution failed [\_SB_.PCI0.SAT0.SPT0._GTF] (Node ffff88041d869a88), AE_NOT_FOUND (20131218/psparse-536)Jan 23 14:35:53 rescue kernel: [11871.709174] ACPI Error: [DSSP] Namespace lookup failure, AE_NOT_FOUND (20131218/psargs-359)Jan 23 14:35:53 rescue kernel: [11871.709182] ACPI Error: Method parse/execution failed [\_SB_.PCI0.SAT0.SPT0._GTF] (Node ffff88041d869a88), AE_NOT_FOUND (20131218/psparse-536)
I am able to mount /dev/md5 and list the files. However I can not add the new partition to the array.
Is there any way I can fix this without losing the data on the partition?
If not, is it possible to just format that single partition and then add the new drive again? I should have up to date backup of that partition, so that would not be an issue. If possible I just would like having to erase all partitions.
smartctl output:
/dev/sda:
smartctl -a /dev/sdasmartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.14.27] (local build)Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net=== START OF INFORMATION SECTION ===Device Model: ST3000DM001-1CH166Serial Number: Z1F1XJHCLU WWN Device Id: 5 000c50 04f3fc2c7Firmware Version: CC24User Capacity: 3,000,592,982,016 bytes [3.00 TB]Sector Sizes: 512 bytes logical, 4096 bytes physicalDevice is: Not in smartctl database [for details use: -P showall]ATA Version is: 8ATA Standard is: ATA-8-ACS revision 4Local Time is: Fri Jan 23 16:16:32 2015 CETSMART support is: Available - device has SMART capability.SMART support is: EnabledError SMART Values Read failed: scsi error aborted commandSmartctl: SMART Read Values failed.=== START OF READ SMART DATA SECTION ===SMART overall-health self-assessment test result: UNKNOWN!SMART Status, Attributes and Thresholds cannot be read.SMART Error Log Version: 1ATA Error Count: 107 (device log contains only the most recent five errors) CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX]Powered_Up_Time is measured from power on, and printed asDDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,SS=sec, and sss=millisec. It "wraps" after 49.710 days.Error 107 occurred at disk power-on lifetime: 13180 hours (549 days + 4 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 08 ff ff ff ef 00 15:56:49.931 READ DMA EXT 25 00 08 ff ff ff ef 00 15:56:48.680 READ DMA EXT ef 10 02 00 00 00 a0 00 15:56:48.644 SET FEATURES [Reserved for Serial ATA] 27 00 00 00 00 00 e0 00 15:56:48.644 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 a0 00 15:56:48.644 IDENTIFY DEVICEError 106 occurred at disk power-on lifetime: 13180 hours (549 days + 4 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 08 ff ff ff ef 00 15:56:45.363 READ DMA EXT 25 00 08 ff ff ff ef 00 15:56:44.071 READ DMA EXT 25 00 08 ff ff ff ef 00 15:56:42.789 READ DMA EXT 25 00 08 ff ff ff ef 00 15:56:42.755 READ DMA EXT 25 00 08 ff ff ff ef 00 15:56:42.722 READ DMA EXTError 105 occurred at disk power-on lifetime: 13180 hours (549 days + 4 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 08 ff ff ff ef 00 15:56:15.716 READ DMA EXT 25 00 08 ff ff ff ef 00 15:56:12.832 READ DMA EXT 25 00 08 ff ff ff ef 00 15:56:11.540 READ DMA EXT 25 00 08 ff ff ff ef 00 15:56:10.290 READ DMA EXT 25 00 08 ff ff ff ef 00 15:56:09.448 READ DMA EXTError 104 occurred at disk power-on lifetime: 13180 hours (549 days + 4 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 08 ff ff ff ef 00 15:56:02.563 READ DMA EXT 25 00 08 ff ff ff ef 00 15:55:59.655 READ DMA EXT 25 00 08 ff ff ff ef 00 15:55:58.319 READ DMA EXT 25 00 08 ff ff ff ef 00 15:55:58.069 READ DMA EXT 25 00 08 ff ff ff ef 00 15:55:57.838 READ DMA EXTError 103 occurred at disk power-on lifetime: 13180 hours (549 days + 4 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 80 ff ff ff ef 00 15:55:51.995 READ DMA EXT 25 00 08 ff ff ff ef 00 15:55:50.735 READ DMA EXT ef 10 02 00 00 00 a0 00 15:55:50.700 SET FEATURES [Reserved for Serial ATA] 27 00 00 00 00 00 e0 00 15:55:50.700 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 a0 00 15:55:50.699 IDENTIFY DEVICESMART Self-test log structure revision number 1Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error# 1 Extended offline Completed without error 00% 4561 -# 2 Extended offline Completed without error 00% 2977 -# 3 Extended offline Completed without error 00% 5 -Device does not support Selective Self Tests/Logging
/dev/sdb:
smartctl -a /dev/sdbsmartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.14.27] (local build)Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net=== START OF INFORMATION SECTION ===Device Model: ST33000650NSSerial Number: Z295TK0GLU WWN Device Id: 5 000c50 04f891dedFirmware Version: 0004User Capacity: 3,000,592,982,016 bytes [3.00 TB]Sector Size: 512 bytes logical/physicalDevice is: Not in smartctl database [for details use: -P showall]ATA Version is: 8ATA Standard is: ATA-8-ACS revision 4Local Time is: Fri Jan 23 16:15:30 2015 CETSMART support is: Available - device has SMART capability.SMART support is: Enabled=== START OF READ SMART DATA SECTION ===SMART overall-health self-assessment test result: PASSEDSee vendor-specific Attribute list for marginal Attributes.General SMART Values:Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled.Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run.Total time to complete Offline data collection: ( 600) seconds.Offline data collectioncapabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported.SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer.Error logging capability: (0x01) Error logging supported. General Purpose Logging supported.Short self-test routine recommended polling time: ( 1) minutes.Extended self-test routinerecommended polling time: ( 255) minutes.Conveyance self-test routinerecommended polling time: ( 2) minutes.SCT capabilities: (0x10bd) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported.SMART Attributes Data Structure revision number: 10Vendor Specific SMART Attributes with Thresholds:ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 078 053 044 Pre-fail Always - 70825960 3 Spin_Up_Time 0x0003 093 093 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 11 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 1 7 Seek_Error_Rate 0x000f 088 060 030 Pre-fail Always - 791126750 9 Power_On_Hours 0x0032 092 092 000 Old_age Always - 7155 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 11184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0187 Reported_Uncorrect 0x0032 090 090 000 Old_age Always - 10188 Command_Timeout 0x0032 100 099 000 Old_age Always - 1189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0190 Airflow_Temperature_Cel 0x0022 066 043 045 Old_age Always In_the_past 34 (5 173 37 27)191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 8193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 11194 Temperature_Celsius 0x0022 034 057 000 Old_age Always - 34 (0 24 0 0)195 Hardware_ECC_Recovered 0x001a 018 007 000 Old_age Always - 70825960197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0SMART Error Log Version: 1ATA Error Count: 18 (device log contains only the most recent five errors) CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX]Powered_Up_Time is measured from power on, and printed asDDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,SS=sec, and sss=millisec. It "wraps" after 49.710 days.Error 18 occurred at disk power-on lifetime: 5559 hours (231 days + 15 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f Error: WP at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 61 00 18 ff ff ff 4f 00 26d+03:52:28.560 WRITE FPDMA QUEUED 60 00 00 ff ff ff 4f 00 26d+03:52:28.560 READ FPDMA QUEUED 60 00 08 ff ff ff 4f 00 26d+03:52:28.559 READ FPDMA QUEUED 60 00 08 ff ff ff 4f 00 26d+03:52:28.559 READ FPDMA QUEUED 60 00 08 ff ff ff 4f 00 26d+03:52:28.559 READ FPDMA QUEUEDError 17 occurred at disk power-on lifetime: 5559 hours (231 days + 15 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 00 08 ff ff ff 4f 00 26d+03:52:13.471 READ FPDMA QUEUED 60 00 58 d0 57 44 43 00 26d+03:52:13.471 READ FPDMA QUEUED 61 00 02 08 90 6d 49 00 26d+03:52:13.471 WRITE FPDMA QUEUED ea 00 00 00 00 00 a0 00 26d+03:52:13.470 FLUSH CACHE EXT 60 00 00 e0 42 20 4e 00 26d+03:52:13.422 READ FPDMA QUEUEDError 16 occurred at disk power-on lifetime: 5559 hours (231 days + 15 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 00 00 ff ff ff 4f 00 26d+03:51:56.176 READ FPDMA QUEUED 60 00 08 ff ff ff 4f 00 26d+03:51:56.176 READ FPDMA QUEUED 60 00 08 ff ff ff 4f 00 26d+03:51:56.175 READ FPDMA QUEUED 60 00 00 e0 0d 20 4e 00 26d+03:51:56.116 READ FPDMA QUEUED 60 00 00 e0 0c 20 4e 00 26d+03:51:56.114 READ FPDMA QUEUEDError 15 occurred at disk power-on lifetime: 5559 hours (231 days + 15 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 00 08 50 59 cb 43 00 26d+03:51:24.077 READ FPDMA QUEUED 60 00 08 ff ff ff 4f 00 26d+03:51:24.077 READ FPDMA QUEUED 60 00 00 e0 c5 1c 4e 00 26d+03:51:24.076 READ FPDMA QUEUED ea 00 00 00 00 00 a0 00 26d+03:51:24.071 FLUSH CACHE EXT 60 00 08 28 46 c1 43 00 26d+03:51:22.717 READ FPDMA QUEUEDError 14 occurred at disk power-on lifetime: 5559 hours (231 days + 15 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 00 00 ff ff ff 4f 00 26d+03:51:02.317 READ FPDMA QUEUED 61 00 08 ff ff ff 4f 00 26d+03:51:02.317 WRITE FPDMA QUEUED ea 00 00 00 00 00 a0 00 26d+03:51:02.316 FLUSH CACHE EXT 60 00 08 ff ff ff 4f 00 26d+03:51:02.303 READ FPDMA QUEUED 60 00 08 ff ff ff 4f 00 26d+03:51:02.300 READ FPDMA QUEUEDSMART Self-test log structure revision number 1Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error# 1 Extended offline Completed without error 00% 7071 -# 2 Extended offline Completed without error 00% 7060 -# 3 Extended offline Completed without error 00% 5600 -# 4 Short offline Completed without error 00% 2489 -SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testingSelective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk.If Selective self-test is pending on power-up, resume after 0 minute delay.