reallocated sectorevent count 1 1 5 衰退硬盘还能用吗?会不会有文件损坏啊?

Linux(7)
1&&&&&&&&& 编写目的
在如今大数据的环境中,磁盘的性能和稳定性是非常重要的一个业务因素。在Linux系统中,smartctl是较为常用的磁盘检测工具。
本文基于Linux系统中smartctl进行分析,目的在于说明相关工具的使用,并对SMART(Self-Monitoring,
Analysis and Reporting Technology)做一些分析。
2&&&&&&&&& 术语、定义和缩略语
2.1&&&&&&&& 术语、定义
本文使用的专用术语、定义,见表2.1。
Self-Monitoring, Analysis and Reporting Technology
2.2&&&&&&&& 缩略语
本文件应用了以下缩略语,见表2.2。
Self-Monitoring,
Analysis and Reporting Technology
自监察分析及报告技术
3&&&&&&&&& smartctl
smartctl是smartmontools-5.38-2.el5 rpm中的一个命令行工具,可以执行SMART任务:打印SMART self-test和error报告,开启或关闭SMART自动测试,触发磁盘self-test。
smartctl &[options] &device
“/dev/hd[a-t]”& &&IDE/ATA 磁盘
“/dev/sd[a-z]” &&&SCSI devices磁盘。注意,对于SATA磁盘,由于是通过libata
库来访问,所以要增加参数“-d &ata”。
3.1&&&&&&&& [options]:
&&&&&& 参数按照不同的类型来分类。
3.1.1&&&&&&&&& 显示信息 参数:
-h&&& &&&&&& 帮助信息
-V&&&&&&&&& 版本信息
-i&&&&&&&&&&& 打印基本信息(磁盘设备号、序列号、固件版本…)
-a&&& &&打印磁盘所有的SMART信息
3.1.2&&&&&&&&& 运行时行为 参数:
-q& TYPE&&&& 指定输出的安静模式。
TYPE可以有3种选择:
&&&&&&&&&&&&&&&&&&&& & eorsonly&&&&&&&&&&& 只打印错误日志。
&&&&&&&&&&&&&&&&&&&& & slent&&&&&&&&&&&&&&&&& 有任何打印。
&&&&&&&&&&&&&&&&&&&& & nserial&&&&&&& 不打印序列号
&&&&&& -d& TYPE&&&& 指定磁盘的类型。如果没有指定,smartctl会根据磁盘的名字来
猜测磁盘类型。
-T& TYPE&&&& 指定当发生错误时,smartctl的容忍程度,是否继续运行。
&&&&&&&&&&&&&&&&&&&& TYPE可以有4种选择:
&&&&&&&&&&&&&&&&&&&& & conservative&&&&& 一有错就会退出
&&&&&&&&&&&&&&&&&&&& & normal&&&&&&& 如果必须支持的SMART命令失败,则退出
&&&&&&&&&&&&&&&&&&&& & permissive&&&& 忽略一次必须支持的SMART命令失败
&&&&&&&&&&&&&&&&&&&& & verypermissive& 忽略所有必须支持的SMART命令失败
-b& TYPE&&&& 指定当发生校验错误时,smartctl的动作。
&&&&&&&&&&&&&&&&&&&& TYPE有3种选择:
&&&&&&&&&&&&&&&&&&&& & warn&&&&&&&&& 发出警告,继续执行
&&&&&&&&&&&&&&&&&&&& & exit &&&&&&&&& 退出smartctl
&&&&&&&&&&&&&&&&&&&& & ignore&&&&&&& 不发出告警,继续执行&&&&&&
-r& TYPE&&&&& smartmontools开发人员相关。
-n &POWERMODE&&& 指定当磁盘处于节能模式时,smartctl是否继续检查,
默认是不检查。
POWERMODE有4种选择:
& never& &检查
& sleep& &&除了sleep模式,检查。
& standby& 除了sleep或standby模式,检查。
& idle& &&&&除了sleep或standby或idle模式,见车。
3.1.3&&&&&&&&& SMART功能开关 参数:
-s &on/off&&&&& 打开或关闭磁盘的SMART功能
-o& on/off&&&&& 打开或关闭SMART自动离线检测,该功能每4小时就会自动扫描磁盘是
否有缺陷。
-S& on/off&& 打开或关闭“自动保存厂商指定属性”功能。
3.1.4&&&&&&&&& SMART 读和显示数据 参数
-H&&&&&&&&& 报告磁盘的是否健康。如果报告不健康,则说明磁盘已经损坏或会在24小时
-c&&&&&&&&&& 显示磁盘支持的普通SMART功能,以及这些功能当前的状态。
-A&&&&&&&&& 显示磁盘支持的厂商指定SMART特性。这些特性的编号从1-253,并且有指
定的名字。
-l& TYPE&&&&& 指定显示的log类型。
&&&&&&&&&&&&&&&&&&&& TYPE有4种选择:
&&&&&&&&&&&&&&&&&&&& error&&&&&&&&&&&& 只显示error &log。
&&&&&&&&&&&&&&&&&&&& selftest&&& 只显示selftest& log
&&&&&&&&&&&&&&&&&&&& selective 只显示selective &self-test &log
&&&&&&&&&&&&&&&&&&&& directory 只显示Log &Directory
&&&&&& -v& N,OPTION&&& 显示厂商指定SMART特性N时,使用厂商相关的显示方式。
-F &TYPE&&&& 设置smartctl的行为,当出现一些已知但还没有解决的硬件或软件bug时,
smartctl应该怎么做。
-P &TYPE&&&& 设置smartctl是否对磁盘使用数据库中已有的参数。
3.1.5&&&&&&&&& SMART 离线测试、自测试 参数
-t& TEST&&&&& 立刻执行测试,可以和-C参数一起使用。
&&&&&&&&&&&&&&&&&&&& TEST可以有以下几个选择:
&&&&&&&&&&&&&&&&&&&& offline& 离线测试。可以在挂载文件系统的磁盘上使用
&&&&&&&&&&&&&&&&&&&& short&& 短时间测试。可以在挂载文件系统的磁盘上使用。
&&&&&&&&&&&&&&&&&&&& long&& 长时间测试。可以在挂载文件系统的磁盘上使用。
&&&&&&&&&&&&&&&&&&&& conveyance& [ATA only]传输zi测试。可以在挂载文件系统的磁盘上使用。
&&&&&&&&&&&&&&&&&&&& select,
select, N+SIZE& [ATA only]有选择性测试,测试磁盘的部分LBA。N表示
LBA编号,M表示结束LBA编号,SIZE表示测试的LBA
-C& 在captive模式下运行测试。
注意:(1)-C必须配合-t一起使用,但如果是-t offline,则-C不生效。
&&&&&& &(2)-C会使得磁盘很忙,所以最好是在没有挂载文件系统的磁盘上使用。
-X& 中断no-captive模式下运行的测试。
3.2&&&&&&&& 常用example
3.2.1&&&&&&&&& 查看当前整体健康状态
查看/dev/sda当前整体监控状态。PASSED表示健康,否则意味着磁盘已经故障,或很快就会发生故障。
&smartctl &-H& /dev/sda
3.2.2&&&&&&&&& 查看所有信息
打印/dev/sda所有的SMART信息。
martctl &-a &/dev/sda
相当于依次执行:
smartctl &–i&
/dev/sda&&
smartctl& -c&
/dev/sda&&
smartctl& -A&
/dev/sda&&
smartctl& -l&
error& /dev/sda
smartctl& -l&
selftest& /dev/sda
smartctl& -l& selective& /dev/sda
3.2.3&&&&&&&&& 开/关SMART功能
打开或关闭/dev/sda 的SMART功能。
smartctl &-s &on/off
查看当前SMART功能是否开启,可以使用 –i 参数。
smartctl &-i &/dev/sda
3.2.4&&&&&&&&& 离线测试
对/dev/sda进行离线测试,它的结果主要用来更新SMART 属性。
smartctl& -t&
offline& /dev/sda
3.2.5&&&&&&&&& &短时间测试
对/dev/sda进行短时间测试。
smartctl& -t&
short& /dev/sda
3.2.5.1&&&&&&&& 观察测试进度
通过-c 参数,可以观察到测试的进度:
# smartctl -c&&& /dev/sda
Self-test execution status:&&&&& ( 242) Self-test
routine in progress…
&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& &&&&&&&&&&& 20% of
test remaining.
3.2.5.2&&&&&&&& 观察测试结果
通过-l selftest 参数,可以看到/dev/sda测试的结果记录:
“#1”代表的那一次测试,Completed without error表示完成,没有错误。
“#2”代表的那一次测试,Aborted by host表示测试被用户终止,还有90%没有完成。
# smartctl -l selftest&&& /dev/sda
Test_Description& Status& &&&&&&&&&&&&&&&&Remaining& LifeTime(hours)& LBA_of_first_error
# 1& Short offline&&&&&& Completed without error&& 00%&&&&& &&9535&& &&&&&&-
Extended offline&&& Aborted by host&&&&&&&&& 90%&&&& &&&9534&&&&&&&&
3.2.6&&&&&&&&& 查看SMART属性值
通过-A参数,可以看到/dev/sda
SMART属性值。
smartctl& -A&
3.4&&&&&&&& SMART 属性
使用smartctl& -A& /dev/sda能看到很多磁盘的SMART& 属性,可以知道磁盘是否健康。
下面是一个列表,可以知道每个属性的具体含义:
Description
Read Error Rate
(Vendor specific raw value.) Stores data related to the
rate of hardware read errors that occurred when reading data from a disk
surface. The raw value has different structure for different vendors and is
often not meaningful as a decimal number.
Throughput Performance
Overall (general) throughput performance of a hard disk
drive. If the value of this attribute is decreasing there is a high
probability that there is a problem with the disk.
Spin-Up Time
Average time of spindle spin up (from zero RPM to fully
operational [millisecs]).
Start/Stop Count
A tally of spindle start/stop cycles. The spindle turns
on, and hence the count is increased, both when the hard disk is turned on
after having before been turned entirely off (disconnected from power source)
and when the hard disk returns from having previously been put to sleep mode.
Reallocated Sectors Count
Count of reallocated sectors. When the hard drive finds a
read/write/verification error, it marks that sector as
“reallocated” and transfers data to a special reserved area (spare
area). This process is also known as remapping, and reallocated sectors are
called “remaps”. The raw value normally represents a count of the
bad sectors that have been found and remapped. Thus, the higher the attribute
value, the more sectors the drive has had to reallocate. This allows a drive
with bad sectors t however, a drive which has had any
reallocations at all is significantly more likely to fail in the near future.While primarily used as a metric of
the life expectancy of the drive, this number also affects performance. As
the count of reallocated sectors increases, the read/write speed tends to
become worse because the&&is forced to seek to the reserved area whenever a remap is
accessed. A workaround which will preserve drive speed at the expense of
capacity is to create a&&over the region which contains remaps and instruct the&&to not use that partition.
Read Channel Margin
Margin of a channel while reading data. The function of
this attribute is not specified.
Seek Error Rate
(Vendor specific raw value.) Rate of seek errors of the
magnetic heads. If there is a partial failure in the mechanical positioning
system, then seek errors will arise. Such a failure may be due to numerous
factors, such as damage to a servo, or thermal widening of the hard disk. The
raw value has different structure for different vendors and is often not
meaningful as a decimal number.
Seek Time Performance
Average performance of seek operations of the magnetic
heads. If this attribute is decreasing, it is a sign of problems in the
mechanical subsystem.
Count of hours in power-on state. The raw value of this
attribute shows total count of hours (or minutes, or seconds, depending on
manufacturer) in power-on state.
Spin Retry Count
Count of retry of spin start attempts. This attribute
stores a total count of the spin start attempts to reach the fully
operational speed (under the condition that the first attempt was
unsuccessful). An increase of this attribute value is a sign of problems in
the hard disk mechanical subsystem.
Recalibration Retries&orCalibration Retry Count
This attribute indicates the count that recalibration was
requested (under the condition that the first attempt was unsuccessful). An
increase of this attribute value is a sign of problems in the hard disk
mechanical subsystem.
Power Cycle Count
This attribute indicates the count of full hard disk power
on/off cycles.
Soft Read Error Rate
Uncorrected read errors reported to the operating system.
Unused Reserved Block Count Total
“Pre-Fail” Attribute used at least in HP
SATA Downshift Error Count
Western Digital and Samsung attribute.
End-to-End error / IOEDC&&&&
This attribute is a part of&’s
SMART IV technology, as well as part of other vendors’ IO Error Detection and
Correction schemas, and it contains a count of parity errors which occur in
the data path to the media via the drive’s cache RAM.
Head Stability
Western Digital attribute.
Induced Op-Vibration Detection
Western Digital attribute.
Reported Uncorrectable Errors
The count of errors that could not be recovered using
hardware ECC&.
Command Timeout
The count of aborted operations due to HDD timeout.
Normally this attribute value should be equal to zero and if the value is far
above zero, then most likely there will be some serious problems with power
supply or an oxidized data cable.
High Fly Writes
producers implement a Fly Height Monitor that attempts to provide additional
protections for write operations by detecting when a recording head is flying
outside its normal operating range. If an unsafe fly height condition is
encountered, the write process is stopped, and the information is rewritten
or reallocated to a safe region of the hard drive. This attribute indicates
the count of these errors detected over the lifetime of the drive.
This feature is implemented in most modern Seagate drives and some of Western Digital’s drives, beginning with the WD Enterprise
WDE18300 and WDE9180 Ultra2 SCSI hard drives, and will be included on all
future WD Enterprise products.
Airflow Temperature (WDC)&resp.Airflow Temperature Celsius (HP)
Airflow temperature on Western Digital HDs (Same as temp.
[C2], but current value is 50 less for some models. Marked as obsolete.)
G-sense Error Rate
The count of errors resulting from externally-induced
shock & vibration.
Power-off Retract Countor&Emergency Retract Cycle Count(Fujitsu)
Count of times the heads are loaded off the media. Heads
can be unloaded without actually powering off.
Load Cycle Count&orLoad/Unload Cycle Count(Fujitsu)
load/unload cycles into head landing zone position.
The typical lifetime rating for laptop (2.5-in) hard
drives is 300,000 to 600,000 load cycles.&Some
laptop drives are programmed to unload the heads whenever there has not been
any activity for about five seconds.Many Linux installations write to the
file system a few times a minute in the background.&As a result, there may be 100 or
more load cycles per hour, and the load cycle rating may be exceeded in less
than a year
Temperatureresp.Temperature Celsius
Current internal temperature.
Hardware ECC Recovered
(Vendor specific raw value.) The raw value has different
structure for different vendors and is often not meaningful as a decimal
Reallocation Event Count
Count of remap operations. The raw value of this attribute
shows the total count of attempts to transfer data from reallocated sectors
to a spare area. Both successful & unsuccessful attempts are counted.
Current Pending Sector Count
Count of “unstable” sectors (waiting to be
remapped, because of read errors). If an unstable sector is subsequently read
successfully, this value is decreased and the sector is not remapped. Read
errors on a sector will not remap the sector (since it might be readable
later); instead, the drive firmware remembers that the sector needs to be
remapped, and remaps it the next time it’s written.
Uncorrectable Sector Countor
Offline Uncorrectableor
Off-Line Scan
Uncorrectable Sector Count
The total count of uncorrectable errors when
reading/writing a sector. A rise in the value of this attribute indicates
defects of the disk surface and/or problems in the mechanical subsystem.
UltraDMA CRC Error Count
The count of errors in data transfer via the interface
cable as determined by ICRC (Interface Cyclic Redundancy Check).
Multi-Zone Error Rate
The count of errors found when writing a sector. The
higher the value, the worse the disk’s mechanical condition is.
Write Error Rate&(Fujitsu)
The total count of errors when writing a sector.
Soft Read Error Rate&or
TA Counter Detected
Count of off-track errors.
Data Address Mark errorsor
TA Counter Increased
Count of Data Address Mark errors (or vendor-specific).
Run Out Cancel
Count of ECC errors
Soft ECC Correction
Count of errors corrected by software ECC
Thermal Asperity Rate (TAR)
Count of errors due to high temperature.
Flying Height
Height of heads above the disk surface. A flying height
that’s too low increases the chances of a head crash while a flying height
that’s too high increases the chances of a read/write error.
Spin High Current
Amount of&&used to spin up the drive.
Count of buzz routines needed to spin up the drive due to
insufficient power.
Offline Seek Performance
Drive’s seek performance during its internal tests.
(found in a Maxtor 6B200M0 200GB and Maxtor 2R015H1 15GB
Vibration During Write
Vibration During Write
Shock During Write
Shock During Write
Disk Shift
Distance the disk has shifted relative to the spindle
(usually due to shock or temperature). Unit of measure is unknown.
Loaded Hours
Time spent operating under data load (movement of magnetic
head armature)
Load/Unload Retry Count
Count of times head changes position.
Load Friction
Resistance caused by friction in mechanical parts while
operating.
Load/Unload Cycle Count
Total count of load cycles
Load ‘In’-time
Total time of loading on the magnetic heads actuator (time
not spent in parking area).
Torque Amplification Count
Count of attempts to compensate for platter speed
variations
Power-Off Retract Cycle
The count of times the magnetic armature was retracted
automatically as a result of cutting power.
GMR Head Amplitude
Amplitude of “thrashing” (distance of repetitive
forward/reverse head motion)
Temperature
Drive Temperature
Endurance Remaining
Number of physical erase cycles completed on the drive as
a percentage of the maximum physical erase cycles the drive is designed to
Available Reserved Space
Intel SSD reports the number of available reserved space
as a percentage of reserved space in a brand new SSD.
Power-On Hours
Number of hours elapsed in the power-on state.
Media Wearout Indicator
Intel SSD reports a normalized value of 100 (when the SSD
is new) and declines to a minimum value of 1. It decreases while the NAND
erase cycles increase from 0 to the maximum-rated cycles.
Head Flying Hours
Time while head is positioning
Transfer Error Rate(Fujitsu)
Count of times the link is reset during a data transfer.
Total LBAs Written
Total count of LBAs written
Total LBAs Read
Total count of LBAs read.
Some S.M.A.R.T. utilities will report a negative
number for the raw value since in reality it has 48 bits rather than 32.
Read Error Retry Rate
Count of errors while reading from a disk
Free Fall Protection
ount of “Free Fall Events” detected
3.5&&&&&&&& SMART self-test
使用smartctl& –t& offline/short/long&& 可以指定磁盘进行自测。
这个是默认的自测。
&&&&&& 短时自测的目的是快速确认磁盘是否故障。
&&&&&& 测试过程有很多项目,都是磁盘厂商自定义的,比如下面的项目:
a)&&&&&& 电气测试项目,测试磁盘内部的电路。具体测试细节有磁盘厂商自己指定,比如:
A)&&&& 缓存测试。
B)&&&& 读、写电路测试。
C)&&&& 读、写磁头测试。
b)&&&&& 寻道、伺服测试项目,测试磁盘在数据磁道上的寻找和伺服能。
c)&&&&&& 读、校验测试项目,测试磁盘对部分或全盘的读能力。
&&&&&& 称为扩展的自测试。测试的项目和short类型,但是时间长得多。
&&相关文章推荐
* 以上用户言论只代表其个人观点,不代表CSDN网站的观点或立场
访问:1204次
排名:千里之外
原创:11篇
(window.slotbydup = window.slotbydup || []).push({
id: '4740881',
container: s,
size: '200,200',
display: 'inlay-fix'用户名:njqyu
文章数:66
访问量:22456
注册日期:
阅读量:1297
阅读量:3317
阅读量:583023
阅读量:468027
51CTO推荐博文
FLAG是标记,标准数值(VALUE)应当小于或等於关键值(THRESH)。WHEN_FAILED 代表错误信息,上面显示的WHEN_FAILED纵行是空行,说明硬盘没有故障。如果WHEN_FAILED显示数字,表明硬盘磁道可能有比较大的坏道。
read error rate 错误读取率:记录读取数据错误次数(累计),非0值表示硬盘已经或者可能即将发生坏道;
throughput performance 磁盘吞吐量:平均吞吐性能(一般在进行了人工 Offline S.M.A.R.T. 测试以后才会有值。);
spinup time 主轴电机到达要求转速时间(毫秒/秒);
start/stop count 电机启动/停止次数(可以当作开机/关机次数,或者休眠后恢复,均增加一次计数。全新的硬盘应该小于10);
reallocated sectors count 重分配扇区计数:硬盘生产过程中,有一部分扇区是保留的。当一些普通扇区读/写/验证错误,则重新映射到保留扇区,挂起该异常扇区,并增加计数。随着计数增加,io性能骤降。如果数值不为0,就需要密切关注硬盘健康状况;如果持续攀升,则硬盘已经损坏;如果重分配扇区数超过保留扇区数,将不可修复;
seek error rate 寻道错误率:磁头定位错误一次,则技术增加一次。如果持续攀升,则可能是机械部分即将发生故障;
seek timer performance 寻道时间:寻道所需要的时间,越短则读取数据越快,但是如果时间增加,则可能机械部分即将发生故障;
power-on time 累计通电时间:指硬盘通电时间累计值。(单位:天/时/分/秒。休眠/挂起不计入?新购入的硬盘应小于100hrs);
spinup retry count 电机启动失败计数:电机启动到指定转速失败的累计数值。如果失败,则可能是动力系统产生故障;
power cycle count 电源开关计数:每次加电增加一次计数,新硬盘应小于10次;
g-sensor error rate 坠落计数:异常加速度(例如坠落,抛掷)计数&&磁头会立即回到landing zone,并增加一次计数;
power-off retract count 异常断电次数:磁头在断电前没有完全回到landing zone的次数,每次异常断电则增加一次计数;
load/unload cycle count 磁头归位次数:指工作时,磁头每次回归landing zone的次数。(ps:流言说某个linux系统&&不点名,在使用电池时候,会不断强制磁头归为,而磁头归位次数最大值约为600k次,所以认为linux会损坏硬盘,实际上不是这样的);
temperature 温度:没嘛好说的,硬盘温度而已,理论上比工作环境高不了几度。(sudo hddtemp /dev/sda)
reallocetion event count 重映射扇区操作次数:上边的重映射扇区还记得吧?这个就是操作次数,成功的,失败的都计数。成功好说,也许硬盘有救,失败了,也许硬盘就要报废了;
current pending sector count 待映射扇区数:出现异常的扇区数量,待被映射的扇区数量。 如果该异常扇区之后成功读写,则计数会减小,扇区也不会重新映射。读错误不会重新映射,只有写错误才会重新映射;
uncorrectable sector count 不可修复扇区数:所有读/写错误计数,非0就证明有坏道,硬盘报废;
SSD固态硬盘多出的Attributes 信息解释:
其中我们比较关注的有以下四点:
1、Media_Wearout_Indicator: & &使用耗费,100为没有任何耗费; 表示SSD上NAND的擦写次数的程度,初始值为100,随着擦写次数的增加,开始线性递减,递减速度按照擦写次数从0到最大的比例。一旦这个值降低到 1,就不再降了,同时表示SSD上面已经有NAND的擦写次数到达了最大次数。这个时候建议需要备份数据,以及更换SSD。
上面的机器为099,按照100滴血算,目前只耗了1滴血
2、Reallocated_Sector_Ct: 出厂后产生的坏块个数, 初始值为100,如果有坏块,从1开始增加,每4个坏块增加1
这里offer的机器还没有任何坏块
3、Host_Writes_32MiB: 已写32MiB, 每写入65536个扇区raw value增加1。这个扇区还是个数量单位,512字节
比如:这块盘就是 1284966 * 65536 * 512 =
注意到每个机器都有一块盘写的比较少,这块盘就是hotspare盘。
每台机器我们有7块ssd盘。其中6块盘做的raid 5,第7块盘做的hotspare。
4、Available_Reservd_Space: SSD上剩余的保留空间, 初始值为100,表示100%,阀值为10,递减到10表示保留空间已经不能再减少
了这篇文章
类别:┆阅读(0)┆评论(0)}

我要回帖

更多关于 smart event硬盘修复 的文章

更多推荐

版权声明:文章内容来源于网络,版权归原作者所有,如有侵权请点击这里与我们联系,我们将及时删除。

点击添加站长微信