About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog

Documentation / admin-guide / ras.rst




Custom Search

Based on kernel version 4.16.1. Page generated on 2018-04-09 11:52 EST.

1	.. include:: <isonum.txt>
2	
3	============================================
4	Reliability, Availability and Serviceability
5	============================================
6	
7	RAS concepts
8	************
9	
10	Reliability, Availability and Serviceability (RAS) is a concept used on
11	servers meant to measure their robustness.
12	
13	Reliability
14	  is the probability that a system will produce correct outputs.
15	
16	  * Generally measured as Mean Time Between Failures (MTBF)
17	  * Enhanced by features that help to avoid, detect and repair hardware faults
18	
19	Availability
20	  is the probability that a system is operational at a given time
21	
22	  * Generally measured as a percentage of downtime per a period of time
23	  * Often uses mechanisms to detect and correct hardware faults in
24	    runtime;
25	
26	Serviceability (or maintainability)
27	  is the simplicity and speed with which a system can be repaired or
28	  maintained
29	
30	  * Generally measured on Mean Time Between Repair (MTBR)
31	
32	Improving RAS
33	-------------
34	
35	In order to reduce systems downtime, a system should be capable of detecting
36	hardware errors, and, when possible correcting them in runtime. It should
37	also provide mechanisms to detect hardware degradation, in order to warn
38	the system administrator to take the action of replacing a component before
39	it causes data loss or system downtime.
40	
41	Among the monitoring measures, the most usual ones include:
42	
43	* CPU – detect errors at instruction execution and at L1/L2/L3 caches;
44	* Memory – add error correction logic (ECC) to detect and correct errors;
45	* I/O – add CRC checksums for transferred data;
46	* Storage – RAID, journal file systems, checksums,
47	  Self-Monitoring, Analysis and Reporting Technology (SMART).
48	
49	By monitoring the number of occurrences of error detections, it is possible
50	to identify if the probability of hardware errors is increasing, and, on such
51	case, do a preventive maintenance to replace a degraded component while
52	those errors are correctable.
53	
54	Types of errors
55	---------------
56	
57	Most mechanisms used on modern systems use use technologies like Hamming
58	Codes that allow error correction when the number of errors on a bit packet
59	is below a threshold. If the number of errors is above, those mechanisms
60	can indicate with a high degree of confidence that an error happened, but
61	they can't correct.
62	
63	Also, sometimes an error occur on a component that it is not used. For
64	example, a part of the memory that it is not currently allocated.
65	
66	That defines some categories of errors:
67	
68	* **Correctable Error (CE)** - the error detection mechanism detected and
69	  corrected the error. Such errors are usually not fatal, although some
70	  Kernel mechanisms allow the system administrator to consider them as fatal.
71	
72	* **Uncorrected Error (UE)** - the amount of errors happened above the error
73	  correction threshold, and the system was unable to auto-correct.
74	
75	* **Fatal Error** - when an UE error happens on a critical component of the
76	  system (for example, a piece of the Kernel got corrupted by an UE), the
77	  only reliable way to avoid data corruption is to hang or reboot the machine.
78	
79	* **Non-fatal Error** - when an UE error happens on an unused component,
80	  like a CPU in power down state or an unused memory bank, the system may
81	  still run, eventually replacing the affected hardware by a hot spare,
82	  if available.
83	
84	  Also, when an error happens on a userspace process, it is also possible to
85	  kill such process and let userspace restart it.
86	
87	The mechanism for handling non-fatal errors is usually complex and may
88	require the help of some userspace application, in order to apply the
89	policy desired by the system administrator.
90	
91	Identifying a bad hardware component
92	------------------------------------
93	
94	Just detecting a hardware flaw is usually not enough, as the system needs
95	to pinpoint to the minimal replaceable unit (MRU) that should be exchanged
96	to make the hardware reliable again.
97	
98	So, it requires not only error logging facilities, but also mechanisms that
99	will translate the error message to the silkscreen or component label for
100	the MRU.
101	
102	Typically, it is very complex for memory, as modern CPUs interlace memory
103	from different memory modules, in order to provide a better performance. The
104	DMI BIOS usually have a list of memory module labels, with can be obtained
105	using the ``dmidecode`` tool. For example, on a desktop machine, it shows::
106	
107		Memory Device
108			Total Width: 64 bits
109			Data Width: 64 bits
110			Size: 16384 MB
111			Form Factor: SODIMM
112			Set: None
113			Locator: ChannelA-DIMM0
114			Bank Locator: BANK 0
115			Type: DDR4
116			Type Detail: Synchronous
117			Speed: 2133 MHz
118			Rank: 2
119			Configured Clock Speed: 2133 MHz
120	
121	On the above example, a DDR4 SO-DIMM memory module is located at the
122	system's memory labeled as "BANK 0", as given by the *bank locator* field.
123	Please notice that, on such system, the *total width* is equal to the
124	*data width*. It means that such memory module doesn't have error
125	detection/correction mechanisms.
126	
127	Unfortunately, not all systems use the same field to specify the memory
128	bank. On this example, from an older server, ``dmidecode`` shows::
129	
130		Memory Device
131			Array Handle: 0x1000
132			Error Information Handle: Not Provided
133			Total Width: 72 bits
134			Data Width: 64 bits
135			Size: 8192 MB
136			Form Factor: DIMM
137			Set: 1
138			Locator: DIMM_A1
139			Bank Locator: Not Specified
140			Type: DDR3
141			Type Detail: Synchronous Registered (Buffered)
142			Speed: 1600 MHz
143			Rank: 2
144			Configured Clock Speed: 1600 MHz
145	
146	There, the DDR3 RDIMM memory module is located at the system's memory labeled
147	as "DIMM_A1", as given by the *locator* field. Please notice that this
148	memory module has 64 bits of *data width* and 72 bits of *total width*. So,
149	it has 8 extra bits to be used by error detection and correction mechanisms.
150	Such kind of memory is called Error-correcting code memory (ECC memory).
151	
152	To make things even worse, it is not uncommon that systems with different
153	labels on their system's board to use exactly the same BIOS, meaning that
154	the labels provided by the BIOS won't match the real ones.
155	
156	ECC memory
157	----------
158	
159	As mentioned on the previous section, ECC memory has extra bits to be
160	used for error correction. So, on 64 bit systems, a memory module
161	has 64 bits of *data width*, and 74 bits of *total width*. So, there are
162	8 bits extra bits to be used for the error detection and correction
163	mechanisms. Those extra bits are called *syndrome*\ [#f1]_\ [#f2]_.
164	
165	So, when the cpu requests the memory controller to write a word with
166	*data width*, the memory controller calculates the *syndrome* in real time,
167	using Hamming code, or some other error correction code, like SECDED+,
168	producing a code with *total width* size. Such code is then written
169	on the memory modules.
170	
171	At read, the *total width* bits code is converted back, using the same
172	ECC code used on write, producing a word with *data width* and a *syndrome*.
173	The word with *data width* is sent to the CPU, even when errors happen.
174	
175	The memory controller also looks at the *syndrome* in order to check if
176	there was an error, and if the ECC code was able to fix such error.
177	If the error was corrected, a Corrected Error (CE) happened. If not, an
178	Uncorrected Error (UE) happened.
179	
180	The information about the CE/UE errors is stored on some special registers
181	at the memory controller and can be accessed by reading such registers,
182	either by BIOS, by some special CPUs or by Linux EDAC driver. On x86 64
183	bit CPUs, such errors can also be retrieved via the Machine Check
184	Architecture (MCA)\ [#f3]_.
185	
186	.. [#f1] Please notice that several memory controllers allow operation on a
187	  mode called "Lock-Step", where it groups two memory modules together,
188	  doing 128-bit reads/writes. That gives 16 bits for error correction, with
189	  significantly improves the error correction mechanism, at the expense
190	  that, when an error happens, there's no way to know what memory module is
191	  to blame. So, it has to blame both memory modules.
192	
193	.. [#f2] Some memory controllers also allow using memory in mirror mode.
194	  On such mode, the same data is written to two memory modules. At read,
195	  the system checks both memory modules, in order to check if both provide
196	  identical data. On such configuration, when an error happens, there's no
197	  way to know what memory module is to blame. So, it has to blame both
198	  memory modules (or 4 memory modules, if the system is also on Lock-step
199	  mode).
200	
201	.. [#f3] For more details about the Machine Check Architecture (MCA),
202	  please read Documentation/x86/x86_64/machinecheck at the Kernel tree.
203	
204	EDAC - Error Detection And Correction
205	*************************************
206	
207	.. note::
208	
209	   "bluesmoke" was the name for this device driver subsystem when it
210	   was "out-of-tree" and maintained at http://bluesmoke.sourceforge.net.
211	   That site is mostly archaic now and can be used only for historical
212	   purposes.
213	
214	   When the subsystem was pushed upstream for the first time, on
215	   Kernel 2.6.16, for the first time, it was renamed to ``EDAC``.
216	
217	Purpose
218	-------
219	
220	The ``edac`` kernel module's goal is to detect and report hardware errors
221	that occur within the computer system running under linux.
222	
223	Memory
224	------
225	
226	Memory Correctable Errors (CE) and Uncorrectable Errors (UE) are the
227	primary errors being harvested. These types of errors are harvested by
228	the ``edac_mc`` device.
229	
230	Detecting CE events, then harvesting those events and reporting them,
231	**can** but must not necessarily be a predictor of future UE events. With
232	CE events only, the system can and will continue to operate as no data
233	has been damaged yet.
234	
235	However, preventive maintenance and proactive part replacement of memory
236	modules exhibiting CEs can reduce the likelihood of the dreaded UE events
237	and system panics.
238	
239	Other hardware elements
240	-----------------------
241	
242	A new feature for EDAC, the ``edac_device`` class of device, was added in
243	the 2.6.23 version of the kernel.
244	
245	This new device type allows for non-memory type of ECC hardware detectors
246	to have their states harvested and presented to userspace via the sysfs
247	interface.
248	
249	Some architectures have ECC detectors for L1, L2 and L3 caches,
250	along with DMA engines, fabric switches, main data path switches,
251	interconnections, and various other hardware data paths. If the hardware
252	reports it, then a edac_device device probably can be constructed to
253	harvest and present that to userspace.
254	
255	
256	PCI bus scanning
257	----------------
258	
259	In addition, PCI devices are scanned for PCI Bus Parity and SERR Errors
260	in order to determine if errors are occurring during data transfers.
261	
262	The presence of PCI Parity errors must be examined with a grain of salt.
263	There are several add-in adapters that do **not** follow the PCI specification
264	with regards to Parity generation and reporting. The specification says
265	the vendor should tie the parity status bits to 0 if they do not intend
266	to generate parity.  Some vendors do not do this, and thus the parity bit
267	can "float" giving false positives.
268	
269	There is a PCI device attribute located in sysfs that is checked by
270	the EDAC PCI scanning code. If that attribute is set, PCI parity/error
271	scanning is skipped for that device. The attribute is::
272	
273		broken_parity_status
274	
275	and is located in ``/sys/devices/pci<XXX>/0000:XX:YY.Z`` directories for
276	PCI devices.
277	
278	
279	Versioning
280	----------
281	
282	EDAC is composed of a "core" module (``edac_core.ko``) and several Memory
283	Controller (MC) driver modules. On a given system, the CORE is loaded
284	and one MC driver will be loaded. Both the CORE and the MC driver (or
285	``edac_device`` driver) have individual versions that reflect current
286	release level of their respective modules.
287	
288	Thus, to "report" on what version a system is running, one must report
289	both the CORE's and the MC driver's versions.
290	
291	
292	Loading
293	-------
294	
295	If ``edac`` was statically linked with the kernel then no loading
296	is necessary. If ``edac`` was built as modules then simply modprobe
297	the ``edac`` pieces that you need. You should be able to modprobe
298	hardware-specific modules and have the dependencies load the necessary
299	core modules.
300	
301	Example::
302	
303		$ modprobe amd76x_edac
304	
305	loads both the ``amd76x_edac.ko`` memory controller module and the
306	``edac_mc.ko`` core module.
307	
308	
309	Sysfs interface
310	---------------
311	
312	EDAC presents a ``sysfs`` interface for control and reporting purposes. It
313	lives in the /sys/devices/system/edac directory.
314	
315	Within this directory there currently reside 2 components:
316	
317		======= ==============================
318		mc	memory controller(s) system
319		pci	PCI control and status system
320		======= ==============================
321	
322	
323	
324	Memory Controller (mc) Model
325	----------------------------
326	
327	Each ``mc`` device controls a set of memory modules [#f4]_. These modules
328	are laid out in a Chip-Select Row (``csrowX``) and Channel table (``chX``).
329	There can be multiple csrows and multiple channels.
330	
331	.. [#f4] Nowadays, the term DIMM (Dual In-line Memory Module) is widely
332	  used to refer to a memory module, although there are other memory
333	  packaging alternatives, like SO-DIMM, SIMM, etc. Along this document,
334	  and inside the EDAC system, the term "dimm" is used for all memory
335	  modules, even when they use a different kind of packaging.
336	
337	Memory controllers allow for several csrows, with 8 csrows being a
338	typical value. Yet, the actual number of csrows depends on the layout of
339	a given motherboard, memory controller and memory module characteristics.
340	
341	Dual channels allow for dual data length (e. g. 128 bits, on 64 bit systems)
342	data transfers to/from the CPU from/to memory. Some newer chipsets allow
343	for more than 2 channels, like Fully Buffered DIMMs (FB-DIMMs) memory
344	controllers. The following example will assume 2 channels:
345	
346		+------------+-----------------------+
347		| CS Rows    |       Channels        |
348		+------------+-----------+-----------+
349		|            |  ``ch0``  |  ``ch1``  |
350		+============+===========+===========+
351		| ``csrow0`` |  DIMM_A0  |  DIMM_B0  |
352		+------------+           |           |
353		| ``csrow1`` |           |           |
354		+------------+-----------+-----------+
355		| ``csrow2`` |  DIMM_A1  | DIMM_B1   |
356		+------------+           |           |
357		| ``csrow3`` |           |           |
358		+------------+-----------+-----------+
359	
360	In the above example, there are 4 physical slots on the motherboard
361	for memory DIMMs:
362	
363		+---------+---------+
364		| DIMM_A0 | DIMM_B0 |
365		+---------+---------+
366		| DIMM_A1 | DIMM_B1 |
367		+---------+---------+
368	
369	Labels for these slots are usually silk-screened on the motherboard.
370	Slots labeled ``A`` are channel 0 in this example. Slots labeled ``B`` are
371	channel 1. Notice that there are two csrows possible on a physical DIMM.
372	These csrows are allocated their csrow assignment based on the slot into
373	which the memory DIMM is placed. Thus, when 1 DIMM is placed in each
374	Channel, the csrows cross both DIMMs.
375	
376	Memory DIMMs come single or dual "ranked". A rank is a populated csrow.
377	Thus, 2 single ranked DIMMs, placed in slots DIMM_A0 and DIMM_B0 above
378	will have just one csrow (csrow0). csrow1 will be empty. On the other
379	hand, when 2 dual ranked DIMMs are similarly placed, then both csrow0
380	and csrow1 will be populated. The pattern repeats itself for csrow2 and
381	csrow3.
382	
383	The representation of the above is reflected in the directory
384	tree in EDAC's sysfs interface. Starting in directory
385	``/sys/devices/system/edac/mc``, each memory controller will be
386	represented by its own ``mcX`` directory, where ``X`` is the
387	index of the MC::
388	
389		..../edac/mc/
390			   |
391			   |->mc0
392			   |->mc1
393			   |->mc2
394			   ....
395	
396	Under each ``mcX`` directory each ``csrowX`` is again represented by a
397	``csrowX``, where ``X`` is the csrow index::
398	
399		.../mc/mc0/
400			|
401			|->csrow0
402			|->csrow2
403			|->csrow3
404			....
405	
406	Notice that there is no csrow1, which indicates that csrow0 is composed
407	of a single ranked DIMMs. This should also apply in both Channels, in
408	order to have dual-channel mode be operational. Since both csrow2 and
409	csrow3 are populated, this indicates a dual ranked set of DIMMs for
410	channels 0 and 1.
411	
412	Within each of the ``mcX`` and ``csrowX`` directories are several EDAC
413	control and attribute files.
414	
415	``mcX`` directories
416	-------------------
417	
418	In ``mcX`` directories are EDAC control and attribute files for
419	this ``X`` instance of the memory controllers.
420	
421	For a description of the sysfs API, please see:
422	
423		Documentation/ABI/testing/sysfs-devices-edac
424	
425	
426	``dimmX`` or ``rankX`` directories
427	----------------------------------
428	
429	The recommended way to use the EDAC subsystem is to look at the information
430	provided by the ``dimmX`` or ``rankX`` directories [#f5]_.
431	
432	A typical EDAC system has the following structure under
433	``/sys/devices/system/edac/``\ [#f6]_::
434	
435		/sys/devices/system/edac/
436		├── mc
437		│   ├── mc0
438		│   │   ├── ce_count
439		│   │   ├── ce_noinfo_count
440		│   │   ├── dimm0
441		│   │   │   ├── dimm_ce_count
442		│   │   │   ├── dimm_dev_type
443		│   │   │   ├── dimm_edac_mode
444		│   │   │   ├── dimm_label
445		│   │   │   ├── dimm_location
446		│   │   │   ├── dimm_mem_type
447		│   │   │   ├── dimm_ue_count
448		│   │   │   ├── size
449		│   │   │   └── uevent
450		│   │   ├── max_location
451		│   │   ├── mc_name
452		│   │   ├── reset_counters
453		│   │   ├── seconds_since_reset
454		│   │   ├── size_mb
455		│   │   ├── ue_count
456		│   │   ├── ue_noinfo_count
457		│   │   └── uevent
458		│   ├── mc1
459		│   │   ├── ce_count
460		│   │   ├── ce_noinfo_count
461		│   │   ├── dimm0
462		│   │   │   ├── dimm_ce_count
463		│   │   │   ├── dimm_dev_type
464		│   │   │   ├── dimm_edac_mode
465		│   │   │   ├── dimm_label
466		│   │   │   ├── dimm_location
467		│   │   │   ├── dimm_mem_type
468		│   │   │   ├── dimm_ue_count
469		│   │   │   ├── size
470		│   │   │   └── uevent
471		│   │   ├── max_location
472		│   │   ├── mc_name
473		│   │   ├── reset_counters
474		│   │   ├── seconds_since_reset
475		│   │   ├── size_mb
476		│   │   ├── ue_count
477		│   │   ├── ue_noinfo_count
478		│   │   └── uevent
479		│   └── uevent
480		└── uevent
481	
482	In the ``dimmX`` directories are EDAC control and attribute files for
483	this ``X`` memory module:
484	
485	- ``size`` - Total memory managed by this csrow attribute file
486	
487		This attribute file displays, in count of megabytes, the memory
488		that this csrow contains.
489	
490	- ``dimm_ue_count`` - Uncorrectable Errors count attribute file
491	
492		This attribute file displays the total count of uncorrectable
493		errors that have occurred on this DIMM. If panic_on_ue is set
494		this counter will not have a chance to increment, since EDAC
495		will panic the system.
496	
497	- ``dimm_ce_count`` - Correctable Errors count attribute file
498	
499		This attribute file displays the total count of correctable
500		errors that have occurred on this DIMM. This count is very
501		important to examine. CEs provide early indications that a
502		DIMM is beginning to fail. This count field should be
503		monitored for non-zero values and report such information
504		to the system administrator.
505	
506	- ``dimm_dev_type``  - Device type attribute file
507	
508		This attribute file will display what type of DRAM device is
509		being utilized on this DIMM.
510		Examples:
511	
512			- x1
513			- x2
514			- x4
515			- x8
516	
517	- ``dimm_edac_mode`` - EDAC Mode of operation attribute file
518	
519		This attribute file will display what type of Error detection
520		and correction is being utilized.
521	
522	- ``dimm_label`` - memory module label control file
523	
524		This control file allows this DIMM to have a label assigned
525		to it. With this label in the module, when errors occur
526		the output can provide the DIMM label in the system log.
527		This becomes vital for panic events to isolate the
528		cause of the UE event.
529	
530		DIMM Labels must be assigned after booting, with information
531		that correctly identifies the physical slot with its
532		silk screen label. This information is currently very
533		motherboard specific and determination of this information
534		must occur in userland at this time.
535	
536	- ``dimm_location`` - location of the memory module
537	
538		The location can have up to 3 levels, and describe how the
539		memory controller identifies the location of a memory module.
540		Depending on the type of memory and memory controller, it
541		can be:
542	
543			- *csrow* and *channel* - used when the memory controller
544			  doesn't identify a single DIMM - e. g. in ``rankX`` dir;
545			- *branch*, *channel*, *slot* - typically used on FB-DIMM memory
546			  controllers;
547			- *channel*, *slot* - used on Nehalem and newer Intel drivers.
548	
549	- ``dimm_mem_type`` - Memory Type attribute file
550	
551		This attribute file will display what type of memory is currently
552		on this csrow. Normally, either buffered or unbuffered memory.
553		Examples:
554	
555			- Registered-DDR
556			- Unbuffered-DDR
557	
558	.. [#f5] On some systems, the memory controller doesn't have any logic
559	  to identify the memory module. On such systems, the directory is called ``rankX`` and works on a similar way as the ``csrowX`` directories.
560	  On modern Intel memory controllers, the memory controller identifies the
561	  memory modules directly. On such systems, the directory is called ``dimmX``.
562	
563	.. [#f6] There are also some ``power`` directories and ``subsystem``
564	  symlinks inside the sysfs mapping that are automatically created by
565	  the sysfs subsystem. Currently, they serve no purpose.
566	
567	``csrowX`` directories
568	----------------------
569	
570	When CONFIG_EDAC_LEGACY_SYSFS is enabled, sysfs will contain the ``csrowX``
571	directories. As this API doesn't work properly for Rambus, FB-DIMMs and
572	modern Intel Memory Controllers, this is being deprecated in favor of
573	``dimmX`` directories.
574	
575	In the ``csrowX`` directories are EDAC control and attribute files for
576	this ``X`` instance of csrow:
577	
578	
579	- ``ue_count`` - Total Uncorrectable Errors count attribute file
580	
581		This attribute file displays the total count of uncorrectable
582		errors that have occurred on this csrow. If panic_on_ue is set
583		this counter will not have a chance to increment, since EDAC
584		will panic the system.
585	
586	
587	- ``ce_count`` - Total Correctable Errors count attribute file
588	
589		This attribute file displays the total count of correctable
590		errors that have occurred on this csrow. This count is very
591		important to examine. CEs provide early indications that a
592		DIMM is beginning to fail. This count field should be
593		monitored for non-zero values and report such information
594		to the system administrator.
595	
596	
597	- ``size_mb`` - Total memory managed by this csrow attribute file
598	
599		This attribute file displays, in count of megabytes, the memory
600		that this csrow contains.
601	
602	
603	- ``mem_type`` - Memory Type attribute file
604	
605		This attribute file will display what type of memory is currently
606		on this csrow. Normally, either buffered or unbuffered memory.
607		Examples:
608	
609			- Registered-DDR
610			- Unbuffered-DDR
611	
612	
613	- ``edac_mode`` - EDAC Mode of operation attribute file
614	
615		This attribute file will display what type of Error detection
616		and correction is being utilized.
617	
618	
619	- ``dev_type`` - Device type attribute file
620	
621		This attribute file will display what type of DRAM device is
622		being utilized on this DIMM.
623		Examples:
624	
625			- x1
626			- x2
627			- x4
628			- x8
629	
630	
631	- ``ch0_ce_count`` - Channel 0 CE Count attribute file
632	
633		This attribute file will display the count of CEs on this
634		DIMM located in channel 0.
635	
636	
637	- ``ch0_ue_count`` - Channel 0 UE Count attribute file
638	
639		This attribute file will display the count of UEs on this
640		DIMM located in channel 0.
641	
642	
643	- ``ch0_dimm_label`` - Channel 0 DIMM Label control file
644	
645	
646		This control file allows this DIMM to have a label assigned
647		to it. With this label in the module, when errors occur
648		the output can provide the DIMM label in the system log.
649		This becomes vital for panic events to isolate the
650		cause of the UE event.
651	
652		DIMM Labels must be assigned after booting, with information
653		that correctly identifies the physical slot with its
654		silk screen label. This information is currently very
655		motherboard specific and determination of this information
656		must occur in userland at this time.
657	
658	
659	- ``ch1_ce_count`` - Channel 1 CE Count attribute file
660	
661	
662		This attribute file will display the count of CEs on this
663		DIMM located in channel 1.
664	
665	
666	- ``ch1_ue_count`` - Channel 1 UE Count attribute file
667	
668	
669		This attribute file will display the count of UEs on this
670		DIMM located in channel 0.
671	
672	
673	- ``ch1_dimm_label`` - Channel 1 DIMM Label control file
674	
675		This control file allows this DIMM to have a label assigned
676		to it. With this label in the module, when errors occur
677		the output can provide the DIMM label in the system log.
678		This becomes vital for panic events to isolate the
679		cause of the UE event.
680	
681		DIMM Labels must be assigned after booting, with information
682		that correctly identifies the physical slot with its
683		silk screen label. This information is currently very
684		motherboard specific and determination of this information
685		must occur in userland at this time.
686	
687	
688	System Logging
689	--------------
690	
691	If logging for UEs and CEs is enabled, then system logs will contain
692	information indicating that errors have been detected::
693	
694	  EDAC MC0: CE page 0x283, offset 0xce0, grain 8, syndrome 0x6ec3, row 0, channel 1 "DIMM_B1": amd76x_edac
695	  EDAC MC0: CE page 0x1e5, offset 0xfb0, grain 8, syndrome 0xb741, row 0, channel 1 "DIMM_B1": amd76x_edac
696	
697	
698	The structure of the message is:
699	
700		+---------------------------------------+-------------+
701		| Content                               | Example     |
702		+=======================================+=============+
703		| The memory controller                 | MC0         |
704		+---------------------------------------+-------------+
705		| Error type                            | CE          |
706		+---------------------------------------+-------------+
707		| Memory page                           | 0x283       |
708		+---------------------------------------+-------------+
709		| Offset in the page                    | 0xce0       |
710		+---------------------------------------+-------------+
711		| The byte granularity                  | grain 8     |
712		| or resolution of the error            |             |
713		+---------------------------------------+-------------+
714		| The error syndrome                    | 0xb741      |
715		+---------------------------------------+-------------+
716		| Memory row                            | row 0       |
717		+---------------------------------------+-------------+
718		| Memory channel                        | channel 1   |
719		+---------------------------------------+-------------+
720		| DIMM label, if set prior              | DIMM B1     |
721		+---------------------------------------+-------------+
722		| And then an optional, driver-specific |             |
723		| message that may have additional      |             |
724		| information.                          |             |
725		+---------------------------------------+-------------+
726	
727	Both UEs and CEs with no info will lack all but memory controller, error
728	type, a notice of "no info" and then an optional, driver-specific error
729	message.
730	
731	
732	PCI Bus Parity Detection
733	------------------------
734	
735	On Header Type 00 devices, the primary status is looked at for any
736	parity error regardless of whether parity is enabled on the device or
737	not. (The spec indicates parity is generated in some cases). On Header
738	Type 01 bridges, the secondary status register is also looked at to see
739	if parity occurred on the bus on the other side of the bridge.
740	
741	
742	Sysfs configuration
743	-------------------
744	
745	Under ``/sys/devices/system/edac/pci`` are control and attribute files as
746	follows:
747	
748	
749	- ``check_pci_parity`` - Enable/Disable PCI Parity checking control file
750	
751		This control file enables or disables the PCI Bus Parity scanning
752		operation. Writing a 1 to this file enables the scanning. Writing
753		a 0 to this file disables the scanning.
754	
755		Enable::
756	
757			echo "1" >/sys/devices/system/edac/pci/check_pci_parity
758	
759		Disable::
760	
761			echo "0" >/sys/devices/system/edac/pci/check_pci_parity
762	
763	
764	- ``pci_parity_count`` - Parity Count
765	
766		This attribute file will display the number of parity errors that
767		have been detected.
768	
769	
770	Module parameters
771	-----------------
772	
773	- ``edac_mc_panic_on_ue`` - Panic on UE control file
774	
775		An uncorrectable error will cause a machine panic.  This is usually
776		desirable.  It is a bad idea to continue when an uncorrectable error
777		occurs - it is indeterminate what was uncorrected and the operating
778		system context might be so mangled that continuing will lead to further
779		corruption. If the kernel has MCE configured, then EDAC will never
780		notice the UE.
781	
782		LOAD TIME::
783	
784			module/kernel parameter: edac_mc_panic_on_ue=[0|1]
785	
786		RUN TIME::
787	
788			echo "1" > /sys/module/edac_core/parameters/edac_mc_panic_on_ue
789	
790	
791	- ``edac_mc_log_ue`` - Log UE control file
792	
793	
794		Generate kernel messages describing uncorrectable errors.  These errors
795		are reported through the system message log system.  UE statistics
796		will be accumulated even when UE logging is disabled.
797	
798		LOAD TIME::
799	
800			module/kernel parameter: edac_mc_log_ue=[0|1]
801	
802		RUN TIME::
803	
804			echo "1" > /sys/module/edac_core/parameters/edac_mc_log_ue
805	
806	
807	- ``edac_mc_log_ce`` - Log CE control file
808	
809	
810		Generate kernel messages describing correctable errors.  These
811		errors are reported through the system message log system.
812		CE statistics will be accumulated even when CE logging is disabled.
813	
814		LOAD TIME::
815	
816			module/kernel parameter: edac_mc_log_ce=[0|1]
817	
818		RUN TIME::
819	
820			echo "1" > /sys/module/edac_core/parameters/edac_mc_log_ce
821	
822	
823	- ``edac_mc_poll_msec`` - Polling period control file
824	
825	
826		The time period, in milliseconds, for polling for error information.
827		Too small a value wastes resources.  Too large a value might delay
828		necessary handling of errors and might loose valuable information for
829		locating the error.  1000 milliseconds (once each second) is the current
830		default. Systems which require all the bandwidth they can get, may
831		increase this.
832	
833		LOAD TIME::
834	
835			module/kernel parameter: edac_mc_poll_msec=[0|1]
836	
837		RUN TIME::
838	
839			echo "1000" > /sys/module/edac_core/parameters/edac_mc_poll_msec
840	
841	
842	- ``panic_on_pci_parity`` - Panic on PCI PARITY Error
843	
844	
845		This control file enables or disables panicking when a parity
846		error has been detected.
847	
848	
849		module/kernel parameter::
850	
851				edac_panic_on_pci_pe=[0|1]
852	
853		Enable::
854	
855			echo "1" > /sys/module/edac_core/parameters/edac_panic_on_pci_pe
856	
857		Disable::
858	
859			echo "0" > /sys/module/edac_core/parameters/edac_panic_on_pci_pe
860	
861	
862	
863	EDAC device type
864	----------------
865	
866	In the header file, edac_pci.h, there is a series of edac_device structures
867	and APIs for the EDAC_DEVICE.
868	
869	User space access to an edac_device is through the sysfs interface.
870	
871	At the location ``/sys/devices/system/edac`` (sysfs) new edac_device devices
872	will appear.
873	
874	There is a three level tree beneath the above ``edac`` directory. For example,
875	the ``test_device_edac`` device (found at the http://bluesmoke.sourceforget.net
876	website) installs itself as::
877	
878		/sys/devices/system/edac/test-instance
879	
880	in this directory are various controls, a symlink and one or more ``instance``
881	directories.
882	
883	The standard default controls are:
884	
885		==============	=======================================================
886		log_ce		boolean to log CE events
887		log_ue		boolean to log UE events
888		panic_on_ue	boolean to ``panic`` the system if an UE is encountered
889				(default off, can be set true via startup script)
890		poll_msec	time period between POLL cycles for events
891		==============	=======================================================
892	
893	The test_device_edac device adds at least one of its own custom control:
894	
895		==============	==================================================
896		test_bits	which in the current test driver does nothing but
897				show how it is installed. A ported driver can
898				add one or more such controls and/or attributes
899				for specific uses.
900				One out-of-tree driver uses controls here to allow
901				for ERROR INJECTION operations to hardware
902				injection registers
903		==============	==================================================
904	
905	The symlink points to the 'struct dev' that is registered for this edac_device.
906	
907	Instances
908	---------
909	
910	One or more instance directories are present. For the ``test_device_edac``
911	case:
912	
913		+----------------+
914		| test-instance0 |
915		+----------------+
916	
917	
918	In this directory there are two default counter attributes, which are totals of
919	counter in deeper subdirectories.
920	
921		==============	====================================
922		ce_count	total of CE events of subdirectories
923		ue_count	total of UE events of subdirectories
924		==============	====================================
925	
926	Blocks
927	------
928	
929	At the lowest directory level is the ``block`` directory. There can be 0, 1
930	or more blocks specified in each instance:
931	
932		+-------------+
933		| test-block0 |
934		+-------------+
935	
936	In this directory the default attributes are:
937	
938		==============	================================================
939		ce_count	which is counter of CE events for this ``block``
940				of hardware being monitored
941		ue_count	which is counter of UE events for this ``block``
942				of hardware being monitored
943		==============	================================================
944	
945	
946	The ``test_device_edac`` device adds 4 attributes and 1 control:
947	
948		================== ====================================================
949		test-block-bits-0	for every POLL cycle this counter
950					is incremented
951		test-block-bits-1	every 10 cycles, this counter is bumped once,
952					and test-block-bits-0 is set to 0
953		test-block-bits-2	every 100 cycles, this counter is bumped once,
954					and test-block-bits-1 is set to 0
955		test-block-bits-3	every 1000 cycles, this counter is bumped once,
956					and test-block-bits-2 is set to 0
957		================== ====================================================
958	
959	
960		================== ====================================================
961		reset-counters		writing ANY thing to this control will
962					reset all the above counters.
963		================== ====================================================
964	
965	
966	Use of the ``test_device_edac`` driver should enable any others to create their own
967	unique drivers for their hardware systems.
968	
969	The ``test_device_edac`` sample driver is located at the
970	http://bluesmoke.sourceforge.net project site for EDAC.
971	
972	
973	Usage of EDAC APIs on Nehalem and newer Intel CPUs
974	--------------------------------------------------
975	
976	On older Intel architectures, the memory controller was part of the North
977	Bridge chipset. Nehalem, Sandy Bridge, Ivy Bridge, Haswell, Sky Lake and
978	newer Intel architectures integrated an enhanced version of the memory
979	controller (MC) inside the CPUs.
980	
981	This chapter will cover the differences of the enhanced memory controllers
982	found on newer Intel CPUs, such as ``i7core_edac``, ``sb_edac`` and
983	``sbx_edac`` drivers.
984	
985	.. note::
986	
987	   The Xeon E7 processor families use a separate chip for the memory
988	   controller, called Intel Scalable Memory Buffer. This section doesn't
989	   apply for such families.
990	
991	1) There is one Memory Controller per Quick Patch Interconnect
992	   (QPI). At the driver, the term "socket" means one QPI. This is
993	   associated with a physical CPU socket.
994	
995	   Each MC have 3 physical read channels, 3 physical write channels and
996	   3 logic channels. The driver currently sees it as just 3 channels.
997	   Each channel can have up to 3 DIMMs.
998	
999	   The minimum known unity is DIMMs. There are no information about csrows.
1000	   As EDAC API maps the minimum unity is csrows, the driver sequentially
1001	   maps channel/DIMM into different csrows.
1002	
1003	   For example, supposing the following layout::
1004	
1005		Ch0 phy rd0, wr0 (0x063f4031): 2 ranks, UDIMMs
1006		  dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400
1007		  dimm 1 1024 Mb offset: 4, bank: 8, rank: 1, row: 0x4000, col: 0x400
1008	        Ch1 phy rd1, wr1 (0x063f4031): 2 ranks, UDIMMs
1009		  dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400
1010		Ch2 phy rd3, wr3 (0x063f4031): 2 ranks, UDIMMs
1011		  dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400
1012	
1013	   The driver will map it as::
1014	
1015		csrow0: channel 0, dimm0
1016		csrow1: channel 0, dimm1
1017		csrow2: channel 1, dimm0
1018		csrow3: channel 2, dimm0
1019	
1020	   exports one DIMM per csrow.
1021	
1022	   Each QPI is exported as a different memory controller.
1023	
1024	2) The MC has the ability to inject errors to test drivers. The drivers
1025	   implement this functionality via some error injection nodes:
1026	
1027	   For injecting a memory error, there are some sysfs nodes, under
1028	   ``/sys/devices/system/edac/mc/mc?/``:
1029	
1030	   - ``inject_addrmatch/*``:
1031	      Controls the error injection mask register. It is possible to specify
1032	      several characteristics of the address to match an error code::
1033	
1034	         dimm = the affected dimm. Numbers are relative to a channel;
1035	         rank = the memory rank;
1036	         channel = the channel that will generate an error;
1037	         bank = the affected bank;
1038	         page = the page address;
1039	         column (or col) = the address column.
1040	
1041	      each of the above values can be set to "any" to match any valid value.
1042	
1043	      At driver init, all values are set to any.
1044	
1045	      For example, to generate an error at rank 1 of dimm 2, for any channel,
1046	      any bank, any page, any column::
1047	
1048			echo 2 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/dimm
1049			echo 1 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/rank
1050	
1051		To return to the default behaviour of matching any, you can do::
1052	
1053			echo any >/sys/devices/system/edac/mc/mc0/inject_addrmatch/dimm
1054			echo any >/sys/devices/system/edac/mc/mc0/inject_addrmatch/rank
1055	
1056	   - ``inject_eccmask``:
1057	          specifies what bits will have troubles,
1058	
1059	   - ``inject_section``:
1060	       specifies what ECC cache section will get the error::
1061	
1062			3 for both
1063			2 for the highest
1064			1 for the lowest
1065	
1066	   - ``inject_type``:
1067	       specifies the type of error, being a combination of the following bits::
1068	
1069			bit 0 - repeat
1070			bit 1 - ecc
1071			bit 2 - parity
1072	
1073	   - ``inject_enable``:
1074	       starts the error generation when something different than 0 is written.
1075	
1076	   All inject vars can be read. root permission is needed for write.
1077	
1078	   Datasheet states that the error will only be generated after a write on an
1079	   address that matches inject_addrmatch. It seems, however, that reading will
1080	   also produce an error.
1081	
1082	   For example, the following code will generate an error for any write access
1083	   at socket 0, on any DIMM/address on channel 2::
1084	
1085		echo 2 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/channel
1086		echo 2 >/sys/devices/system/edac/mc/mc0/inject_type
1087		echo 64 >/sys/devices/system/edac/mc/mc0/inject_eccmask
1088		echo 3 >/sys/devices/system/edac/mc/mc0/inject_section
1089		echo 1 >/sys/devices/system/edac/mc/mc0/inject_enable
1090		dd if=/dev/mem of=/dev/null seek=16k bs=4k count=1 >& /dev/null
1091	
1092	   For socket 1, it is needed to replace "mc0" by "mc1" at the above
1093	   commands.
1094	
1095	   The generated error message will look like::
1096	
1097		EDAC MC0: UE row 0, channel-a= 0 channel-b= 0 labels "-": NON_FATAL (addr = 0x0075b980, socket=0, Dimm=0, Channel=2, syndrome=0x00000040, count=1, Err=8c0000400001009f:4000080482 (read error: read ECC error))
1098	
1099	3) Corrected Error memory register counters
1100	
1101	   Those newer MCs have some registers to count memory errors. The driver
1102	   uses those registers to report Corrected Errors on devices with Registered
1103	   DIMMs.
1104	
1105	   However, those counters don't work with Unregistered DIMM. As the chipset
1106	   offers some counters that also work with UDIMMs (but with a worse level of
1107	   granularity than the default ones), the driver exposes those registers for
1108	   UDIMM memories.
1109	
1110	   They can be read by looking at the contents of ``all_channel_counts/``::
1111	
1112	     $ for i in /sys/devices/system/edac/mc/mc0/all_channel_counts/*; do echo $i; cat $i; done
1113		/sys/devices/system/edac/mc/mc0/all_channel_counts/udimm0
1114		0
1115		/sys/devices/system/edac/mc/mc0/all_channel_counts/udimm1
1116		0
1117		/sys/devices/system/edac/mc/mc0/all_channel_counts/udimm2
1118		0
1119	
1120	   What happens here is that errors on different csrows, but at the same
1121	   dimm number will increment the same counter.
1122	   So, in this memory mapping::
1123	
1124		csrow0: channel 0, dimm0
1125		csrow1: channel 0, dimm1
1126		csrow2: channel 1, dimm0
1127		csrow3: channel 2, dimm0
1128	
1129	   The hardware will increment udimm0 for an error at the first dimm at either
1130	   csrow0, csrow2  or csrow3;
1131	
1132	   The hardware will increment udimm1 for an error at the second dimm at either
1133	   csrow0, csrow2  or csrow3;
1134	
1135	   The hardware will increment udimm2 for an error at the third dimm at either
1136	   csrow0, csrow2  or csrow3;
1137	
1138	4) Standard error counters
1139	
1140	   The standard error counters are generated when an mcelog error is received
1141	   by the driver. Since, with UDIMM, this is counted by software, it is
1142	   possible that some errors could be lost. With RDIMM's, they display the
1143	   contents of the registers
1144	
1145	Reference documents used on ``amd64_edac``
1146	------------------------------------------
1147	
1148	``amd64_edac`` module is based on the following documents
1149	(available from http://support.amd.com/en-us/search/tech-docs):
1150	
1151	1. :Title:  BIOS and Kernel Developer's Guide for AMD Athlon 64 and AMD
1152		   Opteron Processors
1153	   :AMD publication #: 26094
1154	   :Revision: 3.26
1155	   :Link: http://support.amd.com/TechDocs/26094.PDF
1156	
1157	2. :Title:  BIOS and Kernel Developer's Guide for AMD NPT Family 0Fh
1158		   Processors
1159	   :AMD publication #: 32559
1160	   :Revision: 3.00
1161	   :Issue Date: May 2006
1162	   :Link: http://support.amd.com/TechDocs/32559.pdf
1163	
1164	3. :Title:  BIOS and Kernel Developer's Guide (BKDG) For AMD Family 10h
1165		   Processors
1166	   :AMD publication #: 31116
1167	   :Revision: 3.00
1168	   :Issue Date: September 07, 2007
1169	   :Link: http://support.amd.com/TechDocs/31116.pdf
1170	
1171	4. :Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 15h
1172		  Models 30h-3Fh Processors
1173	   :AMD publication #: 49125
1174	   :Revision: 3.06
1175	   :Issue Date: 2/12/2015 (latest release)
1176	   :Link: http://support.amd.com/TechDocs/49125_15h_Models_30h-3Fh_BKDG.pdf
1177	
1178	5. :Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 15h
1179		  Models 60h-6Fh Processors
1180	   :AMD publication #: 50742
1181	   :Revision: 3.01
1182	   :Issue Date: 7/23/2015 (latest release)
1183	   :Link: http://support.amd.com/TechDocs/50742_15h_Models_60h-6Fh_BKDG.pdf
1184	
1185	6. :Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 16h
1186		  Models 00h-0Fh Processors
1187	   :AMD publication #: 48751
1188	   :Revision: 3.03
1189	   :Issue Date: 2/23/2015 (latest release)
1190	   :Link: http://support.amd.com/TechDocs/48751_16h_bkdg.pdf
1191	
1192	Credits
1193	=======
1194	
1195	* Written by Doug Thompson <dougthompson@xmission.com>
1196	
1197	  - 7 Dec 2005
1198	  - 17 Jul 2007	Updated
1199	
1200	* |copy| Mauro Carvalho Chehab
1201	
1202	  - 05 Aug 2009	Nehalem interface
1203	  - 26 Oct 2016 Converted to ReST and cleanups at the Nehalem section
1204	
1205	* EDAC authors/maintainers:
1206	
1207	  - Doug Thompson, Dave Jiang, Dave Peterson et al,
1208	  - Mauro Carvalho Chehab
1209	  - Borislav Petkov
1210	  - original author: Thayne Harbaugh
Hide Line Numbers
About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog

Information is copyright its respective author. All material is available from the Linux Kernel Source distributed under a GPL License. This page is provided as a free service by mjmwired.net.