About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog

Documentation / edac.txt




Custom Search

Based on kernel version 3.13. Page generated on 2014-01-20 22:02 EST.

1	
2	
3	EDAC - Error Detection And Correction
4	
5	Written by Doug Thompson <dougthompson@xmission.com>
6	7 Dec 2005
7	17 Jul 2007	Updated
8	
9	(c) Mauro Carvalho Chehab <mchehab@redhat.com>
10	05 Aug 2009	Nehalem interface
11	
12	EDAC is maintained and written by:
13	
14		Doug Thompson, Dave Jiang, Dave Peterson et al,
15		original author: Thayne Harbaugh,
16	
17	Contact:
18		website:	bluesmoke.sourceforge.net
19		mailing list:	bluesmoke-devel@lists.sourceforge.net
20	
21	"bluesmoke" was the name for this device driver when it was "out-of-tree"
22	and maintained at sourceforge.net.  When it was pushed into 2.6.16 for the
23	first time, it was renamed to 'EDAC'.
24	
25	The bluesmoke project at sourceforge.net is now utilized as a 'staging area'
26	for EDAC development, before it is sent upstream to kernel.org
27	
28	At the bluesmoke/EDAC project site is a series of quilt patches against
29	recent kernels, stored in a SVN repository. For easier downloading, there
30	is also a tarball snapshot available.
31	
32	============================================================================
33	EDAC PURPOSE
34	
35	The 'edac' kernel module goal is to detect and report errors that occur
36	within the computer system running under linux.
37	
38	MEMORY
39	
40	In the initial release, memory Correctable Errors (CE) and Uncorrectable
41	Errors (UE) are the primary errors being harvested. These types of errors
42	are harvested by the 'edac_mc' class of device.
43	
44	Detecting CE events, then harvesting those events and reporting them,
45	CAN be a predictor of future UE events.  With CE events, the system can
46	continue to operate, but with less safety. Preventive maintenance and
47	proactive part replacement of memory DIMMs exhibiting CEs can reduce
48	the likelihood of the dreaded UE events and system 'panics'.
49	
50	NON-MEMORY
51	
52	A new feature for EDAC, the edac_device class of device, was added in
53	the 2.6.23 version of the kernel.
54	
55	This new device type allows for non-memory type of ECC hardware detectors
56	to have their states harvested and presented to userspace via the sysfs
57	interface.
58	
59	Some architectures have ECC detectors for L1, L2 and L3 caches, along with DMA
60	engines, fabric switches, main data path switches, interconnections,
61	and various other hardware data paths. If the hardware reports it, then
62	a edac_device device probably can be constructed to harvest and present
63	that to userspace.
64	
65	
66	PCI BUS SCANNING
67	
68	In addition, PCI Bus Parity and SERR Errors are scanned for on PCI devices
69	in order to determine if errors are occurring on data transfers.
70	
71	The presence of PCI Parity errors must be examined with a grain of salt.
72	There are several add-in adapters that do NOT follow the PCI specification
73	with regards to Parity generation and reporting. The specification says
74	the vendor should tie the parity status bits to 0 if they do not intend
75	to generate parity.  Some vendors do not do this, and thus the parity bit
76	can "float" giving false positives.
77	
78	In the kernel there is a PCI device attribute located in sysfs that is
79	checked by the EDAC PCI scanning code. If that attribute is set,
80	PCI parity/error scanning is skipped for that device. The attribute
81	is:
82	
83		broken_parity_status
84	
85	as is located in /sys/devices/pci<XXX>/0000:XX:YY.Z directories for
86	PCI devices.
87	
88	FUTURE HARDWARE SCANNING
89	
90	EDAC will have future error detectors that will be integrated with
91	EDAC or added to it, in the following list:
92	
93		MCE	Machine Check Exception
94		MCA	Machine Check Architecture
95		NMI	NMI notification of ECC errors
96		MSRs 	Machine Specific Register error cases
97		and other mechanisms.
98	
99	These errors are usually bus errors, ECC errors, thermal throttling
100	and the like.
101	
102	
103	============================================================================
104	EDAC VERSIONING
105	
106	EDAC is composed of a "core" module (edac_core.ko) and several Memory
107	Controller (MC) driver modules. On a given system, the CORE
108	is loaded and one MC driver will be loaded. Both the CORE and
109	the MC driver (or edac_device driver) have individual versions that reflect
110	current release level of their respective modules.
111	
112	Thus, to "report" on what version a system is running, one must report both
113	the CORE's and the MC driver's versions.
114	
115	
116	LOADING
117	
118	If 'edac' was statically linked with the kernel then no loading is
119	necessary.  If 'edac' was built as modules then simply modprobe the
120	'edac' pieces that you need.  You should be able to modprobe
121	hardware-specific modules and have the dependencies load the necessary core
122	modules.
123	
124	Example:
125	
126	$> modprobe amd76x_edac
127	
128	loads both the amd76x_edac.ko memory controller module and the edac_mc.ko
129	core module.
130	
131	
132	============================================================================
133	EDAC sysfs INTERFACE
134	
135	EDAC presents a 'sysfs' interface for control, reporting and attribute
136	reporting purposes.
137	
138	EDAC lives in the /sys/devices/system/edac directory.
139	
140	Within this directory there currently reside 2 'edac' components:
141	
142		mc	memory controller(s) system
143		pci	PCI control and status system
144	
145	
146	============================================================================
147	Memory Controller (mc) Model
148	
149	First a background on the memory controller's model abstracted in EDAC.
150	Each 'mc' device controls a set of DIMM memory modules. These modules are
151	laid out in a Chip-Select Row (csrowX) and Channel table (chX). There can
152	be multiple csrows and multiple channels.
153	
154	Memory controllers allow for several csrows, with 8 csrows being a typical value.
155	Yet, the actual number of csrows depends on the electrical "loading"
156	of a given motherboard, memory controller and DIMM characteristics.
157	
158	Dual channels allows for 128 bit data transfers to the CPU from memory.
159	Some newer chipsets allow for more than 2 channels, like Fully Buffered DIMMs
160	(FB-DIMMs). The following example will assume 2 channels:
161	
162	
163			Channel 0	Channel 1
164		===================================
165		csrow0	| DIMM_A0	| DIMM_B0 |
166		csrow1	| DIMM_A0	| DIMM_B0 |
167		===================================
168	
169		===================================
170		csrow2	| DIMM_A1	| DIMM_B1 |
171		csrow3	| DIMM_A1	| DIMM_B1 |
172		===================================
173	
174	In the above example table there are 4 physical slots on the motherboard
175	for memory DIMMs:
176	
177		DIMM_A0
178		DIMM_B0
179		DIMM_A1
180		DIMM_B1
181	
182	Labels for these slots are usually silk screened on the motherboard. Slots
183	labeled 'A' are channel 0 in this example. Slots labeled 'B'
184	are channel 1. Notice that there are two csrows possible on a
185	physical DIMM. These csrows are allocated their csrow assignment
186	based on the slot into which the memory DIMM is placed. Thus, when 1 DIMM
187	is placed in each Channel, the csrows cross both DIMMs.
188	
189	Memory DIMMs come single or dual "ranked". A rank is a populated csrow.
190	Thus, 2 single ranked DIMMs, placed in slots DIMM_A0 and DIMM_B0 above
191	will have 1 csrow, csrow0. csrow1 will be empty. On the other hand,
192	when 2 dual ranked DIMMs are similarly placed, then both csrow0 and
193	csrow1 will be populated. The pattern repeats itself for csrow2 and
194	csrow3.
195	
196	The representation of the above is reflected in the directory tree
197	in EDAC's sysfs interface. Starting in directory
198	/sys/devices/system/edac/mc each memory controller will be represented
199	by its own 'mcX' directory, where 'X' is the index of the MC.
200	
201	
202		..../edac/mc/
203			   |
204			   |->mc0
205			   |->mc1
206			   |->mc2
207			   ....
208	
209	Under each 'mcX' directory each 'csrowX' is again represented by a
210	'csrowX', where 'X' is the csrow index:
211	
212	
213		.../mc/mc0/
214			|
215			|->csrow0
216			|->csrow2
217			|->csrow3
218			....
219	
220	Notice that there is no csrow1, which indicates that csrow0 is
221	composed of a single ranked DIMMs. This should also apply in both
222	Channels, in order to have dual-channel mode be operational. Since
223	both csrow2 and csrow3 are populated, this indicates a dual ranked
224	set of DIMMs for channels 0 and 1.
225	
226	
227	Within each of the 'mcX' and 'csrowX' directories are several
228	EDAC control and attribute files.
229	
230	============================================================================
231	'mcX' DIRECTORIES
232	
233	
234	In 'mcX' directories are EDAC control and attribute files for
235	this 'X' instance of the memory controllers.
236	
237	For a description of the sysfs API, please see:
238		Documentation/ABI/testing/sysfs/devices-edac
239	
240	
241	============================================================================
242	'csrowX' DIRECTORIES
243	
244	When CONFIG_EDAC_LEGACY_SYSFS is enabled, the sysfs will contain the
245	csrowX directories. As this API doesn't work properly for Rambus, FB-DIMMs
246	and modern Intel Memory Controllers, this is being deprecated in favor
247	of dimmX directories.
248	
249	In the 'csrowX' directories are EDAC control and attribute files for
250	this 'X' instance of csrow:
251	
252	
253	Total Uncorrectable Errors count attribute file:
254	
255		'ue_count'
256	
257		This attribute file displays the total count of uncorrectable
258		errors that have occurred on this csrow. If panic_on_ue is set
259		this counter will not have a chance to increment, since EDAC
260		will panic the system.
261	
262	
263	Total Correctable Errors count attribute file:
264	
265		'ce_count'
266	
267		This attribute file displays the total count of correctable
268		errors that have occurred on this csrow. This
269		count is very important to examine. CEs provide early
270		indications that a DIMM is beginning to fail. This count
271		field should be monitored for non-zero values and report
272		such information to the system administrator.
273	
274	
275	Total memory managed by this csrow attribute file:
276	
277		'size_mb'
278	
279		This attribute file displays, in count of megabytes, of memory
280		that this csrow contains.
281	
282	
283	Memory Type attribute file:
284	
285		'mem_type'
286	
287		This attribute file will display what type of memory is currently
288		on this csrow. Normally, either buffered or unbuffered memory.
289		Examples:
290			Registered-DDR
291			Unbuffered-DDR
292	
293	
294	EDAC Mode of operation attribute file:
295	
296		'edac_mode'
297	
298		This attribute file will display what type of Error detection
299		and correction is being utilized.
300	
301	
302	Device type attribute file:
303	
304		'dev_type'
305	
306		This attribute file will display what type of DRAM device is
307		being utilized on this DIMM.
308		Examples:
309			x1
310			x2
311			x4
312			x8
313	
314	
315	Channel 0 CE Count attribute file:
316	
317		'ch0_ce_count'
318	
319		This attribute file will display the count of CEs on this
320		DIMM located in channel 0.
321	
322	
323	Channel 0 UE Count attribute file:
324	
325		'ch0_ue_count'
326	
327		This attribute file will display the count of UEs on this
328		DIMM located in channel 0.
329	
330	
331	Channel 0 DIMM Label control file:
332	
333		'ch0_dimm_label'
334	
335		This control file allows this DIMM to have a label assigned
336		to it. With this label in the module, when errors occur
337		the output can provide the DIMM label in the system log.
338		This becomes vital for panic events to isolate the
339		cause of the UE event.
340	
341		DIMM Labels must be assigned after booting, with information
342		that correctly identifies the physical slot with its
343		silk screen label. This information is currently very
344		motherboard specific and determination of this information
345		must occur in userland at this time.
346	
347	
348	Channel 1 CE Count attribute file:
349	
350		'ch1_ce_count'
351	
352		This attribute file will display the count of CEs on this
353		DIMM located in channel 1.
354	
355	
356	Channel 1 UE Count attribute file:
357	
358		'ch1_ue_count'
359	
360		This attribute file will display the count of UEs on this
361		DIMM located in channel 0.
362	
363	
364	Channel 1 DIMM Label control file:
365	
366		'ch1_dimm_label'
367	
368		This control file allows this DIMM to have a label assigned
369		to it. With this label in the module, when errors occur
370		the output can provide the DIMM label in the system log.
371		This becomes vital for panic events to isolate the
372		cause of the UE event.
373	
374		DIMM Labels must be assigned after booting, with information
375		that correctly identifies the physical slot with its
376		silk screen label. This information is currently very
377		motherboard specific and determination of this information
378		must occur in userland at this time.
379	
380	============================================================================
381	SYSTEM LOGGING
382	
383	If logging for UEs and CEs are enabled then system logs will have
384	error notices indicating errors that have been detected:
385	
386	EDAC MC0: CE page 0x283, offset 0xce0, grain 8, syndrome 0x6ec3, row 0,
387	channel 1 "DIMM_B1": amd76x_edac
388	
389	EDAC MC0: CE page 0x1e5, offset 0xfb0, grain 8, syndrome 0xb741, row 0,
390	channel 1 "DIMM_B1": amd76x_edac
391	
392	
393	The structure of the message is:
394		the memory controller			(MC0)
395		Error type				(CE)
396		memory page				(0x283)
397		offset in the page			(0xce0)
398		the byte granularity 			(grain 8)
399			or resolution of the error
400		the error syndrome			(0xb741)
401		memory row				(row 0)
402		memory channel				(channel 1)
403		DIMM label, if set prior		(DIMM B1
404		and then an optional, driver-specific message that may
405			have additional information.
406	
407	Both UEs and CEs with no info will lack all but memory controller,
408	error type, a notice of "no info" and then an optional,
409	driver-specific error message.
410	
411	
412	============================================================================
413	PCI Bus Parity Detection
414	
415	
416	On Header Type 00 devices the primary status is looked at
417	for any parity error regardless of whether Parity is enabled on the
418	device.  (The spec indicates parity is generated in some cases).
419	On Header Type 01 bridges, the secondary status register is also
420	looked at to see if parity occurred on the bus on the other side of
421	the bridge.
422	
423	
424	SYSFS CONFIGURATION
425	
426	Under /sys/devices/system/edac/pci are control and attribute files as follows:
427	
428	
429	Enable/Disable PCI Parity checking control file:
430	
431		'check_pci_parity'
432	
433	
434		This control file enables or disables the PCI Bus Parity scanning
435		operation. Writing a 1 to this file enables the scanning. Writing
436		a 0 to this file disables the scanning.
437	
438		Enable:
439		echo "1" >/sys/devices/system/edac/pci/check_pci_parity
440	
441		Disable:
442		echo "0" >/sys/devices/system/edac/pci/check_pci_parity
443	
444	
445	Parity Count:
446	
447		'pci_parity_count'
448	
449		This attribute file will display the number of parity errors that
450		have been detected.
451	
452	
453	============================================================================
454	MODULE PARAMETERS
455	
456	Panic on UE control file:
457	
458		'edac_mc_panic_on_ue'
459	
460		An uncorrectable error will cause a machine panic.  This is usually
461		desirable.  It is a bad idea to continue when an uncorrectable error
462		occurs - it is indeterminate what was uncorrected and the operating
463		system context might be so mangled that continuing will lead to further
464		corruption. If the kernel has MCE configured, then EDAC will never
465		notice the UE.
466	
467		LOAD TIME: module/kernel parameter: edac_mc_panic_on_ue=[0|1]
468	
469		RUN TIME:  echo "1" > /sys/module/edac_core/parameters/edac_mc_panic_on_ue
470	
471	
472	Log UE control file:
473	
474		'edac_mc_log_ue'
475	
476		Generate kernel messages describing uncorrectable errors.  These errors
477		are reported through the system message log system.  UE statistics
478		will be accumulated even when UE logging is disabled.
479	
480		LOAD TIME: module/kernel parameter: edac_mc_log_ue=[0|1]
481	
482		RUN TIME: echo "1" > /sys/module/edac_core/parameters/edac_mc_log_ue
483	
484	
485	Log CE control file:
486	
487		'edac_mc_log_ce'
488	
489		Generate kernel messages describing correctable errors.  These
490		errors are reported through the system message log system.
491		CE statistics will be accumulated even when CE logging is disabled.
492	
493		LOAD TIME: module/kernel parameter: edac_mc_log_ce=[0|1]
494	
495		RUN TIME: echo "1" > /sys/module/edac_core/parameters/edac_mc_log_ce
496	
497	
498	Polling period control file:
499	
500		'edac_mc_poll_msec'
501	
502		The time period, in milliseconds, for polling for error information.
503		Too small a value wastes resources.  Too large a value might delay
504		necessary handling of errors and might loose valuable information for
505		locating the error.  1000 milliseconds (once each second) is the current
506		default. Systems which require all the bandwidth they can get, may
507		increase this.
508	
509		LOAD TIME: module/kernel parameter: edac_mc_poll_msec=[0|1]
510	
511		RUN TIME: echo "1000" > /sys/module/edac_core/parameters/edac_mc_poll_msec
512	
513	
514	Panic on PCI PARITY Error:
515	
516		'panic_on_pci_parity'
517	
518	
519		This control files enables or disables panicking when a parity
520		error has been detected.
521	
522	
523		module/kernel parameter: edac_panic_on_pci_pe=[0|1]
524	
525		Enable:
526		echo "1" > /sys/module/edac_core/parameters/edac_panic_on_pci_pe
527	
528		Disable:
529		echo "0" > /sys/module/edac_core/parameters/edac_panic_on_pci_pe
530	
531	
532	
533	=======================================================================
534	
535	
536	EDAC_DEVICE type of device
537	
538	In the header file, edac_core.h, there is a series of edac_device structures
539	and APIs for the EDAC_DEVICE.
540	
541	User space access to an edac_device is through the sysfs interface.
542	
543	At the location /sys/devices/system/edac (sysfs) new edac_device devices will
544	appear.
545	
546	There is a three level tree beneath the above 'edac' directory. For example,
547	the 'test_device_edac' device (found at the bluesmoke.sourceforget.net website)
548	installs itself as:
549	
550		/sys/devices/systm/edac/test-instance
551	
552	in this directory are various controls, a symlink and one or more 'instance'
553	directorys.
554	
555	The standard default controls are:
556	
557		log_ce		boolean to log CE events
558		log_ue		boolean to log UE events
559		panic_on_ue	boolean to 'panic' the system if an UE is encountered
560				(default off, can be set true via startup script)
561		poll_msec	time period between POLL cycles for events
562	
563	The test_device_edac device adds at least one of its own custom control:
564	
565		test_bits	which in the current test driver does nothing but
566				show how it is installed. A ported driver can
567				add one or more such controls and/or attributes
568				for specific uses.
569				One out-of-tree driver uses controls here to allow
570				for ERROR INJECTION operations to hardware
571				injection registers
572	
573	The symlink points to the 'struct dev' that is registered for this edac_device.
574	
575	INSTANCES
576	
577	One or more instance directories are present. For the 'test_device_edac' case:
578	
579		test-instance0
580	
581	
582	In this directory there are two default counter attributes, which are totals of
583	counter in deeper subdirectories.
584	
585		ce_count	total of CE events of subdirectories
586		ue_count	total of UE events of subdirectories
587	
588	BLOCKS
589	
590	At the lowest directory level is the 'block' directory. There can be 0, 1
591	or more blocks specified in each instance.
592	
593		test-block0
594	
595	
596	In this directory the default attributes are:
597	
598		ce_count	which is counter of CE events for this 'block'
599				of hardware being monitored
600		ue_count	which is counter of UE events for this 'block'
601				of hardware being monitored
602	
603	
604	The 'test_device_edac' device adds 4 attributes and 1 control:
605	
606		test-block-bits-0	for every POLL cycle this counter
607					is incremented
608		test-block-bits-1	every 10 cycles, this counter is bumped once,
609					and test-block-bits-0 is set to 0
610		test-block-bits-2	every 100 cycles, this counter is bumped once,
611					and test-block-bits-1 is set to 0
612		test-block-bits-3	every 1000 cycles, this counter is bumped once,
613					and test-block-bits-2 is set to 0
614	
615	
616		reset-counters		writing ANY thing to this control will
617					reset all the above counters.
618	
619	
620	Use of the 'test_device_edac' driver should any others to create their own
621	unique drivers for their hardware systems.
622	
623	The 'test_device_edac' sample driver is located at the
624	bluesmoke.sourceforge.net project site for EDAC.
625	
626	=======================================================================
627	NEHALEM USAGE OF EDAC APIs
628	
629	This chapter documents some EXPERIMENTAL mappings for EDAC API to handle
630	Nehalem EDAC driver. They will likely be changed on future versions
631	of the driver.
632	
633	Due to the way Nehalem exports Memory Controller data, some adjustments
634	were done at i7core_edac driver. This chapter will cover those differences
635	
636	1) On Nehalem, there are one Memory Controller per Quick Patch Interconnect
637	   (QPI). At the driver, the term "socket" means one QPI. This is
638	   associated with a physical CPU socket.
639	
640	   Each MC have 3 physical read channels, 3 physical write channels and
641	   3 logic channels. The driver currently sees it as just 3 channels.
642	   Each channel can have up to 3 DIMMs.
643	
644	   The minimum known unity is DIMMs. There are no information about csrows.
645	   As EDAC API maps the minimum unity is csrows, the driver sequencially
646	   maps channel/dimm into different csrows.
647	
648	   For example, supposing the following layout:
649		Ch0 phy rd0, wr0 (0x063f4031): 2 ranks, UDIMMs
650		  dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400
651		  dimm 1 1024 Mb offset: 4, bank: 8, rank: 1, row: 0x4000, col: 0x400
652	        Ch1 phy rd1, wr1 (0x063f4031): 2 ranks, UDIMMs
653		  dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400
654		Ch2 phy rd3, wr3 (0x063f4031): 2 ranks, UDIMMs
655		  dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400
656	   The driver will map it as:
657		csrow0: channel 0, dimm0
658		csrow1: channel 0, dimm1
659		csrow2: channel 1, dimm0
660		csrow3: channel 2, dimm0
661	
662	exports one
663	   DIMM per csrow.
664	
665	   Each QPI is exported as a different memory controller.
666	
667	2) Nehalem MC has the hability to generate errors. The driver implements this
668	   functionality via some error injection nodes:
669	
670	   For injecting a memory error, there are some sysfs nodes, under
671	   /sys/devices/system/edac/mc/mc?/:
672	
673	   inject_addrmatch/*:
674	      Controls the error injection mask register. It is possible to specify
675	      several characteristics of the address to match an error code:
676	         dimm = the affected dimm. Numbers are relative to a channel;
677	         rank = the memory rank;
678	         channel = the channel that will generate an error;
679	         bank = the affected bank;
680	         page = the page address;
681	         column (or col) = the address column.
682	      each of the above values can be set to "any" to match any valid value.
683	
684	      At driver init, all values are set to any.
685	
686	      For example, to generate an error at rank 1 of dimm 2, for any channel,
687	      any bank, any page, any column:
688			echo 2 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/dimm
689			echo 1 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/rank
690	
691		To return to the default behaviour of matching any, you can do:
692			echo any >/sys/devices/system/edac/mc/mc0/inject_addrmatch/dimm
693			echo any >/sys/devices/system/edac/mc/mc0/inject_addrmatch/rank
694	
695	   inject_eccmask:
696	       specifies what bits will have troubles,
697	
698	   inject_section:
699	       specifies what ECC cache section will get the error:
700			3 for both
701			2 for the highest
702			1 for the lowest
703	
704	   inject_type:
705	       specifies the type of error, being a combination of the following bits:
706			bit 0 - repeat
707			bit 1 - ecc
708			bit 2 - parity
709	
710	       inject_enable starts the error generation when something different
711	       than 0 is written.
712	
713	   All inject vars can be read. root permission is needed for write.
714	
715	   Datasheet states that the error will only be generated after a write on an
716	   address that matches inject_addrmatch. It seems, however, that reading will
717	   also produce an error.
718	
719	   For example, the following code will generate an error for any write access
720	   at socket 0, on any DIMM/address on channel 2:
721	
722	   echo 2 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/channel
723	   echo 2 >/sys/devices/system/edac/mc/mc0/inject_type
724	   echo 64 >/sys/devices/system/edac/mc/mc0/inject_eccmask
725	   echo 3 >/sys/devices/system/edac/mc/mc0/inject_section
726	   echo 1 >/sys/devices/system/edac/mc/mc0/inject_enable
727	   dd if=/dev/mem of=/dev/null seek=16k bs=4k count=1 >& /dev/null
728	
729	   For socket 1, it is needed to replace "mc0" by "mc1" at the above
730	   commands.
731	
732	   The generated error message will look like:
733	
734	   EDAC MC0: UE row 0, channel-a= 0 channel-b= 0 labels "-": NON_FATAL (addr = 0x0075b980, socket=0, Dimm=0, Channel=2, syndrome=0x00000040, count=1, Err=8c0000400001009f:4000080482 (read error: read ECC error))
735	
736	3) Nehalem specific Corrected Error memory counters
737	
738	   Nehalem have some registers to count memory errors. The driver uses those
739	   registers to report Corrected Errors on devices with Registered Dimms.
740	
741	   However, those counters don't work with Unregistered Dimms. As the chipset
742	   offers some counters that also work with UDIMMS (but with a worse level of
743	   granularity than the default ones), the driver exposes those registers for
744	   UDIMM memories.
745	
746	   They can be read by looking at the contents of all_channel_counts/
747	
748	   $ for i in /sys/devices/system/edac/mc/mc0/all_channel_counts/*; do echo $i; cat $i; done
749		/sys/devices/system/edac/mc/mc0/all_channel_counts/udimm0
750		0
751		/sys/devices/system/edac/mc/mc0/all_channel_counts/udimm1
752		0
753		/sys/devices/system/edac/mc/mc0/all_channel_counts/udimm2
754		0
755	
756	   What happens here is that errors on different csrows, but at the same
757	   dimm number will increment the same counter.
758	   So, in this memory mapping:
759		csrow0: channel 0, dimm0
760		csrow1: channel 0, dimm1
761		csrow2: channel 1, dimm0
762		csrow3: channel 2, dimm0
763	   The hardware will increment udimm0 for an error at the first dimm at either
764		csrow0, csrow2  or csrow3;
765	   The hardware will increment udimm1 for an error at the second dimm at either
766		csrow0, csrow2  or csrow3;
767	   The hardware will increment udimm2 for an error at the third dimm at either
768		csrow0, csrow2  or csrow3;
769	
770	4) Standard error counters
771	
772	   The standard error counters are generated when an mcelog error is received
773	   by the driver. Since, with udimm, this is counted by software, it is
774	   possible that some errors could be lost. With rdimm's, they displays the
775	   contents of the registers
Hide Line Numbers
About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog

Information is copyright its respective author. All material is available from the Linux Kernel Source distributed under a GPL License. This page is provided as a free service by mjmwired.net.