A fatal hardware error has occured (Event ID 18)
Edit: I've rewritten this post, both to update it and make it more concise to the primary issue.

I've been dealing with two issues an issue lately (Edit: one has been solved). This started shortly after I changed my video card (GTX 1060 to RX 7800 XT).

--------------------------------------------------

Summary of the issue:


The display sometimes goes Black (display backlight stays on though), and then the PC restarts on it's own (most of the time) or stays that way (one or two times). The time between the screen turning Black and the PC restarting varies.

These are not BSODs. I'm not seeing one (I have automatic restart on BSOD disabled) nor does anything show up insofar as minidumps, memory dumps, or event logs indicating such. What I am getting is a log for Event ID 18 every time (which is a machine check exception of "a fatal hardware error has occurred"), and nothing else (besides the expected Event ID 41 and Event ID 6008, which are merely byproducts of the unexpected shutdown). Details about this issue are below.

--------------------------------------------------

PC Specifications:


https://valid.x86.fr/306yr6

PSU: EVGA SuperNova G5 750W
CPU: Ryzen 7 5800X3D
CPU cooling: Be Quiet Dark Rock Pro 4
Motherboard: MSI MAG X570S Tomahawk Max WiFi (7D54v17 BIOS)
RAM: 64 GB (4x 16 GB) G.Skill Ripjaws V 3,6000 MHz 16-19-19-39 1.35V
GPU: Sapphire Nitro RX 7800 XT (23.10.2 drivers)
SSD(s): 2x Western Digital Black SN850X 2 TB (latest firmware on both)
HDD(s): 1x Western Digital Black 5 TB
2x Western Digital Blue 8 TB
Display: Dell U2410 24" 1920 x 1200/60 Hz (connected via display port and HDMI)

Everything is at stock, with the exception of the "XMP" RAM profile speed being enabled.

--------------------------------------------------

Detailed description of the issue:


For a week or so after I added the video card, things were fine.

I tried undervolting my CPU (all core offset of -30, and then -20), and they passed some initial stress tests each time, but failed in real world scenarios. The screen would go Black, and the PC would restart, and this was the first time the issue occurred. Both times I was met with Event ID 18 in the Event Viewer.

"A fatal hardware error has occurred.

Reported by component: Processor Core
Error Source: Machine Check Exception
Error Type: Cache Hierarchy Error
Processor APIC ID: 14

The details view of this entry contains further information."


The APIC number varies each time I get this issue, as this corresponds to the logical processor that threw the exception.

I chalked it up to instability, and set the CPU back to stock.

Not long after, it happened again... and it irked me, but I thought maybe it was a one off and waited to see if it would continue.

For another week or so, things were again stable.

I then got another one (this was the point I initially made this thread). And then another. And then another. And... I'm losing track. And they seem to been escalating.

--------------------------------------------------

Troubleshooting I've attempted:


1. I've updated the motherboard BIOS. V1.5, V,1.7, and V1.8.

2. Windows 10 is up to date.

3. AMD chipset drivers are up to date. Audio drivers are up to date. Ethernet drivers are up to date. Bluetooth and WiFi drivers are up to date. Etc.

4. I've updated video card drivers as new ones have become available. Both issues have persisted on all drivers I've tried, including 23.9.1, 23.9.3, 23.10,.1, 23.10.2, and 23.11.1. I'm not even getting any event logs about the drivers crashing and recovering.

5. I've used DDU to uninstall and reinstall the video drivers. Yes, I used safe mode. Yes, I disconnected the internet.

6. I've reset the BIOS or otherwise done things (see number 13 below) that leave me with a reset BIOS who knows how many times.

7. I've disabled XMP (seems like it may have made it worse, but that might just be coincidental), set XMP but scaled back RAM frequency/IF clocks a bit to 3,200 MHz/1,600 MHz respective. So it doesn't matter RAM/IF is set to 2,133 MHz (JEDEC default)/1,066 MHz or 3,200 MHz/1,600 MHz or 3,600 MHz/1,800 MHz respectively, they all have the issue. This seems to rule out RAM or Infinity Fabric instability?

8. I've run stress tests galore. Windows memory diagnostic (might not be very conclusive on its own but I did it), MemTest86+, Prime 95, BurnInTest, and the majority of the OCCT suite. All passed, with the exception of the "GPU variable" test in OCCT, which immediately caused the crash the first time I attempted it, but then succeeded on a subsequent attempt (at first I was happy, ironically, that I may have found a reproducible cause, but it seems I didn't).

9. I've tried connecting the DP cable to both output ports on the video card (mine has two DP and two HDMI instead of three DP and one HDMI).

10. I've tried HDMI.

11. I've adjusted the ASPM setting (PCI Express > Link State Power Management > Off).

12. I've completely reinstalled Windows 10!

13. I've completely, and I mean completely, took my PC apart down to the part, cleaned it (though it was already rather clean), and reassembled it. This was to rule out a bad connection anywhere. I even swapped RAM around, and the CPU was also reseat.

14. The video card is a Sapphire Nitro+ RX 7800 XT which has a BIOS switch with three positions (one performance BIOS, one silent BIOS, and the other is just a mode that lets you change it on the fly with the Sapphire Trixxx software). I've tried both BIOS/all three positions.

15. I've used "Driver Verifier" which is something Windows includes and followed the instructions here[answers.microsoft.com] to stress test the drivers. This was inconclusive, but not useless. Since the issue doesn't yet have a known reproducible, on demand cause, I have to wait, but this tends to cause it to occur sooner. Unfortunately, the Driver Verifier does not catch anything and give me a notice of any violations it detected. Maybe because the drivers are fine and the issue isn't drivers but hardware itself. I'm reading machine check exceptions are, as a rule, almost always hardware and not software.

16. I've found some people saying they suspect the issue the card boosting above where it should. I've tried limiting the boost to 2,500 MHz (default maximum is 2565 MHz) but it doesn't seem to truly respect this. Nonetheless, it made no difference. Along with this, I tried disabling the "Zero Fan" (this stops the fan when the temperature is below a certain temperature) as someone suggested, and this also made no difference.

17. I've tried disabling ULPS.

18. I've tried disabling MPO.

19. I've tried a 3700X in place of the 5800X3D. It happens on both.

None of these troubleshooting steps have resolved the issue.

--------------------------------------------------

Troubleshooting I'm needing to do since the above failed, and I think I want to try both first before deciding to proceed down any RMA path:


1. Try my old video card to see if the issue indeed goes away.
Sist redigert av Illusion of Progress; 8. nov. 2023 kl. 13.08
< >
Viser 1630 av 149 kommentarer
pasa 13. okt. 2023 kl. 9.27 
It's definitely not "wattage" related as in lack of overall supply. But millisec spikes from draw somewhere loke GPU may result in that short browning out elsewhere. Alt-tab and switches between modes are exactly kind of things that may cause the graphic system to drop its state and rebuild from scratch. (Btw it may even go in the other direction, sharp dropping power draw needs reaction from the supply and create a wave from the regulation.) The point is the fluctuation, not the level of draw, be it high or low.

Another idea I had was picking up external interference like from a mobile phone. Though that is not consistent with relation to F11.

My primary suspect would be the MOBO here, but guess it's not possible to test the rest of the HW with a different one.
Agent 13. okt. 2023 kl. 9.32 
Opprinnelig skrevet av pasa:
It's definitely not "wattage" related as in lack of overall supply. But millisec spikes from draw somewhere loke GPU may result in that short browning out elsewhere. Alt-tab and switches between modes are exactly kind of things that may cause the graphic system to drop its state and rebuild from scratch. (Btw it may even go in the other direction, sharp dropping power draw needs reaction from the supply and create a wave from the regulation.) The point is the fluctuation, not the level of draw, be it high or low.

Another idea I had was picking up external interference like from a mobile phone. Though that is not consistent with relation to F11.

My primary suspect would be the MOBO here, but guess it's not possible to test the rest of the HW with a different one.
Yes I'd use something like OCCT which has benchmarks that can put the system under load then back to idle very quickly to test how it handles sudden power changes.
Sist redigert av Agent; 13. okt. 2023 kl. 9.33
A&A 13. okt. 2023 kl. 9.54 
I continue to think that the voltage is wrong, because at -30 millivolts this problem occurred more often, then at -20. It is at such a level where you literally cannot say that it started to undervolting a lot and stops. That's exactly why I say even if it's at a lower frequency, it shouldn't be a huge deal. Even if it is only necessary to test if the processor is stable, there is no need to touch the multiplier, just turn off the boost clock.
Chouchers 13. okt. 2023 kl. 14.06 
RMA your video card it likely bad psu not problem as only needs 700 W.
PopinFRESH 13. okt. 2023 kl. 14.10 
Step 1 Run CPUz validator and post the link :)

GPU / Driver Issue
When you said you DDU'd the drivers can you elaborate on specifically the full process / everything that was done in doing the DDU before you installed the new card?

Did you reboot into safemode to perform the DDU?
Did you enable the option to disable the Windows automatic driver install via Windows Update?
Did you disconnect from the internet when performing the DDU?

Also, the loss of display signal is almost certainly a resultant byproduct of this same issue where you experience a recoverable driver crash and the driver resets causing the display to blank and then gets signal again when the driver recovers. It is possibly other issues but given the circumstance I'd highly suspect it is directly related to the GPU change and possible driver issues.

WHEA & MCE Error
Advanced Programmable Interrupt Controller (APIC) is the inturrupt controller tied to each logical processor during boot. With SMT2 it should be enumerated such that Core0 will be assigned APIC ID 0 and 1, Core1 will be assigned APIC ID 2 and 3, Core2 will be assigned APIC ID 4 and 5, etc.; It is a bit more complex than that but for the most part you should see that enumeration behavior.

Not directly applicable but here is an Intel document on how APIC enumeration should occur[cdrdv2-public.intel.com] on IA-64 and IA-32 CPUs. I don't know of a similar AMD whitepaper so just posting this for conceptual purposes since there was the related discussion regarding APIC.

The APIC ID in those errors is indicating the logical core where the interrupt was occurring when the error happened. In other words which logical core whatever thread that resulted in the crash was running on where the crash occurred. Don't get too hung up on that ID unless you are seeing consistent crashes all from the same specific logical core; while bearing in mind that the enumeration occurs with each boot so it may change IDs <-> physical core after a crash and reboot.

With 3rd gen Zen / Ryzen 5000, the errors you've noted I would lean toward an unstable Infinity Fabric. This is likely set to "Auto" in your UEFI/BIOS by default and may change when applying the XMP profile and not change when disabling the XMP profile. (CPUz should show what your FCLK is currently running at)

Try:
NOTE: If you are using secure boot with fTMP and CSM disabled you will need to ensure you reconfigure all of those settings in order to restore booting functionality after resetting UEFI/BIOS as described below

  1. Reset BIOS by booting into UEFI and selecting to load defaults; then save and exit
  2. Power off, disconnect power from the PSU, remove the CMOS battery, and short the Clear CMOS jumper; wait 30 seconds, then reinstall the CMOS battery and supply power to the PSU again
  3. You will likely see when attempting to first power on after this that the system will seem to "attempt" to power on and power off multiple times while the board completes memory training. Just let it do its thing until you have the option to boot back into UEFI/BIOS
  4. Leave all settings at defaults other than the above noted settings required for UEFI boot to your Windows OS; e.g. disable CSM, enable secure boot (don't reload keys), enable fTPM, etc.
  5. Check the setting for FCLK and if it is set to auto (should be).
  6. Set your SSD with Windows installed on it as your UEFI boot device and then save & exit to reboot and let it try to boot back into Windows normally.

You should be running at the default memory timings and the SPD defined memory speed for your memory kit at this point. Run the CPUz burn-in for about 45min and keep an eye on HWMONITOR for temps and voltages of the CPU packages. If things appear stable try to reproduce the issue you were encountering by doing the things you've been doing when you've encountered the issue previously.

If that is stable; then try to manually tune your memory to 3200MT/s and manually set the FCLK to 1600MHz which we can cross that bridge when we get there. Also keep in mind that these errors could be a knock-on effect from the GPU change so try to resolve that first.
Sist redigert av PopinFRESH; 13. okt. 2023 kl. 14.18
emoticorpse 13. okt. 2023 kl. 14.26 
Opprinnelig skrevet av PopinFRESH:
Step 1 Run CPUz validator and post the link :)

GPU / Driver Issue
When you said you DDU'd the drivers can you elaborate on specifically the full process / everything that was done in doing the DDU before you installed the new card?

Did you reboot into safemode to perform the DDU?
Did you enable the option to disable the Windows automatic driver install via Windows Update?
Did you disconnect from the internet when performing the DDU?

Also, the loss of display signal is almost certainly a resultant byproduct of this same issue where you experience a recoverable driver crash and the driver resets causing the display to blank and then gets signal again when the driver recovers. It is possibly other issues but given the circumstance I'd highly suspect it is directly related to the GPU change and possible driver issues.

WHEA & MCE Error
Advanced Programmable Interrupt Controller (APIC) is the inturrupt controller tied to each logical processor during boot. With SMT2 it should be enumerated such that Core0 will be assigned APIC ID 0 and 1, Core1 will be assigned APIC ID 2 and 3, Core2 will be assigned APIC ID 4 and 5, etc.; It is a bit more complex than that but for the most part you should see that enumeration behavior.

Not directly applicable but here is an Intel document on how APIC enumeration should occur[cdrdv2-public.intel.com] on IA-64 and IA-32 CPUs. I don't know of a similar AMD whitepaper so just posting this for conceptual purposes since there was the related discussion regarding APIC.

The APIC ID in those errors is indicating the logical core where the interrupt was occurring when the error happened. In other words which logical core whatever thread that resulted in the crash was running on where the crash occurred. Don't get too hung up on that ID unless you are seeing consistent crashes all from the same specific logical core; while bearing in mind that the enumeration occurs with each boot so it may change IDs <-> physical core after a crash and reboot.

With 3rd gen Zen / Ryzen 5000, the errors you've noted I would lean toward an unstable Infinity Fabric. This is likely set to "Auto" in your UEFI/BIOS by default and may change when applying the XMP profile and not change when disabling the XMP profile. (CPUz should show what your FCLK is currently running at)

Try:
NOTE: If you are using secure boot with fTMP and CSM disabled you will need to ensure you reconfigure all of those settings in order to restore booting functionality after resetting UEFI/BIOS as described below

  1. Reset BIOS by booting into UEFI and selecting to load defaults; then save and exit
  2. Power off, disconnect power from the PSU, remove the CMOS battery, and short the Clear CMOS jumper; wait 30 seconds, then reinstall the CMOS battery and supply power to the PSU again
  3. You will likely see when attempting to first power on after this that the system will seem to "attempt" to power on and power off multiple times while the board completes memory training. Just let it do its thing until you have the option to boot back into UEFI/BIOS
  4. Leave all settings at defaults other than the above noted settings required for UEFI boot to your Windows OS; e.g. disable CSM, enable secure boot (don't reload keys), enable fTPM, etc.
  5. Check the setting for FCLK and if it is set to auto (should be).
  6. Set your SSD with Windows installed on it as your UEFI boot device and then save & exit to reboot and let it try to boot back into Windows normally.

You should be running at the default memory timings and the SPD defined memory speed for your memory kit at this point. Run the CPUz burn-in for about 45min and keep an eye on HWMONITOR for temps and voltages of the CPU packages. If things appear stable try to reproduce the issue you were encountering by doing the things you've been doing when you've encountered the issue previously.

If that is stable; then try to manually tune your memory to 3200MT/s and manually set the FCLK to 1600MHz which we can cross that bridge when we get there. Also keep in mind that these errors could be a knock-on effect from the GPU change so try to resolve that first.

His link to the validation is already up there :lunar2019coolpig:
GOD RAYS ON ULTRA™ 13. okt. 2023 kl. 14.32 
Well, if you run out of things to try I read about this option in AMD driver suite, Zero RPM. You can try disabling that. Apparently it makes some AMD systems freeze, lose signal or crash.

https://community.amd.com/t5/drivers-software/setting-link-state-power-management-to-off-fixed-my-crashes/td-p/294814
emoticorpse 13. okt. 2023 kl. 14.37 
Might have something to do with the rare (I don't want to call it rare, but I think most people don't have that much ram, especially using 4 sticks) memory setup. Maybe for troubleshooting purposes take two sticks out to see if it gets stable?. I know even if it did stabilize, you'd want to put the ram back in but I'm just saying.
Sist redigert av emoticorpse; 13. okt. 2023 kl. 16.22
PopinFRESH 13. okt. 2023 kl. 14.48 
Opprinnelig skrevet av emoticorpse:
Opprinnelig skrevet av PopinFRESH:
Step 1 Run CPUz validator and post the link :)
...

His link to the validation is already up there :lunar2019coolpig:

Lol yeah I'm just blind... and I looked twice for it since I'd expect him to post it with how detailed he tends to be. We are all human I guess :)
emoticorpse 13. okt. 2023 kl. 15.32 
Opprinnelig skrevet av PopinFRESH:
Opprinnelig skrevet av emoticorpse:

His link to the validation is already up there :lunar2019coolpig:

Lol yeah I'm just blind... and I looked twice for it since I'd expect him to post it with how detailed he tends to be. We are all human I guess :)

Yeah, I miss things too. With me it's laziness and rushing it I think.
Sist redigert av emoticorpse; 13. okt. 2023 kl. 15.39
Rod 13. okt. 2023 kl. 15.43 
You didnt format after installing the card did you? Could it be that? Sounds stupid but wht else would this all happen just changing gpu. Its AMD after all!
Illusion of Progress 13. okt. 2023 kl. 21.40 
I won't quote every reply because there's too much, but I'll try and cover the important things. If something important was asked of/from me and I miss it, please ask again.

Yes, this "started" after the change of graphics card. I quote started because I recently noticed I had WHEA warning level logs (Event ID 19) going back to around the time I changed my motherboard (AM4 to AM4), but the error level logs (Event ID 18) resulting in crashing started with the video card change.

For those unaware, Event ID 19 is "a corrected hardware error has occurred", and event ID 18 is "a fatal hardware error has occurred" (and while fatal hardware errors can occur on any platform, Event ID 18 seems to show up only with AMD, and more specifically with some generations of CPUs like mine). Same thing, of sorts, but different severity. One is so bad there's no coming back from it, so it crashes.

This is basically where I'm at now. I'm having Event ID 18 crashes and I have a trail of Event ID 19 warnings from the last year. So it seems I was already close to having issues before, and the GPU change pushed that over?

Or coincidental timing. It's also possible they are two different things and I'm still apt to having the warnings, and these recent error crash ones are different.

Therefore, it's hard for me to say where the real issue is (system side or GPU side).

But my logical thinking is "new behavior since graphics card change equals start there" and then maybe look into the remaining warning level issue later, instead of conflating both issues and making it harder on myself?

I did test running at default BIOS/stock/no JEDEC/no high Infinity Fabric speeds/etc. My first rule of "I have an issue" is to try stock/default settings. As it is, I pretty much already run at default everything anyway, the only changes being that I use the XMP profile (which will result in the Infinity Fabric being set to match). Even at stock RAM/Infinity Fabric speeds (2133 MHz/1066 MHz respectively), I not only still had issues, but they showed up sooner and in situations I didn't have them before! League of Legends crashed now for goodness sake, and not even a day after running that way. At least before I was going days/weeks without issue and then it'd only crash when, like, running Minecraft for hours, making recording, and then pressing F11 to switch states and start another task.

The person who advised OCCT, I did that as one of the initial stress tests while undervolting to test said undervolt. When I changed the graphics card, it was fine for a week or so, and so I decided to try and undervolt the CPU (purely to bring temperatures down some because now my GPU was actually running cooler than it). I even did the "core cycler" method to shift a single core load across random threads. To those who don't know, this method is to help catch instability when switching load states, since a lot of the time you can be stable at idle or full load, but when switching one way or the other it might not be. Result? OCCT passed. Prime95 passed. Real world was crashing, and fast. I figured it wasn't stable so I stopped undervolting and set it back to stock. But then I had another of such crashes a few days later and then it was again fine for a week. Then another issue. That's when I knew it wasn't a one off.

When I used DDU when changing from nVidia to AMD, yes, I did it in safe mode and with the internet disconnected. I didn't touch anything with Windows update, but the OS didn't install any drivers on its own, and the drivers I tried installing went fine (I've read of a certain issue with AMD driver installation itself being tricky because Windows does something to conflict with it, and I didn't seem to have any such issues).

Besides these issues, which I know is ironic to say given what's going on, the GPU drivers and behavior/performance in games has been better than expected, but I've also not gotten very far in testing multiple things nor had it very long.

When my display loses signal/enters power save mode in a loop, the drivers are not crashing during this time. There's nothing (and I mean nothing) in event viewer during these times, nothing stating the drivers crashed and recovered either. Speaking of which, I have not had this lesser display losses signal issue since I updated BIOS/drivers, but... it's only been two days, so I'll keep updated on if that occurs again. The crashes remain, though.

Again, if you said something and want me to respond to it specifically, please ask again. I'm not trying to ignore anything but a lot was said.
Sist redigert av Illusion of Progress; 13. okt. 2023 kl. 21.43
pasa 14. okt. 2023 kl. 1.33 
Is your JEDEC ram setting using lower voltage than the XMP? If so you can bump just the RAM voltage and see whether it has effect.
PopinFRESH 14. okt. 2023 kl. 4.55 
These are in no particular order; as always for troubleshooting do one incremental change at a time and re-test to note any changes.

Do you still have the 1060? Are you able to perform a DDU as you've noted previously (but in the advance option check the box to disable automatic driver install), then shutdown and reinstall the 1060, and install the latest GeForce drivers to test if you are still having the issues (besides the WHEA warnings you found post-facto had been occurring prior to the GPU change).

Do you have another temporary disk you can use to disconnect all of your other disks, and do a clean install of Windows 10, and do a clean install for the motherboard drivers and GPU drivers to try to rule out software being an issue?

Do you have the MSI Center installed? and if so have you tried removing it and retesting?

Do you have any RGB control software installed? and if so have you tried removing it and retesting?

Do you have the Ryzen Master software installed? Do you have PBO enabled?

Double check in BIOS/UEFI that MSI's "Gameboost" or "Creator Genie" isn't enabled/active by default.

Try manually configuring memory and FCLK. (For post-change burn-in-testing outside of your installed OS, here is a link to the PassMark Burn-In-Test WinPE Builder Guide[www.passmark.com])
  1. In BIOS/UEFI press F6 to reset to defaults; then save and exit to reboot and boot back into UEFI
  2. Apply the A-XMP profile for your memory; then save and exit to reboot and boot back into UEFI
  3. Make sure the CPU Loadline Calibration Control is set to Auto
  4. Re-enabled fTPM by setting the option to AMD CPU fTPM
  5. Set boot mode to UEFI only (rather than UEFI + Legacy)
  6. Disable both the LPT and COM ports in the Super IO Configuration section (so they are not being assigned IRQs)
  7. In advanced, leave the voltages, and timings for the memory to what was set by the XMP profile. Change the frequency/transfers to 3200MT/s
  8. Change the FCLK from auto to 1600MHz
  9. Save and exit UEFI and boot to Burn-in-test USB and run a burn-in-test on the CPU, Memory, and GPU (both 2D and 3D) for 30 minutes


For hardware just to fully understand the state of things:

Do you have both the 8pin EPS and the 4pin P4 CPU_PWR1 and CPU_PWR2 connections fed from your PSU?

Are there any other PCIe Add-in-cards installed or just the GPU?
Illusion of Progress 14. okt. 2023 kl. 9.38 
Opprinnelig skrevet av pasa:
Is your JEDEC ram setting using lower voltage than the XMP? If so you can bump just the RAM voltage and see whether it has effect.
The RAM seems to use 1.2V at 2,133 MHz speeds, instead of 1.35V when using the XMP profile.

I also noticed the RAM has two XMP profiles, and I can't see much difference between them as both are listed as 3,600 MHz at 16-19-19-39 timings, but trying to use second profile just results at it running at 2,133 MHz instead. Not too important though.
Opprinnelig skrevet av PopinFRESH:
Do you still have the 1060? Are you able to perform a DDU as you've noted previously (but in the advance option check the box to disable automatic driver install), then shutdown and reinstall the 1060, and install the latest GeForce drivers to test if you are still having the issues (besides the WHEA warnings you found post-facto had been occurring prior to the GPU change).
I thought of mentioning this earlier but I've already been including lots of information so I've been trying to keep it strictly to what's relevant.

I still have the GTX 1060, yes. I have the feeling putting it back into the system will result in the issue going away, but that's merely a feeling so I guess there's no place for those here. Unfortunately, given I can go up to a week or more without the issue, it's... hard to troubleshoot. And other than the crash in League of Legends at stock JEDEC RAM settings, it only seems to happen circumstantially with Minecraft. Namely (and this is referring to behavior that occurred even on the previous GTX 1060), I've noticed if I record for too long, and then stop, and then press F11 to switch to Window mode, a couple of seconds later I would get a game crash instead of a system crash, with an exit code (-1073740791[bugs.mojang.com]) that referred to nVidia 36x.xx era drivers (way old ones) as the cause, despite me having that crash code up to nVidia's latest drivers, and then the resulting recording (despite stopping before the crash) was unreadable by any video player. I thought maybe 700 GB+ videos was just... too much or whatever, and I should just avoid those situations, but maybe I'm digressing now. Point is, my first thought when I had this crash was just "it's just a situational Minecraft thing rather than a system issue, but I don't record long videos often, and nVidia just crashed more 'gracefully' in that specific situation" so I didn't think much of it... until it started happening rather often, just when doing "light" play, and now since making this thread, in another game entirely. So it's not just a Minecraft thing, clearly.

Anyway, back to the relevant stuff...

I also have a number of other things that might prove useful here if necessary. Those are my previous motherboard (Asus ROG Strix B550-F Gaming), my previous AM4 CPU (3700X), and a SATA SSD. I do not have another PSU nor DDR4 RAM for any testing.

I'm hesitant to bring the other motherboard in particular into the equation though, since I had these weird "random" restart issues with the combination of it (both the 3700X and 5800X3D) and my RAM. It was BIOS dependent to a point because a certain BIOS version in particular caused it, and on that version, it was like playing a game of Russian roulette on startup on if I'd have a spontaneous restart about 30 seconds after loading into the Windows 10 desktop (and only then; the BIOS was fine to sit in). If it passed that point, it never restarted... until it finally started doing that on even a later BIOS. Around that same time, I ultimately RMA'd it when I bought M2 drives and started using them, to find out the bottom M2 was faulty outright (this was also when I bought the new motherboard, partly to avoid downtime and partly to get X570 to utilize PCI Express 4 speeds for the second M2), so the spare motherboard I have is an RMA and while I have tested it for initial functionality, I don't know its level of operation beyond that, so to speak. Since changing to the new motherboard, none of those random restarts (or near the end, the crashes with the DRAM light on) have been a thing.

Hm, writing that out has me questioning the RAM now. Especially since that above post asked if the voltage is the same at JEDEC speeds, and no it's lower (which makes sense as it should need less) and it crashed sooner there.

I have tested RAM and it supposedly passes, though if you have suspicions/suggestions here, I'd be open to them.
Opprinnelig skrevet av PopinFRESH:
Do you have another temporary disk you can use to disconnect all of your other disks, and do a clean install of Windows 10, and do a clean install for the motherboard drivers and GPU drivers to try to rule out software being an issue?
I do.

I have an old 1 TB SATA SSD I could use, or I could even clear one of my 2 TB M2 SSDs temporarily.

Funny thing about this is I was considering moving to Windows 11 soon as well. I was using it not long after it launched, but I think the issues AMD platforms were having with it for a short time made me retreat to my Windows 10 install for the time, and I simply never moved back to it yet out of procrastination, which is funny considering I liked the look and layout of Windows 11 better than Windows 10. Didn't like the right click changes or extra mess of file associations though (not sure of the state of those today though).
Opprinnelig skrevet av PopinFRESH:
Do you have the MSI Center installed? and if so have you tried removing it and retesting?

Do you have any RGB control software installed? and if so have you tried removing it and retesting?

Do you have the Ryzen Master software installed? Do you have PBO enabled?
No to most/all of these, unless Sapphire Trixxx or whatever it's called counts, which I needed to disable the RGB on the video card. But as far as I can tell, that doesn't "run" since it seems like it may have set the RGB state on a firmware level (?) since the RGB never comes on at all anymore, not even in the BIOS, and there's no sign of Trixxx ever running.

For PBO I'm not sure. I use a 5800X3D and the motherboard BIOS just says "auto" for it.

I used to use Ryzen Master, but it would randomly throw up a command prompt/terminal (or whatever you want to call it) window when checking for updates, which would steal focus from whatever I was doing, and I got annoyed with that and thus uninstalled it. I wasn't using it for anything really important.
Opprinnelig skrevet av PopinFRESH:
Double check in BIOS/UEFI that MSI's "Gameboost" or "Creator Genie" isn't enabled/active by default.
I can confirm this is off, which the exception of when XMP is enabled since it highlights the RAM spot for this when it is. But it's off for the CPU.
Opprinnelig skrevet av PopinFRESH:
Try manually configuring memory and FCLK. (For post-change burn-in-testing outside of your installed OS, here is a link to the PassMark Burn-In-Test WinPE Builder Guide[www.passmark.com])
  1. In BIOS/UEFI press F6 to reset to defaults; then save and exit to reboot and boot back into UEFI
  2. Apply the A-XMP profile for your memory; then save and exit to reboot and boot back into UEFI
  3. Make sure the CPU Loadline Calibration Control is set to Auto
  4. Re-enabled fTPM by setting the option to AMD CPU fTPM
  5. Set boot mode to UEFI only (rather than UEFI + Legacy)
  6. Disable both the LPT and COM ports in the Super IO Configuration section (so they are not being assigned IRQs)
  7. In advanced, leave the voltages, and timings for the memory to what was set by the XMP profile. Change the frequency/transfers to 3200MT/s
  8. Change the FCLK from auto to 1600MHz
  9. Save and exit UEFI and boot to Burn-in-test USB and run a burn-in-test on the CPU, Memory, and GPU (both 2D and 3D) for 30 minutes
I'll try this and report back. Also, thanks so much for the exhaustive list of things to check/try.
Opprinnelig skrevet av PopinFRESH:
For hardware just to fully understand the state of things:

Do you have both the 8pin EPS and the 4pin P4 CPU_PWR1 and CPU_PWR2 connections fed from your PSU?
Yes, all spots for connectors in the top left of the motherboard are supplied with cables from the PSU.
Opprinnelig skrevet av PopinFRESH:
Are there any other PCIe Add-in-cards installed or just the GPU?
No, only the graphics card.
< >
Viser 1630 av 149 kommentarer
Per side: 1530 50

Dato lagt ut: 12. okt. 2023 kl. 19.40
Innlegg: 149