xShadowPro Sep 1, 2024 @ 5:11pm
nvlddmkm.sys BSOD during Benchmark post GPU upgrade Gigabyte 3070
Description

I've just moved from a 1080ti to a 3070 and I'm experiencing an issue ONLY during benchmarks. Game-wise, everything is smooth as butter; I have yet to find a game where it crashes. I tested the 3070 in my wife's PC, and the benchmarks finish without issue. I've also re-tested my 1080ti in my PC, and the benchmark runs fine.

I’ve checked the mini dumps and found that the crash is caused by an nvlddmkm.sys failure every time. Typical driver issue from what I have read. I’ve run DDU clean and restarted in safe mode. I freshly installed the latest Nvidia driver and even tried the previous version for good measure—no joy.

I've tried several other troubleshooting steps, but I’m lost as to whether this is a faulty card, conflicting software, or a dodgy driver. I’m hoping the community here might have some insights or suggestions for the root cause. Below is a detailed breakdown of the scenario, what I've tried, and all the relevant specs.

Specs

Main PC:
- CPU: Ryzen 7 5800X
- GPU: Gigabyte RTX 3070 Gaming OC 8GB GDDR6 (rev 2.0)
- RAM: Corsair Dominator Platinum RGB 32 GB (4 x 8 GB) DDR4 3600 MHz
- Motherboard: MPG X570S CARBON MAX WIFI
- PSU: Seasonic Focus GX 850W
- OS: Windows 11 Pro

Test PC:
- CPU: Ryzen 5 5600X
- GPU: MSI GTX 1080 Ti GAMING X 11GB
- RAM: Corsair Vengeance 2x8GB 3200MHz - White
- Motherboard: ROG STRIX B550-A Gaming
- PSU: RM650 80 Plus Gold
- OS: Windows 11 Pro

Crashing Scenario

Running 3D Mark Demo Steel Nomad Test
* Soon after the benchmark starts, the screen goes black as shown in this video[imgur.com]. This is followed by an auto-reboot after 2 minutes. The minidump contains nvlddmkm.sys.
Running Benchmark Test on Unreal Engine Heaven

* This can vary, but it seems to black screen around scene 20, followed by a reboot after 2 minutes. (Max GPU temp 80 degrees, CPU 70)
Though I've only had the GPU for 24 hours, I've played Teardown on modded maps, a few other games, and COD to intensify the GPU load, but with no crashes so far.

What I've Tried

* Ran DDU clean and restarted in safe mode, clean installed the latest Nvidia driver, and also tried the previous version (with WIFI OFF).
* Checked the BIOS; PCIE is set to Auto, manually set to Gen4 for testing.
* Disabled Nvidia Container LS, enabled Low Latency Mode.
* Yet to flash Windows 11 on a drive to test.
* Confirmed it works on another machine, and the former GPU can run a benchmark as well.
* No overclocking.

Mini toolkit output[pastebin.com]

Conclusion

Given the steps I’ve taken and the comparison between different PCs, I’m leaning towards this being a firmware/software issue. However, I’m not entirely sure how to prove or resolve this.

I would appreciate any suggestions. If anyone has experienced something similar or has any advice, it would be greatly appreciated!

Thanks in advance for your help!
Last edited by xShadowPro; Sep 1, 2024 @ 5:13pm

Something went wrong while displaying this content. Refresh

Error Reference: Community_9734361_
Loading CSS chunk 7561 failed.
(error: https://community.cloudflare.steamstatic.com/public/css/applications/community/communityawardsapp.css?contenthash=789dd1fbdb6c6b5c773d)
Showing 1-9 of 9 comments
Bad 💀 Motha Sep 1, 2024 @ 7:11pm 
If not already, update Win11 (might need to do by force) to 24H2
Then any further Windows Updates afterwards after the reboot & updating process is all done.

DDU wipe the Drivers & GFE in Safe Mode.
Reboot normally when cleaning done.
Download and install latest for AMD Ryzen Chipset Driver and NVIDIA GPU DCH Drivers from their official websites. Do not install GFE.

When you swap CPUs, reset the BIOS fully and then boot back into BIOS and make your needed changes; such as changes to RAM with regards to XMP/EXPO.

If still have GPU related problems, change these in NVIDIA Control Panel >
> Power Management = Prefer Max Performance
> Shader Cache = Unlimited

Since you are wanting to do Benchmarks, ensure VSync & GSync is all disabled.
Re-enable later or set on a game by game basis. GSync however is something you must set Globally though; either on or off.

On pretty much every build ever since Win10 came out (and it has not changed) I also use TDR Manipulator and raise all of the TDR timeouts because Win10/11 has those set too low by default.
Last edited by Bad 💀 Motha; Sep 1, 2024 @ 7:16pm
r.linder Sep 1, 2024 @ 7:11pm 
Make sure that Windows is fully up to date and that you rule out issues caused by storage or RAM by running chkdsk and memory diagnostics
_I_ Sep 1, 2024 @ 7:44pm 
post a cpuz validation link
http://www.cpuid.com/softwares/cpu-z.html
cpuz -> validate button -> submit button
it will open a browser, copy the url (address) and paste it here

make sure ram is set to the xmp profile
Set RAM to default speed (2,133 MHz, not profile speed) and try with two DIMMs. Ensure you use slots 2 and 4 (not 1 and 3).

Sometimes, the issue isn't with a specific part but a combination of them. You have a different motherboard, RAM configuration, and PSU from the other PC it was tested to work in. Start there. The vast drive setup difference might also be a variable.

And if you swap PSUs to test, do not swap the PSU itself without also swapping the cables. Yes, that matters. People have fried parts doing that.

If you're feeling adventurous, you can underclock (not undervolt) the GPU and/or VRAM in increments of 50 MHz or so and see if stability is ever gained. You know the graphics card works in another system though, so it's up to you if you want to spend time on that side of things, or if you want to invest troubleshooting efforts into the other variables in your system.
FAILURE_BUCKET_ID: 0x133_ISR_nvlddmkm!unknown_function
System errors:
=============
Error: (09/01/2024 12:54:11 AM) (Source: Microsoft-Windows-WER-SystemErrorReporting) (EventID: 1001) (User: NT AUTHORITY)
Description: 0x00000133 (0x0000000000000001, 0x0000000000001e00, 0xfffff80350f1c340, 0x0000000000000000)C:\Windows\Minidump\090124-11468-01.dmp041a36d4-23ad-4c84-8198-a33895ac8417

Error: (09/01/2024 12:54:03 AM) (Source: volmgr) (EventID: 162) (User: )
Description: Dump file generation succeded.

Error: (09/01/2024 12:54:12 AM) (Source: EventLog) (EventID: 6008) (User: )
Description: The previous system shutdown at 00:35:56 on ‎01/‎09/‎2024 was unexpected.

Error: (09/01/2024 12:51:34 AM) (Source: nvlddmkm) (EventID: 153) (User: )
Description: Event-ID 153

Error: (09/01/2024 12:51:34 AM) (Source: nvlddmkm) (EventID: 14) (User: )
Description: Event-ID 14

Error: (09/01/2024 12:35:56 AM) (Source: EventLog) (EventID: 6008) (User: )
Description: The previous system shutdown at 00:29:07 on ‎01/‎09/‎2024 was unexpected.

Error: (09/01/2024 12:21:21 AM) (Source: DCOM) (EventID: 10005) (User: Shadow)
Description: Event-ID 10005

Error: (09/01/2024 12:21:16 AM) (Source: DCOM) (EventID: 10005) (User: Shadow)
Description: Event-ID 10005

Error: (09/01/2024 12:21:16 AM) (Source: DCOM) (EventID: 10005) (User: Shadow)
Description: Event-ID 10005

Error: (09/01/2024 12:21:15 AM) (Source: DCOM) (EventID: 10005) (User: Shadow)
Description: Event-ID 10005
Per the bold part...

https://learn.microsoft.com/en-us/windows-hardware/drivers/debugger/bug-check-0x133-dpc-watchdog-violation

...As is common, many things that people think are "driver issues" might indeed be common occurrences, but not necessarily issues with the drivers themselves. Hardware faults or other stability issues cascade into a certain driver or software process failing a lot.

Look in these directories...

Windows/LiveKernelReports/WHEA
Windows/LiveKernelReports/WATCHDOG


And look for the presence of dump files that correspond to the time of these issues. WinDbg can be used to open and analyze them.
Last edited by Illusion of Progress; Sep 2, 2024 @ 2:27am
Bad 💀 Motha Sep 2, 2024 @ 2:40am 
Nothing to do with RAM; just do the DDU method first
Originally posted by Bad 💀 Motha:
Nothing to do with RAM; just do the DDU method first
Well sorry my king for giving multiple ideas to try instead of just the particular one you think is the answer...

While I would agree that the issue sounds like it lies with the graphics card somehow (hardware or software) given the symptoms of it, plus the fact it showed up when the graphics card was changed, this is complicated by the fact that it was tested in a second system it seems to work fine there, and it fails with multiple driver versions on the first system. Therefore, this suggests the card itself may be fine and the drivers may also be fine, so it's worth also looking at other variables in the first system.

Besides, you didn't read very well, did you?
Originally posted by xShadowPro:
I’ve run DDU clean and restarted in safe mode. I freshly installed the latest Nvidia driver and even tried the previous version for good measure—no joy.
Originally posted by xShadowPro:
What I've Tried

* Ran DDU clean and restarted in safe mode, clean installed the latest Nvidia driver, and also tried the previous version (with WIFI OFF).
I get missing stuff here and there, but... it was mentioned not once, but twice. You're also talking about what to do when swapping CPUs in your first reply and no CPU was ever swapped? Those were different PCs it was tested in.

It's like some of you just skim and don't even read, and then you're so fast to repeat the "just DDU it" or "just reinstall Windows" lines out the gate that you don't even see when it was already attempted...
Bad 💀 Motha Sep 2, 2024 @ 4:50am 
Why OP is even wasting time with Win11 is also just beyond me. As if there is any needs or benefits to using that.

First off use Group Policy Edit and disable Windows ability to update your drivers, otherwise that might lead to problems during testing while the PC is connected online.

And I wouldn't use the latest NVIDIA Driver unless you have RTX 40 series. Try 555.99 or older.

And again WinOS is plagued by that low TDR timeout issue, lower those values.

Not to mention the evidence that suggest many suffer from performance issues whenever using 23H2 at all. So I would highly suggest Win10 22H2 or go get Win11 24H2

Disable Fast Startup + Hibernation as well.
Last edited by Bad 💀 Motha; Sep 2, 2024 @ 4:54am
The error is generic and should be excused as it is merely a clock error where the GPU shuts itself down before it breaks - it could also be that the GPU is too far out of clock range from the factory or a switched Normal and OC running profile within the GPU software itself. :csd2smile:

Just run the benchmark with the GPU set to "Debug" mode within the NVIDIA CP Help menu and check if the benchmark crashes while using the standard model clocks. :chirp:
Last edited by Phénomènes Mystiques; Sep 2, 2024 @ 5:55am
Bad 💀 Motha Sep 2, 2024 @ 3:15pm 
In my experiences if the gpu driver is running into a generic clock or timeout issue, raising the TDR timers in WinOS often solves this
Showing 1-9 of 9 comments
Per page: 1530 50

Date Posted: Sep 1, 2024 @ 5:11pm
Posts: 9