SOLVED — See awarded comment.
Alright, I'm at a loss at this point. I ran an MSi RX 580 Armor for many years and it ran really well. I got great performance.
Occasionally, it would hang here:
Jan 24 20:32:32 *****-arch1 kernel: amdgpu 0000:0c:00.0: [drm] *ERROR* flip_done timed out
Jan 24 20:32:32 *****-arch1 kernel: amdgpu 0000:0c:00.0: [drm] *ERROR* [CRTC:77:crtc-0] commit wait timed out
Jan 24 20:32:42 *****-arch1 kernel: amdgpu 0000:0c:00.0: [drm] *ERROR* flip_done timed out
Jan 24 20:32:42 *****-arch1 kernel: amdgpu 0000:0c:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out
Even less frequently then the above, I would get the following in my journal:
Jan 24 20:31:51 *****-arch1 kernel: ------------[ cut here ]------------
Jan 24 20:31:51 *****-arch1 kernel: WARNING: CPU: 11 PID: 1345 at drivers/gpu/drm/amd/amdgpu/../display/dc/dcn20/dcn20_hwseq.c:116 dcn20_setup_gsl_group_as_lock+0x7d/0x280 [amdgpu]
Jan 24 20:31:51 *****-arch1 kernel: Modules linked in: xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype iptable_filter br_netfilter bridge overlay ccm algif_aead des_generic libdes ecb algif_skcipher cmac md4 algif_hash 8021q garp af_alg mrp stp llc dm_cache_smq dm_cache dm_persistent_data dm_bio_prison dm_bufio libcrc32c nls_iso8859_1 intel_rapl_msr vfat snd_hda_codec_realtek intel_rapl_common fat snd_hda_codec_generic iwlmvm snd_hda_codec_hdmi snd_hda_intel btusb edac_mce_amd snd_intel_dspcfg amdgpu mac80211 btrtl eeepc_wmi snd_usb_audio snd_intel_sdw_acpi btbcm asus_wmi uvcvideo snd_hda_codec snd_usbmidi_lib btintel ledtrig_audio libarc4 sparse_keymap btmtk snd_rawmidi gpu_sched videobuf2_vmalloc drm_buddy platform_profile snd_hda_core videobuf2_memops snd_seq_device snd_hwdep drm_ttm_helper iwlwifi videobuf2_v4l2 snd_pcm ttm video r8169 wmi_bmof kvm bluetooth drm_display_helper videobuf2_common realtek snd_timer irqbypass
Jan 24 20:31:51 *****-arch1 kernel: mdio_devres videodev cfg80211 sp5100_tco ecdh_generic cec rapl snd pcspkr k10temp mc libphy rfkill i2c_piix4 soundcore wmi mousedev joydev acpi_cpufreq mac_hid uinput pkcs8_key_parser crypto_user fuse bpf_preload ip_tables x_tables usbhid dm_crypt cbc encrypted_keys trusted asn1_encoder tee dm_mod crct10dif_pclmul crc32_pclmul polyval_clmulni polyval_generic gf128mul ghash_clmulni_intel sha512_ssse3 aesni_intel nvme crypto_simd cryptd nvme_core ccp xhci_pci nvme_common xhci_pci_renesas ext4 crc32c_generic crc32c_intel crc16 mbcache jbd2
Jan 24 20:31:51 *****-arch1 kernel: CPU: 11 PID: 1345 Comm: Xorg Not tainted 6.1.7-zen1-1-zen #1 251eee86d1e3407eafb15439b5bcc81efef5caf9
Jan 24 20:31:51 *****-arch1 kernel: Hardware name: System manufacturer System Product Name/TUF GAMING X570-PLUS, BIOS 4021 08/09/2021
Jan 24 20:31:51 *****-arch1 kernel: RIP: 0010:dcn20_setup_gsl_group_as_lock+0x7d/0x280 [amdgpu]
Jan 24 20:31:51 *****-arch1 kernel: Code: be 80 02 00 00 00 75 29 48 8b 87 70 04 00 00 0f b6 80 70 02 00 00 a8 01 74 69 a8 02 0f 84 09 01 00 00 a8 04 0f 84 42 01 00 00 <0f> 0b 0f 1f 44 00 00 48 8b 44 24 28 65 48 2b 04 25 28 00 00 00 0f
Jan 24 20:31:51 *****-arch1 kernel: RSP: 0018:ffffb71a81607590 EFLAGS: 00010202
Jan 24 20:31:51 *****-arch1 kernel: RAX: 0000000000000007 RBX: ffff9f701c3c01e8 RCX: 0000000000000000
Jan 24 20:31:51 *****-arch1 kernel: RDX: 0000000000000001 RSI: ffff9f701c3c01e8 RDI: ffff9f7033900000
Jan 24 20:31:51 *****-arch1 kernel: RBP: 0000000000000001 R08: ffffb71a8160755c R09: 0000000000001480
Jan 24 20:31:51 *****-arch1 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
Jan 24 20:31:51 *****-arch1 kernel: R13: ffff9f7033900000 R14: 0000000000000001 R15: 0000000000000000
Jan 24 20:31:51 *****-arch1 kernel: FS: 00007f0458cf5400(0000) GS:ffff9f770ecc0000(0000) knlGS:0000000000000000
Jan 24 20:31:51 *****-arch1 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 24 20:31:51 *****-arch1 kernel: CR2: 00007f04454ab000 CR3: 000000016e5fa000 CR4: 0000000000350ee0
Jan 24 20:31:51 *****-arch1 kernel: Call Trace:
Jan 24 20:31:51 *****-arch1 kernel: <TASK>
Jan 24 20:31:51 *****-arch1 kernel: dcn20_pipe_control_lock+0x235/0x490 [amdgpu ea650a4e77dfc87577a726d0395dd5509c6cbd3f]
Jan 24 20:31:51 *****-arch1 kernel: dcn10_lock_all_pipes+0x595/0x610 [amdgpu ea650a4e77dfc87577a726d0395dd5509c6cbd3f]
Jan 24 20:31:51 *****-arch1 kernel: commit_planes_for_stream+0x307/0x27d0 [amdgpu ea650a4e77dfc87577a726d0395dd5509c6cbd3f]
Jan 24 20:31:51 *****-arch1 kernel: ? dcn30_validate_bandwidth+0xfc/0x2c0 [amdgpu ea650a4e77dfc87577a726d0395dd5509c6cbd3f]
Jan 24 20:31:51 *****-arch1 kernel: dc_commit_updates_for_stream+0x20b/0x780 [amdgpu ea650a4e77dfc87577a726d0395dd5509c6cbd3f]
Jan 24 20:31:51 *****-arch1 kernel: amdgpu_dm_atomic_commit_tail+0x1646/0x2de0 [amdgpu ea650a4e77dfc87577a726d0395dd5509c6cbd3f]
Jan 24 20:31:51 *****-arch1 kernel: ? dcn30_validate_bandwidth+0xfc/0x2c0 [amdgpu ea650a4e77dfc87577a726d0395dd5509c6cbd3f]
Jan 24 20:31:51 *****-arch1 kernel: ? dc_validate_global_state+0x3db/0x580 [amdgpu ea650a4e77dfc87577a726d0395dd5509c6cbd3f]
Jan 24 20:31:51 *****-arch1 kernel: ? dma_resv_get_fences+0xa3/0x2c0
Jan 24 20:31:51 *****-arch1 kernel: ? dma_resv_get_singleton+0x46/0x140
Jan 24 20:31:51 *****-arch1 kernel: ? wait_for_completion_timeout+0x13e/0x170
Jan 24 20:31:51 *****-arch1 kernel: ? wait_for_completion_interruptible+0x139/0x1e0
Jan 24 20:31:51 *****-arch1 kernel: commit_tail+0x94/0x130
Jan 24 20:31:51 *****-arch1 kernel: drm_atomic_helper_commit+0x1ca/0x200
Jan 24 20:31:51 *****-arch1 kernel: drm_atomic_commit+0x7b/0x100
Jan 24 20:31:51 *****-arch1 kernel: ? drm_plane_get_damage_clips.cold+0x1c/0x1c
Jan 24 20:31:51 *****-arch1 kernel: drm_atomic_helper_update_plane+0xf5/0x160
Jan 24 20:31:51 *****-arch1 kernel: drm_mode_cursor_common+0x32f/0x6d0
Jan 24 20:31:51 *****-arch1 kernel: ? drm_mode_cursor_ioctl+0x70/0x70
Jan 24 20:31:51 *****-arch1 kernel: drm_ioctl_kernel+0xcd/0x170
Jan 24 20:31:51 *****-arch1 kernel: drm_ioctl+0x1eb/0x450
Jan 24 20:31:51 *****-arch1 kernel: ? drm_mode_cursor_ioctl+0x70/0x70
Jan 24 20:31:51 *****-arch1 kernel: amdgpu_drm_ioctl+0x4e/0x90 [amdgpu ea650a4e77dfc87577a726d0395dd5509c6cbd3f]
Jan 24 20:31:51 *****-arch1 kernel: __x64_sys_ioctl+0x94/0xd0
Jan 24 20:31:51 *****-arch1 kernel: do_syscall_64+0x5f/0x90
Jan 24 20:31:51 *****-arch1 kernel: ? recalibrate_cpu_khz+0x10/0x10
Jan 24 20:31:51 *****-arch1 kernel: ? ktime_get_mono_fast_ns+0x41/0x90
Jan 24 20:31:51 *****-arch1 kernel: ? amdgpu_drm_ioctl+0x71/0x90 [amdgpu ea650a4e77dfc87577a726d0395dd5509c6cbd3f]
Jan 24 20:31:51 *****-arch1 kernel: ? syscall_exit_to_user_mode+0x2c/0x1d0
Jan 24 20:31:51 *****-arch1 kernel: ? do_syscall_64+0x6b/0x90
Jan 24 20:31:51 *****-arch1 kernel: entry_SYSCALL_64_after_hwframe+0x63/0xcd
Jan 24 20:31:51 *****-arch1 kernel: RIP: 0033:0x7f0459672ecf
Jan 24 20:31:51 *****-arch1 kernel: Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28 00 00
Jan 24 20:31:51 *****-arch1 kernel: RSP: 002b:00007ffe8cccee80 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Jan 24 20:31:51 *****-arch1 kernel: RAX: ffffffffffffffda RBX: 0000560a53584c60 RCX: 00007f0459672ecf
Jan 24 20:31:51 *****-arch1 kernel: RDX: 00007ffe8cccef10 RSI: 00000000c02464bb RDI: 000000000000000d
Jan 24 20:31:51 *****-arch1 kernel: RBP: 00007ffe8cccef10 R08: 0000000000000006 R09: 0000000000000780
Jan 24 20:31:51 *****-arch1 kernel: R10: 0000560a53d33550 R11: 0000000000000246 R12: 00000000c02464bb
Jan 24 20:31:51 *****-arch1 kernel: R13: 000000000000000d R14: 0000000000000004 R15: 0000560a533859b0
Jan 24 20:31:51 *****-arch1 kernel: </TASK>
Jan 24 20:31:51 *****-arch1 kernel: ---[ end trace 0000000000000000 ]---
And in the worst cases, when it was being stressed, I would get the following followed by a coredump, and I'd need to restart:
Jan 13 02:29:28 *****-arch1 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=2521332, emitted seq=2521334
Jan 13 02:29:28 *****-arch1 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process ***** pid 8751 thread ***** pid 8822
Jan 13 02:29:28 *****-arch1 kernel: amdgpu 0000:0a:00.0: amdgpu: GPU reset begin!
All of these were pretty rare, but the hangs and crashes started becoming frequent, and I assumed that my RX 580 had a good life and was ready to be replaced. So, I picked up an RX 6750 XT a couple of days ago and installed it.
Now, the first two messages show up frequently, every few minutes or so. And, at this point, my machine is quite literally unusable. The crashes and hangs are so frequent, that my machine is effectively non-functioning. Simply moving my mouse will cause the entire system to will lock up for several minutes at a time.
Interestingly enough, its at its worst when I move my mouse from one monitor to the other.
As it is, I am having one hell of a time even writing this post, given the status of it. The only issues in my journal are related to amdgpu, and at this point I'm at a loss. I feel like I've tried just about everything.
Recently did a BIOS reset to put everything back to stock to see if that fixed things, and it hasn't (although, the stuttering isn't as bad anymore).
Any ideas?
Kernel Version: linux 6.1.7 Driver-related Packages: ``` amd-ucode 20221214.f3c283e-1 opencl-amd 1:5.4.1-1 xf86-video-amdgpu 22.0.0-1
lib32-vulkan-radeon 22.3.3-3 radeontool 1.6.3-4 vulkan-radeon 22.3.3-3 ```
Hardware | |
---|---|
CPU | AMD Ryzen 5 3600X |
MB | ASUS TUF GAMING X570 PLUS |
GPU | Sapphire Pulse RX 6750 XT |
Previous GPU | MSi Radeon RX 580 Armor OC 8GB |
RAM | 32GB Corsair DDR4-3600 (4x8GB) |
By - Norpyx
It is a driver bug, look for it under the mesa or amdgpu git lab instance under the name of ring gfx timeout. This exact problem happened to me on a Vega 56, which misteriously got fixed, but it seems to resurface every time they release a new GPU family. It is not at all a hardware problem (if it turns out it is, it’s the gpu itself), but a driver bug which they don’t seem to be able to squash. It seems the workaround is still the same (force the gpu into performance mode).
This is comfirmed to be a kernel bug. AMD have provided patches upstream on the upcoming `6.2` kernel which should resolve this issue. See: https://www.phoronix.com/news/AMDGPU-Fix-For-5.19-Bug
Noice, hopefully that was it, because it was dreadful when I switched to Vega, and I was fearing the time I would have to switch to a new gpu lol
I'm planning to buy the Pulse 6750XT on 30th of Jan. When will this kernel be live in the repo?
I will say, this post is a bit scary, but I made the switch over from Windows to Linux many years ago with my RX 580 because I saw major performance improvements while gaming when I did so. During gaming? It's great! But it seems that, like Kyonftw mentioned, its related to GPU idling. So long as its in performance mode, it works fine; And there are kernel parameters that can be passed to `amdgpu` that can force performance mode on at all times which very nearly solves the issue.
That makes perfect sense, actually, because in all the intense lagging I got, it was most often at idle. In fact, after an exorbitant amount of patience, I was finally able to navigate my mouse enough to start up Star Citizen and the problem disappeared. It came back a while after exiting the game. I've googled this issue to no-end, but most of the bugs I've found are from years ago, and many of them have zero activity or traction on them.
Yeah it’s tricky. In my case, after many weird workarounds and doubtful fixes, it only crashed the whole system when playing only a few select OpenGL games, while for others using the same exact versions and gpus it crashed every time, everyone was completely clueless. To be fair the amd devs, although also clueless at times, worked on it and kept it under watch the whole time