Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
ROCm is AMD's priority, executive says (eetimes.com)
293 points by mindcrime on Sept 26, 2023 | hide | past | favorite | 157 comments


The Debian ROCm Team [1] has made quite a bit of progress in getting the ROCm stack into the official Archive.

Most components are already packaged, the next big target is adding support to the PyTorch package.

Many of the packages are older versions; this is because getting broad coverage was prioritized. The other next big target that is currently being worked on is getting full ROCm 5.7 support.

I fully expect Debian 13 (trixie) to come with full ROCm support out-of-the-box, and as a consequence, also derivatives to have support (Ubuntu above all). In fact, there will almost certainly be backports of ROCm 5.7 to Debian 12 (bookworm) within the next few months, so one will be able to just

  $ sudo apt-get install pytorch-rocm
One current obstacle is infrastructure: the Debian build and CI infrastructures (both hardware and software) were not designed with GPUs in mind. This is also being worked on.

Edit: forgot to say that the CI infra that the Team is setting up here tests all of these packages on consumer cards, too. So while there may not be official support for most of these, upstream tests passing on the cards within the infra should be a good indication for practical support.

[1] https://salsa.debian.org/rocm-team/


> One current obstacle is infrastructure: the Debian build and CI infrastructures (both hardware and software) were not designed with GPUs in mind. This is also being worked on.

To be more direct, one thing we lack is funding. AMD has provided RDNA 2 and RDNA 3 GPUs for the Debian CI, but to fill out the rest of the architecture matrix I have been personally buying GPUs. That's been sufficient for covering most architectures, but we will need a sponsor if we are to acquire CDNA 2 and CDNA 3 hardware.

Our goal is to cover every modern discrete AMD GPU architecture on the CI. At the moment, that would be Navi 33, Navi 32, Navi 31, Navi 24, Navi 23, Navi 22, Navi 21, Navi 14, Navi 12, Navi 10, Aldebaran, Arcturus, Vega 20, Vega 10, and (maybe) Polaris. I have been very successful at bringing the AMD GPU libraries to architectures that are not officially supported upstream. Unfortunately, I can't afford to keep buying systems out of my personal funds. I have personally spent ~7k USD on hardware for the CI and I have been offered reimbursement from the Debian project for my next ~5k USD in spending. That has given us a good foundation, but we could do more to improve hardware support if we had more funding available.

Please consider donating to the Debian Project [1] if you wish to support their efforts.

[1]: https://www.debian.org/donations


> but to fill out the rest of the architecture matrix I have been personally buying GPUs. That's been sufficient for covering most architectures, but we will need a sponsor if we are to acquire CDNA 2 and CDNA 3 hardware.

This seems like the kind of thing that AMD should be providing (or at least sponsoring) as a matter of principle — regardless of whether it can be funded in other ways I.e. if anyone in AMD cared this problem would be solved trivially. The fact that you are funding it out of pocket is seriously calling into question AMD’s commitment. What am I missing here?


I think Debian is one of the first places AMD should be looking to fund for ROCm.

Especially if there's already motivated capable volunteer labor, and all they need is equipment, and maybe a devrel point of contact.

The cost seems like a few peanuts dug out of the sofa cushions, on a strategic push like this.


> Unfortunately, I can't afford to keep buying systems out of my personal funds. I have personally spent ~7k USD on hardware for the CI and I have been offered reimbursement from the Debian project for my next ~5k USD in spending.

This is ridiculous. There is absolutely ZERO reason why AMD are not giving or donated by AMD at the minimum. I am sure some AMD folks are on HN. Please link this comment to Dr.Lisa Su and get this sorted out.


> I am sure some AMD folks are on HN

The person you're responding to is an AMD employee.


Then that says even worse things about priorities, if someone working there can't find anyone to provide one of each GPU.


"You can talk the talk, but can you walk the walk?"

It pretty much looks like AMD really cannot put their money where their mouth is, and can't bring itself to actually do what is necessary to compete with the green team.

As a personal anecdote from ~4/5 years back, we contacted an AMD sales rep (AMD Germany) about a product we were developing, that was absolutely flying on NVIDIA consumer hardware. We wanted to know if there was a possibility to explore how it would run on AMD hardware, with maybe a bit of support. They didn't even bother to reply..


If that's true, in what world is ROCm a priority for AMD at all? They can't even throw a few old cards on the project?!?


Thank you for your service, breaking the NVIDIA monopoly on AI will only be possible from the efforts of people such as yourself.

May I ask, why isn't AMD providing GPUs beyond RDNA 2/3 for you? Is it just because that is considered the priority as those are the newer cards?

I have an RX 580 8GB at home I would be happy to give you free of charge if you don't have access to that card (Polaris 20).


> I have an RX 580 8GB at home I would be happy to give you free of charge if you don't have access to that card (Polaris 20).

I appreciate your generosity, but the costs for the older architectures are dominated by the supporting infrastructure (servers, rack space, networking, power). It's not the GPUs themselves that are the bottleneck. I have sufficient GPUs to test Polaris, but we're short on servers and hosting.


I'd also like to point out that ROCm has been packaged for Arch Linux since the beginning of 2023, with efforts starting since March 2020 [1].

Currently on Arch Linux you can run the following successfully:

  $ sudo pacman -S python-pytorch-rocm

Arch Linux even has ROCm support with blender.

[1] https://github.com/rocm-arch


Hope you don't mind, but I have a rant I need to get out. I decided to give this another try now that you've mentioned it.

Let's get things started the way the arch wiki suggests:

    $ sudo pacman -S rocm-hip-sdk
    $ /opt/rocm/bin/clinfo
    ERROR: clGetPlatformIDs(-1001)
    $ sudo /opt/rocm/bin/clinfo
    ...
      Board name:     AMD Radeon RX 6600 XT
    ...
Ok, I wonder what's wrong. maybe it's this? https://stackoverflow.com/questions/4959621/error-1001-in-cl...

Nope. Anything about this on the arch wiki? Nope

This bug report[2] from 2021? Maybe I need to update my groups.

[2]: https://github.com/RadeonOpenCompute/ROCm/issues/1411

    $ ls -la /dev/kfd
    crw-rw-rw- 1 root render 237, 0 Sep 26 20:33 /dev/kfd
    $ sudo usermod -aG render $(whoami)
    $ # relogin
    $ /opt/rocm/bin/clinfo
    ERROR: clGetPlatformIDs(-1001)
Ok, I'm a pretty advanced linux user, I'll just jump right in:

    $ strace /opt/rocm/bin/clinfo
    ...
    openat(AT_FDCWD, "rusticl.icd", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = -1 ENOENT (No such file or directory)
Apparently I have some leftover environment variables (OCL_ICD_VENDORS) from last time I spent half a day trying to get this to work. I can fix that. After all, it'd be entirely unreasonable to expect rocm to give me a better error, like "Could not open opencl icd `rusticl.icd`".

Success:

    $ /opt/rocm/bin/clinfo
    Number of platforms:    1
    ...
      Board name:     AMD Radeon RX 6600 XT
Well, let's run some apps!

    $ darktable -d opencl
    ...
    [dt_opencl_device_init]
       DEVICE:                   0: 'gfx1032'
       PLATFORM NAME & VENDOR:   AMD Accelerated Parallel Processing, Advanced Micro Devices, Inc.
    ...
    PHI node has multiple entries for the same basic block with different incoming values!
      %967 = phi float [ %largephi.extractslice0, %sw.default ], [ %largephi.extractslice055, %sw.bb667 ], [ %largephi.extractslice059, %sw.bb663 ], [ %largephi.extractslice063, %sw.bb659 ], [ %largephi.extractslice067, %sw.bb655 ], [ %largephi.extractslice071, %sw.bb646 ], [ %largephi.extractslice075, %_Z4fmodff.exit16 ], [ %largephi.extractslice079, %_Z4fmodff.exit13 ], [ %largephi.extractslice083, %_Z4fmodff.exit ], [ %largephi.extractslice087, %sw.bb562 ], [ %largephi.extractslice091, %sw.bb555 ], [ %largephi.extractslice095, %sw.bb533 ], [ %largephi.extractslice099, %if.then502 ], [ %largephi.extractslice0103, %if.else517 ], [ %largephi.extractslice0107, %if.then456 ], [ %largephi.extractslice0111, %if.else471 ], [ %largephi.extractslice0115, %if.then393 ], [ %largephi.extractslice0119, %if.else408 ], [ %largephi.extractslice0123, %if.then338 ], [ %largephi.extractslice0127, %if.else353 ], [ %largephi.extractslice0131, %if.then283 ], [ %largephi.extractslice0135, %if.else298 ], [ %largephi.extractslice0139, %if.then224 ], [ %largephi.extractslice0143, %if.else241 ], [ %largephi.extractslice0147, %sw.bb193 ], [ %largephi.extractslice0151, %sw.bb180 ], [ %largephi.extractslice0155, %sw.bb168 ], [ %largephi.extractslice0159, %sw.bb158 ], [ %largephi.extractslice0163, %sw.bb147 ], [ %largephi.extractslice0167, %if.then116 ], [ %largephi.extractslice0171, %if.else131 ], [ %largephi.extractslice0175, %sw.bb71 ], [ %largephi.extractslice0179, %sw.bb ], [ %largephi.extractslice0183, %if.end ], [ %largephi.extractslice0187, %if.end ], [ %largephi.extractslice0191, %if.end ], [ %largephi.extractslice0195, %if.end ], [ %largephi.extractslice0199, %if.end ]
    label %if.end
      %largephi.extractslice0183 = extractelement <4 x float> %div, i64 0
      %largephi.extractslice0191 = extractelement <4 x float> %div, i64 0
    in function blendop_Lab
    LLVM ERROR: Broken function found, compilation aborted!
    [1]    27586 IOT instruction (core dumped)  darktable -d opencl
uh that's great. Maybe blender?

It worked! Not too bad for 2 minutes render: https://i.imgur.com/FD1SsQG.png

What about pytorch? It prompted this whole thing anyway:

    $ sudo pacman -S python-pytorch-rocm python-torchvision
    $ python neural_style/neural_style.py eval --content-image ../../2min.png --model ./saved_models/mosaic.pth --output-image out.png --cuda 1
    [1]    32471 segmentation fault (core dumped)  python neural_style/neural_style.py eval --content-image ../../2min.png
    $ sudo dmesg --follow
    [ 2467.536713] python[33309]: segfault at 68 ip 00007f12c5504d5d sp 00007ffc8f539c20 error 4 in libamdhip64.so.5.6.31062[7f12c541e000+357000] likely on CPU 14 (core 7, socket 0)
    [ 2467.536727] Code: ec 78 48 89 bd 78 ff ff ff 64 48 8b 04 25 28 00 00 00 48 89 45 c8 31 c0 85 f6 0f 88 09 03 00 00 48 8b 85 78 ff ff ff 48 63 de <48> 8b 50 68 48 8b 40 70 48 89 85 70 ff ff ff 48 29 d0 48 c1 f8 03
uh oh. Maybe I can crack some passwords?

    $ hashcat -m 0 -a 0 -o cracked.txt target_hashes.txt /usr/share/dict/american-english
    ...
    hiprtcCompileProgram(): HIPRTC_ERROR_COMPILATION

    error: unknown argument: '-flegacy-pass-manager'
    1 error generated when compiling for gfx1032.

    * Device #1: Kernel /usr/share/hashcat/OpenCL/shared.cl build failed.
Well, so much for that.

Best I can get to work with rocm is 1/4 apps.


This is why Christian and I have invested so much effort into the CI system for Debian. There needs to be a clear accounting of what works and what doesn't for every library on every architecture.


It's too late to edit, but I should add that the RX 6600 XT is not officially supported by the upstream ROCm project. It's not clear to me that the experience would be better on any other distro. That's where having public test logs would be valuable.


IIRC, nothing below the 6800 is supported by ROCm... so the lion's share of their installed base of the 6000 series is excluded from 'official' support. nVidia's compute drivers support all of their devices and have across multiple generations, AMD's support only the low volume devices and drop support for older generations seemingly almost as fast as they are released.


One of your problems might be that gfx1032 is not supported by AMD's ROCm packages, which has a laughably short list of supported hardware: https://rocm.docs.amd.com/en/latest/release/gpu_os_support.h...

The normal workaround is to assign the closest architecture, eg gfx1030, so `HSA_OVERRIDE_GFX_VERSION=10.3.0` might help

Also, it looks like some of your tested projects are OpenCL? For me, I do something like: `yay -S rocm-hip-sdk rocm-ml-sdk rocm-opencl-sdk` to cover all the bases.

My recent interest has been LLMs and this is my general step by step for those (llama.cpp, exllama) for those interested: https://llm-tracker.info/books/howto-guides/page/amd-gpus

I didn't port the docs back in, but also here's a step-by-step w/ my adventures getting TVM/MLC working w/ an APU: https://github.com/mlc-ai/mlc-llm/issues/787

From my experience, ROCm is improving, but there's a good reason that Nvidia has 90% market share even at big price premiums.

EDIT: apparently Darktable and Blender have OpenCL issues that are fixed in the just released 5.7: https://github.com/ROCm-Developer-Tools/clr/issues/3


I can totally understand your frustrations, considering the rocm-arch team/community has been seeing these (and trying to fix them) for years now.

I urge you to post any problems you face on the discussions page [1] for the rocm-arch community. Just to get more visibility and to add to the corpus for others to see (or even just to complain and have a voice heard, lol).

[1] https://github.com/orgs/rocm-arch/discussions

So for integrating rocm support into packages, typically this is done by specifying rocm as a build flag. Thus, even if the project supports rocm if it hasn't been built for rocm targets, it won't work on rocm platforms.

For blender and python-pytorch, contributions were made to the Arch Linux build recipes so that they have rocm support, I'm not sure about darktable. For python-torchvision, see [2] to use a rocm build of it. Maybe that helps?

[2] https://aur.archlinux.org/packages/python-torchvision-rocm

Edit: this doesn't seem to be the case for darktable. Maybe wait for rocm 5.7? idk [3].

[3] https://github.com/ROCm-Developer-Tools/clr/issues/3#issueco...

Feel free to request rocm builds of packages on https://github.com/orgs/rocm-arch/discussions.

Others have discussed other issues such as gfx1032 not being officially supported and the fact we are packaging the source from amd repos so the experience may not be different than on other platforms. I will say though that just having an independent team aside from AMD to build and ship rocm is definitely great for the rocm community. Get the product out in the audience for more real world feedback to provide back to the rocm project and make it better. The rocm-arch folks have made several upstream contributions to rocm.

Definitely, excited on the progress of the Debian team and we've been keeping an eye on each other's progress. https://github.com/orgs/rocm-arch/discussions/674


I could get hashcat to work with poor performance but then the computer was unusable.


It's absolutely mindboggling to me that AMD is still struggling so badly on this.

There is an absolutely enormous market for AMD GPUs for this, but they seem to be completely stuck on how to build a developer ecosystem.

Why aren't AMD throwing as many developers as possible submitting PRs for the open source LLM effort adding ROCm support, for example?

It would give AMD real world insights to the problems with their drivers and SDKs as well, which are incredibly numerous.

People would be willing to overlook a huge amount of jank for cheap(er) cards with large VRAM configurations. I don't think they when need to be particularly fast, just have the VRAM needed, which I'm sure AMD could put specialist cards together for.


Historically they believed that "the community" would address broader ML software support. I think the idea was they could assign dedicated engineers for bigger customers and together that was a sort of Pareto-goodish solution given their constraints as a company. Even in retrospect I'm not sure if that was a good call or not.


I mean, they would be right if all their cards, both consumer and enterprises, supported the same programming interface.

You cannot trust the community to do the work for you but then only make the software available for $Xk dollar cards


That's not necessary or sufficient. Going back to 2017 or so when I was working in the area their OpenCL support was good enough, the missing parts were an equivalent to cuDNN and upstreamed support in TensorFlow etc. That work does not subdivide in a way amenable to being a community effort and it's way too big for a hobby project. Today the technical landscape is different but from what I can tell the basic problems are the same.


In 2017, when I got Vega, OpenCL didn't work yet.

Today, in 2023, Vega is already not supported.

Meanwhile, during this period, ROCm was unbuildable by mere mortal distribution maintainers; you either used the binaries thrown over the wall by AMD (for specific versions RHEL/Centos, SuSE and Ubuntu only), or didn't run anything at all.

Also in 2023, Fedora managed to package basic ROCm packages for Fedora 38; I can finally run darktable and blender (but it is crashing!) on Vega. Woohoo!


We didn't use ROCm, the non-ROCm OpenCL path worked fine for us on Polaris and Vega. None of this is a major reason AMD cards are vastly inferior for ML dev and research purposes. At the time they made a decision not to invest heavily in ML framework and workflow support, and so they never had a product really usable for those applications.

I find it annoying they didn't do more but I'm not sure they were wrong. AMD managed to tread water in GPUs, integrate Xilinx, and go from meh to a very strong position on CPUs, and all of that with a relatively small company.


Support for Gentoo existed for a long time in https://github.com/justxi/rocm before being merged in the main Portage tree.


By existed you mean maybe built and sometimes worked.

That's not support. That's throwing things over the wall. More importantly, even with Vega it had numerous crashes. Anything else, like Polaris or RDNA? Forget about it. Even the AMD docker setups weren't quite good enough at times.


No I know, but pretending it was so hard to build it could be considered as good as closed sourced isn't needed to shit on it.


It is hard to build.

AMD uses cpack deb/rpm generators; and the build process requires random things in path (some built from git checkouts of other projects). If you want to create standard deb or rpm build script for building in distro build infra, or inside mock, for rpm-based distributions, the cmake build actively makes steps to make it as difficult as possible.

There used to be a talk titled something like "How to make distribution maintainers hate you" (I cannot find a link to in now); it seems that ROCm developers have seen it, took it to their hearts and then wrote several new chapters themselves.

That's building. Do not even think about testing.

There's a reason why it took distributions years to package it (still not done completely). The upstream project was like student projects, thrown over the wall once students stopped working at it, including the build "system".


>People would be willing to overlook a huge amount of jank for cheap(er) cards with large VRAM configurations.

The older I get, the more intolerable I find jank to be because my time only keeps becoming ever more valuable.


Intel has managed to get their drivers on 23.04 Ubuntu with no additional packages needed to be installed for their Arc dGPU offerings.


Now if only they would offer some bigger Arc GPUs...

I would have picked up a 32GB+ Arc over my 3090 in a heartbeat. Maybe even a 24GB card.


16 GB is a really nice offering at that price point for AI workloads. I'm keeping my fingers crossed for a higher end Battlemage offering and some real competition for Nvidia.


They also lag on the dataplane side do they not? AFAIR nvidia bought the main (remaining?) infiniband supplier and seamlessly integrated it with all their data center offerings? Cue Jensen Huang "the data center is the computer"?


They only care about selling data center cards for GPGPU.

The thing is, why would anyone buy them if CUDA just works?


Relevant, we deployed Lamini on hundreds of MI200 GPUs.

Lisa tweet: https://x.com/LisaSu/status/1706707561809105331?s=20

Lamini tweet: https://x.com/realSharonZhou/status/1706701693684154766?s=20

Blog: https://www.lamini.ai/blog/lamini-amd-paving-the-road-to-gpu...

Register: https://www.theregister.com/2023/09/26/amd_instinct_ai_lamin... CRN: https://www.crn.com/news/components-peripherals/llm-startup-...

The hard part about using any AI Chips other than NVIDIA has been software. ROCm is finally at the point where it can train and deploy LLMs like Llama 2 in production.

If you want to try this out, one big issue is that software support is hugely different on Instinct vs Radeon. I think AMD will fix this eventually, but today you need to use Instinct.

We will post more information explaining how this works in the next few weeks.

The middle section of the blog post above includes some details including GEMM/memcpy performance, and some of the software layers that we needed to write to run on AMD.


It's nice to hear that there are actual results to show, since AMD execs simply saying that ROCm is a priority isn't really convincing anymore given their track record on claims regarding support on the consumer side.


The difference this time is that the executive is from Xilinx. Xilinx has had an AI software development team for a while in the FPGA space.

AMD has had poor management in the GPU computing space since Raja Koduri's time (he put the best engineering resources on VR during his tenure and ignored deep learning). Subsequent directors have not had a long term vision and left within a few years.

Looks like Lisa Su has corrected this now - they seem to have moved AMD software engineers en masse to work under Xilinx management on AI. Remains to be seen if this new management hierarchy will have a better vision and customer focus.


> If you want to try this out, one big issue is that software support is hugely different on Instinct vs Radeon. I think AMD will fix this eventually, but today you need to use Instinct.

I'm really really worried about AMD, and whether they're going to care about anyone else. They might just care about Instinct, where margins are so high, and ignore consumer cards or making more friction and segmentation for consumer cards.

Part of what made CUDA so successful was that the low hardware barrier to entry created such a popular offering. Everyone used it. I really hope AMD realizes that, and really hope AMD invests in consumer card software too. Just making it work on the high end doesn't seem enough to get the kind of mass-movement ecosystem success AMD really needs. I'm afraid they might go for a smaller win, try to compete only at the top.


I completely agree. I wasted a lot of time just assuming that ROCm would work on Radeon, just like it does for CUDA.


I would really hope you could get decent utilization on ops as fundamental as GEMM/memcpy on a single device. Translating that to MFU is a completely different story.


We get good utilization at scale as well. Typically 30-40% of peak at the full application level for training and inference.

Perf isn't the biggest problem though, many AI chips can do this or a bit better on benchmarks, if you invest the engineering time to tune the benchmark.

The really hard part is getting a complete software stack running.

It took us over 3 years because many of the layers just didn't exist, e.g. scale out LLM inference service that supports multiple requests with fine-grained batching across models distributed over multiple GPUs.

On Instinct, ROCm gets you the ability to run most pytorch models on one GPU assuming you get the right drivers, compilers, framework builds, etc.

That's a good start, but you need more to serve a real application.


People have been using their GPGPUs for decades on a variety of scientific applications, and there are all kinds of hybrid and multi-device frameworks that exist (often supporting multiple backends).

The difference is that it didn't get a lot of love as part of the overhyped python LLM movement.


Completely agree, I'd love to see some of the innovations from HPC move over into their LLM stack.

We are working on it, but it takes time.

Contributions to foundational layers like ROCBlas, pytorch, slurm, Tensile, huggingface, etc would help.


What's the cost benefit vs. Nvidia? Is it cheaper?


The classic economic benefits of competition:

* Drives down price

* Enhances product features (I see them competing on VRAM first)

* Helps to insulates buyers from supply issues

Nvidia has kneecapped their consumer grade hardware to ensure the gaming market still has scraps to buy in spite of crypto mining and the AI gold rush. All AMD would have to do to eat into Nvidia marketshare is remove the hardware locks in low-end cards and ship one with 64GB+ of VRAM.

This of course would only work if they have comparable/usable software support. Any improvements to ROCm will be a boon for any company that doesn't already have or can't afford huge farms of high-end Nvidia chips.


Available in orders of up to 10,000 GPUs today - no shortage

More than 10x cheaper than allocating machines on a tier 1 cloud - AWS, Azure, GCP, Oracle, etc

More memory - 128GB HBM per GPU - means bigger models fit for training/inference without the nightmare of model parallelism over MPI/infiniband/etc

Longer term - finetuning optimizations


Ah! The memory sounds interesting. How would that compare to similar Nvidia hardware w.r.t cost assuming the hardware was available?

Does AMD provide something similar to nvlink, and even libraries like cudnn?

Also, last I checked none of the public clouds offered any of the latest gens MI GPUs, so I wasn't aware that it had good availability! Azure had a preview but I'll look more into it now.

Thank you for your answer btw!


Yeah getting around the no public cloud thing was really annoying. We had to build our own datacenter.

On the plus side, it was drastically cheaper and now we can just slot in machines.

I would prefer that a tier 1 cloud made MI GPUs available though. It would make it so much more accessible.


Are you releasing your software stack to the public?


It's available now with an enterprise license, because we validate that it is running correctly on a system we help configure.

We will open source pieces of it over time. Our strategy is open source functional core. Eventually we will have an open source dev environment that runs on a personal scale computer. We already have this for some configurations, but we don't do enough testing to ensure perf/functionality on many different systems.

We are mainly bottlenecked by resources as a 12-person startup.

We have released some open source SDKs here:

https://github.com/orgs/lamini-ai

This class has some training recipe code:

https://www.deeplearning.ai/short-courses/finetuning-large-l...

One thing I'd like to push back to open source is the scale out AMD SLURM support.


See the memory size comparison (GB) in this table: https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_proces...


It blows my mind that A100 and H100 are each safely below 1000W power draw.


You simply cannot buy nvidia GPUs at scale at the moment. We're getting quotes that are many months out, sometimes even a year+ out.


We kept hearing 52 weeks for new shipments.


Oh man, this is exactly what I want to see on HN frontpage!

I commented on another article about an AMD chip that had no OpenCL support that it made it dead in the water for me, and was downvoted; surely everyone understands how important CUDA is, and everyone should understand how important open standards are (e.g. FreeSync vs Nvidia's GSync), so I can't understand why more people don't share my zeal for OpenCL.

I've shipped two commercial products based on it which still works perfectly today on all 3 desktop platforms from all GPU vendors... what's not to love?


For a long time, AMD promoted OpenCL as viable without it actually being viable. This leaves scars and resentment. Mine come from about 10 years ago. They run deep.

I'm glad to hear your experience was better, but I'm fresh out of trust. This time, I need to see major projects in my application areas working on AMD before I buy, because AMD has taught me that "trust us" and "just around the corner" can mean "10 years later and it still hasn't happened." I'm pretty sure that this time is different, but the green tax is dirt cheap compared to learning this lesson the hard way, so I'm letting others jump first this time.


> I've shipped two commercial products based on it which still works perfectly today on all 3 desktop platforms from all GPU vendors... what's not to love?

In my experience, if commercial products involved any sort of hand-optimized, proprietary OpenCL, one would be shocked by the lack of documentation and zero consistency across AMD's GPUs. Intel has SPIRV and Nvidia has PTX and this works pretty well. But some AMD cards support SPIR or SPIRV, and some don't and this support matrix keeps changing over time without a single source of truth.

Throw in random segfaults inside AMD's OpenCL implementation and you have a fun day debugging!

Dockerizing OpenCL on AMD is another nightmare I don't want to get into. Intel is literally installing the compute runtime and mapping `/dev/dri` inside the container. On paper, AMD has the same process but in reality I had to run `LD_DEBUG=binding` so many times just to figure out why AMD runtime breaks inside docker.

There may be great upsides to AMD's hardware in other domains though


OpenCL isn't very useful now that we have Vulkan. Its biggest advantage is that there exist C++ compilers for its kernels. But AMD's OpenCL runtime inserts excessive memory barriers not required by the spec (they won't fix this due to Hyrum's Law) and Vulkan gives you more control over the memory allocation and synchronization anyways. If we had better Vulkan shader compilers, OpenCL would serve basically no purpose, at least for AMD hardware.


Its not that they're supporting buggy code, they just downgraded the quality of their implementation significantly. They made the compiler a lot worse when they swapped to rocm

https://github.com/RadeonOpenCompute/ROCm-OpenCL-Runtime/iss... is the tracking issue for it filed a year ago, which appears to be wontfix largely because its a lot of work

OpenCL still unfortunately supports quite a few things that vulkan doesn't, which makes swapping away very difficult for some use cases


Yeah, that's a big if. In theory there's nothing preventing good compilation to Vulkan compute shaders, in practice people just aren't doing it, as CUDA actually works today.

I also agree that Vulkan is more promising than OpenCL. With recent extensions, it has real pointers (buffer device address), cooperative matrix multiplication (also known as tensor cores or WMMA), scalar types other than 32 bits, proper barrier (including device-scoped, needed for single pass scan), and other important features.


AI libs could use it and we'd break the bonds in CUDA. Also Rust might get an implementation which would give it they non-intervention to overtake C++


No it wouldn't, until it provides the same polyglot support and graphical tooling as CUDA.

At least Intel is trying with oneAPI into that direction.


> I can't understand why more people don't share my zeal for OpenCL.

When I last worked with it, it was difficult, unstable, and performed poorly. CUDA, on the other hand, has been nothing but good (at least). Well, nvidia pricing aside ;)

OpenCL might be a lot better now, but for a lot of us, we remember when it was actively a bad choice.


But is this just more BS from AMD?

https://www.bit-tech.net/reviews/tech/cpus/amd-betting-every... AMD Betting Everything on OpenCL (2011)


I'm pretty sure the NVDA pump finally convinced the AMD board / C-Suite to prioritize this, but it takes time to steer a big ship. I'm hopeful, but there are still bad incentives to jump the gun on announcements so I'll let others take the plunge first.


If they can make a 288 GB $4.4-6.8k prosumer, home-computer-friendly graphics card, I will be extremely happy. Might be a pipe dream (today at least, lol, and standard in like...what, 5 years?), but if they can pull that off, then I think things would really change a lot.

I don't care if it's slow, bottom-of-the-barrel GDDR6, or whatever, just being able to enter the high-end model finetuning & training regime for ML models on a budget _without_ dilly-dallying with multiple graphics cards (a monstrous pain-in-the-neck from a software, engineering, & experimentation perspective)_ would enable so much large-scale development work to happen.

The compute is extremely important, and in most day-to-day usecases, the memory bandwidth even moreso, but boy oh boy would I love to enter the world offered by a large unified card architecture.

(Basically, in my experience, parallelizing a model across multiple GPUs is like compiling from code to a binary -- technically you can 'edit' it, but it's like directly hex editing strings in a binary blob, extremely limited. Hence why I try to stick with models that take only a few seconds (minutes at most) to train on highly-representative tasks, distill first principles, and then expand and exploit that to other modalities from there).


Rocm and amd drives me nuts. The lack of support for consumer cards and the hassle of getting basic things in pytorch to just work was too much.

I was burned by support that never came for my 6800xt. Recently went back to NVIDIA with a 4070 for pytorch.

I hope amd gets their act together with rocm but I'm not going to buy an AMD GPU until they do fix it rather than just vaguely promise to add support some day ...


Exactly. I recently started a NN side project. The process for setting up PyTorch was to run `pacman -S cuda` and `pip install torch`. I was using a GTX 1060. If it was a project with a bigger budget, I could have rented servers from AWS with all the software preinstalled in no time. I don't even know if it would have been possible for me to do it with AMD, even if I owned an AMD graphics card.

People like me are small potatoes to AMD, but surely it's hard to make significant inroads when it's impossible for anyone to learn or do small projects on ROCM, and big projects can't rely on ROCM just working.


People like you are small potatoes until you have some measure of success and then suddenly you're burning up GPU hours by the truckload and whatever you're used to you will continue using.


I'm building a major open source stack on top of NVidia because of how bad my experience with AMD was.

- I bought a ROCm-supported card. Said so on the box. Paid out-of-pocket. An NVidia vendor had sent me a free card, for comparison.

- It never worked well, and a bit more than month after I bought it, AMD dropped support. Money down the drain.

- AMD itself was a black hole for any sort of contact or support.

I'm pretty sure this was a legal violation, as the card wasn't fit for the advertised purpose, but no one took responsibility, and small claims isn't worth it.

I'm very supportive of open, but there's enough wrong at AMD that I'm not hitching myself to that wagon, probably ever.


Depending what country you're in, small claims might be surprisingly straightforward. I filed a claim in the UK a couple of years back and while the webapp was very early-2000s it all worked perfectly and didn't take much work.


"senior VP of the AI group at AMD", said at a "AI Hardware Summit" that "My area is AMDs No. 1 Priority".

Tell me when the rest of the company aligns with you and has started to show any results in providing a good experience for people to do machine learning with AMD. As it stands right now, there is so much tooling missing, and the tooling that's there is severely lacking.

But, I have a faith. They've reinvented themselves with CPUs, multiple times, so why not with GPUs, again?


Tell me when the rest of the company aligns with you

More or less the same message has been promulgated[1][2] by no less than Lisa Su[3], FWIW.

[1]: https://www.phoronix.com/news/Lisa-Su-ROCm-Commitment

[2]: https://www.forbes.com/sites/iainmartin/2023/05/31/lisa-su-s...

[3]: https://en.wikipedia.org/wiki/Lisa_Su


If this turns around it will be amazing but ROCm isnt the only issue. The entire driver stack is important. If they came out with virtualization support for their gpus (even if everyone paid a 10% perf hit) they'd take over the cheap hosted gpu space which is a huge market.


Getting proper (and official) ROCm support across their consumer GPU line will be big as well. Hobbyists aren't buying MI300's and their ilk. And surely AMD is better off if a would be hobbyist (or low budget academic/industrial researcher) chooses a Radeon card over something from NVIDIA!

I'm about to buy a high-end Radeon card myself, gambling that AMD is serious about this and will get it right, and that it won't be a wasted purchase. So yeah, if I seem like an AMD fan-boy (I am, somewhat) at least I'm putting my money where my mouth is. :-)

AMD’s software stacks for each class of product are separate: ROCm (short for Radeon Open Compute platform) targets its Instinct data center GPU lines (and, soon, its Radeon consumer GPUs),

They've been saying this for a while, and I'm encouraged by reports that people "out there" in the wild have actually gotten this to work with some cards, even in advance of the official support shipping. So here's hoping they are really serious about this point and make this real.


Yeah, don't. Buy an Nvidia and get shit done.


For some people, it's not just about getting results or "get shit done" but about the journey and learning on the way there. Also, AMDs approach to openness tends to be a bit better than NVIDIA, so there's that too. And since we're on HackerNews after all, an AMD GPU for the hacker betting on the future seems pretty fitting.


For someone using Linux, an AMD card may be even better suited for 'getting things done'

Wayland and many things outside of GPGPU are much better; ie: power control/gating/monitoring are all available over sysfs. You can over/underclock a fleet of systems with traditional config management.

GPGPU surely deserves some weight given the context of the thread, but let's not ignore the warts Nvidia shows elsewhere.


> For someone using Linux, an AMD card may be even better suited for 'getting things done'

It seems like that on paper, but in practice I've been getting constant GPU crashes and freezes on both my personal and work pc. No one seems to know what this is about and may be multiple issues, but it's been like this for a long time now.

https://gitlab.freedesktop.org/drm/amd/-/issues/1974#note_21...


I'm sorry to hear about the troubles you've seen. I did hedge slightly with 'may' :p

I've had the exact opposite experience; from way back since the 4870 series was common to now with RX6000, AMD has been great for me with Linux. More systems than I can really count, Intel/AMD have been great - while Nvidia, not so much.

Most recently I've not used the 'auto' method of DPM (mentioned in that issue).

I've deliberately set this to 'manual' since at least picking up RX6000 for undervolting/overclocking. Perhaps this is part of why I've been so pleased.

I'm curious on the software levels you run - what distributions do you tend to prefer?


Agreed, AMD and Intel are much easier to rely on. I’ve never had it nicer on Linux than I do now with a primary AMD GPU and a secondary NVIDIA that I can use for games or CUDA, or pass to a VM.

It feels great finally having bleeding edge kernels and Wayland compositors, with the guarantee of a Linux or Windows VM’s stable driver if something breaks for the NVIDIA blob, and my desktop stays operational regardless.


That setup is really nice, I miss doing VFIO. The demarcation point is truly a delight, and with hugepages/CPU pinning, the performance cost is negligible.


In principle I'm all for openness, but it doesn't mean anything if the thing doesn't work. I just haven't found AMD drivers to be reliable enough to use, on any platform, whereas with NVidia I install the proprietary drivers and then it just works, on both Linux and FreeBSD.


That's a shame. Do you tend towards the mobile side, by chance?

The vast majority of my experience has been with discrete (desktop) cards and very new kernels/mesa. It's been great, here - on a number of hardware configs.


Mostly laptops, but generally the chunky "gaming" kind with discrete GPUs, so IDK.


Ah, yea those 'dual GPU' systems have been truly awful for me; discrete + integrated.

I gave Linux/the ecosystem at large a chance with a couple of those and was generally disappointed.

No good way to be sure which card was used... the control mechanism was a bunch of glue/tape.


Nvidia is still much more reliable than Radeon on Linux.


That hasn't been my experience, but like with choices - experiences vary. In my case... this has mostly been with desktop/discrete GPUs.

I've been burned by enough laptops with mobile cards that I just stick with integrated; Linux does/did so poorly with Optimus or whatever dual high/low power GPU tech that I never bought another.

I'm a little doubtful, largely because AMD contributes to the kernel/mesa far more than Nvidia. There's no Linux monolith to support this; not all distributions are equally current.

I've had discrete cards from all of the major vendors for the last few generations for VFIO testing on Linux on mainline kernels.

Intel/AMD have generally been more reliable (for me) and quicker to adopt standards.

If you run an LTS or something with generally older software, Nvidia is probably fine and dandy.

It's a regular routine to have to wait for them to support new kernels. Yes, I know about DKMS, no it isn't always sufficient.


AMD's debuggers and profilers let you disassemble kernel/shader machine code and introspect registers and instruction latency. That's something at least that Nvidia doesn't do with Nsight tools.


I get where you're coming from, and in fact I am planning to also build an NVIDIA based ML box as well. But I pointedly want to support AMD here for a variety of reasons, including an ideological bias towards Open Source Software, and a historical affinity for AMD that dates back to the mid 90's.


Oh, if you can afford it, of course, go for it. I was just afraid you spend money on a high-end card, and are then disappointed.


Having come from Nvidia before recently switching to AMD, this is a naive take on it. Their compute software might be better but their Linux driver is abysmal to manage and takes the fun out of owning a PC. Never again. I'd take AMD over them even if the card burned my house down each time I used it.


A bit harsh but I agree in that I only believe it when I see it. Have been burned by empty promises by AMD before.


Easier said than done, at least for H100.


They're talking about consumer cards, which is the point. You can learn CUDA off any consumer nvidia card and have it translate to the fancier gear, that's part of why nvidia has so much mindshare.

Eg I can write my cuda code with my 3090s, my boss can test it on his laptop's discrete graphics, and then after that we can take the time to bring it to our V100s and A100s and nothing really has to change.


Apologies for the snark, but maybe it's better that so far AMD has had terrible consumer card support. What little hardware they have targeted seems to be barely stable & barely work for the very limited workloads that are supported. If regular consumers were told their GPUs would work for GPGPU, they might be rotten pissed when they found out what the real state of affairs is.

But if AMD really wants a market impact - which is what this submission is about - getting good support across a decent range of consumer GPUs is absolutely required. They cannot win this ecosystem battle with only datacenter mindshare.


Good luck man! Its your money to waste.


Virtualization is such a key ability. I really really lament that it's been tucked away, in a couple specific products (The last MxGPU is, what, half a decade old? More? Oh I guess they finally spun off a new one, an RDNA2 V620!).

I keep close & cherish a small hope that for some use-cases we might get a soft virtualization-alike that just works. I don't know enough to say how likely this is to adequately work, but in automotive & some other places there are nested Waylands, designed to share hardware. You still need a shared OS layer, a shared kernel, and a compositor that manages all the subdesktops - this isn't full virtualization - but hypothetically you get something very similar to virtualized/VDI gpus, if you can handle the constraints.

This is really a huge huge huge shift that Wayland has potentially enabled, by actually using kernel resources like DMA-BUFs and what not, where apps can just allocate whatever & pass the compositor filehandles to the bufs. Wayland is ground up, unlike X's top down. So it's just a matter of writing compositors smart enough to push what data from whom needs to get rendered and sent out where.

I would love to know more about what hardware virtualization really buys, know more about the limitations of what VDI is possible in software. But my hope is, in not too long, there's good enough VDI infrastructure that it's basically moot whether a gpu has hardware support. There will be some use cases where yes every users needs to run their own kernel & OS, and that won't be supported (albeit virtio might workaround even that quite effectively), but for 95% of use cases the more modern software stack might make this a non-issue. And at that point, these companies might stop having such expensive-ass product segmentation, charging 3x as much to have a couple hardware virtual devices, since in fact it costs them essentially nothing & the software virtualization is so competitive.


I've concluded they're just allergic to money.

Even after it became very clear that this is going to be big they're still slow off the block as if they're not even trying.

e.g. Why not make a list of the top 500 people in AI field and send them cards no strings attached plus as good of low level documentation as you can muster. Insignificant cost to AMD but could move the mindshare needle if even 20 of the 500 experiment and make some noise about it in their circles.

The Icewhale guys did exactly that best as I can tell. 350k USD hardware kickstarter so really lean. Yet all the youtubers even vaguely in their niche seem to have one of their boards. It's a good board don't get me wrong, but there is no way that was organic. Some sharp marketeer made sure the right people have the gear to influence mindshare.

https://www.youtube.com/results?search_query=zimaboard


I suspect it's because they don't want to pay for software engineers as hardware engineers are much cheaper. I was contacted by their recruiter last year and it turned out the principal engineer salary was at the level of entry FAANG salary, so I suspect they can't really source the best people.


How much was the salary for principal? Because I know it can do 400k TC and not sure entry level FAANG is that level.


My suspicion is that the GPGPU hardware in shipped cards has known problems / severe limitations due to neglect of that side of the architecture for the last ~10 years. Shipping a bunch of cards only to burn the next generation of AMD compute fans as badly as they burned the last generation of AMD compute fans would not be wise. It's painful to wait, but it may well be for the best.


The Radeon MI series seems to perform fine if you follow their software stack happy path. Same for using modified versions of ROCm on APUs, it's just no one has been willing to invest in paying a few developers to work on broader hardware support full-time, thus any bugs outside enterprise Linux distros on Radeon MI series cards do not get triaged.


Instinct has much better SW support today than Radeon, so you would need to send MI210s/etc .

I think it's at the point where if you are comfortable with GEMM kernels, setting up SLURM, etc it is usable. But if you want to stay at the huggingface layer or higher, you will run into issues.

Many AI researchers are higher level than that these days, but some are still of us willing to go lower level.


ROCm on Vega only works on certain motherboards because the card lacks a synchronization clock over the PCI bus. They added it on some later cards. It’s absurd how much is lacking and inconsistent.


Yeah, this. I tried to do some computing with AMD server grade cards 2 years ago and found all of the API so out of fate and the documentation equally out of date... Went CUDA and didnt look back. Sad, cause Im an AMD fanboy of old.


It seems like Hotz and co are able to move pretty well on it, so maybe there's some low-level stuff they're using (or maybe they're forced to for a few reasons) w.r.t. the tinybox, but it is impressive how much they've been able to do so far I think. :3 <3 :')))) :')


> e.g. Why not...

A key part of progress is choosing the direction to progress in. Flashy knee-jerk moves like that sound good but it isn't the fastest way to move forward. The first step (which I think they've taken) is for the executives to align on what the market wants. The second is to work out how to achieve it, the third to do it. Handing out freebies would probably help, but it'll take sustained long term strategy for AMD to make money.

AMD's problem isn't low-level developer interest. The George Hotz video rant on AMD was enlightening - the interest is there and the official drivers just don't work. A few years ago I made an effort to get in to reinforcement learning as a hobby and was blocked by AMD crashes. At the time I assumed I'd done something wrong. I still believe that, but I'm less certain now. It is possible that the reason AMD is doing so poorly is just that their code to do BLAS is buggy.

People get very excited about CUDA and maybe everything there is necessary, but on AMD the problem seems to be that the card can't reliably multiply matrices together. I got some early nights using Stable Diffusion because everything worked great for an hour then the kernel paniced. I didn't give AMD any feedback because I run an unsupported card and OS - effectively all cards and OSs are unsupported - but if that is widespread behaviour it would be a grave blocker.

I think they are serious now though. The ROCM documentation dropped a lot of infuriating corporate waffle recently and that is a sign that good people are involved. Still going to wait and see before getting too hopeful that it works out well.


> Flashy knee-jerk moves like that sound good but it isn't the fastest way to move forward.

NVidia:

- Games -> we're on it

- Machine learning -> we're on it

- Crypto -> we're on it

- LLM / AI -> we're on it

Compare the growth rate of NVidia vs AMD and you get the picture. Flashy knee-jerk moves are bad, identifying growth segments in your industry and running with them is excellent strategy.

People get excited about CUDA because it works, and AMD could have had a very large slice of that pie.

> on AMD the problem seems to be that the card can't reliably multiply matrices together. I got some early nights using Stable Diffusion because everything worked great for an hour then the kernel paniced. I didn't give AMD any feedback because I run an unsupported card and OS - effectively all cards and OSs are unsupported - but if that is widespread behaviour[sic] it would be a grave blocker.

Exactly. And with NVIDIA you'd be working on your problem instead. And that's what makes the difference. AMD should do exactly what the OP wrote: gain mindshare by getting at least some researchers on board with their product, assuming they haven't burned their brand completely by now.


NVIDIA is focused on graphic cards. AMD has the tough CPU market to worry about.


That's AMD's problem to solve, they made that choice.

NV doesn't have to worry about resource allocation, branding etc. AMD could copy that by spinning out it's GPU division. Note that 'graphic cards' is no longer a proper identifier either, they just happen to have display connectors on them (and not even all of them). They're more like co-processors that you may also use to generate graphics. But I'm not even sure if that's the bulk of the applications.


Never half ass two things when you can whole ass one thing.


ROCm makes me sad, as it reminds me of how much better GPUs could be than they are today.

I've lately been exploring the idea of a "Good Parallel Computer," which combines most of the agility of a CPU with the efficient parallel throughput of a GPU. The central concept is that the decision to launch a workgroup is made by a programmable controller, rather than just being a cube of (x, y, z) or downstream of triangles. A particular workload it would likely excel at is sparse matrix multiplication, including multiple quantization levels like SpQR[1]. I'm hopeful that it could be an advance in execution model, but also a simplification, as I believe a lot of the complexity of the current GPU model is because of lots of workarounds for the weak execution model.

I'm not optimistic about this being built any time soon, as it requires rethinking the software stack. But it's fun to think about. I might blog about it at some point, but I'm also interested in connecting with people who have been thinking along similar lines.

[1]: https://arxiv.org/abs/2306.03078


A workgroup/kernel can launch other ones without talking to the host. Like cuda's dynamic thing except with no nested lifetime restrictions. This is somewhat documented under the name HSA.

Involves getting a pointer to a HSA queue and writing a dispatch packet to it. Same interface the host has for launching kernels - easier in some ways (you've got the kernel descriptor as a symbol, not as a name to dlsym) and harder in others (dynamic memory allocation is a pain).


Yeah, dynamic memory allocation from GPU space seems to be the real sticking point. I'll look into HSA queues, that looks very interesting, thanks.


That's solved too. But as usual there's elements of DIY. The host runtime can allocate memory that is read/write by the host and by GPUs in atomic operation fashion. If you're on pci-e that means load/store/cas/swap/fetch-add. Mutable shared memory is sufficient for arbitrary exchange of information, e.g. a GPU kernel asking the host to allocate some GPU memory and give it the corresponding pointer.

Implementing robust cross device function calls on that was fairly tough going, but these days you could rip the code with 'rpc' in the file name out of the llvm libc implementation where it underpins the GPU equivalent of syscall.

Non-cuda style programming models on GPUs is a pet interest of mine, feel free to email if you want to talk offline.


I heard Unreal Nanite built a job queue system on compute threads (https://www.youtube.com/watch?v=eviSykqSUUw&t=1611s), would that help with your use case or not?


How does this differ from CUDA’s dynamic parallelism, which lets you launch kernels from within a kernel?


There are a lot of similarities, but the granularity is finer. The idea is that you make a decision to launch one workgroup (typically 1024 threads) when the input is available, which would typically be driven by queues, and potentially with joins as well, which is something the new work graph stuff can't quite do. Otherwise the idea of stages running in parallel, connected by queues, is similar. But I did an analysis of work graphs and came to the conclusion that it wouldn't help with the Vello (2d vector graphics) workload at all.



The first step is admitting there's a problem. So... that's nice.


Exactly. People might trust AMD if they continue to invest in this for the next 10 years.

It's clear it wasn't a corporate priority. Convince people it is via sustained action and investment, and eventually they might change their minds.


With all due respect this is an insult to those of us who have loyally purchased AMD for numerous years, trying our very best to do compute with days, nay weeks, of attempts.

Now 5 years too late we get told its suddenly their number one priority.

Too late. Not only has all goodwill gone, but it's in deep negative territory. Even 50% lower performance stacks like Intel / Apple are much more appealing than AMD will ever be at this stage.


AMD has a history of providing sub-par software, and their strategy of (partially) opening up their specifications and have other people write it for free didn't work either.

Nvidia has huge software teams, and so does Intel.


I don't know if they'll ultimately succeed or not, but they at least seem to be putting genuine effort into this. ROCm releases are coming out at a relatively nice clip[1], including a new release just a week or two ago[2].

[1]: https://github.com/RadeonOpenCompute/ROCm/releases

[2]: https://www.phoronix.com/news/AMD-ROCm-5.7-Released


Yeah, AMD is doing more with ROCm. But are they catching up to Nvidia, or just not falling behind as fast as before? Only time will tell


It's a fair question. And I agree, all we can do is wait and see how things play out. I am definitely rooting for AMD here though, for multiple reasons.


Not only sub-par software, but sub-par software that they drop support for after a couple of years. People can work around the problems with sub-par software if they believe that it will benefit them long term. They will absolutely not put in the effort if they fear it will be completely useless in 2 years time.


Only 16 years after Nvidia released CUDA


I remember chatting with some Nvidia rep at CES 2008. He showed me how cuda could be used to accelerate video upscale and encoding. I was 19 at the time and just a hobbyist. I thought that was the coolest thing in the world.

(And yes I "snuck" in to CES using a fake business card to get my badge)


Back in the day, using CUDA was really hard. It got better as more people built on it and it got battle tested.


It's still not exactly easy, and the API has not changed much since the aughts except than to become richer and more complicated. But almost nobody writes raw CUDA anymore. It's abstracted away beneath many layers of libraries, e.g. Flax -> Jax -> lax -> XLA -> CUDA.


[flagged]


What a useless comment. It is you that drives the fire, I would be more than happy with a bit more competition. The sad reality is that right now if you want to focus on your job and not on the intermediary layers that NV is pretty much the only game in town. The 'Team Green' bs came out of the gaming world where people with zero qualifications were facing off with other people with zero qualifications about whose HW was 'the best' when 'the best' meant: I can play games. But this is entirely different, it is about long and deep support of a complex hardware/software combo where whole empires are built upon that support. Those are not decisions made lightly and unfortunately AMD has done very poorly so far. This announcement is great but the proof of the pudding will be in the eating, so let's see how many engineers they dedicate to delivering top notch software.


The hilarious thing is I'm actually an AMD fanboy, I've made a point to only get their GPUs (and CPUs) for the last decade or so. But I'm still annoyed and frustrated that it's taken them so long to get their act together on this.


I think AMD need to do something BIG in the enterprise space. It seems Nvidia have the Lion's Share of the Market, but Intel have been making good strides there with their DC GPUs.

The software stack is the key here. If the drivers aren't there it doesn't matter what paper capabilities your product has if you can't use it.

AMD have on paper done well with performance in recent generations of consumer cards but their drivers universally seem to be the let down to making the most of their architecture.


they have! On one of the last keynotes in Summer they announced direct competitor to chips from Nvidia AI chips for enterprises: MI300X

https://www.anandtech.com/show/18915/amd-expands-mi300-famil...

Software stack is crucial of course but if you buy this kind of chips (means you have a lot of money) you probably can also optimise your stack for it for some extra bucks to not rely on Nvidia's supply.


With all this hype about CUDA, I have recently started looking into programming CUDA as a job as I love that kind of challenge, but to my dismay I found that these tasks are very niche. So it is not even that people are routinely writing new CUDA code. It's just that the current corpus is too big and comprehensive for alternatives to compete with.


That and a massive amount of experience already out there on how to optimize for that particular architecture. NVidia has done well for itself on the back of four sequential very good bets coupled with dedication unmatched by any other vendor, both on the hardware and on the software side. It also was one of the few times that I didn't care if I ran the vendor supplied closed source stuff because it seemed to work just fine and I never had the feeling they would suddenly drop support for my platform.


Specialized skills can have a fairly small job market sometimes. I think a lot of CUDA code ends up being foundational as part of popular libraries, supporting tons of applications that never need to write a single line of CUDA themselves.


Really? Than you might explain why this list is so pitifully short: https://rocm.docs.amd.com/en/latest/release/gpu_os_support.h... and will get even shorter: https://rocm.docs.amd.com/en/latest/CHANGELOG.html#amd-insti... I'm so for AMD, but in terms of easily-accessible GPU computing ROCm is way behind CUDA.


Maybe I'll believe them when a consumer on Windows and Linux can download a binary from something like Meshlab or Automatic1111 and it just works on their gaming computer. If all they're interested in is selling CDNA to data centers I don't think they'll get enough mind share to be a realistic option.

Also, is it really a good idea for various projects to add another proprietary platform? We should move away from Cuda and Rocm and towards open standards like Sycl. I don't want to have to care about who made my GPU, just as I don't have to care about who made my CPU.


They did just start porting support ROCm to windows a few month ago(more specifically, ROCm 5.5.1 released a few months ago). And yea, ROCm for windows specifically supports rdna2 and rdna3 instead of cdna like ROCm for linux. So at least the title isn't a total lie. But ROCm for windows still have a few components missing. Will they finish the porting? Who knows? You may try to guess it.


The inevitable fight here is between ROCm which may have, 100s of AMD engineers working on it and related verticals, at best, without significant changes at the company, plus whatever contributions they can muster from the community.

I think at least headcount check, CUDA had thousands of engineers working on it and related verticals.

I know there's a philosophy that states, eventually, open source eats everything, however, this one seems like there is so much catch up that AMD will need to spend big and fast to get off the ground competitively.


What's stopping AMD from implementing CUDA?

Just like Wine implemented Windows APIs


That is effectively what HIP is supposed to be (while sidesteppingsome copyright gray areas). They have a very close copy of the CUDA API and it can compile either for AMD GPUs or map onto the associated CUDA call for NVIDIA.


Nothing, HIP is essentially API compatible. That gets you nothing because CUDA nVidia optimized code will perform quite abysmally on a Radeon/Instinct.

And furthermore nVidia has a bunch of proprietary libraries AMD has not cloned either.

Normal people use Tensorflow, Keras or PyTorch anyway, not raw CUDA or even its libraries. The one place that is the stronghold of raw CUDA is molecular dynamics simulations because it's been written ages ago by some researcher who has never heard of Tensorflow etc. And probably uses cublas and/or cufft for which the AMD replacement is a joke and incompatible API. Situation there is slowly improving finally with Magma.


Why is this not upvoted more? Very good question.


As far as I understand it, AMD basically has to do this because games are going to increasingly rely on LLMs & generative AI operating simultaneously with the graphics pipeline.


It has nothing to do with games. The market outside of games for compute is much bigger at the moment with the AI hype, and AMD is positioned to take a good slice of it, if they get their software stack in order.


You've missed the point of their message. I think they're saying: Sure, the market is bigger. They could choose to continue to focus on gaming despite that. Except it doesn't seem like even that is an option.


If they were serious, they would start something like drm/mesa but for compute and it would just work out of the box with a stock Linux kernel.


The amdkfd driver is in stock Linux kernels. ROCm is mostly userspace, if you don't install the kernel module that comes with it, code still runs.


Rusticl is the latest attempt at developing an OpenCL implementation for mesa and that is exactly the goal.


Words versus actions.

People don't really care about what the executive says.

Especially when the same executive is also quoted with patently dishonest bullshit:

> If you think about the product portfolio that AMD has, it’s arguably the broadest in the industry in terms of AI compute

What AMD does is what people will pay attention to.


Not particularly relevant but the name "ROCm" is kind of terrible. Hard to pronounce, doesnt look good (the caps and then lower case is quite jarring). Minor details but I feel like these things do have a bit of downstream impact.


s/OpenCL/ROCm/g




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: