Behind the scenes with ATI - Eric Demers interview

R5xx

Eric Demers, 3D Architect and Hardware Design Manager for ATI was kind enough to answer some of our questions regarding some innards of the R520 architecture. Fasten your seat belts, because this isn't the ordinary lot-of-talk but not-to-much-info type of interview!


Eric Demers - source: www.driverheaven.com

PH!: Why do the top ATI GPU codenames always end with “20” instead of “00”, as in R420, R520? Is the “00” ending reserved fore something special?

ED: You know, I've been confused by our part numbers and have ceased to really try to understand the numbering scheme. Yes, the first number indicates architecture, but even that can be partially wrong. It's a number that we sometimes try to make mean something (i.e. engineering wise), sometimes meant to mean something else (i.e. marketing wise). I would not attach too much to those numbers. Even within engineer ing, we use codenames, since numbers change and aren't always meaningful. And yes, sometimes they will end in '20' and sometimes in '00' but it's more random than most things – I don't remember why we picked R520, for example. Perhaps it's done to confuse the enemy J We need cheat sheets to remember them all :)

PH!: We suspect that the architecture first shown with the R520 has lots of reserves. The former “big one”, the R300 has doubled in pipeline count and clock speed throughout its lifespan. We assume the R520 will do similar… will we ever see a 32 pipeline (or rather say, 8 quads) R520 running around 1 GHz or will the unified shader architecture wash it away beforehand? Does it make sense to speak about unified shaders at all when we have Ultra-Threading?

ED: Well, I won't comment on unannounced products, but there's a lot invested into a new generation. About 110 man years for the R5xx generation. So, trying to maximize the number of parts we can get from it is important, to justify all the investment. The R5xx series was designed to be more flexible than previous architectures, since the metrics of yesterday have become less meaningful. 2 years ago, it mattered more how many “pipelines” you have, perhaps with some notion of the number of Z's or textures per pipe, but the basic metric was that. Today, we have moved away from that paradigm. Today, applications don't use fix function pipelines anymore, but create powerful shader programs to execute on the HW. It's not “how many pixels can you pop out per second?”, it's “what is the throughput of your shader”?. Our R5xx architecture has moved away from simply scaling of pipelines, to now scaling in terms of ALU operations, texture operations, flow control, Z operations as well as more traditional raster operations (all of this bathing in a design that can maximize the work done by each part). So will there be a 1 GHz 32 pipeline R5xx part? Well, we've ceased to measure things that way, so it won't be so easy to describe. But, yes, we will have more parts from this generation :)

Hirdetés

PH!: What top clock speeds are you expecting for the X1800 XT with air cooling and with extreme cooling? How far can the X1800 go without a die shrink?

ED: Well, with air cooling in the 90GT, we've seen core graphics clocks hit up to 800 MHz. With more extreme cooling, we've surpassed 1 GHz!! So there's a lot of headroom in these chips. On the memory side, I think we've gone well above 800 MHz, but the X1800XT now ships with 800 MHz memory, so, often, the limiting factor ends up being the dram speed. Board design also contributes to memory clock limitation. A few changes on the current board with faster memory could do significant improvements to memory. Perhaps a future product J


Radeon X1800 XT extreme cooled - source: www.muropaketti.com

But one thing to note is that our memory architecture was designed for GDDR4 up to 1.5GHz in speed, so it's got lots of headroom. I could see MCLK even hitting 1.6GHz with more aggressive cooling – Assuming the memory is there and the board is designed to work at such speeds!

Register array, Ring Bus

PH!: For every thread of the R520 there are 32 registers, each being 128 bits wide. This means a 256KB register array for 512 threads, which seems like quite a huge number. Is it really “this big” or did we make a mistake in our calculations? How does this number compare to former ATI or NVIDIA architectures? Is this 2 MB really that big and if so, how much space does it take up on the ASIC? How fast are these registers compared to L1 and L2 caches of today's CPUs?

ED: Yes, that's the theoretical number. The actual number can be less, if each of your threads does more work before being put to sleep, you can reduce the number of threads without affecting your latency hiding capabilities. How much less depends on your architecture, your ALU/Texture ratio and many other factors.

Compared to previous parts from ATI, it's significantly more. The shader core is getting to be one of the bigger parts of the design now. Compared to NV? Well, since I don't know how they've done things, it's hard for me to compare J

As far as compared to L1/L2 caches, well, these registers are more like the immediate registers of a CPU. If you check the GPGPU results, you'll see that our performance is independent of the number of GPRs in use, so it's exactly the same a CPU's registers. I know that that's not true of other architectures, but it's an important aspect of ours. That's also one of the prime reasons why we are so well suited for GPGPU work: as the shaders get more complex and longer, our performance is perfectly predictable. No need for fancy driver shader games, or falling back to partial precision.

PH!: The Ring Bus memory controller is the most elegant part of the R520 in our opinion. We are looking forward to hearing of (and testing the) new tweaks. We saw the OpenGL optimization that boosted frame rates by 30%, which is a huge number. How does the fine tuning work? What can the controller logic do by itself, and what can the coders do through the driver? What clock speeds can be reached by this Ring Bus architecture? It is possible to extend it to 2x512 bits?

ED: The Memory controller was designed for the fastest GDDR4, or at least 1.5GHz. It's easy to scale down the design, or scale it up. It's all linear. It was designed to be flexible in those ways, since we have so many different products. The fine tuning has some elements of trial and error, but we use annealing and DNA algorithms to reduce the space and zero-in on settings. Applications are complex, and getting to the “perfect” MC settings is difficult. We have an application team in place, whose job it is to improve performance, both through MC tuning as well as driver changes. We have an amazing amount of MC tuning potential at our disposal. So much, that to go through all possibilities would take us longer than the life of the universe so far. However, with some educated guesses and clever reduction algorithms, we can manage the task. You should expect to see improvements across the board and in specific games, over time. We won't always tell you if the performance improvements came from MC tuning or some other driver enhancement. But it's a powerful new tool for us to improve performance.

PH!: As far we know, the early production problems only concerned products having the Ring Bus memory controller. Is there any connection between the issue and the memory controller itself?

ED: no, the early production problems has nothing to do with any specific part of the design. It wasn't the MC or graphics or any other place. It was all over the place. There was a design flaw in a circuit that did not show up in any of the checks that we do in the process of producing ASICs. It was internal to a non-ATI design. Once we found the problem, it was trivial to fix, but it delayed our products many months.

RV530

PH!: The setup of the RV530 looks a bit strange to us. What's the explanation of the low number of texture units? Will this architecture be able to keep up with other 12 pipeline solutions like the X800 GTO, GeForce 6800 GS or a future GeForce 7600 GT, which have more texture and ROP units?

ED: I explained earlier that the “number of pipes” sort of became meaningless with the R5xx series. We came up with a new architecture where all the elements of the pipes are now scalable, and refocused on the shaders. We found that the newest apps have refocused to have higher ALU to texture ratios, but still require decent occlusion (Z) checks. As well, they are refocusing on wide pixel formats for HDR.

Now, for the high end part like the X1800's, you can have everything maxed out and get the best performance in all ways. But with a limited area budget (i.e. target cost) for the X1300's and X1600's, you have to be more careful in your selections and target the “sweet spot” of next generation gaming. Given this for the X1600's, we set ourselves up to get 12 shader ALUs per cycle, 4 texture lookups, 16 Z checks (in AA, 8 in non-AA) and 4 pixels out. This set of ratios is targeted to the latest games and the next generation games. It might not do as well in the older games, but really shines in the best and newest of what's out there, or coming up.

PH!: While did not review it yet, most other hardware sites weren't thoroughly enthusiastic about the X1600 Pro. Are you planning to launch an “X1700 series” soon, or it will be the smaller brother of something like the “X1900”?

ED: there will be new products, of course. I can't comment on any specifics. I think that the MSRP changes in the X1600 make it a very attractive product. Given a MSRP of $149 and a street price that's lower, I think that performance wise, it trashes other products, and feature-wise, it's not even funny. As well, I feel you have a much more feature proof product that will scale better than competitive products on newer titles. Check the next game that uses dynamic branching, and I think you'll see how much better it can be J

CrossFire and PhysX

PH!: The Xilinx FPGA based picture composing engine seems to be a very “individual” circuit that could mix any output signals (DVI signals). Would it be possible to combine the signal of independent video cards with a similar FPGA based circuitry?

ED: Yes, it would be possible, as long as the cards are close to working on the same frame (some buffering is available, but it's not infinite).

PH!: Are there any plans for a motherboard like the Gigabyte GA-8N-SLI Quad that can run four ATI cards working in CrossFire? NVIDIA has quite a few dual GPU products – will we see a similar solution (or a dual-core GPU) from ATI? It would be great to see a dual X1800 XT pushing Adaptive AA for HDTV 1080i :)

ED: Our designs were made to be very scalable. Evans and Sutherland ships a 16 chip version of our chips using super tiling as their default rendering mode. We designed to scale up to 256 chips. So, there's nothing that would prevent it from working technically, but power, heat and cost are the core issues.


source: beyond3d.com

PH!: We haven't tried it ourselves, but word from Canada says that two ordinary X1300 can work in CrossFire, better, with all the features like Super Tile, Scissor and AFR mode. It's clear that the communication in this case is through the PCI Express bus, but what is combining the pictures? Is it the CPU? If two simple X1300 can “CrossFire”, can't any two PCI Express Radeons?

ED: Yes, the communication occurs over PCIe. One of the two cards acts as a virtual “master” card, and displays the image. The slave card can render to the master as peer-to-peer (without system memory) or through system memory. There's no technical reason that we can't run all our X1000's cards in this method. However, if you get to the higher performing cards, they tend to saturate the PCIe bus and so, for them, it makes more sense to switch to the compositing chip, which removes the PCIe bandwidth being used and gives some great new features such as SuperAA. I believe that Cat 5.11 enabled Crossfire on X1300's.

PH!: We hear more and more about Ageia's PhysX PPU, it has even been paired in theory with the Sony PS3. Are physical computations suitable for a GPU? Is it possible that future DirectX (or WGF) will have something like a “DirectPhsyX” extension as a standard library like Direct3D? Is ATI planning to produce a physics accelerator or a driver which makes a GPU (maybe the GPU of a second, PCI Express based ATI video card) doing physics?

ED: One of the things we've been pushing and working on is the general concept of GPGPU. Basically, take non-graphics problems that perform well on parallel computers, and use the GPU to perform those operations. In some of the protein folding and signal processing fields, we see 2x to 7x increases in performance relative to the fastest single core GPUs. We are working with the GPGPU field to develop further these types of applications. However, it's still too early to say how things will develop. But under the banner of GPGPU, we can include physics computations that lend themselves to parallelism. In general, those are the types of physics computations done in a co-processor, and that justify the co-processor. I believe a lot of that could be done in a GPU as well. As for API extensions, you'd need to talk to MS J

Adaptive AA, Quality AF

PH!: There are lots of forum posts about enabling Adaptive AA on older Radeons. We made the feature work on an R350 based card and the last version ATI Tool supports the feature too. Is Adaptive AA going to be available for older Radeons in the future Catalyst versions? What's the essence of the Adaptive AA? Is it not just the activation of Super Sampling when a texture has transparent parts?

ED: Some form of adaptive AA is going to be made available on previous architectures. Our fundamental AA hardware has always been very flexible and gives the ability to alter sampling locations and methods. In the X1k family, we've done more things to improve performance when doing Adaptive AA, so that the performance hit there will be much less than on previous architectures. However, the fundamental feature should be available to earlier products. I'm not sure on the timeline, as we've focused on X1k QA for this feature, up to now. Beta testing is ongoing, and users can use registry keys to enable the feature for or older products, for now.

PH!: Quality AF looks impressively. It finally does what AF is supposed to – congratulations. In our tests, Quality AF was almost as fast as the “angle-dependent” Performance AF. Did we make a mistake? If not, why is the regular AF needed at all, when we have Quality AF with basically the same speed?

ED: there are some cases where area based AF does perform slower than our previous AF. Our previous AF filtering algorithm used reasonable compromises to achieve excellent quality and performance. It was superior to bilinear/trilinear simple implementations. In fact, it was superior, I feel, to today's offerings from competing vendors. However, we did listen to the criticisms, and while not always agreeing, we decided that our customers' requirements are our top priority. As well, as HW improves, it should never give worst quality. So we improved many items in the pipeline, to increase precision (subpixel, Z, color and iteration, etc…) as well as improved LOD and AF algorithms. I think it's obvious that we now offer the very best quality filtering, bar none. To me, I'd rather have high quality at a good frame rate, than terrible quality at breakneck speeds.

PH!: To test the X1800 in 1600x1200 we had to warm up our good old CRT, because our good new LCD-s could only handle 1280x1024. ATI and NVIDIA suggests to test in 1600x1200 or even higher. Does it make sense when most of the users (even power users) use LCD with native 1280x1024? Is the HDTV 1080i going to be popular among PC gamers? We think 1280x1024/Quality AF/Adaptive AA with HDR is the best setup today and it can even make the X1800 XT sweat :)

ED: Well, even the cheaper monitors used to support 16x12 without too much trouble. The popularity of low cost LCDs as actually caused a little of a resolution stall. But newer, larger LCDs certainly support 16x12 and even higher. As well, I think that there will be a plateau of standardization at 1920x1080, as this is standard for broadcast and an excellent resolution for monitors. But for those others, their focus should be on quality. Myself, I use a 19” LCD at 1280x1024. So I up the AF setting to the highest, and turn on our 6xAA – It gives the highest quality picture available.

HDR, GITG

PH!: Speaking of HDR: the 64-bit mode with AA seems to be very demanding on memory bandwidth. What will (could) be the most common HDR+AA mode used in games? In which cases is it recommended to use the Int10 HDR?

ED: There are many, many HDR formats. These include 10b, FP16, RGBe, etc… Probably over 2 dozen formats. All of these formats have advantages and disadvantages. FP16 (common 64b format) is great for dynamic range, but loses out as it doesn't have much more precision and usually comes at a hefty performance loss. 10b gives the equivalent precision, but at 2x the speed. ISVs will have to pick and choose the right format for the job. I think that 10/10/10 , FP16, RGBE will be popular formats. There might be others. ISVs will have to balance out performance vs. surface format required, regardless of AA mode selected. And they will have to use HDR when it makes sense to maximize the quality of the rendering. I could imagine cases where some passes on a scene are HDR, while others are not, to maximize performance.

But in general, when using HDR surfaces, you do want AA, as the contrast is even higher now which means that silhouettes have even more aliasing artifacts. Our X1k is the only HW that can support that today.

PH!: The Parthenon and Toy Shop demos showed lots of interesting effects which would be great to see in games. Are there going to be any GITG (or other) titles using POM, water droplet simulation or the Progressive Buffer geometry?

ED: Yes, there will be. As well, we work with ISVs to let them know of all the things we have discovered doing these demos, so that they can use them in their own applications. There already has been at least 1 or 2 ATI developer days, where we presented to ISVs all of our techniques for these demos. You'll hopefully see this in games soon!

PH!: Lots of our readers think TWIMTB program is something like NVIDIA “buying” the preference of game developers. How does the GITG program work? Who can participate, and what kind of support do they get? Do they get any financial help?

ED: Well, I must admit to not being an expert on all the “goings on” with respect to ATI's developer programs. I poked at one of the leads to get their view on this question:

Tony: “The GITG sub site is not to be viewed as "buying" the preference of game developers. The purpose of the site from a marketing perspective is to have a central place for developers, gamers, and tech enthusiasts to go to get the latest information on games where ATI has a technical and marketing relationship. It also offers viewers the opportunity to view the latest ISV promotions with other partners (e.g. we had a bundle running with Alienware not too long ago), and also to promote and allow requests for support for LAN events. When someone wants ATI to support an event with a LAN kit the requests come in through this site.”

PH!: Thanks a lot guys and good luck for all future projects!

Rudolf Mezes

  • Kapcsolódó cégek:
  • ATi

Azóta történt

Előzmények