Kurnal

Kurnal

A Brief Discussion on Tianji 9000

Preface#

Although the D9000 has reached the end of its life cycle, after all, it has been a year. It may not be very meaningful to rehash old content, but it’s time to talk about it.

Previously written, with errors, just make do with it#

The Tianji 9000 was released on December 16, 2021, manufactured using TSMC N4. The first to launch was the Find X5 Pro Tianji version. Of course, this time we actually consider it a layout debut. Although tech had released a dieshot early on, it still required funding (how could a poor student have money?), so it was only recently that I found time to create a drawing. Although I had already drawn when the dieshot was released, I was slightly unclear about some parts back then, so it got postponed until now. It can be considered a debut after all.

So here is the debut dieshot.

IMG_2589

As you can see, it’s very blurry, but I made some annotations, increased the contrast, and enhanced the resolution.

IMG_2590_enlarged

After all, it’s still black and white, so the contrast can be increased a bit for a more comfortable viewing experience and better discernibility.

MediaTek D9000

In this image, it is obvious that the GPU cluster is in the upper right corner, which is the Mali G710 MC10. Although it theoretically supports up to 16 MP, here MC (multi-core) and MP (multi-processor) mean the same thing, with no difference.

G710 Coreshot

It is clear that the Mali G710 has two sets of ALU clusters and three sets of GPU cores (the orange part is the GPU cluster IO). 4+4+2, but I don’t know why it’s designed this way. I briefly checked the dieshot of the D81 released at the same time, and it doesn’t have this design, which is quite strange.

The lower left corner shows the GPU cache (yellow), which should be 3 MiB.

It’s worth noting that compared to the previous generation G78, although both belong to the Valhall architecture, each shader core contains two execution engines.

G78MP1

This doubles the shader count, with a theoretical maximum of MC16.

In terms of GPU architecture, the G710 is an iteration of the G78, the G510 is an iteration of the G57, and the G310 is an iteration of the G31. All are based on the Valhall architecture, which has been used since the G77. The Valhall core has changed compared to the previous Bifrost generation, featuring a new superscalar engine (improving IPC and PW values), a simplified ISA, and a new instruction set that is more compiler-friendly. It also includes dynamic instruction scheduling, along with new data structures for APIs like Vulkan.

For example, the previous Bifrost architecture was 4-wide/8-wide. The G72 execution part includes 4-wide scalar SIMD with a warp size of 4. The G76 increased it to two 4-wide with a warp size of 8. This narrow warp design leads to ineffective thread filling during scheduling, while Valhall increases the warp to 16-wide, thus improving ALU utilization.

The execution engines have merged from three into one large one, but the actual ALU still consists of two parts, with 2x16-wide FMA.

Compared to Bifrost, where each execution engine has its own digital path control logic, scheduler, and instruction cache, which is resource-wasting, the G710 has each shader core containing two execution engines, effectively doubling the shader count.

The engines still consist of two processing units, but with slight changes. With the wide size and integer throughput remaining unchanged, the G710's processing units are divided into 4x4-wide. Each engine has dedicated resources, achieving a doubling of FMA per core per cycle. The new TMU units can process 8 bilinear texels per cycle.

The G710 replaces Mali's original job manager with the so-called CSF, responsible for scheduling and draw calls.

The G610 is essentially a G710 with fewer than 7 cores.

From the direct parameters, the G510 and G310 are indeed quite impressive (100% indeed), but the G31 hasn’t been updated in hundreds of years.

The G510's shader core has an additional execution engine, with each execution engine optionally equipped with 2 clusters of processing units, similar to the G710. However, one of the engines in the G510 can be equipped with only one processing unit, with optional FMA processing capabilities of 48-64 per cycle. Additionally, the texture units can be optionally configured for 4 texels/8 texels per cycle, with options for 2-6 cores, and L2 can be selected. The memory bandwidth is 60 GB (4x16x3750x2÷8), with SLC being 6 MiB.

Although the G510 shows significant improvement, it is also an update under the unchanged conditions of the G57, meaning that the apparent improvement is due to the lack of updates for hundreds of years, and then suddenly changing a SIV to let Cambridge handle it. The same goes for the G310; the G31 really hasn’t changed in hundreds of years. This update superficially shows a 100% performance increase, but compared to contemporaneous or even previous CU, you will find the improvement is quite lonely. This is the cleverness of ARM.

Let’s talk about the CPU part. The above image shows the CPU cluster of the D9K, located at the lower part of the SoC.

This processor uses the ARMv9 instruction set architecture. ARMv9 is an update based on v8.5, including the entire v8.5 subset.

It introduces the SVE2 extension, which is essentially an extension of SVE. However, SVE is purely aimed at HPC, while SVE2 achieves compatibility with the NEON instruction set. In a sense, SVE2 is positioned as the successor to NEON, allowing for more flexible data access. Although SVE itself is taken over by v8.2, the actual usage is only for ARMv1 server IP. GEMM and BF16 are included in v8.6, so they are only supported from v9.1 onwards. NEON has a width of 128 bits (fixed), while SVE/SVE2 has a minimum of 128 bits, expandable to 2048 bits in width. Although NEON's method (128 bits fixed) is not unfeasible, it is quite cumbersome (especially in terms of scalability, as it depends on the size of SVE registers to handle “where to place the data” in the program).

In contrast, SVE's method offers scalability and ease of deployment. The performance of SVE2 is the same as that of HPC's SVE, but in terms of DSP and multimedia performance, it is 1.2x for 128 bits, 2x for 256 bits, and 3.5x for 512 bits.

1+3+4 tri-cluster. From early aSMP technology to the previous big.LITTLE, and now to the advanced DynamIQ technology. DynamIQ technology allows for up to 32 clusters, with each cluster having up to 8 cores running on different voltage curves, but of course, this is a server matter. MTK engineers use HP+BP+HE, which is the tri-cluster, where the HP inner core runs at 3.4 GHz to meet high-performance demands. It’s worth noting that current domestic software still primarily relies on single-core performance.

The BP core balances power and performance, while the HE is positioned as an ultra-low power, ultra-low voltage standby core. The ARMv9 introduces the DSU110, which allows workload switching across multiple clusters for optimal power efficiency.

The DSU110, with ARM X2 in the upper left corner, shows changes in core architecture. Compared to the previous X1, the pipeline of X2 has been shortened by one stage, and it is fixed at 3.4 GHz on the D9K. Roughly estimating, it might perform similarly to X1 at 3.75 GHz (but this is the engineering machine frequency; the actual frequency is 3.05 GHz). Meanwhile, L3 is given 8 MiB (theoretical maximum of 16 MiB, but ARM's PPT targets the 1135G7). The neighboring SDM only provides up to 4 MiB.

Although the X2 appears to have better energy efficiency, it actually consumes more power than X1 at the same frequency. ARM does not use the common P/W metric in its PPT but uses Performance/Power Curve, which means energy efficiency is assessed based on performance. Under the same performance conditions, X2 can run at a lower frequency to take advantage of lower voltage. Based on a 16% IPC lead (using different cache comparison methods), it can achieve about a 30% energy efficiency buff at lower frequencies (土 3). Of course, at the same frequency, X2's power consumption will be higher (which is hidden in the corner).

At the same time, the A710 has a sufficiently balanced PPAC as its design language. In design, the A710 is a minor revision of the A78. The front end has been modified less than X2, but there is still a sufficient increase. The branch prediction window cache has doubled, and TLB cache has increased (32-48). Uop has been cut, and dispatch has changed from 6-wide to 5-wide, reducing the pipeline by one cycle.

In the PPT, with 8 MiB L3, compared to the 4 MiB L3 of the A78, performance increases by 10% or power consumption decreases by 30% at the same performance level. The A710 is suitable for high frequencies, while the PPT indicates that the A78 is not suitable for such conditions.

In the A510, this is ARM's most significant recent change. First, the decoding has increased from 2 to 3, adding branch prediction. The A510 can be combined into a dual-core compound or a single-core compound. The L2 TLB and VPU loading can be selected as 2x64-bit or 2x128-bit (it is estimated that 128-bit will be used in the single-core compound). AA32 has been removed. If the small core were to support AA32, it would lead to increased power consumption, so ARM cautiously retained AA32 on the A710.

In terms of energy efficiency, at lower frequencies, it may not even be as good as the A55, and only at higher frequencies does it outperform the A55. However, who would use a high-frequency small core? The most obvious feature of the A510 is that it can be combined into a dual-core compound or used as a single-core compound. The dual-core compound shares L2 cache, L2 TLB, and VPU, while the single-core A510 has exclusive access to its own L2 cache, L2 TLB, and VPU.

In the back end, the integer part has:
3 integer ALUs, 1 complex MAC, 1 DIV unit, 1 branch dispatch port, 1 integer division unit, LSU, and pure storage unit. The VPU has: PALU, SHARED VPU (encryption unit VALU, VMAC, VMC, VPU 128-bit (encryption 1, VALU 1 VMAC 1)).

In terms of loading, there are 2 load/1 store pipelines, with a pipeline width of 2x128 bits, 3-wide sequential decoding, and branch prediction. It features a 128-bit prefetch pipeline, capable of fetching 4 instructions per clock cycle. The VPU path size can be 2x64-bit or 2x128-bit. L1d is separated from the MMU.

In the upper left corner is the modem part.

There isn’t much to say about the modem part, but compared to others, it’s quite impressive.

It’s all, ah, cut from the original image. The CPU+Modem ratio is the same, but the CPU-to-CPU comparison is incorrect because it represents the area ratio of CPU/modem in their respective dies. It’s quite remarkable. However, the SDM's modem needs to be compatible with millimeter waves, while the MTK M80 seems not to need it, which is also a significant area consideration. But it’s best to just take a look, as detailed interpretations require funding, and I’m not a professional, so just make do with it.

Speaking of the X65, it has a large modem cache, reportedly shared with the ISP, but unclear.

The lower left corner is the video (accurately should be called streaming media) decoding part. This time the decoding also adopts big.LITTLE, with 2 large encoders and 2 small encoders. Although I don’t know the significance of this, it looks very, ah, professional. Generally, video decoding is typically a task for the CPU/GPU, but this time an APU has been added for, ah, auxiliary computation.

This is the APU, in the lower right corner area. It’s actually the NPU, but MTK calls it APU. Inside, you can see that the area of the APU is quite large. According to the PPT, the APU is responsible for many additional computations for units such as GPU, ISP, CPU, and decoding. The yellow part represents the 3.5-4 MiB APU cache. This time, it features a design of 4 large and 2 small NPU cores, with 4 performance cores and 2 flexible cores. It seems to involve some RISC-V elements, but ARM is also a possibility.

Above the streaming media decoding part are some unclear units, such as Imagiq 790, which is actually the ISP. This time it supports synchronous processing of three cameras at 32 MPx3 18-bit HDR video, with a processing speed of up to 9 billion pixels per second and a maximum of 320 million pixels for CMOS. Some data processing is handled by the APU. You can directly see the ISP core in the red part.

Below, the MiraVision 790 is the display processing unit, which I cannot analyze in detail. The upper right corner has some unknown units, which I am not skilled enough to identify.

The lower right corner has the conventional USB buffer, along with two USB IOs, totaling three, with one close to the GPU. The memory part is LPDDR5X 4x16-bit, supporting up to 7500 MHz, and that’s it.

Finally, let’s talk about TSMC N4. Ah, the price of a single N4 wafer is about $26,000 to $30,000, after all, N5 increased from $26,000 to $28,000. N4 is essentially a minor iteration of N5. Ti's view is that it counts as a node, but it’s clearly not. However, as long as Ti makes changes to CPP/MMP, even a little, it counts. N5: TSMC's N5/N5P. TSMC's 5nm is a new node. TSMC has five sub-points for N5: N5, N5P, N4, N4P, and N4X.

N5: TSMC has three libraries for N5, 6T UHD library, 7.5T HD library, and 9T HP library. The 6T UHD (density library) has a cell height of 180 (6x30) nm, with 137.6 Mtr. The 7.5TUD (performance library) is 225 (7.5x30) nm with 92.3 Mtr. The 9T HP is 270 (9x30) nm. CPP is 48 nm, and MMP is 30 nm. The theoretical maximum density of N5 LPE, specifically 6T UHD, can reach 137.6 Mtr/mm² (predicted). The actual is...

The biggest issue with N5 is thermal density. At 1.8X density, power consumption only reduces to 0.7X, which is very unfavorable for high-performance scenarios.

N5 introduces seven Vt (SVTLL, SVT, LVTLL, LVT, uLVTLL, uLVT, eLVT). eLVT adds an extra 10% energy consumption increase, thanks to Via Piller and post-process metal optimization, achieving an overall improvement of 35% (N5 HPC's uLVT compared to N7's uLVT, with a frequency increase of 25% and unchanged power consumption).

N5P mainly reduces power consumption, using the same design rules (DRC) and being fully compatible with N5. It optimizes power reduction in FEOL and MOL.

N4: Compared to N5, the Mtr changes very little; its 7 Mtr is just a shortening of its MMP... For example, in the D9000, X2 uses a 210 nm library, but in the D9K, all cores of the CPU subsystem are N5's 210 uLVT.

There are three levels of N4, N4P, and N4X, with the same three libraries: 6T UHD library, 7.5T HD, and 9T HPC. In the N4 HP library, it shares the N5 HP library. N4/N6 is suitable for low-frequency operations. The 6T UHD library has a cell height of 168 (6x28) nm, with 146 Mtr. The 7.5T is 210 (7.5x28) nm, with 97.8 Mtr. CPP is 48 nm, and MMP is 28 nm, reducing by 2 nm... From 30 nm to 28 nm, it relies on reducing the metal layer spacing and gate spacing, and the number of metal layers...

The theoretical maximum density of N4 LPE, specifically 6T UHD, can reach 146 MTR/mm².

N4P is just a transitional product, mainly to reduce masks, using more EUV layers, adding six more layers, if I remember correctly.

N4X is the ultimate product of 5nm, providing high performance under high voltage conditions, as it can be compatible with the 5nm design suite... However, one cannot simply measure it by MTR, as MTR definitions vary among different manufacturers.

Moreover, increasing voltage leads to exponentially increasing power consumption. Under high fixed frequencies, dynamic power consumption accounts for the majority (because a core performs differently under different processes, and at the target frequency, the designed frequency will always differ from the realized frequency; I’ll write about this in detail in a few days).

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.