Controleer NVLink® in Linux
Installeer Nvidia-stuurprogramma's door onze gids te volgen Installeer Nvidia-driver in Linux, alvorens de NVLink-ondersteuning in het besturingssysteem te controleren. Daarnaast moet je de CUDA-toolkit installeren om voorbeeldtoepassingen te compileren. In deze kleine gids hebben we enkele handige commando's verzameld die je kunt gebruiken.
Basis commando's
Controleer de fysieke topologie van je systeem. Dit commando toont alle GPU's en hun onderlinge verbinding:
nvidia-smi topo -mAls je de staat van de links wilt weergeven, voer dan het volgende commando uit:
nvidia-smi nvlink -sHet commando toont de snelheid van elke link of 
nvidia-smi nvlink -i 0 -cZonder deze optie, zal informatie over alle GPU-verbindingen worden weergegeven:
nvidia-smi nvlink -cInstalleer CUDA-voorbeelden
Een goede manier om de doorvoer te testen, is door app-voorbeelden van NVIDIA® te gebruiken. De broncode van deze voorbeelden staat op GitHub en is voor iedereen toegankelijk. Ga door met het klonen van de repository naar de server:
git clone https://github.com/NVIDIA/cuda-samples.gitVerander directory naar de gedownloade repository:
cd cuda-samplesSelecteer de juiste branch via tag volgens de geïnstalleerde CUDA-versie. Bijvoorbeeld, als je CUDA® 12.2 hebt:
git checkout tags/v12.2Installeer enkele vereisten die zullen worden gebruikt in het compileerproces:
sudo apt -y install freeglut3-dev build-essential libx11-dev libxmu-dev libxi-dev libgl1-mesa-glx libglu1-mesa libglu1-mesa-dev libglfw3-dev libgles2-mesa-devNu kun je elk voorbeeld compileren. Ga naar de Voorbeelden directory:
cd VoorbeeldenKorte blik op de inhoud:
ls -la
total 40
    drwxrwxr-x 10 usergpu usergpu 4096 Sep 13 14:54 .
    drwxrwxr-x  6 usergpu usergpu 4096 Sep 13 14:54 ..
    drwxrwxr-x 55 usergpu usergpu 4096 Sep 13 14:54 0_Introduction
    drwxrwxr-x  6 usergpu usergpu 4096 Sep 13 14:54 1_Utilities
    drwxrwxr-x 36 usergpu usergpu 4096 Sep 13 14:54 2_Concepts_and_Techniques
    drwxrwxr-x 25 usergpu usergpu 4096 Sep 13 14:54 3_CUDA_Features
    drwxrwxr-x 41 usergpu usergpu 4096 Sep 13 14:54 4_CUDA_Libraries
    drwxrwxr-x 52 usergpu usergpu 4096 Sep 13 14:54 5_Domain_Specific
    drwxrwxr-x  6 usergpu usergpu 4096 Sep 13 14:54 6_Performance
    drwxrwxr-x 11 usergpu usergpu 4096 Sep 13 14:54 7_libNVVM
Laten we de GPU-bandbreedte testen. Verander de directory:
cd 1_Utilities/bandwidthTestCompileer de app:
makeTests uitvoeren
Begin met testen door de app uit te voeren met behulp van de naam:
./bandwidthTestDe uitvoer kan er zo uitzien:
[CUDA Bandwidth Test] - Starting...
    Running on...
    
     Device 0: NVIDIA RTX A6000
     Quick Mode
    
     Host to Device Bandwidth, 1 Device(s)
     PINNED Memory Transfers
       Transfer Size (Bytes)        Bandwidth(GB/s)
       32000000                     6.0
    
     Device to Host Bandwidth, 1 Device(s)
     PINNED Memory Transfers
       Transfer Size (Bytes)        Bandwidth(GB/s)
       32000000                     6.6
    
     Device to Device Bandwidth, 1 Device(s)
     PINNED Memory Transfers
       Transfer Size (Bytes)        Bandwidth(GB/s)
       32000000                     569.2
    
    Result = PASS
    
    NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Als alternatief kunt u de p2pBandwidthLatencyTest compileren en starten:
cd 5_Domain_Specific/p2pBandwidthLatencyTestmake./p2pBandwidthLatencyTestDeze app zal je gedetailleerde informatie tonen over de bandbreedte van je GPU in P2P-modus. Voorbeeld uitvoer:
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
    Device: 0, NVIDIA RTX A6000, pciBusID: 3, pciDeviceID: 0, pciDomainID:0
    Device: 1, NVIDIA RTX A6000, pciBusID: 4, pciDeviceID: 0, pciDomainID:0
    Device=0 CAN Access Peer Device=1
    Device=1 CAN Access Peer Device=0
    
    ***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
    So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.
    
    P2P Connectivity Matrix
         D\D     0     1
         0       1     1
         1       1     1
    Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
       D\D     0      1 
         0 590.51   6.04 
         1   6.02 590.51 
    Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
       D\D     0      1 
         0 589.40  52.75 
         1  52.88 592.53 
    Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
       D\D     0      1 
         0 593.88   8.55 
         1   8.55 595.32 
    Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
       D\D     0      1 
         0 595.69 101.68 
         1 101.97 595.69 
    P2P=Disabled Latency Matrix (us)
       GPU     0      1 
         0   1.61  28.66 
         1  18.49   1.53 
    
       CPU     0      1 
         0   2.27   6.06 
         1   6.12   2.23 
    P2P=Enabled Latency (P2P Writes) Matrix (us)
       GPU     0      1 
         0   1.62   1.27 
         1   1.17   1.55 
    
       CPU     0      1 
         0   2.27   1.91 
         1   1.90   2.34 
    
    NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
In het geval van een configuratie met meerdere GPU's, kan het er zo uitzien:
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
    Device: 0, NVIDIA H100 PCIe, pciBusID: 30, pciDeviceID: 0, pciDomainID:0
    Device: 1, NVIDIA H100 PCIe, pciBusID: 3f, pciDeviceID: 0, pciDomainID:0
    Device: 2, NVIDIA H100 PCIe, pciBusID: 40, pciDeviceID: 0, pciDomainID:0
    Device: 3, NVIDIA H100 PCIe, pciBusID: 41, pciDeviceID: 0, pciDomainID:0
    Device: 4, NVIDIA H100 PCIe, pciBusID: b0, pciDeviceID: 0, pciDomainID:0
    Device: 5, NVIDIA H100 PCIe, pciBusID: b1, pciDeviceID: 0, pciDomainID:0
    Device: 6, NVIDIA H100 PCIe, pciBusID: c2, pciDeviceID: 0, pciDomainID:0
    Device: 7, NVIDIA H100 PCIe, pciBusID: c3, pciDeviceID: 0, pciDomainID:0
    Device=0 CAN Access Peer Device=1
    Device=0 CAN Access Peer Device=2
    Device=0 CAN Access Peer Device=3
    Device=0 CAN Access Peer Device=4
    Device=0 CAN Access Peer Device=5
    Device=0 CAN Access Peer Device=6
    Device=0 CAN Access Peer Device=7
    Device=1 CAN Access Peer Device=0
    Device=1 CAN Access Peer Device=2
    Device=1 CAN Access Peer Device=3
    Device=1 CAN Access Peer Device=4
    Device=1 CAN Access Peer Device=5
    Device=1 CAN Access Peer Device=6
    Device=1 CAN Access Peer Device=7
    Device=2 CAN Access Peer Device=0
    Device=2 CAN Access Peer Device=1
    Device=2 CAN Access Peer Device=3
    Device=2 CAN Access Peer Device=4
    Device=2 CAN Access Peer Device=5
    Device=2 CAN Access Peer Device=6
    Device=2 CAN Access Peer Device=7
    Device=3 CAN Access Peer Device=0
    Device=3 CAN Access Peer Device=1
    Device=3 CAN Access Peer Device=2
    Device=3 CAN Access Peer Device=4
    Device=3 CAN Access Peer Device=5
    Device=3 CAN Access Peer Device=6
    Device=3 CAN Access Peer Device=7
    Device=4 CAN Access Peer Device=0
    Device=4 CAN Access Peer Device=1
    Device=4 CAN Access Peer Device=2
    Device=4 CAN Access Peer Device=3
    Device=4 CAN Access Peer Device=5
    Device=4 CAN Access Peer Device=6
    Device=4 CAN Access Peer Device=7
    Device=5 CAN Access Peer Device=0
    Device=5 CAN Access Peer Device=1
    Device=5 CAN Access Peer Device=2
    Device=5 CAN Access Peer Device=3
    Device=5 CAN Access Peer Device=4
    Device=5 CAN Access Peer Device=6
    Device=5 CAN Access Peer Device=7
    Device=6 CAN Access Peer Device=0
    Device=6 CAN Access Peer Device=1
    Device=6 CAN Access Peer Device=2
    Device=6 CAN Access Peer Device=3
    Device=6 CAN Access Peer Device=4
    Device=6 CAN Access Peer Device=5
    Device=6 CAN Access Peer Device=7
    Device=7 CAN Access Peer Device=0
    Device=7 CAN Access Peer Device=1
    Device=7 CAN Access Peer Device=2
    Device=7 CAN Access Peer Device=3
    Device=7 CAN Access Peer Device=4
    Device=7 CAN Access Peer Device=5
    Device=7 CAN Access Peer Device=6
    
    ***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
    So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.
    
    P2P Connectivity Matrix
         D\D     0     1     2     3     4     5     6     7
         0       1     1     1     1     1     1     1     1
         1       1     1     1     1     1     1     1     1
         2       1     1     1     1     1     1     1     1
         3       1     1     1     1     1     1     1     1
         4       1     1     1     1     1     1     1     1
         5       1     1     1     1     1     1     1     1
         6       1     1     1     1     1     1     1     1
         7       1     1     1     1     1     1     1     1
    Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
       D\D     0      1      2      3      4      5      6      7
         0 1629.83  38.43  38.39  37.66  38.51  38.19  38.09  37.92
         1  38.22 1637.04  35.52  35.59  38.15  38.38  38.08  37.55
         2  37.76  35.62 1635.32  35.45  38.59  38.21  38.77  37.94
         3  37.88  35.50  35.60 1639.45  38.49  37.43  38.72  38.49
         4  36.87  37.03  37.00  36.90 1635.86  34.48  38.06  37.22
         5  37.27  37.06  36.92  37.06  34.51 1636.18  37.80  37.50
         6  37.05  36.95  37.45  37.15  37.51  37.96 1630.79  34.94
         7  36.98  36.91  36.95  36.87  37.83  38.02  34.73 1633.35
    
    Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
       D\D     0      1      2      3      4      5      6      7
         0 1635.22  34.42  33.84 256.54  27.74  28.68  28.00  28.41
         1  34.66 1636.93 256.16  17.97  71.58  71.64  71.65  71.61
         2  34.78 256.81 1655.79  30.29  70.34  70.42  70.37  70.33
         3 256.65  30.65  70.67 1654.53  70.66  70.69  70.70  70.73
         4  28.26  30.80  69.99  70.04 1630.36 256.45  69.97  70.02
         5  28.10  31.08  71.60  71.59 256.47 1654.31  71.62  71.54
         6  28.37  30.96  70.99  70.93  70.91  70.96 1632.12 257.11
         7  27.66  30.87  70.30  70.40  70.30  70.39 256.72 1649.57
    
    Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
       D\D     0      1      2      3      4      5      6      7
         0 1673.16  51.88  51.95  51.76  51.61  51.44  52.07  51.30
         1  52.04 1676.28  39.06  39.21  51.62  51.62  51.98  51.36
         2  52.11  39.27 1674.62  39.16  51.42  51.21  51.72  51.71
         3  51.74  39.70  39.22 1672.77  51.50  51.27  51.70  51.24
         4  52.14  52.41  51.38  52.14 1671.54  38.81  46.76  45.72
         5  51.82  52.65  52.30  51.67  38.57 1676.33  46.90  45.96
         6  52.92  52.66  53.02  52.68  46.23  46.31 1672.74  38.91
         7  52.61  52.74  52.79  52.64  45.90  46.35  39.07 1673.16
    
    Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
       D\D     0      1      2      3      4      5      6      7
         0 1670.31  52.41 140.69 508.68 139.85 141.88 141.71 140.55
         1 141.69 1673.30 509.23 141.22 139.91 143.28 141.71 140.61
         2 140.64 508.90 1669.67 140.68 139.93 140.61 140.67 140.50
         3 509.14 141.36 140.61 1682.65 139.93 141.45 141.45 140.67
         4 140.01 140.03 140.07 139.94 1670.68 508.37 140.01 139.90
         5 141.92 143.17 140.50 141.19 508.92 1670.73 141.72 140.52
         6 141.72 141.72 140.60 141.31 139.66 141.85 1671.51 510.03
         7 140.62 140.71 140.66 140.63 140.02 140.72 509.77 1668.28
    
    P2P=Disabled Latency Matrix (us)
       GPU     0      1      2      3      4      5      6      7
         0   2.35  17.23  17.13  13.38  12.86  21.15  21.39  21.12
         1  17.54   2.32  12.95  13.78  21.05  21.23  21.31  21.37
         2  16.85  14.83   2.35  16.07  12.71  12.80  21.23  12.79
         3  14.98  16.06  14.64   2.41  13.35  12.81  13.60  21.36
         4  21.31  21.31  20.49  21.32   2.62  12.33  12.66  12.98
         5  20.36  21.22  20.17  12.79  16.74   2.58  12.41  12.93
         6  17.51  12.84  12.79  12.70  17.63  18.78   2.36  13.69
         7  21.23  12.71  19.41  21.09  14.69  13.79  15.52   2.59
    
    CPU     0      1      2      3      4      5      6      7
         0   1.73   4.99   4.88   4.85   5.17   5.18   5.18   5.33
         1   5.04   1.71   4.74   4.82   5.04   5.14   5.10   5.19
         2   4.86   4.75   1.66   4.78   5.08   5.09   5.11   5.17
         3   4.80   4.72   4.73   1.63   5.09   5.11   5.06   5.10
         4   5.07   5.00   5.03   4.96   1.77   5.33   5.34   5.38
         5   5.12   4.94   5.00   4.96   5.31   1.77   5.38   5.41
         6   5.09   4.97   5.09   5.01   5.35   5.39   1.80   5.42
         7   5.18   5.09   5.02   5.00   5.39   5.40   5.40   1.76
    
    P2P=Enabled Latency (P2P Writes) Matrix (us)
       GPU     0      1      2      3      4      5      6      7
         0   2.33   2.15   2.11   2.76   2.07   2.11   2.07   2.12
         1   2.07   2.30   2.77   2.07   2.12   2.06   2.06   2.10
         2   2.09   2.75   2.34   2.12   2.09   2.08   2.08   2.12
         3   2.78   2.10   2.13   2.40   2.13   2.14   2.14   2.13
         4   2.18   2.23   2.23   2.17   2.59   2.82   2.15   2.16
         5   2.15   2.17   2.15   2.20   2.82   2.56   2.17   2.16
         6   2.13   2.18   2.21   2.17   2.15   2.17   2.36   2.85
         7   2.19   2.21   2.19   2.22   2.19   2.19   2.86   2.61
    
       CPU     0      1      2      3      4      5      6      7
         0   1.78   1.32   1.29   1.40   1.33   1.34   1.34   1.33
         1   1.32   1.69   1.34   1.35   1.35   1.34   1.40   1.33
         2   1.38   1.37   1.73   1.36   1.36   1.35   1.35   1.34
         3   1.34   1.42   1.35   1.66   1.34   1.34   1.35   1.33
         4   1.53   1.41   1.40   1.40   1.77   1.43   1.48   1.47
         5   1.46   1.43   1.43   1.42   1.47   1.84   1.51   1.56
         6   1.53   1.45   1.45   1.45   1.45   1.44   1.85   1.47
         7   1.54   1.47   1.47   1.47   1.45   1.44   1.50   1.84
    
    NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Zie ook:
Bijgewerkt: 28.03.2025
Gepubliceerd: 06.05.2024