This is an HPC test for gpu and cpu to improve performance
Find a file
2025-02-28 14:41:26 +08:00
matrix multiplication feat: update readme 2025-02-28 14:41:26 +08:00
sort algorithm feat: update readme 2025-02-28 14:41:26 +08:00
README.md feat: update readme 2025-02-28 14:41:26 +08:00

High-Performance Computing Repository

This repository contains implementations of parallel sorting algorithms and matrix multiplication optimizations using various high-performance computing techniques, including CUDA and OpenMP.

Repository Structure

.
├── sort algorithm/                # Sorting algorithm implementations
│   ├── bitonic Sort/              # Bitonic sort implementation
│   │   └── CUDA/                  # CUDA-based bitonic sort
│   ├── quicksort/                 # Quicksort implementations
│   │   └── OpenMP/               # OpenMP-based quicksort
│   ├── dataset/                   # Datasets with different sizes
│   ├── generate_data              # Executable for data generation
│   ├── generate_data.c            # Source for dataset generation
│   └── Makefile                   # Build configuration
│
└── matrix multiplication/         # Matrix multiplication implementations
    ├── CUDA/                      # CUDA-based matrix multiplication
    ├── OpenMP/                    # OpenMP-based matrix multiplication
    ├── dataset/                   # Matrix datasets
    ├── generate_matrix            # Executable for matrix generation
    ├── generate_matrix.c          # Source for matrix generation
    └── makefile                   # Build configuration

Sorting Algorithms

The repository implements two different sorting algorithms optimized for parallel execution:

Bitonic Sort (CUDA)

Bitonic sort is a comparison-based sorting algorithm that is well-suited for parallel implementation. The CUDA implementation uses GPU parallelism to achieve high performance, especially for large datasets.

Key features:

  • GPU-accelerated implementation using CUDA
  • Efficient for large datasets (up to hundreds of millions of elements)
  • In-place sorting to minimize memory overhead
  • Scales well with available GPU resources

Quicksort (OpenMP)

The repository includes a parallel implementation of the classic quicksort algorithm using OpenMP.

Key features:

  • CPU-based parallelism using OpenMP tasks
  • Hybrid approach with sequential sorting for small partitions
  • Configurable threshold (THRESHOLD) to switch between parallel and sequential execution
  • Good performance scaling with multiple CPU cores

Sorting Dataset Generation

The repository includes a data generator (generate_data.c) that creates datasets of various sizes for testing the sorting algorithms. The generated datasets consist of random floating-point numbers.

Features:

  • Supports various dataset sizes (powers of 2)
  • Creates directory structure for organized data management
  • Generates reproducible random data

Available dataset sizes (number of elements):

  • 1024 (1K)
  • 32768 (32K)
  • 1048576 (1M)
  • 16777216 (16M)
  • 67108864 (64M)
  • 134217728 (128M)
  • 268435456 (256M)
  • 402653184 (384M)

Matrix Multiplication

The repository provides optimized implementations of matrix multiplication using parallel computing techniques:

CUDA Matrix Multiplication

The CUDA implementation leverages GPU cores to perform matrix multiplication in parallel.

Key features:

  • 2D thread organization for efficient computation
  • Each thread computes one element of the result matrix
  • Uses 16x16 thread blocks for optimal performance
  • Handles large matrix sizes (tested with 8192x8192)

OpenMP Matrix Multiplication

The OpenMP implementation uses multiple CPU cores to parallelize the matrix multiplication operation.

Key features:

  • Configurable number of threads (NUM_THREADS)
  • Uses OpenMP directives to parallelize the outermost loop
  • Simple and effective parallelization approach
  • Support for different matrix sizes

Matrix Data Generation

The repository includes a tool (generate_matrix.c) for creating random matrices for testing purposes.

Features:

  • Generates two random matrices (A and B)
  • Supports different matrix sizes (configurable via the N macro)
  • Saves matrices in binary format for efficient loading

Building and Running

Sorting Algorithms

  1. Generate the dataset:

    cd "sort algorithm"
    make generate_data
    ./generate_data [size]
    

    Where [size] is a power of 2 (e.g., 1024, 32768, etc.)

  2. Build and run the CUDA bitonic sort:

    cd "sort algorithm/bitonic Sort/CUDA"
    make
    ./bitonic_sort
    
  3. Build and run the OpenMP quicksort:

    cd "sort algorithm/quicksort/OpenMP"
    make
    ./quicksort
    

Matrix Multiplication

  1. Generate the matrices:

    cd "matrix multiplication"
    make -f makefile
    ./generate_matrix
    
  2. Build and run the CUDA matrix multiplication:

    cd "matrix multiplication/CUDA"
    make -f makefile
    ./matrix_mul
    
  3. Build and run the OpenMP matrix multiplication:

    cd "matrix multiplication/OpenMP"
    make
    ./matrix_mul
    

Performance Considerations

Sorting Performance

  • The CUDA bitonic sort performs best for very large datasets where the GPU parallelism can be fully utilized.
  • The OpenMP quicksort performs well on multi-core CPUs and is more efficient for smaller datasets.
  • The THRESHOLD parameter in the OpenMP quicksort can be tuned based on the hardware to optimize performance.

Matrix Multiplication Performance

  • The CUDA implementation performs best for large matrices where the GPU parallelism can be fully utilized.
  • The OpenMP implementation works well on multi-core CPUs but may not scale as well as the GPU version for very large matrices.
  • The block size in the CUDA implementation (16x16) is chosen for good performance on most GPUs but can be tuned for specific hardware.

Requirements

Hardware

  • For CUDA implementations: NVIDIA GPU with CUDA support
  • For OpenMP implementations: Multi-core CPU

Software

  • CUDA Toolkit (for CUDA implementations)
  • GCC with OpenMP support (for OpenMP implementations)
  • Make

Notes

  • Matrix sizes and array lengths are generally defined as constants in the source files. Modify these constants to experiment with different sizes.
  • Both implementations read input data from files in the dataset directory.
  • The CUDA implementations require a compatible NVIDIA GPU.
  • The OpenMP implementations will work on any multi-core CPU.