HPC-Testing/README.md
2025-02-28 14:41:26 +08:00

183 lines
No EOL
6.2 KiB
Markdown

# High-Performance Computing Repository
This repository contains implementations of parallel sorting algorithms and matrix multiplication optimizations using various high-performance computing techniques, including CUDA and OpenMP.
## Repository Structure
```
.
├── sort algorithm/ # Sorting algorithm implementations
│ ├── bitonic Sort/ # Bitonic sort implementation
│ │ └── CUDA/ # CUDA-based bitonic sort
│ ├── quicksort/ # Quicksort implementations
│ │ └── OpenMP/ # OpenMP-based quicksort
│ ├── dataset/ # Datasets with different sizes
│ ├── generate_data # Executable for data generation
│ ├── generate_data.c # Source for dataset generation
│ └── Makefile # Build configuration
└── matrix multiplication/ # Matrix multiplication implementations
├── CUDA/ # CUDA-based matrix multiplication
├── OpenMP/ # OpenMP-based matrix multiplication
├── dataset/ # Matrix datasets
├── generate_matrix # Executable for matrix generation
├── generate_matrix.c # Source for matrix generation
└── makefile # Build configuration
```
## Sorting Algorithms
The repository implements two different sorting algorithms optimized for parallel execution:
### Bitonic Sort (CUDA)
Bitonic sort is a comparison-based sorting algorithm that is well-suited for parallel implementation. The CUDA implementation uses GPU parallelism to achieve high performance, especially for large datasets.
Key features:
- GPU-accelerated implementation using CUDA
- Efficient for large datasets (up to hundreds of millions of elements)
- In-place sorting to minimize memory overhead
- Scales well with available GPU resources
### Quicksort (OpenMP)
The repository includes a parallel implementation of the classic quicksort algorithm using OpenMP.
Key features:
- CPU-based parallelism using OpenMP tasks
- Hybrid approach with sequential sorting for small partitions
- Configurable threshold (THRESHOLD) to switch between parallel and sequential execution
- Good performance scaling with multiple CPU cores
### Sorting Dataset Generation
The repository includes a data generator (`generate_data.c`) that creates datasets of various sizes for testing the sorting algorithms. The generated datasets consist of random floating-point numbers.
Features:
- Supports various dataset sizes (powers of 2)
- Creates directory structure for organized data management
- Generates reproducible random data
Available dataset sizes (number of elements):
- 1024 (1K)
- 32768 (32K)
- 1048576 (1M)
- 16777216 (16M)
- 67108864 (64M)
- 134217728 (128M)
- 268435456 (256M)
- 402653184 (384M)
## Matrix Multiplication
The repository provides optimized implementations of matrix multiplication using parallel computing techniques:
### CUDA Matrix Multiplication
The CUDA implementation leverages GPU cores to perform matrix multiplication in parallel.
Key features:
- 2D thread organization for efficient computation
- Each thread computes one element of the result matrix
- Uses 16x16 thread blocks for optimal performance
- Handles large matrix sizes (tested with 8192x8192)
### OpenMP Matrix Multiplication
The OpenMP implementation uses multiple CPU cores to parallelize the matrix multiplication operation.
Key features:
- Configurable number of threads (NUM_THREADS)
- Uses OpenMP directives to parallelize the outermost loop
- Simple and effective parallelization approach
- Support for different matrix sizes
### Matrix Data Generation
The repository includes a tool (`generate_matrix.c`) for creating random matrices for testing purposes.
Features:
- Generates two random matrices (A and B)
- Supports different matrix sizes (configurable via the N macro)
- Saves matrices in binary format for efficient loading
## Building and Running
### Sorting Algorithms
1. Generate the dataset:
```
cd "sort algorithm"
make generate_data
./generate_data [size]
```
Where `[size]` is a power of 2 (e.g., 1024, 32768, etc.)
2. Build and run the CUDA bitonic sort:
```
cd "sort algorithm/bitonic Sort/CUDA"
make
./bitonic_sort
```
3. Build and run the OpenMP quicksort:
```
cd "sort algorithm/quicksort/OpenMP"
make
./quicksort
```
### Matrix Multiplication
1. Generate the matrices:
```
cd "matrix multiplication"
make -f makefile
./generate_matrix
```
2. Build and run the CUDA matrix multiplication:
```
cd "matrix multiplication/CUDA"
make -f makefile
./matrix_mul
```
3. Build and run the OpenMP matrix multiplication:
```
cd "matrix multiplication/OpenMP"
make
./matrix_mul
```
## Performance Considerations
### Sorting Performance
- The CUDA bitonic sort performs best for very large datasets where the GPU parallelism can be fully utilized.
- The OpenMP quicksort performs well on multi-core CPUs and is more efficient for smaller datasets.
- The THRESHOLD parameter in the OpenMP quicksort can be tuned based on the hardware to optimize performance.
### Matrix Multiplication Performance
- The CUDA implementation performs best for large matrices where the GPU parallelism can be fully utilized.
- The OpenMP implementation works well on multi-core CPUs but may not scale as well as the GPU version for very large matrices.
- The block size in the CUDA implementation (16x16) is chosen for good performance on most GPUs but can be tuned for specific hardware.
## Requirements
### Hardware
- For CUDA implementations: NVIDIA GPU with CUDA support
- For OpenMP implementations: Multi-core CPU
### Software
- CUDA Toolkit (for CUDA implementations)
- GCC with OpenMP support (for OpenMP implementations)
- Make
## Notes
- Matrix sizes and array lengths are generally defined as constants in the source files. Modify these constants to experiment with different sizes.
- Both implementations read input data from files in the dataset directory.
- The CUDA implementations require a compatible NVIDIA GPU.
- The OpenMP implementations will work on any multi-core CPU.