matrix multiplication | ||
sort algorithm | ||
README.md |
High-Performance Computing Repository
This repository contains implementations of parallel sorting algorithms and matrix multiplication optimizations using various high-performance computing techniques, including CUDA and OpenMP.
Repository Structure
.
├── sort algorithm/ # Sorting algorithm implementations
│ ├── bitonic Sort/ # Bitonic sort implementation
│ │ └── CUDA/ # CUDA-based bitonic sort
│ ├── quicksort/ # Quicksort implementations
│ │ └── OpenMP/ # OpenMP-based quicksort
│ ├── dataset/ # Datasets with different sizes
│ ├── generate_data # Executable for data generation
│ ├── generate_data.c # Source for dataset generation
│ └── Makefile # Build configuration
│
└── matrix multiplication/ # Matrix multiplication implementations
├── CUDA/ # CUDA-based matrix multiplication
├── OpenMP/ # OpenMP-based matrix multiplication
├── dataset/ # Matrix datasets
├── generate_matrix # Executable for matrix generation
├── generate_matrix.c # Source for matrix generation
└── makefile # Build configuration
Sorting Algorithms
The repository implements two different sorting algorithms optimized for parallel execution:
Bitonic Sort (CUDA)
Bitonic sort is a comparison-based sorting algorithm that is well-suited for parallel implementation. The CUDA implementation uses GPU parallelism to achieve high performance, especially for large datasets.
Key features:
- GPU-accelerated implementation using CUDA
- Efficient for large datasets (up to hundreds of millions of elements)
- In-place sorting to minimize memory overhead
- Scales well with available GPU resources
Quicksort (OpenMP)
The repository includes a parallel implementation of the classic quicksort algorithm using OpenMP.
Key features:
- CPU-based parallelism using OpenMP tasks
- Hybrid approach with sequential sorting for small partitions
- Configurable threshold (THRESHOLD) to switch between parallel and sequential execution
- Good performance scaling with multiple CPU cores
Sorting Dataset Generation
The repository includes a data generator (generate_data.c
) that creates datasets of various sizes for testing the sorting algorithms. The generated datasets consist of random floating-point numbers.
Features:
- Supports various dataset sizes (powers of 2)
- Creates directory structure for organized data management
- Generates reproducible random data
Available dataset sizes (number of elements):
- 1024 (1K)
- 32768 (32K)
- 1048576 (1M)
- 16777216 (16M)
- 67108864 (64M)
- 134217728 (128M)
- 268435456 (256M)
- 402653184 (384M)
Matrix Multiplication
The repository provides optimized implementations of matrix multiplication using parallel computing techniques:
CUDA Matrix Multiplication
The CUDA implementation leverages GPU cores to perform matrix multiplication in parallel.
Key features:
- 2D thread organization for efficient computation
- Each thread computes one element of the result matrix
- Uses 16x16 thread blocks for optimal performance
- Handles large matrix sizes (tested with 8192x8192)
OpenMP Matrix Multiplication
The OpenMP implementation uses multiple CPU cores to parallelize the matrix multiplication operation.
Key features:
- Configurable number of threads (NUM_THREADS)
- Uses OpenMP directives to parallelize the outermost loop
- Simple and effective parallelization approach
- Support for different matrix sizes
Matrix Data Generation
The repository includes a tool (generate_matrix.c
) for creating random matrices for testing purposes.
Features:
- Generates two random matrices (A and B)
- Supports different matrix sizes (configurable via the N macro)
- Saves matrices in binary format for efficient loading
Building and Running
Sorting Algorithms
-
Generate the dataset:
cd "sort algorithm" make generate_data ./generate_data [size]
Where
[size]
is a power of 2 (e.g., 1024, 32768, etc.) -
Build and run the CUDA bitonic sort:
cd "sort algorithm/bitonic Sort/CUDA" make ./bitonic_sort
-
Build and run the OpenMP quicksort:
cd "sort algorithm/quicksort/OpenMP" make ./quicksort
Matrix Multiplication
-
Generate the matrices:
cd "matrix multiplication" make -f makefile ./generate_matrix
-
Build and run the CUDA matrix multiplication:
cd "matrix multiplication/CUDA" make -f makefile ./matrix_mul
-
Build and run the OpenMP matrix multiplication:
cd "matrix multiplication/OpenMP" make ./matrix_mul
Performance Considerations
Sorting Performance
- The CUDA bitonic sort performs best for very large datasets where the GPU parallelism can be fully utilized.
- The OpenMP quicksort performs well on multi-core CPUs and is more efficient for smaller datasets.
- The THRESHOLD parameter in the OpenMP quicksort can be tuned based on the hardware to optimize performance.
Matrix Multiplication Performance
- The CUDA implementation performs best for large matrices where the GPU parallelism can be fully utilized.
- The OpenMP implementation works well on multi-core CPUs but may not scale as well as the GPU version for very large matrices.
- The block size in the CUDA implementation (16x16) is chosen for good performance on most GPUs but can be tuned for specific hardware.
Requirements
Hardware
- For CUDA implementations: NVIDIA GPU with CUDA support
- For OpenMP implementations: Multi-core CPU
Software
- CUDA Toolkit (for CUDA implementations)
- GCC with OpenMP support (for OpenMP implementations)
- Make
Notes
- Matrix sizes and array lengths are generally defined as constants in the source files. Modify these constants to experiment with different sizes.
- Both implementations read input data from files in the dataset directory.
- The CUDA implementations require a compatible NVIDIA GPU.
- The OpenMP implementations will work on any multi-core CPU.