feat: update readme

This commit is contained in:
Yorukot 2025-02-28 14:41:26 +08:00
parent c80d74448d
commit e48907084e
3 changed files with 183 additions and 50 deletions

183
README.md Normal file
View file

@ -0,0 +1,183 @@
# High-Performance Computing Repository
This repository contains implementations of parallel sorting algorithms and matrix multiplication optimizations using various high-performance computing techniques, including CUDA and OpenMP.
## Repository Structure
```
.
├── sort algorithm/ # Sorting algorithm implementations
│ ├── bitonic Sort/ # Bitonic sort implementation
│ │ └── CUDA/ # CUDA-based bitonic sort
│ ├── quicksort/ # Quicksort implementations
│ │ └── OpenMP/ # OpenMP-based quicksort
│ ├── dataset/ # Datasets with different sizes
│ ├── generate_data # Executable for data generation
│ ├── generate_data.c # Source for dataset generation
│ └── Makefile # Build configuration
└── matrix multiplication/ # Matrix multiplication implementations
├── CUDA/ # CUDA-based matrix multiplication
├── OpenMP/ # OpenMP-based matrix multiplication
├── dataset/ # Matrix datasets
├── generate_matrix # Executable for matrix generation
├── generate_matrix.c # Source for matrix generation
└── makefile # Build configuration
```
## Sorting Algorithms
The repository implements two different sorting algorithms optimized for parallel execution:
### Bitonic Sort (CUDA)
Bitonic sort is a comparison-based sorting algorithm that is well-suited for parallel implementation. The CUDA implementation uses GPU parallelism to achieve high performance, especially for large datasets.
Key features:
- GPU-accelerated implementation using CUDA
- Efficient for large datasets (up to hundreds of millions of elements)
- In-place sorting to minimize memory overhead
- Scales well with available GPU resources
### Quicksort (OpenMP)
The repository includes a parallel implementation of the classic quicksort algorithm using OpenMP.
Key features:
- CPU-based parallelism using OpenMP tasks
- Hybrid approach with sequential sorting for small partitions
- Configurable threshold (THRESHOLD) to switch between parallel and sequential execution
- Good performance scaling with multiple CPU cores
### Sorting Dataset Generation
The repository includes a data generator (`generate_data.c`) that creates datasets of various sizes for testing the sorting algorithms. The generated datasets consist of random floating-point numbers.
Features:
- Supports various dataset sizes (powers of 2)
- Creates directory structure for organized data management
- Generates reproducible random data
Available dataset sizes (number of elements):
- 1024 (1K)
- 32768 (32K)
- 1048576 (1M)
- 16777216 (16M)
- 67108864 (64M)
- 134217728 (128M)
- 268435456 (256M)
- 402653184 (384M)
## Matrix Multiplication
The repository provides optimized implementations of matrix multiplication using parallel computing techniques:
### CUDA Matrix Multiplication
The CUDA implementation leverages GPU cores to perform matrix multiplication in parallel.
Key features:
- 2D thread organization for efficient computation
- Each thread computes one element of the result matrix
- Uses 16x16 thread blocks for optimal performance
- Handles large matrix sizes (tested with 8192x8192)
### OpenMP Matrix Multiplication
The OpenMP implementation uses multiple CPU cores to parallelize the matrix multiplication operation.
Key features:
- Configurable number of threads (NUM_THREADS)
- Uses OpenMP directives to parallelize the outermost loop
- Simple and effective parallelization approach
- Support for different matrix sizes
### Matrix Data Generation
The repository includes a tool (`generate_matrix.c`) for creating random matrices for testing purposes.
Features:
- Generates two random matrices (A and B)
- Supports different matrix sizes (configurable via the N macro)
- Saves matrices in binary format for efficient loading
## Building and Running
### Sorting Algorithms
1. Generate the dataset:
```
cd "sort algorithm"
make generate_data
./generate_data [size]
```
Where `[size]` is a power of 2 (e.g., 1024, 32768, etc.)
2. Build and run the CUDA bitonic sort:
```
cd "sort algorithm/bitonic Sort/CUDA"
make
./bitonic_sort
```
3. Build and run the OpenMP quicksort:
```
cd "sort algorithm/quicksort/OpenMP"
make
./quicksort
```
### Matrix Multiplication
1. Generate the matrices:
```
cd "matrix multiplication"
make -f makefile
./generate_matrix
```
2. Build and run the CUDA matrix multiplication:
```
cd "matrix multiplication/CUDA"
make -f makefile
./matrix_mul
```
3. Build and run the OpenMP matrix multiplication:
```
cd "matrix multiplication/OpenMP"
make
./matrix_mul
```
## Performance Considerations
### Sorting Performance
- The CUDA bitonic sort performs best for very large datasets where the GPU parallelism can be fully utilized.
- The OpenMP quicksort performs well on multi-core CPUs and is more efficient for smaller datasets.
- The THRESHOLD parameter in the OpenMP quicksort can be tuned based on the hardware to optimize performance.
### Matrix Multiplication Performance
- The CUDA implementation performs best for large matrices where the GPU parallelism can be fully utilized.
- The OpenMP implementation works well on multi-core CPUs but may not scale as well as the GPU version for very large matrices.
- The block size in the CUDA implementation (16x16) is chosen for good performance on most GPUs but can be tuned for specific hardware.
## Requirements
### Hardware
- For CUDA implementations: NVIDIA GPU with CUDA support
- For OpenMP implementations: Multi-core CPU
### Software
- CUDA Toolkit (for CUDA implementations)
- GCC with OpenMP support (for OpenMP implementations)
- Make
## Notes
- Matrix sizes and array lengths are generally defined as constants in the source files. Modify these constants to experiment with different sizes.
- Both implementations read input data from files in the dataset directory.
- The CUDA implementations require a compatible NVIDIA GPU.
- The OpenMP implementations will work on any multi-core CPU.

View file

@ -1,49 +0,0 @@
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#define N 1024
#define NUM_THREADS 8 // Set desired number of threads
void load_matrix(const char *filename, double *matrix) {
FILE *file = fopen(filename, "rb");
if (!file) {
perror("無法讀取檔案");
exit(EXIT_FAILURE);
}
fread(matrix, sizeof(double), N * N, file);
fclose(file);
}
int main() {
char matrix_a_path[256], matrix_b_path[256];
sprintf(matrix_a_path, "../dataset/%d/matrix_A.bin", N);
sprintf(matrix_b_path, "../dataset/%d/matrix_B.bin", N);
double *A = (double *)malloc(N * N * sizeof(double));
double *B = (double *)malloc(N * N * sizeof(double));
double *C = (double *)calloc(N * N, sizeof(double));
if (!A || !B || !C) {
perror("記憶體配置失敗");
exit(EXIT_FAILURE);
}
load_matrix(matrix_a_path, A);
load_matrix(matrix_b_path, B);
omp_set_num_threads(NUM_THREADS); // Set thread limit
#pragma omp parallel for
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
for (int k = 0; k < N; k++) {
C[i * N + j] += A[i * N + k] * B[k * N + j];
}
}
}
free(A);
free(B);
free(C);
return 0;
}

View file

@ -1 +0,0 @@
1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 1048576 2097152 4194304 8388608 16777216