feat: update readme
This commit is contained in:
parent
c80d74448d
commit
e48907084e
3 changed files with 183 additions and 50 deletions
183
README.md
Normal file
183
README.md
Normal file
|
@ -0,0 +1,183 @@
|
|||
# High-Performance Computing Repository
|
||||
|
||||
This repository contains implementations of parallel sorting algorithms and matrix multiplication optimizations using various high-performance computing techniques, including CUDA and OpenMP.
|
||||
|
||||
## Repository Structure
|
||||
|
||||
```
|
||||
.
|
||||
├── sort algorithm/ # Sorting algorithm implementations
|
||||
│ ├── bitonic Sort/ # Bitonic sort implementation
|
||||
│ │ └── CUDA/ # CUDA-based bitonic sort
|
||||
│ ├── quicksort/ # Quicksort implementations
|
||||
│ │ └── OpenMP/ # OpenMP-based quicksort
|
||||
│ ├── dataset/ # Datasets with different sizes
|
||||
│ ├── generate_data # Executable for data generation
|
||||
│ ├── generate_data.c # Source for dataset generation
|
||||
│ └── Makefile # Build configuration
|
||||
│
|
||||
└── matrix multiplication/ # Matrix multiplication implementations
|
||||
├── CUDA/ # CUDA-based matrix multiplication
|
||||
├── OpenMP/ # OpenMP-based matrix multiplication
|
||||
├── dataset/ # Matrix datasets
|
||||
├── generate_matrix # Executable for matrix generation
|
||||
├── generate_matrix.c # Source for matrix generation
|
||||
└── makefile # Build configuration
|
||||
```
|
||||
|
||||
## Sorting Algorithms
|
||||
|
||||
The repository implements two different sorting algorithms optimized for parallel execution:
|
||||
|
||||
### Bitonic Sort (CUDA)
|
||||
|
||||
Bitonic sort is a comparison-based sorting algorithm that is well-suited for parallel implementation. The CUDA implementation uses GPU parallelism to achieve high performance, especially for large datasets.
|
||||
|
||||
Key features:
|
||||
- GPU-accelerated implementation using CUDA
|
||||
- Efficient for large datasets (up to hundreds of millions of elements)
|
||||
- In-place sorting to minimize memory overhead
|
||||
- Scales well with available GPU resources
|
||||
|
||||
### Quicksort (OpenMP)
|
||||
|
||||
The repository includes a parallel implementation of the classic quicksort algorithm using OpenMP.
|
||||
|
||||
Key features:
|
||||
- CPU-based parallelism using OpenMP tasks
|
||||
- Hybrid approach with sequential sorting for small partitions
|
||||
- Configurable threshold (THRESHOLD) to switch between parallel and sequential execution
|
||||
- Good performance scaling with multiple CPU cores
|
||||
|
||||
### Sorting Dataset Generation
|
||||
|
||||
The repository includes a data generator (`generate_data.c`) that creates datasets of various sizes for testing the sorting algorithms. The generated datasets consist of random floating-point numbers.
|
||||
|
||||
Features:
|
||||
- Supports various dataset sizes (powers of 2)
|
||||
- Creates directory structure for organized data management
|
||||
- Generates reproducible random data
|
||||
|
||||
Available dataset sizes (number of elements):
|
||||
- 1024 (1K)
|
||||
- 32768 (32K)
|
||||
- 1048576 (1M)
|
||||
- 16777216 (16M)
|
||||
- 67108864 (64M)
|
||||
- 134217728 (128M)
|
||||
- 268435456 (256M)
|
||||
- 402653184 (384M)
|
||||
|
||||
## Matrix Multiplication
|
||||
|
||||
The repository provides optimized implementations of matrix multiplication using parallel computing techniques:
|
||||
|
||||
### CUDA Matrix Multiplication
|
||||
|
||||
The CUDA implementation leverages GPU cores to perform matrix multiplication in parallel.
|
||||
|
||||
Key features:
|
||||
- 2D thread organization for efficient computation
|
||||
- Each thread computes one element of the result matrix
|
||||
- Uses 16x16 thread blocks for optimal performance
|
||||
- Handles large matrix sizes (tested with 8192x8192)
|
||||
|
||||
### OpenMP Matrix Multiplication
|
||||
|
||||
The OpenMP implementation uses multiple CPU cores to parallelize the matrix multiplication operation.
|
||||
|
||||
Key features:
|
||||
- Configurable number of threads (NUM_THREADS)
|
||||
- Uses OpenMP directives to parallelize the outermost loop
|
||||
- Simple and effective parallelization approach
|
||||
- Support for different matrix sizes
|
||||
|
||||
### Matrix Data Generation
|
||||
|
||||
The repository includes a tool (`generate_matrix.c`) for creating random matrices for testing purposes.
|
||||
|
||||
Features:
|
||||
- Generates two random matrices (A and B)
|
||||
- Supports different matrix sizes (configurable via the N macro)
|
||||
- Saves matrices in binary format for efficient loading
|
||||
|
||||
## Building and Running
|
||||
|
||||
### Sorting Algorithms
|
||||
|
||||
1. Generate the dataset:
|
||||
```
|
||||
cd "sort algorithm"
|
||||
make generate_data
|
||||
./generate_data [size]
|
||||
```
|
||||
Where `[size]` is a power of 2 (e.g., 1024, 32768, etc.)
|
||||
|
||||
2. Build and run the CUDA bitonic sort:
|
||||
```
|
||||
cd "sort algorithm/bitonic Sort/CUDA"
|
||||
make
|
||||
./bitonic_sort
|
||||
```
|
||||
|
||||
3. Build and run the OpenMP quicksort:
|
||||
```
|
||||
cd "sort algorithm/quicksort/OpenMP"
|
||||
make
|
||||
./quicksort
|
||||
```
|
||||
|
||||
### Matrix Multiplication
|
||||
|
||||
1. Generate the matrices:
|
||||
```
|
||||
cd "matrix multiplication"
|
||||
make -f makefile
|
||||
./generate_matrix
|
||||
```
|
||||
|
||||
2. Build and run the CUDA matrix multiplication:
|
||||
```
|
||||
cd "matrix multiplication/CUDA"
|
||||
make -f makefile
|
||||
./matrix_mul
|
||||
```
|
||||
|
||||
3. Build and run the OpenMP matrix multiplication:
|
||||
```
|
||||
cd "matrix multiplication/OpenMP"
|
||||
make
|
||||
./matrix_mul
|
||||
```
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
### Sorting Performance
|
||||
|
||||
- The CUDA bitonic sort performs best for very large datasets where the GPU parallelism can be fully utilized.
|
||||
- The OpenMP quicksort performs well on multi-core CPUs and is more efficient for smaller datasets.
|
||||
- The THRESHOLD parameter in the OpenMP quicksort can be tuned based on the hardware to optimize performance.
|
||||
|
||||
### Matrix Multiplication Performance
|
||||
|
||||
- The CUDA implementation performs best for large matrices where the GPU parallelism can be fully utilized.
|
||||
- The OpenMP implementation works well on multi-core CPUs but may not scale as well as the GPU version for very large matrices.
|
||||
- The block size in the CUDA implementation (16x16) is chosen for good performance on most GPUs but can be tuned for specific hardware.
|
||||
|
||||
## Requirements
|
||||
|
||||
### Hardware
|
||||
- For CUDA implementations: NVIDIA GPU with CUDA support
|
||||
- For OpenMP implementations: Multi-core CPU
|
||||
|
||||
### Software
|
||||
- CUDA Toolkit (for CUDA implementations)
|
||||
- GCC with OpenMP support (for OpenMP implementations)
|
||||
- Make
|
||||
|
||||
## Notes
|
||||
|
||||
- Matrix sizes and array lengths are generally defined as constants in the source files. Modify these constants to experiment with different sizes.
|
||||
- Both implementations read input data from files in the dataset directory.
|
||||
- The CUDA implementations require a compatible NVIDIA GPU.
|
||||
- The OpenMP implementations will work on any multi-core CPU.
|
|
@ -1,49 +0,0 @@
|
|||
#include <stdio.h>
|
||||
#include <stdlib.h>
|
||||
#include <omp.h>
|
||||
|
||||
#define N 1024
|
||||
#define NUM_THREADS 8 // Set desired number of threads
|
||||
|
||||
void load_matrix(const char *filename, double *matrix) {
|
||||
FILE *file = fopen(filename, "rb");
|
||||
if (!file) {
|
||||
perror("無法讀取檔案");
|
||||
exit(EXIT_FAILURE);
|
||||
}
|
||||
fread(matrix, sizeof(double), N * N, file);
|
||||
fclose(file);
|
||||
}
|
||||
|
||||
int main() {
|
||||
char matrix_a_path[256], matrix_b_path[256];
|
||||
sprintf(matrix_a_path, "../dataset/%d/matrix_A.bin", N);
|
||||
sprintf(matrix_b_path, "../dataset/%d/matrix_B.bin", N);
|
||||
|
||||
double *A = (double *)malloc(N * N * sizeof(double));
|
||||
double *B = (double *)malloc(N * N * sizeof(double));
|
||||
double *C = (double *)calloc(N * N, sizeof(double));
|
||||
|
||||
if (!A || !B || !C) {
|
||||
perror("記憶體配置失敗");
|
||||
exit(EXIT_FAILURE);
|
||||
}
|
||||
|
||||
load_matrix(matrix_a_path, A);
|
||||
load_matrix(matrix_b_path, B);
|
||||
|
||||
omp_set_num_threads(NUM_THREADS); // Set thread limit
|
||||
#pragma omp parallel for
|
||||
for (int i = 0; i < N; i++) {
|
||||
for (int j = 0; j < N; j++) {
|
||||
for (int k = 0; k < N; k++) {
|
||||
C[i * N + j] += A[i * N + k] * B[k * N + j];
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
free(A);
|
||||
free(B);
|
||||
free(C);
|
||||
return 0;
|
||||
}
|
|
@ -1 +0,0 @@
|
|||
1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 1048576 2097152 4194304 8388608 16777216
|
Loading…
Reference in a new issue