2025/01/01
Table of contents
A few months ago the user “quirino” posted to HackerNews (HN) an interesting shallenge. Find the lowest hash SHA256 value for a string with certain format:
Submit a string in the format “{username}/{nonce}”, where:
- username: 1-32 characters from a-zA-Z0-9_-
- nonce: 1-64 characters from Base64 (a-zA-Z0-9+/)
When this was posted, I wanted to participate, compete with other users, get my code to run as fast as possible and see if I could get lucky finding a low hash value.
In this blog post I will describe the different attempts I did to hash the fastest, how I went from 1.5 Megahashes/s to 18.8 Gigahashes/s and share my final solution.
My first attempt at this was using my CPU. To be honest I didn’t even think about using a graphics card (GPU) as I had never done that before and well, its called graphics card not hash card.
For some context the CPU I’m using in this part is an Intel i5-8250U (4c, 8t), a mobile CPU.
I started writing a first implementation in Go. It’s very simple, generate a random string, calculate its hash value using the crypto package, and check if its lower than the stored hash value.
|
|
Running this got me about 1.5 Megahashes/s (MHs). The longer the random string was, the lower the hash rate got. At this point I thought there wasn’t much to improve about the code since it was pretty straight forward.
I went back to the HN post to check the comments and see if someone had posted their solution and check how close mine was to them but very few users had shared their solution, most were claiming way higher hash rates than what I could do. At the same time it was hard to compare different CPUs, specially when I was using an older mobile CPU and HN users were using newer desktop CPUs.
Reading the comments I got the idea of creating multi-threaded solution which should improve the hash rate. Though I hadn’t tried creating a multi-threaded program before and didn’t want to bother with modifying the code, I thought of a hack which was much simpler, launch multiple instances of the program.
$ for i in {1..4}; do ./shallenge >> hash.log &; done
[2] 12275
[3] 12276
[4] 12277
[5] 12282
In theory the amount of hashes would now be x4 (6 MHs). I tested the hash rate to see if it had improved 4 times like I thought it would but it only improved very little, I was getting 2.1 MHs combined. Though it looks like there was a bottleneck somewhere I’m not really sure if it had to do with the random generator function or something else in the system.
This felt like a hack and not actually code written for multithreading, I went ahead and tried to write a multi-threaded version to see if it improved the hash rate.
The code for the solution using multi-threads is here on codeberg (git). I don’t want to copy and paste a big wall of code so I’ll just reference part of the code on this blog.
Since I hadn’t tried multi-threading before I thought it was going to be more complicated, but luckily go-routines makes them way easier. At first I did need a bit of help and used Large Language Models (LLMs) to get started but once I understood how they worked I was able to modify it myself.
Before you comment anything about go-routines not being truly parallel but concurrent, I’m aware of this about go-routines.
“A goroutine has a simple model: it is a function executing concurrently with other goroutines in the same address space … Goroutines are multiplexed onto multiple OS threads so if one should block, such as while waiting for I/O, others continue to run. Their design hides many of the complexities of thread creation and management.”
Again, nothing special about the code, the only difference is that instead of using print when it finds a lower hash value, it sends it over a channel to be printed with a timestamp. The changes to the code were mostly around handling go routines.
|
|
Even if its only concurrent I thought I’d get a higher hash rate and maybe something close to what I had calculated in theory but was disappointed when I benchmarked it and saw that it was at best doing 4.1 MHs. I noticed that 3 go-routines was the number I got the highest hash rate, using higher or lower value would make the hash rate go lower.
Again, even if I did manage to truly parallelize the hashing I wasn’t going to get a much better hash rate than what other users were doing, mainly because of my CPU, so I kinda gave up on optimizing and left it running for a few hours to see if I could get a good enough result.
I was able to find a hash value that started with 9 zeros for the string “0x00cl/i5+8250U/Hi/HN/OinJeJj/eam3VRRI9d” (00000000 03b98cfc f91b9b3e 63735491 6d9cb960 63837375 8d810723 49bcb778).
Seeing the leaderboard you could see that most of the users in the top were using GPUs to get their hashes and users in the HN post did notice this too. At the time I didn’t have much interest in CUDA, specially because I was probably doing other things and although I do have a graphics card with CUDA enabled (GTX 650 Ti BOOST (768 CUDA cores)) I didn’t even consider it because it was an old card were I had tried to do things in the past just to find out that it wasn’t compatible and needed a more modern graphics card. For example GPU Passthrough, so I could use Linux distro and game on a Windows VM but wasn’t compatible, or pretty much any software that used CUDA, required more modern graphics card to be able to use them.
After a few months of not touching/running the project, I remembered about the shallenge and wanted to give it a try again as I had moved a few positions down on the leaderboard. This development was done on Windows using the same graphics card for the display.
The documentation is bad. It was my first time using CUDA and of course I went ahead and downloaded the latest version of CUDA from their website (12.6). I installed it and tried compiling using nvcc
but got an error saying cl.exe
(MSVC) was missing from PATH, which was my bad because documentation did mention that MSVC was a requirement. So I installed Visual Studio 2022 Community Edition, downloaded C++ tools and compiled this simple program:
#include <stdio.h>
__global__ void hello_kernel() {
printf("Hello, GPU!");
}
int main() {
hello_kernel<<<1,1>>>();
return 0;
}
It compiled without errors, generated an executable and ran it but for some reason when I ran the program it wasn’t printing anything, I tried other simple code but it would sometimes throw a SIGSEGV (segmentation fault).
I tried different things but none of them worked, for a moment I thought my GPU simply had some sort of error that didn’t allow CUDA programs to run. I kept trying to look for what the problem could be. I looked at nvcc
options for debugging and one option caught my attention -arch
, it had a “native” option that would detect your GPU and generate code for it, so I tried compiling with -arch=native
but it threw an error saying that the compute capability (CC) of my GPU wasn’t supported.
That was the problem, by default it compiled for graphics card with CC 5.2 and when running it didn’t show any error. I assumed that CUDA Toolkit would support CUDA enabled graphics card, but the different versions of CUDA Toolkit only support a set of CC. In the case of 12.6 it support from 5.0 to 9.0, which is not mentioned explicitly anywhere in the documentation nor as a requirement and the only way to know if it supports your CC is by digging in the documentation and checking tables where they mention multiple CCs such as this arithmetic instructions.
Because my card CC is 3.0, it isn’t supported by CUDA Toolkit 12.6, so I had to look at pasts versions documentation where they mention supporting CC 3.0. and the latest CUDA Toolkit version that supports my cards CC is 10.2. So I downloaded and installed that version compiled the program and it threw an error, the version of visual studio wasn’t supported (2022 was too new), only versions between 2015 and 2019 were supported (sigh), so I went to download VS 2019 and Microsoft had put it behind an MSDN subscription to be able to download it. Luckily I found this StackOverflow answer where Microsoft still has an unlisted link to download VS 2019 from their servers. So I installed it, compiled and ran the CUDA program and was able to print something.
My first attempt I really wasn’t sure what I was doing. I just saw the GPU as a CPU, instead of having 4 cores, now I had 768 cores.
|
|
This solution has several problems:
<<<1, 768>>>
) is not the way to go. As far as I understand (unless the GPU scheduler does some magic behind the scene), each block is assigned to 1 Streaming Multiprocessor (SM). My GPU has 4 SM with 192 cores each for a total of 768 cuda cores which meant that only 1 SM (192 cores) was being assigned this 1 block of 768 threads, not taking advantage of all the cores, as I had assumed that it was 1 big block of cores.The screen freezing and display driver restart I ended up “solving” it by instead of having an infinite loop inside the kernel function it would be a loop that would do 1000 iterations, which was enough so the watchdog timer wouldn’t try restarting the display driver.
Though I don’t have the exact code I did end up recreating what I did and benchmarking it, and got about 7.2 MHs with 1 block and 768 threads.
At the time I didn’t know how blocks and threads worked but because I had to go back to my first solution so I could write about it I wanted to test if using more blocks would’ve improved the performance and I tried using 4 blocks and 768 threads, since my GPU has 4 SM, so each block should be assigned to 1 SM and definitely saw an improvement in the hash rate getting about 27 MHs.
After reading and understanding a bit better GPU architecture and particularly NVIDIA cards, I thought of copying data in and out from the device instead of generating random strings in the GPU and checking the hash value. This was done a lot in the examples I found of CUDA applications.
My idea was generating an array of random characters on my computer (host) as long as the total threads plus 64 (though I only used 57 random characters, oops), so each thread has their idx
and that idx
is unique to each thread and it goes from 0 to the total amount of threads. So I would use the idx
as an offset within the array of random characters to generate the string.
So lets say I use 5 character long strings for the random part of the string and my array of random characters is this: ['4', 'F', 'J', 'J', '9', 'W', '7', 'd', 'S', 'i', '+', 'V', 'x', 'A']
the resulting random string for each thread would be:
idx = 0
: 4FJJ9idx = 1
: FJJ9Widx = 2
: JJ9W7idx = 3
: J9W7didx = 9
: i+VxA
|
|
I used nvprof
to profile and analyze the program and noticed that it was spending a lot of its time copying data from host to the device, which isn’t ideal.
I also benchmarked this solution and got 27.7 MHs which is better than my first attempt, with the difference that the screen didn’t freeze every few seconds. Maybe I could’ve improved the hash rate if I had also tried using a loop inside the kernel function but because I was copying data in and out it would’ve required more memory allocated but at the time I didn’t try it.
I still wasn’t happy with the results, though I had no idea if 27.7 MHs was a good result or not as I didn’t have anywhere to compare but after profiling still didn’t feel right to spend “a lot” of time copying data to/from the device.
So I started thinking of ways to avoid cudaMemcpy()
, thats when I realized the string doesn’t need to be random, I could go for a sequential approach and try all possible combinations. Going from “AAAAAA”, “AAAAAB”, “AAAAAC”, …, “aaaaaA”, “aaaaaB”, …, “//////” and all of this could be done inside the GPU without having to copy external data to the GPU.
This kind of solved part of the problem which was avoiding generating random characters and having to copy them to the device, but now I had a new problem. How would I do it sequentially, if each thread is supposed to get a unique string sequence. One variable that threads have that is unique to each thread is idx = blockIdx.x * blockDim + threadIdx.x
and I could use that value to select a character from the character set, so with a little bit of math we can get a unique sequence for each thread.
charset[65] = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";
string[3] = charset[ idx % 64];
string[2] = charset[(idx / 64) % 64];
string[1] = charset[(idx /(64*64)) % 64];
string[0] = charset[(idx /(64*64*64)) % 64];
idx 0 = "AAAA"
string[3] = charset[ 0 % 64] -> charset[0] = "A";
string[2] = charset[ 0 % 64] -> charset[0] = "A";
string[1] = charset[ 0 % 64] -> charset[0] = "A";
string[0] = charset[ 0 % 64] -> charset[0] = "A";
idx 1 = "AAAB"
string[3] = charset[ 1 % 64] -> charset[1] = "B";
string[2] = charset[ 0 % 64] -> charset[0] = "A";
string[1] = charset[ 0 % 64] -> charset[0] = "A";
string[0] = charset[ 0 % 64] -> charset[0] = "A";
---
idx 64 = "AABA"
string[3] = charset[64 % 64] -> charset[0] = "A";
string[2] = charset[ 1 % 64] -> charset[1] = "B";
string[1] = charset[ 0 % 64] -> charset[0] = "A";
string[0] = charset[ 0 % 64] -> charset[0] = "A";
idx 65 = "AABB"
string[3] = charset[65 % 64] -> charset[1] = "B";
string[2] = charset[ 1 % 64] -> charset[1] = "B";
string[1] = charset[ 0 % 64] -> charset[0] = "A";
string[0] = charset[ 0 % 64] -> charset[0] = "A";
---
idx 16777215 (64^4 - 1) = "////"
string[3] = charset[16777215 % 64] -> charset[63] = "/";
string[2] = charset[ 262143 % 64] -> charset[63] = "/";
string[1] = charset[ 4095 % 64] -> charset[63] = "/";
string[0] = charset[ 63 % 64] -> charset[63] = "/";
Though this did pose another problem, the string I’m using as nonce is 12 characters long meaning the total amount of possible combinations are 64^12 (4,722,366,482,869,645,213,696) this comes to 2^72. It would cause an overflow even on an unsigned 64-bit integer (2^64) and although nvidia had added int128
support in CUDA 11.6 my graphics card didn’t support it and because its software emulated I’d assume not the best option for performance.
So what I did was instead split the string sequence into 2 parts, so each part would have 64^6 (68,719,476,736), which is 2^36 combinations and it would fit in an unsigned 64-bit integer (2^64). Because I split the string sequence I decided to use a 2-dimensional threads and blocks when launching the kernel. This would allow me to track values for the first 6 characters of the sequence and the 6 last characters in 2 different values and unique combination to each thread, idx
and idy
.
|
|
In this part of the code I’m building the sequence of characters for the nonce. This is exactly the same as I explained in the previous block of code with the difference that there are 2 variables, in this case idx_offset
and idy_offset
which are the ids of each thread with an offset and instead of using arithmetic operators such as division (/) and modulus (%) I’m using bitwise operators because according to the documentation they have more throughput than some arithmetics operations, its also recommended in their documentation “Best practices” to use bitwise operations. Luckily for us, because the character set is of length 64, which is a power of 2 (2^6), we can easily use bitwise operations, such as idx_offset >>= 6
, which shift right the bits 6 times and that equals to dividing by 64, and instead of modulus (%) you can use “bitwise AND” which is equal to idx_offset % 64 = idx_offset & 63
, mind you, it only works on powers of 2.
Once I had that solved I wanted to hard-code some values, specially with the calculation of the hash but its hard to get the most out of hard coding values when you don’t know how SHA256 works but there is a website that explains and shows how values change step by step SHA256 algorithm. This website definitely helped me to understand better how its done.
After understanding better how SHA256 worked I started hard coding values, where I pre-calculated the first 10 values of m. These values depend on your string, but because I’m only changing the last 12 characters of my string, which is 55 characters long, these numbers remain the same.
|
|
The same goes for the initial state values (a
, b
, c
, d
, e
, f
, g
and h
) of the different variables used in the 64 rounds.
Another change I did was, instead of copying the results back to the host, simply check and print the results in the device and ended up adding a loop inside the kernel which avoids some of the overhead of synchronizing threads, launching the kernel and copying the complete string to hash to a variable.
This is a summary of how my solution works.
idx
and idy
plus s1 and s2 offset. In the first iteration, these values go from 0 to 511.data
, which already includes needed padding "0x00cl/GTX+650+Ti+BOOST/https+0x00+cl/2025/AAAAAAAAAAAA\x80\x00\x00\x00\x00\x00\x00\x01\xb8"
data
to include the sequence of 12 characters by using idx_offset
and idy_offset
, dividing by 64 for each position and choosing the character from the charset
array.cuda_sha256_transform(data)
) and at the end, each value of the states (a
, b
, c
, d
, e
, f
, g
and h
) are supposed to be added to an initial state value to get the full hash of the string but because we are looking only for the first 11 zeros, (current top user in the leaderboard found a hash with 15 zeros) and each state accounts for 8 hex digit values, we only need a
and b
to find the first 16 hex digits and check if they start with zeros. If so it will print the string and the first 16 hex digits of the hash.idx_offset
and idy_offset
on each loop.s2 = s2 + 16*32*16*32
.With all that done I benchmarked my solution with my GTX 650 Ti BOOST and it got me about 211 MHs, which is a big improvement compared to my previous attempts (27.7 MHs). At this point it did feel like there was no more space for optimization for the current code or at least I couldn’t think of anything else.
I also went ahead and tested it on an RTX 4090 to see how well it performed and check whether it could get 21 GHs like the user in the top leaderboard claimed. My solution managed 18.8 GHs which is still a lot and close to 21GHs but still wasn’t quite there and clearly there was room for improvement.
seletskiy, the user that was in the top spot claiming 21GHs commented on his solution also shared his solution, though from what I could understand from his code he was doing something similar, except that he is base64 encoding the characters for the nonce, while I’m using a variable charset
that contains all the characters. But because there isn’t much explanation of the code its hard for me to tell exactly what shortcuts he took to achieve 21GHs.
This is definitely a fun shallenge that I enjoyed working on. I spent quite a lot of time understanding CUDA and Nvidia GPUs just so I could write a performant program and get the most out of a GPU, analyzing and thinking on what approach to take for the code, making small changes and checking if it had improved. For me this was about the journey rather than actually finding a low hash value. After all, this shallenge is a mix of knowledge, luck and having the money/hardware to do the calculations. Even if your program is not completely optimized for performance, with an RTX 4090 it probably will run faster than some older GPUs, increasing your chances of finding a lower hash value.
Anyways, after all of this, I can definitely see that the CPU solution performance can be improved by instead of generating random characters use sequential characters, also writing my own implementation of SHA256 using some hard-coded values and instead of go-routines use something like C to actually parallelize the work and use all cores/threads of your CPU.
On the other hand my CUDA solution could maybe be improved if I tune more the amount of blocks that it runs on for more powerful graphics card, increase the amount of time is spent inside the kernel doing calculations (bigger loops) and finally the calculations for generating a string sequence where I’m dividing integers which is a bit slower than float arithmetics in GPU and I’m missing something about creating different sequence of strings.
In the git repository I have included a file called shallengeBase.cu
which the code is the same as my solution except that few values aren’t hard coded and pre-calculated, you can simply modify the string that is at the top of the file (USERNAME_NONCE
), compile and run it. This file is meant to be used as a starting point if you want to start hard coding some values to make it run faster so you can find a low hash value too.
Source code for my CUDA solution: https://codeberg.org/0x00cl/cuda-shallenge