Author Topic: Optimized scrypt kernel files for 7950/7970/7990/R9 280x  (Read 104025 times)

Offline lbr

  • Sr. Member
  • ****
  • Posts: 1309
Re: Optimized scrypt kernel files for 7950/7970/7990/R9 280x
« Reply #255 on: December 19, 2013, 11:58:29 PM »
LTC: LgVc7KdedPGZyDXHXEH9G7z6AoTmTvDdWb
Cleanup 7xGPU driver mod Force monitor setx

Offline ig0rb

  • Newbie
  • *
  • Posts: 10
Re: Optimized scrypt kernel files for 7950/7970/7990/R9 280x
« Reply #256 on: December 20, 2013, 12:02:37 AM »
Quote
Spot on explanation, Will!  Newbies should learn with this.  You can imagine how many scrypt files I have and countless restarts to test every bit of changes. 
If someone wants to really push their rigs to the limit, let them create their own .bin files.

Your next take would be to the understand the cgminer code, you can still squeeze more hash and shares if you know how it works.
... that
I think that some work can be shared between CPU and GPU the bottleneck it's PCI bus that's why profiling it's very important...

I don't have too much spare time... but if some one can implement a profiling it can be a good way to start the optimization....


Offline ig0rb

  • Newbie
  • *
  • Posts: 10
Re: Optimized scrypt kernel files for 7950/7970/7990/R9 280x
« Reply #257 on: December 20, 2013, 12:12:15 AM »
Quote
Only assembler only hardcore!!!! > : )

Assembler it's not hardcore :)
assemble it's how a button on/off did his work ;)

it's paradigm of the elephant in small pieces (when i have a complex scenario, I can decompose a complex situation in a lot of simple problems and resolve every simple problem to reach to definitive solution)

Offline lbr

  • Sr. Member
  • ****
  • Posts: 1309
Re: Optimized scrypt kernel files for 7950/7970/7990/R9 280x
« Reply #258 on: December 20, 2013, 12:17:32 AM »
Quote
Only assembler only hardcore!!!! > : )

Assembler it's not hardcore :)
assemble it's how a button on/off did his work ;)

it's paradigm of the elephant in small pieces (when i have a complex scenario, I can decompose a complex situation in a lot of simple problems and resolve every simple problem to reach to definitive solution)


right..
how many lines of code in asm have u written exactly?
LTC: LgVc7KdedPGZyDXHXEH9G7z6AoTmTvDdWb
Cleanup 7xGPU driver mod Force monitor setx

Offline ig0rb

  • Newbie
  • *
  • Posts: 10
Re: Optimized scrypt kernel files for 7950/7970/7990/R9 280x
« Reply #259 on: December 20, 2013, 12:22:20 AM »
I don't, imo AMD does.
http://developer.amd.com/tools-and-sdks/heterogeneous-computing/codexl/
have u tried this?

not yet as I told i don't have too much free time :)
I have 10 kwh of 'free energy', already paid... my target it's to optimize that...
the life it's a 'trade off'  between free time, money, brain storm, relax....

If some one want to contributeto the code I will be glad to give  my effort to comment and discuss about this...

For sure I will not delete my posts....

Anyway.... I was using bamt on a machine but with 2.6 kernel and i had trouble with onboards enthernet cards and with the latest amd radeon driver... i'm now testing latest xubuntu.... did some one want to manage a mining usb distro?

Offline ig0rb

  • Newbie
  • *
  • Posts: 10
Re: Optimized scrypt kernel files for 7950/7970/7990/R9 280x
« Reply #260 on: December 20, 2013, 12:27:41 AM »
Quote
how many lines of code in asm have u written exactly?
as you can see zero in my scrypt code....

but in my life i don't know.... on embedded devices assembler it's what happen... on gpu / cpu / modern devices it's very hard to become smarter than the compiler 3 level of cache... a lot of instructions releted's to the hardware, ecc....

Offline yellowz06

  • Jr. Member
  • **
  • Posts: 70
Re: Optimized scrypt kernel files for 7950/7970/7990/R9 280x
« Reply #261 on: December 21, 2013, 12:13:59 AM »
Use:

Code: [Select]
salsa(X);
salsa(X);

instead of

Code: [Select]
for (uint=2; i--;)
   salsa(X);

It's executed two times anyway, right? Assuming you are using a lookup-gap of 2.

I've tried this countless times with other code variations, in all cases it hurt performance.  This is exactly the type of thing I was referring to when I couldn't find any logic as to what caused speed ups and what causes slow downs :(

Thanks for all the other info, I'll have a good read through it all over Christmas :)

So I'm the curious type:)

I tried your example above and indeed in did hurt performance.

So i did a bit of reading and figured out why this is the case (side note, I do software development for a living;) ).

This code is faster (because for loops are executed in parallelism) (both salsa(X) as executed simultaneous)
Code: [Select]
for (uint=2; i--;)
   salsa(X);

this code is slower (because it cannot be executed in parallel, it has to executed serially) (the first salsa(X) has to finish executing before the second salsa(X) can be executed)
Code: [Select]
salsa(X);
salsa(X);

So at the end of the day, there is some logic on why it works that way :)

Offline Walrusbonzo

  • Jr. Member
  • **
  • Posts: 23
Re: Optimized scrypt kernel files for 7950/7970/7990/R9 280x
« Reply #262 on: December 21, 2013, 01:00:10 AM »
Not all for loops can be in parallel, what if loop 2 relies on the results of loop 1?

Would I be right thinking the compiler knows to check for this and optimise accordingly?

Offline yellowz06

  • Jr. Member
  • **
  • Posts: 70
Re: Optimized scrypt kernel files for 7950/7970/7990/R9 280x
« Reply #263 on: December 21, 2013, 01:29:14 AM »
Not all for loops can be in parallel, what if loop 2 relies on the results of loop 1?

Would I be right thinking the compiler knows to check for this and optimise accordingly?

you are absolutely correct, that's why loops that are executed in parallelism need to make sure that results for loop 2 does not rely on results from loop 1 as you mentioned.

http://dhruba.name/2012/10/06/opencl-cookbook-parallelise-your-host-loops-using-opencl/

So to your point, are all for loops executed in parallel or does the compiler optimize depending on the code being executed?  I'm still trying to research this.

Offline dexX7

  • Jr. Member
  • **
  • Posts: 82
Re: Optimized scrypt kernel files for 7950/7970/7990/R9 280x
« Reply #264 on: December 21, 2013, 03:58:23 PM »
So at the end of the day, there is some logic on why it works that way :)

Very interesting. Thanks for the link.

I took a look at the ISA code and I confirmed two things:

 - Without additional parameters the compiler uses heavy optimization.
 - Of course the compiler also considers the definitions and parameters. This means you might get a slightly different result for different thread concurrencies for example.

First sighting: some loops are unrolled even without #pragma unroll, divisions are done via shifts, (x mod y) == (x and 2^n-1) simplification , where applicable.


Here is an example:

Code: [Select]
__kernel void optimizationtest(volatile __global uint* restrict output)
{
  // CONCURRENT_THREADS = 16384
  uint a = get_global_id(0);
  size_t b = a;
  int c = 1 + 3;
  b = a + c - 4;
  size_t x = b % CONCURRENT_THREADS;
  output[x] = 0;
}

Is simplified and results in the same as:

Code: [Select]
__kernel void optimizationtest(volatile __global uint* restrict output)
{
  size_t x = get_global_id(0) & 16383;
  output[x] = 0;
}

Edit: for the reference:

Code: [Select]
ShaderType = IL_SHADER_COMPUTE
TargetChip = t
; ------------- SC_SRCSHADER Dump ------------------
SC_SHADERSTATE: u32NumIntVSConst = 0
SC_SHADERSTATE: u32NumIntPSConst = 0
SC_SHADERSTATE: u32NumIntGSConst = 0
SC_SHADERSTATE: u32NumBoolVSConst = 0
SC_SHADERSTATE: u32NumBoolPSConst = 0
SC_SHADERSTATE: u32NumBoolGSConst = 0
SC_SHADERSTATE: u32NumFloatVSConst = 0
SC_SHADERSTATE: u32NumFloatPSConst = 0
SC_SHADERSTATE: u32NumFloatGSConst = 0
fConstantsAvailable = 0
iConstantsAvailable = 0
bConstantsAvailable = 0
u32SCOptions[0] = 0x00680000 SCOption_IGNORE_SAMPLE_L_BUG SCOption_FLOAT_DO_NOT_DIST SCOption_FLOAT_DO_NOT_REASSOC
u32SCOptions[1] = 0x40000000 SCOption_R800_UAV_NONARRAY_FIXUP
u32SCOptions[2] = 0x08200000 SCOption_R1000_BYTE_SHORT_WRITE_WORKAROUND_BUG317611 SCOption_R1000_READLANE_SMRD_WORKAROUND_BUG343479
u32SCOptions[3] = 0x00000204 SCOption_R1000_BARRIER_WORKAROUND_BUG405404 SCOption_R1000R1100_VCCZ_CLOBBER_WORKAROUND_BUG457939
; -------- Disassembly --------------------
shader main
  asic(SI)
  type(CS)

  s_buffer_load_dword  s0, s[8:11], 0x04                    // 00000000: C2000904
  s_buffer_load_dword  s1, s[8:11], 0x18                    // 00000004: C2008918
  s_buffer_load_dword  s2, s[12:15], 0x00                   // 00000008: C2010D00
  s_waitcnt     lgkmcnt(0)                                  // 0000000C: BF8C007F
  s_min_u32     s0, s0, 0x0000ffff                          // 00000010: 8380FF00 0000FFFF
  v_mov_b32     v1, s0                                      // 00000018: 7E020200
  v_mul_i32_i24  v1, s16, v1                                // 0000001C: 12020210
  v_add_i32     v0, vcc, v0, v1                             // 00000020: 4A000300
  v_add_i32     v0, vcc, s1, v0                             // 00000024: 4A000001
  v_and_b32     v0, 0x00003fff, v0                          // 00000028: 360000FF 00003FFF
  v_lshlrev_b32  v0, 2, v0                                  // 00000030: 34000082
  v_add_i32     v0, vcc, s2, v0                             // 00000034: 4A000002
  v_mov_b32     v1, 0                                       // 00000038: 7E020280
  tbuffer_store_format_x  v1, v0, s[4:7], 0 offen format:[BUF_DATA_FORMAT_32,BUF_NUM_FORMAT_FLOAT] // 0000003C: EBA41000 80010100
  s_waitcnt     vmcnt(0) & expcnt(0)                        // 00000044: BF8C1F00
  s_endpgm                                                  // 00000048: BF810000
end

; ----------------- CS Data ------------------------

codeLenInByte        = 76;Bytes

userElementCount     = 3;
;  userElements[0]      = IMM_UAV 10, s[4:7]
;  userElements[1]      = IMM_CONST_BUFFER 0, s[8:11]
;  userElements[2]      = IMM_CONST_BUFFER 1, s[12:15]
extUserElementCount  = 0;
NumVgprs             = 3;
NumSgprs             = 18;
FloatMode            = 192;
IeeeMode             = 0;
ScratchSize          = 0 dwords/thread;
LDSByteSize          = 0 bytes/workgroup (compile time only);
; uavResourceUsage[0]  = 0x00000400
; texSamplerUsage         = 0x00000000
; constBufUsage           = 0x00000003

;COMPUTE_PGM_RSRC2       = 0x000000A0
COMPUTE_PGM_RSRC2:USER_SGPR      = 16
COMPUTE_PGM_RSRC2:TGID_X_EN      = 1
NumThreadX                       = 256


Edit:

Just to clarify: salsa(X); salsa(X); needs to be executed one after the other, because the second one relies on the first, but seeing more dimensional it still might be that for (uint=2; i--;) salsa(X); contributes to more parallelity.
« Last Edit: December 21, 2013, 05:17:33 PM by dexX7 »

Offline Joe-Dirt

  • Jr. Member
  • **
  • Posts: 37
Re: Optimized scrypt kernel files for 7950/7970/7990/R9 280x
« Reply #265 on: December 22, 2013, 12:33:41 AM »
So far ig0rb's has been great for me. I'll test anyone else's as they post and share my results. Running a Gigabyte Rev2 280x @ 1065 x 1500 @ 1.076v with cgminer settings of:

setx GPU_MAX_ALLOC_PERCENT 100
setx GPU_USE_SYNC_OBJECTS 1
cgminer.exe --scrypt -u xxxxx -p xxxxx -o http://xxxxx:3333 --gpu-platform 0 -d 0 -I 13 -g 2 --shaders 2048 --thread-concurrency 11200 -w 256 --lookup-gap 2
If I helped you out, maybe help me out: LLNSvxudA5icF4XG3ztR6U37GtXy2F4Anf

Free LTC: http://ltc4you.com/?r=29062

Offline onscreen

  • Newbie
  • *
  • Posts: 2
Re: Optimized scrypt kernel files for 7950/7970/7990/R9 280x
« Reply #266 on: December 23, 2013, 11:33:36 PM »
Can anyone please re upload the original 24000 binary? Can't find it anywhere.

Offline Luddist

  • Newbie
  • *
  • Posts: 9
Re: Optimized scrypt kernel files for 7950/7970/7990/R9 280x
« Reply #267 on: December 27, 2013, 07:45:59 PM »
Here are all of the 7950 binaries: https://www.mediafire.com/?2dpl2131gf69l8p

Donate to Lantis:
LTC : LackB1YdYLxqRpiYqvWfAzxViDe5WFjpMs
BTC : 162vF41ycBjZDWkZv1eouiaki4zxZ8PEmt


I never grabbed the 7970/280X binaries, could someone reupload those?

Offline ntrader

  • Newbie
  • *
  • Posts: 2
Re: Optimized scrypt kernel files for 7950/7970/7990/R9 280x
« Reply #268 on: December 29, 2013, 09:41:20 PM »
I never grabbed the 7970/280X binaries, could someone reupload those?
http://www28.zippyshare.com/v/75198323/file.html

Offline DarkKnight

  • Jr. Member
  • **
  • Posts: 26
Re: Optimized scrypt kernel files for 7950/7970/7990/R9 280x
« Reply #269 on: December 29, 2013, 10:56:55 PM »
I never grabbed the 7970/280X binaries, could someone reupload those?
http://www28.zippyshare.com/v/75198323/file.html

Thank you.

Also, thanks to Lantis for the hard work.  ;)
BTC: 1LDYFTZhjHcxSWrgdzAwUgBT1T3b63iMf1
LTC: LcEAuGD8f7T7DRJqdRyTWhJKtGSAXDKQL3
DOGE: D5tEoH3JvXm5HRYtkbAvzun3RQTFg932T8