callq retq - 专否

I think one of your return paths doesn't pop rbp. Just leave out the

pushq   %rbp
movq    %rsp, %rbp

pop     %rbp

altogether. gcc's default is -fomit-frame-pointer.

Or fix your non-return-zero path to also pop rbp.

Actually, you're screwed because your function appears to be designed to put stuff on the stack an never take it off. If you want to invent your own ABI where space below the stack pointer can be used to return arrays, that's interesting, but you'll have to keep track of how big they are so you can adjust rsp back to pointing at the return address before ret.

I recommend against loading the return address into a register and replacing a later ret with jmp *%rdx or something. That would throw off the call/return address prediction logic in modern CPUs, and cause a stall the same as a branch mispredict. (See http://agner.org/optimize/). CPUs hate mismatched call/ret. I can't find a specific page to link about that right now.

See https://stackoverflow.com/tags/x86/info for other useful resources, including ABI documentation on how functions normally take args.

You could copy the return address down to below the array you just pushed, and then run ret, to return with %rsp modified. But unless you need to call a long function from multiple call sites, it's probably better to just inline it in one or two call-sites.

If it's too big to inline at too many call sites, your best bet, instead of using call, and copying the return address down to the new location, would be to emulate call and ret. Caller does

    put args in some registers
    lea   .ret_location(%rip), %rbx
    jmp   my_weird_helper_function
.ret_location:  # in NASM/YASM, labels starting with . are local labels, and don't show up in the object file.
         # GNU assembler might only treat symbols starting with .L that way.
    ...


my_weird_helper_function:
    use args, potentially modifying the stack
    jmp *%rbx   # return

You need a really good reason to use something like this. And you'll have to justify / explain it with a lot of comments, because it's not what readers will be expecting. First of all, what are you going to do with this array that you pushed onto the stack? Are you going to find its length by subtracting rsp and rbp or something?

Interestingly, even though push has to modify rsp as well as do a store, it has one per clock throughput on all recent CPUs. Intel CPUs have a stack engine to make stack ops not have to wait for rsp to be computed in the out-of-order engine when it's only been changed by push/pop/call/ret. (mixing push/pop with mov 4(%rsp), %rax or whatever results in extra uops being inserted to sync the OOO-engine's rsp with the stack-engine's offset.) Intel/AMD CPUs can only do one store per clock anyway, but Intel SnB and later can pop twice per clock.

So push/pop is actually not a terrible way to implement a stack data structure, esp. on Intel.

Also, your code is structured weirdly. main() is split across r8_digits_to_stack. That's fine, but you're not taking advantage of falling through from one block into the other block ever, so it just costs you an extra jmp in main for no benefit and a huge readability downside.

Let's pretend your loop is part of main, since I already talked about how it's super-weird to have a function return with %rsp modified.

Your loop could be simpler, too. Structure things with a jcc back to the top, when possible.

There's a small benefit to avoiding the upper 16 registers: 32bit insns with the classic registers doesn't need a REX prefix byte. So lets pretend we just have our starting value in %rax.

digits_to_stack:
# put each bit of %rax into its own 8 byte element on the stack for maximum space-inefficiency

    movq   %rax, %rdx  # save a copy

    xor    %ecx, %ecx  # setcc is only available for byte operands, so zero %rcx

    # need a test at the top after transforming while() into do{}while
    test   %rax, %rax  # fewer insn bytes to test for zero this way
    jz  .Lend

    # Another option can be to jmp to the test at the end of the loop, to begin the first iteration there.

.align 16
.Lpush_loop:
    shr   $1, %rax   # shift the low bit into CF, set ZF based on the result
    setc  %cl       # set %cl to 0 or 1, based on the carry flag
    # movzbl %cl, %ecx  # zero-extend
    pushq %rcx
      #.Lfirst_iter_entry
      # test %rax, %rax   # not needed, flags still set from shr
    jnz  .Lpush_loop
.Lend:

This version still kinda sucks, because on Intel P6 / SnB CPU families, using a wider register after writing a smaller part leads to a slowdown. (stall on pre-SnB, or extra uop on SnB and later). Others, including AMD and Silvermont, don't track partial-registers separately, so writing to %cl has a dependency on the previous value of %rcx. (writing to a 32bit reg zeros the upper 32, which avoids the partial-reg dependency problem.) A movzx to zero-extend from byte to long would do what Sandybridge does implicitly, and give a speedup on older CPUs.

This won't quite run in a single cycle per iteration on Intel, but might on AMD. mov/and $1 is not bad, but and affects the flags, making it harder to loop just based on shr setting the flags.

Note that your old version sarq %rax shifts in sign bits, not necessarily zeros, so with negative input your old version would be an inf loop (and segfault when you ran out of stack space (push would try to write to an unmapped page)).

Solved 7. (10 points) Fill in the values in the registers | Chegg.com

Assembly segmentation fault during retq - Stack Overflow

函数汇编原理（x86_64） - 简书

7.5. Functions in Assembly

[PATCH 1/2] perf annotate: generalize handling of ret instructions ...

chai dan - 知乎

disassembly · GitHub