From e46c6a72bc152280293337c63dc15d772451a808 Mon Sep 17 00:00:00 2001 From: Luke Street Date: Mon, 22 Apr 2024 23:17:09 -0600 Subject: [PATCH] Split up README a bit --- README.md | 146 +-------------------------------------- docs/other_approaches.md | 74 ++++++++++++++++++++ docs/readme.md | 1 + docs/terminology.md | 67 ++++++++++++++++++ 4 files changed, 144 insertions(+), 144 deletions(-) create mode 100644 docs/other_approaches.md create mode 100644 docs/readme.md create mode 100644 docs/terminology.md diff --git a/README.md b/README.md index 04034f5..e9683f3 100644 --- a/README.md +++ b/README.md @@ -15,9 +15,9 @@ project structure and build system that uses decomp-toolkit under the hood. - [Goals](#goals) - [Background](#background) -- [Other approaches](#other-approaches) -- [Terminology](#terminology) - [Analyzer features](#analyzer-features) +- [Other approaches](docs/other_approaches.md) +- [Terminology](docs/terminology.md) - [Commands](#commands) - [ar create](#ar-create) - [ar extract](#ar-extract) @@ -79,148 +79,7 @@ binary that is byte-for-byte identical to the original, then we know that the de decomp-toolkit provides tooling for analyzing and splitting the original binary into relocatable objects, as well as generating the linker script and other files needed to link the decompiled code. -## Other approaches -### Manual assembly - -With existing GameCube/Wii decompilation tooling, the setup process is very tedious and error-prone. -The general process is: - -- Begin by disassembling the original binary with a tool like - [doldisasm.py](https://gist.github.com/camthesaxman/a36f610dbf4cc53a874322ef146c4123). This produces one giant - assembly file per section. -- Manually comb through the assembly files and fix many issues, like incorrect or missing relocations, incorrect or - missing symbols, and more. -- Manually find-and-replace the auto-generated symbol names based on other sources, like other decompilation projects - or a map file. (If you're lucky enough to have one) -- Manually determine data types and sizes, and convert them accordingly. (For example, `.4byte` -> `.float`, strings, - etc) -- Manually split the assembly files into individual objects. This is a very tedious process, as it requires identifying - the boundaries of each function, determining whether adjacent functions are related, finding associated - data from each data section, and cut-and-pasting all of this into a new file. - -Other downsides of this approach: - -- Manually editing the assembly means that the result is not reproducible. You can't run the script again to - make any updates, because your changes will be overwritten. This also means that the assembly files must be - stored in version control, which is not ideal. -- Incorrectly splitting objects is very easy to do, and can be difficult to detect. For example, a `.ctors` entry _must_ - be located in the same object as the function it references, otherwise the linker will not generate the correct - `.ctors` entry. `extab` and `extabindex` entries _must also_ be located in the same object as the function they - reference, have a label and have the correct size, and have a direct relocation rather than a section-relative - relocation. Otherwise, the linker will crash with a cryptic error message. -- Relying on assembly means that you need an assembler. For GameCube/Wii, this means devkitPro, which is a - large dependency and an obstacle for new contributors. The assembler also has some quirks that don't interact well - with `mwldeppc`, which means that the object files must be manually post-processed to fix these issues. (See the - [elf fixup](#elf-fixup) command) - -With decomp-toolkit: - -- Many analysis steps are automated and highly accurate. Many DOL files can be analyzed and split into re-linkable - objects with no configuration. -- Signature analysis automatically labels common functions and objects, and allows for more accurate relocation - rebuilding. -- Any manual adjustments are stored in configuration files, which are stored in version control. -- Splitting is simplified by updating a configuration file. The analyzer will check for common issues, like - incorrectly split `.ctors`/`.dtors`/`extab`/`extabindex` entries. If the user hasn't configured a split for these, - the analyzer will automatically split them along with their associated functions to ensure that the linker will - generate everything correctly. This means that matching code can be written without worrying about splitting all - sections up front. -- The splitter generates object files directly, with no assembler required. This means that we can avoid the devkitPro - requirement. (Although we can still generate assembly files for viewing, editing, and compatibility with other tools) - -### dadosod - -[dadosod](https://github.com/InusualZ/dadosod) is a newer replacement for `doldisasm.py`. It has more accurate function -and relocation analysis than `doldisasm.py`, as well as support for renaming symbols based on a map file. However, since -it operates as a one-shot assembly generator, it still suffers from many of the same issues described above. - -### ppcdis - -[ppcdis](https://github.com/SeekyCt/ppcdis) is one of the tools that inspired decomp-toolkit. It has more accurate -analysis than doldisasm.py, and has similar goals to decomp-toolkit. It's been used successfully in several -decompilation projects. - -However, decomp-toolkit has a few advantages: - -- Faster and more accurate analysis. (See [Analyzer features](#analyzer-features)) -- Emits object files directly, with no assembler required. -- More robust handling of features like common BSS, `.ctors`/`.dtors`/`extab`/`extabindex`, and more. -- Requires very little configuration to start. -- Automatically labels common functions and objects with signature analysis. - -### Honorable mentions - -[splat](https://github.com/ethteck/splat) is a binary splitting tool for N64 and PSX. Some ideas from splat inspired -decomp-toolkit, like the symbol configuration format. - -## Terminology - -### DOL - -A [DOL file](https://wiki.tockdom.com/wiki/DOL_(File_Format)) is the executable format used by GameCube and Wii games. -It's essentially a raw binary with a header that contains information about the code and data sections, as well as the -entry point. - -### ELF - -An [ELF file](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format) is the executable format used by most -Unix-like operating systems. There are two common types of ELF files: **relocatable** and **executable**. - -A relocatable ELF (`.o`, also called "object file") contains machine code and relocation information, and is used as -input to the linker. Each object file is compiled from a single source file (`.c`, `.cpp`). - -An executable ELF (`.elf`) contains the final machine code that can be loaded and executed. It *can* include -information about symbols, debug information (DWARF), and sometimes information about the original relocations, but it -is often missing some or all of these (referred to as "stripped"). - -### Symbol - -A symbol is a name that is assigned to a memory address. Symbols can be functions, variables, or other data. - -**Local** symbols are only visible within the object file they are defined in. -These are usually defined as `static` in C/C++ or are compiler-generated. - -**Global** symbols are visible to all object files, and their names must be unique. - -**Weak** symbols are similar to global symbols, but can be replaced by a global symbol with the same name. -For example: the SDK defines a weak `OSReport` function, which can be replaced by a game-specific implementation. -Weak symbols are also used for functions generated by the compiler or as a result of C++ features, since they can exist -in multiple object files. The linker will deduplicate these functions, keeping only the first copy. - -### Relocation - -A relocation is essentially a pointer to a symbol. At compile time, the final address of a symbol is -not known yet, therefore a relocation is needed. -At link time, each symbol is assigned a final address, and the linker will use the relocations to update the machine -code with the final addresses of the symbol. - -Before: - -```asm -# Unrelocated, instructions point to address 0 (unknown) -lis r3, 0 -ori r3, r3, 0 -``` - -After: - -```asm -# Relocated, instructions point to 0x80001234 -lis r3, 0x8000 -ori r3, r3, 0x1234 -``` - -Once the linker performs the relocation with the final address, the relocation is no longer needed. Still, sometimes the -final ELF will still contain the relocation information, but the conversion to DOL will **always** remove it. - -When we analyze a file, we attempt to rebuild the relocations. This is useful for several reasons: - -- It allows us to split the file into relocatable objects. Each object can then be replaced with a decompiled version, - as matching code is written. -- It allows us to modify or add code and data to the game and have all machine code still to point to the correct - symbols, which may now be in a different location. -- It allows us to view the machine code in a disassembler and show symbol names instead of raw addresses. ## Analyzer features @@ -261,7 +120,6 @@ Generates `ldscript.lcf` for `mwldeppc.exe`. - Support RSO files - Add more signatures -- Rework CodeWarrior map parsing ## Commands diff --git a/docs/other_approaches.md b/docs/other_approaches.md new file mode 100644 index 0000000..abe4d84 --- /dev/null +++ b/docs/other_approaches.md @@ -0,0 +1,74 @@ +# Other approaches + +## Manual assembly + +With existing GameCube/Wii decompilation tooling, the setup process is very tedious and error-prone. +The general process is: + +- Begin by disassembling the original binary with a tool like + [doldisasm.py](https://gist.github.com/camthesaxman/a36f610dbf4cc53a874322ef146c4123). This produces one giant + assembly file per section. +- Manually comb through the assembly files and fix many issues, like incorrect or missing relocations, incorrect or + missing symbols, and more. +- Manually find-and-replace the auto-generated symbol names based on other sources, like other decompilation projects + or a map file. (If you're lucky enough to have one) +- Manually determine data types and sizes, and convert them accordingly. (For example, `.4byte` -> `.float`, strings, + etc) +- Manually split the assembly files into individual objects. This is a very tedious process, as it requires identifying + the boundaries of each function, determining whether adjacent functions are related, finding associated + data from each data section, and cut-and-pasting all of this into a new file. + +Other downsides of this approach: + +- Manually editing the assembly means that the result is not reproducible. You can't run the script again to + make any updates, because your changes will be overwritten. This also means that the assembly files must be + stored in version control, which is not ideal. +- Incorrectly splitting objects is very easy to do, and can be difficult to detect. For example, a `.ctors` entry _must_ + be located in the same object as the function it references, otherwise the linker will not generate the correct + `.ctors` entry. `extab` and `extabindex` entries _must also_ be located in the same object as the function they + reference, have a label and have the correct size, and have a direct relocation rather than a section-relative + relocation. Otherwise, the linker will crash with a cryptic error message. +- Relying on assembly means that you need an assembler. For GameCube/Wii, this means devkitPro, which is a + large dependency and an obstacle for new contributors. The assembler also has some quirks that don't interact well + with `mwldeppc`, which means that the object files must be manually post-processed to fix these issues. (See the + [elf fixup](/README.md#elf-fixup) command) + +With decomp-toolkit: + +- Many analysis steps are automated and highly accurate. Many DOL files can be analyzed and split into re-linkable + objects with no configuration. +- Signature analysis automatically labels common functions and objects, and allows for more accurate relocation + rebuilding. +- Any manual adjustments are stored in configuration files, which are stored in version control. +- Splitting is simplified by updating a configuration file. The analyzer will check for common issues, like + incorrectly split `.ctors`/`.dtors`/`extab`/`extabindex` entries. If the user hasn't configured a split for these, + the analyzer will automatically split them along with their associated functions to ensure that the linker will + generate everything correctly. This means that matching code can be written without worrying about splitting all + sections up front. +- The splitter generates object files directly, with no assembler required. This means that we can avoid the devkitPro + requirement. (Although we can still generate assembly files for viewing, editing, and compatibility with other tools) + +## dadosod + +[dadosod](https://github.com/InusualZ/dadosod) is a newer replacement for `doldisasm.py`. It has more accurate function +and relocation analysis than `doldisasm.py`, as well as support for renaming symbols based on a map file. However, since +it operates as a one-shot assembly generator, it still suffers from many of the same issues described above. + +## ppcdis + +[ppcdis](https://github.com/SeekyCt/ppcdis) is one of the tools that inspired decomp-toolkit. It has more accurate +analysis than doldisasm.py, and has similar goals to decomp-toolkit. It's been used successfully in several +decompilation projects. + +However, decomp-toolkit has a few advantages: + +- Faster and more accurate analysis. (See [Analyzer features](/README.md#analyzer-features)) +- Emits object files directly, with no assembler required. +- More robust handling of features like common BSS, `.ctors`/`.dtors`/`extab`/`extabindex`, and more. +- Requires very little configuration to start. +- Automatically labels common functions and objects with signature analysis. + +## Honorable mentions + +[splat](https://github.com/ethteck/splat) is a binary splitting tool for N64 and PSX. Some ideas from splat inspired +decomp-toolkit, like the symbol configuration format. diff --git a/docs/readme.md b/docs/readme.md new file mode 100644 index 0000000..b1da1e2 --- /dev/null +++ b/docs/readme.md @@ -0,0 +1 @@ +### Visit the [dtk-template](https://github.com/encounter/dtk-template) repository for additional documentation, including a guide. diff --git a/docs/terminology.md b/docs/terminology.md new file mode 100644 index 0000000..91a9829 --- /dev/null +++ b/docs/terminology.md @@ -0,0 +1,67 @@ +# Terminology + +## DOL + +A [DOL file](https://wiki.tockdom.com/wiki/DOL_(File_Format)) is the executable format used by GameCube and Wii games. +It's essentially a raw binary with a header that contains information about the code and data sections, as well as the +entry point. + +## ELF + +An [ELF file](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format) is the executable format used by most +Unix-like operating systems. There are two common types of ELF files: **relocatable** and **executable**. + +A relocatable ELF (`.o`, also called "object file") contains machine code and relocation information, and is used as +input to the linker. Each object file is compiled from a single source file (`.c`, `.cpp`). + +An executable ELF (`.elf`) contains the final machine code that can be loaded and executed. It *can* include +information about symbols, debug information (DWARF), and sometimes information about the original relocations, but it +is often missing some or all of these (referred to as "stripped"). + +## Symbol + +A symbol is a name that is assigned to a memory address. Symbols can be functions, variables, or other data. + +**Local** symbols are only visible within the object file they are defined in. +These are usually defined as `static` in C/C++ or are compiler-generated. + +**Global** symbols are visible to all object files, and their names must be unique. + +**Weak** symbols are similar to global symbols, but can be replaced by a global symbol with the same name. +For example: the SDK defines a weak `OSReport` function, which can be replaced by a game-specific implementation. +Weak symbols are also used for functions generated by the compiler or as a result of C++ features, since they can exist +in multiple object files. The linker will deduplicate these functions, keeping only the first copy. + +## Relocation + +A relocation is essentially a pointer to a symbol. At compile time, the final address of a symbol is +not known yet, therefore a relocation is needed. +At link time, each symbol is assigned a final address, and the linker will use the relocations to update the machine +code with the final addresses of the symbol. + +Before: + +```asm +# Unrelocated, instructions point to address 0 (unknown) +lis r3, 0 +ori r3, r3, 0 +``` + +After: + +```asm +# Relocated, instructions point to 0x80001234 +lis r3, 0x8000 +ori r3, r3, 0x1234 +``` + +Once the linker performs the relocation with the final address, the relocation is no longer needed. Still, sometimes the +final ELF will still contain the relocation information, but the conversion to DOL will **always** remove it. + +When we analyze a file, we attempt to rebuild the relocations. This is useful for several reasons: + +- It allows us to split the file into relocatable objects. Each object can then be replaced with a decompiled version, + as matching code is written. +- It allows us to modify or add code and data to the game and have all machine code still to point to the correct + symbols, which may now be in a different location. +- It allows us to view the machine code in a disassembler and show symbol names instead of raw addresses.