IR and AST Optimizers in Decompilers

The following is a small guide that will help users writing decompiler plugins decide whether they need to work at the IR (Intermediate Representation) level or at the AST (Abstract Syntax Tree) level. The recommendations apply to both JEB decompiler engines, dexdec (for Android Dex/Dalvik) and gendec (generic decompiler engine.

Decompilation Pipeline

A method undergoing decompilation goes through the following simplified pipeline:

  1. The low-level native code (machine code or bytecode) is converted to low-level IR
  2. Some augmentation take place, including SSA transformation and typing
  3. IR processors lift and clean the low-level IR
  4. The final high-level IR is converted to an AST
  5. AST processors clean and beautify the code
  6. The final AST is rendered as pseudo-code

The steps 3 (IR processing) and 5 (AST processing) are customizable by the user through JEB’s API. Indeed, custom plugins are sometimes necessary to perform work not done by JEB’s built-in optimizers.

IR vs AST

The following comparison between IR and AST will help you decide which plugin is better suited to perform some type of work.

  • The number of IR elements to deal with is substantially smaller than the AST counterpart. As such, it may be easier to learn at first. The AST being more abstract and closer to final pseudo code, there are necessarily more types of elements (e.g. a Break element, representing a break; statement, does not exist at the IR level). However, modifying IR statements requires more care than modifying the AST tree.
  • The IR of a method is a flat sequence of instructions, organized into basic blocks. The flow of execution between the blocks is clear and concise. On the other hand, the AST being a tree, its navigation is not as straight-forward as a flat IR listing. While the concept of blocks exists, they are not necessarily basic blocks, and the flow of execution in the AST is not trivial to determine.
  • A consequence of the above is that data analysis is easier done at the IR level than at the AST level. The IR framework provides Data Flow Analysis objects with easy-to-use ways to determine where and by what variables are being accessed. This is a fundamental prerequisite for many non-trivial optimizers whose goal is code cleaning or restructuring (e.g. constant and variable propagation, dead code elimination, etc.).
  • Continuing the above, the IR framework generally offers more facility and helpers to perform advanced optimization, such as deobfuscation. Examples: dexdec offers an emulator and sandbox engine at the IR level, something unavailable at the AST level; gendec offers pattern matching facility making the development of complex IR rewriting rules easy.
  • The AST is closer to the final generated pseudo-code. As such, it is a place of choice to perform final beautification or clean-up passes. High-level clean-up, requiring the insertion of AST elements with no IR equivalents, can only be done at the AST level.

Generally, working at the AST level will seem more approachable and an easiest entry-point to writing decompiler plugins. However, in most cases, IR processors will be better suited to perform non-trivial optimizations and deobfuscation.

Development

For dexdec, IR and AST plugins can be developed as compiled jar, or plugin scripts (Java or Python). Plugin scripts are extremely convenient for quick prototyping. See example code in your JEB coreplugins/scripts/ folder.

For gendec, IR and AST plugins can be developed as compiled jar only. Support for plugin scripts will come soon.

Resources

This blog contains several tutorials on how to get started with writing IR and AST plugins for both dexdec and gendec.

You will also find examples in this GitHub repository.

API Reference: dexdec IR, dexdec AST, gendec IR, gendec AST