A note about 2020 Q1 updates (versions 3.10 to 3.16) regarding the DEX/Dalvik decompiler modules:
Generic String Decryption
Lambda Recovery
Unreflecting Code
Decompiling Java Bytecode
Auto-Rename All
Generic String Decryption
JEB ships with a generic deobfuscator that can perform on-the-fly string decryption and other complex optimizations. Although this optimizer performs safe (i.e., guaranteed) optimizations in most cases, it is unsafe in the general case case and therefore, may be disabled in the options. Refer to the Engines options .parsers.dcmp_dex.EnableDeobfuscators and .parsers.dcmp_dex.EmulationSupport.
Many code protectors offer options to replace immediate string constants by method invocations that perform on-the-fly decryption.
A variety of techniques exist, ranging from simple one-off trivial decryptor methods, to complex schemes involving object(s) creation, complicated decryptors injected in third-party packages, non-trivial logic, junk code meant to slow down analyzers, use of opaque predicates, etc. They are implemented in an infinite number of ways. JEB’s generic deobfuscator can perform quick, safe emulation of the intermediate representation to provide a replacement. It may sometimes fail or bail out due to several reasons, such as performance or pitfalls like anti-emulation and anti-sandboxing techniques.
Example 1
The above code (blue box) ends up being deobfuscated to:
Example 2:
The above code is deobfuscated to:
Below, a decryptor that had been injected into the com.google.gson.Gson() class:
Example 3:
One last example, which was involuntarily – yet, quite timely! – provided by a user:
Decrypting all strings: The decryptor kicks in when decompiling methods only. At the moment, if a string happens to be successfully decrypted, the optimizer does not attempt to recover all similarly encrypted strings in the code, although it is most certainly an addition that will make it in a future software update.
Rendering: You may quickly identify decrypted strings in the client as they are rendered using a special color associated with the itemId STRING_GENERATED, by default rendered in a flashy pink color in light and dark themes. Hovering over such items will bring up a pop-up with additional origin information, like the underlying code that would have generated that string:
API: – From a DEX perspective: Generated strings are artificial. Therefore, IDexString.isArtificial() would return true. – From a Java/AST perspective: IJavaConstant objects that embed origin information do so using the “origin” tag. Use IJavaConstant.getTags().get("origin") to retrieve it.
Lambda Recovery
JEB attempts to perform Java 8 style lambda recovery and reconstruction.
Desugared Lambdas
Recovery and reconstruction does not rely on any type of metadata 1, such as special prefixes -$$Lambda$ for classes and methods implementing desugared lambdas in dex 37-.
You may therefore see constructs like this:
Options: Lambda reconstruction can be disabled in the options (Edit, Options, Engines, …). Lambda rendering can also be disabled in the options, as well as on-demand by right-clicking a decompiled view, Rendering Options….
API Note: In the above cases, the underlying Java AST may be a IJavaNew or IJavaStaticField node. This is not the case for real (not desugared) lambdas, which map to an IJavaCall node – see below.
API Note: Such lambdas map to an IJavaCall node for which isCustomCall() will return true.
Unreflecting Code
Many code protectors make heavy use of reflection – combined with string encryption, as we’ll see below – to obfuscate code. In practice, reflection is limited to method invocation (static and virtual), static and non-static field setting and getting, and new instance creation. A few examples:
v = Class.forName("java.lang.Integer").getMethod("valueOf",
String.class).invoke(null, str);
// instead of
v = Integer.valueOf(str);
Class.forName("SomeClassName").getField("b").setInt(x, 4);
// instead of
x.b = 4;
Class.forName("java.lang.String").getConstructor(byte[].class)
.newInstance(val);
// instead of
new String(arg6);
Such code is generally protected by a catch-all handler that forwards the cause of any exception raised by a reflection issue:
By default, JEB will attempt to unreflect code. This deobfuscator is potentially unsafe and may be disabled in the options. Note that you always have the ability to choose, for a particular decompilation, whether some options should be temporarily enabled or disabled, by pressing CTRL+TAB (or COMMAND+TAB on macOS) to decompile (same as menu Action, Decompile with options…).
So, in a nutshell, code normally decompiled to:
will be decompiled to:
Technical Note: This optimizer works on the Intermediate Representation manipulated by the decompiler, not to be confused with the AST rendered as its output. (The AST cleaner that was described in an older post is more limited than this IR optimizer.)
Last-step failures: Successfully unreflecting code eventually depends on being able to find the intended target method or field matching the provided description (method parameter types or field type). Failure to do so will generate a log like "A candidate field/method/constructor for unreflection was not found".
Decompiling Java Bytecode
JEB supports JLS bytecode decompilation for *.class files and jar-like archives (jar, war, ear, etc.). The Java bytecode is converted to Dalvik using Android’s dx by default. Users may choose to use d8 (not recommended for now) instead by selecting so in the Options.
The resulting DEX file(s) are processed as usual.
You may use this to decompile Android Library files (*.aar files) in JEB.
Auto-Rename All
JEB 3.13 introduced a new generic action, Auto-Rename All. Its implementation is at the discretion of code plugins. The DEX plugin implements it, therefore users may execute Action, Auto-Rename All… at any time (generally after processing an obfuscated file) in order to rename code items such as field, method, or class names, to something more easily processable for our -limited- human brains.
Look at this horrendous obfuscation scheme below. It’s using right-to-left unicode characters to seriously mess up rendering:
Let’s run Action, Auto-Rename All… on this file:
As usual, feel free to join us on Slack, message us on Twitter, or email us privately at support@pnfsoftware.com.
Until next time!
–
Relying on metadata leads to false negatives in the best case – e.g., when the code has been minified by something like ProGuard; it leads to false positives in the worst case – e.g. forged metadata to incite the decompiler to generate inaccurate or wrong code. ↩
One of our users recently reported an Android resources.arsc file seemingly unprocessed by JEB. Upon closer inspection, it turned out this file was not a regular binary resources file, but instead, a compressed resources container serving as a generator for localized resources.arsc. Older versions of Google Play (eg, com.android.vending 11.6.18) and other official Google applications have been using this type of file, which is stored as a raw asset and sometimes named metaresources.arsc.
I decided to have a quick look. However, for better or worse, what was planned as a superficial exploration turned into a deep-dive into the rabbit-hole that was the “meta-arsc” parsing code.
Those files, as said above, are used to generate localized (non-English) resources.arsc files. That means that the client application can generate lightweight resources files on the fly. And presumably, APKs as well. Since this mechanism seems to be primarily used by the Play Store app, a reasonable use case could be Dynamic Delivery.
A brief description of the file format can be found below
The fully annotated JDB2 is here (as well as the source apk) if you’d like to write your own implementation of a parser and localized arsc generator. The parser and generator have been thoroughly deobfuscated and commented out where need be. Package: com.google.d.a.a.a.a. Client code: FilteredResourceHelper. Drop both files (jdb2, apk) in a folder and open the JDB2 file in JEB
What does it look like when metaresources.arsc is processed in JEB?
Binary Format
Disclaimer: Specification is a work-in-progress. Refer to the JDB2 annotation and code to fill in the gaps.
metaresources.arsc=
BE_UINT32 cnt count of languages
BE_UINT16[cnt] langs 2-char language codes
MetaEntry[cnt] metaentries meta entries matching the language codes
CompressedResourceTableChunk restab a compressed resource table (code: 0x1002)
EOF - not necessarily the EOF, but all metares files examined contained a single resource
chunk, which is a compressed resource tab
MetaEntry=
BE_UINT32 magic the value 'META'
BE_UINT32 entrysize complete entry size (including the above magic) in bytes
BE_UINT16 lang language code
VAR_INT32 cnt1 .
VAR_INT32[cnt1] offsets1 a custom serialization of java.util.BitSet (refer to JDB2 for details) holding positions for strings and string styles stored in string pool chunks
VAR_INT32 cnt2 .
VAR_INT32[cnt2] offsets2 offsets to Table Package chunk entries (types, typespecs)
=> Compressed entries:
- 5 types exist, basically non-XML chunk types
- Their type code is the same as arsc's with the 0x1000 bit set
- List of chunks:
StringPool= refer to JDB2, class CompressedStringPoolChunk
ResourceTable= refer to JDB2, class CompressedResourceTableChunk
Package= refer to JDB2, class CompressedPackageChunk
Type= refer to JDB2, class CompressedTypeChunk
TypeSpec= refer to JDB2, class CompressedTypeSpecChunk
Update (May 2024): Debugging oddities with API levels 34+, native code debugging issues with API level 34. Refer to the end of this post.
Update (March 2020): The original post describing the impossibility to read locals that do not have associated DebugLocalInfo on Android 9 (P) and 10 (Q) was fixed in Android 11 (R).
[Original Post] Issues related to reading local vars
Lower-level components of the Dalvik debugging stack, namely JDWP, JVM TI, and JVM DI implementations, were upgraded in Android Pie. It is something we indirectly noticed after installing P.beta-1 in the Spring of 2018. For lack of time, and because our recommendation is to debug apps (non-debuggable and debuggable alike) using API levels 21 (Lollipop) to 27 (Oreo), reversers could easily avoid road blocks which manifested in JEB as the following:
An empty local variable panel (with the exception of this for non-static methods)
Type 35 JDWP errors reported in the console, indicating that an invalid slot was being accessed
A type 35 error in this context means an invalid local slot is being accessed. In the example shown above, it would mean accessing a slot outside of [0, 10] (per the .registers directive) since the method declares a frame of 11 registers.
The second type of noticeable errors (not visible in the screenshot) were mix-ups between variable indices. Normally, and up to the JDWP implementation used in Android Pie, indices used to access slots were Java-style parameter indices (represented in Dalvik as pX), instead of Dalvik-style indices (vX). Converting from one to the other is trivial assuming the method staticity and prototype is known. It is a matter of generating pX so that they end up at the bottom of the frame. In the case above:
v0 p2
v1 p3
v2 p4
..
v9
v10 p0
v11 p1
When issuing a JDWP request (16,1) to read frame slots, we would normally use pX indices. It is no longer the case with Android P and Q: vX indices are to be used.
Open up JEB, start debugging any APK, switch to the Terminal view, and type ‘info’. In the context of JDWP, this JEB Debugger command issues Info and Sizes requests. Notice the differences:
=> On Android Oreo (API 27):
VM> info
Debuggee is running on ?
VM information: JDWP:"Android Runtime 2.1.0" v1.6 (VM:Dalvik v1.6.0)
VM identifier sizes: f=8,m=8,o=8,rt=8,fr=8
=> On Android Pie (API 28) and Android Q:
VM> info
Debuggee is running on ?
VM information: JDWP:"Java Debug Wire Protocol (Reference Implementation) version 1.8
JVM Debug Interface version 1.2
JVM version 0 (Dalvik, )" v1.8 (VM:Dalvik v0)
VM identifier sizes: f=4,m=4,o=8,rt=8,fr=8
Notice the reported version 1.2 for JVM DI, previously unspecified, and reported version 1.8 for JDWP, likely the cause of the breakage. Also note ID encoding size updates. JDWP had been reported a 1.6 version number, as well as field and method IDs encoded on 8 bytes, for as long as I can remember.
The vX/pX index issue was easily solved. It took a little while to crack the second issue. A superficial browsing of AOSP did not show anything fruitful, but after digging around, it seemed clear that this updated implementation of JDWP used CodeItem variables’ debug information to determine which variables are worth checking, and using what type.
See below: at address 2, v0 has been declared and is rendered. v1 has not been declared yet, the debugger cannot read it (we’ll get error 35).
Single-step: at address 4, v1 is declared. Although it is uninitialized, the debugger can successfully read the var:
So – Up until P, this metadata information, when present (almost all Release-type builds of legitimate and malware files alike discard it), had been considered indicative. Now, the debugger takes it literally. There are multiple candidate reasons as of why, but an obvious one is Safety. JDWP has been known to have the potential to crash the VM when receiving reading requests for frame variables using a bad type. E.g., requesting to read an integer-holding slot as a reference would most likely crash the target VM. Using type information providing in metadata, a debugger server can now ensure that a debugger requesting to read a slot as type T is indeed a valid request – assuming the metadata is legitimate, and since the primary use case is to debug applications inside IDE, which hold source information used to generate valid debug metadata, the assumption is fair.
Validating access to local vars has the interesting side-effect to act as an anti-debugging feature. While debugging the app remains possible, not being able to easily read some locals (parameters pX are always readable though), can be quite an annoyance.
In the future, how could we work around the JDWP limitation? Well, aside from the obvious cop out “use Oreo or below”, an idea would be to extend JEB’s –makeapkdebug option (that generates a debuggable version of a non-debuggable APK) to insert DEX metadata information specifying that all variables of a frame are used and of a given type. That may not work depending on the type of validation performed by the DEX verifier, but it’s something worth exploring. Maybe more simply, an alternative could be a custom AOSP build that disabled that feature. Or better yet, finding if a system property exists to disable/enable that JDWP functionality.
A final note: debugging non-debuggable APKs on Android Pie or above also proved more difficult, if not practically impossible, than on Oreo and below. Assuming your phone is rooted, here’s a solution (found when browsing around AOSP commits). On a rooted phone:
The command set -57 is a custom JDWP command set for Android, used for DDMS (Dalvik Debug Monitor System). JEB does not support those requests. Starting with JEB 5.13, those target-issued requests will be made more explicit (specifically saying they are DDM messages), but also less verbose (below INFO level) to avoid cluttering the logger.
[May 2024] Problems debugging native code on Android 14
JEB has difficulties debugging native code with Android 14 / API level 34. The native threads may suspend over an invalid memory access (SIGSEGV, signal 11) received when simply interacting with the app:
The issue is only present with API 34. It is not present with API levels <=33 and seems to have disappeared with the current preview of API 35 (upcoming Vanilla Ice Cream release).
We do not have a workaround in place at this time, and can only recommend avoiding debugging of native code with Android 14 / API level 34. We will update this post if/when a workaround is found.
As the latest update makes its way to all users (changelog), it is a good time to quickly recap additions related to Android analysis that made it into JEB versions 3.1.4, 3.1.5, and 3.2.
Dalvik Decompiler Updates
The newest releases of JEB contain several improvements to the Dalvik decompiler. I will highlight only a couple that users may find interesting. 1
Enumerations
Compiled Java enumerations can be complicated beasts. JEB attempts to re-sugar them to the best of its ability. On failure, regular classes extending java.lang.Enum will be rendered.
Obfuscation sometimes destroy important synthetic fields and structures that allow recovery heuristics to work. However, support should function reasonably well, even on enumeration data that was intentionally shuffled to generate decompilation errors. Moreover, and to keep with the spirit of interactivity in JEB, enumerated fields can be renamed – and it is done consistently over the code base, including over reconstructed switches making use of such enums.
Custom enumerated constants are also properly reconstructed, including:
Field annotations
Custom initializers (see below)
Additional methods and method overrides
Switches
Support was recently added for switch-on-enum and switch-on-string (partial support for the latter, to be continued in the next software update).
Inner classes, Anonymous classes
We improved rendering support for named- and anonymous-inner classes. Properly rendering anonymous classes in particular is made difficult by the fact that some of its arguments are captured from the outer classes. Properly rendering anonymous constructors, with exact argument types and position, is also challenging.
Lately, a user sent us a sample making use of an anonymous class initializer to hide string decryption code. See below:
The anonymous class extends Android’s OnActivityResultListener, instantiates the object, and tosses it immediately.
Decryption code takes place in the initializer. Note the captured arguments from the outer container method __m: i, _b. Access to other private class fields is made via synthetic accessor calls that were re-sugared into seemingly direct field access (BA._b).
Plugin options
Remember that some decompiler properties are publicly available in the options: (menu: Edit, Options, Advanced, Engines)
All Dalvik decompilation options: see the .parsers.dcmp_dex.* namespace
All Java rendering options of decompiled code: see the .parsers.dcmp_dex.text.* namespace
1)Rendering options are real-time options that can be changed after the fact to customize the output. Right-click on a decompiled class output, and select Rendering Options:
2) Decompilation options are used to guide and customize the decompilation. They can be changed in the Engines options, or more simply, when performing a decompilation itself, by invoking “Decompile with Options…” instead of “Decompile”.
Keyword for “Decompile with Options”: CTRL+TAB (Windows, Linux) or COMMAND+TAB (macOS)
API additions
Essential updates to:
IJavaSwitch: additional methods to access switch-on-enum and switch-on-string data
IJavaForEach: additional type introduced to manipulate for-each statements: for(Type var: iterator_or_array) { … }
Other changes, What next
JEB 3.2 contains other improvements, such as:
Better auto-naming, including default usage of debug data, if present (can be disabled in the options)
Improved typing and type propagation
Additional IR and AST optimizations
Better exceptional flow processing
Rendering of try-catch, synchronized blocks, etc.
Decompilation of invoke-polymorphic (invoke-custom is not supported, see below the part on lambdas on method handles)
We have more planned for the coming releases, including:
Improved support for switch-on-string. As said earlier, some of those switches, when properly detected, are re-sugared into legal Java-8 switch-on-string. However, the nature of those high-level constructs (they are implemented as double-conditionals, sometimes double-switches) makes it quite hard in some cases to provide proper reconstruction. It is something that will be improved in the future.
Support for generics. We had decided to not implement Java 5-style type generic since the information, when provided, is stored as pure metadata and should not be trusted. However, in practice, it turns out to be helpful when auditing legitimate, non-obfuscated compiled apps. We will add optional support for that in a coming release.
Support for try-with-resources.try(resource)/catch/finally are difficult very-high-level idioms to reconstruct. Optimizing compilers generate a substantial amount of additional, highly optimized code to implicitly catch exceptions and auto-close resources, making it extra difficult to reconstruct in the general case. We will likely introduce partial support before the summer.
Lambdas. It is a planned addition. We will soon be re-sugaring Android implementations of Java 8+ lambdas into proper lambda functions. Same goes for method handles (::). That’s quite exciting and may pave the way for a hypothetical Kotlin decompiler, since that language implicitly and explicitly rely on lambdas extensively.
Debuggable APK Generation
For several reasons, it is easier to debug Android applications explicitly marked debuggable in their Manifest.
Debugging non-debuggable APK requires root access to the operating system. Which means rooting a production phone, using an emulator 2 image built as userdebug, or building a custom userdebug image from AOSP.
Any of the above solutions have shortcomings: rooted production builds and userdebug builds expose features that non-rooted production builds do not have, and can be fingerprinted as such; Debugging native code of applications on non-rooted devices requires replacing system-level utilities; the API level and OS features also play a role, eg, SE-Android needs to be disabled on recent OS in order for debugging to work.
In many cases, rebuilding a release app into a debug-mode app (with <application android:debuggable=”true” …>) is a viable solution, and one that does not require using root, obviously. Many users are implementing this solution via apktool. However it is frequent for the tool to fail decoding complex APKs, let alone rebuild them with different settings.
We have introduced a feature in JEB that makes rebuilding non-debuggable APK to debuggable APK easy and fast:
$ jeb_wincon.bat -c --makeapkdebug -- file.apk
Upon success, file_debuggable.apk will be generated. Sign it (Android SDK’s apksigner), install it on your device, and start debugging. Remember that this solution has its shortcomings as well! Anti-debugging code may check at runtime that the app is not debuggable, as would be expected. More elaborate solutions implement certificate pinning-style checks, where the code verifies that it is signed using a specific certificate. Be careful when debugging rebuilt APK.
Keyboard Shortcuts for Script
Bind your JEB Python scripts to keyboard shortcut by adding a line at the top of your script:
#?shortcut=xxx
where xxx is your keyboard shortcut, eg: Ctrl+Shift+T
Permitted keyboard modifiers are Ctrl, Shift, Alt, as well as the generic Mod1, mapping to macOS’s Command (Apple) key, or Control on Windows/Linux.
Sublime Text 3 Extension
Are you writing Python scripts to automate your JEB reversing tasks? If so, give a try to using the “JEB Script Development Helper” package available on Sublime Text’s Package Control.
In part 1 of this series, we gave an overview of the Intermediate Representation used by JEB’s Native Analysis Pipeline, as well as a simple Python script demonstrating how to use the API to access and print out IR-CFG of decompiled routines.
In part 2, we continue our exploration of JEB IR. We will show how to write a custom IR optimizer plugin to clean-up a custom obfuscation used in a piece of code. The resulting decompiled C code will end up very readable as well.
Before you proceed, make sure to update JEB Pro to version 3.1.1+.
Obfuscated Crypto-stealer Code
The sample we are going to look at monitors Windows clipboards for cryptocurrency-looking wallet addresses, and replaces them with a desired target address. The sample is specifically targeting Ethereum wallet addresses. It is a neutered final stage payload – the recipient address has been scrambled to render the code ineffective.
Although the payload is unpacked, what is interesting is that one of its key routines is obfuscated: custom garbage code was inserted.
The garbage code is easy to go through: a bit of manual analysis shows that junk instructions are assigning pseudo-random values to an array whose bytes are never used. Two types of assembly patterns are present:
1- mov dword ptr [edi + offset], junk_value ; edi previously init. to ; junkarray address 2- push junk_value pop dword ptr [junkarray_address + offset]
If we decompile that code and look at the final IR (as shown below), we can see that those instructions ended up being converted and optimized to the following type of assignment:
Assign(Mem(mem_address), Imm(junk_value))
Currently, the decompiled code looks like the following, hard-to-digest blob:
Although quite painful to read, we can follow the program’s logic by abstracting away the junk assignments. (Essentially, win32 functions’ OpenClipboard, GetClipboardData, and SetClipboardData are used to retrieve, check, and replace copy-pasted Ascii and Unicode text, if they match the following pattern “/0x(..){20}/”. The replacement string target wallet address, previously decrypted by sub_401000.)
Cleaning the Intermediate Representation
Recall that the native analysis pipeline can be simplifed as the workflow below:
CodeObject (*) -> Reconstructed Routines & Data -> Conversion to IR (low-level, non-optimized) -> IR Optimizations<--- this is where we'll work -> Final IR (higher-level, optimized, typed) -> Generation of AST -> AST Optimizations -> Final AST (final, cleaned) -> High-level output (eg, C variant)
Our custom IR optimizer will look for junk assignments and remove them. The important criteria are: What is the junk array start and end addresses? Is it common to all routines in the binary, or is there one array per routine? Those questions may be hard to answer in the general case. However, for our specific sample file, we can assert with a high-degree of certainty that the junk array: – starts at address 0x415882 – is at most 256 bytes long – is used solely by sub_401171, the routine we want to analyze
Because of the above restrictions, the IR optimizer we are going to write should be qualified as a custom or ad-hoc IR optimizer. Chances are, we won’t be able to reuse it as-is in other programs without some amount of tweaking.
Let’s get started, we will: – create an Eclipse project with scaffold code for a Java back-end plugin – write and test a custom IR optimizer with a headless client – deploy the plugin and make it usable and accessible from the UI desktop client
Creating a Plugin Project
Before we proceed, make sure to:
Define an environment variable JEB_HOME, that points to your JEB installation folder
Open Eclipse and import the newly-created project into your Workspace (File, Import, Existing Projects into the Workspace, select the cloned repository folder, proceed)
Debugging the Obfuscation
Now that your project is imported in Eclipse, you should be able to see two source files in src’s default package:
Tester.java
EOptExample1.java
EOptExample1 is the IR optimizer plugin we will be working on. (Note that several classes of plugins exist, this one is a native IR optimizer, and therefore inherits from AbstractEOptimizer or one its subclasses.)
Tester creates a headless JEB instance that loads the plugin EOptExample1.
Then, create a JEB project and load the artifact file samples/1.exe (IMPORTANT: unzip 1.zip to 1.exe first – password: password)
Analyze the artifact
Retrieve a handle on the native decompiler
Retrieve a handle on the to-be-analyzed routine sub_401171
Perform a full decompilation of that routine
Let’s have a preliminary look at EOptExample1: This IR optimizer type is set to STANDARD, which is not ideal when you use custom optimizers tailored for specific code. A better IR optimizer class for those is ON_DEMAND: those optimizers are to be manually invoked, e.g. from JEB UI (menu: File, Advanced Unit Options). However, during development, since we are focusing on a particular file and routine, STANDARD type may be fine. Standard optimizers are called during regular IR optimization phases of the decompilation pipeline.
public class EOptExample1 extends AbstractEOptimizer {
public EOptExample1() {
super(DataChainsUpdatePolicy.UPDATE_IF_OPTIMIZED);
getPluginInformation().setName("Sample IR Optimizer #1");
getPluginInformation().setDescription("Remove IR-statements reduced to \"*(&garbage + delta) = xxx\"");
getPluginInformation().setVersion(Version.create(1, 0, 0));
// Standard optimizers are normally run, as part of the IR optimization stages in the decompilation pipeline
setType(OptimizerType.STANDARD);
}
// replace all IR statements previously reduced to EMem ("[junk_address] = xxx") to ENop
@Override
public int perform(boolean updateDFA) {
logger.info("IR-CFG before running custom optimizer \"%s\":\n%s", getName(),
DecompilerUtil.formatIRCFGWithContext(2, cfg, ectx));
// ...
// optimizer code
}
}
Note the plugin’s data-chains update policy, set to UPDATE_IF_OPTIMIZED. Optimizations that specify this flag tell their runner, aka the master optimizer that orchestrate them, that identifiers may be modified – hence, if optimizations occurred, a data flow analysis (DFA) pass needs to take place again. DFA update policies are a topic for another article.
Lines 3-5 are plugin metadata information, such as name and description, authorship, version numbers (including minimum/maximum JEB back-end versions), etc.
Before we deep-dive into perform(), let’s first set a breakpoint on line 15, where logger.info(…) is called. Then, start a debugging session for Tester: menu Run, command Debug (hotkey: F11.)
After a few seconds of analysis, your breakpoint should be hit; it corresponds to the first-time invocation of your custom optimizer. The logger prints out the IR-CFG that’s about to be optimized. Let’s have a look at it:
IR-CFG before running custom optimizer "Sample IR Optimizer #1":
>> IN(@0): ecx={@D} esp={@0} ebp={@1} ss={@1,@C,@18,@1D,@21,@24,@25,@27,@30,@35,@38,@3B,@3E,@3F,@41,@43,@46,@4F,@51,@54,@56,@59,@5C,@5D,@5F,@6B,@77,@81,@84,@9B,@9E,@A0,@AC,@B8,@BA,@BD,@BF,@C3,@C5,@C7,@CB,@CD,@D1,@D2,@D4,@E0,@E9,@EF,@F1,@F5,@F7,@FB,@FC,@FE,@100,@103,@106,@107,@109,@10C,@10E,@112,@114,@116,@11A,@11C,@11F,@122,@123,@12E,@131,@133,@137,@139,@13D,@13E,@140,@143,@145,@149,@14B,@14F,@150,@152,@15D,@173,@176,@179,@17C,@17D,@17F,@181,@18A,@18C,@18F,@191,@194,@196,@19A,@19D,@19E,@1A0,@1B3,@1BF,@1C2,@1D9,@1DC,@1E0,@1EC,@1EF,@1F1,@1F3,@1F7,@1F9,@1FC,@1FF,@202,@203,@205,@211,@21D,@220,@222,@226,@227,@229,@22B,@22E,@231,@232,@234,@237,@23A,@23C,@23E,@242,@244,@246,@24A,@24C,@250,@251,@25C,@25F,@262,@263,@265,@268,@26A,@26E,@271,@272,@27A,@27F,@295,@298,@29A,@29D,@2A4,@2A9,@2AD,@2B0,@2B2,@2B6,@2B7,@2BA,@2BC,@2C0,@2C2,@2C6,@2C7,@2CA,@2CD,@2D2,@2DE,@2E4,@2E7,@2E8,@2EA} ds={@F,@11,@19,@1E,@22,@28,@31,@36,@39,@3C,@40,@44,@4C,@4E,@52,@55,@57,@5A,@60,@6C,@74,@78,@7B,@82,@85,@86,@87,@8E,@90,@92,@9C,@A1,@A3,@A4,@AD,@B5,@B7,@BB,@C0,@C4,@C8,@CE,@D5,@E1,@EA,@ED,@F2,@F8,@FD,@FF,@101,@104,@10A,@10F,@113,@117,@11B,@11D,@120,@124,@12F,@134,@13A,@13F,@141,@146,@14C,@153,@15A,@15C,@163,@165,@166,@167,@170,@171,@174,@177,@17A,@17E,@180,@187,@189,@18D,@192,@197,@19B,@1A1,@1AB,@1B4,@1B7,@1B9,@1C0,@1C3,@1C4,@1C5,@1CC,@1CE,@1D0,@1DA,@1DD,@1DE,@1E1,@1E9,@1ED,@1F0,@1F4,@1FA,@1FD,@200,@206,@212,@219,@21B,@21E,@223,@228,@22A,@22C,@22F,@235,@238,@23B,@23F,@243,@247,@24D,@252,@25B,@25D,@260,@264,@266,@26B,@26F,@273,@27B,@27E,@285,@287,@288,@289,@292,@293,@296,@29B,@2A5,@2AA,@2AE,@2B3,@2B8,@2BD,@2C3,@2C8,@2CE,@2D0,@2D3,@2DF,@2E1,@2E2,@2E5,@2EB} OpenClipboard={@25} GetClipboardData={@3F,@17D} GlobalAlloc={@FC,@227} GlobalLock={@107,@232} GlobalUnlock={@13E,@263} SetClipboardData={@150,@272,@2B7,@2C7} CloseClipboard={@2CB} Sleep={@2E8} sub_401000={@D} sub_405010={@5D} sub_404F80={@D2} sub_4024E0={@123,@251} sub_404E54={@19E} sub_404E14={@203}
0000/1> s32:_esp = (s32:_esp - i32:00000004h) DU: esp={@1,@2,@B} | UD: esp={}
0001/1: 32<s16:_ss>[s32:_esp] = s32:_ebp DU: | UD: esp={@0} ebp={} ss={}
0002/9: s32:_ebp = s32:_esp DU: ebp={@38,@41,@46,@4F,@54,@56,@84,@9E,@B8,@BD,@C5,@FE,@100,@10C,@114,@11C,@131,@140,@15D,@176,@17F,@181,@18A,@18F,@194,@1C2,@1DC,@1EF,@1F1,@1FC,@229,@22B,@237,@23C,@244,@25C,@265,@27F,@298,@29D} | UD: esp={@0}
000B/1: s32:_esp = (s32:_esp - i32:0000002Ch) DU: esp={@C,@D,@17} | UD: esp={@0}
000C/1: 32<s16:_ss>[s32:_esp] = i32:0040117Ch DU: | UD: esp={@B} ss={}
000D/1: call s32:_sub_401000(s32:_ecx)->(s32:_eax){32[s32:_esp]} DU: eax={} | UD: ecx={} esp={@B} sub_401000={}
000E/1+ s32:_edi = i32:00415882h DU: edi={} | UD:
000F/1: 32<s16:_ds>[i32:00415944h] = i32:E2E60682h DU: | UD: ds={}
0010/1: s32:_eax = i32:00000001h DU: eax={} | UD:
0011/6: 32<s16:_ds>[i32:00415904h] = i32:7C64C0E4h DU: | UD: ds={}
0017/1: s32:_esp = (s32:_esp - i32:00000004h) DU: esp={@18,@1A} | UD: esp={@B,@2EC}
0018/1: 32<s16:_ss>[s32:_esp] = i32:E87A1612h DU: | UD: esp={@17} ss={}
0019/1: 32<s16:_ds>[i32:004158DDh] = i32:E87A1612h DU: | UD: ds={}
001A/1: s32:_esp = (s32:_esp + i32:00000004h) DU: esp={@1C} | UD: esp={@17}
001B/1: nop DU: | UD:
001C/1+ s32:_esp = (s32:_esp - i32:00000004h) DU: esp={@1D,@20} | UD: esp={@1A}
001D/1: 32<s16:_ss>[s32:_esp] = i32:CCA4A4A0h DU: | UD: esp={@1C} ss={}
001E/2: 32<s16:_ds>[i32:004158CAh] = i32:CCA4A4A0h DU: | UD: ds={}
0020/1: s32:_esp = s32:_esp DU: esp={@21,@23} | UD: esp={@1C}
0021/1: 32<s16:_ss>[s32:_esp] = i32:00000000h DU: | UD: esp={@20} ss={}
0022/1: 32<s16:_ds>[i32:00415951h] = i32:249E4228h DU: | UD: ds={}
0023/1: s32:_esp = (s32:_esp - i32:00000004h) DU: esp={@24,@25,@26} | UD: esp={@20}
0024/1: 32<s16:_ss>[s32:_esp] = i32:004011CAh DU: | UD: esp={@23} ss={}
0025/1: call s32:_OpenClipboard(32<s16:_ss>[(s32:_esp + i32:00000004h)])->(s32:_eax){32[s32:_esp]} DU: eax={@33} | UD: esp={@23} ss={} OpenClipboard={}
...
... (trimmed)
...
The above IR listing is a human-friendly representation of IR statements. The general format of this listing is:
- offset: IR statement offset - length: IR statement length (generally, 1) - C: indicates whether the instruction is - the entry-point instruction (>) - the first of a basic-block (+) - any other instruction (:) - insn: IR statement instruction (refer to Part 1 of this blog series) - DU/UD: routine def-use and use-def chains - IN: live input variables at the entry-point - OUT: reaching output variables at a given exit point
The IR listing is relatively readable, although quite verbose at this early stage of optimization (roughly, the first pass in tier 1 of the analysis pipeline). The important idioms to look at here are:
a/ The first one is an Assign(Mem(Imm), Imm), which corresponds to optimized “mov [edi + offset], value”, where the value of edi was determined, propagated further, and the addition folded and converted to an immediate address.
b/ The second one is a partially optimized “push value / pop [address]”. Later optimizations phases will find and remove esp updates or esp-based operations, as was shown in the pseudo-code earlier. What we need to focus on here is the Assign(Mem(Imm), Imm), like the one in a/.
Those are the bits we will look for and modify: Assuming those assignments are useless, we will simply replace them by Nop statements.
Writing the Optimizer
At this point, our preliminary understanding of the obfuscation is enough to start writing the clean-up optimizer. Its code is extremely simple, for two main reasons: – The obfuscation scheme itself is relatively trivial – Other built-in JEB optimizers are giving us clean IR assignments to work on
Let’s look at the code of proceed():
@Override
public int perform(boolean updateDFA) {
final long garbageStart = 0x415882;
final long garbageEnd = garbageStart + 0x100;
int cnt = 0;
for(int iblk = 0; iblk < cfg.size(); iblk++) {
BasicBlock<IEStatement> b = cfg.get(iblk);
for(int i = 0; i < b.size(); i++) {
IEStatement stm = b.get(i);
if(!(stm instanceof IEAssign)) {
continue;
}
IEAssign asg = (IEAssign)stm;
if(!(asg.getLeftOperand() instanceof IEMem)) {
continue;
}
IEMem target = (IEMem)asg.getLeftOperand();
if(!(target.getReference() instanceof IEImm)) {
continue;
};
IEImm wraddr = (IEImm)target.getReference();
if(!wraddr.canReadAsAddress()) {
continue;
}
long addr = wraddr.getValueAsAddress();
if(addr < garbageStart || addr >= garbageEnd) {
continue;
}
b.set(i, ectx.createNop(stm));
cnt++;
}
}
return postPerform(updateDFA, cnt);
}
This optimizer inherits from AbstractEOptimizer. Therefore, the perform() method works on an IR-CFG. (Not all optimizers may choose to do so; it is sometimes easier to work directly on statements or expressions.)
process() goes through all statements or every basic block of the IR-CFG. Using the instanceof operator, we check that the statement is an assignment such as: Mem(address) = Imm. The address is retrieved, and we make sure that it falls within the junk array. If those checks succeed, we replace the assignment by a Nop.
And that is it. Clean and simple – although, not quite portable, since the junk array address and size are hard-coded into the code! But that is not the point of this blog, and neither is portability a first-class goal when writing optimizers for custom code.
Next up, let’s see how to use the plugin in an interactive session using the desktop client.
Building, Deploying, Interactive Use
In order to use the optimizer within the JEB desktop client, we either:
Register the plugin as a development plugin;
Or build the plugin as a Jar and drop it in JEB’s coreplugins/ folder.
Development Plugin
This is the easiest option. You may consider it as an intermediate step between prototyping with the headless client, as demonstrated above, and a full-blown, deployed Jar plugin.
Open the Options panel, Development tab, tick the option “Development Mode”, add the bin/ folder of your plugin’s project to the classpath, and add the classname of your plugin entry-point:
Press OK and restart JEB. Your plugin will be loaded and ready to use. You may now skip to the section “Using the IR optimizer plugin”.
Building a Jar plugin
The alternative is to run build.cmd (on Windows) or build.sh (on Linux/macOs), which calls an Ant script in the scripts/ folder, therefore, make sure to have Ant installed on your system first. You may also customize the plugin name and version before building.
The resulting Jar plugin file will be generated in your project’s out/ folder. Copy it to your JEB coreplugins/ folder and start the JEB client. Your plugin will be automatically loaded, along with the other plugins.
Using the IR Optimizer Plugin
If your plugin has the type STANDARD (default), then, as explained earlier, it will be invoked by the optimizations’ orchestrator automatically, at various times during the decompilation pipeline. If that’s the mode you’d like to choose, make sure that your plugin is generic enough to handle all types of input routines, else you’re in for some strange surprises if you ever forget to remove it from your coreplugins/ folder.
An alternative is to convert it to an on-demand plugin:
public EOptExample1() {
super(DataChainsUpdatePolicy.UPDATE_IF_OPTIMIZED);
getPluginInformation().setName("Sample IR Optimizer #1");
getPluginInformation().setDescription("Remove IR-statements reduced to \"*(&garbage + delta) = xxx\"");
getPluginInformation().setVersion(Version.create(1, 0, 0));
// Standard optimizers are normally run, as part of the IR optimization stages in the decompilation pipeline
//setType(OptimizerType.STANDARD);
// alternative (better for production / in UI use):
setType(OptimizerType.ON_DEMAND);
setPreferredExecutionStage(-NativeDecompilationStage.LIFTING_COMPLETED.getId());
setPostProcessingActionFlags(PPA_OPTIMIZATION_PASS_FULL);
}
– Line 11 makes the optimizer on-demand. Users must manually activate it, on specific code. – Line 12 is recommended for on-demand optimizers: we specify at which point in in the pipeline the plugin should be called. – Finally, we set some post-processing flags, specifying that a full round of standard optimizations must be performed after our custom optimizer has run: this will allow cleaning up code remnants, and optimize our IR-CFG further – something made possible after running an optimization pass like this one.
On-demand optimizer plugins show up in the File, Advanced Unit Options dialog box, that you may bring up when a decompiled routine has the focus:
Tick the optimizer box, press OK. The routine will be re-decompiled.
Clean Code
Regardless which method you choose, once cleaned up, the IR will allow for better downstream pipeline phases, including typing, AST generation, AST optimizations, etc.
The pseudo-C code has become quite readable:
Conclusion
That is it for part 2. We scratched the surface of IR optimizers (which themselves are a relatively small – albeit important – part of the overall decompilation pipeline 2) but it’s a good start. I strongly encourage you to experiment and ask your questions on our Slack channel. One ongoing effort right now is to bring the API documentation up to speed in terms of contents and sample code.
In part 3, we will continue exploring IR optimizers. Later on in the series, we will show how to write AST optimizers 3, how to write decompilation modules, and show how existing decompilers can be cutomized further. Stay tuned!
JEB must have been previously run, at least once: EULA accepted, license key generated, etc. ↩
The decompilation pipeline is one component of the native analysis pipeline, which is one module, among tens, of the JEB back-end: the public API is worth exploring if you’re into advanced use cases. ↩
AST generation is one of the very final decompilation phases – working on the syntax tree serves different purposes than working on the IR ↩
We are happy to announce that JEB3 is finally available for download! The Beta period spanned from June last year to early January this year, and we thank users who actively participated in it by providing feedback and reporting issues. Our continuous effort to add features – big and small – and scrap bugs is ongoing, as always.
If you are a registered user, you should have received an email letting you know that you can download and install JEB 3.1.0. (Users that were previously using JEB 2.3.x must install JEB3 in a separate location. You may also use both JEB2 and JEB3 concurrently, if you ever need to.) If you haven’t received an email (eg, you are not the primary licensee of a multi-user license), please reach out.
Below is a very high-level summary of the additions that went into JEB3:
Major upgrades to the native analysis pipeline. The decompilation pipeline is accessible and customizable at different stages, which we will detail in coming blogs. (We published part 1 of a series on writing custom IR optimizers and AST optimizers.)
New decompilers for Ethereum smart contracts (evm) and WebAssembly modules (wasm). As of JEB 3.1, JEB ships with 8 decompilers: dex/dalvik, x86, x86-64, arm, arm64 (aarch64), mips, wasm, and evm. A large chunk of our effort in 2019 will be focused on continuing our work on the native analysis and decompilation (eg, advanced optimization modules, release of the C++ reconstruction plugin, open-sourcing of advanced optimizers –1, 2-, etc.).
Type libraries for Windows, Linux, and Android-Linux sub-systems for common architectures (x86, x86-64, arm, aarch64, mips). Power users can also generate their own typelibs (eg, for custom SDKs).
Windows malware analysis and Android SO native files is enjoyable and practical with JEB. Combined with powerful, custom IR optimizers, the analysis of complex code is also possible.
Interactive global graphs. The desktop client provides this experimental feature, whose goal is to provide global, smart views of a program. More to come, including API to access the CFG graphs, callgraphs, and create custom graphs.
The release of JEB 3.1 also marks the addition of a new type of licence, JEB Home Edition x86. While JEB Pro and JEB Android are subscription based license types for professional and corporate use, the Home Edition is designed for individuals such as hobbyists, students, or freelancers, who wish to legally acquire a professional reverse engineering tool for a reasonable price: $99, perpetual license, with updates for one year.
JEB Home Edition x86 has everything needed to perform analysis of x86 and x86-64 binaries, for most platforms. Here are the features and modules shipping with this license:
Support for all code objects, including ELF files, EXE binaries, DLL libraries, SYS drivers, headless firmware, etc.
Augmented disassembly, including resolution of dynamic callsites, candidate values determination for registers, dynamic cross-references, etc.
Decompilation of x86 and x86-64 to C-like source code. The decompiler includes advanced optimization passes to thwart protected or obfuscated code.
Win32 type libraries & WDK type libraries for efficient Windows file analysis. Power-users can generate their own typelibs as well (details)
Signature libraries for common SDK, including all versions of Microsoft Visual Studio.
Interactive layer for refactoring: type definition, stackframe building, renaming/commenting/cross-referencing, etc.
Client-side API access for scripting and automating tasks in Python.
JEB native code analysis components make use of a custom intermediate representation (IR) to perform code analysis.
Some background: after analysis of a code object, the native assembly of a reconstructed routine is converted to an intermediate representation. 1 That IR subsequently goes through a series of transformation passes, including massages and optimizations. Final stages include the generation of high-level C-like code. Most stages in this pipeline can be customized by users via the use of plugins. A high-level, simplified view of the pipeline could be as follows:
CodeObject (*) -> Reconstructed Routines & Data -> Conversion to IR (low-level, non-optimized) -> IR Optimizations -> Final IR (higher-level, optimized, typed) -> Generation of AST -> AST Optimizations -> Final AST (final, cleaned) -> High-level output (eg, C variant)
(*) Examples of code objects: a Windows PE file with x86-code, an ELF library with with MIPS code, a headless ARM firmware, a Wasm binary file, an Ethereum smart contract, etc.
Two important JEB API components to hook into and customize the native analysis pipeline are: – The IR classes – The AST classes We will start looking at IR components through the rest of this part 1.
IR Description
JEB IR can be seen as a low-level, imperative assembly language, made of expressions. Highest-level expressions are statements. Statements contain expressions. Generally, expressions can contain expressions. IR can be accessed via interfaces in the JEB API. The top-level interface for all IR expressions is IEGeneric. All IR elements start with IExxx. 2
The diagram below shows the current hierarchy of IR expression interfaces:
Note that IEGeneric sits at the top. All other IRE’s (short for IR Expressions from now on) derive from it. Let’s go through those interfaces:
IEImm: Integer immediate of arbitrary length. Eg, Imm(0x1122, 64) would represent the 64-bit integer value 0x1122.
IEVar: Generic IRE to represent variables. Variables can represent underlying physical registers, virtual registers, local function variables, global program variables, etc.
IEMem: Piece of memory of arbitrary length. The memory address itself is an IRE; the accessed bitsize is not.
IECond: A ternary expression “c ? a: b”, where a, b and c are IRE’s.
IERange: A fixed integer range, commonly used with Slice
IESlice: A chunk (contents range) of an existing IR. Eg, Slice(Imm(0x11223344, 32), 16, 24)) can be simplified to Imm(0x22, 8)
IECompose: The concatenation of two or more IRE’s (IR0, IR1, …), resulting in an IR of size SUM(i=0->n, bitsize(IRi))
IEOperation: A generic operation expression, with IRE operands and an operator. Eg, Operation(ADD,Imm(0x10,8),Mem(Imm(0x10000,32),8)). Most standard operators are supported, as well as less standard operators such as the Parity function or Carry function.)
IEStatement: the super-interface for IR statements; we will detail them below.
An IR translation unit, resulting from the conversion of a native routine, consists of a sequential list of IEStatement objects. An IR statement has a size (generally, but not necessarily, 1) and an address (generally, a 0-based offset relative to its position in the translation unit).
As of JEB 3.0.8, IR statements can be:
IEAssign: The most common of all statements: an assignment from a right-side source to a left-side destination. While the source can be virtually anything, the destination IRE is restricted to a subset of expressions.
IENop: This statement does nothing but consumes virtual size in the translation unit.
IEJump: An unconditional or conditional jump within the translation unit, expressed using IR offsets.
IEJumpFar: An unconditional or conditional far jump (can be outside the translation unit), expressed using native addresses.
IECall: Represent a well-formed static or dynamic dispatch to another IR translation unit. The dispatch expression can be any IRE (eg, an Imm for a static dispatch; a Var or Mem for a dynamic dispatch).
IEReturn: A high-level expression used to denote a return-to-caller from a translation unit representing a routine. This IRE is always introduced by later optimization passes.
IEUntranslatedInstruction: This powerful statement can be used to express anything. It is generally used to represent native instructions that cannot be readily translated using other IR expressions. (Users may see it as an IECall on steroid, using native addresses. In that sense, it is to IECall what IEJumpFar is to IEJump.)
Now, let’s look at a few examples of conversions.
IR Examples
Let’s assume the following EVars were previously defined by an Intel x86 (or x86-64) converter: tmp (a 32-bit EVar representing a virtual placeholder register); eax (an EVar representing the physical register %eax); ?f (1-bit EVars representing standard x86 flags).
x86: mov eax, 1
s32:_eax = s32:00000001h
Translating this mov instruction is straight-forward, and can be done with a single Assign IR statement.
x86-64: not r9d
s64:_r9 = C(~(s64:_r9[0:32[), i32:00000000h)
Translating a not-32-bit-register on an x86-64 platform is slightly more complex, as the upper 32-bit of the register are zeroed out. Here, the converter is making use of three nested IREs: (IECompose(IEOperation(NOT, Slice(r9, 0, 32))))
Reading IR.IECompose are pretty-printed as C(lo, …, hi), IESlice as Expr[m:n[
One side-effect of arithmetic operations on x86 is the modification of flag registers. A converter explicits those side effects. Consequently, translating the exclusive-or above resulted in several Assign IR statements to represent register and flags updates. 3
Reading IR. IEMem are pretty-printed as bitsize<SegmentIR>[AddressIR]
The translation of add makes use of the temporary, virtual EVar tmp. It holds the original value of %eax, before the addition was done. That value is necessary for some flag update computations (eg, the overflow flag.) Also take note of the use of special operators Parity and Carry in the converted stub.
Note that a native address is written to the RIP-IEVar (or any EVar representing the Program Counter – PC). PC-assignments like those can later be optimized to IEJump, making use of IR Offsets instead of Native Addresses.
Also note that the Control Flow Graph (CFG) of the native instruction in the examples thus far are isomorphic to their IR-CFG translated counterparts. That is not always the case, as seen in the example below.
Reading IR. conditional IEJump are pretty-printed “if (cond) goto IROffset”. Unconditional IEJump are rendered as simple “goto IROffset”.
This IR-CFG is not isomorphic to the native CFG. Additional edges (per the presence of 2x IEJump) are used to represent the compare “[esi+xxx] to [edi+xxx]” loop.
Accessing IR
The JEB back-end API allows full access to several IR-CFG’s, from low-level, raw IR to partially optimized IR, to fully lifted IR just before AST generation phases.
Navigating the IR in the GUI
The UI client currently provides access to the most optimized IR of routines. Those IR-CFG’s can be examined in the apt-named fragment right next to the source fragment showing decompiled code. Here is an example of a side-by-side assemblies (x86, IR). The next screenshot shows the decompiled source.
IR via API
The API is the preferred method when it comes to power-users wanting to manipulate the IR for specific needs, such as writing a custom optimizer, as we will see in the next blog in this series.
Reminder: JEB back-end plugins can be written in Java (preferably) or Python. JEB front-end scripts can be written in Python, and can run both in headless clients (eg, using the built-in command line client) or the UI client.
For now, let’s see how to write a Python script to:
Retrieve a decompiled routine
Get the generated Intermediate Representations
Print it out
The following script does retrieve the first internal routine of a Native unit, decompiles it, retrieve the default (latest) IR, and prints out its CFG. The full scripts is available on GitHub.
# retrieve `unit`, the code unit
# GlobalAnalysis is assumed to be on (default)
decomp = DecompilerHelper.getDecompiler(unit)
if not decomp:
print('No decompiler unit found')
return
# retrieve a handle on the method we wish to examine
method = unit.getInternalMethods().get(0)#('sub_1001929')
src = decomp.decompile(method.getName(True))
if not src:
print('Routine was not decompiled')
return
print(src)
decompTargets = src.getDecompilationTargets()
print(decompTargets)
decompTarget = decompTargets.get(0)
ircfg = decompTarget.getContext().getCfg()
# CFG object reference
# see package com.pnfsoftware.jeb.core.units.code.asm.cfg
print("+++ IR-CFG for %s +++" % method)
print(ircfg.formatSimple())
Running on Desktop Client. Run this script in the UI client via File, Scripts, Run… (hotkey: F3). Remember to open a binary file first, with a version of JEB that ships with the decompiler for that file’s architecture.
Running on the command-line. You may also decide to run it on the command-line. Example, on Windows:
That is it for part 1. In part 2, we will continue our exploration of the IR and see how we can hook into the decompilation pipeline to write our custom optimizers to clean packer-specific obfuscation, as well as make use of the data flow analysis components available with the IR-CFG. Stay tuned!
Working on IR presents several advantages, two of which being: a/ the reduction of coupling between the analysis pipeline and the input native architecture; b/ and offering a side-effect free representation of a program. ↩
The design choices of JEB IR are out-of-scope for this blog. They may be the subject of a separate document. ↩
When decompiling routines, IR optimization passes will iteratively refactor and clean-up unnecessary operations. In practice, most flag assignments will end up being removed or consolidated. ↩
JEB 3.0.7 ships with our internal type library generation tool. In this post, we will show how to use native types with the client and API, and how power-users can generate custom type libraries.
Type libraries (typelibs)
Type libraries are *.typelib files stored in the JEB’s typelibs/ folder. They contain type information for a given component (eg, an OS or an SDK), such as:
Types (aliases, structures, enumerations, etc.) and prototypes (~function pointers)
Publicly exported routines
Constants
JEB ships with typelibs for major sub-systems (such as Windows win32 (user-mode), Windows Driver Kit (kernel), Linux GNU, Linux Android, etc.) running on the most popular architectures (x86, x86-64, arm, aarch64, mips).
Let’s see how types can be used to ease your reverse-engineering tasks.
Using native types with the UI client
Applying types
Using types with JEB is straightforward. If your file’s target environment was identified (or partially identified), then, matching typelibs will be loaded and their types be made available to the user.
The file shown below is an x86 file compiled for Windows 32-bit:
As such, win32 typelibs were loaded. You can verify that by clicking File, Engines, Type Libraries…:
Let’s define the bytes at address 0x403000 as belonging to a FILETIME structure. You may right-click and select Edit Type (Y):
and input the exact type name: (the type must exist)
Alternatively, it is easier to select a type using Select Type (T). A list of available types is displayed. Filter on “FILETIME”:
And apply it.
The resulting updated disassembly listing will be:
Type editor
JEB features a powerful native type editor, that allows the modification of existing “complex” types (that is, structure and derivative) and the definition of new types. Open it with Ctrl+Alt+T (macOS: Cmd+Alt+T).
Below, we are selecting an existing well-known Windows type, IMAGE_DOS_HEADER.
Let’s create a new type.
To create a structure type, click Create, and input a name, such as MyStruc1. The type editor will display your empty structure:
You may then add or remove fields, using the following hotkeys:
Here, we define MyStruc1 to be as such: a structure containing primitives, a nested structure, and arrays.
As seen earlier, we can apply our type MyStruc1 anywhere on bytes, eg at offset 0x403027:
Constants
Typelib files also bundle well-known constants, generally defined in header files with #DEFINE pre-processor commands. You may use them to replace immediate values in your assembly or decompiler views.
Here is an example, again, coming from a Windows win32 file. The following decompiled method makes use of SendMessage routine:
Note that the second parameter is the message id. The MSDN provides a long list of well-known ids; Most of them are bundled with Windows typelibs shipping with JEB.
Right click on the immediate value (176), and select Replace to see what is offered:
Click OK to perform the replacement:
More readable, isn’t it?
Custom typelibs
There exist scenarios where users will want to create their own typelibs, generally when many custom types would have to be created and/or may need to be reused later. Examples:
Analysis of a Windows kernel component making use of Driver Kit headers whose types were not added to JEB’s pre-built WDK typelibs (our own wdk10-<arch>.typelib files do not contain all WDK components, although they do contain the most important ones).
The types of platform X were not compiled for a given architecture (eg, JEB does not ship with Linux types specific to Atmel AVR microcontrollers).
The binary to be analyzed makes use of a third-party SDK and the program is dynamically linked to that SDK. In that scenario, a user may want to generate typelibs for the SDK for the platform of their choosing.
Creating custom typelibs
Creating a custom typelib file is a fairly simple process: the generator is called by executing your JEB startup script (eg, jeb_wincon.bat) with the following flags:
$ jeb - c --typelibgen=<typelib_configuration_file>
JEB ships with a sample typelib cfg file: typelibs/custom/sample-typelib.cfg. This key-value file is mostly self-explanatory, please refer to it for reference. (Below, we focus solely on the two most important entries, hdrsrc and cstsrc.)
You may want to copy the sample configuration file and adjust it to match your requirements.
The input files can be either or both of the following:
An aggregated, preprocessed header file: it should contain C types and exported methods
A constant file containing a list of named constants
Types and public routines
The aggregated header can be generated by pre-processing a simple C file including your target header file(s).
Example: let’s say we want to generate types for stdio.h, on Windows ARM64 platform. We can use Microsoft Compiler’s /P flag to pre-process a sample file, 1.c including the target headers:
// 1.c
#include "stdio.h"
int main(void) {return 0;}
The resulting file will be quite large – and is likely to contain much more than just stdio.h type information (all headers recursively-included by stdio.h would be processed as well).
We can rename that file as hdr.h and feed it to JEB’s Typelib Generator. (entry: hdrsrc)
Our C parser is C11 based, and supports most standard C declarations, as well as common MSVC and GCC extensions. Two important caveats to remember:
anonymous structure bitfields are not supported: things like “int :4” will need to be massaged to, eg, “int _:4”
anonymous aliased parameter for single-parameter methods are not supported: things like “void foo(X)” will need to be massaged to, eg, “void foo(X _)”
Predefined constants
As seen earlier, typelib files can also contain list of named constants – generally, they will be those constants that are #DEFINE’d in header files.
They can be scraped from C/C++ header files. JEB ships with a handy Python script that will help you do that quickly: see typelibs/custom/collectDefines.py (other tools exist, such as GCC’s dM flag, but they may not generate all constants, only those that are preprocessed with a given set of precompilation parameters).
We can save that file as, eg cst.txt, and feed it to JEB’s Typelib Generator. (entry: cstsrc)
Loading custom typelibs
If your typelib configuration matches your input files (most notably, the groupid and processor fields), then JEB will load it automatically during analysis of your input file.
Example, with the sample typelib shipping with JEB (groupid=GROUPID_TYPELIB_WIN32, processor=X86):
Obviously, you may decide to force-load a type lib by ticking the “Loaded” checkbox.
Programmatic access with JEB API
Native types, like any other component of JEB, can be accessed with the API. Scripts and plugins can use the API to programmatically retrieve, define, apply types, as well as manipulate type libraries.
The two single most important classes are:
ITypeManager: manager of native types for a given INativeCodeUnit
JEB3 is still in Beta, for a few more weeks. General availability should be expected during the first or second week of January. If you haven’t done so, feel free to ask for a Beta build right away.
Once again, thank you to all our users, we are very grateful for your feedback and support. Finally, a special thank you note to our user “Andy P.” who pushed JEB’s boundaries relatively far (!) and allowed us to uncover interesting corner cases when working with large firmware binaries.
Update: March 8, 2022: – The most up-to-date version of this document can be found in the Manual Update: Dec 8, 2021: – Reference section with list of special translations for EVM opcodes
Update: Jan 2, 2019:
– The full EVM decompiler shipped with JEB 3.0-beta.8
– Download a sample JEB Python script showing how to use the API Update: Nov 20, 2018: – We uploaded the decompiled code of an interested contract, the second part of the PolySwarm challenge (a good write-up can be found here)
We’re excited to announce that the pre-release of our Ethereum smart contract decompiler is available. We hope that it will become a tool of choice for security auditors, vulnerability researchers, and reverse engineers examining opaque smart contracts running on Ethereum platforms.
Keep on reading to learn about the current features of the decompiler; how to use it and understand its output; its current limitations, and planned additions.
Overall decompiler features
The decompiler modules provide the following specific capabilities:
The EVM decompiler takes compiled smart contract EVM 1 code as input, and decompiles them to Solidity-like source code.
The initial EVM code analysis passes determine contract’s public and private methods, including implementations of public methods synthetically generated by compilers.
Code analysis attempts to determine method and event names and prototypes, without access to an ABI.
The decompiler also attempts to recover various high-level constructs, including:
Implementations of well-known interfaces, such as ERC20 for standard tokens, ERC721 for non-fungible tokens, MultiSigWallet contracts, etc.
Storage variables and types
High-level Solidity artifacts and idioms, including:
Function mutability attributes
Function payability state
Event emission, including event name
Invocations of address.send() or address.transfer()
Precompiled contracts invocations
On top of the above, the JEB back-end and client platform provide the following standard functionality:
The decompiler uses JEB’s optimizations pipeline to produce high-level clean code.
It uses JEB code analysis core features, and therefore permits: code refactoring (eg, consistently renaming methods or fields), commenting and annotating, navigating (eg, cross references), typing, graphing, etc.
Users have access to the intermediate-level IR representation as well as high-level AST representations though the JEB API.
More generally, the API allows power-users to write extensions, ranging from simple scripts in Python to complex plugins in Java.
Our Ethereum modules were tested on thousands of smart contracts active on Ethereum mainnet and testnets.
Basic usage
Open a contract via the “File, Download Ethereum Contract…” menu entry.
You will be offered two options:
Open a binary file already stored on disk
Download 2 and open a contract from one of the principal Ethereum networks: mainnet, rinkeby, ropsten, or kovan:
Select the network
Provide the contract 20-byte address
Click Download and select a file destination
Note that to be recognized as EVM code, a file must:
either have a “.evm-bytecode” extension: in this case, the file may contain binary or hex-encoded code;
or have be a “.runtime” or “.bin-runtime” extension (as generated by the solc Solidity compiler), and contain hex-encoded Solidity generated code.
If you are opening raw files, we recommend you append the “.evm-extension” to them in order to guarantee that they will be processed as EVM contract code.
Contract Processing
JEB will process your contract file and generate a DecompiledContract class item to represent it:
To switch to the decompiled view, select the “Decompiled Contract” node in the Hierarchy view, and press TAB (or right-click, Decompile).
The decompiled contract is rendered in Solidity-like code: it is mostly Solidity code, but not entirely; constructs that are illegal in Solidity are used throughout the code to represent instructions that the decompiler could not represent otherwise. Examples include: low-level statements representing some low-level EVM instructions, memory accesses, or very rarely, goto statements. Do not expect a DecompiledContract to be easily recompiled.
Code views
You may adjust the View panels to have side-by-side views if you wish to navigate the assembly and high-level code at the same time.
In the assembly view, within a routine, press Space to visualize its control flow graph.
To navigate from assembly to source, and back, press the TAB key. The caret will be positioned on the closest matching instruction.
Contract information
In the Project Explorer panel, double click the contract node (the node with the official Ethereum Foundation logo), and then select the Description tab in the opened view to see interesting information about the processed contract, such as:
The detected compiler and/or its version (currently supported are variants of Solidity and Vyper compilers).
The list of detected routines (private and public, with their hashes).
The Swarm hash of the metadata file, if any.
Commands
The usual commands can be used to refactor and annotate the assembly or decompiled code. You will find the exhaustive list in the Action and Native menus. Here are basic commands:
Rename items (methods, variables, globals, …) using the N key
Navigate the code by examining cross-references, using the X key (eg, find all callers of a method and jump to one of them)
Comment using the Slash key
As said earlier, the TAB key is useful to navigate back and forth from the low-level EVM code to high-level decompiled code
We recommend you to browser the general user manual to get up to speed on how to use JEB.
Remember that you can change immediate number bases and rendering by using the B key. In the example below, you can see a couple of strings present in the bad Fomo3D contract, initially rendered in Hex:
Understanding decompiled contracts
This section highlights idioms you will encounter throughout decompiled pseudo-Solidity code. The examples below show the JEB UI Client with an assembly on the left side, and high level decompiled code on the right side. The contracts used as examples are live contracts currently active Ethereum mainnet.
We also highlight current limitations and planned additions.
Dispatcher and public functions
The entry-point function of a contract, at address 0, is generally its dispatcher. It is named start() by JEB, and in most cases will consist in an if-statement comparing the input calldata hash (the first 4 bytes) to pre-calculated hashes, to determine which routine is to be executed.
JEB attempts to determine public method names by using a hash dictionary (currently containing more than 140,000 entries).
Contracts compiled by Solidity generally use synthetic (compiler generated) methods as bridges between public routines, that use the public Ethereum ABI, and internal routines, using a compiler-specific ABI. Those routines are identified as well and, if their corresponding public method was named, will be assigned a similar name __impl_{PUBLIC_NAME}.
NOTE/PLANNED ADDITION: currently, JEB does not attempt to process input data of public routines and massage it back into an explicit prototype with regular variables. Therefore, you will see low-level access to CALLDATA bytes within public methods.
Below, see the public method collectToken(), which is retrieving its first parameter – a 20 byte address – from the calldata.
Interface discovery
At the time of writing, implementation of the following interfaces can be detected: ERC20, ERC165, ERC721, ERC721TokenReceiver, ERC721Metadata, ERC721Enumerable, ERC820, ERC223, ERC777, TokenFallback used by ERC223/ERC777 interfaces, as well as the common MultiSigWallet interface.
Eg, the contract below was identified as an ERC20 token implementation:
Function attributes
JEB does its best to retrieve:
low-level state mutability attributes (pure, read-only, read-write)
the high-level Solidity ‘payable’ attribute, reserved for public methods
Explicitly non-payable functions have lower-level synthetic stubs that verify that no Ether is being received. They REVERT if it is is the case. If JEB decides to remove this stub, the function will always have an inline comment /* non payable */ to avoid any ambiguity.
The contract below shows two public methods, one has a default mutability state (non-payable); the other one is payable. (Note that the hash 0xFF03AD56 was not resolved, therefore the name of the method is unknown and was set to sub_AF; you may also see a call to the collect()’s bridge function __impl_collect(), as was mentioned in the previous section).
Storage variables
The pre-release decompiler ships with a limited storage reconstructor module.
Accesses to primitives (int8 to int256, uint8 to uint256) is reconstructed in most cases
Packed small primitives in storage words are extracted (eg, a 256-bit storage word containing 2x uint8 and 1x int32, and accessed as such throughout the code, will yield 3 contract variables, as one would expect to see in a Solidity contract
However, currently, accesses to complex storage variables, such as mappings, mappings of mappings, mappings of structures, etc. are not simplified. This limitation will be addressed in the full release.
When a storage variable is not resolved, you will see simple “storage[…]” assignments, such as:
Due to how storage on Ethereum is designed (a key-value store of uint256 to uint256), Solidity internally uses a two-or-more indirection level for computing actual storage keys. Those low-level storage keys depend on the position of the high level storage variables. The KECCAK256 opcode is used to calculate intermediate and final keys. We will detail this mechanism in detail in a future blog post.
Precompiled contracts
Ethereum defines four pre-compiled contracts at addresses 1, 2, 3, 4. (Other addresses (5-8) are being reserved for additional pre-compiled contracts, but this is still at the ERC stage.)
JEB identifies CALLs that will eventually lead to pre-compiled code execution, and marks them as such in decompiled code: call_{specific}.
The example below shows the __impl_Receive (named recovered) method of the 34C3 CTF contract, which calls into address #2, a pre-compiled contract providing a fast implementation of SHA-256.
Ether send()
Solidity’s send can be translated into a lower-level call with a standard gas stipend and zero parameters. It is essentially used to send Ether to a contract through the target contract fallback function.
NOTE: Currently, JEB renders them as send(address, amount) instead of address.send(amount)
The contract below is live on mainnet. It is a simple forwarder, that does not store ether: it forwards the received amount to another contract.
Ether transfer()
Solidity’s transfer is an even higher-level variant of send that checks and REVERTs with data if CALL failed. JEB identifies those calls as well.
NOTE: Currently, JEB renders them as transfer(address, amount) instead of address.transfer(amount)
Event emission
JEB attempts to partially reconstruct LOGx (x in 1..4) opcodes back into high-level Solidity “emit Event(…)”. The event name is resolved by reversing the Event method prototype hash. At the time of writing, our dictionary contains more than 20,000 entries.
If JEB cannot reverse a LOGx instruction, or if LOG0 is used, then a lower-level log(…) call will be used.
NOTE: currently, the event parameters are not processed; therefore, the emit construct used in the decompiled code has the following form: emit Event(memory, size[, topic2[, topic3[, topic4]]]). topic1 is always used to store the event prototype hash.
API
JEB API allows automation of complex or repetitive tasks. Back-end plugins or complex scripts can be written in Python or Java. The API update that ship with JEB 3.0-beta.6 allow users to query decompiled contract code:
access to the intermediate representation (IR)
access to the final Solidity-like representation (AST)
API use is out-of-scope here. We will provide examples either in a subsequent blog post or on our public GitHub repository.
Additional References
List of EVM opcodes that receive special translation: link (on GitHub)
Conclusion
As said in the introduction, if you are reverse engineering opaque contracts (that is, most contracts on Ethereum’s mainnet), we believe you will find JEB useful.
You may give a try to the pre-release by downloading the demo here. Please let us know your feedback: we are planning a full release before the end of the year.
As always, thank you to all our users and supporters. -Nicolas
We published a paper deep-diving into WebAssembly from a reverse engineer point of view (wasm format, bytecode, execution environment, implementation details, etc.).
The paper annex details how JEB can be used to analyze and decompiler WebAssembly modules.