A novel class of vulnerabilities could be leveraged by threat actors to inject visually deceptive malware in a way that’s semantically permissible but alters the logic defined by the source code, effectively opening the door to more first-party and supply chain risks.
Dubbed “Trojan Source attacks,” the technique “exploits subtleties in text-encoding standards such as Unicode to produce source code whose tokens are logically encoded in a different order from the one in which they are displayed, leading to vulnerabilities that cannot be perceived directly by human code reviewers,” Cambridge University researchers Nicholas Boucher and Ross Anderson said in a newly published paper.
The vulnerabilities — tracked as CVE-2021-42574 and CVE-2021-42694 — affect compilers of all popular programming languages such as C, C++, C#, JavaScript, Java, Rust, Go, and Python.
Compilers are programs that translate high-level human-readable source code into their lower-level representations such as assembly language, object code, or machine code that can then be executed by the operating system.
At its core, the issue concerns Unicode’s bidirectional (or Bidi) algorithm which enables support for both left-to-right (e.g., English) and right-to-left (e.g., Arabic or Hebrew) languages, and also features what’s called bidirectional overrides to allow writing left-to-right words inside a right-to-left sentence, or vice versa, thereby making it possible to embed text of a different reading direction inside large blocks of text.
While a compiler’s output is expected to correctly implement the source code supplied to it, discrepancies created by inserting Unicode Bidi override characters into comments and strings can enable a scenario that yields syntactically-valid source code in which the display order of characters presents logic that diverges from the actual logic.
Put differently, the attack works by targeting the encoding of source code files to craft targeted vulnerabilities, rather than deliberately introducing logical bugs, so as to visually reorder tokens in source code that, while rendered in a perfectly acceptable manner, tricks the compiler into processing the code in a different way and drastically changing the program flow — e.g., making a comment appear as if it were code.
“In effect, we anagram program A into program B,” the researchers surmised. “If the change in logic is subtle enough to go undetected in subsequent testing, an adversary could introduce targeted vulnerabilities without being detected.”
Such adversarial encodings can have a serious impact on the supply chain, the researchers warn, when invisible software vulnerabilities injected into open-source software make their way downstream, potentially affecting all users of the software. Even worse, the Trojan Source attacks can become more severe should an attacker use homoglyphs to redefine pre-existing functions in an upstream package and invoke them from a victim program.
By replacing Latin letters with lookalike characters from other Unicode family sets (e.g., changing “H” to Cyrillic “Н”), a threat actor can create a homoglyph function that seemingly looks similar to the original function but actually contains malicious code that could then be added to an open-source project without attracting much scrutiny. An attack of this kind could be disastrous when applied against a common function that’s available via an imported dependency or library, the paper noted.
“The fact that the Trojan Source vulnerability affects almost all computer languages makes it a rare opportunity for a system-wide and ecologically valid cross-platform and cross-vendor comparison of responses,” the researchers noted. “As powerful supply-chain attacks can be launched easily using these techniques, it is essential for organizations that participate in a software supply chain to implement defenses.”