论文标题

SIMCLF:功能级二进制嵌入的简单对比度学习框架

SimCLF: A Simple Contrastive Learning Framework for Function-level Binary Embeddings

论文作者

RuiJin, Sun, Shize, Guo, Jinhong, Guo, Wei, Li, Dazhi, Zhan, Meng, Sun, Zhisong, Pan

论文摘要

功能级二进制代码相似性检测是网络安全的关键方面。它可以在已发布的软件中检测错误和专利侵权,并在防止供应链攻击方面起关键作用。实用的嵌入学习框架依赖于组件代码表示的鲁棒性和功能对注释的准确性,该函数对注释的准确性传统上是使用基于学习的框架来完成的。但是,用准确的标签对不同功能对的注释会带来巨大的挑战。这些监督的学习方法很容易被过度训练,并遭受表示鲁棒性问题的影响。为了应对这些挑战,我们提出了SIMCLF:功能级二进制嵌入的简单对比学习框架。我们采用一种无监督的学习方法,并将二进制代码相似性检测作为实例歧视。 SIMCLF直接在拆卸的二进制功能上运行,并且可以使用任何编码器实现。它不需要手动注释的信息,而只需要增强数据。增强数据是使用编译器优化选项和代码混淆技术生成的。实验结果表明,SIMCLF的准确性超过了最先进的功能,并且在几次射击设置中具有显着优势。

Function-level binary code similarity detection is a crucial aspect of cybersecurity. It enables the detection of bugs and patent infringements in released software and plays a pivotal role in preventing supply chain attacks. A practical embedding learning framework relies on the robustness of the assembly code representation and the accuracy of function-pair annotation, which is traditionally accomplished using supervised learning-based frameworks. However, annotating different function pairs with accurate labels poses considerable challenges. These supervised learning methods can be easily overtrained and suffer from representation robustness problems. To address these challenges, we propose SimCLF: A Simple Contrastive Learning Framework for Function-level Binary Embeddings. We take an unsupervised learning approach and formulate binary code similarity detection as instance discrimination. SimCLF directly operates on disassembled binary functions and could be implemented with any encoder. It does not require manually annotated information but only augmented data. Augmented data is generated using compiler optimization options and code obfuscation techniques. The experimental results demonstrate that SimCLF surpasses the state-of-the-art in accuracy and has a significant advantage in few-shot settings.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源