A neural-based binary analysis tool

Related tags

Data Analysisnbref
Overview

A neural-based binary analysis tool

Introduction

This directory contains the demo of a neural-based binary analysis tool. We test the framework using multiple binary analysis tasks: (i) vulnerability detection. (ii) code similarity measures. (iii) decompilations. (iv) malware analysis (coming later).

Requirements

  • Python 3.7.6
  • Python packages
    • dgl 0.6.0
    • numpy 1.18.1
    • pandas 1.2.0
    • scipy 1.4.1
    • sklearn 0.0
    • tensorboard 2.2.1
    • torch 1.5.0
    • torchtext 0.2.0
    • tqdm 4.42.1
    • wget 3.2
  • C++14 compatible compiler
  • Clang++ 3.7.1

Tasks and Dataset preparation

Binary code similarity measures

  1. Download dataset
    • Download POJ-104 datasets from here and extract them into data/.
  2. Compile and preprocess
    • Run python extract_obj.py -a data/obj (clang++-3.7.1 required)
    • Run python preprocess/split_dataset.py -i data/obj -m p -o data/split.pkl to split the dataset into train/valid/test sets.
    • Run python preprocess/sim_preprocess.py to compile the binary code into graphs data.
    • *(part of the preprocessing code are from [1])

Binary Vulnerability detections

  1. Cramming the binary dataset
    • The dataset is built on top of Devign. We compile the entire library based on the commit id and dump the binary code of the vulnerable functions. The cramming code is given in preprocess/cram_vul_dataset.
  2. Download Preprocessed data
    • Run ./preprocess.sh (clang++-3.7.1 required), or
    • You can directly download the preprocessed datasets from here and extract them into data/.
    • Run python preprocess/vul_preprocess.py to compile the binary code into graphs data

Binary decompilation [N-Bref]

  1. Download dataset
    • Download the demo datasets (raw and preprocessed data) from here and extract them into data/. (More datasets to come.)
    • No need to compile the code into graph again as the data has already been preprocessed.

Training and Evaluation

Binary code similarity measures

  • Run cd baseline_model && python run_similarity_check.py

Binary Vulnerability detections

  • Run cd baseline_model && python run_vulnerability_detection.py

Binary decompilation [N-Bref]

  1. Dump the trace of tree expansion:
    • To accelerate the online processing of the tree output, we will dump the trace of the trea data by running python -m preprocess.dump_trace
  2. Training scripts:
    • First, cd baseline model.
    • To train the model using torch parallel, run python run_tree_transformer.py.
    • To train it on multi-gpu using distribute pytorch, run python run_tree_transformer_multi_gpu.py
    • To evaluate, run python run_tree_transformer.py --eval
    • To evaluate a multi-gpu trained model, run python run_tree_transformer_multi_gpu.py --eval

References

[1] Ye, Fangke, et al. "MISIM: An End-to-End Neural Code Similarity System." arXiv preprint arXiv:2006.05265 (2020).

[2] Zhou, Yaqin, et al. "Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks." Advances in Neural Information Processing Systems. 2019.

[3] Shi, Zhan, et al. "Learning Execution through Neural Code Fusion.", ICLR (2019).

License

This repo is CC-BY-NC licensed, as found in the LICENSE file.

Owner
Facebook Research
Facebook Research
Pipeline and Dataset helpers for complex algorithm evaluation.

tpcp - Tiny Pipelines for Complex Problems A generic way to build object-oriented datasets and algorithm pipelines and tools to evaluate them pip inst

Machine Learning and Data Analytics Lab FAU 3 Dec 07, 2022
A real-time financial data streaming pipeline and visualization platform using Apache Kafka, Cassandra, and Bokeh.

Realtime Financial Market Data Visualization and Analysis Introduction This repo shows my project about real-time stock data pipeline. All the code is

6 Sep 07, 2022
Conduits - A Declarative Pipelining Tool For Pandas

Conduits - A Declarative Pipelining Tool For Pandas Traditional tools for declaring pipelines in Python suck. They are mostly imperative, and can some

Kale Miller 7 Nov 21, 2021
Meltano: ELT for the DataOps era. Meltano is open source, self-hosted, CLI-first, debuggable, and extensible.

Meltano is open source, self-hosted, CLI-first, debuggable, and extensible. Pipelines are code, ready to be version c

Meltano 625 Jan 02, 2023
Candlestick Pattern Recognition with Python and TA-Lib

Candlestick-Pattern-Recognition-with-Python-and-TA-Lib Goal Look at the S&P500 to try and get a better understanding of these candlestick patterns and

Ganesh Jainarain 11 Oct 07, 2022
A data parser for the internal syncing data format used by Fog of World.

A data parser for the internal syncing data format used by Fog of World. The parser is not designed to be a well-coded library with good performance, it is more like a demo for showing the data struc

Zed(Zijun) Chen 40 Dec 12, 2022
Datashredder is a simple data corruption engine written in python. You can corrupt anything text, images and video.

Datashredder is a simple data corruption engine written in python. You can corrupt anything text, images and video. You can chose the cha

2 Jul 22, 2022
DataPrep β€” The easiest way to prepare data in Python

DataPrep β€” The easiest way to prepare data in Python

SFU Database Group 1.5k Dec 27, 2022
Synthetic data need to preserve the statistical properties of real data in terms of their individual behavior and (inter-)dependences

Synthetic data need to preserve the statistical properties of real data in terms of their individual behavior and (inter-)dependences. Copula and functional Principle Component Analysis (fPCA) are st

32 Dec 20, 2022
A probabilistic programming library for Bayesian deep learning, generative models, based on Tensorflow

ZhuSuan is a Python probabilistic programming library for Bayesian deep learning, which conjoins the complimentary advantages of Bayesian methods and

Tsinghua Machine Learning Group 2.2k Dec 28, 2022
Using approximate bayesian posteriors in deep nets for active learning

Bayesian Active Learning (BaaL) BaaL is an active learning library developed at ElementAI. This repository contains techniques and reusable components

ElementAI 687 Dec 25, 2022
πŸ“Š Python Flask game that consolidates data from Nasdaq, allowing the user to practice buying and selling stocks.

Web Trader Web Trader is a trading website that consolidates data from Nasdaq, allowing the user to search up the ticker symbol and price of any stock

Paulina Khew 21 Aug 30, 2022
peptides.py is a pure-Python package to compute common descriptors for protein sequences

peptides.py Physicochemical properties and indices for amino-acid sequences. πŸ—ΊοΈ Overview peptides.py is a pure-Python package to compute common descr

Martin Larralde 32 Dec 31, 2022
Convert monolithic Jupyter notebooks into Ploomber pipelines.

Soorgeon Join our community | Newsletter | Contact us | Blog | Website | YouTube Convert monolithic Jupyter notebooks into Ploomber pipelines. soorgeo

Ploomber 65 Dec 16, 2022
Python dataset creator to construct datasets composed of OpenFace extracted features and Shimmer3 GSR+ Sensor datas

Python dataset creator to construct datasets composed of OpenFace extracted features and Shimmer3 GSR+ Sensor datas

Gabriele 3 Jul 05, 2022
ForecastGA is a Python tool to forecast Google Analytics data using several popular time series models.

ForecastGA is a tool that combines a couple of popular libraries, Atspy and googleanalytics, with a few enhancements.

JR Oakes 36 Jan 03, 2023
Data Analytics on Genomes and Genetics

Data Analytics performed on On genomes and Genetics dataset to predict genetic disorder and disorder subclass. DONE by TEAM SIGMA!

1 Jan 12, 2022
Techdegree Data Analysis Project 2

Basketball Team Stats Tool In this project you will be writing a program that reads from the "constants" data (PLAYERS and TEAMS) in constants.py. Thi

2 Oct 23, 2021
This is a tool for speculation of ancestral allel, calculation of sfs and drawing its bar plot.

superSFS This is a tool for speculation of ancestral allel, calculation of sfs and drawing its bar plot. It is easy-to-use and runing fast. What you s

3 Dec 16, 2022
An Indexer that works out-of-the-box when you have less than 100K stored Documents

U100KIndexer An Indexer that works out-of-the-box when you have less than 100K stored Documents. U100K means under 100K. At 100K stored Documents with

Jina AI 7 Mar 15, 2022