AI-based, context-driven network device ranking

Related tags

Deep Learningbatea
Overview

Python package

logo

Batea

A batea is a large shallow pan of wood or iron traditionally used by gold prospectors for washing sand and gravel to recover gold nuggets.

Batea is a context-driven network device ranking framework based on the anomaly detection family of machine learning algorithms. The goal of Batea is to allow security teams to automatically filter interesting network assets in large networks using nmap scan reports. We call those Gold Nuggets.

For more information about Gold Nuggeting and the science behind Batea, check out our whitepaper here

You can try Batea on your nmap scan data without downloading the software, using Batea Live: https://batea.delvesecurity.com/

How it works

Batea works by constructing a numerical representation (numpy) of all devices from your nmap reports (XML) and then applying anomaly detection methods to uncover the gold nuggets. It is easily extendable by adding specific features, or interesting characteristics, to the numerical representation of the network elements.

The numerical representation of the network is constructed using features, which are inspired by the expertise of the security community. The features act as elements of intuition, and the unsupervised anomaly detection methods allow the context of the network asset, or the total description of the network, to be used as the central building block of the ranking algorithm. The exact algorithm used is Isolation Forest (https://en.wikipedia.org/wiki/Isolation_forest)

Machine learning models are the heart of Batea. Models are algorithms trained on the whole dataset and used to predict a score on the same (and other) data points (network devices). Batea also allows for model persistence. That is, you can re-use pretrained models and export models trained on large datasets for further use.

Usage

# Complete info
$ sudo nmap -A 192.168.0.0/16 -oX output.xml

# Partial info
$ sudo nmap -O -sV 192.168.0.0/16 -oX output.xml


$ batea -v output.xml

Installation

$ git clone [email protected]:delvelabs/batea.git
$ cd batea
$ python3 setup.py sdist
$ pip3 install -r requirements.txt
$ pip3 install -e .

Developers Installation

$ git clone [email protected]:delvelabs/batea.git
$ cd batea
$ python3 -m venv batea/
$ source batea/bin/activate
$ python3 setup.py sdist
$ pip3 install -r requirements-dev.txt
$ pip3 install -e .
$ pytest

Example usage

# simple use (output top 5 gold nuggets with default format)
$ batea nmap_report.xml

# Output top 3
$ batea -n 3 nmap_report.xml

# Output all assets
$ batea -A nmap_report.xml

# Using multiple input files
$ batea -A nmap_report1.xml nmap_report2.xml

# Using wildcards (default xsl)
$ batea ./nmap*.xml
$ batea -f csv ./assets*.csv

# You can use batea on pretrained models and export trained models.

# Training, output and dumping model for persistence
$ batea -D mymodel.batea nmap_report.xml

# Using pretrained model
$ batea -L mymodel.batea nmap_report.xml

# Using preformatted csv along with xml files
$ batea -x nmap_report.xml -c portscan_data.csv

# Adjust verbosity
$ batea -vv nmap_report.xml

How to add a feature

Batea works by assigning numerical features to every host in the report (or series of report). Hosts are python objects derived from the nmap report. They consist of the following list of attributes: [ipv4, hostname, os_info, ports] where ports is a list of ports objects. Each port has the following list of attributes : [port, protocol, state, service, software, version, cpe, scripts], all defaulting to None.

Features are objects inherited from the FeatureBase class that instantiate a specific _transform method. This method always takes the list of all hosts as input and returns a lambda function that maps each host to a numpy column of numeric values (host order is conserved). The column is then appended to the matrix representation of the report. Features must output correct numerical values (floats or integers) and nothing else.

Most feature transformations are implemented using a simple lambda function. Just make sure to default a numeric value to every host for model compatibility.

Ex:

class CustomInterestingPorts(FeatureBase):
    def __init__(self):
        super().__init__(name="some_custom_interesting_ports")

    def _transform(self, hosts):
      """This method takes a list of hosts and returns a function that counts the number
      of host ports member from a predefined list of "interesting" ports, defaulting to 0.

      Parameters
      ----------
      hosts : list
          The list of all hosts

      Returns
      -------
      f : lambda function
          Counts the number of ports in the defined list.
      """
        member_ports = [21, 22, 25, 8080, 8081, 1234]
        f = lambda host: len([port for port in host.ports if port.port in member_ports])
        return f

You can then add the feature to the report by using the NmapReport.add_feature method in batea/__init__.py

from .features.basic_features import CustomInterestingPorts

def build_report():
    report = NmapReport()
    #[...]
    report.add_feature(CustomInterestingPorts())

    return report

Using precomputed tabular data (CSV)

It is possible to use preprocessed data to train the model or for prediction. The data has to be indexed by (ipv4, port) with one unique combination per row. The type of data should be close to what you expect from the XML version of an nmap report. A column has to use one of the following names, but you don't have to use all of them. The parser defaults to null values if a column is absent.

  'ipv4',
  'hostname',
  'os_name',
  'port',
  'state',
  'protocol',
  'service',
  'software_banner',
  'version',
  'cpe',
  'other_info'

Example:

ipv4,hostname,os_name,port,state,protocol,service,software_banner
10.251.53.100,internal.delvesecurity.com,Linux,110,open,tcp,rpcbind,"program version   port/proto  service100000  2,3,4        111/tcp  rpcbind100000  2,3,4    "
10.251.53.100,internal.delvesecurity.com,Linux,111,open,tcp,rpcbind,
10.251.53.188,serious.delvesecurity.com,Linux,6000,open,tcp,X11,"X11Probe: CentOS"

Outputing numerical representation

For the data scientist in you, or just for fun and profit, you can output the numerical matrix along with the score column instead of the regular output. This can be useful for further data analysis and debug purpose.

$ batea -oM network_matrix nmap_report.xml
Owner
Secureworks Taegis VDR
Automatically identify and prioritize vulnerabilities for intelligent remediation.
Secureworks Taegis VDR
moving object detection for satellite videos.

DSFNet: Dynamic and Static Fusion Network for Moving Object Detection in Satellite Videos Algorithm Introduction DSFNet: Dynamic and Static Fusion Net

xiaochao 39 Dec 16, 2022
Jittor is a high-performance deep learning framework based on JIT compiling and meta-operators.

Jittor: a Just-in-time(JIT) deep learning framework Quickstart | Install | Tutorial | Chinese Jittor is a high-performance deep learning framework bas

2.7k Jan 03, 2023
Bridging Composite and Real: Towards End-to-end Deep Image Matting

Bridging Composite and Real: Towards End-to-end Deep Image Matting Please note that the official repository of the paper Bridging Composite and Real:

Jizhizi_Li 30 Oct 31, 2022
Official code for "Eigenlanes: Data-Driven Lane Descriptors for Structurally Diverse Lanes", CVPR2022

[CVPR 2022] Eigenlanes: Data-Driven Lane Descriptors for Structurally Diverse Lanes Dongkwon Jin, Wonhui Park, Seong-Gyun Jeong, Heeyeon Kwon, and Cha

Dongkwon Jin 106 Dec 29, 2022
Neural Tangent Generalization Attacks (NTGA)

Neural Tangent Generalization Attacks (NTGA) ICML 2021 Video | Paper | Quickstart | Results | Unlearnable Datasets | Competitions | Citation Overview

Chia-Hung Yuan 34 Nov 25, 2022
This code provides various models combining dilated convolutions with residual networks

Overview This code provides various models combining dilated convolutions with residual networks. Our models can achieve better performance with less

Fisher Yu 1.1k Dec 30, 2022
A PyTorch implementation: "LASAFT-Net-v2: Listen, Attend and Separate by Attentively aggregating Frequency Transformation"

LASAFT-Net-v2 Listen, Attend and Separate by Attentively aggregating Frequency Transformation Woosung Choi, Yeong-Seok Jeong, Jinsung Kim, Jaehwa Chun

Woosung Choi 29 Jun 04, 2022
[CVPR 2022 Oral] TubeDETR: Spatio-Temporal Video Grounding with Transformers

TubeDETR: Spatio-Temporal Video Grounding with Transformers Website • STVG Demo • Paper This repository provides the code for our paper. This includes

Antoine Yang 108 Dec 27, 2022
Zen-NAS: A Zero-Shot NAS for High-Performance Deep Image Recognition

Zen-NAS: A Zero-Shot NAS for High-Performance Deep Image Recognition How Fast Compare to Other Zero-Shot NAS Proxies on CIFAR-10/100 Pre-trained Model

190 Dec 29, 2022
An exploration of log domain "alternative floating point" for hardware ML/AI accelerators.

This repository contains the SystemVerilog RTL, C++, HLS (Intel FPGA OpenCL to wrap RTL code) and Python needed to reproduce the numerical results in

Facebook Research 373 Dec 31, 2022
DLWP: Deep Learning Weather Prediction

DLWP: Deep Learning Weather Prediction DLWP is a Python project containing data-

Kushal Shingote 3 Aug 14, 2022
Pytorch and Torch testing code of CartoonGAN

CartoonGAN-Test-Pytorch-Torch Pytorch and Torch testing code of CartoonGAN [Chen et al., CVPR18]. With the released pretrained models by the authors,

Yijun Li 642 Dec 27, 2022
Leaf: Multiple-Choice Question Generation

Leaf: Multiple-Choice Question Generation Easy to use and understand multiple-choice question generation algorithm using T5 Transformers. The applicat

Kristiyan Vachev 62 Dec 20, 2022
This is a Deep Leaning API for classifying emotions from human face and human audios.

Emotion AI This is a Deep Leaning API for classifying emotions from human face and human audios. Starting the server To start the server first you nee

crispengari 5 Oct 02, 2022
Crawl & visualize ICLR papers and reviews

Crawl and Visualize ICLR 2022 OpenReview Data Descriptions This Jupyter Notebook contains the data crawled from ICLR 2022 OpenReview webpages and thei

Federico Berto 75 Dec 05, 2022
A PyTorch Implementation of "SINE: Scalable Incomplete Network Embedding" (ICDM 2018).

Scalable Incomplete Network Embedding ⠀⠀ A PyTorch implementation of Scalable Incomplete Network Embedding (ICDM 2018). Abstract Attributed network em

Benedek Rozemberczki 69 Sep 22, 2022
Model Serving Made Easy

The easiest way to build Machine Learning APIs BentoML makes moving trained ML models to production easy: Package models trained with any ML framework

BentoML 4.4k Jan 08, 2023
Denoising Diffusion Probabilistic Models

Denoising Diffusion Probabilistic Models This repo contains code for DDPM training. Based on Denoising Diffusion Probabilistic Models, Improved Denois

Alexander Markov 7 Dec 15, 2022
Spearmint Bayesian optimization codebase

Spearmint Spearmint is a software package to perform Bayesian optimization. The Software is designed to automatically run experiments (thus the code n

Formerly: Harvard Intelligent Probabilistic Systems Group -- Now at Princeton 1.5k Dec 29, 2022
Object detection evaluation metrics using Python.

Object detection evaluation metrics using Python.

Louis Facun 2 Sep 06, 2022