Weird Sort-and-Compress Thing

A weird integer sorting + compression algorithm inspired by a conversation with Luthingx (it probably already exists by some name I don't know yet). There's a lot still to improve about this algorithm, so be careful where you use it.

How it works

Here's an example for the following list:

l = [1, 2, 2, 2, 3]

The algorithm starts with counting sort, creating a dictionary with each unique number as key and the number of occurences of it in the list as the value:

d = {1: 1, 2: 3, 3: 1}

To decrease the space needed to store the numbers in memory, we'll only store the first number and then the difference between each of the next numbers and the previous one:

d2 = [(1, 1), (1, 3), (1, 1))

Now, the minimum amount of memory we need to store every key that's in d2 is 1 bit, since 1 is the maximum difference between any subsequent elements. The same applies to the values, except that to store any value here we need 2 bits of memory, since the maximum value is 3(11 in binary). So we know that we can store this list as a sequence of 3 bits elements, like this:

d2_bin = ["101", "111", 101"]

We can now return the list as a single number, along with a pair of integers containing the number of bits in each key and the number of bits in each value, allowing the value to be decompressed.

Memory efficiency

Here's a list with the sum of the number of bits of all numbers in a list with 100 elements, generated with random values in the range 0 to 50 and generated 20 times, vs. the number of bits in the resulting compressed integer(taking as a premise that all numbers in the array are all actually stored in continuous memory, including duplicates):

And 1000 numbers from 0 to 50, also 20 times:

4724 => 358
4827 => 309
4818 => 308
4801 => 309
4763 => 309
4763 => 309
4801 => 359
4757 => 359
4766 => 309
4794 => 309
4769 => 309
4789 => 359
4887 => 359
4787 => 309
4761 => 309
4749 => 309
4844 => 308
4798 => 359
4799 => 308
4763 => 359

Weird Sort-and-Compress Thing

Related tags

Overview

Weird Sort-and-Compress Thing

How it works

Memory efficiency

Owner

Douglas

LCG T-TEST USING EUCLIDEAN METHOD

A curated list of efficient attention modules

Calibre recipe to convert latest issue of Analyse & Kritik into an ebook

CCF BDCI BERT系统调优赛题baseline（Pytorch版本）

Conversational-AI-ChatBot - Intelligent ChatBot built with Microsoft's DialoGPT transformer to make conversations with human users!

pkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation

Code for "Generating Disentangled Arguments with Prompts: a Simple Event Extraction Framework that Works"

Final Project Bootcamp Zero

SIGIR'22 paper: Axiomatically Regularized Pre-training for Ad hoc Search

Simple, hackable offline speech to text - using the VOSK-API.

An Explainable Leaderboard for NLP

💛 Code and Dataset for our EMNLP 2021 paper: "Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes"

spaCy plugin for Transformers , Udify, ELmo, etc.

Quick insights from Zoom meeting transcripts using Graph + NLP

Code for PED: DETR For (Crowd) Pedestrian Detection

A method to generate speech across multiple speakers

An easy to use, user-friendly and efficient code for extracting OpenAI CLIP (Global/Grid) features from image and text respectively.

code for "AttentiveNAS Improving Neural Architecture Search via Attentive Sampling"

Under the hood working of transformers, fine-tuning GPT-3 models, DeBERTa, vision models, and the start of Metaverse, using a variety of NLP platforms: Hugging Face, OpenAI API, Trax, and AllenNLP

BERTopic is a topic modeling technique that leverages 🤗 transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions