Python scripts aim to use a Random Forest machine learning algorithm to predict the water affinity of Metal-Organic Frameworks

Last update: Jan 09, 2022

Overview

MOF-Water-Affinity-Prediction-

The following Python scripts aim to use a Random Forest machine learning algorithm to predict the water affinity of Metal-Organic Frameworks (MOFs). The training set is extracted from the Cambridge Structural Database and the CoRE_MOF 2019 dataset.

Prediction Model

The prediction model is used to determine whether a given MOF is hydrophobic or hydrophilic. It uses a Random Forest model from the XGBoost library through a scikit-learn interface. The model reads in a .csv file of training data and then predicts the water affinity of a user inputted MOF. The user can specify what input parameters are to be used in the model.

Overfitting/Underfitting

This script was created to investigate how the prediction model’s accuracy and precision vary with the number and combination of inputs. This script allows a user to compare how the different combinations of inputs affect the score and the standard deviation of the model’s results.

It operates by reading in a .csv file of training data containing 13 input parameters. It then generates a list of all the possible combinations of input parameters according to the lengths specified by the user. For example, if the user wants all the combinations of length 3, 4, and 10 possible, the program will generate a list of all combinations of those lengths, and then use each combination as input for the model. Basically, each combination will undergo the same process as in the prediction model above, and then its results will be added into a .csv file for later analysis. Finally, a plot is created with filters for visualization.

.cif to .csv Converter

In order to create a training set for the prediction model, a csv must be created with all the available datapoints. This includes the MOFs and their crystallographic data. The data needed is collected from three different sources: WebCSD, CoRE_MOF 2019 dataset, and the MOF’s .cif files. Furthermore, additional calculations need to be performed from the information collected from the .cif files.

The code works by reading a .txt file, folder, or both, containing the refcodes and .cif files given to the MOF by the Cambrdige Structural Database. It then searches for these refcodes in the CoRE_MOF 2019 dataset, and retrieves the crystallographic data attached to them. Additionally, it uses the .cif files of the MOFs to calculate the atomic mass percentage of the metals contained in the MOF. These calculations are stored in columns 14-17, but are treated as one input parameter in the models in an attempt to relate them to each other. It also states the MOFs in the training set as hydrophobic and hydrophilic based on previously collected information from the literature. Finally, it produces a .csv file ready for use in the prediction model.

.cif folders

Three different folders are used to store .cif files.

cif: these are hydrophobic MOFs received from Dr. Z. Qiao.
manual hydrophobic: these are hydrophobic MOFs collected from the literature
manual hydrophilic: these are hydrophilic MOFs collected from the literature

To add additional .cif files:

Add additional .cif files into either the manual hydrophobic folder or the manual hydrophilic folder. Make sure the file names represent the CCDC refcodes (including or excluding the CoRE_MOF 2019 name extensions). Finally, add these refcodes into the .txt file available in each folder so that the .cif files can be read by the cif Reader program.

This project is licensed under the terms of the GNU General Public License v3.0

Python scripts aim to use a Random Forest machine learning algorithm to predict the water affinity of Metal-Organic Frameworks

Related tags

Overview

MOF-Water-Affinity-Prediction-

Prediction Model

Overfitting/Underfitting

.cif to .csv Converter

.cif folders

To add additional .cif files:

Owner

Pandas and Dask test helper methods with beautiful error messages.

ELFXtract is an automated analysis tool used for enumerating ELF binaries

PATC: Introduction to Big Data Analytics. Practical Data Analytics for Solving Real World Problems

[CVPR2022] This repository contains code for the paper "Nested Collaborative Learning for Long-Tailed Visual Recognition", published at CVPR 2022

talkbox is a scikit for signal/speech processing, to extend scipy capabilities in that domain.

A highly efficient and modular implementation of Gaussian Processes in PyTorch

Retail-Sim is python package to easily create synthetic dataset of retaile store.

A Numba-based two-point correlation function calculator using a grid decomposition

A Python package for the mathematical modeling of infectious diseases via compartmental models

Containerized Demo of Apache Spark MLlib on a Data Lakehouse (2022)

Full ELT process on GCP environment.

MoRecon - A tool for reconstructing missing frames in motion capture data.

BigDL - Evaluate the performance of BigDL (Distributed Deep Learning on Apache Spark) in big data analysis problems

This repo contains a simple but effective tool made using python which can be used for quality control in statistical approach.

GWpy is a collaboration-driven Python package providing tools for studying data from ground-based gravitational-wave detectors

a tool that compiles a csv of all h1 program stats

Candlestick Pattern Recognition with Python and TA-Lib

Elementary is an open-source data reliability framework for modern data teams. The first module of the framework is data lineage.

Spectral Analysis in Python

Pizza Orders Data Pipeline Usecase Solved by SQL, Sqoop, HDFS, Hive, Airflow.