Skip to content

goeckslab/Galaxy-ML

Folders and files

NameName
Last commit message
Last commit date

Latest commit

2ac5c43 · Nov 20, 2022
Nov 19, 2022
Nov 19, 2022
Nov 19, 2022
Nov 19, 2022
Mar 27, 2021
Jul 16, 2019
Mar 22, 2021
Nov 10, 2022
Jul 1, 2019
Apr 10, 2021
Nov 19, 2022
Nov 8, 2022
Nov 18, 2022

Repository files navigation

Galaxy-ML

Galaxy-ML is a web machine learning end-to-end pipeline building framework, with special support to biomedical data. Under the management of unified scikit-learn APIs, cutting-edge machine learning libraries are combined together to provide thousands of different pipelines suitable for various needs. In the form of Galalxy tools, Galaxy-ML provides scalabe, reproducible and transparent machine learning computations.

Key features

  • simple web UI
  • no coding or minimum coding requirement
  • fast model deployment and model selection, specialized in hyperparameter tuning using GridSearchCV
  • high level of parallel and automated computation

Supported modules

A typic machine learning pipeline is composed of a main estimator/model and optional preprocessing component(s).

Model
  • scikit-learn

    • sklearn.ensemble
    • sklearn.linear_model
    • sklearn.naive_bayes
    • sklearn.neighbors
    • sklearn.svm
    • sklearn.tree
  • xgboost

    • XGBClassifier
    • XGBRegressor
  • mlxtend

    • StackingCVClassifier
    • StackingClassifier
    • StackingCVRegressor
    • StackingRegressor
  • Keras (Deep learning models are re-implemented to fully support sklearn APIs. Supports parameter, including layer subparameter, swaps or searches. Supports callbacks)

    • KerasGClassifier
    • KerasGRegressor
    • KerasGBatchClassifier (works best with online data generators, processing images, genomic sequences and so on)
  • BinarizeTargetClassifier/BinarizeTargetRegressor

  • IRAPSClassifier

Preprocessor
  • scikit-learn
    • sklearn.preprocessing
    • sklearn.feature_selection
    • sklearn.decomposition
    • sklearn.kernel_approximation
    • sklearn.cluster
  • imblanced-learn
    • imblearn.under_sampling
    • imblearn.over_sampling
    • imblearn.combine
  • skrebate
    • ReliefF
    • SURF
    • SURFstar
    • MultiSURF
    • MultiSURFstar
  • TDMScaler
  • DyRFE/DyRFECV
  • Z_RandomOverSampler
  • GenomeOneHotEncoder
  • ProteinOneHotEncoder
  • FastaDNABatchGenerator
  • FastaRNABatchGenerator
  • FastaProteinBatchGenerator
  • GenomicIntervalBatchGenerator
  • GenomicVariantBatchGenerator
  • ImageDataFrameBatchGenerator

Installation

APIs for models, preprocessors and utils implemented in Galaxy-ML can be installed separately.

Installing using anaconda (recommended)
conda install -c bioconda -c conda-forge Galaxy-ML
Installing using pip
pip install -U Galaxy-ML
Installing from source
python setup.py install
Using source code inplace
python install -e .

To install Galaxy-ML tools in Galaxy, please refer to https://galaxyproject.org/admin/tools/add-tool-from-toolshed-tutorial/.

Running the tests

Before running the tests, run the following commands:

conda create --name galaxy_ml python=3.9
conda activate galaxy_ml
pip install -e .
pip install nose nose-htmloutput pytest
cd galaxy_ml

To run all tests and generate an HTML report:

nosetests ./tests --with-html --html-file=./report.html

To run tests in a specific file (e.g., test_keras_galaxy.py file) and generate an HTML report

nosetests ./tests/test_keras_galaxy.py --with-html --html-file=./report.html

To run a specific test in a specific file (e.g., test_multi_dimensional_output test in test_keras_galaxy.py file) and generate an HTML report

nosetests ./tests/test_keras_galaxy.py:test_multi_dimensional_output --with-html --html-file=./report.html

Examples for using Galaxy-ML custom models

# handle imports
from keras.models import Sequential
from keras.layers import Dense, Activation
from sklearn.model_selection import GridSearchCV
from galaxy_ml.keras_galaxy_models import KerasGClassifier


# build a DNN classifier
model = Sequential()
model.add(Dense(64))
model.add(Activation(‘relu'))
model.add((Dense(1, activation=‘sigmoid’)))
config = model.get_config()

classifier = KerasGClassifier(config, random_state=42)


# clone a classifier
clf = clone(classifier)


# Get parameters
params = clf.get_params()


# Set parameters
new_params = dict(
    epochs=60,
    lr=0.01,
    layers_1_Dense__config__kernel_initializer__config__seed=999,
    layers_0_Dense__config__kernel_initializer__config__seed=999
)
clf.set_params(**new_params)


# model evaluation using GridSearchCV
grid = GridSearchCV(clf, param_grid={}, scoring=‘roc_auc’, cv=5, n_jobs=2)
grid.fit(X, y)

Example for using Galaxy-ML to persist a sklearn/keras model

from galaxy_ml.model_persist import (dump_model_to_h5,
                                     load_model_from_h5)
                 
# dump model to hdf5
dump_model_to_h5(model, `save_path`,
                 store_hyperparameter=True)

# load model from hdf5
model = load_model_from_h5(`path_to_hdf5`)

Performance comparison

Galaxy-ML's HDF5 saving utils perform faster than cPickle for large, array-rich models.

Loading model using pickle...
(1.2471628189086914 s)

Dumping model using pickle...
(3.6942389011383057 s)
File size: 930712861

Dumping model to hdf5...
(3.006715774536133 s)
File size: 930729696

Loading model from hdf5...
(0.6420958042144775 s)

Pipeline(memory=None,
         steps=[('robustscaler',
                 RobustScaler(copy=True, quantile_range=(25.0, 75.0),
                              with_centering=True, with_scaling=True)),
                ('kneighborsclassifier',
                 KNeighborsClassifier(algorithm='auto', leaf_size=30,
                                      metric='minkowski', metric_params=None,
                                      n_jobs=1, n_neighbors=100, p=2,
                                      weights='uniform'))],
         verbose=False)

Publication

Gu Q, Kumar A, Bray S, Creason A, Khanteymoori A, Jalili V, et al. (2021) Galaxy-ML: An accessible, reproducible, and scalable machine learning toolkit for biomedicine. PLoS Comput Biol 17(6): e1009014. https://doi.org/10.1371/journal.pcbi.1009014