Title: | Develop Clinical Prediction Models Using the Common Data Model |
---|---|
Description: | A user friendly way to create patient level prediction models using the Observational Medical Outcomes Partnership Common Data Model. Given a cohort of interest and an outcome of interest, the package can use data in the Common Data Model to build a large set of features. These features can then be used to fit a predictive model with a number of machine learning algorithms. This is further described in Reps (2017) <doi:10.1093/jamia/ocy032>. |
Authors: | Egill Fridgeirsson [aut, cre], Jenna Reps [aut], Martijn Schuemie [aut], Marc Suchard [aut], Patrick Ryan [aut], Peter Rijnbeek [aut], Observational Health Data Science and Informatics [cph] |
Maintainer: | Egill Fridgeirsson <[email protected]> |
License: | Apache License 2.0 |
Version: | 6.4.0 |
Built: | 2025-02-12 05:25:00 UTC |
Source: | https://github.com/ohdsi/patientlevelprediction |
Calculate the average precision
averagePrecision(prediction)
averagePrecision(prediction)
prediction |
A prediction object |
Calculates the average precision from a predition object
The average precision value
prediction <- data.frame( value = c(0.1, 0.2, 0.3, 0.4, 0.5), outcomeCount = c(0, 1, 0, 1, 1) ) averagePrecision(prediction)
prediction <- data.frame( value = c(0.1, 0.2, 0.3, 0.4, 0.5), outcomeCount = c(0, 1, 0, 1, 1) ) averagePrecision(prediction)
brierScore
brierScore(prediction)
brierScore(prediction)
prediction |
A prediction dataframe |
Calculates the brierScore from prediction object
A list containing the brier score and the scaled brier score
prediction <- data.frame( value = c(0.1, 0.2, 0.3, 0.4, 0.5), outcomeCount = c(0, 1, 0, 1, 1)) brierScore(prediction)
prediction <- data.frame( value = c(0.1, 0.2, 0.3, 0.4, 0.5), outcomeCount = c(0, 1, 0, 1, 1)) brierScore(prediction)
calibrationLine
calibrationLine(prediction, numberOfStrata = 10)
calibrationLine(prediction, numberOfStrata = 10)
prediction |
A prediction object |
numberOfStrata |
The number of groups to split the prediction into |
A list containing the calibrationLine coefficients, the aggregate data used to fit the line and the Hosmer-Lemeshow goodness of fit test
prediction <- data.frame( value = c(0.1, 0.2, 0.3, 0.4, 0.5), outcomeCount = c(0, 1, 0, 1, 1)) calibrationLine(prediction, numberOfStrata = 1)
prediction <- data.frame( value = c(0.1, 0.2, 0.3, 0.4, 0.5), outcomeCount = c(0, 1, 0, 1, 1)) calibrationLine(prediction, numberOfStrata = 1)
Compute the area under the ROC curve
computeAuc(prediction, confidenceInterval = FALSE)
computeAuc(prediction, confidenceInterval = FALSE)
prediction |
A prediction object as generated using the
|
confidenceInterval |
Should 95 percebt confidence intervals be computed? |
Computes the area under the ROC curve for the predicted probabilities, given the true observed outcomes.
A data.frame containing the AUC and optionally the 95% confidence interval
prediction <- data.frame( value = c(0.1, 0.2, 0.3, 0.4, 0.5), outcomeCount = c(0, 1, 0, 1, 1)) computeAuc(prediction)
prediction <- data.frame( value = c(0.1, 0.2, 0.3, 0.4, 0.5), outcomeCount = c(0, 1, 0, 1, 1)) computeAuc(prediction)
Computes grid performance with a specified performance function
computeGridPerformance(prediction, param, performanceFunct = "computeAuc")
computeGridPerformance(prediction, param, performanceFunct = "computeAuc")
prediction |
a dataframe with predictions and outcomeCount per rowId |
param |
a list of hyperparameters |
performanceFunct |
a string specifying which performance function to use
. Default |
A list with overview of the performance
prediction <- data.frame(rowId = c(1, 2, 3, 4, 5), outcomeCount = c(0, 1, 0, 1, 0), value = c(0.1, 0.9, 0.2, 0.8, 0.3), index = c(1, 1, 1, 1, 1)) param <- list(hyperParam1 = 5, hyperParam2 = 100) computeGridPerformance(prediction, param, performanceFunct = "computeAuc")
prediction <- data.frame(rowId = c(1, 2, 3, 4, 5), outcomeCount = c(0, 1, 0, 1, 0), value = c(0.1, 0.9, 0.2, 0.8, 0.3), index = c(1, 1, 1, 1, 1)) param <- list(hyperParam1 = 5, hyperParam2 = 100) computeGridPerformance(prediction, param, performanceFunct = "computeAuc")
Sets up a python environment to use for PLP (can be conda or venv)
configurePython(envname = "PLP", envtype = NULL, condaPythonVersion = "3.11")
configurePython(envname = "PLP", envtype = NULL, condaPythonVersion = "3.11")
envname |
A string for the name of the virtual environment (default is 'PLP') |
envtype |
An option for specifying the environment as'conda' or 'python'. If NULL then the default is 'conda' for windows users and 'python' for non-windows users |
condaPythonVersion |
String, Python version to use when creating a conda environment |
This function creates a python environment that can be used by PatientLevelPrediction and installs all the required package dependancies.
location of the created conda or virtual python environment
## Not run: configurePython(envname="PLP", envtype="conda") ## End(Not run)
## Not run: configurePython(envname="PLP", envtype="conda") ## End(Not run)
Summarises the covariateData to calculate the mean and standard deviation per covariate if the labels are given it also stratifies this by class label and if the trainRowIds and testRowIds specifying the patients in the train/test sets respectively are input, these values are also stratified by train and test set
covariateSummary( covariateData, cohort, labels = NULL, strata = NULL, variableImportance = NULL, featureEngineering = NULL )
covariateSummary( covariateData, cohort, labels = NULL, strata = NULL, variableImportance = NULL, featureEngineering = NULL )
covariateData |
The covariateData part of the plpData that is
extracted using |
cohort |
The patient cohort to calculate the summary |
labels |
A data.frame with the columns rowId and outcomeCount |
strata |
A data.frame containing the columns rowId, strataName |
variableImportance |
A data.frame with the columns covariateId and value (the variable importance value) |
featureEngineering |
(currently not used ) A function or list of functions specifying any feature engineering to create covariates before summarising |
The function calculates various metrics to measure the performance of the model
A data.frame containing: CovariateCount, CovariateMean and CovariateStDev for any specified stratification
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n=100) covariateSummary <- covariateSummary(plpData$covariateData, plpData$cohorts) head(covariateSummary)
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n=100) covariateSummary <- covariateSummary(plpData$covariateData, plpData$cohorts) head(covariateSummary)
Extracts covariates based on cohorts
createCohortCovariateSettings( cohortName, settingId, cohortDatabaseSchema = NULL, cohortTable = NULL, cohortId, startDay = -30, endDay = 0, count = FALSE, ageInteraction = FALSE, lnAgeInteraction = FALSE, analysisId = 456 )
createCohortCovariateSettings( cohortName, settingId, cohortDatabaseSchema = NULL, cohortTable = NULL, cohortId, startDay = -30, endDay = 0, count = FALSE, ageInteraction = FALSE, lnAgeInteraction = FALSE, analysisId = 456 )
cohortName |
Name for the cohort |
settingId |
A unique id for the covariate time and |
cohortDatabaseSchema |
The schema of the database with the cohort. If nothing is specified then the cohortDatabaseSchema from databaseDetails at runtime is used. |
cohortTable |
the table name that contains the covariate cohort. If nothing is specified then the cohortTable from databaseDetails at runtime is used. |
cohortId |
cohort id for the covariate cohort |
startDay |
The number of days prior to index to start observing the cohort |
endDay |
The number of days prior to index to stop observing the cohort |
count |
If FALSE the covariate value is binary (1 means cohort occurred between index+startDay and index+endDay, 0 means it did not) If TRUE then the covariate value is the number of unique cohort_start_dates between index+startDay and index+endDay |
ageInteraction |
If TRUE multiple covariate value by the patient's age in years |
lnAgeInteraction |
If TRUE multiple covariate value by the log of the patient's age in years |
analysisId |
The analysisId for the covariate |
The user specifies a cohort and time period and then a covariate is constructed whether they are in the cohort during the time periods relative to target population cohort index
An object of class covariateSettings
specifying how to create the cohort covariate with the covariateId
cohortId x 100000 + settingId x 1000 + analysisId
createCohortCovariateSettings(cohortName="testCohort", settingId=1, cohortId=1, cohortDatabaseSchema="cohorts", cohortTable="cohort_table")
createCohortCovariateSettings(cohortName="testCohort", settingId=1, cohortId=1, cohortDatabaseSchema="cohorts", cohortTable="cohort_table")
Create a setting that holds the details about the cdmDatabase connection for data extraction
createDatabaseDetails( connectionDetails, cdmDatabaseSchema, cdmDatabaseName, cdmDatabaseId, tempEmulationSchema = cdmDatabaseSchema, cohortDatabaseSchema = cdmDatabaseSchema, cohortTable = "cohort", outcomeDatabaseSchema = cdmDatabaseSchema, outcomeTable = cohortTable, targetId = NULL, outcomeIds = NULL, cdmVersion = 5, cohortId = NULL )
createDatabaseDetails( connectionDetails, cdmDatabaseSchema, cdmDatabaseName, cdmDatabaseId, tempEmulationSchema = cdmDatabaseSchema, cohortDatabaseSchema = cdmDatabaseSchema, cohortTable = "cohort", outcomeDatabaseSchema = cdmDatabaseSchema, outcomeTable = cohortTable, targetId = NULL, outcomeIds = NULL, cdmVersion = 5, cohortId = NULL )
connectionDetails |
An R object of type |
cdmDatabaseSchema |
The name of the database schema that contains the OMOP CDM instance. Requires read permissions to this database. On SQL Server, this should specifiy both the database and the schema, so for example 'cdm_instance.dbo'. |
cdmDatabaseName |
A string with the name of the database - this is used in the shiny app and when externally validating models to name the result list and to specify the folder name when saving validation results (defaults to cdmDatabaseSchema if not specified) |
cdmDatabaseId |
A string with a unique identifier for the database and version - this is stored in the plp object for future reference and used by the shiny app (defaults to cdmDatabaseSchema if not specified) |
tempEmulationSchema |
For dmbs like Oracle only: the name of the database schema where you want all temporary tables to be managed. Requires create/insert permissions to this database. |
cohortDatabaseSchema |
The name of the database schema that is the location where the target cohorts are available. Requires read permissions to this database. |
cohortTable |
The tablename that contains the target cohorts. Expectation is cohortTable has format of COHORT table: COHORT_DEFINITION_ID, SUBJECT_ID, COHORT_START_DATE, COHORT_END_DATE. |
outcomeDatabaseSchema |
The name of the database schema that is the location where the data used to define the outcome cohorts is available. Requires read permissions to this database. |
outcomeTable |
The tablename that contains the outcome cohorts. Expectation is outcomeTable has format of COHORT table: COHORT_DEFINITION_ID, SUBJECT_ID, COHORT_START_DATE, COHORT_END_DATE. |
targetId |
An integer specifying the cohort id for the target cohort |
outcomeIds |
A single integer or vector of integers specifying the cohort ids for the outcome cohorts |
cdmVersion |
Define the OMOP CDM version used: currently support "4" and "5". |
cohortId |
(depreciated: use targetId) old input for the target cohort id |
This function simply stores the settings for communicating with the cdmDatabase when extracting the target cohort and outcomes
A list with the the database specific settings:
connectionDetails
: An R object of type connectionDetails
created using the function createConnectionDetails
in the DatabaseConnector
package.
cdmDatabaseSchema
: The name of the database schema that contains the OMOP CDM instance.
cdmDatabaseName
: A string with the name of the database - this is used in the shiny app and when externally validating models to name the result list and to specify the folder name when saving validation results (defaults to cdmDatabaseSchema if not specified).
cdmDatabaseId
: A string with a unique identifier for the database and version - this is stored in the plp object for future reference and used by the shiny app (defaults to cdmDatabaseSchema if not specified).
tempEmulationSchema
: The name of a databae schema where you want all temporary tables to be managed. Requires create/insert permissions to this database.
cohortDatabaseSchema
: The name of the database schema that is the location where the target cohorts are available. Requires read permissions to this schema.
cohortTable
: The tablename that contains the target cohorts. Expectation is cohortTable has format of COHORT table: COHORT_DEFINITION_ID, SUBJECT_ID, COHORT_START_DATE, COHORT_END_DATE.
outcomeDatabaseSchema
: The name of the database schema that is the location where the data used to define the outcome cohorts is available. Requires read permissions to this database.
outcomeTable
: The tablename that contains the outcome cohorts. Expectation is outcomeTable has format of COHORT table: COHORT_DEFINITION_ID, SUBJECT_ID, COHORT_START_DATE, COHORT_END_DATE.
targetId
: An integer specifying the cohort id for the target cohort
outcomeIds
: A single integer or vector of integers specifying the cohort ids for the outcome cohorts
cdmVersion
: Define the OMOP CDM version used: currently support "4" and "5".
connectionDetails <- Eunomia::getEunomiaConnectionDetails() # create the database details for Eunomia example database createDatabaseDetails( connectionDetails = connectionDetails, cdmDatabaseSchema = "main", cdmDatabaseName = "main", cohortDatabaseSchema = "main", cohortTable = "cohort", outcomeDatabaseSchema = "main", outcomeTable = "cohort", targetId = 1, # users of celecoxib outcomeIds = 3, # GIbleed cdmVersion = 5)
connectionDetails <- Eunomia::getEunomiaConnectionDetails() # create the database details for Eunomia example database createDatabaseDetails( connectionDetails = connectionDetails, cdmDatabaseSchema = "main", cdmDatabaseName = "main", cohortDatabaseSchema = "main", cohortTable = "cohort", outcomeDatabaseSchema = "main", outcomeTable = "cohort", targetId = 1, # users of celecoxib outcomeIds = 3, # GIbleed cdmVersion = 5)
This function specifies where the results schema is and lets you pick a different schema for the cohorts and databases
createDatabaseSchemaSettings( resultSchema = "main", tablePrefix = "", targetDialect = "sqlite", tempEmulationSchema = getOption("sqlRenderTempEmulationSchema"), cohortDefinitionSchema = resultSchema, tablePrefixCohortDefinitionTables = tablePrefix, databaseDefinitionSchema = resultSchema, tablePrefixDatabaseDefinitionTables = tablePrefix )
createDatabaseSchemaSettings( resultSchema = "main", tablePrefix = "", targetDialect = "sqlite", tempEmulationSchema = getOption("sqlRenderTempEmulationSchema"), cohortDefinitionSchema = resultSchema, tablePrefixCohortDefinitionTables = tablePrefix, databaseDefinitionSchema = resultSchema, tablePrefixDatabaseDefinitionTables = tablePrefix )
resultSchema |
(string) The name of the database schema with the result tables. |
tablePrefix |
(string) A string that appends to the PatientLevelPrediction result tables |
targetDialect |
(string) The database management system being used |
tempEmulationSchema |
(string) The temp schema used when the database management system is oracle |
cohortDefinitionSchema |
(string) The name of the database schema with the cohort definition tables (defaults to resultSchema). |
tablePrefixCohortDefinitionTables |
(string) A string that appends to the cohort definition tables |
databaseDefinitionSchema |
(string) The name of the database schema with the database definition tables (defaults to resultSchema). |
tablePrefixDatabaseDefinitionTables |
(string) A string that appends to the database definition tables |
This function can be used to specify the database settings used to upload PatientLevelPrediction results into a database
Returns a list of class 'plpDatabaseResultSchema' with all the database settings
createDatabaseSchemaSettings(resultSchema = "cdm", tablePrefix = "plp_")
createDatabaseSchemaSettings(resultSchema = "cdm", tablePrefix = "plp_")
Creates default list of settings specifying what parts of runPlp to execute
createDefaultExecuteSettings()
createDefaultExecuteSettings()
runs split, preprocess, model development and covariate summary
list with TRUE for split, preprocess, model development and covariate summary
createDefaultExecuteSettings()
createDefaultExecuteSettings()
Create the settings for defining how the plpData are split into test/validation/train sets using default splitting functions (either random stratified by outcome, time or subject splitting)
createDefaultSplitSetting( testFraction = 0.25, trainFraction = 0.75, splitSeed = sample(1e+05, 1), nfold = 3, type = "stratified" )
createDefaultSplitSetting( testFraction = 0.25, trainFraction = 0.75, splitSeed = sample(1e+05, 1), nfold = 3, type = "stratified" )
testFraction |
(numeric) A real number between 0 and 1 indicating the test set fraction of the data |
trainFraction |
(numeric) A real number between 0 and 1 indicating the train set fraction of the data. If not set train is equal to 1 - test |
splitSeed |
(numeric) A seed to use when splitting the data for reproducibility (if not set a random number will be generated) |
nfold |
(numeric) An integer > 1 specifying the number of folds used in cross validation |
type |
(character) Choice of:
|
Returns an object of class splitSettings
that specifies the
splitting function that will be called and the settings
An object of class splitSettings
createDefaultSplitSetting(testFraction=0.25, trainFraction=0.75, nfold=3, splitSeed=42)
createDefaultSplitSetting(testFraction=0.25, trainFraction=0.75, nfold=3, splitSeed=42)
Creates list of settings specifying what parts of runPlp to execute
createExecuteSettings( runSplitData = FALSE, runSampleData = FALSE, runFeatureEngineering = FALSE, runPreprocessData = FALSE, runModelDevelopment = FALSE, runCovariateSummary = FALSE )
createExecuteSettings( runSplitData = FALSE, runSampleData = FALSE, runFeatureEngineering = FALSE, runPreprocessData = FALSE, runModelDevelopment = FALSE, runCovariateSummary = FALSE )
runSplitData |
TRUE or FALSE whether to split data into train/test |
runSampleData |
TRUE or FALSE whether to over or under sample |
runFeatureEngineering |
TRUE or FALSE whether to do feature engineering |
runPreprocessData |
TRUE or FALSE whether to do preprocessing |
runModelDevelopment |
TRUE or FALSE whether to develop the model |
runCovariateSummary |
TRUE or FALSE whether to create covariate summary |
define what parts of runPlp to execute
list with TRUE/FALSE for each part of runPlp
# create settings with only split and model development createExecuteSettings(runSplitData = TRUE, runModelDevelopment = TRUE)
# create settings with only split and model development createExecuteSettings(runSplitData = TRUE, runModelDevelopment = TRUE)
Create the settings for defining how the plpData are split into test/validation/train sets using an existing split - good to use for reproducing results from a different run
createExistingSplitSettings(splitIds)
createExistingSplitSettings(splitIds)
splitIds |
(data.frame) A data frame with rowId and index columns of type integer/numeric. Index is -1 for test set, positive integer for train set folds |
An object of class splitSettings
# rowId 1 is in fold 1, rowId 2 is in fold 2, rowId 3 is in the test set # rowId 4 is in fold 1, rowId 5 is in fold 2 createExistingSplitSettings(splitIds = data.frame(rowId = c(1, 2, 3, 4, 5), index = c(1, 2, -1, 1, 2)))
# rowId 1 is in fold 1, rowId 2 is in fold 2, rowId 3 is in the test set # rowId 4 is in fold 1, rowId 5 is in fold 2 createExistingSplitSettings(splitIds = data.frame(rowId = c(1, 2, 3, 4, 5), index = c(1, 2, -1, 1, 2)))
Create the settings for defining any feature engineering that will be done
createFeatureEngineeringSettings(type = "none")
createFeatureEngineeringSettings(type = "none")
type |
(character) Choice of:
|
Returns an object of class featureEngineeringSettings
that specifies the sampling function that will be called and the settings
An object of class featureEngineeringSettings
createFeatureEngineeringSettings(type = "none")
createFeatureEngineeringSettings(type = "none")
Create a generalized linear model that can be used in the PatientLevelPrediction package.
createGlmModel(coefficients, intercept = 0, mapping = "logistic")
createGlmModel(coefficients, intercept = 0, mapping = "logistic")
coefficients |
A dataframe containing two columns, coefficients and
covariateId, both of type numeric. The covariateId column must contain
valid covariateIds that match those used in the |
intercept |
A numeric value representing the intercept of the model. |
mapping |
A string representing the mapping from the linear predictors to outcome probabilities. For generalized linear models this is the inverse of the link function. Supported values is only "logistic" for logistic regression model at the moment. |
A model object containing the model (Coefficients and intercept) and the prediction function.
coefficients <- data.frame( covariateId = c(1002), coefficient = c(0.05)) model <- createGlmModel(coefficients, intercept = -2.5) data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n=50) prediction <- predictPlp(model, plpData, plpData$cohorts) # see the predicted risk values prediction$value
coefficients <- data.frame( covariateId = c(1002), coefficient = c(0.05)) model <- createGlmModel(coefficients, intercept = -2.5) data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n=50) prediction <- predictPlp(model, plpData, plpData$cohorts) # see the predicted risk values prediction$value
This function creates the settings for an iterative imputer
which first removes features with more than missingThreshold
missing values
and then imputes the missing values iteratively using chained equations
createIterativeImputer( missingThreshold = 0.3, method = "pmm", methodSettings = list(pmm = list(k = 5, iterations = 5)) )
createIterativeImputer( missingThreshold = 0.3, method = "pmm", methodSettings = list(pmm = list(k = 5, iterations = 5)) )
missingThreshold |
The threshold for missing values to remove a feature |
method |
The method to use for imputation, currently only "pmm" is supported |
methodSettings |
A list of settings for the imputation method to use. Currently only "pmm" is supported with the following settings:
|
The settings for the iterative imputer of class featureEngineeringSettings
# create imputer to impute values with missingness less than 30% using # predictive mean matching in 5 iterations with 5 donors createIterativeImputer(missingThreshold = 0.3, method = "pmm", methodSettings = list(pmm = list(k = 5, iterations = 5)))
# create imputer to impute values with missingness less than 30% using # predictive mean matching in 5 iterations with 5 donors createIterativeImputer(missingThreshold = 0.3, method = "pmm", methodSettings = list(pmm = list(k = 5, iterations = 5)))
Creates a learning curve object, which can be plotted using the
plotLearningCurve()
function.
createLearningCurve( plpData, outcomeId, parallel = TRUE, cores = 4, modelSettings, saveDirectory = NULL, analysisId = "learningCurve", populationSettings = createStudyPopulationSettings(), splitSettings = createDefaultSplitSetting(), trainFractions = c(0.25, 0.5, 0.75), trainEvents = NULL, sampleSettings = createSampleSettings(), featureEngineeringSettings = createFeatureEngineeringSettings(), preprocessSettings = createPreprocessSettings(minFraction = 0.001, normalize = TRUE), logSettings = createLogSettings(), executeSettings = createExecuteSettings(runSplitData = TRUE, runSampleData = FALSE, runFeatureEngineering = FALSE, runPreprocessData = TRUE, runModelDevelopment = TRUE, runCovariateSummary = FALSE) )
createLearningCurve( plpData, outcomeId, parallel = TRUE, cores = 4, modelSettings, saveDirectory = NULL, analysisId = "learningCurve", populationSettings = createStudyPopulationSettings(), splitSettings = createDefaultSplitSetting(), trainFractions = c(0.25, 0.5, 0.75), trainEvents = NULL, sampleSettings = createSampleSettings(), featureEngineeringSettings = createFeatureEngineeringSettings(), preprocessSettings = createPreprocessSettings(minFraction = 0.001, normalize = TRUE), logSettings = createLogSettings(), executeSettings = createExecuteSettings(runSplitData = TRUE, runSampleData = FALSE, runFeatureEngineering = FALSE, runPreprocessData = TRUE, runModelDevelopment = TRUE, runCovariateSummary = FALSE) )
plpData |
An object of type |
outcomeId |
(integer) The ID of the outcome. |
parallel |
Whether to run the code in parallel |
cores |
The number of computer cores to use if running in parallel |
modelSettings |
An object of class
|
saveDirectory |
The path to the directory where the results will be saved (if NULL uses working directory) |
analysisId |
(integer) Identifier for the analysis. It is used to create, e.g., the result folder. Default is a timestamp. |
populationSettings |
An object of type |
splitSettings |
An object of type |
trainFractions |
A list of training fractions to create models for.
Note, providing |
trainEvents |
Events have shown to be determinant of model performance.
Therefore, it is recommended to provide
|
sampleSettings |
An object of type |
featureEngineeringSettings |
An object of |
preprocessSettings |
An object of |
logSettings |
An object of |
executeSettings |
An object of |
A learning curve object containing the various performance measures
obtained by the model for each training set fraction. It can be plotted
using plotLearningCurve
.
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n = 1000) outcomeId <- 3 modelSettings <- setLassoLogisticRegression(seed=42) learningCurve <- createLearningCurve(plpData, outcomeId, modelSettings = modelSettings, saveDirectory = file.path(tempdir(), "learningCurve"), cores = 2) # clean up unlink(file.path(tempdir(), "learningCurve"), recursive = TRUE)
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n = 1000) outcomeId <- 3 modelSettings <- setLassoLogisticRegression(seed=42) learningCurve <- createLearningCurve(plpData, outcomeId, modelSettings = modelSettings, saveDirectory = file.path(tempdir(), "learningCurve"), cores = 2) # clean up unlink(file.path(tempdir(), "learningCurve"), recursive = TRUE)
Create the settings for logging the progression of the analysis
createLogSettings( verbosity = "DEBUG", timeStamp = TRUE, logName = "runPlp Log" )
createLogSettings( verbosity = "DEBUG", timeStamp = TRUE, logName = "runPlp Log" )
verbosity |
Sets the level of the verbosity. If the log level is at or higher in priority than the logger threshold, a message will print. The levels are:
|
timeStamp |
If TRUE a timestamp will be added to each logging statement. Automatically switched on for TRACE level. |
logName |
A string reference for the logger |
Returns an object of class logSettings
that specifies the logger settings
An object of class logSettings
containing the settings for the logger
# create a log settings object with DENUG verbosity, timestamp and log name # "runPlp Log". This needs to be passed to `runPlp`. createLogSettings(verbosity = "DEBUG", timeStamp = TRUE, logName = "runPlp Log")
# create a log settings object with DENUG verbosity, timestamp and log name # "runPlp Log". This needs to be passed to `runPlp`. createLogSettings(verbosity = "DEBUG", timeStamp = TRUE, logName = "runPlp Log")
Specify settings for developing a single model
createModelDesign( targetId = NULL, outcomeId = NULL, restrictPlpDataSettings = createRestrictPlpDataSettings(), populationSettings = createStudyPopulationSettings(), covariateSettings = FeatureExtraction::createDefaultCovariateSettings(), featureEngineeringSettings = NULL, sampleSettings = NULL, preprocessSettings = NULL, modelSettings = NULL, splitSettings = createDefaultSplitSetting(), runCovariateSummary = TRUE )
createModelDesign( targetId = NULL, outcomeId = NULL, restrictPlpDataSettings = createRestrictPlpDataSettings(), populationSettings = createStudyPopulationSettings(), covariateSettings = FeatureExtraction::createDefaultCovariateSettings(), featureEngineeringSettings = NULL, sampleSettings = NULL, preprocessSettings = NULL, modelSettings = NULL, splitSettings = createDefaultSplitSetting(), runCovariateSummary = TRUE )
targetId |
The id of the target cohort that will be used for data extraction (e.g., the ATLAS id) |
outcomeId |
The id of the outcome that will be used for data extraction (e.g., the ATLAS id) |
restrictPlpDataSettings |
The settings specifying the extra restriction settings when extracting the data created using |
populationSettings |
The population settings specified by |
covariateSettings |
The covariate settings, this can be a list or a single |
featureEngineeringSettings |
Either NULL or an object of class |
sampleSettings |
Either NULL or an object of class |
preprocessSettings |
Either NULL or an object of class |
modelSettings |
The model settings such as |
splitSettings |
The train/validation/test splitting used by all analyses created using |
runCovariateSummary |
Whether to run the covariateSummary |
This specifies a single analysis for developing as single model
A list with analysis settings used to develop a single prediction model
# L1 logistic regression model to predict the outcomeId 2 using the targetId 2 # with with default population, restrictPlp, split, and covariate settings createModelDesign( targetId = 1, outcomeId = 2, modelSettings = setLassoLogisticRegression(seed=42), populationSettings = createStudyPopulationSettings(), restrictPlpDataSettings = createRestrictPlpDataSettings(), covariateSettings = FeatureExtraction::createDefaultCovariateSettings(), splitSettings = createDefaultSplitSetting(splitSeed = 42), runCovariateSummary = TRUE )
# L1 logistic regression model to predict the outcomeId 2 using the targetId 2 # with with default population, restrictPlp, split, and covariate settings createModelDesign( targetId = 1, outcomeId = 2, modelSettings = setLassoLogisticRegression(seed=42), populationSettings = createStudyPopulationSettings(), restrictPlpDataSettings = createRestrictPlpDataSettings(), covariateSettings = FeatureExtraction::createDefaultCovariateSettings(), splitSettings = createDefaultSplitSetting(splitSeed = 42), runCovariateSummary = TRUE )
Create the settings for normalizing the data @param type The type of normalization to use, either "minmax" or "robust"
createNormalizer(type = "minmax", settings = list())
createNormalizer(type = "minmax", settings = list())
type |
The type of normalization to use, either "minmax" or "robust" |
settings |
A list of settings for the normalization. For robust normalization, the settings list can contain a boolean value for clip, which clips the values to be between -3 and 3 after normalization. See https://arxiv.org/abs/2407.04491 |
An object of class featureEngineeringSettings
An object of class featureEngineeringSettings
'
# create a minmax normalizer that normalizes the data between 0 and 1 normalizer <- createNormalizer(type = "minmax") # create a robust normalizer that normalizes the data by the interquartile range # and squeezes the values to be between -3 and 3 normalizer <- createNormalizer(type = "robust", settings = list(clip = TRUE))
# create a minmax normalizer that normalizes the data between 0 and 1 normalizer <- createNormalizer(type = "minmax") # create a robust normalizer that normalizes the data by the interquartile range # and squeezes the values to be between -3 and 3 normalizer <- createNormalizer(type = "robust", settings = list(clip = TRUE))
This function executes a large set of SQL statements to create tables that can store models and results
createPlpResultTables( connectionDetails, targetDialect = "postgresql", resultSchema, deleteTables = TRUE, createTables = TRUE, tablePrefix = "", tempEmulationSchema = getOption("sqlRenderTempEmulationSchema"), testFile = NULL )
createPlpResultTables( connectionDetails, targetDialect = "postgresql", resultSchema, deleteTables = TRUE, createTables = TRUE, tablePrefix = "", tempEmulationSchema = getOption("sqlRenderTempEmulationSchema"), testFile = NULL )
connectionDetails |
The database connection details |
targetDialect |
The database management system being used |
resultSchema |
The name of the database schema that the result tables will be created. |
deleteTables |
If true any existing tables matching the PatientLevelPrediction result tables names will be deleted |
createTables |
If true the PatientLevelPrediction result tables will be created |
tablePrefix |
A string that appends to the PatientLevelPrediction result tables |
tempEmulationSchema |
The temp schema used when the database management system is oracle |
testFile |
(used for testing) The location of an sql file with the table creation code |
This function can be used to create (or delete) PatientLevelPrediction result tables
Returns NULL but creates or deletes the required tables in the specified database schema(s).
# create a sqlite database with the PatientLevelPrediction result tables connectionDetails <- DatabaseConnector::createConnectionDetails( dbms = "sqlite", server = file.path(tempdir(), "test.sqlite")) createPlpResultTables(connectionDetails = connectionDetails, targetDialect = "sqlite", resultSchema = "main", tablePrefix = "plp_") # delete the tables createPlpResultTables(connectionDetails = connectionDetails, targetDialect = "sqlite", resultSchema = "main", deleteTables = TRUE, createTables = FALSE, tablePrefix = "plp_") # clean up the database file unlink(file.path(tempdir(), "test.sqlite"))
# create a sqlite database with the PatientLevelPrediction result tables connectionDetails <- DatabaseConnector::createConnectionDetails( dbms = "sqlite", server = file.path(tempdir(), "test.sqlite")) createPlpResultTables(connectionDetails = connectionDetails, targetDialect = "sqlite", resultSchema = "main", tablePrefix = "plp_") # delete the tables createPlpResultTables(connectionDetails = connectionDetails, targetDialect = "sqlite", resultSchema = "main", deleteTables = TRUE, createTables = FALSE, tablePrefix = "plp_") # clean up the database file unlink(file.path(tempdir(), "test.sqlite"))
Create the settings for preprocessing the trainData.
createPreprocessSettings( minFraction = 0.001, normalize = TRUE, removeRedundancy = TRUE )
createPreprocessSettings( minFraction = 0.001, normalize = TRUE, removeRedundancy = TRUE )
minFraction |
The minimum fraction of target population who must have a covariate for it to be included in the model training |
normalize |
Whether to normalise the covariates before training (Default: TRUE) |
removeRedundancy |
Whether to remove redundant features (Default: TRUE) Redundant features are features that within an analysisId together cover all observations. For example with ageGroups, if you have ageGroup 0-18 and 18-100 and all patients are in one of these groups, then one of these groups is redundant. |
Returns an object of class preprocessingSettings
that specifies how to
preprocess the training data
An object of class preprocessingSettings
# Create the settings for preprocessing, remove no features, normalise the data createPreprocessSettings(minFraction = 0.0, normalize = TRUE, removeRedundancy = FALSE)
# Create the settings for preprocessing, remove no features, normalise the data createPreprocessSettings(minFraction = 0.0, normalize = TRUE, removeRedundancy = FALSE)
Create the settings for random foreat based feature selection
createRandomForestFeatureSelection(ntrees = 2000, maxDepth = 17)
createRandomForestFeatureSelection(ntrees = 2000, maxDepth = 17)
ntrees |
number of tree in forest |
maxDepth |
MAx depth of each tree |
Returns an object of class featureEngineeringSettings
that specifies the sampling function that will be called and the settings
An object of class featureEngineeringSettings
## Not run: #' featureSelector <- createRandomForestFeatureSelection(ntrees = 2000, maxDepth = 10)
## Not run: #' featureSelector <- createRandomForestFeatureSelection(ntrees = 2000, maxDepth = 10)
Create the settings for removing rare features
createRareFeatureRemover(threshold = 0.001)
createRareFeatureRemover(threshold = 0.001)
threshold |
The minimum fraction of the training data that must have a feature for it to be included |
An object of class featureEngineeringSettings
# create a rare feature remover that removes features that are present in less # than 1% of the population rareFeatureRemover <- createRareFeatureRemover(threshold = 0.01) plpData <- getEunomiaPlpData() analysisId <- "rareFeatureRemover" saveLocation <- file.path(tempdir(), analysisId) results <- runPlp( plpData = plpData, featureEngineeringSettings = rareFeatureRemover, outcomeId = 3, executeSettings = createExecuteSettings( runModelDevelopment = TRUE, runSplitData = TRUE, runFeatureEngineering = TRUE), saveDirectory = saveLocation, analysisId = analysisId) # clean up unlink(saveLocation, recursive = TRUE)
# create a rare feature remover that removes features that are present in less # than 1% of the population rareFeatureRemover <- createRareFeatureRemover(threshold = 0.01) plpData <- getEunomiaPlpData() analysisId <- "rareFeatureRemover" saveLocation <- file.path(tempdir(), analysisId) results <- runPlp( plpData = plpData, featureEngineeringSettings = rareFeatureRemover, outcomeId = 3, executeSettings = createExecuteSettings( runModelDevelopment = TRUE, runSplitData = TRUE, runFeatureEngineering = TRUE), saveDirectory = saveLocation, analysisId = analysisId) # clean up unlink(saveLocation, recursive = TRUE)
This function creates the settings used to restrict the target cohort when calling getPlpData
createRestrictPlpDataSettings( studyStartDate = "", studyEndDate = "", firstExposureOnly = FALSE, washoutPeriod = 0, sampleSize = NULL )
createRestrictPlpDataSettings( studyStartDate = "", studyEndDate = "", firstExposureOnly = FALSE, washoutPeriod = 0, sampleSize = NULL )
studyStartDate |
A calendar date specifying the minimum date that a cohort index date can appear. Date format is 'yyyymmdd'. |
studyEndDate |
A calendar date specifying the maximum date that a cohort index date can appear. Date format is 'yyyymmdd'. Important: the study end data is also used to truncate risk windows, meaning no outcomes beyond the study end date will be considered. |
firstExposureOnly |
Should only the first exposure per subject be included? Note that
this is typically done in the |
washoutPeriod |
The mininum required continuous observation time prior to index
date for a person to be included in the at risk cohort. Note that
this is typically done in the |
sampleSize |
If not NULL, the number of people to sample from the target cohort |
Users need to specify the extra restrictions to apply when downloading the target cohort
A setting object of class restrictPlpDataSettings
containing a list of
the settings:
studyStartDate
: A calendar date specifying the minimum date that a cohort index date can appear
studyEndDate
: A calendar date specifying the maximum date that a cohort index date can appear
firstExposureOnly
: Should only the first exposure per subject be included
washoutPeriod
: The mininum required continuous observation time prior to index date for a person to be included in the at risk cohort
sampleSize
: If not NULL, the number of people to sample from the target cohort
# restrict to 2010, first exposure only, require washout period of 365 day # and sample 1000 people createRestrictPlpDataSettings(studyStartDate = "20100101", studyEndDate = "20101231", firstExposureOnly = TRUE, washoutPeriod = 365, sampleSize = 1000)
# restrict to 2010, first exposure only, require washout period of 365 day # and sample 1000 people createRestrictPlpDataSettings(studyStartDate = "20100101", studyEndDate = "20101231", firstExposureOnly = TRUE, washoutPeriod = 365, sampleSize = 1000)
splitData
are sampled using
default sample functions.Create the settings for defining how the trainData from splitData
are sampled using
default sample functions.
createSampleSettings( type = "none", numberOutcomestoNonOutcomes = 1, sampleSeed = sample(10000, 1) )
createSampleSettings( type = "none", numberOutcomestoNonOutcomes = 1, sampleSeed = sample(10000, 1) )
type |
(character) Choice of:
|
numberOutcomestoNonOutcomes |
(numeric) A numeric specifying the required number of outcomes per non-outcomes |
sampleSeed |
(numeric) A seed to use when splitting the data for reproducibility (if not set a random number will be generated) |
Returns an object of class sampleSettings
that specifies the sampling function that will be called and the settings
An object of class sampleSettings
# sample even rate of outcomes to non-outcomes sampleSetting <- createSampleSettings(type = "underSample", numberOutcomestoNonOutcomes = 1, sampleSeed = 42)
# sample even rate of outcomes to non-outcomes sampleSetting <- createSampleSettings(type = "underSample", numberOutcomestoNonOutcomes = 1, sampleSeed = 42)
This function creates the settings for a simple imputer which imputes missing values with the mean or median
createSimpleImputer(method = "mean", missingThreshold = 0.3)
createSimpleImputer(method = "mean", missingThreshold = 0.3)
method |
The method to use for imputation, either "mean" or "median" |
missingThreshold |
The threshold for missing values to be imputed vs removed |
The settings for the single imputer of class featureEngineeringSettings
# create imputer to impute values with missingness less than 10% using the median # of observed values createSimpleImputer(method = "median", missingThreshold = 0.10)
# create imputer to impute values with missingness less than 10% using the median # of observed values createSimpleImputer(method = "median", missingThreshold = 0.10)
Plug an existing scikit learn python model into the PLP framework
createSklearnModel( modelLocation = "/model", covariateMap = data.frame(columnId = 1:2, covariateId = c(1, 2), ), covariateSettings, populationSettings, isPickle = TRUE )
createSklearnModel( modelLocation = "/model", covariateMap = data.frame(columnId = 1:2, covariateId = c(1, 2), ), covariateSettings, populationSettings, isPickle = TRUE )
modelLocation |
The location of the folder that contains the model as model.pkl |
covariateMap |
A data.frame with the columns: columnId and covariateId.
|
covariateSettings |
The settings for the standardized covariates |
populationSettings |
The settings for the population, this includes the time-at-risk settings and inclusion criteria. |
isPickle |
If the model should be saved as a pickle set this to TRUE if it should be saved as json set this to FALSE. |
This function lets users add an existing scikit learn model that is saved as model.pkl into PLP format. covariateMap is a mapping between standard covariateIds and the model columns. The user also needs to specify the covariate settings and population settings as these are used to determine the standard PLP model design.
An object of class plpModel, this is a list that contains: model (the location of the model.pkl), preprocessing (settings for mapping the covariateIds to the model column mames), modelDesign (specification of the model design), trainDetails (information about the model fitting) and covariateImportance.
You can use the output as an input in PatientLevelPrediction::predictPlp to apply the model and calculate the risk for patients.
Create the settings for adding a spline for continuous variables
createSplineSettings(continousCovariateId, knots, analysisId = 683)
createSplineSettings(continousCovariateId, knots, analysisId = 683)
continousCovariateId |
The covariateId to apply splines to |
knots |
Either number of knots of vector of split values |
analysisId |
The analysisId to use for the spline covariates |
Returns an object of class featureEngineeringSettings
that specifies the sampling function that will be called and the settings
An object of class featureEngineeringSettings
# create splines for age (1002) with 5 knots createSplineSettings(continousCovariateId = 1002, knots = 5, analysisId = 683)
# create splines for age (1002) with 5 knots createSplineSettings(continousCovariateId = 1002, knots = 5, analysisId = 683)
Create the settings for using stratified imputation.
createStratifiedImputationSettings(covariateId, ageSplits = NULL)
createStratifiedImputationSettings(covariateId, ageSplits = NULL)
covariateId |
The covariateId that needs imputed values |
ageSplits |
A vector of age splits in years to create age groups |
Returns an object of class featureEngineeringSettings
that specifies
how to do stratified imputation. This function splits the covariate into
age groups and fits splines to the covariate within each age group. The spline
values are then used to impute missing values.
An object of class featureEngineeringSettings
# create a stratified imputation settings for covariate 1050 with age splits # at 50 and 70 stratifiedImputationSettings <- createStratifiedImputationSettings(covariateId = 1050, ageSplits = c(50, 70))
# create a stratified imputation settings for covariate 1050 with age splits # at 50 and 70 stratifiedImputationSettings <- createStratifiedImputationSettings(covariateId = 1050, ageSplits = c(50, 70))
Create a study population
createStudyPopulation( plpData, outcomeId = plpData$metaData$databaseDetails$outcomeIds[1], populationSettings = createStudyPopulationSettings(), population = NULL )
createStudyPopulation( plpData, outcomeId = plpData$metaData$databaseDetails$outcomeIds[1], populationSettings = createStudyPopulationSettings(), population = NULL )
plpData |
An object of type |
outcomeId |
The ID of the outcome. |
populationSettings |
An object of class populationSettings created using |
population |
If specified, this population will be used as the starting point instead of the
cohorts in the |
Create a study population by enforcing certain inclusion and exclusion criteria, defining a risk window, and determining which outcomes fall inside the risk window.
A data frame specifying the study population. This data frame will have the following columns:
A unique identifier for an exposure
The person ID of the subject
The index date
The number of outcomes observed during the risk window
The number of days in the risk window
The number of days until either the outcome or the end of the risk window
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n = 100) # Create study population, require time at risk of 30 days. The risk window is 1 to 90 days. populationSettings <- createStudyPopulationSettings(requireTimeAtRisk = TRUE, minTimeAtRisk = 30, riskWindowStart = 1, riskWindowEnd = 90) population <- createStudyPopulation(plpData, outcomeId = 3, populationSettings)
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n = 100) # Create study population, require time at risk of 30 days. The risk window is 1 to 90 days. populationSettings <- createStudyPopulationSettings(requireTimeAtRisk = TRUE, minTimeAtRisk = 30, riskWindowStart = 1, riskWindowEnd = 90) population <- createStudyPopulation(plpData, outcomeId = 3, populationSettings)
create the study population settings
createStudyPopulationSettings( binary = TRUE, includeAllOutcomes = TRUE, firstExposureOnly = FALSE, washoutPeriod = 0, removeSubjectsWithPriorOutcome = TRUE, priorOutcomeLookback = 99999, requireTimeAtRisk = TRUE, minTimeAtRisk = 364, riskWindowStart = 1, startAnchor = "cohort start", riskWindowEnd = 365, endAnchor = "cohort start", restrictTarToCohortEnd = FALSE )
createStudyPopulationSettings( binary = TRUE, includeAllOutcomes = TRUE, firstExposureOnly = FALSE, washoutPeriod = 0, removeSubjectsWithPriorOutcome = TRUE, priorOutcomeLookback = 99999, requireTimeAtRisk = TRUE, minTimeAtRisk = 364, riskWindowStart = 1, startAnchor = "cohort start", riskWindowEnd = 365, endAnchor = "cohort start", restrictTarToCohortEnd = FALSE )
binary |
Forces the outcomeCount to be 0 or 1 (use for binary prediction problems) |
includeAllOutcomes |
(binary) indicating whether to include people with outcomes who are not observed for the whole at risk period |
firstExposureOnly |
Should only the first exposure per subject be included? Note that
this is typically done in the |
washoutPeriod |
The mininum required continuous observation time prior to index date for a person to be included in the cohort. |
removeSubjectsWithPriorOutcome |
Remove subjects that have the outcome prior to the risk window start? |
priorOutcomeLookback |
How many days should we look back when identifying prior outcomes? |
requireTimeAtRisk |
Should subject without time at risk be removed? |
minTimeAtRisk |
The minimum number of days at risk required to be included |
riskWindowStart |
The start of the risk window (in days) relative to the index date (+
days of exposure if the |
startAnchor |
The anchor point for the start of the risk window. Can be "cohort start" or "cohort end". |
riskWindowEnd |
The end of the risk window (in days) relative to the index data (+
days of exposure if the |
endAnchor |
The anchor point for the end of the risk window. Can be "cohort start" or "cohort end". |
restrictTarToCohortEnd |
If using a survival model and you want the time-at-risk to end at the cohort end date set this to T |
An object of type populationSettings containing all the settings required for creating the study population
# Create study population settings with a washout period of 30 days and a # risk window of 1 to 90 days populationSettings <- createStudyPopulationSettings(washoutPeriod = 30, riskWindowStart = 1, riskWindowEnd = 90)
# Create study population settings with a washout period of 30 days and a # risk window of 1 to 90 days populationSettings <- createStudyPopulationSettings(washoutPeriod = 30, riskWindowStart = 1, riskWindowEnd = 90)
Create a temporary model location
createTempModelLoc()
createTempModelLoc()
A string for the location of the temporary model location
modelLoc <- createTempModelLoc() dir.exists(modelLoc) # clean up unlink(modelLoc, recursive = TRUE)
modelLoc <- createTempModelLoc() dir.exists(modelLoc) # clean up unlink(modelLoc, recursive = TRUE)
Create the settings for defining any feature selection that will be done
createUnivariateFeatureSelection(k = 100)
createUnivariateFeatureSelection(k = 100)
k |
This function returns the K features most associated (univariately) to the outcome |
Returns an object of class featureEngineeringSettings
that specifies
the function that will be called and the settings. Uses the scikit-learn
SelectKBest function with chi2 for univariate feature selection.
An object of class featureEngineeringSettings
## Not run: #' # create a feature selection that selects the 100 most associated features featureSelector <- createUnivariateFeatureSelection(k = 100) ## End(Not run)
## Not run: #' # create a feature selection that selects the 100 most associated features featureSelector <- createUnivariateFeatureSelection(k = 100) ## End(Not run)
createValidationDesign - Define the validation design for external validation
createValidationDesign( targetId, outcomeId, populationSettings = NULL, restrictPlpDataSettings = NULL, plpModelList, recalibrate = NULL, runCovariateSummary = TRUE )
createValidationDesign( targetId, outcomeId, populationSettings = NULL, restrictPlpDataSettings = NULL, plpModelList, recalibrate = NULL, runCovariateSummary = TRUE )
targetId |
The targetId of the target cohort to validate on |
outcomeId |
The outcomeId of the outcome cohort to validate on |
populationSettings |
A list of population restriction settings created
by |
restrictPlpDataSettings |
A list of plpData restriction settings
created by |
plpModelList |
A list of plpModels objects created by |
recalibrate |
A vector of characters specifying the recalibration method to apply, |
runCovariateSummary |
whether to run the covariate summary for the validation data |
A validation design object of class validationDesign
or a list of such objects
# create a validation design for targetId 1 and outcomeId 2 one l1 model and # one gradient boosting model createValidationDesign(1, 2, plpModelList = list( "pathToL1model", "PathToGBMModel"))
# create a validation design for targetId 1 and outcomeId 2 one l1 model and # one gradient boosting model createValidationDesign(1, 2, plpModelList = list( "pathToL1model", "PathToGBMModel"))
This function creates the settings required by externalValidatePlp
createValidationSettings(recalibrate = NULL, runCovariateSummary = TRUE)
createValidationSettings(recalibrate = NULL, runCovariateSummary = TRUE)
recalibrate |
A vector of characters specifying the recalibration method to apply |
runCovariateSummary |
Whether to run the covariate summary for the validation data |
Users need to specify whether they want to sample or recalibate when performing external validation
A setting object of class validationSettings
containing a list of settings for externalValidatePlp
# do weak recalibration and don't run covariate summary createValidationSettings(recalibrate = "weakRecalibration", runCovariateSummary = FALSE)
# do weak recalibration and don't run covariate summary createValidationSettings(recalibrate = "weakRecalibration", runCovariateSummary = FALSE)
Run a list of predictions diagnoses
diagnoseMultiplePlp( databaseDetails = createDatabaseDetails(), modelDesignList = list(createModelDesign(targetId = 1, outcomeId = 2, modelSettings = setLassoLogisticRegression()), createModelDesign(targetId = 1, outcomeId = 3, modelSettings = setLassoLogisticRegression())), cohortDefinitions = NULL, logSettings = createLogSettings(verbosity = "DEBUG", timeStamp = TRUE, logName = "diagnosePlp Log"), saveDirectory = NULL )
diagnoseMultiplePlp( databaseDetails = createDatabaseDetails(), modelDesignList = list(createModelDesign(targetId = 1, outcomeId = 2, modelSettings = setLassoLogisticRegression()), createModelDesign(targetId = 1, outcomeId = 3, modelSettings = setLassoLogisticRegression())), cohortDefinitions = NULL, logSettings = createLogSettings(verbosity = "DEBUG", timeStamp = TRUE, logName = "diagnosePlp Log"), saveDirectory = NULL )
databaseDetails |
The database settings created using |
modelDesignList |
A list of model designs created using |
cohortDefinitions |
A list of cohort definitions for the target and outcome cohorts |
logSettings |
The setting spexcifying the logging for the analyses created using |
saveDirectory |
Name of the folder where all the outputs will written to. |
This function will run all specified prediction design diagnoses.
A data frame with the following columns:
analysisId |
The unique identifier for a set of analysis choices. |
targetId |
The ID of the target cohort populations. |
outcomeId |
The ID of the outcomeId. |
dataLocation |
The location where the plpData was saved |
the settings ids |
The ids for all other settings used for model development. |
This function runs a set of prediction diagnoses to help pick a suitable T, O, TAR and determine whether the prediction problem is worth executing.
diagnosePlp( plpData = NULL, outcomeId, analysisId, populationSettings, splitSettings = createDefaultSplitSetting(), sampleSettings = createSampleSettings(), saveDirectory = NULL, featureEngineeringSettings = createFeatureEngineeringSettings(), modelSettings = setLassoLogisticRegression(), logSettings = createLogSettings(verbosity = "DEBUG", timeStamp = TRUE, logName = "diagnosePlp Log"), preprocessSettings = createPreprocessSettings() )
diagnosePlp( plpData = NULL, outcomeId, analysisId, populationSettings, splitSettings = createDefaultSplitSetting(), sampleSettings = createSampleSettings(), saveDirectory = NULL, featureEngineeringSettings = createFeatureEngineeringSettings(), modelSettings = setLassoLogisticRegression(), logSettings = createLogSettings(verbosity = "DEBUG", timeStamp = TRUE, logName = "diagnosePlp Log"), preprocessSettings = createPreprocessSettings() )
plpData |
An object of type |
outcomeId |
(integer) The ID of the outcome. |
analysisId |
(integer) Identifier for the analysis. It is used to create, e.g., the result folder. Default is a timestamp. |
populationSettings |
An object of type |
splitSettings |
An object of type |
sampleSettings |
An object of type |
saveDirectory |
The path to the directory where the results will be saved (if NULL uses working directory) |
featureEngineeringSettings |
An object of |
modelSettings |
An object of class
|
logSettings |
An object of |
preprocessSettings |
An object of |
Users can define set of Ts, Os, databases and population settings. A list of data.frames containing details such as follow-up time distribution, time-to-event information, characteriszation details, time from last prior event, observation time distribution.
An object containing the model or location where the model is saved, the data selection settings, the preprocessing and training settings as well as various performance measures obtained by the model.
distribution
: List for each O of a data.frame containing: i) Time to observation end distribution, ii) Time from observation start distribution, iii) Time to event distribution and iv) Time from last prior event to index distribution (only for patients in T who have O before index)
incident
: List for each O of incidence of O in T during TAR
characterization
: List for each O of Characterization of T, TnO, Tn~O
# load the data plpData <- getEunomiaPlpData() populationSettings <- createStudyPopulationSettings(minTimeAtRisk = 1) saveDirectory <- file.path(tempdir(), "diagnosePlp") diagnosis <- diagnosePlp(plpData = plpData, outcomeId = 3, analysisId = 1, populationSettings = populationSettings, saveDirectory = saveDirectory) # clean up unlink(saveDirectory, recursive = TRUE)
# load the data plpData <- getEunomiaPlpData() populationSettings <- createStudyPopulationSettings(minTimeAtRisk = 1) saveDirectory <- file.path(tempdir(), "diagnosePlp") diagnosis <- diagnosePlp(plpData = plpData, outcomeId = 3, analysisId = 1, populationSettings = populationSettings, saveDirectory = saveDirectory) # clean up unlink(saveDirectory, recursive = TRUE)
Evaluates the performance of the patient level prediction model
evaluatePlp(prediction, typeColumn = "evaluationType")
evaluatePlp(prediction, typeColumn = "evaluationType")
prediction |
The patient level prediction model's prediction |
typeColumn |
The column name in the prediction object that is used to stratify the evaluation |
The function calculates various metrics to measure the performance of the model
An object of class plpEvaluation containing the following components
evaluationStatistics: A data frame containing the evaluation statistics'
thresholdSummary: A data frame containing the threshold summary'
demographicSummary: A data frame containing the demographic summary'
calibrationSummary: A data frame containing the calibration summary'
predictionDistribution: A data frame containing the prediction distribution'
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n= 1500) population <- createStudyPopulation(plpData, outcomeId = 3, populationSettings = createStudyPopulationSettings()) data <- splitData(plpData, population, splitSettings=createDefaultSplitSetting(splitSeed=42)) data$Train$covariateData <- preprocessData(data$Train$covariateData, createPreprocessSettings()) path <- file.path(tempdir(), "plp") model <- fitPlp(data$Train, modelSettings=setLassoLogisticRegression(seed=42), analysisId=1, analysisPath = path) evaluatePlp(model$prediction) # Train and CV metrics
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n= 1500) population <- createStudyPopulation(plpData, outcomeId = 3, populationSettings = createStudyPopulationSettings()) data <- splitData(plpData, population, splitSettings=createDefaultSplitSetting(splitSeed=42)) data$Train$covariateData <- preprocessData(data$Train$covariateData, createPreprocessSettings()) path <- file.path(tempdir(), "plp") model <- fitPlp(data$Train, modelSettings=setLassoLogisticRegression(seed=42), analysisId=1, analysisPath = path) evaluatePlp(model$prediction) # Train and CV metrics
This function extracts data using a user specified connection and cdm_schema, applied the model and then calcualtes the performance
externalValidateDbPlp( plpModel, validationDatabaseDetails = createDatabaseDetails(), validationRestrictPlpDataSettings = createRestrictPlpDataSettings(), settings = createValidationSettings(recalibrate = "weakRecalibration"), logSettings = createLogSettings(verbosity = "INFO", logName = "validatePLP"), outputFolder = NULL )
externalValidateDbPlp( plpModel, validationDatabaseDetails = createDatabaseDetails(), validationRestrictPlpDataSettings = createRestrictPlpDataSettings(), settings = createValidationSettings(recalibrate = "weakRecalibration"), logSettings = createLogSettings(verbosity = "INFO", logName = "validatePLP"), outputFolder = NULL )
plpModel |
The model object returned by runPlp() containing the trained model |
validationDatabaseDetails |
A list of objects of class |
validationRestrictPlpDataSettings |
A list of population restriction settings created by |
settings |
A settings object of class |
logSettings |
An object of |
outputFolder |
The directory to save the validation results to (subfolders are created per database in validationDatabaseDetails) |
Users need to input a trained model (the output of runPlp()) and new database connections. The function will return a list of length equal to the number of cdm_schemas input with the performance on the new data
An externalValidatePlp object containing the following components
model: The model object
executionSummary: A list of execution details
prediction: A dataframe containing the predictions
performanceEvaluation: A dataframe containing the performance metrics
covariateSummary: A dataframe containing the covariate summary
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n=1000) # first fit a model on some data, default is a L1 logistic regression saveLoc <- file.path(tempdir(), "development") results <- runPlp(plpData, outcomeId = 3, saveDirectory = saveLoc, populationSettings = createStudyPopulationSettings(requireTimeAtRisk=FALSE) ) connectionDetails <- Eunomia::getEunomiaConnectionDetails() Eunomia::createCohorts(connectionDetails) # now validate the model on Eunomia validationDatabaseDetails <- createDatabaseDetails( connectionDetails = connectionDetails, cdmDatabaseSchema = "main", cdmDatabaseName = "main", cohortDatabaseSchema = "main", cohortTable = "cohort", outcomeDatabaseSchema = "main", outcomeTable = "cohort", targetId = 1, # users of celecoxib outcomeIds = 3, # GIbleed cdmVersion = 5) path <- file.path(tempdir(), "validation") externalValidateDbPlp(results$model, validationDatabaseDetails, outputFolder = path) # clean up unlink(saveLoc, recursive = TRUE) unlink(path, recursive = TRUE)
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n=1000) # first fit a model on some data, default is a L1 logistic regression saveLoc <- file.path(tempdir(), "development") results <- runPlp(plpData, outcomeId = 3, saveDirectory = saveLoc, populationSettings = createStudyPopulationSettings(requireTimeAtRisk=FALSE) ) connectionDetails <- Eunomia::getEunomiaConnectionDetails() Eunomia::createCohorts(connectionDetails) # now validate the model on Eunomia validationDatabaseDetails <- createDatabaseDetails( connectionDetails = connectionDetails, cdmDatabaseSchema = "main", cdmDatabaseName = "main", cohortDatabaseSchema = "main", cohortTable = "cohort", outcomeDatabaseSchema = "main", outcomeTable = "cohort", targetId = 1, # users of celecoxib outcomeIds = 3, # GIbleed cdmVersion = 5) path <- file.path(tempdir(), "validation") externalValidateDbPlp(results$model, validationDatabaseDetails, outputFolder = path) # clean up unlink(saveLoc, recursive = TRUE) unlink(path, recursive = TRUE)
Exports all the results from a database into csv files
extractDatabaseToCsv( conn = NULL, connectionDetails, databaseSchemaSettings = createDatabaseSchemaSettings(resultSchema = "main"), csvFolder, minCellCount = 5, sensitiveColumns = getPlpSensitiveColumns(), fileAppend = NULL )
extractDatabaseToCsv( conn = NULL, connectionDetails, databaseSchemaSettings = createDatabaseSchemaSettings(resultSchema = "main"), csvFolder, minCellCount = 5, sensitiveColumns = getPlpSensitiveColumns(), fileAppend = NULL )
conn |
The connection to the database with the results |
connectionDetails |
The connectionDetails for the result database |
databaseSchemaSettings |
The result database schema settings |
csvFolder |
Location to save the csv files |
minCellCount |
The min value to show in cells that are sensitive (values less than this value will be replaced with -1) |
sensitiveColumns |
A named list (name of table columns belong to) with a list of columns to apply the minCellCount to. |
fileAppend |
If set to a string this will be appended to the start of the csv file names |
Extracts the results from a database into a set of csv files
The directory path where the results were saved
# develop a simple model on simulated data data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n = 500) saveLoc <- file.path(tempdir(), "extractDatabaseToCsv") results <- runPlp(plpData, outcomeId = 3, saveDirectory = saveLoc) # now upload the results to a sqlite database databasePath <- insertResultsToSqlite(saveLoc) # now extract the results to csv connectionDetails <- DatabaseConnector::createConnectionDetails(dbms = "sqlite", server = databasePath) extractDatabaseToCsv( connectionDetails = connectionDetails, csvFolder = file.path(saveLoc, "csv") ) # show csv file list.files(file.path(saveLoc, "csv")) # clean up unlink(saveLoc, recursive = TRUE)
# develop a simple model on simulated data data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n = 500) saveLoc <- file.path(tempdir(), "extractDatabaseToCsv") results <- runPlp(plpData, outcomeId = 3, saveDirectory = saveLoc) # now upload the results to a sqlite database databasePath <- insertResultsToSqlite(saveLoc) # now extract the results to csv connectionDetails <- DatabaseConnector::createConnectionDetails(dbms = "sqlite", server = databasePath) extractDatabaseToCsv( connectionDetails = connectionDetails, csvFolder = file.path(saveLoc, "csv") ) # show csv file list.files(file.path(saveLoc, "csv")) # clean up unlink(saveLoc, recursive = TRUE)
Train various models using a default parameter grid search or user specified parameters
fitPlp(trainData, modelSettings, search = "grid", analysisId, analysisPath)
fitPlp(trainData, modelSettings, search = "grid", analysisId, analysisPath)
trainData |
An object of type |
modelSettings |
An object of class |
search |
The search strategy for the hyper-parameter selection (currently not used) |
analysisId |
The id of the analysis |
analysisPath |
The path of the analysis |
The user can define the machine learning model to train
An object of class plpModel
containing:
model |
The trained prediction model |
preprocessing |
The preprocessing required when applying the model |
prediction |
The cohort data.frame with the predicted risk column added |
modelDesign |
A list specifiying the modelDesign settings used to fit the model |
trainDetails |
The model meta data |
covariateImportance |
The covariate importance for the model |
# simulate data data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n=1000) # create study population, split into train/test and preprocess with default settings population <- createStudyPopulation(plpData, outcomeId = 3) data <- splitData(plpData, population, createDefaultSplitSetting()) data$Train$covariateData <- preprocessData(data$Train$covariateData) saveLoc <- file.path(tempdir(), "fitPlp") # fit a lasso logistic regression model using the training data plpModel <- fitPlp(data$Train, modelSettings=setLassoLogisticRegression(seed=42), analysisId=1, analysisPath=saveLoc) # show evaluationSummary for model evaluatePlp(plpModel$prediction)$evaluationSummary # clean up unlink(saveLoc, recursive = TRUE)
# simulate data data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n=1000) # create study population, split into train/test and preprocess with default settings population <- createStudyPopulation(plpData, outcomeId = 3) data <- splitData(plpData, population, createDefaultSplitSetting()) data$Train$covariateData <- preprocessData(data$Train$covariateData) saveLoc <- file.path(tempdir(), "fitPlp") # fit a lasso logistic regression model using the training data plpModel <- fitPlp(data$Train, modelSettings=setLassoLogisticRegression(seed=42), analysisId=1, analysisPath=saveLoc) # show evaluationSummary for model evaluatePlp(plpModel$prediction)$evaluationSummary # clean up unlink(saveLoc, recursive = TRUE)
Get a sparse summary of the calibration
getCalibrationSummary( prediction, predictionType, typeColumn = "evaluation", numberOfStrata = 10, truncateFraction = 0.05 )
getCalibrationSummary( prediction, predictionType, typeColumn = "evaluation", numberOfStrata = 10, truncateFraction = 0.05 )
prediction |
A prediction object as generated using the
|
predictionType |
The type of prediction (binary or survival) |
typeColumn |
A column that is used to stratify the results |
numberOfStrata |
The number of strata in the plot. |
truncateFraction |
This fraction of probability values will be ignored when plotting, to avoid the x-axis scale being dominated by a few outliers. |
Generates a sparse summary showing the predicted probabilities and the observed fractions. Predictions are stratified into equally sized bins of predicted probabilities.
A dataframe with the calibration summary
# simulate data data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n=500) # create study population, split into train/test and preprocess with default settings population <- createStudyPopulation(plpData, outcomeId = 3) data <- splitData(plpData, population, createDefaultSplitSetting()) data$Train$covariateData <- preprocessData(data$Train$covariateData) saveLoc <- file.path(tempdir(), "calibrationSummary") # fit a lasso logistic regression model using the training data plpModel <- fitPlp(data$Train, modelSettings=setLassoLogisticRegression(seed=42), analysisId=1, analysisPath=saveLoc) calibrationSummary <- getCalibrationSummary(plpModel$prediction, "binary", numberOfStrata = 10, typeColumn = "evaluationType") calibrationSummary # clean up unlink(saveLoc, recursive = TRUE)
# simulate data data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n=500) # create study population, split into train/test and preprocess with default settings population <- createStudyPopulation(plpData, outcomeId = 3) data <- splitData(plpData, population, createDefaultSplitSetting()) data$Train$covariateData <- preprocessData(data$Train$covariateData) saveLoc <- file.path(tempdir(), "calibrationSummary") # fit a lasso logistic regression model using the training data plpModel <- fitPlp(data$Train, modelSettings=setLassoLogisticRegression(seed=42), analysisId=1, analysisPath=saveLoc) calibrationSummary <- getCalibrationSummary(plpModel$prediction, "binary", numberOfStrata = 10, typeColumn = "evaluationType") calibrationSummary # clean up unlink(saveLoc, recursive = TRUE)
Extracts covariates based on cohorts
getCohortCovariateData( connection, tempEmulationSchema = NULL, oracleTempSchema = NULL, cdmDatabaseSchema, cdmVersion = "5", cohortTable = "#cohort_person", rowIdField = "row_id", aggregated, cohortIds, covariateSettings, ... )
getCohortCovariateData( connection, tempEmulationSchema = NULL, oracleTempSchema = NULL, cdmDatabaseSchema, cdmVersion = "5", cohortTable = "#cohort_person", rowIdField = "row_id", aggregated, cohortIds, covariateSettings, ... )
connection |
The database connection |
tempEmulationSchema |
The schema to use for temp tables |
oracleTempSchema |
DEPRECATED The temp schema if using oracle |
cdmDatabaseSchema |
The schema of the OMOP CDM data |
cdmVersion |
version of the OMOP CDM data |
cohortTable |
the table name that contains the target population cohort |
rowIdField |
string representing the unique identifier in the target population cohort |
aggregated |
whether the covariate should be aggregated |
cohortIds |
cohort id for the target cohort |
covariateSettings |
settings for the covariate cohorts and time periods |
... |
additional arguments from FeatureExtraction |
The user specifies a cohort and time period and then a covariate is constructed whether they are in the cohort during the time periods relative to target population cohort index
CovariateData object with covariates, covariateRef, and analysisRef tables
library(DatabaseConnector) connectionDetails <- Eunomia::getEunomiaConnectionDetails() # create some cohort of people born in 1969, index date is their date of birth con <- connect(connectionDetails) executeSql(con, "INSERT INTO main.cohort SELECT 1969 as COHORT_DEFINITION_ID, PERSON_ID as SUBJECT_ID, BIRTH_DATETIME as COHORT_START_DATE, BIRTH_DATETIME as COHORT_END_DATE FROM main.person WHERE YEAR_OF_BIRTH = 1969") covariateData <- getCohortCovariateData(connection = con, cdmDatabaseSchema = "main", aggregated = FALSE, rowIdField = "SUBJECT_ID", cohortTable = "cohort", covariateSettings = createCohortCovariateSettings( cohortName="summerOfLove", cohortId=1969, settingId=1, cohortDatabaseSchema="main", cohortTable="cohort")) covariateData$covariateRef covariateData$covariates
library(DatabaseConnector) connectionDetails <- Eunomia::getEunomiaConnectionDetails() # create some cohort of people born in 1969, index date is their date of birth con <- connect(connectionDetails) executeSql(con, "INSERT INTO main.cohort SELECT 1969 as COHORT_DEFINITION_ID, PERSON_ID as SUBJECT_ID, BIRTH_DATETIME as COHORT_START_DATE, BIRTH_DATETIME as COHORT_END_DATE FROM main.person WHERE YEAR_OF_BIRTH = 1969") covariateData <- getCohortCovariateData(connection = con, cdmDatabaseSchema = "main", aggregated = FALSE, rowIdField = "SUBJECT_ID", cohortTable = "cohort", covariateSettings = createCohortCovariateSettings( cohortName="summerOfLove", cohortId=1969, settingId=1, cohortDatabaseSchema="main", cohortTable="cohort")) covariateData$covariateRef covariateData$covariates
Get a demographic summary
getDemographicSummary(prediction, predictionType, typeColumn = "evaluation")
getDemographicSummary(prediction, predictionType, typeColumn = "evaluation")
prediction |
A prediction object |
predictionType |
The type of prediction (binary or survival) |
typeColumn |
A column that is used to stratify the results |
Generates a data.frame with a prediction summary per each 5 year age group and gender group
A dataframe with the demographic summary
# simulate data data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n=500) # create study population, split into train/test and preprocess with default settings population <- createStudyPopulation(plpData, outcomeId = 3) data <- splitData(plpData, population, createDefaultSplitSetting()) data$Train$covariateData <- preprocessData(data$Train$covariateData) saveLoc <- file.path(tempdir(), "demographicSummary") # fit a lasso logistic regression model using the training data plpModel <- fitPlp(data$Train, modelSettings=setLassoLogisticRegression(seed=42), analysisId=1, analysisPath=saveLoc) demographicSummary <- getDemographicSummary(plpModel$prediction, "binary", typeColumn = "evaluationType") # show the demographic summary dataframe str(demographicSummary) # clean up unlink(saveLoc, recursive = TRUE)
# simulate data data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n=500) # create study population, split into train/test and preprocess with default settings population <- createStudyPopulation(plpData, outcomeId = 3) data <- splitData(plpData, population, createDefaultSplitSetting()) data$Train$covariateData <- preprocessData(data$Train$covariateData) saveLoc <- file.path(tempdir(), "demographicSummary") # fit a lasso logistic regression model using the training data plpModel <- fitPlp(data$Train, modelSettings=setLassoLogisticRegression(seed=42), analysisId=1, analysisPath=saveLoc) demographicSummary <- getDemographicSummary(plpModel$prediction, "binary", typeColumn = "evaluationType") # show the demographic summary dataframe str(demographicSummary) # clean up unlink(saveLoc, recursive = TRUE)
This function creates a plpData object from the Eunomia database. It gets the connection details, creates the cohorts, and extracts the data. The cohort is predicting GIbleed in new users of celecoxib.
getEunomiaPlpData(covariateSettings = NULL)
getEunomiaPlpData(covariateSettings = NULL)
covariateSettings |
A list of covariateSettings objects created using the
|
An object of type plpData
, containing information on the cohorts, their
outcomes, and baseline covariates. Information about multiple outcomes can be
captured at once for efficiency reasons. This object is a list with the
following components:
A data frame listing the outcomes per person, including the time to event, and the outcome id
A data frame listing the persons in each cohort, listing their exposure status as well as the time to the end of the observation period and time to the end of the cohort
An Andromeda object created
with the FeatureExtraction
package. This object contains the following items:
An Andromeda table listing the covariates per person in the two cohorts. This is done using a sparse representation: covariates with a value of 0 are omitted to save space. Usually has three columns, rowId, covariateId and covariateValue'.
An Andromeda table describing the covariates that have been extracted.
An Andromeda table with information about which analysisIds from 'FeatureExtraction' were used.
covariateSettings <- FeatureExtraction::createCovariateSettings( useDemographicsAge = TRUE, useDemographicsGender = TRUE, useConditionOccurrenceAnyTimePrior = TRUE ) plpData <- getEunomiaPlpData(covariateSettings = covariateSettings)
covariateSettings <- FeatureExtraction::createCovariateSettings( useDemographicsAge = TRUE, useDemographicsGender = TRUE, useConditionOccurrenceAnyTimePrior = TRUE ) plpData <- getEunomiaPlpData(covariateSettings = covariateSettings)
This function executes a large set of SQL statements against the database in OMOP CDM format to extract the data needed to perform the analysis.
getPlpData(databaseDetails, covariateSettings, restrictPlpDataSettings = NULL)
getPlpData(databaseDetails, covariateSettings, restrictPlpDataSettings = NULL)
databaseDetails |
The cdm database details created using |
covariateSettings |
An object of type |
restrictPlpDataSettings |
Extra settings to apply to the target population while extracting data.
Created using |
Based on the arguments, the at risk cohort data is retrieved, as well as outcomes
occurring in these subjects. The at risk cohort is identified through
user-defined cohorts in a cohort table either inside the CDM instance or in a separate schema.
Similarly, outcomes are identified
through user-defined cohorts in a cohort table either inside the CDM instance or in a separate
schema. Covariates are automatically extracted from the appropriate tables within the CDM.
If you wish to exclude concepts from covariates you will need to
manually add the concept_ids and descendants to the excludedCovariateConceptIds
of the
covariateSettings
argument.
'r plpDataObjectDoc()'
# use Eunomia database connectionDetails <- Eunomia::getEunomiaConnectionDetails() Eunomia::createCohorts(connectionDetails) outcomeId <- 3 # GIbleed databaseDetails <- createDatabaseDetails( connectionDetails = connectionDetails, cdmDatabaseSchema = "main", cdmDatabaseName = "main", cohortDatabaseSchema = "main", cohortTable = "cohort", outcomeDatabaseSchema = "main", outcomeTable = "cohort", targetId = 1, outcomeIds = outcomeId, cdmVersion = 5 ) covariateSettings <- FeatureExtraction::createCovariateSettings( useDemographicsAge = TRUE, useDemographicsGender = TRUE, useConditionOccurrenceAnyTimePrior = TRUE ) plpData <- getPlpData( databaseDetails = databaseDetails, covariateSettings = covariateSettings, restrictPlpDataSettings = createRestrictPlpDataSettings() )
# use Eunomia database connectionDetails <- Eunomia::getEunomiaConnectionDetails() Eunomia::createCohorts(connectionDetails) outcomeId <- 3 # GIbleed databaseDetails <- createDatabaseDetails( connectionDetails = connectionDetails, cdmDatabaseSchema = "main", cdmDatabaseName = "main", cohortDatabaseSchema = "main", cohortTable = "cohort", outcomeDatabaseSchema = "main", outcomeTable = "cohort", targetId = 1, outcomeIds = outcomeId, cdmVersion = 5 ) covariateSettings <- FeatureExtraction::createCovariateSettings( useDemographicsAge = TRUE, useDemographicsGender = TRUE, useConditionOccurrenceAnyTimePrior = TRUE ) plpData <- getPlpData( databaseDetails = databaseDetails, covariateSettings = covariateSettings, restrictPlpDataSettings = createRestrictPlpDataSettings() )
Calculates the prediction distribution
getPredictionDistribution( prediction, predictionType = "binary", typeColumn = "evaluation" )
getPredictionDistribution( prediction, predictionType = "binary", typeColumn = "evaluation" )
prediction |
A prediction object |
predictionType |
The type of prediction (binary or survival) |
typeColumn |
A column that is used to stratify the results |
Calculates the quantiles from a predition object
The 0.00, 0.1, 0.25, 0.5, 0.75, 0.9, 1.00 quantile pf the prediction, the mean and standard deviation per class
prediction <- data.frame(rowId = 1:100, outcomeCount = stats::rbinom(1:100, 1, prob=0.5), value = runif(100), evaluation = rep("Train", 100)) getPredictionDistribution(prediction)
prediction <- data.frame(rowId = 1:100, outcomeCount = stats::rbinom(1:100, 1, prob=0.5), value = runif(100), evaluation = rep("Train", 100)) getPredictionDistribution(prediction)
Calculate all measures for sparse ROC
getThresholdSummary( prediction, predictionType = "binary", typeColumn = "evaluation" )
getThresholdSummary( prediction, predictionType = "binary", typeColumn = "evaluation" )
prediction |
A prediction object |
predictionType |
The type of prediction (binary or survival) |
typeColumn |
A column that is used to stratify the results |
Calculates the TP, FP, TN, FN, TPR, FPR, accuracy, PPF, FOR and Fmeasure from a prediction object
A data.frame with TP, FP, TN, FN, TPR, FPR, accuracy, PPF, FOR and Fmeasure
prediction <- data.frame(rowId = 1:100, outcomeCount = stats::rbinom(1:100, 1, prob=0.5), value = runif(100), evaluation = rep("Train", 100)) summary <- getThresholdSummary(prediction) str(summary)
prediction <- data.frame(rowId = 1:100, outcomeCount = stats::rbinom(1:100, 1, prob=0.5), value = runif(100), evaluation = rep("Train", 100)) summary <- getThresholdSummary(prediction) str(summary)
Calculate the Integrated Calibration Index from Austin and Steyerberg https://onlinelibrary.wiley.com/doi/full/10.1002/sim.8281
ici(prediction)
ici(prediction)
prediction |
the prediction object found in the plpResult object |
Calculate the Integrated Calibration Index
Integrated Calibration Index value or NULL if the calculation fails
prediction <- data.frame(rowId = 1:100, outcomeCount = stats::rbinom(1:100, 1, prob=0.5), value = runif(100), evaluation = rep("Train", 100)) ici(prediction)
prediction <- data.frame(rowId = 1:100, outcomeCount = stats::rbinom(1:100, 1, prob=0.5), value = runif(100), evaluation = rep("Train", 100)) ici(prediction)
This function converts a folder with csv results into plp objects and loads them into a plp result database
insertCsvToDatabase( csvFolder, connectionDetails, databaseSchemaSettings, modelSaveLocation, csvTableAppend = "" )
insertCsvToDatabase( csvFolder, connectionDetails, databaseSchemaSettings, modelSaveLocation, csvTableAppend = "" )
csvFolder |
The location to the csv folder with the plp results |
connectionDetails |
A connection details for the plp results database that the csv results will be inserted into |
databaseSchemaSettings |
A object created by |
modelSaveLocation |
The location to save any models from the csv folder - this should be the same location you picked when inserting other models into the database |
csvTableAppend |
A string that appends the csv file names |
The user needs to have plp csv results in a single folder and an existing plp result database
Returns a data.frame indicating whether the results were inported into the database
# develop a simple model on simulated data data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n=1000) saveLoc <- file.path(tempdir(), "extractDatabaseToCsv") results <- runPlp(plpData, outcomeId=3, saveDirectory=saveLoc) # now upload the results to a sqlite database databasePath <- insertResultsToSqlite(saveLoc) # now extract the results to csv connectionDetails <- DatabaseConnector::createConnectionDetails(dbms = "sqlite", server = databasePath) extractDatabaseToCsv(connectionDetails = connectionDetails, csvFolder = file.path(saveLoc, "csv")) # show csv file list.files(file.path(saveLoc, "csv")) # now insert the csv results into a database newDatabasePath <- file.path(tempdir(), "newDatabase.sqlite") connectionDetails <- DatabaseConnector::createConnectionDetails(dbms = "sqlite", server = newDatabasePath) insertCsvToDatabase(csvFolder = file.path(saveLoc, "csv"), connectionDetails = connectionDetails, databaseSchemaSettings = createDatabaseSchemaSettings(), modelSaveLocation = file.path(saveLoc, "models")) # clean up unlink(saveLoc, recursive = TRUE)
# develop a simple model on simulated data data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n=1000) saveLoc <- file.path(tempdir(), "extractDatabaseToCsv") results <- runPlp(plpData, outcomeId=3, saveDirectory=saveLoc) # now upload the results to a sqlite database databasePath <- insertResultsToSqlite(saveLoc) # now extract the results to csv connectionDetails <- DatabaseConnector::createConnectionDetails(dbms = "sqlite", server = databasePath) extractDatabaseToCsv(connectionDetails = connectionDetails, csvFolder = file.path(saveLoc, "csv")) # show csv file list.files(file.path(saveLoc, "csv")) # now insert the csv results into a database newDatabasePath <- file.path(tempdir(), "newDatabase.sqlite") connectionDetails <- DatabaseConnector::createConnectionDetails(dbms = "sqlite", server = newDatabasePath) insertCsvToDatabase(csvFolder = file.path(saveLoc, "csv"), connectionDetails = connectionDetails, databaseSchemaSettings = createDatabaseSchemaSettings(), modelSaveLocation = file.path(saveLoc, "models")) # clean up unlink(saveLoc, recursive = TRUE)
This function create an sqlite database with the PLP result schema and inserts all results
insertResultsToSqlite( resultLocation, cohortDefinitions = NULL, databaseList = NULL, sqliteLocation = file.path(resultLocation, "sqlite") )
insertResultsToSqlite( resultLocation, cohortDefinitions = NULL, databaseList = NULL, sqliteLocation = file.path(resultLocation, "sqlite") )
resultLocation |
(string) location of directory where the main package results were saved |
cohortDefinitions |
A set of one or more cohorts extracted using ROhdsiWebApi::exportCohortDefinitionSet() |
databaseList |
A list created by |
sqliteLocation |
(string) location of directory where the sqlite database will be saved |
This function can be used upload PatientLevelPrediction results into an sqlite database
Returns the location of the sqlite database file
plpData <- getEunomiaPlpData() saveLoc <- file.path(tempdir(), "insertResultsToSqlite") results <- runPlp(plpData, outcomeId = 3, analysisId = 1, saveDirectory = saveLoc) databaseFile <- insertResultsToSqlite(saveLoc, cohortDefinitions = NULL, sqliteLocation = file.path(saveLoc, "sqlite")) # check there is some data in the database library(DatabaseConnector) connectionDetails <- createConnectionDetails( dbms = "sqlite", server = databaseFile) conn <- connect(connectionDetails) # All tables should be created getTableNames(conn, databaseSchema = "main") # There is data in the tables querySql(conn, "SELECT * FROM main.model_designs limit 10") # clean up unlink(saveLoc, recursive = TRUE)
plpData <- getEunomiaPlpData() saveLoc <- file.path(tempdir(), "insertResultsToSqlite") results <- runPlp(plpData, outcomeId = 3, analysisId = 1, saveDirectory = saveLoc) databaseFile <- insertResultsToSqlite(saveLoc, cohortDefinitions = NULL, sqliteLocation = file.path(saveLoc, "sqlite")) # check there is some data in the database library(DatabaseConnector) connectionDetails <- createConnectionDetails( dbms = "sqlite", server = databaseFile) conn <- connect(connectionDetails) # All tables should be created getTableNames(conn, databaseSchema = "main") # There is data in the tables querySql(conn, "SELECT * FROM main.model_designs limit 10") # clean up unlink(saveLoc, recursive = TRUE)
join two lists
listAppend(a, b)
listAppend(a, b)
a |
A list |
b |
Another list |
This function joins two lists
the joined list
a <- list(a = 1, b = 2) b <- list(c = 3, d = 4) listAppend(a, b)
a <- list(a = 1, b = 2) b <- list(c = 3, d = 4) listAppend(a, b)
Computes the Cartesian product of all the combinations of elements in a list
listCartesian(allList)
listCartesian(allList)
allList |
a list of lists |
A list with all possible combinations from the input list of lists
listCartesian(list(list(1, 2), list(3, 4)))
listCartesian(list(list(1, 2), list(3, 4)))
Load the multiple prediction json settings from a file
loadPlpAnalysesJson(jsonFileLocation)
loadPlpAnalysesJson(jsonFileLocation)
jsonFileLocation |
The location of the file 'predictionAnalysisList.json' with the modelDesignList |
This function interprets a json with the multiple prediction settings and creates a list that can be combined with connection settings to run a multiple prediction study
A list with the modelDesignList and cohortDefinitions
modelDesign <- createModelDesign(targetId = 1, outcomeId = 2, modelSettings = setLassoLogisticRegression()) saveLoc <- file.path(tempdir(), "loadPlpAnalysesJson") savePlpAnalysesJson(modelDesignList = modelDesign, saveDirectory = saveLoc) loadPlpAnalysesJson(file.path(saveLoc, "predictionAnalysisList.json")) # clean use unlink(saveLoc, recursive = TRUE)
modelDesign <- createModelDesign(targetId = 1, outcomeId = 2, modelSettings = setLassoLogisticRegression()) saveLoc <- file.path(tempdir(), "loadPlpAnalysesJson") savePlpAnalysesJson(modelDesignList = modelDesign, saveDirectory = saveLoc) loadPlpAnalysesJson(file.path(saveLoc, "predictionAnalysisList.json")) # clean use unlink(saveLoc, recursive = TRUE)
loadPlpData
loads an object of type plpData from a folder in the file
system.
loadPlpData(file, readOnly = TRUE)
loadPlpData(file, readOnly = TRUE)
file |
The name of the folder containing the data. |
readOnly |
If true, the data is opened read only. |
The data will be written to a set of files in the folder specified by the user.
An object of class plpData.
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n = 500) saveLoc <- file.path(tempdir(), "loadPlpData") savePlpData(plpData, saveLoc) dir(saveLoc) # clean up unlink(saveLoc, recursive = TRUE)
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n = 500) saveLoc <- file.path(tempdir(), "loadPlpData") savePlpData(plpData, saveLoc) dir(saveLoc) # clean up unlink(saveLoc, recursive = TRUE)
loads the plp model
loadPlpModel(dirPath)
loadPlpModel(dirPath)
dirPath |
The location of the model |
Loads a plp model that was saved using savePlpModel()
The plpModel object
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n = 1000) saveLoc <- file.path(tempdir(), "loadPlpModel") plpResult <- runPlp(plpData, outcomeId = 3, saveDirectory = saveLoc) savePlpModel(plpResult$model, file.path(saveLoc, "savedModel")) loadedModel <- loadPlpModel(file.path(saveLoc, "savedModel")) # show design of loaded model str(loadedModel$modelDesign) # clean up unlink(saveLoc, recursive = TRUE)
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n = 1000) saveLoc <- file.path(tempdir(), "loadPlpModel") plpResult <- runPlp(plpData, outcomeId = 3, saveDirectory = saveLoc) savePlpModel(plpResult$model, file.path(saveLoc, "savedModel")) loadedModel <- loadPlpModel(file.path(saveLoc, "savedModel")) # show design of loaded model str(loadedModel$modelDesign) # clean up unlink(saveLoc, recursive = TRUE)
Loads the evalaution dataframe
loadPlpResult(dirPath)
loadPlpResult(dirPath)
dirPath |
The directory where the evaluation was saved |
Loads the evaluation
The runPlp object
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n = 1000) saveLoc <- file.path(tempdir(), "loadPlpResult") results <- runPlp(plpData, outcomeId = 3, saveDirectory = saveLoc) savePlpResult(results, saveLoc) loadedResults <- loadPlpResult(saveLoc) # clean up unlink(saveLoc, recursive = TRUE)
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n = 1000) saveLoc <- file.path(tempdir(), "loadPlpResult") results <- runPlp(plpData, outcomeId = 3, saveDirectory = saveLoc) savePlpResult(results, saveLoc) loadedResults <- loadPlpResult(saveLoc) # clean up unlink(saveLoc, recursive = TRUE)
Loads the prediction dataframe to json
loadPrediction(fileLocation)
loadPrediction(fileLocation)
fileLocation |
The location with the saved prediction |
Loads the prediciton json file
The prediction data.frame
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n = 1000) saveLoc <- file.path(tempdir(), "loadPrediction") results <- runPlp(plpData, outcomeId = 3, saveDirectory = saveLoc) savePrediction(results$prediction, saveLoc) dir(saveLoc) loadedPrediction <- loadPrediction(file.path(saveLoc, "prediction.json"))
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n = 1000) saveLoc <- file.path(tempdir(), "loadPrediction") results <- runPlp(plpData, outcomeId = 3, saveDirectory = saveLoc) savePrediction(results$prediction, saveLoc) dir(saveLoc) loadedPrediction <- loadPrediction(file.path(saveLoc, "prediction.json"))
this functions takes covariate data and a cohort/population and remaps the covariate and row ids, restricts to pop and saves/creates mapping
MapIds(covariateData, cohort = NULL, mapping = NULL)
MapIds(covariateData, cohort = NULL, mapping = NULL)
covariateData |
a covariateData object |
cohort |
if specified rowIds restricted to the ones in cohort |
mapping |
A pre defined mapping to use |
a new covariateData
object with remapped covariate and row ids
covariateData <- Andromeda::andromeda( covariates = data.frame(rowId = c(1, 3, 5, 7, 9), covariateId = c(10, 20, 10, 10, 20), covariateValue = c(1, 1, 1, 1, 1)), covariateRef = data.frame(covariateId = c(10, 20), covariateNames = c("covariateA", "covariateB"), analysisId = c(1, 1))) mappedData <- MapIds(covariateData) # columnId and rowId are now starting from 1 and are consecutive mappedData$covariates
covariateData <- Andromeda::andromeda( covariates = data.frame(rowId = c(1, 3, 5, 7, 9), covariateId = c(10, 20, 10, 10, 20), covariateValue = c(1, 1, 1, 1, 1)), covariateRef = data.frame(covariateId = c(10, 20), covariateNames = c("covariateA", "covariateB"), analysisId = c(1, 1))) mappedData <- MapIds(covariateData) # columnId and rowId are now starting from 1 and are consecutive mappedData$covariates
Migrate data from current state to next state
It is strongly advised that you have a backup of all data (either sqlite files, a backup database (in the case you are using a postgres backend) or have kept the csv/zip files from your data generation.
migrateDataModel(connectionDetails, databaseSchema, tablePrefix = "")
migrateDataModel(connectionDetails, databaseSchema, tablePrefix = "")
connectionDetails |
DatabaseConnector connection details object |
databaseSchema |
String schema where database schema lives |
tablePrefix |
(Optional) Use if a table prefix is used before table names (e.g. "cd_") |
Nothing. Is called for side effects of migrating data model in the database
Calculate the model-based concordance, which is a calculation of the expected discrimination performance of a model under the assumption the model predicts the "TRUE" outcome as detailed in van Klaveren et al. https://pubmed.ncbi.nlm.nih.gov/27251001/
modelBasedConcordance(prediction)
modelBasedConcordance(prediction)
prediction |
the prediction object found in the plpResult object |
Calculate the model-based concordance
The model-based concordance value
prediction <- data.frame(value = runif(100)) modelBasedConcordance(prediction)
prediction <- data.frame(value = runif(100)) modelBasedConcordance(prediction)
Plot the outcome incidence over time
outcomeSurvivalPlot( plpData, outcomeId, populationSettings = createStudyPopulationSettings(binary = TRUE, includeAllOutcomes = TRUE, firstExposureOnly = FALSE, washoutPeriod = 0, removeSubjectsWithPriorOutcome = TRUE, priorOutcomeLookback = 99999, requireTimeAtRisk = FALSE, riskWindowStart = 1, startAnchor = "cohort start", riskWindowEnd = 3650, endAnchor = "cohort start"), riskTable = TRUE, confInt = TRUE, yLabel = "Fraction of those who are outcome free in target population" )
outcomeSurvivalPlot( plpData, outcomeId, populationSettings = createStudyPopulationSettings(binary = TRUE, includeAllOutcomes = TRUE, firstExposureOnly = FALSE, washoutPeriod = 0, removeSubjectsWithPriorOutcome = TRUE, priorOutcomeLookback = 99999, requireTimeAtRisk = FALSE, riskWindowStart = 1, startAnchor = "cohort start", riskWindowEnd = 3650, endAnchor = "cohort start"), riskTable = TRUE, confInt = TRUE, yLabel = "Fraction of those who are outcome free in target population" )
plpData |
The plpData object returned by running getPlpData() |
outcomeId |
The cohort id corresponding to the outcome |
populationSettings |
The population settings created using |
riskTable |
(binary) Whether to include a table at the bottom of the plot showing the number of people at risk over time |
confInt |
(binary) Whether to include a confidence interval |
yLabel |
(string) The label for the y-axis |
This creates a survival plot that can be used to pick a suitable time-at-risk period
A ggsurvplot
object
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n=1000) plotObject <- outcomeSurvivalPlot(plpData, outcomeId = 3) print(plotObject)
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n=1000) plotObject <- outcomeSurvivalPlot(plpData, outcomeId = 3) print(plotObject)
Calculate the permutation feature importance (pfi) for a PLP model.
pfi( plpResult, population, plpData, repeats = 1, covariates = NULL, cores = NULL, log = NULL, logthreshold = "INFO" )
pfi( plpResult, population, plpData, repeats = 1, covariates = NULL, cores = NULL, log = NULL, logthreshold = "INFO" )
plpResult |
An object of type |
population |
The population created using createStudyPopulation() who will have their risks predicted |
plpData |
An object of type |
repeats |
The number of times to permute each covariate |
covariates |
A vector of covariates to calculate the pfi for. If NULL it uses all covariates included in the model. |
cores |
Number of cores to use when running this (it runs in parallel) |
log |
A location to save the log for running pfi |
logthreshold |
The log threshold (e.g., INFO, TRACE, ...) |
The function permutes the each covariate/features repeats
times and
calculates the mean AUC change caused by the permutation.
A dataframe with the covariateIds and the pfi (change in AUC caused by permuting the covariate) value
library(dplyr) # simulate some data data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n=1000) # now fit a model saveLoc <- file.path(tempdir(), "pfi") plpResult <- runPlp(plpData, outcomeId = 3, saveDirectory = saveLoc) population <- createStudyPopulation(plpData, outcomeId = 3) pfi(plpResult, population, plpData, repeats = 1, cores = 1) # compare to model coefficients plpResult$model$covariateImportance %>% dplyr::filter(.data$covariateValue != 0) # clean up unlink(saveLoc, recursive = TRUE)
library(dplyr) # simulate some data data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n=1000) # now fit a model saveLoc <- file.path(tempdir(), "pfi") plpResult <- runPlp(plpData, outcomeId = 3, saveDirectory = saveLoc) population <- createStudyPopulation(plpData, outcomeId = 3) pfi(plpResult, population, plpData, repeats = 1, cores = 1) # compare to model coefficients plpResult$model$covariateImportance %>% dplyr::filter(.data$covariateValue != 0) # clean up unlink(saveLoc, recursive = TRUE)
Plot the Observed vs. expected incidence, by age and gender
plotDemographicSummary( plpResult, typeColumn = "evaluation", saveLocation = NULL, fileName = "roc.png" )
plotDemographicSummary( plpResult, typeColumn = "evaluation", saveLocation = NULL, fileName = "roc.png" )
plpResult |
A plp result object as generated using the |
typeColumn |
The name of the column specifying the evaluation type |
saveLocation |
Directory to save plot (if NULL plot is not saved) |
fileName |
Name of the file to save to plot, for example
'plot.png'. See the function |
Create a plot showing the Observed vs. expected incidence, by age and gender #'
A ggplot object. Use the ggsave
function to save to file in a different
format.
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n=1000) saveLoc <- file.path(tempdir(), "plotDemographicSummary") plpResult <- runPlp(plpData, outcomeId = 3, saveDirectory = saveLoc) plotDemographicSummary(plpResult) # clean up unlink(saveLoc, recursive = TRUE)
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n=1000) saveLoc <- file.path(tempdir(), "plotDemographicSummary") plpResult <- runPlp(plpData, outcomeId = 3, saveDirectory = saveLoc) plotDemographicSummary(plpResult) # clean up unlink(saveLoc, recursive = TRUE)
Plot the F1 measure efficiency frontier using the sparse thresholdSummary data frame
plotF1Measure( plpResult, typeColumn = "evaluation", saveLocation = NULL, fileName = "roc.png" )
plotF1Measure( plpResult, typeColumn = "evaluation", saveLocation = NULL, fileName = "roc.png" )
plpResult |
A plp result object as generated using the |
typeColumn |
The name of the column specifying the evaluation type |
saveLocation |
Directory to save plot (if NULL plot is not saved) |
fileName |
Name of the file to save to plot, for example
'plot.png'. See the function |
Create a plot showing the F1 measure efficiency frontier using the sparse thresholdSummary data frame
A ggplot object. Use the ggsave
function to save to file in a different
format.
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n=1000) saveLoc <- file.path(tempdir(), "plotF1Measure") results <- runPlp(plpData, outcomeId = 3, saveDirectory = saveLoc) plotF1Measure(results) # clean up unlink(saveLoc, recursive = TRUE)
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n=1000) saveLoc <- file.path(tempdir(), "plotF1Measure") results <- runPlp(plpData, outcomeId = 3, saveDirectory = saveLoc) plotF1Measure(results) # clean up unlink(saveLoc, recursive = TRUE)
Plot the train/test generalizability diagnostic
plotGeneralizability( covariateSummary, saveLocation = NULL, fileName = "Generalizability.png" )
plotGeneralizability( covariateSummary, saveLocation = NULL, fileName = "Generalizability.png" )
covariateSummary |
A prediction object as generated using the
|
saveLocation |
Directory to save plot (if NULL plot is not saved) |
fileName |
Name of the file to save to plot, for example
'plot.png'. See the function |
Create a plot showing the train/test generalizability diagnostic #'
A ggplot object. Use the ggsave
function to save to file in a different
format.
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n=1000) population <- createStudyPopulation(plpData, outcomeId = 3) data <- splitData(plpData, population = population) strata <- data.frame( rowId = c(data$Train$labels$rowId, data$Test$labels$rowId), strataName = c(rep("Train", nrow(data$Train$labels)), rep("Test", nrow(data$Test$labels)))) covariateSummary <- covariateSummary(plpData$covariateData, cohort = dplyr::select(population, "rowId"), strata = strata, labels = population) plotGeneralizability(covariateSummary)
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n=1000) population <- createStudyPopulation(plpData, outcomeId = 3) data <- splitData(plpData, population = population) strata <- data.frame( rowId = c(data$Train$labels$rowId, data$Test$labels$rowId), strataName = c(rep("Train", nrow(data$Train$labels)), rep("Test", nrow(data$Test$labels)))) covariateSummary <- covariateSummary(plpData$covariateData, cohort = dplyr::select(population, "rowId"), strata = strata, labels = population) plotGeneralizability(covariateSummary)
Create a plot of the learning curve using the object returned
from createLearningCurve
.
plotLearningCurve( learningCurve, metric = "AUROC", abscissa = "events", plotTitle = "Learning Curve", plotSubtitle = NULL, fileName = NULL )
plotLearningCurve( learningCurve, metric = "AUROC", abscissa = "events", plotTitle = "Learning Curve", plotSubtitle = NULL, fileName = NULL )
learningCurve |
An object returned by |
metric |
Specifies the metric to be plotted:
|
abscissa |
Specify the abscissa metric to be plotted:
|
plotTitle |
Title of the learning curve plot. |
plotSubtitle |
Subtitle of the learning curve plot. |
fileName |
Filename of plot to be saved, for example |
A ggplot object. Use the ggsave
function to save to
file in a different format.
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n = 1000) outcomeId <- 3 modelSettings <- setLassoLogisticRegression(seed=42) learningCurve <- createLearningCurve(plpData, outcomeId, modelSettings = modelSettings, saveDirectory = file.path(tempdir(), "learningCurve"), cores = 2) plotLearningCurve(learningCurve)
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n = 1000) outcomeId <- 3 modelSettings <- setLassoLogisticRegression(seed=42) learningCurve <- createLearningCurve(plpData, outcomeId, modelSettings = modelSettings, saveDirectory = file.path(tempdir(), "learningCurve"), cores = 2) plotLearningCurve(learningCurve)
Plot the net benefit
plotNetBenefit( plpResult, typeColumn = "evaluation", saveLocation = NULL, fileName = "netBenefit.png", evalType = NULL, ylim = NULL, xlim = NULL )
plotNetBenefit( plpResult, typeColumn = "evaluation", saveLocation = NULL, fileName = "netBenefit.png", evalType = NULL, ylim = NULL, xlim = NULL )
plpResult |
A plp result object as generated using the |
typeColumn |
The name of the column specifying the evaluation type |
saveLocation |
Directory to save plot (if NULL plot is not saved) |
fileName |
Name of the file to save to plot, for example 'plot.png'. See the function |
evalType |
Which evaluation type to plot for. For example |
ylim |
The y limits for the plot, if NULL the limits are calculated from the data |
xlim |
The x limits for the plot, if NULL the limits are calculated from the data |
A list of ggplot objects or a single ggplot object if only one evaluation type is plotted
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n=1000) saveLoc <- file.path(tempdir(), "plotNetBenefit") results <- runPlp(plpData, outcomeId = 3, saveDirectory = saveLoc) plotNetBenefit(results) # clean up unlink(saveLoc, recursive = TRUE)
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n=1000) saveLoc <- file.path(tempdir(), "plotNetBenefit") results <- runPlp(plpData, outcomeId = 3, saveDirectory = saveLoc) plotNetBenefit(results) # clean up unlink(saveLoc, recursive = TRUE)
Plot all the PatientLevelPrediction plots
plotPlp(plpResult, saveLocation = NULL, typeColumn = "evaluation")
plotPlp(plpResult, saveLocation = NULL, typeColumn = "evaluation")
plpResult |
Object returned by the runPlp() function |
saveLocation |
Name of the directory where the plots should be saved (NULL means no saving) |
typeColumn |
The name of the column specifying the evaluation type (to stratify the plots) |
Create a directory with all the plots
TRUE if it ran, plots are saved in the specified directory
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n=1000) saveLoc <- file.path(tempdir(), "plotPlp") results <- runPlp(plpData, outcomeId = 3, saveDirectory = saveLoc) plotPlp(results) # clean up unlink(saveLoc, recursive = TRUE)
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n=1000) saveLoc <- file.path(tempdir(), "plotPlp") results <- runPlp(plpData, outcomeId = 3, saveDirectory = saveLoc) plotPlp(results) # clean up unlink(saveLoc, recursive = TRUE)
Plot the precision-recall curve using the sparse thresholdSummary data frame
plotPrecisionRecall( plpResult, typeColumn = "evaluation", saveLocation = NULL, fileName = "roc.png" )
plotPrecisionRecall( plpResult, typeColumn = "evaluation", saveLocation = NULL, fileName = "roc.png" )
plpResult |
A plp result object as generated using the |
typeColumn |
The name of the column specifying the evaluation type |
saveLocation |
Directory to save plot (if NULL plot is not saved) |
fileName |
Name of the file to save to plot, for example
'plot.png'. See the function |
Create a plot showing the precision-recall curve using the sparse thresholdSummary data frame
A ggplot object. Use the ggsave
function to save to file in a different
format.
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n=1000) saveLoc <- file.path(tempdir(), "plotPrecisionRecall") results <- runPlp(plpData, outcomeId = 3, saveDirectory = saveLoc) plotPrecisionRecall(results) # clean up unlink(saveLoc, recursive = TRUE)
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n=1000) saveLoc <- file.path(tempdir(), "plotPrecisionRecall") results <- runPlp(plpData, outcomeId = 3, saveDirectory = saveLoc) plotPrecisionRecall(results) # clean up unlink(saveLoc, recursive = TRUE)
Plot the Predicted probability density function, showing prediction overlap between true and false cases
plotPredictedPDF( plpResult, typeColumn = "evaluation", saveLocation = NULL, fileName = "PredictedPDF.png" )
plotPredictedPDF( plpResult, typeColumn = "evaluation", saveLocation = NULL, fileName = "PredictedPDF.png" )
plpResult |
A plp result object as generated using the |
typeColumn |
The name of the column specifying the evaluation type |
saveLocation |
Directory to save plot (if NULL plot is not saved) |
fileName |
Name of the file to save to plot, for example
'plot.png'. See the function |
Create a plot showing the predicted probability density function, showing prediction overlap between true and false cases
A ggplot object. Use the ggsave
function to save to file in a different
format.
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n=1000) saveLoc <- file.path(tempdir(), "plotPredictedPDF") results <- runPlp(plpData, outcomeId = 3, saveDirectory = saveLoc) plotPredictedPDF(results) # clean up unlink(saveLoc, recursive = TRUE)
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n=1000) saveLoc <- file.path(tempdir(), "plotPredictedPDF") results <- runPlp(plpData, outcomeId = 3, saveDirectory = saveLoc) plotPredictedPDF(results) # clean up unlink(saveLoc, recursive = TRUE)
Plot the side-by-side boxplots of prediction distribution, by class
plotPredictionDistribution( plpResult, typeColumn = "evaluation", saveLocation = NULL, fileName = "PredictionDistribution.png" )
plotPredictionDistribution( plpResult, typeColumn = "evaluation", saveLocation = NULL, fileName = "PredictionDistribution.png" )
plpResult |
A plp result object as generated using the |
typeColumn |
The name of the column specifying the evaluation type |
saveLocation |
Directory to save plot (if NULL plot is not saved) |
fileName |
Name of the file to save to plot, for example
'plot.png'. See the function |
Create a plot showing the side-by-side boxplots of prediction distribution, by class #'
A ggplot object. Use the ggsave
function to save to file in a different
format.
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n=1000) saveLoc <- file.path(tempdir(), "plotPredictionDistribution") results <- runPlp(plpData, outcomeId = 3, saveDirectory = saveLoc) plotPredictionDistribution(results) # clean up unlink(saveLoc, recursive = TRUE)
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n=1000) saveLoc <- file.path(tempdir(), "plotPredictionDistribution") results <- runPlp(plpData, outcomeId = 3, saveDirectory = saveLoc) plotPredictionDistribution(results) # clean up unlink(saveLoc, recursive = TRUE)
Plot the preference score probability density function, showing prediction overlap between true and false cases #'
plotPreferencePDF( plpResult, typeColumn = "evaluation", saveLocation = NULL, fileName = "plotPreferencePDF.png" )
plotPreferencePDF( plpResult, typeColumn = "evaluation", saveLocation = NULL, fileName = "plotPreferencePDF.png" )
plpResult |
A plp result object as generated using the |
typeColumn |
The name of the column specifying the evaluation type |
saveLocation |
Directory to save plot (if NULL plot is not saved) |
fileName |
Name of the file to save to plot, for example
'plot.png'. See the function |
Create a plot showing the preference score probability density function, showing prediction overlap between true and false cases #'
A ggplot object. Use the ggsave
function to save to file in a different
format.
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n=1000) saveLoc <- file.path(tempdir(), "plotPreferencePDF") results <- runPlp(plpData, outcomeId = 3, saveDirectory = saveLoc) plotPreferencePDF(results) # clean up unlink(saveLoc, recursive = TRUE)
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n=1000) saveLoc <- file.path(tempdir(), "plotPreferencePDF") results <- runPlp(plpData, outcomeId = 3, saveDirectory = saveLoc) plotPreferencePDF(results) # clean up unlink(saveLoc, recursive = TRUE)
Plot the smooth calibration as detailed in Calster et al. "A calibration heirarchy for risk models was defined: from utopia to empirical data" (2016)
plotSmoothCalibration( plpResult, smooth = "loess", span = 0.75, nKnots = 5, scatter = FALSE, bins = 20, sample = TRUE, typeColumn = "evaluation", saveLocation = NULL, fileName = "smoothCalibration.pdf" )
plotSmoothCalibration( plpResult, smooth = "loess", span = 0.75, nKnots = 5, scatter = FALSE, bins = 20, sample = TRUE, typeColumn = "evaluation", saveLocation = NULL, fileName = "smoothCalibration.pdf" )
plpResult |
The result of running |
smooth |
options: 'loess' or 'rcs' |
span |
This specifies the width of span used for loess. This will allow for faster computing and lower memory usage. |
nKnots |
The number of knots to be used by the rcs evaluation. Default is 5 |
scatter |
plot the decile calibrations as points on the graph. Default is False |
bins |
The number of bins for the histogram. Default is 20. |
sample |
If using loess then by default 20,000 patients will be sampled to save time |
typeColumn |
The name of the column specifying the evaluation type |
saveLocation |
Directory to save plot (if NULL plot is not saved) |
fileName |
Name of the file to save to plot, for example
'plot.png'. See the function |
Create a plot showing the smoothed calibration
A ggplot object.
# generate prediction dataaframe with 1000 patients predictedRisk <- stats::runif(1000) # overconfident for high risk patients actualRisk <- ifelse(predictedRisk < 0.5, predictedRisk, 0.5 + 0.5 * (predictedRisk - 0.5)) outcomeCount <- stats::rbinom(1000, 1, actualRisk) # mock data frame prediction <- data.frame(rowId = 1:1000, value = predictedRisk, outcomeCount = outcomeCount, evaluationType = "Test") attr(prediction, "modelType") <- "binary" calibrationSummary <- getCalibrationSummary(prediction, "binary", numberOfStrata = 10, typeColumn = "evaluationType") plpResults <- list() plpResults$performanceEvaluation$calibrationSummary <- calibrationSummary plpResults$prediction <- prediction plotSmoothCalibration(plpResults)
# generate prediction dataaframe with 1000 patients predictedRisk <- stats::runif(1000) # overconfident for high risk patients actualRisk <- ifelse(predictedRisk < 0.5, predictedRisk, 0.5 + 0.5 * (predictedRisk - 0.5)) outcomeCount <- stats::rbinom(1000, 1, actualRisk) # mock data frame prediction <- data.frame(rowId = 1:1000, value = predictedRisk, outcomeCount = outcomeCount, evaluationType = "Test") attr(prediction, "modelType") <- "binary" calibrationSummary <- getCalibrationSummary(prediction, "binary", numberOfStrata = 10, typeColumn = "evaluationType") plpResults <- list() plpResults$performanceEvaluation$calibrationSummary <- calibrationSummary plpResults$prediction <- prediction plotSmoothCalibration(plpResults)
Plot the calibration
plotSparseCalibration( plpResult, typeColumn = "evaluation", saveLocation = NULL, fileName = "roc.png" )
plotSparseCalibration( plpResult, typeColumn = "evaluation", saveLocation = NULL, fileName = "roc.png" )
plpResult |
A plp result object as generated using the |
typeColumn |
The name of the column specifying the evaluation type |
saveLocation |
Directory to save plot (if NULL plot is not saved) |
fileName |
Name of the file to save to plot, for example
'plot.png'. See the function |
Create a plot showing the calibration #'
A ggplot object. Use the ggsave
function to save to file in a different
format.
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n=1000) saveLoc <- file.path(tempdir(), "plotSparseCalibration") results <- runPlp(plpData, outcomeId = 3, saveDirectory = saveLoc) plotSparseCalibration(results) # clean up unlink(saveLoc, recursive = TRUE)
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n=1000) saveLoc <- file.path(tempdir(), "plotSparseCalibration") results <- runPlp(plpData, outcomeId = 3, saveDirectory = saveLoc) plotSparseCalibration(results) # clean up unlink(saveLoc, recursive = TRUE)
Plot the conventional calibration
plotSparseCalibration2( plpResult, typeColumn = "evaluation", saveLocation = NULL, fileName = "roc.png" )
plotSparseCalibration2( plpResult, typeColumn = "evaluation", saveLocation = NULL, fileName = "roc.png" )
plpResult |
A plp result object as generated using the |
typeColumn |
The name of the column specifying the evaluation type |
saveLocation |
Directory to save plot (if NULL plot is not saved) |
fileName |
Name of the file to save to plot, for example
'plot.png'. See the function |
Create a plot showing the calibration #'
A ggplot object. Use the ggsave
function to save to file in a different
format.
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n=1000) saveLoc <- file.path(tempdir(), "plotSparseCalibration2") results <- runPlp(plpData, outcomeId = 3, saveDirectory = saveLoc) plotSparseCalibration2(results) # clean up unlink(saveLoc, recursive = TRUE)
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n=1000) saveLoc <- file.path(tempdir(), "plotSparseCalibration2") results <- runPlp(plpData, outcomeId = 3, saveDirectory = saveLoc) plotSparseCalibration2(results) # clean up unlink(saveLoc, recursive = TRUE)
Plot the ROC curve using the sparse thresholdSummary data frame
plotSparseRoc( plpResult, typeColumn = "evaluation", saveLocation = NULL, fileName = "roc.png" )
plotSparseRoc( plpResult, typeColumn = "evaluation", saveLocation = NULL, fileName = "roc.png" )
plpResult |
A plp result object as generated using the |
typeColumn |
The name of the column specifying the evaluation type |
saveLocation |
Directory to save plot (if NULL plot is not saved) |
fileName |
Name of the file to save to plot, for example
'plot.png'. See the function |
Create a plot showing the Receiver Operator Characteristics (ROC) curve.
A ggplot object. Use the ggsave
function to save to file in a different
format.
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n=1000) saveLoc <- file.path(tempdir(), "plotSparseRoc") results <- runPlp(plpData, outcomeId = 3, saveDirectory = saveLoc) plotSparseRoc(results) # clean up unlink(saveLoc, recursive = TRUE)
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n=1000) saveLoc <- file.path(tempdir(), "plotSparseRoc") results <- runPlp(plpData, outcomeId = 3, saveDirectory = saveLoc) plotSparseRoc(results) # clean up unlink(saveLoc, recursive = TRUE)
Plot the variable importance scatterplot
plotVariableScatterplot( covariateSummary, saveLocation = NULL, fileName = "VariableScatterplot.png" )
plotVariableScatterplot( covariateSummary, saveLocation = NULL, fileName = "VariableScatterplot.png" )
covariateSummary |
A prediction object as generated using the
|
saveLocation |
Directory to save plot (if NULL plot is not saved) |
fileName |
Name of the file to save to plot, for example
'plot.png'. See the function |
Create a plot showing the variable importance scatterplot #'
A ggplot object. Use the ggsave
function to save to file in a different
format.
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n=1000) saveLoc <- file.path(tempdir(), "plotVariableScatterplot") results <- runPlp(plpData, outcomeId = 3, saveDirectory = saveLoc) plotVariableScatterplot(results$covariateSummary) # clean up
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n=1000) saveLoc <- file.path(tempdir(), "plotVariableScatterplot") results <- runPlp(plpData, outcomeId = 3, saveDirectory = saveLoc) plotVariableScatterplot(results$covariateSummary) # clean up
Create predictive probabilities
predictCyclops(plpModel, data, cohort)
predictCyclops(plpModel, data, cohort)
plpModel |
An object of type |
data |
The new plpData containing the covariateData for the new population |
cohort |
The cohort to calculate the prediction for |
Generates predictions for the population specified in plpData given the model.
The value column in the result data.frame is: logistic: probabilities of the outcome, poisson: Poisson rate (per day) of the outome, survival: hazard rate (per day) of the outcome.
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n = 1000) population <- createStudyPopulation(plpData, outcomeId = 3) data <- splitData(plpData, population) plpModel <- fitPlp(data$Train, modelSettings = setLassoLogisticRegression(), analysisId = "test", analysisPath = NULL) prediction <- predictCyclops(plpModel, data$Test, data$Test$labels) # view prediction dataframe head(prediction)
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n = 1000) population <- createStudyPopulation(plpData, outcomeId = 3) data <- splitData(plpData, population) plpModel <- fitPlp(data$Train, modelSettings = setLassoLogisticRegression(), analysisId = "test", analysisPath = NULL) prediction <- predictCyclops(plpModel, data$Test, data$Test$labels) # view prediction dataframe head(prediction)
Predict risk with a given plpModel containing a generalized linear model.
predictGlm(plpModel, data, cohort)
predictGlm(plpModel, data, cohort)
plpModel |
An object of type |
data |
An object of type |
cohort |
The population dataframe created using
|
A dataframe containing the prediction for each person in the population
coefficients <- data.frame( covariateId = c(1002), coefficient = c(0.05)) model <- createGlmModel(coefficients, intercept = -2.5) data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n=50) prediction <- predictGlm(model, plpData, plpData$cohorts) # see the predicted risk values head(prediction)
coefficients <- data.frame( covariateId = c(1002), coefficient = c(0.05)) model <- createGlmModel(coefficients, intercept = -2.5) data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n=50) prediction <- predictGlm(model, plpData, plpData$cohorts) # see the predicted risk values head(prediction)
Predict the risk of the outcome using the input plpModel for the input plpData
predictPlp(plpModel, plpData, population, timepoint)
predictPlp(plpModel, plpData, population, timepoint)
plpModel |
An object of type |
plpData |
An object of type |
population |
The population created using createStudyPopulation() who will have their risks predicted or a cohort without the outcome known |
timepoint |
The timepoint to predict risk (survival models only) |
The function applied the trained model on the plpData to make predictions
A data frame containing the predicted risk values
coefficients <- data.frame( covariateId = c(1002), coefficient = c(0.05) ) model <- createGlmModel(coefficients, intercept = -2.5) data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n = 50) prediction <- predictPlp(model, plpData, plpData$cohorts) # see the predicted risk values head(prediction)
coefficients <- data.frame( covariateId = c(1002), coefficient = c(0.05) ) model <- createGlmModel(coefficients, intercept = -2.5) data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n = 50) prediction <- predictPlp(model, plpData, plpData$cohorts) # see the predicted risk values head(prediction)
A function that wraps around FeatureExtraction::tidyCovariateData to normalise the data and remove rare or redundant features
preprocessData(covariateData, preprocessSettings = createPreprocessSettings())
preprocessData(covariateData, preprocessSettings = createPreprocessSettings())
covariateData |
The covariate part of the training data created by |
preprocessSettings |
The settings for the preprocessing created by |
Returns an object of class covariateData
that has been processed.
This includes normalising the data and removing rare or redundant features.
Redundant features are features that within an analysisId together cover
all obervations.
The covariateData object with the processed covariates
library(dplyr) data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n=1000) preProcessedData <- preprocessData(plpData$covariateData, createPreprocessSettings()) # check age is normalized by max value preProcessedData$covariates %>% dplyr::filter(.data$covariateId == 1002)
library(dplyr) data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n=1000) preProcessedData <- preprocessData(plpData$covariateData, createPreprocessSettings()) # check age is normalized by max value preProcessedData$covariates %>% dplyr::filter(.data$covariateId == 1002)
Print a plpData object
## S3 method for class 'plpData' print(x, ...)
## S3 method for class 'plpData' print(x, ...)
x |
The plpData object to print |
... |
Additional arguments |
A message describing the object
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n=10) print(plpData)
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n=10) print(plpData)
Print a summary.plpData object
## S3 method for class 'summary.plpData' print(x, ...)
## S3 method for class 'summary.plpData' print(x, ...)
x |
The summary.plpData object to print |
... |
Additional arguments |
A message describing the object
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n=10) summary <- summary(plpData) print(summary)
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n=10) summary <- summary(plpData) print(summary)
Recalibrating a model using the recalibrationInTheLarge or weakRecalibration methods
recalibratePlp( prediction, analysisId, typeColumn = "evaluationType", method = c("recalibrationInTheLarge", "weakRecalibration") )
recalibratePlp( prediction, analysisId, typeColumn = "evaluationType", method = c("recalibrationInTheLarge", "weakRecalibration") )
prediction |
A prediction dataframe |
analysisId |
The model analysisId |
typeColumn |
The column name where the strata types are specified |
method |
Method used to recalibrate ('recalibrationInTheLarge' or 'weakRecalibration' ) |
'recalibrationInTheLarge' calculates a single correction factor for the average predicted risks to match the average observed risks. 'weakRecalibration' fits a glm model to the logit of the predicted risks, also known as Platt scaling/logistic recalibration.
A prediction dataframe with the recalibrated predictions added
prediction <- data.frame(rowId = 1:100, value = runif(100), outcomeCount = stats::rbinom(100, 1, 0.1), evaluationType = rep("validation", 100)) attr(prediction, "metaData") <- list(modelType = "binary") # since value is unformally distributed but outcomeCount is not (prob <- 0.1) # the predictions are mis-calibrated outcomeRate <- mean(prediction$outcomeCount) observedRisk <- mean(prediction$value) message("outcome rate is: ", outcomeRate) message("observed risk is: ", observedRisk) # lets recalibrate the predictions prediction <- recalibratePlp(prediction, analysisId = "recalibration", method = "recalibrationInTheLarge") recalibratedRisk <- mean(prediction$value) message("recalibrated risk with recalibration in the large is: ", recalibratedRisk) prediction <- recalibratePlp(prediction, analysisId = "recalibration", method = "weakRecalibration") recalibratedRisk <- mean(prediction$value) message("recalibrated risk with weak recalibration is: ", recalibratedRisk)
prediction <- data.frame(rowId = 1:100, value = runif(100), outcomeCount = stats::rbinom(100, 1, 0.1), evaluationType = rep("validation", 100)) attr(prediction, "metaData") <- list(modelType = "binary") # since value is unformally distributed but outcomeCount is not (prob <- 0.1) # the predictions are mis-calibrated outcomeRate <- mean(prediction$outcomeCount) observedRisk <- mean(prediction$value) message("outcome rate is: ", outcomeRate) message("observed risk is: ", observedRisk) # lets recalibrate the predictions prediction <- recalibratePlp(prediction, analysisId = "recalibration", method = "recalibrationInTheLarge") recalibratedRisk <- mean(prediction$value) message("recalibrated risk with recalibration in the large is: ", recalibratedRisk) prediction <- recalibratePlp(prediction, analysisId = "recalibration", method = "weakRecalibration") recalibratedRisk <- mean(prediction$value) message("recalibrated risk with weak recalibration is: ", recalibratedRisk)
Recalibrating a model by refitting it
recalibratePlpRefit(plpModel, newPopulation, newData, returnModel = FALSE)
recalibratePlpRefit(plpModel, newPopulation, newData, returnModel = FALSE)
plpModel |
The trained plpModel (runPlp$model) |
newPopulation |
The population created using createStudyPopulation() who will have their risks predicted |
newData |
An object of type |
returnModel |
Logical: return the refitted model |
An prediction dataframe with the predictions of the recalibrated model added
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n = 1000) saveLoc <- file.path(tempdir(), "recalibratePlpRefit") plpResults <- runPlp(plpData, outcomeId = 3, saveDirectory = saveLoc) newData <- simulatePlpData(simulationProfile, n = 1000) newPopulation <- createStudyPopulation(newData, outcomeId = 3) predictions <- recalibratePlpRefit(plpModel = plpResults$model, newPopulation = newPopulation, newData = newData) # clean up unlink(saveLoc, recursive = TRUE)
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n = 1000) saveLoc <- file.path(tempdir(), "recalibratePlpRefit") plpResults <- runPlp(plpData, outcomeId = 3, saveDirectory = saveLoc) newData <- simulatePlpData(simulationProfile, n = 1000) newPopulation <- createStudyPopulation(newData, outcomeId = 3) predictions <- recalibratePlpRefit(plpModel = plpResults$model, newPopulation = newPopulation, newData = newData) # clean up unlink(saveLoc, recursive = TRUE)
Run a list of predictions analyses
runMultiplePlp( databaseDetails = createDatabaseDetails(), modelDesignList = list(createModelDesign(targetId = 1, outcomeId = 2, modelSettings = setLassoLogisticRegression()), createModelDesign(targetId = 1, outcomeId = 3, modelSettings = setLassoLogisticRegression())), onlyFetchData = FALSE, cohortDefinitions = NULL, logSettings = createLogSettings(verbosity = "DEBUG", timeStamp = TRUE, logName = "runPlp Log"), saveDirectory = NULL, sqliteLocation = file.path(saveDirectory, "sqlite") )
runMultiplePlp( databaseDetails = createDatabaseDetails(), modelDesignList = list(createModelDesign(targetId = 1, outcomeId = 2, modelSettings = setLassoLogisticRegression()), createModelDesign(targetId = 1, outcomeId = 3, modelSettings = setLassoLogisticRegression())), onlyFetchData = FALSE, cohortDefinitions = NULL, logSettings = createLogSettings(verbosity = "DEBUG", timeStamp = TRUE, logName = "runPlp Log"), saveDirectory = NULL, sqliteLocation = file.path(saveDirectory, "sqlite") )
databaseDetails |
The database settings created using |
modelDesignList |
A list of model designs created using |
onlyFetchData |
Only fetches and saves the data object to the output folder without running the analysis. |
cohortDefinitions |
A list of cohort definitions for the target and outcome cohorts |
logSettings |
The setting specifying the logging for the analyses created using |
saveDirectory |
Name of the folder where all the outputs will written to. |
sqliteLocation |
(optional) The location of the sqlite database with the results |
This function will run all specified predictions as defined using .
A data frame with the following columns:
analysisId |
The unique identifier for a set of analysis choices. |
targetId |
The ID of the target cohort populations. |
outcomeId |
The ID of the outcomeId. |
dataLocation |
The location where the plpData was saved |
the settings ids |
The ids for all other settings used for model development. |
connectionDetails <- Eunomia::getEunomiaConnectionDetails() databaseDetails <- createDatabaseDetails(connectionDetails = connectionDetails, cdmDatabaseSchema = "main", cohortDatabaseSchema = "main", cohortTable = "cohort", outcomeDatabaseSchema = "main", outcomeTable = "cohort", targetId = 1, outcomeIds = 2) Eunomia::createCohorts(connectionDetails = connectionDetails) covariateSettings <- FeatureExtraction::createCovariateSettings(useDemographicsGender = TRUE, useDemographicsAge = TRUE, useConditionOccurrenceLongTerm = TRUE) # GI Bleed in users of celecoxib modelDesign <- createModelDesign(targetId = 1, outcomeId = 3, modelSettings = setLassoLogisticRegression(seed = 42), populationSettings = createStudyPopulationSettings(), restrictPlpDataSettings = createRestrictPlpDataSettings(), covariateSettings = covariateSettings, splitSettings = createDefaultSplitSetting(splitSeed = 42), preprocessSettings = createPreprocessSettings()) # GI Bleed in users of NSAIDs modelDesign2 <- createModelDesign(targetId = 4, outcomeId = 3, modelSettings = setLassoLogisticRegression(seed = 42), populationSettings = createStudyPopulationSettings(), restrictPlpDataSettings = createRestrictPlpDataSettings(), covariateSettings = covariateSettings, splitSettings = createDefaultSplitSetting(splitSeed = 42), preprocessSettings = createPreprocessSettings()) saveLoc <- file.path(tempdir(), "runMultiplePlp") multipleResults <- runMultiplePlp(databaseDetails = databaseDetails, modelDesignList = list(modelDesign, modelDesign2), saveDirectory = saveLoc) # You should see results for two developed models in the ouutput. The output is as well # uploaded to a sqlite database in the saveLoc/sqlite folder, dir(saveLoc) # The dir output should show two Analysis_ folders with the results, # two targetId_ folders with th extracted data, and a sqlite folder with the database # The results can be explored in the shiny app by calling viewMultiplePlp(saveLoc) # clean up (viewing the results in the shiny app is won't work after this) unlink(saveLoc, recursive = TRUE)
connectionDetails <- Eunomia::getEunomiaConnectionDetails() databaseDetails <- createDatabaseDetails(connectionDetails = connectionDetails, cdmDatabaseSchema = "main", cohortDatabaseSchema = "main", cohortTable = "cohort", outcomeDatabaseSchema = "main", outcomeTable = "cohort", targetId = 1, outcomeIds = 2) Eunomia::createCohorts(connectionDetails = connectionDetails) covariateSettings <- FeatureExtraction::createCovariateSettings(useDemographicsGender = TRUE, useDemographicsAge = TRUE, useConditionOccurrenceLongTerm = TRUE) # GI Bleed in users of celecoxib modelDesign <- createModelDesign(targetId = 1, outcomeId = 3, modelSettings = setLassoLogisticRegression(seed = 42), populationSettings = createStudyPopulationSettings(), restrictPlpDataSettings = createRestrictPlpDataSettings(), covariateSettings = covariateSettings, splitSettings = createDefaultSplitSetting(splitSeed = 42), preprocessSettings = createPreprocessSettings()) # GI Bleed in users of NSAIDs modelDesign2 <- createModelDesign(targetId = 4, outcomeId = 3, modelSettings = setLassoLogisticRegression(seed = 42), populationSettings = createStudyPopulationSettings(), restrictPlpDataSettings = createRestrictPlpDataSettings(), covariateSettings = covariateSettings, splitSettings = createDefaultSplitSetting(splitSeed = 42), preprocessSettings = createPreprocessSettings()) saveLoc <- file.path(tempdir(), "runMultiplePlp") multipleResults <- runMultiplePlp(databaseDetails = databaseDetails, modelDesignList = list(modelDesign, modelDesign2), saveDirectory = saveLoc) # You should see results for two developed models in the ouutput. The output is as well # uploaded to a sqlite database in the saveLoc/sqlite folder, dir(saveLoc) # The dir output should show two Analysis_ folders with the results, # two targetId_ folders with th extracted data, and a sqlite folder with the database # The results can be explored in the shiny app by calling viewMultiplePlp(saveLoc) # clean up (viewing the results in the shiny app is won't work after this) unlink(saveLoc, recursive = TRUE)
This provides a general framework for training patient level prediction models. The user can select various default feature selection methods or incorporate their own, The user can also select from a range of default classifiers or incorporate their own. There are three types of evaluations for the model patient (randomly splits people into train/validation sets) or year (randomly splits data into train/validation sets based on index year - older in training, newer in validation) or both (same as year spliting but checks there are no overlaps in patients within training set and validaiton set - any overlaps are removed from validation set)
runPlp( plpData, outcomeId = plpData$metaData$databaseDetails$outcomeIds[1], analysisId = paste(Sys.Date(), outcomeId, sep = "-"), analysisName = "Study details", populationSettings = createStudyPopulationSettings(), splitSettings = createDefaultSplitSetting(type = "stratified", testFraction = 0.25, trainFraction = 0.75, splitSeed = 123, nfold = 3), sampleSettings = createSampleSettings(type = "none"), featureEngineeringSettings = createFeatureEngineeringSettings(type = "none"), preprocessSettings = createPreprocessSettings(minFraction = 0.001, normalize = TRUE), modelSettings = setLassoLogisticRegression(), logSettings = createLogSettings(verbosity = "DEBUG", timeStamp = TRUE, logName = "runPlp Log"), executeSettings = createDefaultExecuteSettings(), saveDirectory = NULL )
runPlp( plpData, outcomeId = plpData$metaData$databaseDetails$outcomeIds[1], analysisId = paste(Sys.Date(), outcomeId, sep = "-"), analysisName = "Study details", populationSettings = createStudyPopulationSettings(), splitSettings = createDefaultSplitSetting(type = "stratified", testFraction = 0.25, trainFraction = 0.75, splitSeed = 123, nfold = 3), sampleSettings = createSampleSettings(type = "none"), featureEngineeringSettings = createFeatureEngineeringSettings(type = "none"), preprocessSettings = createPreprocessSettings(minFraction = 0.001, normalize = TRUE), modelSettings = setLassoLogisticRegression(), logSettings = createLogSettings(verbosity = "DEBUG", timeStamp = TRUE, logName = "runPlp Log"), executeSettings = createDefaultExecuteSettings(), saveDirectory = NULL )
plpData |
An object of type |
outcomeId |
(integer) The ID of the outcome. |
analysisId |
(integer) Identifier for the analysis. It is used to create, e.g., the result folder. Default is a timestamp. |
analysisName |
(character) Name for the analysis |
populationSettings |
An object of type |
splitSettings |
An object of type |
sampleSettings |
An object of type |
featureEngineeringSettings |
An object of |
preprocessSettings |
An object of |
modelSettings |
An object of class
|
logSettings |
An object of |
executeSettings |
An object of |
saveDirectory |
The path to the directory where the results will be saved (if NULL uses working directory) |
This function takes as input the plpData extracted from an OMOP CDM database and follows the specified settings to develop and internally validate a model for the specified outcomeId.
An plpResults object containing the following:
model The developed model of class plpModel
executionSummary A list containing the hardward details, R package details and execution time
performanceEvaluation Various internal performance metrics in sparse format
prediction The plpData cohort table with the predicted risks added as a column (named value)
covariateSummary A characterization of the features for patients with and without the outcome during the time at risk
analysisRef A list with details about the analysis
# simulate some data data('simulationProfile') plpData <- simulatePlpData(simulationProfile, n = 1000) # develop a model with the default settings saveLoc <- file.path(tempdir(), "runPlp") results <- runPlp(plpData = plpData, outcomeId = 3, analysisId = 1, saveDirectory = saveLoc) # to check the results you can view the log file at saveLoc/1/plpLog.txt # or view with shiny app using viewPlp(results) # clean up unlink(saveLoc, recursive = TRUE)
# simulate some data data('simulationProfile') plpData <- simulatePlpData(simulationProfile, n = 1000) # develop a model with the default settings saveLoc <- file.path(tempdir(), "runPlp") results <- runPlp(plpData = plpData, outcomeId = 3, analysisId = 1, saveDirectory = saveLoc) # to check the results you can view the log file at saveLoc/1/plpLog.txt # or view with shiny app using viewPlp(results) # clean up unlink(saveLoc, recursive = TRUE)
Save the modelDesignList to a json file
savePlpAnalysesJson( modelDesignList = list(createModelDesign(targetId = 1, outcomeId = 2, modelSettings = setLassoLogisticRegression()), createModelDesign(targetId = 1, outcomeId = 3, modelSettings = setLassoLogisticRegression())), cohortDefinitions = NULL, saveDirectory = NULL )
savePlpAnalysesJson( modelDesignList = list(createModelDesign(targetId = 1, outcomeId = 2, modelSettings = setLassoLogisticRegression()), createModelDesign(targetId = 1, outcomeId = 3, modelSettings = setLassoLogisticRegression())), cohortDefinitions = NULL, saveDirectory = NULL )
modelDesignList |
A list of modelDesigns created using |
cohortDefinitions |
A list of the cohortDefinitions (generally extracted from ATLAS) |
saveDirectory |
The directory to save the modelDesignList settings |
This function creates a json file with the modelDesignList saved
The json string of the ModelDesignList
modelDesign <- createModelDesign(targetId = 1, outcomeId = 2, modelSettings = setLassoLogisticRegression()) saveLoc <- file.path(tempdir(), "loadPlpAnalysesJson") jsonFile <- savePlpAnalysesJson(modelDesignList = modelDesign, saveDirectory = saveLoc) # clean up unlink(saveLoc, recursive = TRUE)
modelDesign <- createModelDesign(targetId = 1, outcomeId = 2, modelSettings = setLassoLogisticRegression()) saveLoc <- file.path(tempdir(), "loadPlpAnalysesJson") jsonFile <- savePlpAnalysesJson(modelDesignList = modelDesign, saveDirectory = saveLoc) # clean up unlink(saveLoc, recursive = TRUE)
savePlpData
saves an object of type plpData to folder.
savePlpData(plpData, file, envir = NULL, overwrite = FALSE)
savePlpData(plpData, file, envir = NULL, overwrite = FALSE)
plpData |
An object of type |
file |
The name of the folder where the data will be written. The folder should not yet exist. |
envir |
The environment for to evaluate variables when saving |
overwrite |
Whether to force overwrite an existing file |
Called for its side effect, the data will be written to a set of files in the folder specified by the user.
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n = 500) saveLoc <- file.path(tempdir(), "savePlpData") savePlpData(plpData, saveLoc) dir(saveLoc, full.names = TRUE) # clean up unlink(saveLoc, recursive = TRUE)
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n = 500) saveLoc <- file.path(tempdir(), "savePlpData") savePlpData(plpData, saveLoc) dir(saveLoc, full.names = TRUE) # clean up unlink(saveLoc, recursive = TRUE)
Saves the plp model
savePlpModel(plpModel, dirPath)
savePlpModel(plpModel, dirPath)
plpModel |
A trained classifier returned by running |
dirPath |
A location to save the model to |
Saves the plp model to a user specificed folder
The directory path where the model was saved
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n = 1000) saveLoc <- file.path(tempdir(), "savePlpModel") plpResult <- runPlp(plpData, outcomeId = 3, saveDirectory = saveLoc) path <- savePlpModel(plpResult$model, file.path(saveLoc, "savedModel")) # show the saved model dir(path, full.names = TRUE) # clean up unlink(saveLoc, recursive = TRUE)
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n = 1000) saveLoc <- file.path(tempdir(), "savePlpModel") plpResult <- runPlp(plpData, outcomeId = 3, saveDirectory = saveLoc) path <- savePlpModel(plpResult$model, file.path(saveLoc, "savedModel")) # show the saved model dir(path, full.names = TRUE) # clean up unlink(saveLoc, recursive = TRUE)
Saves the result from runPlp into the location directory
savePlpResult(result, dirPath)
savePlpResult(result, dirPath)
result |
The result of running runPlp() |
dirPath |
The directory to save the csv |
Saves the result from runPlp into the location directory
The directory path where the results were saved
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n = 1000) saveLoc <- file.path(tempdir(), "savePlpResult") results <- runPlp(plpData, outcomeId = 3, saveDirectory = saveLoc) # save the results newSaveLoc <- file.path(tempdir(), "savePlpResult", "saved") savePlpResult(results, newSaveLoc) # show the saved results dir(newSaveLoc, recursive = TRUE, full.names = TRUE) # clean up unlink(saveLoc, recursive = TRUE) unlink(newSaveLoc, recursive = TRUE)
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n = 1000) saveLoc <- file.path(tempdir(), "savePlpResult") results <- runPlp(plpData, outcomeId = 3, saveDirectory = saveLoc) # save the results newSaveLoc <- file.path(tempdir(), "savePlpResult", "saved") savePlpResult(results, newSaveLoc) # show the saved results dir(newSaveLoc, recursive = TRUE, full.names = TRUE) # clean up unlink(saveLoc, recursive = TRUE) unlink(newSaveLoc, recursive = TRUE)
Saves the prediction dataframe to a json file
savePrediction(prediction, dirPath, fileName = "prediction.json")
savePrediction(prediction, dirPath, fileName = "prediction.json")
prediction |
The prediciton data.frame |
dirPath |
The directory to save the prediction json |
fileName |
The name of the json file that will be saved |
Saves the prediction data frame returned by predict.R to an json file and returns the fileLocation where the prediction is saved
The file location where the prediction was saved
prediction <- data.frame( rowIds = c(1, 2, 3), outcomeCount = c(0, 1, 0), value = c(0.1, 0.9, 0.2) ) saveLoc <- file.path(tempdir()) savePrediction(prediction, saveLoc) dir(saveLoc) # clean up unlink(file.path(saveLoc, "prediction.json"))
prediction <- data.frame( rowIds = c(1, 2, 3), outcomeCount = c(0, 1, 0), value = c(0.1, 0.9, 0.2) ) saveLoc <- file.path(tempdir()) savePrediction(prediction, saveLoc) dir(saveLoc) # clean up unlink(file.path(saveLoc, "prediction.json"))
Create setting for AdaBoost with python DecisionTreeClassifier base estimator
setAdaBoost( nEstimators = list(10, 50, 200), learningRate = list(1, 0.5, 0.1), algorithm = list("SAMME"), seed = sample(1e+06, 1) )
setAdaBoost( nEstimators = list(10, 50, 200), learningRate = list(1, 0.5, 0.1), algorithm = list("SAMME"), seed = sample(1e+06, 1) )
nEstimators |
(list) The maximum number of estimators at which boosting is terminated. In case of perfect fit, the learning procedure is stopped early. |
learningRate |
(list) Weight applied to each classifier at each boosting iteration. A higher learning rate increases the contribution of each classifier. There is a trade-off between the learningRate and nEstimators parameters There is a trade-off between learningRate and nEstimators. |
algorithm |
Only ‘SAMME’ can be provided. The 'algorithm' argument will be deprecated in scikit-learn 1.8. |
seed |
A seed for the model |
a modelSettings object
## Not run: model <- setAdaBoost(nEstimators = list(10), learningRate = list(0.1), seed = 42) ## End(Not run)
## Not run: model <- setAdaBoost(nEstimators = list(10), learningRate = list(0.1), seed = 42) ## End(Not run)
Create setting for lasso Cox model
setCoxModel( variance = 0.01, seed = NULL, includeCovariateIds = c(), noShrinkage = c(), threads = -1, upperLimit = 20, lowerLimit = 0.01, tolerance = 2e-07, maxIterations = 3000 )
setCoxModel( variance = 0.01, seed = NULL, includeCovariateIds = c(), noShrinkage = c(), threads = -1, upperLimit = 20, lowerLimit = 0.01, tolerance = 2e-07, maxIterations = 3000 )
variance |
Numeric: prior distribution starting variance |
seed |
An option to add a seed when training the model |
includeCovariateIds |
a set of covariate IDS to limit the analysis to |
noShrinkage |
a set of covariates whcih are to be forced to be included in the final model. default is the intercept |
threads |
An option to set number of threads when training model |
upperLimit |
Numeric: Upper prior variance limit for grid-search |
lowerLimit |
Numeric: Lower prior variance limit for grid-search |
tolerance |
Numeric: maximum relative change in convergence criterion from successive iterations to achieve convergence |
maxIterations |
Integer: maximum iterations of Cyclops to attempt before returning a failed-to-converge error |
modelSettings
object
coxL1 <- setCoxModel()
coxL1 <- setCoxModel()
Create setting for the scikit-learn DecisionTree with python
setDecisionTree( criterion = list("gini"), splitter = list("best"), maxDepth = list(as.integer(4), as.integer(10), NULL), minSamplesSplit = list(2, 10), minSamplesLeaf = list(10, 50), minWeightFractionLeaf = list(0), maxFeatures = list(100, "sqrt", NULL), maxLeafNodes = list(NULL), minImpurityDecrease = list(10^-7), classWeight = list(NULL), seed = sample(1e+06, 1) )
setDecisionTree( criterion = list("gini"), splitter = list("best"), maxDepth = list(as.integer(4), as.integer(10), NULL), minSamplesSplit = list(2, 10), minSamplesLeaf = list(10, 50), minWeightFractionLeaf = list(0), maxFeatures = list(100, "sqrt", NULL), maxLeafNodes = list(NULL), minImpurityDecrease = list(10^-7), classWeight = list(NULL), seed = sample(1e+06, 1) )
criterion |
The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain. |
splitter |
The strategy used to choose the split at each node. Supported strategies are “best” to choose the best split and “random” to choose the best random split. |
maxDepth |
(list) The maximum depth of the tree. If NULL, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. |
minSamplesSplit |
The minimum number of samples required to split an internal node |
minSamplesLeaf |
The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least minSamplesLeaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. |
minWeightFractionLeaf |
The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sampleWeight is not provided. |
maxFeatures |
(list) The number of features to consider when looking for the best split (int/'sqrt'/NULL) |
maxLeafNodes |
(list) Grow a tree with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes. (int/NULL) |
minImpurityDecrease |
Threshold for early stopping in tree growth. A node will split if its impurity is above the threshold, otherwise it is a leaf. |
classWeight |
(list) Weights associated with classes 'balance' or NULL |
seed |
The random state seed |
a modelSettings object
## Not run: model <- setDecisionTree(criterion = list("gini"), maxDepth = list(4), minSamplesSplit = list(2), minSamplesLeaf = list(10), seed = 42) ## End(Not run)
## Not run: model <- setDecisionTree(criterion = list("gini"), maxDepth = list(4), minSamplesSplit = list(2), minSamplesLeaf = list(10), seed = 42) ## End(Not run)
Create setting for gradient boosting machine model using gbm_xgboost implementation
setGradientBoostingMachine( ntrees = c(100, 300), nthread = 20, earlyStopRound = 25, maxDepth = c(4, 6, 8), minChildWeight = 1, learnRate = c(0.05, 0.1, 0.3), scalePosWeight = 1, lambda = 1, alpha = 0, seed = sample(1e+07, 1) )
setGradientBoostingMachine( ntrees = c(100, 300), nthread = 20, earlyStopRound = 25, maxDepth = c(4, 6, 8), minChildWeight = 1, learnRate = c(0.05, 0.1, 0.3), scalePosWeight = 1, lambda = 1, alpha = 0, seed = sample(1e+07, 1) )
ntrees |
The number of trees to build |
nthread |
The number of computer threads to use (how many cores do you have?) |
earlyStopRound |
If the performance does not increase over earlyStopRound number of trees then training stops (this prevents overfitting) |
maxDepth |
Maximum depth of each tree - a large value will lead to slow model training |
minChildWeight |
Minimum sum of of instance weight in a child node - larger values are more conservative |
learnRate |
The boosting learn rate |
scalePosWeight |
Controls weight of positive class in loss - useful for imbalanced classes |
lambda |
L2 regularization on weights - larger is more conservative |
alpha |
L1 regularization on weights - larger is more conservative |
seed |
An option to add a seed when training the final model |
A modelSettings object that can be used to fit the model
modelGbm <- setGradientBoostingMachine( ntrees = c(10, 100), nthread = 20, maxDepth = c(4, 6), learnRate = c(0.1, 0.3) )
modelGbm <- setGradientBoostingMachine( ntrees = c(10, 100), nthread = 20, maxDepth = c(4, 6), learnRate = c(0.1, 0.3) )
Create setting for Iterative Hard Thresholding model
setIterativeHardThresholding( K = 10, penalty = "bic", seed = sample(1e+05, 1), exclude = c(), forceIntercept = FALSE, fitBestSubset = FALSE, initialRidgeVariance = 0.1, tolerance = 1e-08, maxIterations = 10000, threshold = 1e-06, delta = 0 )
setIterativeHardThresholding( K = 10, penalty = "bic", seed = sample(1e+05, 1), exclude = c(), forceIntercept = FALSE, fitBestSubset = FALSE, initialRidgeVariance = 0.1, tolerance = 1e-08, maxIterations = 10000, threshold = 1e-06, delta = 0 )
K |
The maximum number of non-zero predictors |
penalty |
Specifies the IHT penalty; possible values are |
seed |
An option to add a seed when training the model |
exclude |
A vector of numbers or covariateId names to exclude from prior |
forceIntercept |
Logical: Force intercept coefficient into regularization |
fitBestSubset |
Logical: Fit final subset with no regularization |
initialRidgeVariance |
integer |
tolerance |
numeric |
maxIterations |
integer |
threshold |
numeric |
delta |
numeric |
modelSettings
object
modelIht <- setIterativeHardThresholding(K = 5, seed = 42)
modelIht <- setIterativeHardThresholding(K = 5, seed = 42)
Create modelSettings for lasso logistic regression
setLassoLogisticRegression( variance = 0.01, seed = NULL, includeCovariateIds = c(), noShrinkage = c(0), threads = -1, forceIntercept = FALSE, upperLimit = 20, lowerLimit = 0.01, tolerance = 2e-06, maxIterations = 3000, priorCoefs = NULL )
setLassoLogisticRegression( variance = 0.01, seed = NULL, includeCovariateIds = c(), noShrinkage = c(0), threads = -1, forceIntercept = FALSE, upperLimit = 20, lowerLimit = 0.01, tolerance = 2e-06, maxIterations = 3000, priorCoefs = NULL )
variance |
Numeric: prior distribution starting variance |
seed |
An option to add a seed when training the model |
includeCovariateIds |
a set of covariateIds to limit the analysis to |
noShrinkage |
a set of covariates whcih are to be forced to be included in in the final model. Default is the intercept |
threads |
An option to set number of threads when training model. |
forceIntercept |
Logical: Force intercept coefficient into prior |
upperLimit |
Numeric: Upper prior variance limit for grid-search |
lowerLimit |
Numeric: Lower prior variance limit for grid-search |
tolerance |
Numeric: maximum relative change in convergence criterion from from successive iterations to achieve convergence |
maxIterations |
Integer: maximum iterations of Cyclops to attempt before returning a failed-to-converge error |
priorCoefs |
Use coefficients from a previous model as starting points for model fit (transfer learning) |
modelSettings
object
modelLasso <- setLassoLogisticRegression(seed=42)
modelLasso <- setLassoLogisticRegression(seed=42)
Create setting for gradient boosting machine model using lightGBM (https://github.com/microsoft/LightGBM/tree/master/R-package).
setLightGBM( nthread = 20, earlyStopRound = 25, numIterations = c(100), numLeaves = c(31), maxDepth = c(5, 10), minDataInLeaf = c(20), learningRate = c(0.05, 0.1, 0.3), lambdaL1 = c(0), lambdaL2 = c(0), scalePosWeight = 1, isUnbalance = FALSE, seed = sample(1e+07, 1) )
setLightGBM( nthread = 20, earlyStopRound = 25, numIterations = c(100), numLeaves = c(31), maxDepth = c(5, 10), minDataInLeaf = c(20), learningRate = c(0.05, 0.1, 0.3), lambdaL1 = c(0), lambdaL2 = c(0), scalePosWeight = 1, isUnbalance = FALSE, seed = sample(1e+07, 1) )
nthread |
The number of computer threads to use (how many cores do you have?) |
earlyStopRound |
If the performance does not increase over earlyStopRound number of trees then training stops (this prevents overfitting) |
numIterations |
Number of boosting iterations. |
numLeaves |
This hyperparameter sets the maximum number of leaves. Increasing this parameter can lead to higher model complexity and potential overfitting. |
maxDepth |
This hyperparameter sets the maximum depth . Increasing this parameter can also lead to higher model complexity and potential overfitting. |
minDataInLeaf |
This hyperparameter sets the minimum number of data points that must be present in a leaf node. Increasing this parameter can help to reduce overfitting |
learningRate |
This hyperparameter controls the step size at each iteration of the gradient descent algorithm. Lower values can lead to slower convergence but may result in better performance. |
lambdaL1 |
This hyperparameter controls L1 regularization, which can help to reduce overfitting by encouraging sparse models. |
lambdaL2 |
This hyperparameter controls L2 regularization, which can also help to reduce overfitting by discouraging large weights in the model. |
scalePosWeight |
Controls weight of positive class in loss - useful for imbalanced classes |
isUnbalance |
This parameter cannot be used at the same time with scalePosWeight, choose only one of them. While enabling this should increase the overall performance metric of your model, it will also result in poor estimates of the individual class probabilities. |
seed |
An option to add a seed when training the final model |
A list of settings that can be used to train a model with runPlp
modelLightGbm <- setLightGBM( numLeaves = c(20, 31, 50), maxDepth = c(-1, 5, 10), minDataInLeaf = c(10, 20, 30), learningRate = c(0.05, 0.1, 0.3) )
modelLightGbm <- setLightGBM( numLeaves = c(20, 31, 50), maxDepth = c(-1, 5, 10), minDataInLeaf = c(10, 20, 30), learningRate = c(0.05, 0.1, 0.3) )
DeepPatientLevelPrediction
package.Create setting for neural network model with python's scikit-learn. For
bigger models, consider using DeepPatientLevelPrediction
package.
setMLP( hiddenLayerSizes = list(c(100), c(20)), activation = list("relu"), solver = list("adam"), alpha = list(0.3, 0.01, 1e-04, 1e-06), batchSize = list("auto"), learningRate = list("constant"), learningRateInit = list(0.001), powerT = list(0.5), maxIter = list(200, 100), shuffle = list(TRUE), tol = list(1e-04), warmStart = list(TRUE), momentum = list(0.9), nesterovsMomentum = list(TRUE), earlyStopping = list(FALSE), validationFraction = list(0.1), beta1 = list(0.9), beta2 = list(0.999), epsilon = list(1e-08), nIterNoChange = list(10), seed = sample(1e+05, 1) )
setMLP( hiddenLayerSizes = list(c(100), c(20)), activation = list("relu"), solver = list("adam"), alpha = list(0.3, 0.01, 1e-04, 1e-06), batchSize = list("auto"), learningRate = list("constant"), learningRateInit = list(0.001), powerT = list(0.5), maxIter = list(200, 100), shuffle = list(TRUE), tol = list(1e-04), warmStart = list(TRUE), momentum = list(0.9), nesterovsMomentum = list(TRUE), earlyStopping = list(FALSE), validationFraction = list(0.1), beta1 = list(0.9), beta2 = list(0.999), epsilon = list(1e-08), nIterNoChange = list(10), seed = sample(1e+05, 1) )
(list of vectors) The ith element represents the number of neurons in the ith hidden layer. |
|
activation |
(list) Activation function for the hidden layer.
|
solver |
(list) The solver for weight optimization. (‘lbfgs’, ‘sgd’, ‘adam’) |
alpha |
(list) L2 penalty (regularization term) parameter. |
batchSize |
(list) Size of minibatches for stochastic optimizers. If the solver is ‘lbfgs’, the classifier will not use minibatch. When set to “auto”, batchSize=min(200, n_samples). |
learningRate |
(list) Only used when solver='sgd' Learning rate schedule for weight updates. ‘constant’, ‘invscaling’, ‘adaptive’, default=’constant’ |
learningRateInit |
(list) Only used when solver=’sgd’ or ‘adam’. The initial learning rate used. It controls the step-size in updating the weights. |
powerT |
(list) Only used when solver=’sgd’. The exponent for inverse scaling learning rate. It is used in updating effective learning rate when the learning_rate is set to ‘invscaling’. |
maxIter |
(list) Maximum number of iterations. The solver iterates until convergence (determined by ‘tol’) or this number of iterations. For stochastic solvers (‘sgd’, ‘adam’), note that this determines the number of epochs (how many times each data point will be used), not the number of gradient steps. |
shuffle |
(list) boolean: Whether to shuffle samples in each iteration. Only used when solver=’sgd’ or ‘adam’. |
tol |
(list) Tolerance for the optimization. When the loss or score is not improving by at least tol for nIterNoChange consecutive iterations, unless learning_rate is set to ‘adaptive’, convergence is considered to be reached and training stops. |
warmStart |
(list) When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution. |
momentum |
(list) Momentum for gradient descent update. Should be between 0 and 1. Only used when solver=’sgd’. |
nesterovsMomentum |
(list) Whether to use Nesterov’s momentum. Only used when solver=’sgd’ and momentum > 0. |
earlyStopping |
(list) boolean Whether to use early stopping to terminate training when validation score is not improving. If set to true, it will automatically set aside 10 percent of training data as validation and terminate training when validation score is not improving by at least tol for n_iter_no_change consecutive epochs. |
validationFraction |
(list) The proportion of training data to set aside as validation set for early stopping. Must be between 0 and 1. Only used if earlyStopping is True. |
beta1 |
(list) Exponential decay rate for estimates of first moment vector in adam, should be in 0 to 1. |
beta2 |
(list) Exponential decay rate for estimates of second moment vector in adam, should be in 0 to 1. |
epsilon |
(list) Value for numerical stability in adam. |
nIterNoChange |
(list) Maximum number of epochs to not meet tol improvement. Only effective when solver=’sgd’ or ‘adam’. |
seed |
A seed for the model |
a modelSettings object
## Not run: model <- setMLP(hiddenLayerSizes = list(c(20)), alpha=list(3e-4), seed = 42) ## End(Not run)
## Not run: model <- setMLP(hiddenLayerSizes = list(c(20)), alpha=list(3e-4), seed = 42) ## End(Not run)
Create setting for naive bayes model with python
setNaiveBayes()
setNaiveBayes()
a modelSettings object
## Not run: plpData <- getEunomiaPlpData() model <- setNaiveBayes() analysisId <- "naiveBayes" saveLocation <- file.path(tempdir(), analysisId) results <- runPlp(plpData, modelSettings = model, saveDirectory = saveLocation, analysisId = analysisId) # clean up unlink(saveLocation, recursive = TRUE) ## End(Not run)
## Not run: plpData <- getEunomiaPlpData() model <- setNaiveBayes() analysisId <- "naiveBayes" saveLocation <- file.path(tempdir(), analysisId) results <- runPlp(plpData, modelSettings = model, saveDirectory = saveLocation, analysisId = analysisId) # clean up unlink(saveLocation, recursive = TRUE) ## End(Not run)
Use the python environment created using configurePython()
setPythonEnvironment(envname = "PLP", envtype = NULL)
setPythonEnvironment(envname = "PLP", envtype = NULL)
envname |
A string for the name of the virtual environment (default is 'PLP') |
envtype |
An option for specifying the environment as'conda' or 'python'. If NULL then the default is 'conda' for windows users and 'python' for non-windows users |
This function sets PatientLevelPrediction to use a python environment
A string indicating the which python environment will be used
## Not run: #' # create a conda environment named PLP configurePython(envname="PLP", envtype="conda") ## End(Not run)
## Not run: #' # create a conda environment named PLP configurePython(envname="PLP", envtype="conda") ## End(Not run)
Create setting for random forest model using sklearn
setRandomForest( ntrees = list(100, 500), criterion = list("gini"), maxDepth = list(4, 10, 17), minSamplesSplit = list(2, 5), minSamplesLeaf = list(1, 10), minWeightFractionLeaf = list(0), mtries = list("sqrt", "log2"), maxLeafNodes = list(NULL), minImpurityDecrease = list(0), bootstrap = list(TRUE), maxSamples = list(NULL, 0.9), oobScore = list(FALSE), nJobs = list(NULL), classWeight = list(NULL), seed = sample(1e+05, 1) )
setRandomForest( ntrees = list(100, 500), criterion = list("gini"), maxDepth = list(4, 10, 17), minSamplesSplit = list(2, 5), minSamplesLeaf = list(1, 10), minWeightFractionLeaf = list(0), mtries = list("sqrt", "log2"), maxLeafNodes = list(NULL), minImpurityDecrease = list(0), bootstrap = list(TRUE), maxSamples = list(NULL, 0.9), oobScore = list(FALSE), nJobs = list(NULL), classWeight = list(NULL), seed = sample(1e+05, 1) )
ntrees |
(list) The number of trees to build |
criterion |
(list) The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain. Note: this parameter is tree-specific. |
maxDepth |
(list) The maximum depth of the tree. If NULL, then nodes are expanded until all leaves are pure or until all leaves contain less than minSamplesSplit samples. |
minSamplesSplit |
(list) The minimum number of samples required to split an internal node |
minSamplesLeaf |
(list) The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least minSamplesLeaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. |
minWeightFractionLeaf |
(list) The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sampleWeight is not provided. |
mtries |
(list) The number of features to consider when looking for the best split:
|
maxLeafNodes |
(list) Grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes. |
minImpurityDecrease |
(list) A node will be split if this split induces a decrease of the impurity greater than or equal to this value. |
bootstrap |
(list) Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree. |
maxSamples |
(list) If bootstrap is True, the number of samples to draw from X to train each base estimator. |
oobScore |
(list) Whether to use out-of-bag samples to estimate the generalization score. Only available if bootstrap=True. |
nJobs |
The number of jobs to run in parallel. |
classWeight |
(list) Weights associated with classes. If not given, all classes are supposed to have weight one. NULL, “balanced”, “balanced_subsample” |
seed |
A seed when training the final model |
a modelSettings object
## Not run: plpData <- getEunomiaPlpData() model <- setRandomForest(ntrees = list(100), maxDepth = list(4), minSamplesSplit = list(2), minSamplesLeaf = list(10), maxSamples = list(0.9), seed = 42) saveLoc <- file.path(tempdir(), "randomForest") results <- runPlp(plpData, modelSettings = model, saveDirectory = saveLoc) # clean up unlink(saveLoc, recursive = TRUE) ## End(Not run)
## Not run: plpData <- getEunomiaPlpData() model <- setRandomForest(ntrees = list(100), maxDepth = list(4), minSamplesSplit = list(2), minSamplesLeaf = list(10), maxSamples = list(0.9), seed = 42) saveLoc <- file.path(tempdir(), "randomForest") results <- runPlp(plpData, modelSettings = model, saveDirectory = saveLoc) # clean up unlink(saveLoc, recursive = TRUE) ## End(Not run)
Create setting for the python sklearn SVM (SVC function)
setSVM( C = list(1, 0.9, 2, 0.1), kernel = list("rbf"), degree = list(1, 3, 5), gamma = list("scale", 1e-04, 3e-05, 0.001, 0.01, 0.25), coef0 = list(0), shrinking = list(TRUE), tol = list(0.001), classWeight = list(NULL), cacheSize = 500, seed = sample(1e+05, 1) )
setSVM( C = list(1, 0.9, 2, 0.1), kernel = list("rbf"), degree = list(1, 3, 5), gamma = list("scale", 1e-04, 3e-05, 0.001, 0.01, 0.25), coef0 = list(0), shrinking = list(TRUE), tol = list(0.001), classWeight = list(NULL), cacheSize = 500, seed = sample(1e+05, 1) )
C |
(list) Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive. The penalty is a squared l2 penalty. |
kernel |
(list) Specifies the kernel type to be used in the algorithm. one of ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’. If none is given ‘rbf’ will be used. |
degree |
(list) degree of kernel function is significant only in poly, rbf, sigmoid |
gamma |
(list) kernel coefficient for rbf and poly, by default 1/n_features will be taken. ‘scale’, ‘auto’ or float, default=’scale’ |
coef0 |
(list) independent term in kernel function. It is only significant in poly/sigmoid. |
shrinking |
(list) whether to use the shrinking heuristic. |
tol |
(list) Tolerance for stopping criterion. |
classWeight |
(list) Class weight based on imbalance either 'balanced' or NULL |
cacheSize |
Specify the size of the kernel cache (in MB). |
seed |
A seed for the model |
a modelSettings object
## Not run: plpData <- getEunomiaPlpData() model <- setSVM(C = list(1), gamma = list("scale"), seed = 42) saveLoc <- file.path(tempdir(), "svm") results <- runPlp(plpData, modelSettings = model, saveDirectory = saveLoc) # clean up unlink(saveLoc, recursive = TRUE) ## End(Not run)
## Not run: plpData <- getEunomiaPlpData() model <- setSVM(C = list(1), gamma = list("scale"), seed = 42) saveLoc <- file.path(tempdir(), "svm") results <- runPlp(plpData, modelSettings = model, saveDirectory = saveLoc) # clean up unlink(saveLoc, recursive = TRUE) ## End(Not run)
simulateplpData
creates a plpData object with simulated data.
simulatePlpData(plpDataSimulationProfile, n = 10000)
simulatePlpData(plpDataSimulationProfile, n = 10000)
plpDataSimulationProfile |
An object of type |
n |
The size of the population to be generated. |
This function generates simulated data that is in many ways similar to the original data on which the simulation profile is based.
An object of type plpData
.
# first load the simulation profile to use data("simulationProfile") # then generate the simulated data plpData <- simulatePlpData(simulationProfile, n = 100) nrow(plpData$cohorts)
# first load the simulation profile to use data("simulationProfile") # then generate the simulated data plpData <- simulatePlpData(simulationProfile, n = 100) nrow(plpData$cohorts)
A simulation profile for generating synthetic patient level prediction data
data(simulationProfile)
data(simulationProfile)
A data frame containing the following elements:
prevalence of all covariates
regression model parameters to simulate outcomes
settings used to simulate the profile
covariateIds and covariateNames
time window
prevalence of exclusion of covariates
Loads sklearn python model from json
sklearnFromJson(path)
sklearnFromJson(path)
path |
path to the model json file |
a sklearn python model object
## Not run: plpData <- getEunomiaPlpData() modelSettings <- setDecisionTree(maxDepth = list(3), minSamplesSplit = list(2), minSamplesLeaf = list(1), maxFeatures = list(100)) saveLocation <- file.path(tempdir(), "sklearnFromJson") results <- runPlp(plpData, modelSettings = modelSettings, saveDirectory = saveLocation) # view save model dir(results$model$model, full.names = TRUE) # load into a sklearn object model <- sklearnFromJson(file.path(results$model$model, "model.json")) # max depth is 3 as we set in beginning model$max_depth # clean up unlink(saveLocation, recursive = TRUE) ## End(Not run)
## Not run: plpData <- getEunomiaPlpData() modelSettings <- setDecisionTree(maxDepth = list(3), minSamplesSplit = list(2), minSamplesLeaf = list(1), maxFeatures = list(100)) saveLocation <- file.path(tempdir(), "sklearnFromJson") results <- runPlp(plpData, modelSettings = modelSettings, saveDirectory = saveLocation) # view save model dir(results$model$model, full.names = TRUE) # load into a sklearn object model <- sklearnFromJson(file.path(results$model$model, "model.json")) # max depth is 3 as we set in beginning model$max_depth # clean up unlink(saveLocation, recursive = TRUE) ## End(Not run)
Saves sklearn python model object to json in path
sklearnToJson(model, path)
sklearnToJson(model, path)
model |
a fitted sklearn python model object |
path |
path to the saved model file |
nothing, saves the model to the path as json
## Not run: sklearn <- reticulate::import("sklearn", convert = FALSE) model <- sklearn$tree$DecisionTreeClassifier() model$fit(sklearn$datasets$load_iris()$data, sklearn$datasets$load_iris()$target) saveLoc <- file.path(tempdir(), "model.json") sklearnToJson(model, saveLoc) # the model.json is saved in the tempdir dir(tempdir()) # clean up unlink(saveLoc) ## End(Not run)
## Not run: sklearn <- reticulate::import("sklearn", convert = FALSE) model <- sklearn$tree$DecisionTreeClassifier() model$fit(sklearn$datasets$load_iris()$data, sklearn$datasets$load_iris()$target) saveLoc <- file.path(tempdir(), "model.json") sklearnToJson(model, saveLoc) # the model.json is saved in the tempdir dir(tempdir()) # clean up unlink(saveLoc) ## End(Not run)
splitSettings
Split the plpData into test/train sets using a splitting settings of class
splitSettings
splitData( plpData = plpData, population = population, splitSettings = createDefaultSplitSetting(splitSeed = 42) )
splitData( plpData = plpData, population = population, splitSettings = createDefaultSplitSetting(splitSeed = 42) )
plpData |
An object of type |
population |
The population created using |
splitSettings |
An object of type |
Returns a list containing the training data (Train) and optionally the test data (Test). Train is an Andromeda object containing
covariates: a table (rowId, covariateId, covariateValue) containing the covariates for each data point in the train data
covariateRef: a table with the covariate information
labels: a table (rowId, outcomeCount, ...) for each data point in the train data (outcomeCount is the class label)
folds: a table (rowId, index) specifying which training fold each data point is in.
Test is an Andromeda object containing
covariates: a table (rowId, covariateId, covariateValue) containing the covariates for each data point in the test data
covariateRef: a table with the covariate information
labels: a table (rowId, outcomeCount, ...) for each data point in the test data (outcomeCount is the class label)
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n = 1000) population <- createStudyPopulation(plpData) splitSettings <- createDefaultSplitSetting(testFraction = 0.50, trainFraction = 0.50, nfold = 5) data = splitData(plpData, population, splitSettings) # test data should be ~500 rows (changes because of study population) nrow(data$Test$labels) # train data should be ~500 rows nrow(data$Train$labels) # should be five fold in the train data length(unique(data$Train$folds$index))
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n = 1000) population <- createStudyPopulation(plpData) splitSettings <- createDefaultSplitSetting(testFraction = 0.50, trainFraction = 0.50, nfold = 5) data = splitData(plpData, population, splitSettings) # test data should be ~500 rows (changes because of study population) nrow(data$Test$labels) # train data should be ~500 rows nrow(data$Train$labels) # should be five fold in the train data length(unique(data$Train$folds$index))
Summarize a plpData object
## S3 method for class 'plpData' summary(object, ...)
## S3 method for class 'plpData' summary(object, ...)
object |
The plpData object to summarize |
... |
Additional arguments |
A summary of the object containing the number of people, outcomes and covariates
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n=10) summary(plpData)
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n=10) summary(plpData)
Converts the standard plpData to a sparse matrix
toSparseM(plpData, cohort = NULL, map = NULL)
toSparseM(plpData, cohort = NULL, map = NULL)
plpData |
An object of type |
cohort |
If specified the plpData is restricted to the rowIds in the cohort (otherwise plpData$labels is used) |
map |
A covariate map (telling us the column number for covariates) |
This function converts the covariates Andromeda
table in COO format into a sparse matrix from
the package Matrix
Returns a list, containing the data as a sparse matrix, the plpData covariateRef and a data.frame named map that tells us what covariate corresponds to each column This object is a list with the following components:
A sparse matrix with the rows corresponding to each person in the plpData and the columns corresponding to the covariates.
The plpData covariateRef.
A data.frame containing the data column ids and the corresponding covariateId from covariateRef.
library(dplyr) data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n=100) # how many covariates are there before we convert to sparse matrix plpData$covariateData$covariates %>% dplyr::group_by(.data$covariateId) %>% dplyr::summarise(n = n()) %>% dplyr::collect() %>% nrow() sparseData <- toSparseM(plpData, cohort=plpData$cohorts) # how many covariates are there after we convert to sparse matrix' sparseData$dataMatrix@Dim[2]
library(dplyr) data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n=100) # how many covariates are there before we convert to sparse matrix plpData$covariateData$covariates %>% dplyr::group_by(.data$covariateId) %>% dplyr::summarise(n = n()) %>% dplyr::collect() %>% nrow() sparseData <- toSparseM(plpData, cohort=plpData$cohorts) # how many covariates are there after we convert to sparse matrix' sparseData$dataMatrix@Dim[2]
validateExternal - Validate model performance on new data
validateExternal( validationDesignList, databaseDetails, logSettings = createLogSettings(verbosity = "INFO", logName = "validatePLP"), outputFolder )
validateExternal( validationDesignList, databaseDetails, logSettings = createLogSettings(verbosity = "INFO", logName = "validatePLP"), outputFolder )
validationDesignList |
A list of objects created with |
databaseDetails |
A list of objects of class
|
logSettings |
An object of |
outputFolder |
The directory to save the validation results to (subfolders are created per database in validationDatabaseDetails) |
A list of results
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n=1000) # first fit a model on some data, default is a L1 logistic regression saveLoc <- file.path(tempdir(), "development") results <- runPlp(plpData, saveDirectory = saveLoc) # then create my validation design validationDesign <- createValidationDesign(1, 3, plpModelList = list(results$model)) # I will validate on Eunomia example database connectionDetails <- Eunomia::getEunomiaConnectionDetails() Eunomia::createCohorts(connectionDetails) databaseDetails <- createDatabaseDetails(connectionDetails = connectionDetails, cdmDatabaseSchema = "main", cdmDatabaseName = "Eunomia", cdmDatabaseId = 1, targetId = 1, outcomeIds = 3) path <- file.path(tempdir(), "validation") validateExternal(validationDesign, databaseDetails, outputFolder = path) # see generated result files dir(path, recursive = TRUE) # clean up unlink(saveLoc, recursive = TRUE) unlink(path, recursive = TRUE)
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n=1000) # first fit a model on some data, default is a L1 logistic regression saveLoc <- file.path(tempdir(), "development") results <- runPlp(plpData, saveDirectory = saveLoc) # then create my validation design validationDesign <- createValidationDesign(1, 3, plpModelList = list(results$model)) # I will validate on Eunomia example database connectionDetails <- Eunomia::getEunomiaConnectionDetails() Eunomia::createCohorts(connectionDetails) databaseDetails <- createDatabaseDetails(connectionDetails = connectionDetails, cdmDatabaseSchema = "main", cdmDatabaseName = "Eunomia", cdmDatabaseId = 1, targetId = 1, outcomeIds = 3) path <- file.path(tempdir(), "validation") validateExternal(validationDesign, databaseDetails, outputFolder = path) # see generated result files dir(path, recursive = TRUE) # clean up unlink(saveLoc, recursive = TRUE) unlink(path, recursive = TRUE)
This function loads all the models in a multiple plp analysis folder and validates the models on new data
validateMultiplePlp( analysesLocation, validationDatabaseDetails, validationRestrictPlpDataSettings = createRestrictPlpDataSettings(), recalibrate = NULL, cohortDefinitions = NULL, saveDirectory = NULL )
validateMultiplePlp( analysesLocation, validationDatabaseDetails, validationRestrictPlpDataSettings = createRestrictPlpDataSettings(), recalibrate = NULL, cohortDefinitions = NULL, saveDirectory = NULL )
analysesLocation |
The location where the multiple plp analyses are |
validationDatabaseDetails |
A single or list of validation database settings created using |
validationRestrictPlpDataSettings |
The settings specifying the extra restriction settings when extracting the data created using |
recalibrate |
A vector of recalibration methods (currently supports 'RecalibrationintheLarge' and/or 'weakRecalibration') |
cohortDefinitions |
A list of cohortDefinitions |
saveDirectory |
The location to save to validation results |
Users need to input a location where the results of the multiple plp analyses are found and the connection and database settings for the new data
Nothing. The results are saved to the saveDirectory
# first develop a model using runMultiplePlp connectionDetails <- Eunomia::getEunomiaConnectionDetails() Eunomia::createCohorts(connectionDetails = connectionDetails) databaseDetails <- createDatabaseDetails(connectionDetails = connectionDetails, cdmDatabaseId = "1", cdmDatabaseName = "Eunomia", cdmDatabaseSchema = "main", targetId = 1, outcomeIds = 3) covariateSettings <- FeatureExtraction::createCovariateSettings(useDemographicsGender = TRUE, useDemographicsAge = TRUE, useConditionOccurrenceLongTerm = TRUE) modelDesign <- createModelDesign(targetId = 1, outcomeId = 3, modelSettings = setLassoLogisticRegression(seed = 42), covariateSettings = covariateSettings) saveLoc <- file.path(tempdir(), "valiateMultiplePlp", "development") results <- runMultiplePlp(databaseDetails = databaseDetails, modelDesignList = list(modelDesign), saveDirectory = saveLoc) # now validate the model on a Eunomia but with a different target analysesLocation <- saveLoc validationDatabaseDetails <- createDatabaseDetails(connectionDetails = connectionDetails, cdmDatabaseId = "2", cdmDatabaseName = "EunomiaNew", cdmDatabaseSchema = "main", targetId = 4, outcomeIds = 3) newSaveLoc <- file.path(tempdir(), "valiateMultiplePlp", "validation") validateMultiplePlp(analysesLocation = analysesLocation, validationDatabaseDetails = validationDatabaseDetails, saveDirectory = newSaveLoc) # the results could now be viewed in the shiny app with viewMultiplePlp(newSaveLoc)
# first develop a model using runMultiplePlp connectionDetails <- Eunomia::getEunomiaConnectionDetails() Eunomia::createCohorts(connectionDetails = connectionDetails) databaseDetails <- createDatabaseDetails(connectionDetails = connectionDetails, cdmDatabaseId = "1", cdmDatabaseName = "Eunomia", cdmDatabaseSchema = "main", targetId = 1, outcomeIds = 3) covariateSettings <- FeatureExtraction::createCovariateSettings(useDemographicsGender = TRUE, useDemographicsAge = TRUE, useConditionOccurrenceLongTerm = TRUE) modelDesign <- createModelDesign(targetId = 1, outcomeId = 3, modelSettings = setLassoLogisticRegression(seed = 42), covariateSettings = covariateSettings) saveLoc <- file.path(tempdir(), "valiateMultiplePlp", "development") results <- runMultiplePlp(databaseDetails = databaseDetails, modelDesignList = list(modelDesign), saveDirectory = saveLoc) # now validate the model on a Eunomia but with a different target analysesLocation <- saveLoc validationDatabaseDetails <- createDatabaseDetails(connectionDetails = connectionDetails, cdmDatabaseId = "2", cdmDatabaseName = "EunomiaNew", cdmDatabaseSchema = "main", targetId = 4, outcomeIds = 3) newSaveLoc <- file.path(tempdir(), "valiateMultiplePlp", "validation") validateMultiplePlp(analysesLocation = analysesLocation, validationDatabaseDetails = validationDatabaseDetails, saveDirectory = newSaveLoc) # the results could now be viewed in the shiny app with viewMultiplePlp(newSaveLoc)
open a local shiny app for viewing the result of a PLP analyses from a database
viewDatabaseResultPlp( mySchema, myServer, myUser, myPassword, myDbms, myPort = NULL, myTableAppend )
viewDatabaseResultPlp( mySchema, myServer, myUser, myPassword, myDbms, myPort = NULL, myTableAppend )
mySchema |
Database result schema containing the result tables |
myServer |
server with the result database |
myUser |
Username for the connection to the result database |
myPassword |
Password for the connection to the result database |
myDbms |
database management system for the result database |
myPort |
Port for the connection to the result database |
myTableAppend |
A string appended to the results tables (optional) |
Opens a shiny app for viewing the results of the models from a database
Opens a shiny app for interactively viewing the results
connectionDetails <- Eunomia::getEunomiaConnectionDetails() Eunomia::createCohorts(connectionDetails) databaseDetails <- createDatabaseDetails(connectionDetails = connectionDetails, cdmDatabaseSchema = "main", cdmDatabaseName = "Eunomia", cdmDatabaseId = "1", targetId = 1, outcomeIds = 3) modelDesign <- createModelDesign(targetId = 1, outcomeId = 3, modelSettings = setLassoLogisticRegression()) saveLoc <- file.path(tempdir(), "viewDatabaseResultPlp", "developement") runMultiplePlp(databaseDetails = databaseDetails, modelDesignList = list(modelDesign), saveDirectory = saveLoc) # view result files dir(saveLoc, recursive = TRUE) viewDatabaseResultPlp(myDbms = "sqlite", mySchema = "main", myServer = file.path(saveLoc, "sqlite", "databaseFile.sqlite"), myUser = NULL, myPassword = NULL, myTableAppend = "") # clean up, shiny app can't be opened after the following has been run unlink(saveLoc, recursive = TRUE)
connectionDetails <- Eunomia::getEunomiaConnectionDetails() Eunomia::createCohorts(connectionDetails) databaseDetails <- createDatabaseDetails(connectionDetails = connectionDetails, cdmDatabaseSchema = "main", cdmDatabaseName = "Eunomia", cdmDatabaseId = "1", targetId = 1, outcomeIds = 3) modelDesign <- createModelDesign(targetId = 1, outcomeId = 3, modelSettings = setLassoLogisticRegression()) saveLoc <- file.path(tempdir(), "viewDatabaseResultPlp", "developement") runMultiplePlp(databaseDetails = databaseDetails, modelDesignList = list(modelDesign), saveDirectory = saveLoc) # view result files dir(saveLoc, recursive = TRUE) viewDatabaseResultPlp(myDbms = "sqlite", mySchema = "main", myServer = file.path(saveLoc, "sqlite", "databaseFile.sqlite"), myUser = NULL, myPassword = NULL, myTableAppend = "") # clean up, shiny app can't be opened after the following has been run unlink(saveLoc, recursive = TRUE)
open a local shiny app for viewing the result of a multiple PLP analyses
viewMultiplePlp(analysesLocation)
viewMultiplePlp(analysesLocation)
analysesLocation |
The directory containing the results (with the analysis_x folders) |
Opens a shiny app for viewing the results of the models from various T,O, Tar and settings settings.
Opens a shiny app for interactively viewing the results
connectionDetails <- Eunomia::getEunomiaConnectionDetails() Eunomia::createCohorts(connectionDetails) databaseDetails <- createDatabaseDetails(connectionDetails = connectionDetails, cdmDatabaseSchema = "main", cdmDatabaseName = "Eunomia", cdmDatabaseId = "1", targetId = 1, outcomeIds = 3) modelDesign <- createModelDesign(targetId = 1, outcomeId = 3, modelSettings = setLassoLogisticRegression()) saveLoc <- file.path(tempdir(), "viewMultiplePlp", "development") runMultiplePlp(databaseDetails = databaseDetails, modelDesignList = list(modelDesign), saveDirectory = saveLoc) # view result files dir(saveLoc, recursive = TRUE) # open shiny app viewMultiplePlp(analysesLocation = saveLoc) # clean up, shiny app can't be opened after the following has been run unlink(saveLoc, recursive = TRUE)
connectionDetails <- Eunomia::getEunomiaConnectionDetails() Eunomia::createCohorts(connectionDetails) databaseDetails <- createDatabaseDetails(connectionDetails = connectionDetails, cdmDatabaseSchema = "main", cdmDatabaseName = "Eunomia", cdmDatabaseId = "1", targetId = 1, outcomeIds = 3) modelDesign <- createModelDesign(targetId = 1, outcomeId = 3, modelSettings = setLassoLogisticRegression()) saveLoc <- file.path(tempdir(), "viewMultiplePlp", "development") runMultiplePlp(databaseDetails = databaseDetails, modelDesignList = list(modelDesign), saveDirectory = saveLoc) # view result files dir(saveLoc, recursive = TRUE) # open shiny app viewMultiplePlp(analysesLocation = saveLoc) # clean up, shiny app can't be opened after the following has been run unlink(saveLoc, recursive = TRUE)
This is a shiny app for viewing interactive plots of the performance and the settings
viewPlp(runPlp, validatePlp = NULL, diagnosePlp = NULL)
viewPlp(runPlp, validatePlp = NULL, diagnosePlp = NULL)
runPlp |
The output of runPlp() (an object of class 'runPlp') |
validatePlp |
The output of externalValidatePlp (on object of class 'validatePlp') |
diagnosePlp |
The output of diagnosePlp() |
Either the result of runPlp and view the plots
Opens a shiny app for interactively viewing the results
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n= 1000) saveLoc <- file.path(tempdir(), "viewPlp", "development") results <- runPlp(plpData, saveDirectory = saveLoc) # view result files dir(saveLoc, recursive = TRUE) # open shiny app viewPlp(results) # clean up, shiny app can't be opened after the following has been run unlink(saveLoc, recursive = TRUE)
data("simulationProfile") plpData <- simulatePlpData(simulationProfile, n= 1000) saveLoc <- file.path(tempdir(), "viewPlp", "development") results <- runPlp(plpData, saveDirectory = saveLoc) # view result files dir(saveLoc, recursive = TRUE) # open shiny app viewPlp(results) # clean up, shiny app can't be opened after the following has been run unlink(saveLoc, recursive = TRUE)