This guide aims to describe the process of cohort subsetting using
CohortGenerator
. The purpose of Cohort subsetting
operations is to allow the creation of common operations that can be
applied to generated cohorts in order to subset to different operations
in a consistent manner.
Subset definitions are named sets of operations that can be applied to a set of one or more cohorts. The current operations that you can apply to cohorts are:
Operations can be sequentially chained within subset definitions and all outputs are considered full cohorts that can be passed into to other packages as if they are cohorts designed in packages
This subsetting process allows you to capture the age, race/ethnicity gender within a cohort as subgroups. For example, “subset cohorts to subjects that are male between the ages 1 and 5 years old”.
This type of operation allows you to subset a cohort to only those subjects included in one or more other cohorts
First get a Cohort definition set:
cohortDefinitionSet <- getCohortDefinitionSet(
settingsFileName = "testdata/name/Cohorts.csv",
jsonFolder = "testdata/name/cohorts",
sqlFolder = "testdata/name/sql/sql_server",
cohortFileNameFormat = "%s",
cohortFileNameValue = c("cohortName"),
packageName = "CohortGenerator",
verbose = FALSE
)
cohortDefinitionSet$cohortId <- cohortDefinitionSet$cohortId + 1778210 # Match the cohort Ids taken from Atlas
cohortIds <- cohortDefinitionSet$cohortId
cohortDefinitionSet$atlasId <- cohortDefinitionSet$cohortId
cohortDefinitionSet$logicDescription <- ""
A definition can include different subset operations - these are applied strictly in order:
# Example, we want to have a HTN cohort that starts any time prior to the index start
# and the HTN cohort ends any time after the index start
subsetDef <- createCohortSubsetDefinition(
name = "Patients in cohort cohort 1778213 with 365 days prior observation",
definitionId = 1,
subsetOperators = list(
# here we are saying 'first subset to only those patients in cohort 1778213'
createCohortSubset(
name = "Subset to patients in cohort 1778213",
# Note that this can be set to any id - if the
# cohort is empty or doesn't exist this will not error
cohortIds = 1778213,
cohortCombinationOperator = "any",
negate = FALSE,
startWindow = createSubsetCohortWindow(
startDay = -9999,
endDay = 0,
targetAnchor = "cohortStart"
),
endWindow = createSubsetCohortWindow(
startDay = 0,
endDay = 9999,
targetAnchor = "cohortStart"
)
),
# Next, subset to only those with 365 days of prior observation
createLimitSubset(
name = "Observation of at least 365 days prior",
priorTime = 365,
followUpTime = 0,
limitTo = "all"
)
)
)
Next we create a similar definition that also subsetOperators the specified cohorts to require patients with specific demographic criteria. We can do that by copying the subset operations from our first definition and modifying them.
subsetOperations2 <- subsetDef$subsetOperators
# subset to those between aged 18 an 64
subsetOperations2[[3]] <-
createDemographicSubset(
name = "18 - 65",
ageMin = 18,
ageMax = 64
)
subsetDef2 <- createCohortSubsetDefinition(
name = "Patients in cohort 1778213 with 365 days prior obs, aged 18 - 64",
definitionId = 2,
subsetOperators = subsetOperations2
)
Next we need to add the subset definitions to the base cohort set. This will automatically add identifiers and OHDSI SQL for the subset cohorts as well as storing references for saving definition sets for re-use.
cohortDefinitionSet <- cohortDefinitionSet |>
addCohortSubsetDefinition(subsetDef)
knitr::kable(cohortDefinitionSet[, names(cohortDefinitionSet)[which(!names(cohortDefinitionSet) %in% c("json", "sql"))]])
cohortId | cohortName | atlasId | logicDescription | subsetParent | isSubset | subsetDefinitionId |
---|---|---|---|---|---|---|
1778211 | celecoxib | 1778211 | 1778211 | FALSE | NA | |
1778212 | celecoxibAge40 | 1778212 | 1778212 | FALSE | NA | |
1778213 | celecoxibAge40Male | 1778213 | 1778213 | FALSE | NA | |
1778214 | celecoxibCensored | 1778214 | 1778214 | FALSE | NA | |
1778211001 | celecoxib - Patients in cohort cohort 1778213 with 365 days prior observation Subset to patients in cohort 1778213, Observation of at least 365 days prior | NA | NA | 1778211 | TRUE | 1 |
1778212001 | celecoxibAge40 - Patients in cohort cohort 1778213 with 365 days prior observation Subset to patients in cohort 1778213, Observation of at least 365 days prior | NA | NA | 1778212 | TRUE | 1 |
1778213001 | celecoxibAge40Male - Patients in cohort cohort 1778213 with 365 days prior observation Subset to patients in cohort 1778213, Observation of at least 365 days prior | NA | NA | 1778213 | TRUE | 1 |
1778214001 | celecoxibCensored - Patients in cohort cohort 1778213 with 365 days prior observation Subset to patients in cohort 1778213, Observation of at least 365 days prior | NA | NA | 1778214 | TRUE | 1 |
We can also apply a subset definition to only a limited number of target cohorts as follows
cohortDefinitionSet <- cohortDefinitionSet |>
addCohortSubsetDefinition(subsetDef2, targetCohortIds = 1778212)
knitr::kable(cohortDefinitionSet[, names(cohortDefinitionSet)[which(!names(cohortDefinitionSet) %in% c("json", "sql"))]])
cohortId | cohortName | atlasId | logicDescription | subsetParent | isSubset | subsetDefinitionId |
---|---|---|---|---|---|---|
1778211 | celecoxib | 1778211 | 1778211 | FALSE | NA | |
1778212 | celecoxibAge40 | 1778212 | 1778212 | FALSE | NA | |
1778213 | celecoxibAge40Male | 1778213 | 1778213 | FALSE | NA | |
1778214 | celecoxibCensored | 1778214 | 1778214 | FALSE | NA | |
1778211001 | celecoxib - Patients in cohort cohort 1778213 with 365 days prior observation Subset to patients in cohort 1778213, Observation of at least 365 days prior | NA | NA | 1778211 | TRUE | 1 |
1778212001 | celecoxibAge40 - Patients in cohort cohort 1778213 with 365 days prior observation Subset to patients in cohort 1778213, Observation of at least 365 days prior | NA | NA | 1778212 | TRUE | 1 |
1778213001 | celecoxibAge40Male - Patients in cohort cohort 1778213 with 365 days prior observation Subset to patients in cohort 1778213, Observation of at least 365 days prior | NA | NA | 1778213 | TRUE | 1 |
1778214001 | celecoxibCensored - Patients in cohort cohort 1778213 with 365 days prior observation Subset to patients in cohort 1778213, Observation of at least 365 days prior | NA | NA | 1778214 | TRUE | 1 |
1778212002 | celecoxibAge40 - Patients in cohort 1778213 with 365 days prior obs, aged 18 - 64 Subset to patients in cohort 1778213, Observation of at least 365 days prior, 18 - 65 | NA | NA | 1778212 | TRUE | 2 |
The cohortDefinitionSet
data.frame now has some
additional columns:
subsetParent, isSubset, subsetDefinitionId
subsetParent
indicates the parent cohort. For standard
cohorts this will be their own ID. For out newly defined subsets, this
will be the base cohort.
subsetDefinitionId
displays the id of the subset applied
to the cohort.
In addition, the name of the cohort displayed in this table is automatically generated from the base cohort name, the subset name and the names defined for the subset operations applied in the subset definition. As the number of resulting subsets can become very large, it is crucial to choose human interpretable naming conventions. For example, see the name of our first cohort and the resulting name of a child subset:
writeLines(c(
paste("Cohort Id:", cohortDefinitionSet$cohortId[1]),
paste("Name", cohortDefinitionSet$cohortName[1])
))
#> Cohort Id: 1778211
#> Name celecoxib
writeLines(c(
paste("Cohort Id:", cohortDefinitionSet$cohortId[4]),
paste("Subset Parent Id:", cohortDefinitionSet$subsetParent[4]),
paste("Name", cohortDefinitionSet$cohortName[4])
))
#> Cohort Id: 1778214
#> Subset Parent Id: 1778214
#> Name celecoxibCensored
Note that when adding a subset definition to a cohort definition set,
the target cohort ids e.g (1778211, 1778212) must exist in the
cohortDefinitionSet
and the output ids
(1778211002
, 1778212003
) must be unique. As
with all cohorts, any cohorts with these ids will be deleted prior to
execution to prevent collisions. Note that the default expression for
output cohort ids is targetId * 1000 + definitionId
this
may cause collisions that will cause
addCohortSubsetDefinition
to error. This can be modified by
changing the identifierExpression
parameter to
createSubsetDefinition
. This expression should be defined
to guarantee uniqueness or adding the definition to a cohort definition
set will fail.
Executing CohortGenerator, we can now include the subset operations when our cohorts are generated:
connectionDetails <- Eunomia::getEunomiaConnectionDetails()
createCohortTables(
connectionDetails = connectionDetails,
cohortDatabaseSchema = "main",
cohortTableNames = getCohortTableNames("my_cohort")
)
# ### As subsets are a big side effect we need to be clear what was generated and have good naming conventions
generatedCohorts <- generateCohortSet(
connectionDetails = connectionDetails,
cdmDatabaseSchema = "main",
cohortDatabaseSchema = "main",
cohortTableNames = getCohortTableNames("my_cohort"),
cohortDefinitionSet = cohortDefinitionSet,
incremental = TRUE,
incrementalFolder = file.path(someFolder, "RecordKeeping")
)
Cohort subset definitions can be run incrementally. In fact, if the
base cohort definition changes for any reason, any subsets will
automatically be re-executed when calling
generateCohortSet
.
Saving applied subsets can automatically be added to a project using
saveCohortDefinitionSet
loading is also achieved with getCohortDefinitionSet
cohortDefinitionSet <- getCohortDefinitionSet(
subsetJsonFolder = "<path_to_my_subset_definition>"
)
Any subset definitions should automatically be loaded and applied to the cohort definition set.
Subset definitions can be converted to JSON objects as follows:
For the purpose of writing to disk we recommend the use of
ParallelLogger
for consistency.