Introduction to CohortSymmetry

CohortSymmetry provides tools to perform Sequence Symmetry Analysis (SSA). Before using the package, it is highly recommended that this method is tested beforehand against well-known positive and negative controls. The details of SSA and the relevant controls could be found using Pratt et al (2015).

The functions you will interact with are:

  1. generateSequenceCohortSet(): this function will create a cohort with individuals present in both (the index and the marker) cohorts.

  2. summariseSequenceRatios(): this function will calculate sequence ratios.

  3. tableSequenceRatios() and plotSequenceRatios(): these functions will help us to visualise the sequence ratio results.

  4. summariseTemporalSymmetry(): this function will produce aggregated results based on the time difference between two cohort start dates.

  5. plotTemporalSymmetry(): this function will help us to visualise the results from summariseTemporalSymmetry().

Below, you will find an example analysis that offers a brief and comprehensive overview of the package’s functionalities. More context and further examples for each of these functions are provided in later vignettes.

First, let’s load the relevant libraries.

library(CDMConnector)
library(dplyr)
library(DBI)
library(omock)
library(CohortSymmetry)
library(duckdb)

The CohortSymmetry package works with data mapped to the OMOP CDM. Hence, the initial step involves connecting to a database. As an example, we will be using Omock package to generate a mock database with two mock cohorts: the index_cohort and the marker_cohort.

cdm <- emptyCdmReference(cdmName = "mock") |>
  mockPerson(nPerson = 1000) |>
  mockObservationPeriod() |>
  mockCohort(
    name = "index_cohort",
    numberCohorts = 1,
    cohortName = c("index_cohort"),
    seed = 1,
  ) |>
  mockCohort(
    name = "marker_cohort",
    numberCohorts = 1,
    cohortName = c("marker_cohort"), 
    seed = 2
  )

con <- dbConnect(duckdb::duckdb())
cdm <- copyCdmTo(con = con, cdm = cdm, schema = "main", overwrite = T)

cdm$index_cohort |> 
  dplyr::glimpse()
#> Rows: ??
#> Columns: 4
#> Database: DuckDB v1.1.2 [unknown@Linux 6.5.0-1025-azure:R 4.4.2/:memory:]
#> $ cohort_definition_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ subject_id           <int> 1, 3, 3, 5, 5, 6, 7, 8, 11, 13, 13, 15, 16, 16, 1…
#> $ cohort_start_date    <date> 2013-11-16, 1993-03-12, 1993-05-09, 2016-12-12, …
#> $ cohort_end_date      <date> 2014-03-29, 1993-05-08, 1993-06-14, 2017-02-01, …

cdm$marker_cohort |> 
  dplyr::glimpse()
#> Rows: ??
#> Columns: 4
#> Database: DuckDB v1.1.2 [unknown@Linux 6.5.0-1025-azure:R 4.4.2/:memory:]
#> $ cohort_definition_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ subject_id           <int> 1, 3, 4, 7, 8, 8, 9, 9, 9, 9, 10, 11, 12, 13, 13,…
#> $ cohort_start_date    <date> 2013-11-04, 1992-09-30, 1970-09-13, 1975-06-01, …
#> $ cohort_end_date      <date> 2014-04-02, 1994-09-17, 1987-03-12, 1987-11-11, …

Once we have established a connection to the database, we can use the generateSequenceCohortSet() function to find the intersection of the two cohorts. This function will provide us with the individuals who appear in both cohorts, which will be named intersect - another cohort in the cdm reference.

cdm <- generateSequenceCohortSet(
  cdm = cdm,
  indexTable = "index_cohort",
  markerTable = "marker_cohort",
  name = "intersect",
  combinationWindow = c(0, Inf)
)

See below that the generated cohort follows the format of an OMOP CDM cohort with the addition of two extra columns: index_date and marker_date. These columns correspond to the cohort_start_date in the index_cohort and the marker_cohort, respectively.

cdm$intersect |> 
  dplyr::glimpse()
#> Rows: ??
#> Columns: 6
#> Database: DuckDB v1.1.2 [unknown@Linux 6.5.0-1025-azure:R 4.4.2/:memory:]
#> $ cohort_definition_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ subject_id           <int> 139, 150, 167, 173, 200, 207, 357, 410, 466, 471,…
#> $ cohort_start_date    <date> 1981-02-09, 2017-10-28, 2005-07-27, 2006-01-13, …
#> $ cohort_end_date      <date> 1982-04-05, 2017-11-08, 2005-12-06, 2006-06-17, …
#> $ index_date           <date> 1982-04-05, 2017-11-08, 2005-07-27, 2006-06-17, …
#> $ marker_date          <date> 1981-02-09, 2017-10-28, 2005-12-06, 2006-01-13, …

Once we have the intersect cohort, you are able to explore the temporal symmetry by using summariseTemporalSymmetry and plotTemporalSymmetry():

result <- summariseTemporalSymmetry(cohort = cdm$intersect, 
                                    timescale = "year")
result |> dplyr::glimpse()
#> Rows: 19
#> Columns: 13
#> $ result_id        <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ cdm_name         <chr> "mock", "mock", "mock", "mock", "mock", "mock", "mock…
#> $ group_name       <chr> "index_name &&& marker_name", "index_name &&& marker_…
#> $ group_level      <chr> "index_cohort &&& marker_cohort", "index_cohort &&& m…
#> $ strata_name      <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ strata_level     <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ variable_name    <chr> "temporal_symmetry", "temporal_symmetry", "temporal_s…
#> $ variable_level   <chr> "-7", "-4", "13", "6", "0", "8", "-9", "-6", "7", "-2…
#> $ estimate_name    <chr> "count", "count", "count", "count", "count", "count",…
#> $ estimate_type    <chr> "integer", "integer", "integer", "integer", "integer"…
#> $ estimate_value   <chr> NA, "5", NA, "5", "136", NA, NA, NA, NA, "26", NA, NA…
#> $ additional_name  <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ additional_level <chr> "overall", "overall", "overall", "overall", "overall"…

plotTemporalSymmetry(result = result)

Next, we will use the summariseSequenceRatios() function to get the crude sequence ratios, adjusted sequence ratios, and the corresponding confidence intervals.

result <- summariseSequenceRatios(cohort = cdm$intersect)

result |> dplyr::glimpse()
#> Rows: 10
#> Columns: 13
#> $ result_id        <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
#> $ cdm_name         <chr> "mock", "mock", "mock", "mock", "mock", "mock", "mock…
#> $ group_name       <chr> "index_cohort_name &&& marker_cohort_name", "index_co…
#> $ group_level      <chr> "index_cohort &&& marker_cohort", "index_cohort &&& m…
#> $ strata_name      <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ strata_level     <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ variable_name    <chr> "crude", "adjusted", "crude", "crude", "adjusted", "a…
#> $ variable_level   <chr> "sequence_ratio", "sequence_ratio", "sequence_ratio",…
#> $ estimate_name    <chr> "point_estimate", "point_estimate", "lower_CI", "uppe…
#> $ estimate_type    <chr> "numeric", "numeric", "numeric", "numeric", "numeric"…
#> $ estimate_value   <chr> "1.17222222222222", "2231.13882352941", "0.9611606018…
#> $ additional_name  <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ additional_level <chr> "overall", "overall", "overall", "overall", "overall"…

Finally, we can visualise the results using tableSequenceRatios():

tableSequenceRatios(result)
Database name Index Marker Study population Index first, N (%) Marker first, N (%) CSR (95% CI) ASR (95% CI)
mock Index cohort Marker cohort 391 211 (54.0 %) 180 (46.0 %) 1.17 (0.96 - 1.43) 2,231.14 (1,829.42 - 2,723.17)

Or create a plot with the adjusted sequence ratios:

plotSequenceRatios(result = result,
                  onlyaSR = T,
                  colours = "black")

As a diagram

Diagrammatically, the work flow using CohortSymmetry resembles the following flow chat: