Provides haplotype matching capabilities on the GBWT. Implements long match and set maximal match query on the GBWT.
Related paper: "Haplotype Matching with GBWT for Pangenome Graphs," by Ahsan Sanaullah, Seba Villalobos, Degui Zhi, and Shaojie Zhang. DOI:
Contact Author: Ahsan Sanaullah (ahsan.sanaullah@ucf.edu)
-
FastLCP: Augmentation of FastLocate of that adds the locatePrev and LCP computation capabilities. -
lf_gbwt::GBWT: Barebones GBWT that allows computation of LF in efficient worst case time complexity. -
CompText: Compressed Text Data structure. Implementation of the$O\left(r \log\left(\frac{n}{r}\right)\right)$ space text access data structure by Gagie et al. ("Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space," https://doi.org/10.1145/3375890). Allows$O\left(\log\left(\frac{n}{r}\right) + l\right)$ time queries of$l$ consecutive characters of the text. This implementation is modified to use less space and allows point queries in$O\left(\log\left(\frac{n}{r}\right)\right)$ time.
-
Set Maximal Match Query: Given a query path-string
$Q$ , outputs all matches to a path-string$p$ in the GBWT,$(i,j,k)$ , s.t.$Q[i,i+k) = p[j,j+k)$ and there does not exist any path-string$p'$ and index$j'$ in the GBWT s.t.$Q[i-1,i+k) = p'[j'-1, j'+k)$ or$Q[i,i+k+1) = p'[j', j'+k+1)$ . -
Long Match Query: Given a query path-string
$Q$ , outputs all matches to a path-string$p$ in the GBWT,$(i,j,k)$ , s.t.$k \geq L$ ,$Q[i,i+k) = p[j,j+k)$ , and$Q[i-1,i+k) \neq p[j-1, j+k)$ and$Q[i,i+k+1) \neq p[j,j+k+1)$ .
Each haplotype matching algorithm has various version implementations. Each version uses different indexes as input. Set Maximal Match Query has 6 numbered versions, Long Match Query has 4 (specifically, 2, 3, 4, and 2_4).The GBWT should contain no document array samples except for in version 0. There are also brute force versions for long and set maximal match query. Finally set maximal match query has a naive
- 0:
GBWT. Document array samples should be contained in the GBWT, otherwise this implementation run very slowly. - 1:
GBWTandFastLocate. - 2:
GBWT,FastLocate, andFastLCP. - 3:
lf_gbwt::GBWT,FastLocate, andFastLCP. - 4:
lf_gbwt::GBWT,FastLocate,FastLCP, andCompText. - 2_4:
GBWT,FastLocate,FastLCP, andCompText.
This repository has two primary purposes. First, it is provided for readers of the paper who wish to view the implementation, and reproduce the results of the paper. Second, the classes/indexes/queries implemented here may be reused in other code.
The code used to perform the experiments in the paper are provided in /test/.
-
generateIndices.cpp takes as input a base name and an integer
$N$ . It reads theGBWTwith the base name, then removes$N$ random haplotypes. Finally, it builds theGBWT, theFastLocate, and all indexes implemented here for the trimmed dataset. Finally, it writes all removed paths and built indexes to the disk. -
doTest.cpp takes as input data generated by generateIndices.cpp and a
GBWTequivalent to the trimmedGBWTbut with document array samples. For each path removed from the originalGBWT, it performs each numbered version of set maximal and long match query. Each query is timed and times are written to cout. Whether the outputs match or not is also written. - Running
make 1000GGeneratewill generate indices needed for reproduction. It will take the 1000G Chromosome 21 dataset (stored as a bidirectional GBWT) in chr21.gbwt and remove 100 random haplotypes. The haplotypes will be stored to disk and various structures are built and written to disk. - Running
make 1000GTestwill run the experiments as run in the paper. The indices built by 1000GGenerate will be loaded and then queried with varying versions of set maximal match and long match query. - Any work that uses the 1000G dataset should cite their paper: https://doi.org/10.1038/nature15393.
Finally, the code used to verify the correctness of the queries and structures is provided in verifyStructuresAndQueries.cpp. This code takes as input a GBWT, its associated FastLocate structure, and an integer FastLCP, lf_gbwt::GBWT, and CompText indexes on the provided data. Then it prints info from the built structures. Then it performs set maximal match and long match queries. Finally, it performs the following
- A random path-string is generated.
- All the indexes (
GBWT,FastLocate,FastLCP,lf_gbwt::GBWT,CompText) are rebuilt on the dataset with this path-string added to it. - The indexes implemented here are verified (
FastLCP,lf_gbwt::GBWT,CompText). (Note that only suffix values ofFastLCPare verified.) - A new random path-string is generated.
- All set maximal match query versions (numbered and unnumbered) are run with the new random path-string and new indexes as input. Their outputs are compared (should be equal).
- All long match query versions (numbered and unnumbered) are run with the new random path-string and new indexes as input. Their outputs are compared (should be equal).
The header files here may be reused.
The following header files contain the implementation of indexes.
- fast_lcp.h:
FastLCP - lf_gbwt.h:
lf_gbwt::GBWT - compText.h:
CompText
The following header files contain general utilities.
- testing.h: Functions for testing validity of queries and classes.
- ioHelp.h: Functions for printing information related to queries and classes.
The following header files contain implementation of the queries.
- querySupport.h: General purpose support for both set maximal and long match queries.
- setMaximalMatchQuery.h: Set maximal match query versions.
- longMatchQuery.h: Long match query versions.
Compilation of code including the header files provided in this repository requires the use of the GBWT library (https://github.com/jltsiren/gbwt). The specific version this code was built on is available at https://github.com/jltsiren/gbwt/blob/0bfeb0723bdc71db075aacf99a77704769d56a55. Follow the instructions in the GBWT readme to compile the GBWT library. The GBWT library (and its dependency, vgteam's fork of sdsl-lite) must be linked in order to compile code that uses header files from this repository. Finally note, the requirements are the same as that of the GBWT library: (C++14, OpenMP).
Sample datasets are provided in /toyData/indexes. The raw paths for these datasets are in /toyData/raw/.
Running make in the /test/ directory will compile binaries for each code program described in Archival. Running make test will run basic tests by verifying structures and performing the paper experiment for the small datasets provided in /toyData/. Note, SDSL_DIR in the Makefile should be set to the SDSL directory before compilation. gbwtDir in the Makefile should be set to the directory that contains the build_gbwt binary for generateIndices to work correctly.
- In an empty directory
mkdir GBWT_AND_QUERY; cd GBWT_AND_QUERY - Clone all repos.
git clone https://github.com/jltsiren/gbwtgit clone https://github.com/ucfcbb/GBWT-Querygit clone https://github.com/vgteam/sdsl-lite
- Installation of prereqs locally
cd sdsl-lite./install.sh ../cd ../gbwt./install.sh ../
- Run 1000G Experiment
cd ../GBWT-Query/testmake 1000GGenerate 1000GTest