Unpaired OAS Help

About OAS

The Observed Antibody Space database (OAS) collates over a billion unique antibody sequences from over 75 different studies. We make the data available for bulk download or you can filter the sequences with respect to certain metadata parameters using our search form, which can be found here.

OAS can be filtered according to attributes such as species, isotype, chain type, etc. The fields are non-exclusive, meaning that the user could choose a combination of fields that does not exist in our database (for instance specifying isotype and light chain, which is impossible).

Data and Downloads

Data in the OAS database is organized into studies that are in turn sub-divided into data-units. A single data-unit is a set of sequences uniquely identified by its metadata. The range of meta-parameters are:

  • Author First author and date of publication.
  • Link Link to the publication with the study.
  • Run Run ID the sequence is derived from.
  • Subject Indicates whether the B-cells can be tracked back to a particular individual.
  • Species Organism of the B-cell donor.
  • Chain Heavy/light chain annotation.
  • Isotype Identified isotype information.
  • Age Information on age of the human B-cell donors.
  • Disease Indicates whether the donor was sick at the time of B-cell extraction.
  • Vaccine Indicates whether the B-cell donor was purposely immunized prior to B-cell extraction.
  • B-cell subset Indicates whether a particular B-cell subset was sorted for Ig-seq.
  • B-cell source Which organ/tissue the B-cells were extracted from.
  • Longitudinal If the study is conducted over a period of time, indicates the particular timepoint when B-cells were sourced.
  • Total sequences Number of redundant sequences in the data-unit.
  • Unique sequences Number of non-redundant sequences in the data-unit.

After searching, there is an option to download all data-units that match your criteria. This is in the form of a shell script containing consecutive wget commands. To download all the data you can run the following command (you might want to download sets of data-units at a time, as the total size can exceed 500GB):

chmod u+rx bulk_download.sh
./bulk_download.sh

A series of .csv.gz files will then be downloaded to your current directory. Each .csv.gz file contains as the first line the metadata for the data-unit and the following lines each sequence and its annotations.

The contents of each data-unit file can look as below:

sequence locus
stop_codon
vj_in_frame
productive
rev_comp
v_call
d_call
j_call
sequence_alignment
germline_alignment
sequence_alignment_aa
germline_alignment_aa
v_alignment
_start
v_alignment
_end
d_alignment
_start
d_alignment
_end
j_alignment
_start
j_alignment
_end
v_sequence_alignment
v_sequence_alignment_aa
v_germline_alignment
v_germline_alignment_aa
d_sequence_alignment
d_sequence_alignment
_aa
d_germline_alignment
d_germline_alignment
_aa
j_sequence_alignment
j_sequence_alignment_aa
j_germline_alignment
j_germline_alignment_aa
fwr1
fwr1_aa
cdr1
cdr1_aa
fwr2
fwr2_aa
cdr2
cdr2_aa
fwr3
fwr3_aa
cdr3
cdr3_aa
junction
junction
_length
junction_aa
junction_aa
_length
v_score
d_score
j_score
v_cigar
d_cigar
j_cigar
v_support
d_support
j_support
v_identity
d_identity
j_identity
v_sequence
_start
v_sequence
_end
v_germline
_start
v_germline
_end
d_sequence
_start
d_sequence
_end
d_germline
_start
d_germline
_end
j_sequence
_start
j_sequence
_end
j_germline
_start
j_germline
_end
fwr1_start
fwr1_end
cdr1_start
cdr1_end
fwr2_start
fwr2_end
cdr2_start
cdr2_end
fwr3_start
fwr3_end
cdr3_start
cdr3_end
np1
np1_length
np2
np2_length
c_region Redundancy ANARCI_numbering
ANARCI_status
ACGGAGGTTTCT... IGH F T T F IGHV9-3*01 IGHD2-4*01 IGHJ3*01 CAGATCCAGTTGG... CAGATCCAGTTG... QIQLVQSGPELKKPG... QIQLVQSGPELKKPG... 1.0 290.0 295.0 303.0 306.0 349.0 CAGATCCAGTTGGT... QIQLVQSGPELKKPGETV... CAGATCCAGTTGGT... QIQLVQSGPELKKPGE... GATTACGAC DYD GATTACGAC DYD GTTTGCTTACTGGG... FAYWGQGTLVTVSA GTTTGCTTACTGGG... FAYWGQGTLVTVSA CAGATCCAG... QIQLVQ... GGGTATAC... GYTFTTYG ATGAGCT... MSWVKQ... ATAAAC... INTYSGVP ACATATG... TYVDDFK... GCCCCC... APDYDEFAY TGTGCCC... 33.0 CAPDYDE... 11.0 450.572 17.992 85.286 309S290M150S4N 603S8N9M137S 614S4N44M91S 2.040000e-128 0.839300 1.019000e-20 99.655 100.0 100.0 310.0 599.0 1.0 290.0 604.0 612.0 9.0 17.0 615.0 658.0 5.0 48.0 310.0 384.0 385.0 408.0 409.0 459.0 460.0 483.0 484.0 597.0 598.0 624.0 CCCC 4.0 GA 2.0 GAGCGCGCG 7 {'fwh1': {'2': 'I', '3': 'Q' ...},
'cdrh1': {'27': 'G', '28': 'Y',...}
...'cdrh3': {'105': 'A', '106': 'P', ...},
||||Shorter than IMGT defined: fw1, fw4|
TGAAAACAACCT... IGH F T T F IGHV1-81*01 IGHD1-1*01 IGHJ1*03 CAGGTTCAGCTGC... CAGGTTCAGCT... QVQLQQSGAELARP... QVQLQQSGAELARP... 1.0 294.0 305.0 318.0 322.0 373.0 CAGGTTCAGCTGC... QVQLQQSGAELARPGAS... CAGGTTCAGCTGCA... QVQLQQSGAELARPGA... TTTATTACTAC... YYYG TTTATTACTACGGT YYYG TACTGGTACTTCGAT... YWYFDVWGTGTTVTVSS TACTGGTACTTCGA... YWYFDVWGTGTTVTVSS CAGGTTCAG... QVQLQ... GGCTACAC... GYTFTSYG ATAAGCT... ISWVKQR... ATTTATC... IYLRSGNT TACTACA... YYNEKFK... GCAAGA... ARWERFY... TGTGCAA... 57.0 CARWERFYY... 19.0 456.805 27.605 100.667 115S294M170S 419S14M146S9N 436S1N52M91S 2.073000e-130 0.000825 1.897000e-25 99.660 100.0 100.0 116.0 409.0 1.0 294.0 420.0 433.0 1.0 14.0 437.0 488.0 2.0 53.0 116.0 190.0 191.0 214.0 215.0 265.0 266.0 289.0 290.0 403.0 404.0 454.0 TGGGAG... 10.0 TCT 3.0 GAGCGCGCG 2223 {'fwh1': {'1 ': 'E', '2 ': 'V', '3 ': 'Q', '4': ... }, 'cdrh1': {'27': 'G', '28': 'Y', '29': ...}
... cdrh3': {'105': 'A', '106': 'R', '107': ...}
||||Shorter than IMGT defined: fw4|

A convenient way to parse the csv.gz files, is to use the following code to extract the metadata as a json object and the sequences as a pandas object.


Downloading using Safari: In some cases Safari automatically unzips the .csv.gz files when downloading. This automatic unzipping can cause problems with the downloaded file and it is therefore recommended to either turn off automatic unzipping or use a different browser (e.g. Firefox or Chrome).

Contact

If you would like to contact us about anything related to OAS, please drop an email to oas_opig@stats.ox.ac.uk.

Updated OAS paper: Olsen, T.H., Boyles, F., and Deane C.M. (2021). Protein Science. [link]

The current OAS is an update of the previous paper: Kovaltsuk, A., Leem, J. et al (2018). J. Immunol. [link]