The Observed Antibody Space database (OAS) collates over a billion unique antibody sequences from over 80 different studies. We make the data available for bulk download or you can filter the sequences with respect to certain metadata parameters using our search form, which can be found here.
OAS can be filtered according to attributes such as species, isotype, chain type, etc. The fields are non-exclusive, meaning that the user could choose a combination of fields that does not exist in our database (for instance specifying isotype and light chain, which is impossible).
Data in the OAS database is organized into studies that are in turn sub-divided into data-units. A single data-unit is a set of sequences uniquely identified by its metadata. The range of meta-parameters are:
After searching, there is an option to download all data-units that match your criteria. This is in the form of a shell script containing consecutive wget commands. To download all the data you can run the following command (you might want to download sets of data-units at a time, as the total size can exceed 700GB):
chmod u+rx bulk_download.sh
./bulk_download.sh
A series of .csv.gz files will then be downloaded to your current directory. Each .csv.gz file contains as the first line the metadata for the data-unit and the following lines each sequence and its annotations.
The contents of each data-unit file can look as below:
sequence | locus |
stop_codon |
vj_in_frame |
productive |
rev_comp |
v_call |
d_call |
j_call |
sequence_alignment |
germline_alignment |
sequence_alignment_aa |
germline_alignment_aa |
v_alignment _start |
v_alignment _end |
d_alignment _start |
d_alignment _end |
j_alignment _start |
j_alignment _end |
v_sequence_alignment |
v_sequence_alignment_aa |
v_germline_alignment |
v_germline_alignment_aa |
d_sequence_alignment |
d_sequence_alignment _aa |
d_germline_alignment |
d_germline_alignment _aa |
j_sequence_alignment |
j_sequence_alignment_aa |
j_germline_alignment |
j_germline_alignment_aa |
fwr1 |
fwr1_aa |
cdr1 |
cdr1_aa |
fwr2 |
fwr2_aa |
cdr2 |
cdr2_aa |
fwr3 |
fwr3_aa |
cdr3 |
cdr3_aa |
junction |
junction _length |
junction_aa |
junction_aa _length |
v_score |
d_score |
j_score |
v_cigar |
d_cigar |
j_cigar |
v_support |
d_support |
j_support |
v_identity |
d_identity |
j_identity |
v_sequence _start |
v_sequence _end |
v_germline _start |
v_germline _end |
d_sequence _start |
d_sequence _end |
d_germline _start |
d_germline _end |
j_sequence _start |
j_sequence _end |
j_germline _start |
j_germline _end |
fwr1_start |
fwr1_end |
cdr1_start |
cdr1_end |
fwr2_start |
fwr2_end |
cdr2_start |
cdr2_end |
fwr3_start |
fwr3_end |
cdr3_start |
cdr3_end |
np1 |
np1_length |
np2 |
np2_length |
c_region | Redundancy | ANARCI_numbering |
ANARCI_status |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ACGGAGGTTTCT... | IGH | F | T | T | F | IGHV9-3*01 | IGHD2-4*01 | IGHJ3*01 | CAGATCCAGTTGG... | CAGATCCAGTTG... | QIQLVQSGPELKKPG... | QIQLVQSGPELKKPG... | 1.0 | 290.0 | 295.0 | 303.0 | 306.0 | 349.0 | CAGATCCAGTTGGT... | QIQLVQSGPELKKPGETV... | CAGATCCAGTTGGT... | QIQLVQSGPELKKPGE... | GATTACGAC | DYD | GATTACGAC | DYD | GTTTGCTTACTGGG... | FAYWGQGTLVTVSA | GTTTGCTTACTGGG... | FAYWGQGTLVTVSA | CAGATCCAG... | QIQLVQ... | GGGTATAC... | GYTFTTYG | ATGAGCT... | MSWVKQ... | ATAAAC... | INTYSGVP | ACATATG... | TYVDDFK... | GCCCCC... | APDYDEFAY | TGTGCCC... | 33.0 | CAPDYDE... | 11.0 | 450.572 | 17.992 | 85.286 | 309S290M150S4N | 603S8N9M137S | 614S4N44M91S | 2.040000e-128 | 0.839300 | 1.019000e-20 | 99.655 | 100.0 | 100.0 | 310.0 | 599.0 | 1.0 | 290.0 | 604.0 | 612.0 | 9.0 | 17.0 | 615.0 | 658.0 | 5.0 | 48.0 | 310.0 | 384.0 | 385.0 | 408.0 | 409.0 | 459.0 | 460.0 | 483.0 | 484.0 | 597.0 | 598.0 | 624.0 | CCCC | 4.0 | GA | 2.0 | GAGCGCGCG | 7 | {'fwh1': {'2': 'I', '3': 'Q' ...}, 'cdrh1': {'27': 'G', '28': 'Y',...} ...'cdrh3': {'105': 'A', '106': 'P', ...}, |
||||Shorter than IMGT defined: fw1, fw4| |
TGAAAACAACCT... | IGH | F | T | T | F | IGHV1-81*01 | IGHD1-1*01 | IGHJ1*03 | CAGGTTCAGCTGC... | CAGGTTCAGCT... | QVQLQQSGAELARP... | QVQLQQSGAELARP... | 1.0 | 294.0 | 305.0 | 318.0 | 322.0 | 373.0 | CAGGTTCAGCTGC... | QVQLQQSGAELARPGAS... | CAGGTTCAGCTGCA... | QVQLQQSGAELARPGA... | TTTATTACTAC... | YYYG | TTTATTACTACGGT | YYYG | TACTGGTACTTCGAT... | YWYFDVWGTGTTVTVSS | TACTGGTACTTCGA... | YWYFDVWGTGTTVTVSS | CAGGTTCAG... | QVQLQ... | GGCTACAC... | GYTFTSYG | ATAAGCT... | ISWVKQR... | ATTTATC... | IYLRSGNT | TACTACA... | YYNEKFK... | GCAAGA... | ARWERFY... | TGTGCAA... | 57.0 | CARWERFYY... | 19.0 | 456.805 | 27.605 | 100.667 | 115S294M170S | 419S14M146S9N | 436S1N52M91S | 2.073000e-130 | 0.000825 | 1.897000e-25 | 99.660 | 100.0 | 100.0 | 116.0 | 409.0 | 1.0 | 294.0 | 420.0 | 433.0 | 1.0 | 14.0 | 437.0 | 488.0 | 2.0 | 53.0 | 116.0 | 190.0 | 191.0 | 214.0 | 215.0 | 265.0 | 266.0 | 289.0 | 290.0 | 403.0 | 404.0 | 454.0 | TGGGAG... | 10.0 | TCT | 3.0 | GAGCGCGCG | 2223 | {'fwh1': {'1 ': 'E', '2 ': 'V', '3 ': 'Q', '4': ... }, 'cdrh1': {'27': 'G', '28': 'Y', '29': ...} ... cdrh3': {'105': 'A', '106': 'R', '107': ...} |
||||Shorter than IMGT defined: fw4| |
A convenient way to parse the csv.gz files, is to use the following code to extract the metadata as a json object and the sequences as a pandas object.
Downloading using Safari: In some cases Safari automatically unzips the .csv.gz files when downloading. This automatic unzipping can cause problems with the downloaded file and it is therefore recommended to either turn off automatic unzipping or use a different browser (e.g. Firefox or Chrome).
If you would like to contact us about anything related to OAS, please drop an email to oas_opig@stats.ox.ac.uk.