XStringSet-class {Biostrings} | R Documentation |
XStringSet objects
Description
The BStringSet class is a container for storing a set of
BString
objects and for making its manipulation
easy and efficient.
Similarly, the DNAStringSet (or RNAStringSet, or AAStringSet) class is
a container for storing a set of DNAString
(or RNAString
, or AAString
) objects.
All those containers derive directly (and with no additional slots) from the XStringSet virtual class.
Usage
## Constructors:
BStringSet(x=character(), start=NA, end=NA, width=NA, use.names=TRUE)
DNAStringSet(x=character(), start=NA, end=NA, width=NA, use.names=TRUE)
RNAStringSet(x=character(), start=NA, end=NA, width=NA, use.names=TRUE)
AAStringSet(x=character(), start=NA, end=NA, width=NA, use.names=TRUE)
## Accessor-like methods:
## S4 method for signature 'character'
width(x)
## S4 method for signature 'XStringSet'
nchar(x, type="chars", allowNA=FALSE)
## ... and more (see below)
Arguments
x |
Either a character vector (with no NAs), or an XString, XStringSet or XStringViews object. |
start , end , width |
Either |
use.names |
|
type , allowNA |
Ignored. |
Details
The BStringSet
, DNAStringSet
, RNAStringSet
and
AAStringSet
functions are constructors that can be used to
turn input x
into an XStringSet object of the desired base type.
They also allow the user to "narrow" the sequences contained in x
via proper use of the start
, end
and/or width
arguments. In this context, "narrowing" means dropping a prefix or/and
a suffix of each sequence in x
.
The "narrowing" capabilities of these constructors can be illustrated
by the following property: if x
is a character vector
(with no NAs), or an XStringSet (or XStringViews) object,
then the 3 following transformations are equivalent:
-
BStringSet(x, start=mystart, end=myend, width=mywidth)
-
subseq(BStringSet(x), start=mystart, end=myend, width=mywidth)
-
BStringSet(subseq(x, start=mystart, end=myend, width=mywidth))
Note that, besides being more convenient, the first form is also more efficient on character vectors.
Accessor-like methods
In the code snippets below,
x
is an XStringSet object.
-
length(x)
: The number of sequences inx
. -
width(x)
: A vector of non-negative integers containing the number of letters for each element inx
. Note thatwidth(x)
is also defined for a character vector with no NAs and is equivalent tonchar(x, type="bytes")
. -
names(x)
:NULL
or a character vector of the same length asx
containing a short user-provided description or comment for each element inx
. These are the only data in an XStringSet object that can safely be changed by the user. All the other data are immutable! As a general recommendation, the user should never try to modify an object by accessing its slots directly. -
alphabet(x)
: ReturnNULL
,DNA_ALPHABET
,RNA_ALPHABET
orAA_ALPHABET
depending on whetherx
is a BStringSet, DNAStringSet, RNAStringSet or AAStringSet object. -
nchar(x)
: The same aswidth(x)
.
Subsequence extraction and related transformations
In the code snippets below,
x
is a character vector (with no NAs),
or an XStringSet (or XStringViews) object.
-
subseq(x, start=NA, end=NA, width=NA)
: Appliessubseq
on each element inx
. See?subseq
for the details.Note that this is similar to what
substr
does on a character vector. However there are some noticeable differences:(1) the arguments are
start
andstop
forsubstr
;(2) the SEW interface (start/end/width) interface of
subseq
is richer (e.g. support for negative start or end values); and (3)subseq
checks that the specified start/end/width values are valid i.e., unlikesubstr
, it throws an error if they define "out of limits" subsequences or subsequences with a negative width. -
narrow(x, start=NA, end=NA, width=NA, use.names=TRUE)
: Same assubseq
. The only differences are: (1)narrow
has ause.names
argument; and (2) all the thingsnarrow
andsubseq
work on (IRanges, XStringSet or XStringViews objects fornarrow
, XVector or XStringSet objects forsubseq
). But they both work and do the same thing on an XStringSet object. -
threebands(x, start=NA, end=NA, width=NA)
: Like the method for IRanges objects, thethreebands
methods for character vectors and XStringSet objects extend the capability ofnarrow
by returning the 3 set of subsequences (the left, middle and right subsequences) associated to the narrowing operation. See?threebands
in the IRanges package for the details. -
subseq(x, start=NA, end=NA, width=NA) <- value
: A vectorized version of thesubseq<-
method for XVector objects. See?`subseq<-`
for the details.
Subsetting and appending
In the code snippets below,
x
and values
are XStringSet objects,
and i
should be an index specifying the elements to extract.
-
x[i]
: Return a new XStringSet object made of the selected elements. -
x[[i]]
: Extract the i-thXString
object fromx
. -
append(x, values, after=length(x))
: Add sequences invalues
tox
.
Set operations
In the code snippets below,
x
and y
are XStringSet objects.
-
union(x, y)
: Union ofx
andy
. -
intersect(x, y)
: Intersection ofx
andy
. -
setdiff(x, y)
: Asymmetric set difference ofx
andy
. -
setequal(x, y)
: Set equality ofx
toy
.
Other methods
In the code snippets below,
x
is an XStringSet object.
-
unlist(x)
: Turnsx
into an XString object by combining the sequences inx
together. Fast equivalent todo.call(c, as.list(x))
. -
as.character(x, use.names=TRUE)
: Convertsx
to a character vector of the same length asx
. Theuse.names
argument controls whether or notnames(x)
should be propagated to the names of the returned vector. -
as.factor(x)
: Convertsx
to a factor, viaas.character(x)
. -
as.matrix(x, use.names=TRUE)
: Returns a character matrix containing the "exploded" representation of the strings. Can only be used on an XStringSet object with equal-width strings. Theuse.names
argument controls whether or notnames(x)
should be propagated to the row names of the returned matrix. -
toString(x)
: Equivalent totoString(as.character(x))
. -
show(x)
: By default theshow
method displays 5 head and 5 tail lines. The number of lines can be altered by setting the global optionsshowHeadLines
andshowTailLines
. If the object length is less than the sum of the options, the full object is displayed. These options affect GRanges, GAlignments, IRanges, and XStringSet objects.
Display
The letters in a DNAStringSet or RNAStringSet object are colored
when displayed by the show()
method. Set global option
Biostrings.coloring
to FALSE to turn off this coloring.
Author(s)
H. Pagès
See Also
-
readDNAStringSet
andwriteXStringSet
for reading/writing a DNAStringSet object (or other XStringSet derivative) from/to a FASTA or FASTQ file. -
XString objects.
-
XStringViews objects.
-
XStringSetList objects.
-
XVectorList objects.
Examples
## ---------------------------------------------------------------------
## A. USING THE XStringSet CONSTRUCTORS ON A CHARACTER VECTOR OR FACTOR
## ---------------------------------------------------------------------
## Note that there is no XStringSet() constructor, but an XStringSet
## family of constructors: BStringSet(), DNAStringSet(), RNAStringSet(),
## etc...
x0 <- c("#CTC-NACCAGTAT", "#TTGA", "TACCTAGAG")
width(x0)
x1 <- BStringSet(x0)
x1
## 3 equivalent ways to obtain the same BStringSet object:
BStringSet(x0, start=4, end=-3)
subseq(x1, start=4, end=-3)
BStringSet(subseq(x0, start=4, end=-3))
dna0 <- DNAStringSet(x0, start=4, end=-3)
dna0 # 'options(Biostrings.coloring=FALSE)' to turn off coloring
names(dna0)
names(dna0)[2] <- "seqB"
dna0
## When the input vector contains a lot of duplicates, turning it into
## a factor first before passing it to the constructor will produce an
## XStringSet object that is more compact in memory:
library(hgu95av2probe)
x2 <- sample(hgu95av2probe$sequence, 999000, replace=TRUE)
dna2a <- DNAStringSet(x2)
dna2b <- DNAStringSet(factor(x2)) # slower but result is more compact
object.size(dna2a)
object.size(dna2b)
## ---------------------------------------------------------------------
## B. USING THE XStringSet CONSTRUCTORS ON A SINGLE SEQUENCE (XString
## OBJECT OR CHARACTER STRING)
## ---------------------------------------------------------------------
x3 <- "abcdefghij"
BStringSet(x3, start=2, end=6:2) # behaves like 'substring(x3, 2, 6:2)'
BStringSet(x3, start=-(1:6))
x4 <- BString(x3)
BStringSet(x4, end=-(1:6), width=3)
## Randomly extract 1 million 40-mers from C. elegans chrI:
extractRandomReads <- function(subject, nread, readlength)
{
if (!is.integer(readlength))
readlength <- as.integer(readlength)
start <- sample(length(subject) - readlength + 1L, nread,
replace=TRUE)
DNAStringSet(subject, start=start, width=readlength)
}
library(BSgenome.Celegans.UCSC.ce2)
rndreads <- extractRandomReads(Celegans$chrI, 1000000, 40)
## Notes:
## - This takes only 2 or 3 seconds versus several hours for a solution
## using substring() on a standard character string.
## - The short sequences in 'rndreads' can be seen as the result of a
## simulated high-throughput sequencing experiment. A non-realistic
## one though because:
## (a) It assumes that the underlying technology is perfect (the
## generated reads have no technology induced errors).
## (b) It assumes that the sequenced genome is exactly the same as the
## reference genome.
## (c) The simulated reads can contain IUPAC ambiguity letters only
## because the reference genome contains them. In a real
## high-throughput sequencing experiment, the sequenced genome
## of course doesn't contain those letters, but the sequencer
## can introduce them in the generated reads to indicate ambiguous
## base-calling.
## (d) The simulated reads come from the plus strand only of a single
## chromosome.
## - See the getSeq() function in the BSgenome package for how to
## circumvent (d) i.e. how to generate reads that come from the whole
## genome (plus and minus strands of all chromosomes).
## ---------------------------------------------------------------------
## C. USING THE XStringSet CONSTRUCTORS ON AN XStringSet OBJECT
## ---------------------------------------------------------------------
library(drosophila2probe)
probes <- DNAStringSet(drosophila2probe)
probes
RNAStringSet(probes, start=2, end=-5) # does NOT copy the sequence data!
## ---------------------------------------------------------------------
## D. USING THE XStringSet CONSTRUCTORS ON AN ORDINARY list OF XString
## OBJECTS
## ---------------------------------------------------------------------
probes10 <- head(probes, n=10)
set.seed(33)
shuffled_nucleotides <- lapply(probes10, sample)
shuffled_nucleotides
DNAStringSet(shuffled_nucleotides) # does NOT copy the sequence data!
## Note that the same result can be obtained in a more compact way with
## just:
set.seed(33)
endoapply(probes10, sample)
## ---------------------------------------------------------------------
## E. USING subseq() ON AN XStringSet OBJECT
## ---------------------------------------------------------------------
subseq(probes, start=2, end=-5)
subseq(probes, start=13, end=13) <- "N"
probes
## Add/remove a prefix:
subseq(probes, start=1, end=0) <- "--"
probes
subseq(probes, end=2) <- ""
probes
## Do more complicated things:
subseq(probes, start=4:7, end=7) <- c("YYYY", "YYY", "YY", "Y")
subseq(probes, start=4, end=6) <- subseq(probes, start=-2:-5)
probes
## ---------------------------------------------------------------------
## F. UNLISTING AN XStringSet OBJECT
## ---------------------------------------------------------------------
library(drosophila2probe)
probes <- DNAStringSet(drosophila2probe)
unlist(probes)
## ---------------------------------------------------------------------
## G. COMPACTING AN XStringSet OBJECT
## ---------------------------------------------------------------------
## As a particular type of XVectorList objects, XStringSet objects can
## optionally be compacted. Compacting is done typically before
## serialization. See ?compact for more information.
library(drosophila2probe)
probes <- DNAStringSet(drosophila2probe)
y <- subseq(probes[1:12], start=5)
probes@pool
y@pool
object.size(probes)
object.size(y)
y0 <- compact(y)
y0@pool
object.size(y0)