gperc Data¶
Data Reader¶
This file has the datareaders for the gperc
program. The datareaders work as follows:
The default datareader is
gperc.BinaryConsumer
and it reads all the files that are provided to this. It reads the binaries of the files and reads the data based on pretty much the size of the file.You can provide extra metadata to the datareaders. This is done by providing a dictionary or list. For more information read below.
Though there is some code added below, it does not work. I have added it here because that is suppossed to be the general progression towards. The general idea is that the trainer should be able to select the kind of data that it wants to use. This means that there needs to be a structured way to represent and fetch the information. This is done as follows:
The input data
F
can be loaded in 4 different styles as given in the documentation below.The fetching
I
can happen in 6 different styles as given in the documentation below.
I am following the same system that I have implemented in
nbox.Parsers. Here is a quick brief on
primitives P
and structures S
:
P
are the basic data types that are used in the data. This is the actual data you want your model to process.S
are the structures that are used to represent the data. This is how data is organised.
In our day to day life, what we call data is nothing but an interplay of one P
and many S
.
Raw Bytes Tokenization¶
I am choosing to read binary instead of tokenizing text, this is similar to how computer programs like to work.
A good way to measure the amount of information processed per sample is
bytes_processed = n_bytes * seqlen
, eg.8192 * 2 = 16KB
.n_bytes
defines the total number of tokens asn_tokens = 2 ** (nbytes * 8) => (256, 65536, 16777216, ...)
, which is total number of permutations with 2 bits.This is should not be confused with memory footprint since that is going to be larger as each
int
is 64-bits (i64
in Rust).In the above sample with
batch_size = 20
, we have processed320KB
same as the total cache on Apple M1, which has 192 KB of L1 instruction cache and 128 KB of L1 data cache.Total tokens processed would be
20 * 8192 = 163840
in each batch and withi64
that means memory footprint of163840 * 64 ~ 1.25MB
.Wrapping up that means we are processing 320KB of data in a 1.25MB memory footprint (which is a
4x
memory requirement).
Internal Representation¶
This is what we have we have to do, full_meta
is a not a good way access individual elements
in the batch, so we need to convert it to a more convenient internal representation. Consider
full_meta
like a table so this is what it would look like:
class |
filepath |
size (in bytes) |
---|---|---|
cat |
f1 |
137 |
cat |
f2 |
417 |
cat |
f3 |
139 |
dog |
f4 |
123 |
dog |
f5 |
52 |
dog |
f6 |
390 |
The batches with seqlen = 128
and n_bytes=1
would look like a flat array with items like this:
batches = [
([f1, 0, 128],),
([f1, 128, 137],
[f2, 0, 119],),
([f2, 119, 347],),
([f2, 347, 417],
[f3, 0, 58],),
...
]
System Footprint¶
gperc.Consumer
is a powerful data ingestion system, and as a side effect will create new files
on your system, in order to avoid
Documentation¶
- gperc.data.binary_sample_generator(meta, seqlen, n_bytes)[source]¶
This function takes in the filesystem metadata generated by the main
Consumer
class,seqlen
,n_bytes
and returns all the samples in the dataset. Each sample looks like this:The logic of the code is as follows: For each class data go over the files. For each file check the total number of bytes in the file. keep adding the above tuple while the total number of bytes < seqlen. At the end of each file we increment the current buffer +1 to account for “<EOF>” tag.
- gperc.data.diff_sample_generator(meta)[source]¶
This function takes in the filesystem metadata generated by the main
Consumer
class,seqlen
,n_bytes
and returns all the samples in the dataset. Each sample looks like this:The logic of the code is as follows: For each class data go over the files. For each file check the total number of bytes in the file. keep adding the above tuple while the total number of bytes < seqlen. At the end of each file we increment the current buffer +1 to account for “<EOF>” tag.
- class gperc.data.Consumer(fps, style='diff', n_bytes=2, seqlen='auto', verbose=False, class_to_id=None, _unittesting=False)[source]¶
Bases:
object
Consumer takes in list of files along with it’s meta data and becomes a callable generator. When calling you can tell it what kind of data that you want. It is a full fledged data engine in itself. This will sit in nbox one day and thus has to be engineered in such a what that it is production grade with good documentation. In the nbox hierarchy it sits parallel to nbox.Model thus has to continue the following traits as nbox.Parsers:
primitive that tells the actual fetching instruction
structure should be same as the source meta data
This
Consumer
object will convert any input to the F3 format as internal representation. Moreover for each file we will extract the token sequence, the target token sequence looks like this:sequence = [tokens,from,meta,data] + [tokens,from,actual,file] + [EOF-tag]
This will be the input to the model and this is the final version, this provides sufficient context to the model for the given input just like how much information OS has about any given file. The meta data is obtained using
file
command on posix systems (man page).Procedure
From the IR we extract the internal metadata that helps guide any kind of batching process and it looks like this:
metadata = { "cat1": { "extensions": [".jpg", ".png", ...], "filepath": [ "/path/to/file/1.jpg", "/path/to/file/2.png", ], "st_size": [780, 782, 779], } }
Note that in above case
st_size
will also include the metadata tokens.- Parameters
fps (Any) –
The file paths have to be the primary index inside the lists and so filepaths “fps” can look like these:
(F0) list of strings:
["file1.txt", "file2.txt", ...]
(F1) list of dicts:
[{"file1.txt": "cat1"}, {"file2.txt": "cat2"}, ...]
(F2) dict of strings:
{"file1.txt": "cat1", "file2.txt": "cat2", ...}
(F3) dict of categories (IR):
{"cat1": ["file1.txt", "file2.txt", ...], "cat2": ["file3.txt", "file4.txt", ...]}
style (str, optional) –
The style of the merging the data should be one of the following:
concat: the data is threating like a very long sequence of bytes, in this case bytes are split by
<EOF>
diff: each file is treated as an independent sequence of bytes, in this case bytes are padded by
<EOF>
n_bytes (int, optional) – number of bytes that make one token, 2 is a good number.
seqlen (list, optional) – the total number of tokens for each sample
verbose (bool, optional) – if True, prints out the progress of the data
class_to_id (dict, optional) – if not None, this is a dictionary that maps the class names to the integer ids.
_unittesting (bool) – This is a private variable that is used to test the data reader. Keep at False
- set_unsupervised_mode(mask_frequency=0.15, add_cls=False)[source]¶
set variables required for unsupervised query mode
- Parameters
mask_frequency (float) – frequency of masking of input tensor
add_cls (bool) – whether to prefix the
<CLS>
token to data
- set_supervised_mode()[source]¶
set variables required for supervised query mode. Currently takes nothing.
- __getitem__(x=None)[source]¶
This is the heart of this code, it takes in user requests and returns the data according to it. This is slightly technical and so we will explain it in detail. I find similarities between databases in CRUD and datasets for machine learning, CRUD has amazing performance and interaction tools like SQL. Datasets in ML are more like a collection of data, and they are not designed to be used in friendly way. Everyone’s writing their own thing there but good UX requires satisfying the user in some kind of formula and then let them be.
Any SQL query has the following grammar
SELECT [columns] FROM [table] WHERE [condition]
. This is something everyone understands, it’s very simple. In our case[table] == self
, i.e. the table is the dataset itself, this is no RDMS. The condition is very clearly described in the documentation ofx
. But[columns]
(here calling itquery
) is something hard, ie. user needs something in a particular format, and with random user logic is hard to give guarantees. I will come back to this later.The
condition
, has two parts, theprimitive
andstructure
. With this version of the code, thestructure
andprimitive
are implemented in pythonic way. Read the documentation ofx
for more details. After getting the data we convert it to an intermediate format, which is a list of tuples, each tuple is a sample. The intermediate format has the can be one of the following:dict like this:
{ 'data': [ ('some/file/1', seek_location, end_bytes), # >= 1 sample of the above tuple ], 'class': 'tinker' }
list with dict in it, in which case the samples are batched together.
The intermediate format is then converted to the desired format i.e.
query
, currently I have added functionality that can return one of the following formats:supervised
, in which input is the input tensor and output is the class tensor, fromself.class_to_id
dict.unsupervised
, in which input is the input tensor and output is clone of it.
- Parameters
x (Any) –
There is only one input since this is a special method. We take in this input item and process it accordingly based on following rules:
(I0)
None
: when x is None we have an internal idx that is incremented and the next batch is returned(I1)
int
: when x is an int we return the batch at that index(I2)
slice
: when x is a slice we return the batches in the slice(I3)
list
: when x is a list we return the batches in the list containing the indices (int
)(I4)
dict -> ints
: when values of x are ints we return the batches in the list containing the indices (int
)(I5)
dict -> list
: when values of x are lists we return the batches in the list containing the indices (list
)(I6)
tuple
: Read below.
x_tuple (Tuple) –
When x is a tuple you can use it like a function, meaning it can run certain hardcoded logic. It should have
condition
as above andquery
. This is not a real input, added seperately for documentation convinience. The objectquery
can be one of the followingNone
: returns just{"input_tensor": tensor}
dict'supervised'
:{"input_tensor": tensor, "class": tensor}
, this will fail if incorrectself.class_to_id
'unsupervised'
:{"input_tensor": tensor, "output_tensor": tensor}
Using this is very simple.
# define the consumer object my_kewl_dataset = Consumer( fps = { "cat": ["img0.png", "/tmp/ssg3hng.png", ...], "dog": ["img1.png", "/tmp/uo35523.png", ...], }, seed = 4 ) # output in all cases is a batched tensor of desired shape out = my_kewl_dataset[None] # get whatever is the next batch out = my_kewl_dataset[0] # get the data at index 0 out = my_kewl_dataset[5:10] # get the data at indices 5 to 10 out = my_kewl_dataset[{ "cat": 10, "dog": 4 }] # return random batches of 10 samples from class cat and 4 samples from class dog out = my_kewl_dataset[{ "cat": [0, 1, 2, 3, 4], "dog": [5, 6, 7, 8, 9] }] # return the batches at indices [0...4] and [5...9] from class cat and class dog respectively # in all cases above out is a dict with key "input_array" because we have not provided a query # if you want to force this behaviour out = my_kewl_dataset[5:10, None] # when you want supervised set(my_kewl_dataset[5:10, "supervised"].keys()) == {"input_array", "class"} # when you want unsupervised set(my_kewl_dataset[5:10, "unsupervised"].keys()) == {"input_array", "output_tensor"}