gperc Data

Data Reader

This file has the datareaders for the gperc program. The datareaders work as follows:

  1. The default datareader is gperc.BinaryConsumer and it reads all the files that are provided to this. It reads the binaries of the files and reads the data based on pretty much the size of the file.

  2. You can provide extra metadata to the datareaders. This is done by providing a dictionary or list. For more information read below.

Though there is some code added below, it does not work. I have added it here because that is suppossed to be the general progression towards. The general idea is that the trainer should be able to select the kind of data that it wants to use. This means that there needs to be a structured way to represent and fetch the information. This is done as follows:

  1. The input data F can be loaded in 4 different styles as given in the documentation below.

  2. The fetching I can happen in 6 different styles as given in the documentation below.

I am following the same system that I have implemented in nbox.Parsers. Here is a quick brief on primitives P and structures S:

  1. P are the basic data types that are used in the data. This is the actual data you want your model to process.

  2. S are the structures that are used to represent the data. This is how data is organised.

In our day to day life, what we call data is nothing but an interplay of one P and many S.

Raw Bytes Tokenization

I am choosing to read binary instead of tokenizing text, this is similar to how computer programs like to work.

  1. A good way to measure the amount of information processed per sample is bytes_processed = n_bytes * seqlen, eg. 8192 * 2 = 16KB. n_bytes defines the total number of tokens as n_tokens = 2 ** (nbytes * 8) => (256, 65536, 16777216, ...), which is total number of permutations with 2 bits.

  2. This is should not be confused with memory footprint since that is going to be larger as each int is 64-bits (i64 in Rust).

  3. In the above sample with batch_size = 20, we have processed 320KB same as the total cache on Apple M1, which has 192 KB of L1 instruction cache and 128 KB of L1 data cache.

  4. Total tokens processed would be 20 * 8192 = 163840 in each batch and with i64 that means memory footprint of 163840 * 64 ~ 1.25MB.

  5. Wrapping up that means we are processing 320KB of data in a 1.25MB memory footprint (which is a 4x memory requirement).

Internal Representation

This is what we have we have to do, full_meta is a not a good way access individual elements in the batch, so we need to convert it to a more convenient internal representation. Consider full_meta like a table so this is what it would look like:

Full Meta as a Table

class

filepath

size (in bytes)

cat

f1

137

cat

f2

417

cat

f3

139

dog

f4

123

dog

f5

52

dog

f6

390

The batches with seqlen = 128 and n_bytes=1 would look like a flat array with items like this:

batches = [
    ([f1,   0, 128],),
    ([f1, 128, 137],
     [f2,   0, 119],),
    ([f2, 119, 347],),
    ([f2, 347, 417],
     [f3,   0,  58],),
    ...
]

System Footprint

gperc.Consumer is a powerful data ingestion system, and as a side effect will create new files on your system, in order to avoid

Documentation

gperc.data.get_vocab(n_bytes)[source]
gperc.data.get_time()[source]
gperc.data.binary_sample_generator(meta, seqlen, n_bytes)[source]

This function takes in the filesystem metadata generated by the main Consumer class, seqlen, n_bytes and returns all the samples in the dataset. Each sample looks like this:

The logic of the code is as follows: For each class data go over the files. For each file check the total number of bytes in the file. keep adding the above tuple while the total number of bytes < seqlen. At the end of each file we increment the current buffer +1 to account for “<EOF>” tag.

gperc.data.diff_sample_generator(meta)[source]

This function takes in the filesystem metadata generated by the main Consumer class, seqlen, n_bytes and returns all the samples in the dataset. Each sample looks like this:

The logic of the code is as follows: For each class data go over the files. For each file check the total number of bytes in the file. keep adding the above tuple while the total number of bytes < seqlen. At the end of each file we increment the current buffer +1 to account for “<EOF>” tag.

gperc.data.decode_ids(ids, vocab)[source]

Decode the ids to the corresponding characters.

gperc.data.convert_to_gperc_consumer_ir(fps)[source]
class gperc.data.Consumer(fps, style='diff', n_bytes=2, seqlen='auto', verbose=False, class_to_id=None, _unittesting=False)[source]

Bases: object

Consumer takes in list of files along with it’s meta data and becomes a callable generator. When calling you can tell it what kind of data that you want. It is a full fledged data engine in itself. This will sit in nbox one day and thus has to be engineered in such a what that it is production grade with good documentation. In the nbox hierarchy it sits parallel to nbox.Model thus has to continue the following traits as nbox.Parsers:

  1. primitive that tells the actual fetching instruction

  2. structure should be same as the source meta data

This Consumer object will convert any input to the F3 format as internal representation. Moreover for each file we will extract the token sequence, the target token sequence looks like this:

sequence = [tokens,from,meta,data] + [tokens,from,actual,file] + [EOF-tag]

This will be the input to the model and this is the final version, this provides sufficient context to the model for the given input just like how much information OS has about any given file. The meta data is obtained using file command on posix systems (man page).

Procedure

From the IR we extract the internal metadata that helps guide any kind of batching process and it looks like this:

metadata = {
    "cat1": {
        "extensions": [".jpg", ".png", ...],
        "filepath": [
            "/path/to/file/1.jpg",
            "/path/to/file/2.png",
        ],
        "st_size": [780, 782, 779],
    }
}

Note that in above case st_size will also include the metadata tokens.

Parameters
  • fps (Any) –

    The file paths have to be the primary index inside the lists and so filepaths “fps” can look like these:

    1. (F0) list of strings: ["file1.txt", "file2.txt", ...]

    2. (F1) list of dicts: [{"file1.txt": "cat1"}, {"file2.txt": "cat2"}, ...]

    3. (F2) dict of strings: {"file1.txt": "cat1", "file2.txt": "cat2", ...}

    4. (F3) dict of categories (IR): {"cat1": ["file1.txt", "file2.txt", ...], "cat2": ["file3.txt", "file4.txt", ...]}

  • style (str, optional) –

    The style of the merging the data should be one of the following:

    1. concat: the data is threating like a very long sequence of bytes, in this case bytes are split by <EOF>

    2. diff: each file is treated as an independent sequence of bytes, in this case bytes are padded by <EOF>

  • n_bytes (int, optional) – number of bytes that make one token, 2 is a good number.

  • seqlen (list, optional) – the total number of tokens for each sample

  • verbose (bool, optional) – if True, prints out the progress of the data

  • class_to_id (dict, optional) – if not None, this is a dictionary that maps the class names to the integer ids.

  • _unittesting (bool) – This is a private variable that is used to test the data reader. Keep at False

get_dict()[source]
to_json(fp=None)[source]
set_unsupervised_mode(mask_frequency=0.15, add_cls=False)[source]

set variables required for unsupervised query mode

Parameters
  • mask_frequency (float) – frequency of masking of input tensor

  • add_cls (bool) – whether to prefix the <CLS> token to data

set_supervised_mode()[source]

set variables required for supervised query mode. Currently takes nothing.

__getitem__(x=None)[source]

This is the heart of this code, it takes in user requests and returns the data according to it. This is slightly technical and so we will explain it in detail. I find similarities between databases in CRUD and datasets for machine learning, CRUD has amazing performance and interaction tools like SQL. Datasets in ML are more like a collection of data, and they are not designed to be used in friendly way. Everyone’s writing their own thing there but good UX requires satisfying the user in some kind of formula and then let them be.

Any SQL query has the following grammar SELECT [columns] FROM [table] WHERE [condition]. This is something everyone understands, it’s very simple. In our case [table] == self, i.e. the table is the dataset itself, this is no RDMS. The condition is very clearly described in the documentation of x. But [columns] (here calling it query) is something hard, ie. user needs something in a particular format, and with random user logic is hard to give guarantees. I will come back to this later.

The condition, has two parts, the primitive and structure. With this version of the code, the structure and primitive are implemented in pythonic way. Read the documentation of x for more details. After getting the data we convert it to an intermediate format, which is a list of tuples, each tuple is a sample. The intermediate format has the can be one of the following:

  1. dict like this:

{
    'data': [
        ('some/file/1', seek_location, end_bytes),
        # >= 1 sample of the above tuple
],
    'class': 'tinker'
}
  1. list with dict in it, in which case the samples are batched together.

The intermediate format is then converted to the desired format i.e. query, currently I have added functionality that can return one of the following formats:

  1. supervised, in which input is the input tensor and output is the class tensor, from self.class_to_id dict.

  2. unsupervised, in which input is the input tensor and output is clone of it.

Parameters
  • x (Any) –

    There is only one input since this is a special method. We take in this input item and process it accordingly based on following rules:

    1. (I0) None: when x is None we have an internal idx that is incremented and the next batch is returned

    2. (I1) int: when x is an int we return the batch at that index

    3. (I2) slice: when x is a slice we return the batches in the slice

    4. (I3) list: when x is a list we return the batches in the list containing the indices (int)

    5. (I4) dict -> ints: when values of x are ints we return the batches in the list containing the indices (int)

    6. (I5) dict -> list: when values of x are lists we return the batches in the list containing the indices (list)

    7. (I6) tuple: Read below.

  • x_tuple (Tuple) –

    When x is a tuple you can use it like a function, meaning it can run certain hardcoded logic. It should have condition as above and query. This is not a real input, added seperately for documentation convinience. The object query can be one of the following

    1. None: returns just {"input_tensor": tensor} dict

    2. 'supervised': {"input_tensor": tensor, "class": tensor}, this will fail if incorrect self.class_to_id

    3. 'unsupervised': {"input_tensor": tensor, "output_tensor": tensor}

Using this is very simple.

# define the consumer object
my_kewl_dataset = Consumer(
    fps = {
        "cat": ["img0.png", "/tmp/ssg3hng.png", ...],
        "dog": ["img1.png", "/tmp/uo35523.png", ...],
    },
    seed = 4
)

# output in all cases is a batched tensor of desired shape
out = my_kewl_dataset[None] # get whatever is the next batch
out = my_kewl_dataset[0]    # get the data at index 0
out = my_kewl_dataset[5:10] # get the data at indices 5 to 10
out = my_kewl_dataset[{
    "cat": 10,
    "dog": 4
}] # return random batches of 10 samples from class cat and 4 samples from class dog
out = my_kewl_dataset[{
    "cat": [0, 1, 2, 3, 4],
    "dog": [5, 6, 7, 8, 9]
}] # return the batches at indices [0...4] and [5...9] from class cat and class dog respectively

# in all cases above out is a dict with key "input_array" because we have not provided a query
# if you want to force this behaviour
out = my_kewl_dataset[5:10, None]

# when you want supervised
set(my_kewl_dataset[5:10, "supervised"].keys()) == {"input_array", "class"}

# when you want unsupervised
set(my_kewl_dataset[5:10, "unsupervised"].keys()) == {"input_array", "output_tensor"}
create_batches(batch_size, drop_last=False, seed=4)[source]
get_next_batch(query=None)[source]