DATA

Tools

DATA.Tools.aparquet(pf, func, args=[], kwargs={}, index=None, sample=1000000)

apply a function on a parquet file by sample.

Parameters:
  • pf (pq.ParquetFile) – the parquet file we want to itterate through

  • func (python function) – funtion that take a panda dataframe in input and *args, **kwargs

  • args (list or tuple, optional) – arguments of func

  • kwargs (dict, optional) – keyword arguments of func

  • index (list(str), optional) – index we want to keep from pf

  • sample (int, optional) – size of the sample

Returns:

list of the outputs of func(dtf, *args, **kwargs)

Return type:

list

DATA.Tools.csv2parquet(pname, name, index=None, sample=1000000)

Convert a csv file into a parquet file

Parameters:
  • pname (str) – the path of the parquet file

  • name (str) – the name of the file

  • index (list(str), optional) – index we want to keep from pf

  • sample (int, optional) – size of the sample

DATA.Tools.getrand(pf, num, index=None, sample=1000000)

Get random values in the parquet file

Parameters:
  • pf (pq.ParquetFile) – the parquetfile we want to split

  • num (int) – number of element we want to get

  • index (list(str), optional) – index we want to keep from pf

  • sample (int, optional) – size of the sample

Returns:

pd.DataFrame

Return type:

the values selected

DATA.Tools.parquet2csv(name, pf, index=None, sample=1000000)

Convert a parquet file to a csv file

Parameters:
  • name (str) – the name of the file

  • pf (pq.ParquetFile or str) – the parquet file or path

  • index (list(str), optional) – index we want to keep from pf

  • sample (int, optional) – size of the sample

DATA.Tools.sparq(pf, goto, name='out', index=None, sample=1000000)

Split a parquet file into many subparquet file

Parameters:
  • pf (pq.ParquetFile) – the parquetfile we want to split

  • goto (python function) – function that return a list that assign any row to a label, if it returns -1 no file is created and the row is discarded.

  • index (list(str), optional) – index we want to keep from pf

  • sample (int, optional) – size of the sample

DATA.Tools.stats(pf, index=None, sample=1000000)

Gather statistical data over the parquet file

Parameters:
  • pf (pq.ParquetFile or str) – the parquet file or path we want to itterate through.

  • index (list(str), optional) – index we want to keep from pf

  • sample (int, optional) – size of the sample

Returns:

class with following public variables

mean, variation, min, max, N

Return type:

upstats

class DATA.Tools.upstats

A class to compute stats through batches

update(df)

update the statistics of the hole set through batches

Parameters:

df (pd.DataFrame) – the new batch to update the stats with

models

class DATA.models.Bicephale(Densize, activation='relu')

Bicephale model

the geometry of the model is the folowing :

Input

Dense Block

Dense Block

Sig

Lin

dot product

Parameters:
  • Densize (list(int) or (list(int), list(int))) – the size of the different layers in the DenseBlock if it’s a tuple the first goes define the sig part (the classifier) and the second define the lin part (the estimator).

  • activation (str or (str, str), optional) – the name of the activation function in the DenseBlock if it’s a tuple the first goes define the sig part (the classifier) and the second define the lin part (the estimator).

class DATA.models.DenseBlock(hidden_units, activation='relu')

Dense hidden layers block

Parameters:
  • hidden_units (list(int)) – the size of the different layers in the block

  • activation (str, optional) – the name of the activation function

DATA.models.cephalise(model)

Return the two sub models of the Bicephale model

Parameters:

model (keras.Model) – should be the Bicephal model

train

DATA.train.getdta(files, sizes, sample)

get the data for the training

Parameters:
  • files (list(pq.Parquetfile)) – should be the same len as prop and it list the parquet file we want compose the training set from

  • sizes (list(int)) – the given number of values we want to get from each files

  • sample (int) – size of the samples

Returns:

Concat the data of the different files

Return type:

pd.DataFrame

DATA.train.train(model, xy, files, prop, trainsize, sample, Nepoch, Nsets, nfile=None)

train the model with a dataset composed with some proportions of others

Parameters:
  • model (keras.Model) – the model we want to train

  • xy (python function, pd.dataframe -> (x,y)) – this function transform a pandas dataframe to a tuple input output for the training

  • files (list(pq.Parquetfile)) – should be the same len as prop and it list the parquet file we want compose the training set from

  • prop (list(float), (or equivalent)) – the given proportion we should put from every file (it’s normalized)

  • trainsize (int) – size of the training set

  • sample (int) – size of the sample wen we’re doing file stuffs (look at Tools)

  • Nepoch (int) – number of epoch for the fit function

  • Nsets (int) – number of sets of training before stoping the training

  • nfile (str, optinal) – the name of the csv file the data will be concatenate into