DATA¶

Tools¶

DATA.Tools.aparquet(pf, func, args=[], kwargs={}, index=None, sample=1000000)¶

apply a function on a parquet file by sample.

Parameters:

pf (pq.ParquetFile) – the parquet file we want to itterate through
func (python function) – funtion that take a panda dataframe in input and *args, **kwargs
args (list or tuple, optional) – arguments of func
kwargs (dict, optional) – keyword arguments of func
index (list(str), optional) – index we want to keep from pf
sample (int, optional) – size of the sample

Returns:

list of the outputs of func(dtf, *args, **kwargs)

Return type:

list

DATA.Tools.csv2parquet(pname, name, index=None, sample=1000000)¶

Convert a csv file into a parquet file

Parameters:

pname (str) – the path of the parquet file
name (str) – the name of the file
index (list(str), optional) – index we want to keep from pf
sample (int, optional) – size of the sample

DATA.Tools.getrand(pf, num, index=None, sample=1000000)¶

Get random values in the parquet file

Parameters:

pf (pq.ParquetFile) – the parquetfile we want to split
num (int) – number of element we want to get
index (list(str), optional) – index we want to keep from pf
sample (int, optional) – size of the sample

Returns:

pd.DataFrame

Return type:

the values selected

DATA.Tools.parquet2csv(name, pf, index=None, sample=1000000)¶

Convert a parquet file to a csv file

Parameters:

name (str) – the name of the file
pf (pq.ParquetFile or str) – the parquet file or path
index (list(str), optional) – index we want to keep from pf
sample (int, optional) – size of the sample

DATA.Tools.sparq(pf, goto, name='out', index=None, sample=1000000)¶

Split a parquet file into many subparquet file

Parameters:

pf (pq.ParquetFile) – the parquetfile we want to split
goto (python function) – function that return a list that assign any row to a label, if it returns -1 no file is created and the row is discarded.
index (list(str), optional) – index we want to keep from pf
sample (int, optional) – size of the sample

DATA.Tools.stats(pf, index=None, sample=1000000)¶

Gather statistical data over the parquet file

Parameters:

pf (pq.ParquetFile or str) – the parquet file or path we want to itterate through.
index (list(str), optional) – index we want to keep from pf
sample (int, optional) – size of the sample

Returns:

class with following public variables: mean, variation, min, max, N

Return type:

upstats

class DATA.Tools.upstats¶

A class to compute stats through batches

update(df)¶

update the statistics of the hole set through batches

Parameters:: df (pd.DataFrame) – the new batch to update the stats with

models¶

class DATA.models.Bicephale(Densize, activation='relu')¶

Bicephale model

the geometry of the model is the folowing :

Input
Dense Block	Dense Block
Sig	Lin
dot product

Parameters:

Densize (list(int) or (list(int), list(int))) – the size of the different layers in the DenseBlock if it’s a tuple the first goes define the sig part (the classifier) and the second define the lin part (the estimator).
activation (str or (str, str), optional) – the name of the activation function in the DenseBlock if it’s a tuple the first goes define the sig part (the classifier) and the second define the lin part (the estimator).

class DATA.models.DenseBlock(hidden_units, activation='relu')¶

Dense hidden layers block

Parameters:

hidden_units (list(int)) – the size of the different layers in the block
activation (str, optional) – the name of the activation function

DATA.models.cephalise(model)¶

Return the two sub models of the Bicephale model

Parameters:: model (keras.Model) – should be the Bicephal model

train¶

DATA.train.getdta(files, sizes, sample)¶

get the data for the training

Parameters:

files (list(pq.Parquetfile)) – should be the same len as prop and it list the parquet file we want compose the training set from
sizes (list(int)) – the given number of values we want to get from each files
sample (int) – size of the samples

Returns:

Concat the data of the different files

Return type:

pd.DataFrame

DATA.train.train(model, xy, files, prop, trainsize, sample, Nepoch, Nsets, nfile=None)¶

train the model with a dataset composed with some proportions of others

Parameters:

model (keras.Model) – the model we want to train
xy (python function, pd.dataframe -> (x,y)) – this function transform a pandas dataframe to a tuple input output for the training
files (list(pq.Parquetfile)) – should be the same len as prop and it list the parquet file we want compose the training set from
prop (list(float), (or equivalent)) – the given proportion we should put from every file (it’s normalized)
trainsize (int) – size of the training set
sample (int) – size of the sample wen we’re doing file stuffs (look at Tools)
Nepoch (int) – number of epoch for the fit function
Nsets (int) – number of sets of training before stoping the training
nfile (str, optinal) – the name of the csv file the data will be concatenate into

DATA¶

Tools¶

models¶

train¶

Table of Contents

Previous topic

This Page