DATA¶
Tools¶
- DATA.Tools.aparquet(pf, func, args=[], kwargs={}, index=None, sample=1000000)¶
apply a function on a parquet file by sample.
- Parameters:
pf (pq.ParquetFile) – the parquet file we want to itterate through
func (python function) – funtion that take a panda dataframe in input and
*args
,**kwargs
args (list or tuple, optional) – arguments of func
kwargs (dict, optional) – keyword arguments of func
index (list(str), optional) – index we want to keep from pf
sample (int, optional) – size of the sample
- Returns:
list of the outputs of
func(dtf, *args, **kwargs)
- Return type:
list
- DATA.Tools.csv2parquet(pname, name, index=None, sample=1000000)¶
Convert a csv file into a parquet file
- Parameters:
pname (str) – the path of the parquet file
name (str) – the name of the file
index (list(str), optional) – index we want to keep from pf
sample (int, optional) – size of the sample
- DATA.Tools.getrand(pf, num, index=None, sample=1000000)¶
Get random values in the parquet file
- Parameters:
pf (pq.ParquetFile) – the parquetfile we want to split
num (int) – number of element we want to get
index (list(str), optional) – index we want to keep from pf
sample (int, optional) – size of the sample
- Returns:
pd.DataFrame
- Return type:
the values selected
- DATA.Tools.parquet2csv(name, pf, index=None, sample=1000000)¶
Convert a parquet file to a csv file
- Parameters:
name (str) – the name of the file
pf (pq.ParquetFile or str) – the parquet file or path
index (list(str), optional) – index we want to keep from pf
sample (int, optional) – size of the sample
- DATA.Tools.sparq(pf, goto, name='out', index=None, sample=1000000)¶
Split a parquet file into many subparquet file
- Parameters:
pf (pq.ParquetFile) – the parquetfile we want to split
goto (python function) – function that return a list that assign any row to a label, if it returns -1 no file is created and the row is discarded.
index (list(str), optional) – index we want to keep from pf
sample (int, optional) – size of the sample
- DATA.Tools.stats(pf, index=None, sample=1000000)¶
Gather statistical data over the parquet file
- Parameters:
pf (pq.ParquetFile or str) – the parquet file or path we want to itterate through.
index (list(str), optional) – index we want to keep from pf
sample (int, optional) – size of the sample
- Returns:
- class with following public variables
mean, variation, min, max, N
- Return type:
models¶
- class DATA.models.Bicephale(Densize, activation='relu')¶
Bicephale model
the geometry of the model is the folowing :
Input
Dense Block
Dense Block
Sig
Lin
dot product
- Parameters:
Densize (list(int) or (list(int), list(int))) – the size of the different layers in the DenseBlock if it’s a tuple the first goes define the sig part (the classifier) and the second define the lin part (the estimator).
activation (str or (str, str), optional) – the name of the activation function in the DenseBlock if it’s a tuple the first goes define the sig part (the classifier) and the second define the lin part (the estimator).
- class DATA.models.DenseBlock(hidden_units, activation='relu')¶
Dense hidden layers block
- Parameters:
hidden_units (list(int)) – the size of the different layers in the block
activation (str, optional) – the name of the activation function
- DATA.models.cephalise(model)¶
Return the two sub models of the Bicephale model
- Parameters:
model (keras.Model) – should be the Bicephal model
train¶
- DATA.train.getdta(files, sizes, sample)¶
get the data for the training
- Parameters:
files (list(pq.Parquetfile)) – should be the same len as prop and it list the parquet file we want compose the training set from
sizes (list(int)) – the given number of values we want to get from each files
sample (int) – size of the samples
- Returns:
Concat the data of the different files
- Return type:
pd.DataFrame
- DATA.train.train(model, xy, files, prop, trainsize, sample, Nepoch, Nsets, nfile=None)¶
train the model with a dataset composed with some proportions of others
- Parameters:
model (keras.Model) – the model we want to train
xy (python function, pd.dataframe -> (x,y)) – this function transform a pandas dataframe to a tuple input output for the training
files (list(pq.Parquetfile)) – should be the same len as prop and it list the parquet file we want compose the training set from
prop (list(float), (or equivalent)) – the given proportion we should put from every file (it’s normalized)
trainsize (int) – size of the training set
sample (int) – size of the sample wen we’re doing file stuffs (look at Tools)
Nepoch (int) – number of epoch for the fit function
Nsets (int) – number of sets of training before stoping the training
nfile (str, optinal) – the name of the csv file the data will be concatenate into