arviz_base.dataset_to_dataframe

arviz_base.dataset_to_dataframe#

arviz_base.dataset_to_dataframe(ds, sample_dims=None, labeller=None, multiindex=False, new_dim='label')[source]#

Convert a Dataset to a DataFrame via a stacked DataArray, using a labeller.

Parameters:

dsxarray.Dataset: Input dataset to convert to a DataFrame.
sample_dimssequence of hashable, optional: Dimensions that are present in all variables of ds and should be kept as the index of the returned DataFrame. Defaults to rcParams["data.sample_dims"].
labellerlabeller, optional: Labeller instance passed to dataset_to_dataarray to generate coordinate values along new_dim.
multiindex{“row”, “column”} or bool, default False: If "row" or True, use a MultiIndex for the rows of the DataFrame. If "column" or True, use a MultiIndex for the columns. If False (default), use a single index.
new_dimhashable, default “label”: Name of the new dimension created from stacking variables and dimensions not in sample_dims.

Returns:

pandas.DataFrame

Examples

The output will have whatever that uses sample_dims as the columns of the DataFrame, so when these are much longer we might want to transpose the output:

from arviz_base import load_arviz_data, dataset_to_dataframe
idata = load_arviz_data("centered_eight")
dataset_to_dataframe(idata.posterior.dataset)

	mu	theta[Choate]	theta[Deerfield]	theta[Phillips Andover]	theta[Phillips Exeter]	theta[Hotchkiss]	theta[Lawrenceville]	theta[St. Paul's]	theta[Mt. Hermon]	tau
(0, 0)	1.715723	2.317391	1.450174	2.085550	2.227076	3.071507	2.712972	3.083764	1.460448	0.877494
(0, 1)	1.903481	0.889170	0.742949	3.125869	2.779524	2.834705	1.558939	2.487503	1.984379	0.802714
(0, 2)	1.903481	0.889170	0.742949	3.125869	2.779524	2.834705	1.558939	2.487503	1.984379	0.802714
(0, 3)	1.903481	0.889170	0.742949	3.125869	2.779524	2.834705	1.558939	2.487503	1.984379	0.802714
(0, 4)	2.017497	1.109120	0.818893	2.750620	1.928670	1.983162	1.029620	3.662744	2.167574	0.767934
...	...	...	...	...	...	...	...	...	...	...
(3, 495)	7.750625	11.477589	5.578327	9.321531	5.812095	5.437099	3.096142	9.731409	7.948321	3.020477
(3, 496)	6.922368	2.710763	8.646136	3.807844	7.543669	6.788881	6.595036	4.003042	5.275016	2.704639
(3, 497)	5.408836	11.406390	4.446937	9.210775	6.331074	4.150778	4.812302	9.693257	4.914656	2.236486
(3, 498)	7.721440	7.086139	12.311889	6.584301	10.286093	10.050167	11.859938	7.952268	9.754468	2.989656
(3, 499)	10.237157	10.464390	13.714306	10.261666	15.180098	10.916030	15.070900	14.923210	14.023129	3.051559

2000 rows × 10 columns

The default is to only return a single index, with the labels or tuples of coordinate values in the stacked dimensions. To keep all data from all coordinates as a multiindex use multiindex=True

dataset_to_dataframe(idata.posterior.dataset, multiindex=True)

		label	mu	theta[Choate]	theta[Deerfield]	theta[Phillips Andover]	theta[Phillips Exeter]	theta[Hotchkiss]	theta[Lawrenceville]	theta[St. Paul's]	theta[Mt. Hermon]	tau
		variable	mu	theta	theta	theta	theta	theta	theta	theta	theta	tau
		school	nan	Choate	Deerfield	Phillips Andover	Phillips Exeter	Hotchkiss	Lawrenceville	St. Paul's	Mt. Hermon	nan
sample	chain	draw
(0, 0)	0	0	1.715723	2.317391	1.450174	2.085550	2.227076	3.071507	2.712972	3.083764	1.460448	0.877494
(0, 1)	0	1	1.903481	0.889170	0.742949	3.125869	2.779524	2.834705	1.558939	2.487503	1.984379	0.802714
(0, 2)	0	2	1.903481	0.889170	0.742949	3.125869	2.779524	2.834705	1.558939	2.487503	1.984379	0.802714
(0, 3)	0	3	1.903481	0.889170	0.742949	3.125869	2.779524	2.834705	1.558939	2.487503	1.984379	0.802714
(0, 4)	0	4	2.017497	1.109120	0.818893	2.750620	1.928670	1.983162	1.029620	3.662744	2.167574	0.767934
...	...	...	...	...	...	...	...	...	...	...	...	...
(3, 495)	3	495	7.750625	11.477589	5.578327	9.321531	5.812095	5.437099	3.096142	9.731409	7.948321	3.020477
(3, 496)	3	496	6.922368	2.710763	8.646136	3.807844	7.543669	6.788881	6.595036	4.003042	5.275016	2.704639
(3, 497)	3	497	5.408836	11.406390	4.446937	9.210775	6.331074	4.150778	4.812302	9.693257	4.914656	2.236486
(3, 498)	3	498	7.721440	7.086139	12.311889	6.584301	10.286093	10.050167	11.859938	7.952268	9.754468	2.989656
(3, 499)	3	499	10.237157	10.464390	13.714306	10.261666	15.180098	10.916030	15.070900	14.923210	14.023129	3.051559

2000 rows × 10 columns

The only restriction on sample_dims is that it is present in all variables of the dataset. Consequently, we can compute statistical summaries, concatenate the results into a single dataset creating a new dimension.

Then convert the result into a DataFrame for ease of viewing.

dataset_to_dataframe(summaries, sample_dims=["summary"]).T

	mean	median	1st quartile	3rd quartile
mu	4.171372	4.063302	1.996927	6.535536
theta[Choate]	6.420443	5.795054	2.601156	9.379063
theta[Deerfield]	4.954497	5.015449	1.697938	8.023527
theta[Phillips Andover]	3.422932	3.744714	0.271904	6.907635
theta[Phillips Exeter]	4.753565	4.690572	1.424864	7.960607
theta[Hotchkiss]	3.453035	3.618865	0.461903	6.645211
theta[Lawrenceville]	3.662959	3.904880	0.562143	7.143478
theta[St. Paul's]	6.505227	6.090589	3.059898	9.237491
theta[Mt. Hermon]	4.819780	4.645244	1.337334	7.938904
tau	4.321166	3.511275	2.190964	5.668695

Note that if all summaries were scalar, it would not be necessary to use expand_dims or renaming dimensions, using assign_coords on the result to label the newly created dimension would be enough. But using this approach we already generate a dimension with coordinate values and can also combine non scalar summaries.

arviz_base.dataset_to_dataframe

Contents

arviz_base.dataset_to_dataframe#