arviz_base.dataset_to_dataframe#
- arviz_base.dataset_to_dataframe(ds, sample_dims=None, labeller=None, multiindex=False, new_dim='label')[source]#
Convert a Dataset to a DataFrame via a stacked DataArray, using a labeller.
- Parameters:
- Returns:
pandas.DataFrame
Examples
The output will have whatever that uses sample_dims as the columns of the DataFrame, so when these are much longer we might want to transpose the output:
from arviz_base import load_arviz_data, dataset_to_dataframe idata = load_arviz_data("centered_eight") dataset_to_dataframe(idata.posterior.dataset)
mu theta[Choate] theta[Deerfield] theta[Phillips Andover] theta[Phillips Exeter] theta[Hotchkiss] theta[Lawrenceville] theta[St. Paul's] theta[Mt. Hermon] tau (0, 0) 1.715723 2.317391 1.450174 2.085550 2.227076 3.071507 2.712972 3.083764 1.460448 0.877494 (0, 1) 1.903481 0.889170 0.742949 3.125869 2.779524 2.834705 1.558939 2.487503 1.984379 0.802714 (0, 2) 1.903481 0.889170 0.742949 3.125869 2.779524 2.834705 1.558939 2.487503 1.984379 0.802714 (0, 3) 1.903481 0.889170 0.742949 3.125869 2.779524 2.834705 1.558939 2.487503 1.984379 0.802714 (0, 4) 2.017497 1.109120 0.818893 2.750620 1.928670 1.983162 1.029620 3.662744 2.167574 0.767934 ... ... ... ... ... ... ... ... ... ... ... (3, 495) 7.750625 11.477589 5.578327 9.321531 5.812095 5.437099 3.096142 9.731409 7.948321 3.020477 (3, 496) 6.922368 2.710763 8.646136 3.807844 7.543669 6.788881 6.595036 4.003042 5.275016 2.704639 (3, 497) 5.408836 11.406390 4.446937 9.210775 6.331074 4.150778 4.812302 9.693257 4.914656 2.236486 (3, 498) 7.721440 7.086139 12.311889 6.584301 10.286093 10.050167 11.859938 7.952268 9.754468 2.989656 (3, 499) 10.237157 10.464390 13.714306 10.261666 15.180098 10.916030 15.070900 14.923210 14.023129 3.051559 2000 rows × 10 columns
The default is to only return a single index, with the labels or tuples of coordinate values in the stacked dimensions. To keep all data from all coordinates as a multiindex use
multiindex=Truedataset_to_dataframe(idata.posterior.dataset, multiindex=True)
label mu theta[Choate] theta[Deerfield] theta[Phillips Andover] theta[Phillips Exeter] theta[Hotchkiss] theta[Lawrenceville] theta[St. Paul's] theta[Mt. Hermon] tau variable mu theta theta theta theta theta theta theta theta tau school nan Choate Deerfield Phillips Andover Phillips Exeter Hotchkiss Lawrenceville St. Paul's Mt. Hermon nan sample chain draw (0, 0) 0 0 1.715723 2.317391 1.450174 2.085550 2.227076 3.071507 2.712972 3.083764 1.460448 0.877494 (0, 1) 0 1 1.903481 0.889170 0.742949 3.125869 2.779524 2.834705 1.558939 2.487503 1.984379 0.802714 (0, 2) 0 2 1.903481 0.889170 0.742949 3.125869 2.779524 2.834705 1.558939 2.487503 1.984379 0.802714 (0, 3) 0 3 1.903481 0.889170 0.742949 3.125869 2.779524 2.834705 1.558939 2.487503 1.984379 0.802714 (0, 4) 0 4 2.017497 1.109120 0.818893 2.750620 1.928670 1.983162 1.029620 3.662744 2.167574 0.767934 ... ... ... ... ... ... ... ... ... ... ... ... ... (3, 495) 3 495 7.750625 11.477589 5.578327 9.321531 5.812095 5.437099 3.096142 9.731409 7.948321 3.020477 (3, 496) 3 496 6.922368 2.710763 8.646136 3.807844 7.543669 6.788881 6.595036 4.003042 5.275016 2.704639 (3, 497) 3 497 5.408836 11.406390 4.446937 9.210775 6.331074 4.150778 4.812302 9.693257 4.914656 2.236486 (3, 498) 3 498 7.721440 7.086139 12.311889 6.584301 10.286093 10.050167 11.859938 7.952268 9.754468 2.989656 (3, 499) 3 499 10.237157 10.464390 13.714306 10.261666 15.180098 10.916030 15.070900 14.923210 14.023129 3.051559 2000 rows × 10 columns
The only restriction on sample_dims is that it is present in all variables of the dataset. Consequently, we can compute statistical summaries, concatenate the results into a single dataset creating a new dimension.
import xarray as xr dims = ["chain", "draw"] post = idata.posterior.dataset summaries = xr.concat( ( post.mean(dims).expand_dims(summary=["mean"]), post.median(dims).expand_dims(summary=["median"]), post.quantile([.25, .75], dim=dims).rename( quantile="summary" ).assign_coords(summary=["1st quartile", "3rd quartile"]) ), dim="summary" ) summaries
<xarray.Dataset> Size: 864B Dimensions: (summary: 4, school: 8) Coordinates: * summary (summary) object 32B 'mean' 'median' '1st quartile' '3rd quartile' * school (school) <U16 512B 'Choate' 'Deerfield' ... 'Mt. Hermon' Data variables: mu (summary) float64 32B 4.171 4.063 1.997 6.536 theta (summary, school) float64 256B 6.42 4.954 3.423 ... 9.237 7.939 tau (summary) float64 32B 4.321 3.511 2.191 5.669 Attributes: created_at: 2025-01-19T14:32:33.071271+00:00 arviz_version: 0.20.0 inference_library: pymc inference_library_version: 5.20.0 sampling_time: 3.159093141555786 tuning_steps: 1000Then convert the result into a DataFrame for ease of viewing.
dataset_to_dataframe(summaries, sample_dims=["summary"]).T
mean median 1st quartile 3rd quartile mu 4.171372 4.063302 1.996927 6.535536 theta[Choate] 6.420443 5.795054 2.601156 9.379063 theta[Deerfield] 4.954497 5.015449 1.697938 8.023527 theta[Phillips Andover] 3.422932 3.744714 0.271904 6.907635 theta[Phillips Exeter] 4.753565 4.690572 1.424864 7.960607 theta[Hotchkiss] 3.453035 3.618865 0.461903 6.645211 theta[Lawrenceville] 3.662959 3.904880 0.562143 7.143478 theta[St. Paul's] 6.505227 6.090589 3.059898 9.237491 theta[Mt. Hermon] 4.819780 4.645244 1.337334 7.938904 tau 4.321166 3.511275 2.190964 5.668695 Note that if all summaries were scalar, it would not be necessary to use
expand_dimsor renaming dimensions, usingassign_coordson the result to label the newly created dimension would be enough. But using this approach we already generate a dimension with coordinate values and can also combine non scalar summaries.