Retrievals: Dealing with multiple datasets#

Written by Evert Nasedkin. Please cite pRT’s retrieval package (Nasedkin et al. 2024) in addition to pRT (Mollière et al. 2019) if you make use of the retrieval package for your work.

This advanced tutorial will use JWST observations of WASP39 b to demonstrate how to incorporate multiple datasets into the pRT retrieval framework. This is the nominal Radtrans workflow. For the SpectralModel workflow, see the SpectralModel retrieval notebook.

Getting started#

Please make sure to have worked through the“Basic Retrieval Tutorial”before looking at the material below.

In this tutorial, we will outline the process of setting up a RetrievalConfig object, which is the class used to set up a pRT retrieval. The basic process is always to set up the configuration, and then pass it to the Retrieval class to run the retrieval using, for example, pyMultiNest. Like mentioned in the “Basic Retrieval Tutorial” several standard plotting outputs will also be produced by the retrieval class. Most of the classes and functions used in this tutorial have more advanced features than what will be explained here, so it’s highly recommended to take a look at the code and API documentation. There should be enough flexibility built in to cover most typical retrieval studies, but if you have feature requests please get in touch, or open an issue on gitlab.

[1]:
# Let's start by importing everything we need
import os

import matplotlib.pyplot as plt
import numpy as np

from petitRADTRANS import physical_constants as cst
from petitRADTRANS.radtrans import Radtrans

# Import the class used to set up the retrieval.
from petitRADTRANS.retrieval import Retrieval,RetrievalConfig

# Import Prior functions, if necessary.
from petitRADTRANS.retrieval.utils import gaussian_prior

# Import atmospheric model function
from petitRADTRANS.retrieval.models import guillot_transmission

Lets start out by setting up a simple run definition. We’ll add the data after we define the model function below Full details of the parameters can be found in the API documentation.

[2]:
# Lets start out by setting up a simple run definition
# Full details of the parameters can be found in retrieval_config.py

# Since our retrieval has not run yet, we'll set the mode to 'retrieve'.
# If it has already run, we can set it to 'evaluate', so we can make some plots.
# In general, the evaluate functions will also be called after the 'retrieve' mode has finished.

retrieval_config = RetrievalConfig(
    retrieval_name="WASP39b_Guillot_FreeChem_PatchyGreyHaze",
    run_mode="retrieve", # this must be 'retrieve' to run PyMultiNest, 'evaluate' if looking at an existing run
    pressures=np.logspace(-8,3,100), # Extend up to 10^-8 bar
    amr=False, # We won't be using adaptive mesh refinement for the pressure grid
    scattering_in_emission=False
) # This would turn on scattering when calculating emission spectra.
# Scattering is automatically included for transmission spectra.

Data#

Let’s start with reading in the data. The data must be a 1D spectrum with error bars or a covariance matrix.

As in the basic tutorial, we’re reading in text files, but this time we also include the column that describes the wavelength bins:

# Wavelength [micron], Bins [micron], Flux [W/m2/micron or (Rp/Rstar)^2], Flux Error [W/m2/micron or (Rp/Rstar)^2]

As mentioned, the Data class is arguably the most important part of setting up the retrieval. Not only do you input your data here, but you also choose your model function and resolution. This means that you can design a retrieval around different datatypes and retrieve simultaneously on both - for example, if you want the day and nightside of a planet, or want to combine the eastward and westward limbs of a transmission spectrum with different models. New in version 3.1: you can now set the data_resolution as an array. For many instruments, such as those on JWST, the spectral resolution varies with wavelength. In order to more accurately capture this, you can now pass an array with the spectral resolution at each wavelength, which determines the width of the Gaussian kernel convolved with the model.

You can also set a distance to your object, which will allow you to automatically scale the flux and error of your data using the scale_to_distance() method - useful if you have data normalized to 10pc! Finally, there’s also a arguments scale, scale_err and offset_bool, which tells the retrieval that the flux or uncertaines should be scaled by an arbitrary multiplicative factor or have an additive offset, both which is set up as a normal retrieval parameter using the RetrivalConfig.add_parameter() method. The name must be of the format DATANAME_scale_factor or DATANAME_offset. This is useful if two datasets are incompatible in absolute photometry, but you still want to use the spectral shape to inform the retrieval.

In this retrieval we’re going to include several datasets from different JWST instruments, starting with NIRISS SOSS orders 1 and 2. To include both of them , we simply add more than one dataset to our RetrievalConfig object. Notice that we’re also telling the retrieval that we want the data of Order 2 to have an additive offset: this will allow the data to float relative to Order 1, which remains fixed. This can be used to compensate for differences in transit depth between different instruments.

We’re also using the built-in guillot_transmission model for this retrieval, rather than writing our own model function from scratch.

[3]:
# Here we import petitRADTRANS to find the path of the example files on your machine.
# In general this is not required, you just put the files in the folder that you are running
# Your script in, for example.
import petitRADTRANS # need to get the name for the example data

path_to_data = "./"

transmission_directory = "retrievals/transmission/observations/"
retrieval_config.add_data(
    'JWST/NIRISSSOSS/O1',
    f"{path_to_data}{transmission_directory}JWST/WASP39b_niriss_soss1.txt",
    data_resolution=700,
    model_resolution=300,
    model_generating_function=guillot_transmission,
    external_radtrans_reference=None
)
retrieval_config.add_data(
    'JWST/NIRISSSOSS/O2',
    f"{path_to_data}{transmission_directory}JWST/WASP39b_niriss_soss2.txt",
    data_resolution=700,
    model_resolution=300,
    offset_bool=True,
    model_generating_function=guillot_transmission,
    external_radtrans_reference=None
)

External references#

Sometimes the datasets will include regions that overlap in wavelength, or where one dataset falls entirely within the wavelength range of another dataset. In that case we can use an “external reference”: for the dataset (let’s call it “short”) that falls within the wavelength range of the other (let’s call it “long”) we will not compute a model spectrum. Instead, we will use the model spectrum calculated for the “long” dataset. This way we only need to compute the spectrum once in the same wavelength range, rather than having the retrieval package initialise two Radtrans objects and calculating a spectrum for each. This saves both memory and time. However, be careful here: the model resolution in the reference object should be high enough to properly sample any datasets that reference it! If you have multiple data sets that do not overlap, but lie back-to-back to each other in wavelength space, with at most small gaps, it makes also sense to use external references, since it reduced computational overheads.

In this example, the NIRSpec PRISM data covers the entire NIRISS SOSS wavelength range, so we can use it as a reference for both NIRISS SOSS orders.

Note that here we appear to do something odd: the data resolution is more than twice larger than the model resolution for NIRISS SOSS. In the special case here the SOSS data files are binned to R = 100. Thus, while the spectrum was indeed recorded at R=700, it is OK to use a lower pRT model resolution for the retrievals.

[4]:
retrieval_config.data = {} # Remove the previous data that was added above to start with a clean slate.
retrieval_config.add_data(
    'JWST/NIRSPEC/PRISM',
    f"{path_to_data}{transmission_directory}JWST/WASP39b_nirspec_prism.txt",
    data_resolution=100,
    model_resolution=300,
    offset_bool=True,
    model_generating_function=guillot_transmission
)
retrieval_config.add_data(
    'JWST/NIRISSSOSS/O1',
    f"{path_to_data}{transmission_directory}JWST/WASP39b_niriss_soss1.txt",
    data_resolution=700,
    model_resolution=300,
    model_generating_function = guillot_transmission,
    external_radtrans_reference='JWST/NIRSPEC/PRISM'  # here we set the external pRT reference to PRISM
)
retrieval_config.add_data(
    'JWST/NIRISSSOSS/O2',
    f"{path_to_data}{transmission_directory}JWST/WASP39b_niriss_soss2.txt",
    data_resolution=700,
    model_resolution=300,
    offset_bool=True,
    model_generating_function = guillot_transmission,
    external_radtrans_reference = 'JWST/NIRSPEC/PRISM'  # here we set the external pRT reference to PRISM
)

Model parameters#

Here we’re using a more complicated atmospheric model to fit the JWST data. The temperature profile is taken from Guillot 2010, and includes four parameters to describe the shape. We’re freely retrieving the chemical abundances, and include both patchy grey clouds and an enhanced power law slope as a proxy for hazes. The cloud coverage fraction is set with the cloud_fraction (formerly patchiness, which will still work) parameter.

If we’re using real cloud opacities, rather than grey ones, we, can set the patchiness to only apply to a subset of the cloud species. To do this, we just need to create a parameter called complete_coverage_clouds, and set the value to a list of the cloud species that are NOT patchy.

[5]:
# WASP 39 parameters
retrieval_config.add_parameter(
    name='stellar_radius',
    free=False,
    value=0.9324 * cst.r_sun
)

# Fix the reference pressure in bar
retrieval_config.add_parameter(
    'reference_pressure',
    False,
    value=0.01
)

# Choose two of log_g, radius and mass priors
retrieval_config.add_parameter(
    'log_g',
    True,
    transform_prior_cube_coordinate=lambda x: 2.0 + 3.5 * x
)
retrieval_config.add_parameter(
    'planet_radius',
    True,
    transform_prior_cube_coordinate=lambda x: 0.8 * cst.r_jup_mean + (x * 0.8 * cst.r_jup_mean)
)

# Priors for Guillot 2010 Temperature Profile
retrieval_config.add_parameter(
    "T_int",
    True,
    transform_prior_cube_coordinate=lambda x: 100 + 3500 * x
)
retrieval_config.add_parameter(
    "T_equ",
    True,
    transform_prior_cube_coordinate=lambda x: 100 + 3500 * x
)
retrieval_config.add_parameter(
    "gamma",
    True,
    transform_prior_cube_coordinate=lambda x:  10 ** (-(x / 2) ** 2 / 2)
)
retrieval_config.add_parameter(
    "log_kappa_IR",
    True,
    transform_prior_cube_coordinate=lambda x: -4.0 + 6.0 * x
)

# Grey cloud top pressure
retrieval_config.add_parameter(
    'log_Pcloud',
    True,
    transform_prior_cube_coordinate=lambda x: -8 + 11 * x
)

# Enhanced haze scattering slope
# kappa
retrieval_config.add_parameter(
    'haze_factor',
    True,
    transform_prior_cube_coordinate=lambda x: -4 + 14 * x
)
# gamma
retrieval_config.add_parameter(
    'power_law_opacity_350nm',
    True,
    transform_prior_cube_coordinate=lambda x: 10**(-20 + 40 * x)
)
retrieval_config.add_parameter(
    'power_law_opacity_coefficient',
    True,
    transform_prior_cube_coordinate=lambda x: -20 + 22 * x
)
# Cloud fraction
retrieval_config.add_parameter(
    'cloud_fraction',
    True,
    transform_prior_cube_coordinate=lambda x: x
)

# Data offsets
retrieval_config.add_parameter(
    'JWST/NIRSPEC/PRISM_offset',
    True,
    transform_prior_cube_coordinate=lambda x : gaussian_prior(x, 0, 1e-4)
)
retrieval_config.add_parameter(
    'JWST/NIRISSSOSS/O2_offset',
    True,
    transform_prior_cube_coordinate=lambda x : gaussian_prior(x, 0, 1e-4)
)

Opacities#

[6]:
retrieval_config.set_rayleigh_species(['H2', 'He'])
retrieval_config.set_continuum_opacities(['H2-H2', 'H2-He'])

# Here we setup the line species for a free retrieval,
# setting the prior bounds of the log10 abundance with the abund_lim parameter
# So the retrieved value is the log mass fraction.
retrieval_config.set_line_species(["H2O__POKAZATEL", "CO-NatAbund", "CO2", "CH4", "SO2"], eq=False, abund_lim = (-8.0,0.0))

Plotting#

Please see “Basic Retrieval Tutorial” if you do not know what’s happening here.

[7]:
# Define what to put into corner plot if run_mode == 'evaluate'
for key, value in retrieval_config.parameters.items():
    value.plot_in_corner = True
    value.corner_label = key.replace("_"," ")

# Define axis properties of spectral plot if run_mode == 'evaluate'
retrieval_config.plot_kwargs["spec_xlabel"] = 'Wavelength [micron]'
retrieval_config.plot_kwargs["spec_ylabel"] = r'$(R_{\rm P}/R_*)^2$ [ppm]'
retrieval_config.plot_kwargs["y_axis_scaling"] = 1e6 # so we have units of ppm
retrieval_config.plot_kwargs["xscale"] = 'linear'
retrieval_config.plot_kwargs["yscale"] = 'linear'

# Use at least ~100 samples to plot 3 sigma curves
retrieval_config.plot_kwargs["nsample"] = 10

# Define from which observation object to take P-T
# in evaluation mode (if run_mode == 'evaluate'),
# add PT-envelope plotting options
retrieval_config.plot_kwargs["take_PTs_from"] = 'JWST/NIRSPEC/PRISM'
retrieval_config.plot_kwargs["temp_limits"] = [150, 3000]
retrieval_config.plot_kwargs["press_limits"] = [1e1, 1e-7]

# If in doubt, define all of the plot_kwargs used here.

Running the retrieval#

Like in “Basic Retrieval Tutorial” we can now run the retrieval. Most of the various parameters used to control pyMultiNest or Ultranest can be set in the retrieval.run() function, see the API documentation.

Once the retrieval is complete, we can use the plot_all function to generate plots of the best fit spectrum, the pressure-temperature profile and the corner plots, also see “Basic Retrieval Tutorial”.

[8]:
output_dir = f"{path_to_data}results"

retrieval = Retrieval(
    retrieval_config,
    output_directory=output_dir,
    evaluate_sample_spectra=False,  # output the spectrum from nsample random samples.
    use_prt_plot_style=True,  # we think that our plots look nice.
    ultranest=False  # let's use pyMultiNest rather than Ultranest
)
Setting up Radtrans object for data 'JWST/NIRSPEC/PRISM'...
Loading Radtrans opacities...
 Loading line opacities of species 'H2O__POKAZATEL.R300' from file '/home/dblain/petitRADTRANS/input_data/opacities/lines/correlated_k/H2O/1H2-16O/1H2-16O__POKAZATEL.R300_0.3-50mu.ktable.petitRADTRANS.h5'... Done.
 Loading line opacities of species 'CO-NatAbund.R300' from file '/home/dblain/petitRADTRANS/input_data/opacities/lines/correlated_k/CO/C-O-NatAbund/C-O-NatAbund__HITEMP.R300_0.1-250mu.ktable.petitRADTRANS.h5'... Done.
 Loading line opacities of species 'CO2.R300' from file '/home/dblain/petitRADTRANS/input_data/opacities/lines/correlated_k/CO2/12C-16O2/12C-16O2__UCL-4000.R300_0.3-50mu.ktable.petitRADTRANS.h5'... Done.
 Loading line opacities of species 'CH4.R300' from file '/home/dblain/petitRADTRANS/input_data/opacities/lines/correlated_k/CH4/12C-1H4/12C-1H4__YT34to10.R300_0.3-50mu.ktable.petitRADTRANS.h5'... Done.
 Loading line opacities of species 'SO2.R300' from file '/home/dblain/petitRADTRANS/input_data/opacities/lines/correlated_k/SO2/32S-16O2/32S-16O2__ExoAmes.R300_0.3-50mu.ktable.petitRADTRANS.h5'... Done.
 Successfully loaded all line opacities
 Loading CIA opacities for H2-H2 from file '/home/dblain/petitRADTRANS/input_data/opacities/continuum/collision_induced_absorptions/H2--H2/H2--H2-NatAbund/H2--H2-NatAbund__BoRi.R831_0.6-250mu.ciatable.petitRADTRANS.h5'... Done.
 Loading CIA opacities for H2-He from file '/home/dblain/petitRADTRANS/input_data/opacities/continuum/collision_induced_absorptions/H2--He/H2--He-NatAbund/H2--He-NatAbund__BoRi.DeltaWavenumber2_0.5-500mu.ciatable.petitRADTRANS.h5'... Done.
 Successfully loaded all CIA opacities
Successfully loaded all opacities

This retrieval is very complex and thus takes several days to run, with hundred of cores on a cluster.

To try to run the retrieval anyway, set run_retrieval below to True, then execute the cells below.

[9]:
%%time
run_retrieval = False

if run_retrieval:
    retrieval.run(
        sampling_efficiency=0.8,
        const_efficiency_mode=False,
        n_live_points=400
        # See API or other tutorials for additional PyMultiNest or Ultranest parameters
    )
CPU times: user 1e+03 ns, sys: 1 µs, total: 2 µs
Wall time: 5.01 µs
[10]:
# Automatically generate all of the standard output plots. The contribution
# argument means that the PT profile and abundance profile plots will display
# the contribution function. The mode argument means we'll be plotting a model
# based on the median retrieved parameter values, rather than the minimum likelihood
# model.
if run_retrieval:
    retrieval.plot_all(contribution=True, mode='median')

Contact

If you need any additional help, don’t hesitate to contact Evert Nasedkin.