This repository contains the code necessary to replicate the harmonized water quality dataset from:
E. Krasovich, P. Lau, J. Tseng, J. Longmate, K. Bell, and S. Hsiang. "Harmonized nitrogen and phosphorus concentrations in the Mississippi/Atchafalaya River Basin from 1980 to 2018," Scientific Data, 2022.
Data inputs and outputs are available on our HydroShare Repository.
All scripts are written in R. Throughout this README, paths to code and data assume you execute scripts from the top-level SNAPD/ folder using R or RStudio, matching the folder structure in the HydroShare repository.
Run Code/_install_us_wq_packages.R to install all required R packages, or source it from the master workflow (see Section 3).
Before running any code, download the following two shapefiles and save them to Data/_A_workflow/:
- Mississippi River Basin boundary — Schwartz, M. (2015). USGS Mississippi River Basin. Retrieved from: https://www.sciencebase.gov/catalog/item/55de04d5e4b0518e354dfcf8
- US State boundaries — U.S. Census Bureau (2017). TIGER/Line Shapefile, 2017, nation, U.S., Current State and Equivalent National. Retrieved from: http://www2.census.gov/geo/tiger/TIGER2017/STATE/tl_2017_us_state.zip
SNAPD/
├── README.md
├── LICENSE
├── Code/
│ ├── _install_us_wq_packages.R # Install required R packages
│ ├── _master_workflow_and_setup.R # Master script: runs full pipeline
│ ├── A00_us_raw_wqd_retrieval_workflow.R
│ ├── A01download_wq_sites_from_WQP.R
│ ├── A02create_and_clean_WQP_site_df.R
│ ├── A03download_wqd_by_nutrient.R
│ ├── A04merge_wqd_w_site_data_by_download.R
│ ├── A05crop_wqp_sites_to_mrb.R
│ ├── B00_us_wqd_processing_workflow.R
│ ├── B01standardize_wq_org_names.R
│ ├── B02recover_state_and_make_unique_sites.R
│ ├── B03flag_sample_level_metadata.R
│ ├── B04flag_raw_obs_w_unknown_chemical_form.R
│ ├── B05flag_result_level_metadata.R
│ ├── B06flag_and_convert_wqd_units.R
│ ├── B07merge_nutrient_compounds_and_rename_RSFs.R
│ ├── B08get_upper_DLs_and_merge_w_wqd.R
│ ├── B09impute_non_detects.R
│ ├── B10flag_potential_outliers.R
│ ├── B11flag_duplicate_types.R
│ ├── B12create_full_flagged_dataset.R
│ ├── B13harmonize_duplicates.R
│ ├── B14combine_parameters.R
│ ├── B15final_cleaning.R
│ ├── C00_us_wq_data_figures_and_tables_workflow.R
│ ├── C01create_raw_wqd_summary_table.R
│ ├── C02create_harmonization_process_table.R
│ ├── C03create_final_wqd_summary_table.R
│ ├── C04create_technical_validation_histograms.R
│ └── C05make_sankey_plots.R
└── Data/ # Not tracked — see HydroShare for all data
Data files are not tracked in this repository. Static copies of all inputs and outputs are available on HydroShare.
There are three stages to the data harmonization pipeline, each corresponding to a lettered workflow.
The entire pipeline can be run from the master workflow:
source("Code/_master_workflow_and_setup.R")The master workflow installs packages, loads libraries, creates directories, sets file paths, and sources each stage in sequence. Set your working directory to the top-level SNAPD/ folder before running.
Entry point: Code/A00_us_raw_wqd_retrieval_workflow.R
Downloads raw water quality site and sample data from the Water Quality Portal and performs minimal cleaning. Output is saved to Data/_A_workflow/all_raw_wqd_and_sites.fst.
Note: The Water Quality Portal is frequently updated. Running Stage 1 may produce data that differs from what we used. We recommend skipping this stage and using our archived output on HydroShare unless new/updated data is desired. Running Stage 1 with new data may require downstream code adjustments.
| Script | Description |
|---|---|
A01 |
Download WQ sites from WQP |
A02 |
Create and clean WQP site dataframe |
A03 |
Download WQ data by nutrient |
A04 |
Merge WQ data with site data |
A05 |
Crop WQP sites to Mississippi/Atchafalaya River Basin |
Entry point: Code/B00_us_wqd_processing_workflow.R
Performs the cleaning and harmonization steps described in Table 2 of Krasovich et al. (2022). Outputs are saved to Data/_B_workflow/, including:
WQP_to_SNAPD_flagged.fst— intermediate flagged datasetSNAPD.fst— final harmonized dataset
| Script | Description |
|---|---|
B01 |
Standardize WQ organization names |
B02 |
Recover state codes and make unique sites |
B03 |
Flag sample-level metadata |
B04 |
Flag observations with unknown chemical form |
B05 |
Flag result-level metadata |
B06 |
Flag and convert WQ data units |
B07 |
Merge nutrient compounds and rename result sample fractions |
B08 |
Get upper detection limits and merge with WQ data |
B09 |
Impute non-detects |
B10 |
Flag potential outliers |
B11 |
Flag duplicate types |
B12 |
Create full flagged dataset |
B13 |
Harmonize duplicates |
B14 |
Combine parameters |
B15 |
Final cleaning |
Entry point: Code/C00_us_wq_data_figures_and_tables_workflow.R
Creates figures and tables used in Krasovich et al. (2022). Requires Stages 1 and 2 to be completed first. Outputs are saved to Data/_C_workflow/.
Specifically outputs: Table 1, Table 2, Table 5, Figure 4, Figure 5 (A and B), Figure 6 (A and B), and SNAPD_final_wqd_sites.csv (used for Figures 1 and 3 in QGIS). Final figure edits are made in Adobe Illustrator after export from R.
| Script | Description |
|---|---|
C01 |
Create raw WQ data summary table (Table 1) |
C02 |
Create harmonization process table (Table 2) |
C03 |
Create final WQ data summary table (Table 5) |
C04 |
Create technical validation histograms (Figures 4 and 5) |
C05 |
Make Sankey plots (Figure 6) |
Static copies of all data inputs and outputs are archived on HydroShare:
HydroShare Repository: http://www.hydroshare.org/resource/9547035cf37940eb9b500b7994a378a1
Variable definitions for all datasets are in the Data Records section of Krasovich et al. (2022).
Please cite the dataset as:
Krasovich, E., P. Lau, J. Tseng, J. Longmate, K. Bell, S. Hsiang (2022). Standardized Nitrogen and Phosphorus Dataset (SNAPD), HydroShare. http://www.hydroshare.org/resource/9547035cf37940eb9b500b7994a378a1
And the associated paper as:
Krasovich, E., Lau, P., Tseng, J., Longmate, J., Bell, K., & Hsiang, S. (2022). Harmonized nitrogen and phosphorus concentrations in the Mississippi/Atchafalaya River Basin from 1980 to 2018. Scientific Data, 9, 556. https://doi.org/10.1038/s41597-022-01650-6