Technical details

Data sources

Japanese Red List

The Japanese government publishes two red lists, the marine and the terrestrial one. The data from both lists are available as csv files on the "ikilog" website: https://ikilog.biodic.go.jp/Rdb/booklist. The latest (2020) terrestrial list is found under the "レッドリスト2020について（令和2年）" title. Confusingly, in Japanese it is called just "Red List", without the "terrestrial" part, so it's easy to mistake it for the entire Japanese Red List. However, the separate Marine Red List (latest edition from 2017) is found on the same page under the "海洋生物レッドリストについて（平成29（2017）年）" title. We use all 6,244 unique entries from both of these lists. We show scientific names exactly as they are presented in the original csv files, for the purpose of enabling easy automatic linking between the csv files and our data.

IUCN Red List

The IUCN (International Union for Conservation of Nature) Red List of Threatened Species is a project documenting conservation status of species worldwide. Unfortunately, their website (https://www.iucnredlist.org/) does not offer a downloadable list of species. Therefore, we use a version 2022-2 of this list, downloaded from GBIF. It can be found at this doi: https://doi.org/10.15468/0qnb58, or by using the following procedure: 1. Open the GBIF website: https://www.gbif.org/. 2. Open the "Get data" menu, then "Datasets". 3. Choose "CHECKLIST" at the top. 4. Find the dataset called "The IUCN Red List of Threatened Species", possibly navigating multiple pages using a page selector at the bottom of the page. 5. Click on the "DOWNLOAD" link at the top, choose "Source archive" in the menu. (The link we used: https://hosted-datasets.gbif.org/datasets/iucn/iucn-2022-1.zip). The GBIF page states that it has 150,490 accepted names and 104,093 synonyms. We count 150,489 accepted names and 221,895 synonyms in this data.

NCBI Taxonomy

NCBI Taxonomy Database (https://www.ncbi.nlm.nih.gov/taxonomy) provides current data at this url: https://ftp.ncbi.nih.gov/pub/taxonomy/, reachable using the "Taxonomy FTP" link from the main site. We download the latest taxdmp.zip every time when updating our site. The current version of our website uses data from taxdmp.zip downloaded on 2024-09-02. It contains 2,608,077 nodes.

iNat Taxonomy

iNaturalist (https://www.inaturalist.org/) provides a monthly updated taxonomy snapshot. It can be found by using the "developers" link at the bottom of the page, then using the "iNaturalist Taxonomy DarwinCore Archive" link, which points to this url: https://www.inaturalist.org/taxa/inaturalist-taxonomy.dwca.zip. We download the latest snapshot every time we update this page. For the current update we used the snapshot downloaded on 2024-09-02. It includes 1,322,935 nodes.

GBIF Taxonomy

We use a snapshot of the GBIF (Global Biodiversity Information Facility) Backbone Taxonomy, available with this doi: https://doi.org/10.15468/39omei. How to find it: 1. Open GBIF website: https://www.gbif.org/. 2. Open the "Get data" menu, then "Datasets". 3. Choose "CHECKLIST" at the top. 4. Choose "GBIF Backbone Taxonomy" from the list. 5. Click on the "DOWNLOAD" link at the top, choose "Source archive" in the menu. The link we used: https://hosted-datasets.gbif.org/datasets/backbone/current/backbone.zip. Note that the link has no date or version, and just includes "current", therefore it may possibly link to a different data in the future. We use all accepted names and synonyms from this taxonomy, except those with "doubtful" taxonomic status. The "backbone.zip" file that we use has "publication date" of 2023-08-28, and we downloaded it on 2024-01-26. After preprocessing, we count 4,152,023 nodes in this data.

COL Taxonomy

Catalogue of Life Checklist provides monthly updated taxonomy snapshots. To find them: 1. Open COL website: https://www.catalogueoflife.org/. 2. In the "DATA" menu at the top, choose "DOWNLOAD". 3. Scroll to the end of the page, and click on the "Monthly Checklist Archive" link. 4. Scroll to the bottom and download the latest dated archive. For this update, we downloaded the file "2024-08-29_coldp.zip". We use all names that have taxonomic status of "accepted", "provisionally accepted" or "synonym", and we ignore names with "ambiguous synonym" and "misapplied" statuses. We count 2,606,273 accepted or provisionally accepted names and 2,425,545 synonyms in this dataset.

NCBI Datasets

We used the NCBI "datasets" command line utility, obtained from this location: https://www.ncbi.nlm.nih.gov/datasets/docs/v2/download-and-install/. Using this tool, for each update of our website, we download genome summaries for all family-level taxa relevant to the Red List. Currently there are 1,005 such taxa. Therefore, we automatically run the command "datasets summary genome taxon TAXID" for each of those taxonomic ids.

Preprocessing

The species lists from www.biodic.go.jp come in 18 separate csv files. The first issue is that 8 of these files are encoded in the archaic SHIFT-JIS encoding. The problematic files: redlist2020_kairui.csv, redlist2020_invertebrate, and redlist2020_sorui.csv from the terrestrial red list, as well as all 5 files from the marine red list. We converted them into UTF-8, to unify the encoding with the rest of the data. Our csv files, converted to UTF-8 and renamed to more convenient names, are available here: Japanese-Red-List-2020-csv-files-UTF-8.zip.

We use all unique names from these files. (Some entries with the LP conservation status are listed several times, once for each endangered population. We show only one entry for each of such cases).

We noticed some issues with scientific names, stored in the the 4-th column of csv files:

Some names contain double spaces: Anguilla bicolor pacifica.
Some names contain a fullwidth space character (U+3000): Sorex minutissimus hawkeri　.
Some names contain a fullwidth full stop character (U+FF0E): Borniopsis sp．.
Some names use diacritics: Elaphocordyceps jezoënsis (S.Imai) G.H.Sung, J.M.Sung & Spatafora.
Some names have a comma, which means they are enclosed in double quotation marks in the csv (= comma-separated values) format. This is OK, but some lines have a dangling space after the closing double quotation. E.g.: "Ophiocordyceps asyuënsis (Kobayasi & Shimizu) G.H.Sung, J.M.Sung, Hywel-Jones & Spatafora" .

We tried to correct these issues when comparing the names to taxonomy, but might have missed some more. (However, we still show all names unchanged on the page).

Connecting names to taxonomy

We try to connect all names to the NCBI Taxonomy Database. For each update, we download the latest dump file taxdmp.zip from https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/. We only use the "nodes.dmp" and "names.dmp" files from the dump.

For each entry of the Red List, we take its entire preprocessed name, and first check if it is recorded as a "scientific name" in taxonomy. If yes, we show this as a blue check mark (✓) in the table, linked to the corresponding taxonomy page. If the name is not found among scientific names, we check whether it is registered as an alternative name for any taxon. If yes, we show a yellow check mark (✓). If the name is not found in taxonomy at all, we check our own custom list of synonyms that are still missing in taxonomy (synonyms.txt). A name found in this list receives a red check mark (✓), still linked to the taxonomy page. A name not found by all these methods gets a dash (—) in the taxonomy column.

Next we need to find species, genus and family of the entry. For entries that are already connected to taxonomy, we traverse the taxonomic tree towards the root, locating nodes with ranks of "species", "genus" and "family". For the remaining entries (those without any link to taxonomy), we don't give up just yet. We extract the first two words of the complete organism name (first three words, if the second word is "sp.") and try to connect it to taxonomy using the procedure described above. If this works, we traverse the tree to locate genus and family nodes. Otherwise, we extract just the first word to use as a genus name, and try to find this genus in taxonomy.

One additional difficulty is the fact that some names correspond to multiple taxonomy nodes. For example, there is a plant genus named "Digitaria", and there is a mollusc genus also named "Digitaria". Now, let's say we are trying to connect Red List entry "Digitaria mollicoma" to the taxonomy. First we check the complete name and find it missing. But we still hope to establish any link to taxonomy for this entry. So we switch to the genus level, and try to locate "Digitaria" in the taxonomy, to find that it is listed twice, for plants and for molluscs. In such cases we take a clue from which section of the Red List we are processing now. In this instance, "Digitaria mollicoma" is from the plant section (file "redlist2020_ikansoku.csv"), which allows us to connect it to the plant genus "Digitaria".

Eventually some entries connect to taxonomy at the subspecies level, some at species, genus, or family level. And some entries end up without any link to taxonomy.

Locating genomes

We use NCBI Datasets command line tools to download genome summaries for all taxa related to entries in the list. Only entries that could be connected to taxonomy are used for this.

Processing script

In the interest of transparency, we are sharing the scripts used for processing the Red List data and constructing this website.

Japanese-Red-List-Genomes-processing-scripts.zip - v.0.2.0, 2024-07-03, 65 kB

The scripts are available under the zlib/libpng license.