
Plant material
The 20 barley pan-genome genotypes8 were used to prepare a barley pan-transcriptome. These are Akashinriki, Barke, ZDM02064, ZDM01467, B1K-04-12, Golden Promise, Hockett, HOR10350, HOR13821, HOR13942, HOR21599, HOR3081, HOR3365, HOR7552, HOR8148, HOR9043, Igri, Morex, OUN333 and RGT Planet. Five different tissue/organ types, including embryo, mesocotyl and seminal roots (together referred to throughout as embryonic tissue), seedling shoot, seedling root, inflorescence and caryopsis, were collected in three biological replicates, amounting to 300 samples. Tissue samples for each genotype were collected at the same growth stage (rather than time after planting), which varied between genotypes.
Forty-five seeds from each genotype, representing three plants for each tissue and each biological replicate, were germinated on water-soaked filter papers in Petri dishes, covered with foil and left at room temperature. Four to eight days postgermination, embryonic tissue at a similar developmental stage was collected by carefully removing the seed coat, leaving the embryo, mesocotyl and seminal roots. The remainder of the germinated seeds were planted in cereal mix compost in seed trays and grown in a glasshouse at nominally 18 °C—16 h light, 400 µmol m−2 s−1 and 14 °C—8 h dark, 0 µmol m−2 s−1. Young shoot and root tissues were collected between 14 and 17 days. Plants were then vernalized for 2 weeks at 4 °C—2 h light, 400 µmol m−2 s−1 and 4 °C—12 h dark, 0 µmol m−2 s−1, then transplanted into six-inch pots containing cereal mix compost and transferred to a glasshouse and grown under standard barley growing conditions (nominally 18 °C—16 h light, supplemented when appropriate with 400 µmol m−2 s−1 light from sodium lamps, and 14 °C—8 h ambient dark, 0 µmol m−2 s−1). The 1–1.5 cm immature inflorescences (Waddington W6–W7) were dissected from the developing spike, and whole developing caryopses were collected at 15–20 days postanthesis (Fig. 1).
RNA extraction and sequencing
Total RNA extraction was carried out using the Macherey-Nagel NucleoSpin RNA Plant Mini Kit using the manufacturer’s instructions. All extractions included the DNAse step. Seedling root and caryopsis tissues included a plant isolation aid (Invitrogen; equal volume of plant isolation aid per gram of tissue) and centrifugation before RNA isolation. Total RNA was quality checked using a Bioanalyzer 2100 (Agilent) and showed an RNA integrity number >8 except for two samples (HOR10350 caryopsis rep 3 and HOR7552 root rep 2), which were not replaced. Hor8148_In3 was identified as mislabeled, thus also being excluded. Strand-specific dUTP libraries and Illumina paired-end 150 bp sequencing were completed by Novogene (UK) Company.
Single-molecule real-time sequencing (SMRT) was performed at IPK Gatersleben. Equal amounts of total RNA from one biological replicate from each of the five sample tissues were pooled, and this was repeated for each of the remaining barley genotypes to produce 20 pooled samples for long-read sequencing. Samples were size fractioned to ~2 kbp, and 300 ng of total RNA per genotype was used to prepare libraries according to the Iso-Seq Express 2.0 workflow using the SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences) following the manufacturer’s instructions. Each library was sequenced through an individual SMRT cell (one genotype/SMRT cell) on a PacBio Sequel II.
Assembly of GsRTDs from RNA-seq and Iso-seq reads
We used RNA-seq short-read and Iso-seq long-read data derived from the five tissues to construct GsRTDs for 20 barley genotypes. Short-read RTDs and long-read RTDs were separately assembled using different approaches and then merged to produce comprehensive GsRTDs.
RNA-seq reads were processed using Fastp (v0.20.1)46 to remove adapters and filter reads with a quality score of less than 20 and a length of less than 30 bases. STAR (v2.7.8a)47 was applied to map the trimmed reads to the reference genomes8. We applied the two-pass method and used SJs detected in the first mapping pass to improve sensitivity in the second mapping pass. We used Stringtie (v2.1.5)48 and Scallop (v0.10.5)49 to construct the transcript assemblies. We then used RTDmaker (https://github.com/anonconda/RTDmaker) to merge the assemblies across samples and filter out low-quality transcripts, including those with noncanonical SJs, SJs supported by fewer than ten reads in fewer than two samples, fragmentary transcripts whose lengths are less than 70% of the gene length, low-abundance transcripts (less than one TPM in fewer than two samples) and mono-exonic antisense fragment transcripts located on the opposite strand of the gene with a length less than 50% of the gene length.
The raw Iso-seq long-read data were processed into full-length nonconcatemer (FLNC) reads using the IsoSeqv3 pipeline v3.4.0 (https://github.com/PacificBiosciences/IsoSeq). We mapped the FLNC reads to the reference genome using Minimap2 (v2.24) with maximum intron size ‘-G 15,000’ (ref. 50) and collapsed the mapped reads using TAMA (24 December 2022)51 to generate transcript models. As described previously52, we only retained transcripts with high-confidence (HC) SJs and transcript starts and ends. We then conducted a redundancy evaluation to merge transcripts with small variances of 50 bp at the 5′ and 3′ UTR regions using TAMA merge (-a 50 and –z 50).
Finally, we integrated the short-read assemblies into the long-read assemblies by only including transcripts containing new SJs or contributing to new gene loci, resulting in GsRTDs of all 20 genotypes.
Construction of pan-transcriptome
We used PSVCP (v1.0.1)17 with default parameters to construct a linear pan-genome using Morex V3 as the reference. New sequences of the remaining 19 genomes were integrated iteratively (Supplementary Data 2). To construct the barley pan-transcriptome (PanBaRT20), the transcripts from 20 GsRTDs were mapped to the PSVCP pan-genome using Minimap2 (v2.24; ref. 50) with the maximum intron size restricted to 15,000 (ref. 50). The unmapped transcripts and transcripts with secondary and supplementary alignments were eliminated. The mapped transcripts were grouped into representative transcript models using the following approach (Extended Data Fig. 1): (1) multiple-exon transcripts with identical intron combinations were merged, and the furthest position of transcript start site/transcript end site of that gene was considered as the start and end of the representative transcript. (2) Overlapped mono-exon transcripts were merged into a single representative transcript by taking their union set. (3) A set of overlapped transcripts located entirely within the intronic regions of other transcripts was treated as a separate gene model. A gene and transcript ID look-up table between PanBaRT20 and GsRTDs was generated based on the transcript grouping processes.
The core genes of PanBaRT20 were defined as those genes represented in all 20 genotypes, the shell genes as those represented in 2–19 genotypes and the cloud genes denoted genes that were unique to a single genotype. These categories were further distinguished based on whether a PanBaRT20 gene matched single or multiple genomic loci within a single genotype. We used Transuite (v0.2.2)53 to identify the longest ORFs and derived protein sequences. We used InterProScan (v5.59-91.0)54 to gain insight into the transcript functions from multiple protein databases.
Transcript quantification and expression analysis
We conducted transcript quantifications using Salmon33 on 327 RNA-seq samples using PanBaRT20, 20 GsRTDs and BaRTv2 (ref. 16; please note this included RNA-seq data from two genotypes (30 samples) not reported here). We used the 3D RNA-seq (v2.0.1)55 pipeline to perform differential expression analysis to identify the genes with significant expression changes among five tissues across 20 genotypes. In each tissue, Morex samples were treated as the reference, and all the remaining genotypes were compared to Morex. F test was used to generate a P value across 19 contrast groups. To be considered significant, a gene must have a Benjamini–Hochberg (BH)-adjusted P value < 0.01 and an absolute log2(FC) ≥1 in at least one of the contrast groups.
Evaluation of expression quantification using PanBaRT20, GsRTDs and GsRTDMorex
We simulated RNA-seq data (150 bp paired-end reads) to assess the performance and accuracy of expression quantification using different transcriptomes. Ground truth was set during the simulation using Polyester (v1.29.1)56 as the transcript read counts obtained through GsRTD quantifications of five genotypes (Akashinriki, Barke, Golden Promise, Morex and OUN333). Quantification was carried out based on the simulated reads using PanBaRT20 and the corresponding GsRTDs and GsRTDMorex. Error assessment was based on relative percentage difference \({\mathrm{RPD}}=2\times \frac{\left|x-{x}_{0}\right|}{\left|x+{x}_{0}\right|}\) to measure the errors between quantification x and ground truth x0. We used Pearson correlation to evaluate the consistency of expression changes across samples between quantification of the RTDs and ground truth.
Identification of tandem duplicated genes
In total, 79,580 PanBaRT20 gene sequences were clustered using CD-HIT (v4.8.1) with the following parameter settings: -c 0.95 -r 0 -g 1 -s 0.5. Initial clusters with gene number ≥2 were further filtered as follows: (1) removing clusters with overlapping genes, (2) removing clusters spanning more than 5 Mbp and (3) excluding clusters if any gene copy is from another chromosome. All filtering was done in R.
Association between CNV and gene expression
The total PanBaRT20 gene number for all 20 accessions was collected from the GsRTD and PanBaRT20 match table (Supplementary Data 15). Clusters with a median expression below ten TPM were excluded, and for the remaining clusters, the Pearson correlation coefficient was calculated on a per-tissue basis. The CDS for CBF2 (chr5H52041) and CBF4 (chr5H52037) was used to confirm the CNV in the genomes and GsRTDs with BLAST 2.15.0 (ref. 57), setting minimum percent identity and coverage to 95% (Supplementary Data 6).
Inversion analysis
In total, 20 different inversions over 10 Mb in size in relation to the cultivar Morex8 were aligned on a per-chromosome basis using minimap2 (-2 -I 6G -K 5G -f 0.005 -x asm5 -c –eqx) followed by variant identification using SyRI (v1.6.3)58 and plotting using plotsr (v0.5.4)59 with default parameters. Differential gene expression analyses using RNA-seq reads from PRJEB49069 (ref. 21) were as described above. PCR primers as described8 were used to confirm genotypes with the 141 Mb 7H inversion (Supplementary Data 7). Contrast groups were set up that contained either genotypes with the inversion or without it. Genes with an adjusted BH P value of below 10−10 and a log FC above 0.5 were considered significant.
TFBS analysis
Using the barley pan-genome8, we first selected sets of HC genes consisting of CDS of high quality (Automatic assignment of Human Readable Descriptions score = 3; see https://github.com/groupschoof/AHRD) without potential association to transposable elements (TE; TE-related = 0) and further filtered for the absence of any ‘N’ bases by scanning each CDS. This selection resulted in 22,000–23,000 filtered CDS for each accession.
We defined a set of genes in Morex V2 that had a bona fide counterpart in Morex V3 by BLAST Best Reciprocal Hits, setting the minimum percent identity and coverage to 95% and the expected value to 10−40 using the high-quality 22,541 Morex_V2 proteins as the query and 70,355 Morex_V3 proteins (including isoforms) as the database. Close correspondence between Morex_V2 and V3 was found for 15,001 genes, and whenever necessary, these were used to trace the counterparts of Morex V3 genes in the pan-genome V1 genotypes.
For each of the 15,001 V2_V3 corresponding genes, we conducted a BLAST Best Reciprocal Hits analysis of Morex versus each genotype filtering for CDS identity ≥99% and no alignment gaps (100% coverage for query and participants). This defined 19 sets of close orthologous pairs ranging from 12,071 to 10,330 pairs. Then, for each pairwise comparison, 2 kb upstream sequences from CDS start (based on coordinates as available in general feature format (gffs)) were retrieved and split into four 500 bp bins. Furthermore, a 500 bp downstream region was also captured and compared to prove substantial identity in the sequences of the selected genes.
Thirty TFBS consensus sequences (cis-elements, in direct orientation) have been selected based on recent data29 and further literature30,31 to account for the major plant TF classes of the most well-known cis-elements. The following TFBS consensus sequences were considered: (1) AP2-DREB-RCCGAC, (2) AP2-ERF—CGCCGCC, (3) AP2-TOE2—MTCGTA, (4) B3-AuxRE2—TGTCGG, (5) B3-AuxRE—TGTCTC, (6) B3-LEC-AB3—CATGCA, (7) B3-RAV—CACCTG, (8) bZIP-A-box—TACGTA, (9) bZIP-C-box—GACGTC, (10) bZIP-G-box—CACGTG, (11) bZIP-GCN4—TGASTCA, (12) C2C2-ZF-DOF—AAAGY, (13) C2C2-ZF-GATA—WGATAR, (14) C2C2-ZF-YABBY—WATNATW, (15) C2H2-ZF—CAGCT, (16) GARE—TAACARA, (17) HD-WOX13—CAATCA, (18) HD-ZIP—AATNATT, (19) LOB—TCCGGA, (20) MRECHS—AACCTA, (21) MYB-GARP-B-ARR—AGATWCG, (22) MYB-R2R3-MBSI—CNGTTR, (23) MYB-R2R3-MBSIIG—GKTWGGTR, (24) MYB-R2R3-MBSII—GKTWGTTR, (25) MYB-related—AGATAT, (26) SPL—CGTAC, (27) STY1—CCTAGG, (28) TCP—GGNCCC, (29) trihelix—GRWAAW and (30) WRKY—TTGACY. This list corresponds to the positions from 1 to 30 reported on the x axis in Extended Data Fig. 4a.
Degenerate International Union of Pure and Applied Chemistry bases were set according to the appropriate position weight matrix to account for ambiguities. By scanning for occurrences of the 30 TFBSs in the four 500 bp bins, we obtained (via SeqPattern (v1.32) and OmicsBox (v3.1.2)) overall statistics by comparing the location and occurrences in all pairwise comparisons among Morex and the other 19 genotypes. These data were used to evaluate the expression coherence for ortholog pairs defined as the percent of genes whose TPMs are in the range of plus or minus 30% relative to their Morex orthologs. ‘Expression coherence’ was computed on these orthologous gene pairs.
The entire TPM dataset was used to compute the CV, resulting in minimum, median, mean and maximum values of 0.16, 0.71, 0.99 and 15.07, respectively, splitting the genes into five classes based on CV ranges (0–0.4, 0.4–0.58, 0.58–0.88 and 0.88–20 with the first and last values set to 0 and 20, respectively) and close to 3,000 members in each class.
Network analysis
Of the 13,680 single-copy genes, 13,652 genes passed a filter for lowly expressed genes (TPM >0.5 in minimum of two biological replicates) in all accessions. From the original 297 samples, PCA revealed an outlier (ZDM01467_In2) that was removed from further analysis. R package DESeq2 (v1.34.0)60 was used to perform variance-stabilizing transformation for normalization, applying a design ‘~Tissue’ with the argument ‘blind = FALSE’. Co-expression networks were built for all accessions’ single-copy orthologs separately using the WGCNA package (v1.69). The soft power threshold used for scale-free topology was determined automatically using ‘pickSoftThreshold’. Unsigned networks were calculated with Pearson correlation to obtain the adjacency matrix that accounted for both positive and negative correlations among gene pairs by applying methods dynamicTreeCut and TOMtype with a minimum module size = 30, ‘unsigned’ network type and otherwise default settings. This resulted in 20 networks, with different module numbers for each accession. This approach was followed by merging closely clustered modules by a cut height of 0.1, which resulted in the final set of modules for each accession. A total of 738 modules across all accessions were identified. In total, 91 genes with insufficient variation have been filtered out from each of the genotype-specific networks using the ‘goodSamplesGenes’ function, leaving 13,561 single-copy genes for the following analysis.
Module eigengenes were calculated using the ‘moduleEigengenes’. Pearson correlation with Fisher’s exact significance test was calculated among a binarized tissue table and the eigengenes (Fig. 3c and Extended Data Fig. 5). Mercator4 (v5.0)61 functional annotation was performed, and enrichment with over-representation analysis (ORA) of sets of module-genes was done using the R package clusterProfiler (v4.6)62 with BH false discovery rate (FDR) correction P value cutoff 0.05 (Fig. 3c and Extended Data Figs. 5 and 6). ORA results are shown only for those modules where module size enabled the enrichment analysis (Extended Data Figs. 5 and 6).
To compare the shared number of orthologs across accession-specific network modules, cosine similarity was calculated for each module pair (Extended Data Fig. 6) using Python (v3.10.12). Across all accessions, a meta-network was established by computing edges using the cosine similarity values and nodes as modules using NetworkX (v3.1)63 and clustered further into six distinct communities via Louvain community detection64. The final network layout was calculated using the Netgraph (v4.12.11)65 package with the edge bundle option (Fig. 3a). Ortholog detection among Oryza sativa66, Arabidopsis thaliana67 and selected barley accessions Morex, Akashinriki and B1K representative protein sequences was performed using OrthoFinder (v2.5.5)68 with default parameters.
Creation and analysis of barley ‘MorexGeneAtlas’
All public cv. Morex datasets with ≥3 replicates were retrieved from the European Nucleotide Archive (ENA). Discovery of datasets using the Representational State Transfer Application Programming Interface is available at https://www.ebi.ac.uk/ena/portal/api/search. Samples came from the following study accessions: PRJEB29972, PRJEB52944, PRJNA315041, PRJNA326683, PRJNA382490, PRJNA639318, PRJNA661163, PRJEB14349, PRJNA910827, PRJNA975859 and PRJNA589222 (Supplementary Data 11). We also included the 15 samples from the Morex pan-transcriptome dataset listed under the study accession PRJEB64639. Raw RNA-seq reads were trimmed using fastp (v0.20.0)46 and mapped to the Barley cv. Morex V3 (2021) genome69 using STAR (v2.7.8a)47. Transcripts were assembled using Stringtie (v2.1.5)48 and Scallop (v0.10.5)49. The resulting annotation was filtered to remove redundant and fragmented transcripts, transcripts with poorly supported SJs and low-expressed transcripts. The resulting assemblies were merged and filtered using the RTDmaker (v0.1.5; https://github.com/anonconda/RTDmaker), forming Mx-RTD. Quantification of expression against both Mx-RTD and PanBaRT20 was carried out using Salmon (v0.14.1) in quasi-mapping mode33.
The Morex Gene Atlas and database were prepared using the LAMP configuration (Linux, Apache, mySQL and Perl) and created as described previously34. Significant differential gene expression and differential alternative splicing analysis were performed using the 3D RNA-seq App (https://3drnaseq.hutton.ac.uk/app_direct/3DRNAseq/)55. All scripts used in the construction and analysis of the Morex Atlas have been made available at https://github.com/cropgeeks/barleyPantranscriptome.
GA metabolic pathway analysis
Gene sequences of the GA metabolic pathway genes (based on IBSC2017 cv. Morex assembly) as reported previously36 were used to BLAST against the genome assemblies of the 20 barley accessions used in the updated pan-transcriptome study9. Identified gene sequences were BLASTed against PanBaRT20, and individual gene expression values were extracted.
FIND-IT screening
Barley FIND-IT variants ga2ox3w106* (ID CB-FINDit-Hv-034) and ga2ox7w107* (ID CB-FINDit-Hv-035) were identified and isolated in the RGT Planet mutant library as described70.
Barley field trials and yield performance
FIND-IT variants and controls were grown in field trial plots for agronomic evaluation (7.5 m2 plots) in Denmark. Grain from field plots was harvested and threshed using a Wintersteiger Elite Plot Combine (2019 and 2020) and a Wintersteiger Classic Plus Combine (2021 and 2022; Wintersteiger AG), and grains were sorted by size (threshold, 2.5 mm) using a Pfeuffer SLN3 sample cleaner (Pfeuffer GmbH).
Agronomic performance evaluation
Mature dry grains from field-grown plants were analyzed to determine TGW using a MARVIN digital seed analyzer (GTA Sensorik GmbH), and grain quality (starch, protein and water content) was measured using near infrared technology (Foss Infratec 1241 analyzer).
Weather data
Weather data for the field trial region of the respective years were extracted from the weather archive of the Danish Meteorological Institute (https://www.dmi.dk/vejrarkiv/).
Micromalting and hydrolytic enzyme activity analysis
Nondormant barley samples of RGT Planet and the FIND-IT variants ga2ox3w106* and ga2ox7w107* were placed in individual containers, each holding 100 g of seeds, and submitted into 16 °C fresh water to reach 33% moisture content on day 1 and 43% moisture content on day 2. The actual water uptake of individual samples was determined as the weight difference between initial water content, measured with the Foss 1241 NIT instrument (Foss A/S), and the sample weight after surface water removal. Following the last step, the barley samples were maintained at a degree of steeping of 43% for 4 days. After each 24 h, samples were checked for moisture content and sprayed with additional water to overcome possible respiration loss. After the germination process, the barley samples were kiln dried in a Curio kiln using a two-step ramping profile. The first ramping step started at a set point of 27 °C and a linear ramping at 2 °C h−1 to the breakpoint at 55 °C using 100% fresh air. The second linear ramping was at 4 °C h−1 reaching a maximum at 85 °C. This temperature was kept constant for 90 min using 50% fresh air. The kiln dried samples were deculmed using a manual root removal system from Wissenschaftliche Station für Brauerei. Before enzyme activity analysis, the germinated grain samples were milled using a standard IKA Tube Mill 100. All measurements of enzyme activity in germinated barley grains were made within 48 h after milling of the sample. The enzymatic activity of α-amylase, β-amylase and free limit dextrinase was measured using the Megazyme ‘Ceralpha’, ‘Betamyl-3’ and ‘PullG6’ methods, respectively, that were modified for the Gallery Plus Enzyme Master (Thermo Fisher Scientific).
Gene and transcript naming
Multiple historical genomic and transcriptomic experiments have led to a range of barley gene name annotations for the same gene. MorexGeneAtlas provides a barley gene name ‘look-up’ function (https://ics.hutton.ac.uk/morexgeneatlas/relator.html). Submitting a legacy barley gene identification name returns a table of matching transcript identification names from MLOCs, JLOCs, HORVUv1, HORVUv3, BARTv1 and BARTv2 datasets. It also provides transcript identification names from the GsRTDMorex, HvMxRTD and PanBaRT20 datasets from the current paper.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.