Transcriptome profiling revealed multiple genes and ECM-receptor interaction pathways that may be associated with breast cancer

Background Exploration of the genes with abnormal expression during the development of breast cancer is essential to provide a deeper understanding of the mechanisms involved. Transcriptome sequencing and bioinformatics analysis of invasive ductal carcinoma and paracancerous tissues from the same patient were performed to identify the key genes and signaling pathways related to breast cancer development. Methods Samples of breast tumor tissue and paracancerous breast tissue were obtained from 6 patients. Sequencing used the Illumina HiSeq platform. All. Only perfectly matched clean reads were mapped to the reference genome database, further analyzed and annotated based on the reference genome information. Differentially expressed genes (DEGs) were identified using the DESeq R package (1.10.1) and DEGSeq R package (1.12.0). Using KOBAS software to execute the KEGG bioinformatics analyses, enriched signaling pathways of DEGs involved in the occurrence of breast cancer were determined. Subsequently, quantitative real time PCR was used to verify the accuracy of the expression profile of key DEGs from the RNA-seq result and to explore the expression patterns of novel cancer-related genes on 8 different clinical individuals. Results The transcriptomic sequencing results showed 937 DEGs, including 487 upregulated and 450 downregulated genes in the breast cancer specimens. Further quantitative gene expression analysis was performed and captured 252 DEGs (201 downregulated and 51 upregulated) that showed the same differential expression pattern in all libraries. Finally, 6 upregulated DEGs (CST2, DRP2, CLEC5A, SCD, KIAA1211, DTL) and 6 downregulated DEGs (STAC2, BTNL9, CA4, CD300LG, GPIHBP1 and PIGR), were confirmed in a quantitative real time PCR comparison of breast cancer and paracancerous breast tissues from 8 clinical specimens. KEGG analysis revealed various pathway changes, including 20 upregulated and 21 downregulated gene enrichment pathways. The extracellular matrix–receptor (ECM-receptor) interaction pathway was the most enriched pathway: all genes in this pathway were DEGs, including the THBS family, collagen and fibronectin. These DEGs and the ECM-receptor interaction pathway may perform important roles in breast cancer. Conclusion Several potential breast cancer-related genes and pathways were captured, including 7 novel upregulated genes and 76 novel downregulated genes that were not found in other studies. These genes are related to cell proliferation, movement and adhesion. They may be important for research into breast cancer mechanisms, particularly CST2 and CA4. A key signaling pathway, the ECM-receptor interaction signal pathway, was also identified as possibly involved in the development of breast cancer. Electronic supplementary material The online version of this article (10.1186/s11658-019-0162-0) contains supplementary material, which is available to authorized users.

paracancerous breast tissues). Changed genes in any cancer, including breast cancer, reflect the biological diversity of cellular phenotype and physiological function and could become important research areas for elucidating molecular mechanisms. Already, many studies have found genes or proteins strongly associated with the progress and prognosis of breast cancer, including enhancer of zeste 2 (EZH2) polycomb repressive complex 2 subunit [20] and Jab1/COPS5 [21]. In addition, the nuclear receptor-interacting protein 1 (NRIP1) and the MAPK signaling pathway can regulate the development of breast cancer cells [22].

Study patients, tissue sample preparation and collection
Histopathological breast cancer (invasive ductal carcinoma, tumor tissue) and adjacent normal tissue (paracancerous tissues, normal tissue) samples were obtained from 14 patients with pathologically confirmed breast cancer. Six of the cases were randomly selected for transcriptome sequencing, while the other eight were selected for expression pattern studies of novel breast cancer-related genes. The samples were taken in 2016 and 2017 at the Department of Pathology of the Affiliated Hospital of Inner Mongolia Medical University in Hohhot, China. This study was approved by the Ethics Committee of Inner Mongolia Medical University. After surgery, each specimen was cut into 3-8 mm 2 pieces. The cut tissue was immediately placed in an RNA protectant (RNAlater, Sigma Aldrich). After being infiltrated by RNAlater for 12 h at 4°C, all samples were then quickly placed in liquid nitrogen and stored at − 80°C until needed for further processing and sequencing.

Isolation of total RNA from samples
Total RNA extraction was performed with TRIzol (Takara) following the manufacturer's protocol, and isolated total RNA was stored at − 80°C until the next step. RNA degradation and contamination was monitored using 1% agarose gel electrophoresis. A Nano-Photometer spectrophotometer (Implen) and Qubit RNA Assay Kit with a Qubit 2.0 Fluorometer (Life Technologies) were respectively used to detect the purity and concentration of total RNA. An RNA Nano 6000 Assay Kit and a Bioanalyzer 2100 system (Agilent Technologies) were used to assess the integrity of total RNA.

Library preparation for transcriptome sequencing
At least 3 μg of total RNA per sample was used. Following the manufacturer's instructions, the NEBNext Ultra RNA Library Prep Kit (Illumina) was used to generate the 6 pairs of sequencing libraries (6 for normal tissues and 6 for tumor tissues). Random hexamer primer, M-MuLV Reverse Transcriptase (RNase H-) and DNA Polymerase I followed by RNase H were respectively used to synthesize first-and double-stranded cDNA. Any remaining overhangs were converted into blunt ends with exonuclease and polymerase. The AMPure XP system (Beckman Coulter) was used to select cDNA fragments, preferentially those that were~150-200 bp in length. Three microliters of USER Enzyme was used with size-selected, adaptor-ligated cDNA at 37°C for 15 min followed by 5 min at 95°C, and then PCR was performed. Two pairs of cDNA libraries were constructed: one from the cDNA libraries from 6 normal tissue samples (named N1 to N6) and the other from the cDNA libraries from 6 tumor tissue samples (named T1 to T6). The Illumina HiSeq platform (Beijing Novo Gene Biological) was used for transcriptome sequencing according to the manufacturer's instructions.

Bioinformatics analysis
Raw (sequenced) reads were obtained first. After raw read filtering, sequencing error rate check (Q20 and Q30) and GC content profiling, the high-quality clean paired-end reads from each sample were aligned to the reference genome using TopHat version 2.0.9. The mapped genes from the reference genome were queried in databases such as UniProtKB/ SwissProt and the Non-Redundant Protein Database (NRPD) with the help of BLASTX program (E-value cut-off of 1e − 5 ). The mapped read numbers in the exon and intron regions (exonic and intronic rates) and total mapping rate were analyzed independently using HTSeq version 0.6.1. The total number of mapped reads was determined and the RPKMs (reads per kilobase of exon model per million mapped reads) were calculated for each annotated gene. The software R package of DESeq was employed to capture the DEGs (differentially expressed genes) between same pair transcriptome data from the same individual and calculate the fold changes for each gene. Genes with fold changes > 2, q values < 0.01 and FDRs < 0.01 were defined as DEGs and captured for further analysis. All DEGs were enriched to the KEGG signal pathway based on a q value < 0.01 and FDR < 0.01.
The results for DEGs obtained in this study were compared to corresponding breast cancer research transcriptome information from the GEO database (especially the latest research GSE33447 and GSE109169).

Validation and clinical experiments with quantitative real time PCR
The validation experiment was performed with the same 6 pairs of breast cancer tissue and adjacent normal tissue used for the transcriptome sequencing. The following 12 genes were selected to perform the quantitative real time PCR: pituitary tumortransforming 1 (PTTG1), TTK protein kinase (TTK), COL10A1, CYCS, eukaryotic translation elongation factor 1 alpha 2 (EEF1A), BUB1B, CCNB1, CDC20, karyopherin alpha 2 (RAG cohort 1 importin alpha 1; KPNA2), tetraspanin 1 (TSPAN1), tetraspanin 13 (TSPAN13) and tetraspanin 15 (TSPAN15). The group includes cancer-related genes identified in previous research. Primers were designed and are listed in Additional file 1: Table S1. 18S ribosomal RNA expression was used as an internal reference. The reaction system consisted of 2 × Super Real PreMix Plus, forward primer (10 μM), reverse primer (10 μM), cDNA and 50 × ROX Reference Dye, and the volumes of RNase-Free ddH 2 O used were 0.4, 0.6, 1, 7.4 and 10 μl. The PCR amplifications were performed in triplicate wells with initial denaturation at 95°C for 30 s, followed by 40 cycles of 95°C for 5 s and 60°C for 34 s.
The clinical experiment was prepared with the total RNA from the other different 8 pairs of breast cancer tissue and adjacent normal tissue. The first-strand cDNA was synthesized using a PrimeScript RT reagent Kit with gDNA Eraser (Perfect real time PCR). The expression levels of upregulated and downregulated genes that were selected as novel breast cancer-related genes were verified via quantitative real time PCR. The primers, PCR system and amplification conditions were the same as in the validation experiment. The data were analyzed using ABI 7500 HT SDS software 4.1 (Applied Biosystems). DEG expression levels were analyzed using the 2 -ΔΔCT analysis method.

Results
The sequencing and transcriptome data The relevant parameters, including raw reads, clean reads and total mapped rate of breast cancer tissue and normal breast tissue are summarized in Table 1. Based on filtered sequenced reads, we obtained 164,352,319 clean reads in normal tissues and 166,067,405 in tumor tissues. The average Q20, Q30, exonic, intronic and total mapping rates for tumor samples were 96.18, 90.9, 80.37, 15.8 and 92.88%, respectively. All raw reads were submitted to the NCBI SRA database under the accession number PRJNA528582.
In total, 39,649 different genes were annotated from the whole transcriptome. Within these sequences, 4685 lncRNAs, 923 miRNAs, 926 misc. RNAs, 170 rRNAs and 18,800 protein-coding genes were annotated based on the various reference databases. In total, 18, 013 genes were known protein-coding genes and 787 sequences were novel genes that were not mentioned in any database. These unknown protein-coding genes may be novel genes.

Search for known cancer-related genes in breast cancer tissue
In total, 93 different previously reported cancer-related genes were annotated in this study (Additional file 2: Table S2). This included 7 breast cancer-related genes ( Table 2): caspase 8 (CASP8), cadherin 1 type 1 (CDH1), estrogen receptor 1 (ESR1), ETS variant 6 (ETV6), forkhead box A1 (FOXA1), GATA-binding protein 3 (GATA3) and neurotrophic tyrosine kinase receptor type 3 (NTRK3). The expression levels of GATA3 and ESR1, which are both the breast tumor-related genes, showed upregulation in tumor tissues compared with normal tissues. The GATA3 expression level was 15,000 in tumor tissues and 5000 in normal tissues. The expression level was ESR1 4700 in tumor tissues and 1500 in normal tissues.
Of the 93 cancer-related genes, 58 were downregulated in the tumor tissue transcriptome. The WNT inhibitory factor 1 (WIF1) gene, which is related to the tumor type pleomorphic salivary gland adenoma, was the gene with the greatest downregulation (16-fold change), while a member of the ETS family of transcription factors (ETV6), which is related to non-small cell lung cancer, had the joint smallest downregulation Validate the accuracy of the transcriptome expression results using quantitative real time

Identification and analysis of differentially expressed genes (DEGs)
In the tissue collections and sequencing program, the breast tumor and paracancerous tissue samples from 6 patients were treated independently. These 6 different sample pairings were sequenced, mapped, analyzed and annotated. The DESeq R package (1.10.1) and DEGSeq R package (1.12.0) were used to identify the DEGs in the different libraries from the same individual patient. Pairwise comparisons for DEG analysis were performed between the tumor tissues and paracancerous tissues in the six individual groups. Interestingly, due to individual differences, the 6 groups' transcriptomes showed little difference in the number of DEGs (Table 3). In total, 937 DEGs (487 upregulated genes and 450 downregulated genes) were found to be differentially expressed in at least one tumor tissue compared with the paracancerous tissues within the same individual (Table 3). Further analysis shown that only 26.9% of the identified genes (252 of 937 DEGs) have a similar expression pattern among all individual groups, which indicated that the effect of individual differences must be taken into account when we define a universal cancer-related gene for breast tumors. Meanwhile, these 252 DEGs, including 51 upregulated genes and 201 downregulated genes (Fig. 2), showed the same up-or   Of all the DEGs, 51 were upregulated in breast cancer tissues (Additional file 3: Table S3). Ffibronectin-1 (FN1) showed the highest expression level in the tumor tissue transcriptome: 71,967, which was 10-fold higher than that in the paracancerous tissues transcriptome. The vacuole membrane protein 1 (VMP1) gene exhibited the second highest expression level, followed by collagen triple helix repeat containing-1 (CTHRC1), inhibin beta A (INHBA), and matrix metallopeptidase 11 (MMP11). The relative expression levels of these DEGs were higher than 4000 in tumor tissues and less than 2000 in paracancerous tissues (Table 4) . Of these 51 genes, 44 could be cancer-related based on the reference and previous research. Twenty genes were not mentioned in any study about breast cancer.

KEGG pathway enrichment analysis
KEGG is a database for the molecular or system biology study of gene clusters. These genes perform their functions at different levels (e.g., cell and organism levels). KOBAS software was used to test the statistical enrichment of DEGs in the KEGG pathways.  digestion and absorption pathway (18 DEGs), focal adhesion pathway (36 DEGs) and neuroactive ligand-receptor interaction pathway (43 DEGs). The PPAR signaling pathway was annotated as a downregulated DEG enrichment pathway in all different 6 transcriptomes, and the 18 DEGs, including fatty acid binding protein 7 brain (FABP7), solute carrier family 27 (fatty acid transporter) member 6 (SLC27A6), solute carrier family 27 (fatty acid transporter) member 1 (SLC27A1) and collagen domain-containing (ADIPOQ), showed downregulation by 1.5-fold to 6.7-fold in all sequencing groups (Fig. 5).

Search for potential cancer-related genes in DEGs from breast cancer tissue
Only genes that showed the same expression pattern in all 6 transcriptome pairs were taken into consideration. Of these 51 genes, CST2 showed the biggest expression differences between tumor tissues and paracancerous tissues (350-fold upregulation). Only~1 mean relative expression level was detected in the normal tissues. Functional analysis revealed that this gene is a protein-coding gene, 748 bp in length, and located on chromosome 20. The other genes with high fold changes, dystrophin related protein 2 (DRP2) and COL10A1, were also annotated. COL10A1 showed a relative expression level of 3937 in breast tumor tissues and only 21 in paracancerous breast tissues. Of the 201 downregulated genes, DLK1 exhibited a 128-fold downregulation in breast tumor tissues. However, the RPKM values of this gene were not very high in the  tumor tissues). These genes could be potential low expression level breast cancer-related genes.

The clinical experimental with quantitative real time PCR
To determine the clinical effects, we screened 6 high expression level and 6 low expression level genes to determine expression patterns in breast cancer tissues and adjacent tissues from 8 different patients. All quantitative real time PCR primers were designed based on the gene sequences reported in the NCBI database (Additional file 1: Table S1 primers). The results showed that the upregulated CST2, GJB2, UBE2T, NUF2, ORC6 and CCNB1 (Fig. 6), and the downregulated ELF5, cysteine-rich domain 2 (STAC2), BTNL9, CA4, CD300LG and PIGR (Fig. 7) showed the same result in different patients. This also verified the RNA-seq results. These 12 genes could be new cancer-related genes for the clinical treatment of breast cancer.

Discussion
Using next-generation sequencing technology and quantitative real time PCR, we successfully analyzed the DEGs in breast cancer tissues from patients from Inner Mongolia in China. Since the molecular changes in breast cancer tissues are dependent on tumor type, grade, size and receptor status [23][24][25], we limited our study to invasive cases. Transcriptome sequencing techniques play an important role in the identification of cancer-specific genes [3,5,6,19]. We sequenced the transcriptome from 6 pairs of Because the gene expression patterns or the transcriptomes of cancer patients are greatly impacted by multiple factors, including living environment [26] and the severity of the disease [27], there can be considerable inter-patient variation. The DEG results in this study support the fluctuations in the number of DEGs between the breast cancer tissue and paracancerous tissue in different individuals. They also confirm that the expression levels of DEGs display significant differences between breast cancer patients.
At the same time, the statistical results of all DEGs in our study showed that each patient expressed unique DEGs (937 DEGs in total and 253 DEGs in common). The expression patterns of many DEGs in the transcriptome were not stable, which may due to the development of the disease or the genetic background of the individual [7]. This is a hindrance for researchers seeking universal cancer-related genes for breast cancer. Therefore, individual differences must be taken into account when conducting subsequent studies.
The expressions of three tetraspanin family members, TSPAN1, TSPAN13 and TSPAN15, are upregulated.They all function as transmembrane transport proteins, and TSPAN15 is also associated with the notch signaling pathway [28,29]. TSPAN1 has been reported to regulate the progression of many human cancers, including gastric cancer, pancreatic cancer and cervical cancer [30][31][32]. Meanwhile, the expression of TSPAN1 was higher in ER-positive and HER2-positive breast cancer [33]. All samples in this study were collected from ER-positive patients. While the expression of TSPAN13 in prostate cancer and glioma is known to be elevated [34,35], there is only one study on TSPAN13 in breast cancer [36]. It mentioned that TSPAN13 was upregulated in breast cancer cells. There are few studies on TSPAN15, and its effect on cancer was reported less frequently.
In our results, the expression levels of TSPAN1, TSPAN13, and TSPAN15 in breast cancer were all increased. Our TSPAN1 results are consistent with those previously reported [33], so we speculate that TSPAN13 and TSPAN15 may be potential cancerrelated genes for breast cancer. This needs further study.
Previous reports [25] have shown that in invasive breast cancer, upregulated genes are related to cell proliferation and cell movement, while downregulated genes are associated with cell adhesion. Our study showed that ASPM, INHBA, NUF2, ORC6, UBE2T and PKMYT1 are associated with cell proliferation [46][47][48][49][50][51][52], and the expressions of these genes were also elevated in our breast cancer tissue transcriptome. The immune function-related genes CD300LG and PIGR were also detected as downregulated in breast cancer [53,54].
(See figure on previous page.) Fig. 6 The relative expressions of CST2, GJB2, NUF2T, NUF2, ORC6 and CCNB1 in normal tissues and tumor tissues assessed via quantitative real time PCR. Fold changes are expressed as the ratio of gene expression in tumor tissue to that in normal tissue, normalized to 18S rRNA. The gene expression in normal tissue is normalized to 1. *p < 0.05, **p < 0.01 In this study, 7 upregulated and 76 downregulated DEGs were captured and may be the important genes for breast cancer research. Of the 6 upregulated genes, CST2, which belongs to the cystatin superfamily and is an active cysteine protease inhibitor [55], showed a 350-fold change compared to its expression in normal tissue. The protein of this gene is found in a variety of human fluids and secretions [55], which could provide a new detection method for breast cancer. Till now, few studies have focused on CST2 in any tumor type, except to show that is responsive to the anti-growth activity of triptolide in ovarian cancer cells [56]. More study should be performed to confirm the function and character of CST2 in the development of breast cancer.
The other high expression level gene in breast tumors was DRP2, which is associated with paranoid-type schizophrenia [57]. Some study suggests a relationship between DRP2 and lung cancer [58] and brain cancer [59]. The function of this gene in breast cancer is still unknown.
Just like the CST2, GJB2, UBE2T, NUF2 and ORC6 also showed the same high expression level in breast tumors. GJB2 is implicated in the mechanisms of invasion of ductal breast carcinoma [60] and it is a prognosis marker in pancreatic cancer [61]. The downregulation of UBE2T could inhibit the progression of gastric cancer [62] and performed the same function in prostate cancer [63]. Previous study indicated that the downregulation of NUF2 could inhibit the growth of pancreatic cancer [64]. Few studies have focused on the gene function of ORC6 in breast cancer, but single-nucleotide polymorphisms (SNPs) were detected in this breast cancer-related gene [65].
We found more genes with low expression levels in tumors: 63 with at least a 10-fold change, including BTNL9, CA4, GPIHBP1 and PIGR. In total, 6 low expression level genes were confirmed using quantitative real time PCR for 6 transcriptome specimens and 8 clinical specimens.
BTNL genes show expression pattern changes in intestinal inflammation and colon cancer [66] and may be important in tumor immunity [67]. The expression and prognostic significance of PIGR, an immunoglobulin receptor, is similar, in epithelial ovarian cancer [68]. CA4, which is involved in cell proliferation, has been shown to inhibit cell proliferation, invasion and metastasis and was downregulated in our results [69]. The glycosylphosphatidylinositol high-density lipoprotein binding protein 1 (GPIHBP1) acts to chaperone secreted LPL and interact in fatty acids and breast cancer [70]. The role of GPIHBP1 has yet to be studied in cancer.
To the best of our knowledge, the function of these genes in breast cancer has not received much coverage. More study should be performed to explore the role of these genes. An expression pattern like the one we found for these genes may imply a high risk of breast cancer.
The KEGG pathway annotation showed that all DEGs were significantly enriched in 20 pathways, including the ECM-receptor interaction pathway and the protein digestion and absorption pathway, suggesting that there are many DEGs and signaling (See figure on previous page.) Fig. 7 The relative expressions of BTNL9, CA4, CD300LG, ELF5, PIGR and STAC2 in normal tissues and tumor tissues assessed via quantitative real time PCR. Fold changes are expressed as the ratio of gene expression in tumor tissue to that in normal tissue, normalized to 18S rRNA. The gene expression in normal tissue is normalized to 1. *p < 0.05, **p < 0.01