Introduction

 

 

 

We designed and established this database, which integrated molecular data and clinical data of HCC samples from the current public databases, including tissue samples of HCC, normal liver tissue, liver cirrhosis, liver fibrosis, fatty liver and various HCC cell lines. It is also the first multicenter liver cancer database. The data were reasonably standardized to help researchers efficiently analyze big data of liver diseases and HCC and quantitatively analyze data across studies and across batches. Moreover, the database supports users to upload data and selectively share data. This database greatly simplifies the process of data collection and analysis by liver cancer researchers, and it lowers the threshold for bioinformatics analysis. It is helpful to promote the research process of the genetic characteristics of liver cancer in the background of “accurate medicine”.

 

Sample space

  • By the end of June 2018, the number of samples we have collected is as follows:
Type Amount
Hepatocellular carcinoma 868
Normal liver issue 402
Viral hepatitis 217
Cirrhosis 119
Liver fibrosis 29
Fatty liver 72
Mixed diseases 243
Other (can be regard as control) 4786

 

  • So far, we have collected various liver disease datasets from GEO database, and these data are mainly cDNA chip data and the characteristic/clinical labels of these samples. In the future, we will continue to collect samples containing various types of molecular data and corresponding more complete clinical information.

 

Standardization and Normalization

  • Every dataset we collected from GEO database contained clinical information, most of which was not standard in expression due to different personal habits, involving inconsistency in cases, difference in full names and short terms and usage of synonyms. To co-analyze the data from various study origins, we standardized the clinical information from the samples we collected using UMLS library and MeSH library.
  • Affymetrix cDNA gene chip expression spectrum commonly use RMA algorithm to process multi-sample analysis in a batch, thus causing batch effect. To make it impossible for researchers to quantitatively analyze data across-study and across-batch, we used fRMA algorithm to parse the cDNA expression data of each sample downloaded, then we combined the results. The batch effects were successfully eliminated and the original biological characters obtained from RMA algorithm were well reserved.

Structure

  • The ontology of database was constructed based on standardized clinical data and normalized molecular data, framed by tranSMART system, and with R language scripts as extensions. The overall structure was shown below. In the front-end website, simple data search can be performed and superior analysis can be performed by securely accessing the ontology database based on tranSMART system through the front-end website.

Function

  • In the query page of the front page, the user can perform simple searches.
  • Via the front-end website or direct access, users can access the ontology database built with tranSMART framework. In structure ontology database can be summarized as following seven functional modules: (i) Dataset Browse, Organize, and Manage module; (ii) Dataset Statistics module; (iii)Data View module; (iv)Data Export module; (v)Data Analyze module; (vi)Data Upload module; and (vii) User and Group Management module. The main functions of ontology database will then be presented in the form of a referral process to the general user.
  • For a quick start, view Quick Start Guide via the Homepage.

Comments are closed.