Technical Reports

After some programming work in my thesis, I had to change some aspects of my approach to Statistical Analysis. More often than not I had the notion statistical analysis might be very repetitive concerning some tasks. This was sometimes true, with some datasets, almost 100% repetitive in terms of what we could analyze.

Data Analysis and particularly Statistics is approached almost always by presenting a Descriptive analysis of the dataset. Then, a good inference analysis is needed to better understand the relations between variables. Sometimes, certain types of variables (numeric, integer…) in the dataset are good starting points to try regression models, other times we need clustering of instances. Additionally, sometimes we need to check the grouping of variables with Factor Analysis.

Considering this repetitive approach, I started to use Statsframe software (https://statsframe.com) to do my analysis, anytime I encountered some particular subject or dataset I would be interested in. This has been proving to be fantastic and a great tool to efficiently and rapidly know more about a subjet of interest. With this tool, we get a fast and clear notion about a dataset. Of course, sometimes we might need more, but when approaching new data, we sometimes thrive (and loose time) just to get a global idea of the behavior of the data concerning possible solutions to extract knowledge from it.

This Blog Post starts a new series of publications (Technical Reports) published in RG (https://www.researchgate.net/)

The subject of studies are divided in three large groups of areas (click the link to follow the projects and corresponding publications on RG):

As a small example of the work that was developed in several areas, here goes something to check:

Statsframe ULTRA DATASET REPORT (https://statsframe.com)

Paper Titles WordClouds For:

COVID-19 Open Research Dataset Challenge (CORD-19)

An AI challenge with AI2, CZI, MSR, Georgetown, NIH & The White House

Dataset Description

In response to the COVID-19 pandemic, the White House and a coalition of leading research groups have prepared the COVID-19 Open Research Dataset (CORD-19). CORD-19 is a resource of over 29,000 scholarly articles, including over 13,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses. This freely available dataset is provided to the global research community to apply recent advances in natural language processing and other AI techniques to generate new insights in support of the ongoing fight against this infectious disease. There is a growing urgency for these approaches because of the rapid acceleration in new coronavirus literature, making it difficult for the medical research community to keep up.

Online Accessed (18-03-2020)

at

https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge

Word Cloud Text Analysis:

About the Text Variables:

Text variable analysis can make sense when we analyze them as a whole. In this report about variable “title”, we analyze it regarding word frequency and weighted word frequency. These analysis give the reader a better notion about the keywords provided on all instances in the data. Sometimes, these variables arise from survey respondents answers, it is clear that such analysis can be complemented with the grouping of answers. Thus, to provide a better look at this variable behavior and influence in other possible variables in the dataset, grouping is advisable. If you have doubts or need further analysis, please contact us and to ask for help, please feel free to use our ticket system.

Results:

Variable

word.freq

freq

word.tfidf

weight.tf.idf

title

virus

2199

virus

560.8542

title

respiratory

1399

infection

423.4355

title

infection

1199

respiratory

410.5416

title

influenza

956

influenza

381.5456

title

human

892

viral

358.1893

title

viral

851

human

344.3145

title

coronavirus

838

coronavirus

340.2731

title

disease

737

viruses

334.4678

title

viruses

696

disease

304.1021

title

analysis

635

health

293.8622

title

protein

633

analysis

271.9597

title

infectious

607

novel

270.3096

title

health

603

infections

262.4356

title

novel

586

infectious

259.7204

title

cells

583

protein

258.0374

title

study

563

china

240.0059

title

detection

499

cells

239.4819

title

syndrome

494

detection

229.1926

title

infections

493

rna

224.2244

title

acute

470

study

218.6107

title

rna

453

clinical

210.7424

title

china

442

diseases

208.1201

title

clinical

435

review

203.7603

title

porcine

423

cell

198.0351

title

cell

419

acute

193.4945

title

patients

413

emerging

192.4230

title

open

413

epidemic

191.9413

title

diseases

407

syndrome

190.8752

title

using

406

patients

190.6341

title

response

380

porcine

190.4315

title

epidemic

365

open

186.5899

title

review

343

research

184.9458

title

vaccine

336

vaccine

183.3491

title

expression

316

response

182.6597

title

children

310

sars

182.3354

title

molecular

308

transmission

178.9190

title

transmission

307

using

175.5547

title

host

297

molecular

166.3554

title

middle

292

article

165.3193

title

east

291

children

162.8521

title

associated

290

outbreak

159.7631

title

outbreak

285

antiviral

157.7369

title

immune

281

host

156.8507

title

replication

272

pandemic

153.9406

title

antiviral

267

replication

150.7451

title

public

266

new

150.1894

title

severe

263

associated

149.1044

title

identification

261

expression

149.0908

title

emerging

261

role

145.7078

title

development

256

identification

145.4885

title

new

251

immune

143.4217

title

gene

251

potential

140.4981

title

research

251

public

140.0130

title

potential

250

characterization

139.9151

title

characterization

249

development

139.8788

title

risk

245

access

139.3176

title

sars

242

surveillance

137.5564

title

pandemic

239

case

137.2556

title

case

237

proteins

134.7585

title

diarrhea

234

risk

134.0362

title

role

230

pneumonia

133.4136

title

model

228

severe

133.1649

title

surveillance

227

east

131.8575

title

among

226

middle

131.6582

title

proteins

226

covid

129.1168

title

activity

224

gene

129.0419

title

responses

222

system

127.5701

title

system

222

activity

126.6272

title

mice

222

genome

126.0038

title

pneumonia

221

epidemiology

123.6403

title

type

218

control

123.1863

title

data

216

global

122.2663

title

genome

210

model

122.0775

title

control

205

information

121.8169

title

avian

200

supplementary

121.5287

title

evaluation

199

pathogens

121.2429

title

covid

190

diarrhea

120.8964

title

based

186

data

120.1016

title

ebola

186

hiv

119.5215

title

hiv

184

hepatitis

118.3884

title

pathogens

183

ebola

117.6904

title

global

182

report

117.5863

title

receptor

177

type

115.4013

title

hepatitis

173

responses

114.9123

title

journal

173

mice

114.8206

title

factors

170

avian

113.9809

title

epidemiology

167

bats

113.2802

title

infected

166

among

112.0897

title

rapid

166

vaccines

110.7445

title

dna

163

rapid

108.5839

title

care

162

evolution

106.1581

title

sequence

161

journal

105.7244

title

antibody

159

sequence

104.9246

title

treatment

158

merscov

104.2743

title

lung

155

dna

102.7831

title

genetic

152

evaluation

102.6775

title

merscov

151

based

101.7596

title

antibodies

151

entry

100.6673

title

information

149

sarscov

100.1908

title

report

149

antibodies

100.1227

Word Cloud (with Word Freq.) is:

Suppose we have a set of English text documents and wish to rank which document is most relevant to the query, “the white SUV”. A simple way to start out is by eliminating documents that do not contain all three words “the”, “white”, and “SUV”, but this still leaves many documents. To further distinguish them, we might count the number of times each term occurs in each document; the number of times a term occurs in a document is called its term frequency. However, in the case where the length of documents varies greatly, adjustments are often made.

Word Cloud (with Word TF-IDF Weight) is:

In information retrieval, tf–idf or TFIDF (short for term frequency–inverse document frequency), is a numerical metric that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor (More complex than previous and simple Word Frequency) in searches of information retrieval, text mining, and user modeling. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general, therefore, these words are less important in searching for keywords. tf–idf is one of the most popular term-weighting schemes of today information retrieval systems for text documents.

And yet another example, this time about Abstracts of Covid-19 scientific papers:

Statsframe ULTRA DATASET REPORT (https://statsframe.com)

Paper Abstracts WordClouds For:

COVID-19 Open Research Dataset Challenge (CORD-19)

An AI challenge with AI2, CZI, MSR, Georgetown, NIH & The White House

Dataset Description

In response to the COVID-19 pandemic, the White House and a coalition of leading research groups have prepared the COVID-19 Open Research Dataset (CORD-19). CORD-19 is a resource of over 29,000 scholarly articles, including over 13,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses. This freely available dataset is provided to the global research community to apply recent advances in natural language processing and other AI techniques to generate new insights in support of the ongoing fight against this infectious disease. There is a growing urgency for these approaches because of the rapid acceleration in new coronavirus literature, making it difficult for the medical research community to keep up.

Online Accessed (18-03-2020)

at

https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge

Word Cloud Text Analysis:

About the Text Variables:

Text variable analysis can make sense when we analyze them as a whole. In this report about variable “abstract”, we analyze it regarding word frequency and weighted word frequency. These analysis give the reader a better notion about the keywords provided on all instances in the data. Sometimes, these variables arise from survey respondents answers, it is clear that such analysis can be complemented with the grouping of answers. Thus, to provide a better look at this variable behavior and influence in other possible variables in the dataset, grouping is advisable. If you have doubts or need further analysis, please contact us and to ask for help, please feel free to use our ticket system.

Results:

Variable

word.freq

freq

word.tfidf

weight.tf.idf

abstract

virus

5009

virus

92.69545

abstract

infection

3800

patients

80.09655

abstract

viral

3695

infection

77.75323

abstract

cells

3086

respiratory

73.07172

abstract

study

2705

viral

72.21476

abstract

patients

2672

cells

65.22349

abstract

protein

2646

influenza

62.92086

abstract

respiratory

2609

viruses

60.29487

abstract

disease

2467

study

59.20572

abstract

viruses

2460

disease

57.22941

abstract

human

2358

protein

55.23234

abstract

results

2325

human

53.90775

abstract

using

2300

background

51.88825

abstract

can

2018

health

50.29953

abstract

data

1978

data

49.47286

abstract

also

1939

using

46.69118

abstract

rna

1923

results

46.09648

abstract

cell

1850

cases

44.86466

abstract

used

1829

infections

43.62141

abstract

proteins

1819

cell

42.95411

abstract

background

1695

rna

42.92258

abstract

influenza

1685

clinical

42.55537

abstract

health

1665

used

41.60691

abstract

two

1598

methods

40.70162

abstract

analysis

1546

can

40.04584

abstract

however

1534

proteins

39.95884

abstract

methods

1532

analysis

39.72378

abstract

host

1512

children

39.12819

abstract

expression

1490

expression

38.37127

abstract

infections

1486

coronavirus

37.31657

abstract

clinical

1409

samples

36.24646

abstract

cases

1401

also

35.82233

abstract

may

1357

host

35.77510

abstract

including

1353

acute

35.34221

abstract

different

1349

two

35.28536

abstract

one

1326

model

34.76209

abstract

associated

1310

associated

34.74370

abstract

studies

1284

transmission

34.19541

abstract

identified

1250

however

33.31028

abstract

replication

1232

severe

33.22310

abstract

model

1223

infectious

33.01137

abstract

gene

1203

studies

32.87224

abstract

high

1180

risk

32.86041

abstract

found

1160

may

32.73593

abstract

samples

1157

gene

32.39170

abstract

response

1146

diseases

32.30309

abstract

control

1122

control

31.77444

abstract

diseases

1105

one

31.71887

abstract

transmission

1099

different

31.59783

abstract

severe

1090

response

31.42600

abstract

infectious

1078

identified

31.41202

abstract

immune

1065

including

31.29605

abstract

genes

1048

immune

31.13356

abstract

acute

1040

genes

30.97422

abstract

among

1039

outbreak

30.73294

abstract

infected

1038

mice

30.72850

abstract

role

1022

high

30.70275

abstract

activity

1012

among

30.39152

abstract

number

1009

replication

29.90671

abstract

coronavirus

1007

china

29.84132

abstract

mice

994

detection

29.69279

abstract

important

986

number

29.26682

abstract

showed

966

pneumonia

29.24120

abstract

compared

962

pathogens

29.09911

abstract

new

951

infected

29.03451

abstract

factors

932

novel

28.61948

abstract

potential

929

found

28.44668

abstract

well

917

treatment

28.25726

abstract

treatment

912

epidemic

28.02363

abstract

genome

907

sars

27.69046

abstract

time

906

group

27.55645

abstract

risk

897

vaccine

27.36175

abstract

children

893

merscov

27.33319

abstract

system

893

new

27.30292

abstract

based

891

public

27.28561

abstract

antiviral

887

compared

27.23466

abstract

species

887

sarscov

26.95335

abstract

three

887

time

26.94145

abstract

pathogens

878

rsv

26.90683

abstract

group

873

pandemic

26.83463

abstract

significant

864

years

26.80969

abstract

first

860

detected

26.62188

abstract

use

859

factors

26.35281

abstract

significantly

837

activity

26.32098

abstract

detected

833

information

26.26531

abstract

novel

828

use

26.25516

abstract

levels

824

role

26.24666

abstract

years

813

important

26.10616

abstract

outbreak

807

potential

26.09487

abstract

reported

804

system

26.01824

abstract

increased

798

based

25.98676

abstract

specific

798

antiviral

25.91031

abstract

development

797

showed

25.84423

abstract

detection

792

positive

25.78238

abstract

positive

782

species

25.59791

abstract

within

779

days

25.38459

abstract

higher

773

age

25.37997

abstract

sequence

772

assay

25.05421

abstract

several

772

surveillance

24.91657

abstract

merscov

768

genome

24.50598

Word Cloud (with Word Freq.) is:

Suppose we have a set of English text documents and wish to rank which document is most relevant to the query, “the white SUV”. A simple way to start out is by eliminating documents that do not contain all three words “the”, “white”, and “SUV”, but this still leaves many documents. To further distinguish them, we might count the number of times each term occurs in each document; the number of times a term occurs in a document is called its term frequency. However, in the case where the length of documents varies greatly, adjustments are often made.

Word Cloud (with Word TF-IDF Weight) is:

In information retrieval, tf–idf or TFIDF (short for term frequency–inverse document frequency), is a numerical metric that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor (More complex than previous and simple Word Frequency) in searches of information retrieval, text mining, and user modeling. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general, therefore, these words are less important in searching for keywords. tf–idf is one of the most popular term-weighting schemes of today information retrieval systems for text documents.