1 Star 0 Fork 1

丁智 / forkdataset

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
该仓库未声明开源许可证文件(LICENSE),使用请关注具体项目描述及其代码上游依赖。
克隆/下载
贡献代码
同步代码
取消
提示: 由于 Git 不支持空文件夾,创建文件夹后会生成空的 .keep 文件
Loading...
README
annotations_creators language_creators language license multilinguality size_categories source_datasets task_categories task_ids paperswithcode_id pretty_name dataset_info config_names
crowdsourced
found
en
unknown
monolingual
100K
10K
original
text-classification
text-scoring
sentiment-classification
sentiment-scoring
sst
Stanford Sentiment Treebank
config_name features splits download_size dataset_size
default
name dtype
sentence
string
name dtype
label
float32
name dtype
tokens
string
name dtype
tree
string
name num_bytes num_examples
train
2818768
8544
name num_bytes num_examples
validation
366205
1101
name num_bytes num_examples
test
730154
2210
7162356
3915127
config_name features splits download_size dataset_size
dictionary
name dtype
phrase
string
name dtype
label
float32
name num_bytes num_examples
dictionary
12121843
239232
7162356
12121843
config_name features splits download_size dataset_size
ptb
name dtype
ptb_tree
string
name num_bytes num_examples
train
2185694
8544
name num_bytes num_examples
validation
284132
1101
name num_bytes num_examples
test
566248
2210
7162356
3036074
default
dictionary
ptb

Dataset Card for sst

Table of Contents

Dataset Description

Dataset Summary

The Stanford Sentiment Treebank is the first corpus with fully labeled parse trees that allows for a complete analysis of the compositional effects of sentiment in language.

Supported Tasks and Leaderboards

  • sentiment-scoring: Each complete sentence is annotated with a float label that indicates its level of positive sentiment from 0.0 to 1.0. One can decide to use only complete sentences or to include the contributions of the sub-sentences (aka phrases). The labels for each phrase are included in the dictionary configuration. To obtain all the phrases in a sentence we need to visit the parse tree included with each example. In contrast, the ptb configuration explicitly provides all the labelled parse trees in Penn Treebank format. Here the labels are binned in 5 bins from 0 to 4.
  • sentiment-classification: We can transform the above into a binary sentiment classification task by rounding each label to 0 or 1.

Languages

The text in the dataset is in English

Dataset Structure

Data Instances

For the default configuration:

{'label': 0.7222200036048889,
 'sentence': 'Yet the act is still charming here .',
 'tokens': 'Yet|the|act|is|still|charming|here|.',
 'tree': '15|13|13|10|9|9|11|12|10|11|12|14|14|15|0'}

For the dictionary configuration:

{'label': 0.7361099720001221, 
'phrase': 'still charming'}

For the ptb configuration:

{'ptb_tree': '(3 (2 Yet) (3 (2 (2 the) (2 act)) (3 (4 (3 (2 is) (3 (2 still) (4 charming))) (2 here)) (2 .))))'}

Data Fields

  • sentence: a complete sentence expressing an opinion about a film
  • label: the degree of "positivity" of the opinion, on a scale between 0.0 and 1.0
  • tokens: a sequence of tokens that form a sentence
  • tree: a sentence parse tree formatted as a parent pointer tree
  • phrase: a sub-sentence of a complete sentence
  • ptb_tree: a sentence parse tree formatted in Penn Treebank-style, where each component's degree of positive sentiment is labelled on a scale from 0 to 4

Data Splits

The set of complete sentences (both default and ptb configurations) is split into a training, validation and test set. The dictionary configuration has only one split as it is used for reference rather than for learning.

Dataset Creation

Curation Rationale

[Needs More Information]

Source Data

Initial Data Collection and Normalization

[Needs More Information]

Who are the source language producers?

Rotten Tomatoes reviewers.

Annotations

Annotation process

[Needs More Information]

Who are the annotators?

[Needs More Information]

Personal and Sensitive Information

[Needs More Information]

Considerations for Using the Data

Social Impact of Dataset

[Needs More Information]

Discussion of Biases

[Needs More Information]

Other Known Limitations

[Needs More Information]

Additional Information

Dataset Curators

[Needs More Information]

Licensing Information

[Needs More Information]

Citation Information

@inproceedings{socher-etal-2013-recursive,
    title = "Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank",
    author = "Socher, Richard  and
      Perelygin, Alex  and
      Wu, Jean  and
      Chuang, Jason  and
      Manning, Christopher D.  and
      Ng, Andrew  and
      Potts, Christopher",
    booktitle = "Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing",
    month = oct,
    year = "2013",
    address = "Seattle, Washington, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/D13-1170",
    pages = "1631--1642",
}

Contributions

Thanks to @patpizio for adding this dataset.

空文件

简介

Mirror of https://huggingface.co/datasets/sst 展开 收起
取消

发行版

暂无发行版

贡献者

全部

近期动态

加载更多
不能加载更多了
1
https://gitee.com/sc_dingzhi/forkdataset.git
git@gitee.com:sc_dingzhi/forkdataset.git
sc_dingzhi
forkdataset
forkdataset
main

搜索帮助