说明: 说明: 说明: 说明: 说明: 说明: 说明: 说明: 说明: 说明: 说明: 说明: 说明: 说明: 说明: 说明: 说明: 说明: 说明: 说明: 说明: 说明: 说明: 说明: 说明: 说明: D:\work-2007\Teaching\2011-12\语言统计分析\pku_logol.jpg 

Statistical Analysis of Language

 

考试:18 1400开始考试,地点:三教206。请提前10分钟到达考场。建议带 计算器 。

通知:请大家检查一下作业提交情况,可能会存在一些疏漏,如果同学发现有统计错误,请及时和我联系。

大作业要求:

需要做一个证真或证伪的工作,比如说红楼梦前80回和后40回是不是一个人写的,韩寒作品是否一个人写的(或是否是他自己的写的?)

根据什么证据你得到了这样一个结论。

关键是要有结论、 有方法。

和上次的presentation要求差不多,写一个报告,程序最好同时发给我。

Deadline可延为:2013117日前交

 

 

第七次作业做的普遍不好,请大家把方差分析的过程从书上或ppt上仔细看一遍。

注意:作业题很多同学用R语言直接得出结果,但一定要会手算过程,把作业题做熟练很重要!

 

Contents

作业答案(下载)注:作业中可能会有一些错误,请指出。

复习大纲(下载

 

注:一次授课内容为2节课

课程介绍(1次)

1st Lecture (9.11)  Introduction to the course

2nd Lecture (9.18) Corpus introduction

3rd Lecture (9.25) R 语言介绍 (1)

4th Lecture (10.8) R 语言介绍 (2)

5th Lecture (10.16) 语言基本问题统计分析的基本手段

6th Lecture (10.23)概率论基础

7th Lecture (11.13): 参数的置信区间Estimating from samples (Chap 7)

8th Lecture (11.20): 假设检验 (chap 8)

9th Lecture (11.27): 信息论基础

10th lecture (12.4): 语言模型和n-gram, test the fit of models to data

11th Lecture (12.11): 搭配

12th Lecture (12.18): test interdependence and difference

13th Lecture (12.25): 方差分析

 

 

作业1:(deadline:2012.10.23

1) 如何设计方案统计文中出现的词切分歧义?

l  真歧义 and 伪歧义

l  交叉歧义 and 组合歧义

2)描述所收集数据的类型和格式,基于现有语料如何统计和分析?

提交方式:数据示例文件 +  报告(电子版)  发送到 lisujian@pku.edu.cn

 

作业2

1Suppose one is interested in a rare syntactic construction, perhaps parasitic gaps, which occurs on average once in 100,000 sentences.

Joe Linguist has developed a complicated pattern matcher that attempts to identify sentences with parasitic gaps. It’s pretty good,

but it’s not perfect: if a sentence has a parasitic gap, it will say so with probability 0.95, if it doesn’t, it will wrongly say it does with probability 0.005.

Suppose the test says that a sentence contains a parasitic gap. What is the probability that this is true?

 

 

2)程序:偏度的R实现

 

3)在语料库中找单词一定长度内出现的分布规律(不限于下面例子)

例如:每100 words出现单词theof等单词次数的分布;每一百字出现“的”、“是”等词次数的分布

 

作业3:(deadline: 2012.11.27

1)  We want to estimate the average number of words per news story. We take a sample of 900 news stories. The average number of words in the sample is 100. The standard deviation for the sample is 30. Given a 95% confidence interval for the mean number of words per news story in the population.

 

2) We want to estimate the average number of words per news story. We take a sample of 9 news stories. The average number of words in the sample is 100. The standard deviation for the sample is 30. Give the interval that contains 99% of news stories in the population.

 

3) A die is thrown 10000 times. The average score is 3.52, sample deviation is 3.0. Can we be 95% certain that this is an unfair die? Why

 

4) 《语言研究中的统计学》 P131. (1)

 

作业4:(deadline: 2012.12.11

利用语料库计算汉字熵或英语字符熵 (提交形式:详细说明计算的具体步骤和过程,最好能够附上语言模型的结果)

 

 

 

作业5: (deadline: 2012.12.18

 

1

 

2

 

以前两题注意考虑Yates correction. 练习使用R语言的chisq.test(),给出使用命令和方法。

 

 

3P152(2)

 

作业6

1) Student 1 gets 2, 8, 7, and 3 points on 4 different tests.

Student 2 gets 4, 1, 2, and 5 points on the same tests. Are these scores correlated?

 

2) Two judges rank participating couples at a dance contest as follows.

l couple a b c d e f g h j k

l judge 1 1 2 3 4 5 6 7 8 9 10

l judge 2 4 2 7 1 3 10 8 9 6 5

Can we be 95% certain that the two sets of judgments are correlated?

 

3) table0906: Is reduplication equally likely in initial and medial position? (adopt proportion test )

 

4) Table1211: Means of Form 1 and Form 2 the same or different?

 

5) Exercises, page 193 (6)

 

 

作业7

比较group之间是否两两之间有显著差别

Groups

1 Europe

2 South America

3 North Africa

4 Far East

10

19

24

17

29

37

32

29

22

31

33

21

25

32

16

16

20

13

23

20

26

25

19

31

15

25

23

32

20

15

26

21

25

22

11

35

18

12

22

21

Total

Mean

250

25.0

219

21.9

231

23.1

213

21.3

 

 

 

 

√表示已交作业。由于数据更新时会出现一些错误。已交作业同学发现有问题,请及时和我联系。

同学名单:

序号

姓名

作业1

作业2

作业3

作业4

作业5

作业6

作业7

1

艾琦

2

张佳华

 

 

 

 

 

3

尼玛啦

 

 

 

 

 

 

4

孙潇雪

5

郝逸洋

6

马丁

7

赵忱

8

孟骥

9

寇然

10

包新启

-

 

11

高可言

12

刘阔

13

于昕元

14

张耘昊

15

黄宇钦

 

 

 

 

 

 

 

16

费跃

17

林舒

18

安传恺

19

邓德重

20

晓畅

 

 

 

 

 

 

21

龚尘

22

颜聪

23

周迪宇

 

 

 

 

 

 

 

24

黄睿哲

25

王维侬

26

27

胡明达

28

李明阳

29

林海南

30

马俊磊

31

应泽楷

32

赵晓

33

陈垚坤

34

陈桐飞

35

李嫣然

36

孙妍

 

 

 

 

 

 

 

Presentation 分组:

郝逸洋/王维侬 ; 寇然;安传恺/刘阔; 费跃; 黄睿哲/林舒; 晓畅/; 于昕元/孙潇雪; 胡明达/马俊磊/应泽楷

艾琦/张佳华、包新启/高可言/马丁、颜聪/张耘昊、龚尘/孟骥/赵忱、李明阳/林海南/赵晓、陈桐飞/李嫣然;

尼玛啦、黄宇钦、邓德重 周迪宇 陈垚坤

 

 

[TOP]

 

Reference

语言研究中的统计学,Anthony woods, Paul Fletecher, Arthur Hughs, 外语教学与研究出版社

Foundations of Statistical Natural Language Processing

Butler, C. (1985a), Statistics in Linguistics, Oxford: Basil Blackwell.

http://www.ims.uni-stuttgart.de/lehre/teaching/2005-SS/statistics/

Church,K., Gale, W. Hanks, P. and Hindle, D. (1991), "Using statistics in lexical Analysis", in Zernik, U.(ed), Lexical Acquisition: Exploiting on-line Resources to Build a Lexicon, Hillsdale, NJ: Lawrence Erlbaum Associates.

http://www.stat.tamu.edu/stat30x/notes/node3.html

McEnery, A., Baker, P. and Wilson, A. (1994), 'The Role of Corpora in Computer Assisted Language Learning', Computer Assisted Language Learning, 6(3), pp.233-48.

 

For R beginners:

统计编程的框架与R语言统计分析基础

R for Beginners (Chinese version)

 

[TOP]

 

[TOP]

 

 


[Home][Email]

© Peking University

Update: 2012.10.11