554.488/688 应用数学计算
Posted guhgf18
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了554.488/688 应用数学计算相关的知识,希望对你有一定的参考价值。
554.488/688 Computing for Applied Mathematics
Spring 2023 - Final Project Assignment
The aim of this assignment is to give you a chance to exercise your skills at prediction using
Python. You have been sent an email with a link to data collected on a random sample from some
population of Wikipedia pages, to develop prediction models for three different web page attributes.
Each student is provided with their own data drawn from a Wikipedia page population unique
to that student, and this comes in the form of two files:
A training set which is a pickled pandas data frame with 200,000 rows and 44 columns. Each
row corresponds to a distinct Wikipedia page/url drawn at random from a certain population
of Wikipedia pages. The columns are
– URLID in column 0, which gives a unique identifier for each url. You will not be able to
determine the url from the URLID or the rest of the data. (It would be a waste of time
to try so the only information you have about this url is provided in the dataset itself.)
– 40 feature/predictor variable代做554.488/688程序 columns in columns 1,...,40 each associated with a particular
word (the word is in the header). For each url/Wikipedia page, the word column gives
the number of times each word appears in the asociated page.
– Three response variables in columns 41, 42 and 43
* length = the length of the page, defined as the total number of characters in the
page
* date = the last date when the page was edited
* word present = a binary variable indicating whether at least one of 5 possible words
(using a word list of 5 words specific to each student and not among the 40 feature
words) 1 appears in the page
A test set which is also a pickled pandas data frame with 50,000 rows but with 41 columns
since the response variables (length, date, word present) are not available to you. The rows
of the test dataset also correspond to distinct url/pages drawn from the same Wikipedia
url/page population as the training dataset (with no pages in common with the training set
pages). The response variables have been removed so that the columns that are available are
– URLID in column 0
– the same 40 feature/predictor variable columns corresponding to word counts for the
same 40 words as in the training set
Your goal is to use the training data to
predict the length variable for pages in the test dataset
1What this list of 5 words is will not be revealed to you and you it would be a waste of time tring to figure out
what it is.
predict the mean absolute error you expect to achieve in your predictions of length in the test
dataset
测试集,也是一个腌熊猫数据帧,有50000行,但有41列
因为响应变量(长度、日期、单词存在)对您不可用。行
的测试数据集也对应于从同一维基百科中提取的不同url/页面
url/page population作为训练数据集(没有与训练集共同的页面
页面)。响应变量已被删除,因此可用的列为
–第0列中的URLID
–相同的40个特征/预测变量列,对应于
与训练集中的40个单词相同
您的目标是使用培训数据
预测测试数据集中页面的长度变量
1这5个单词的清单是什么,不会透露给你和你——弄清楚是浪费时间
它是什么。
预测你期望在测试中预测长度时达到的平均绝对误差
数据集
predict word present for pages in the test dataset, attempting to make the false positive as
close as you can to .05 2
, and make the true positive rates as high as you possibly can 3
,
predict your true positive rate for word present in the test dataset
predict edited 2023 for pages in the test dataset, attempting to make the false positive as
close as you can to .05 4
, and make the true positive rates as high as you possibly can 5
,
predict your true positive rate for edited 2023 in the test dataset
Since I have the response variable values (length, word present, date) for the pages in your test
dataset, I can determine the performance of your predictions. Since you do not have those variables,
you will need to set aside some data in your training set or use cross-validation to estimate the
performance of your prediction models.
There are 3 different parts of this assignment, each requiring a submission:
Part 1 (30 points) - a Jupyter notebook containing
– a description (in words, no code) of the steps you followed to arrive at your predictions
and your estimates of prediction quality - including a description of any separation of
your training data into training and testing data, method you used for imputation,
methods you tried to use for making predictions (e.g. regression, logistic regression, ...)
followed by
– the code you used in your calculations
Part 2 (60 points) - a cvs file with your predictions - this file should consist of exactly 4
columns with 6
– a header row with URLID, length, word present, edited 2023
– 50,000 additional rows
– every URLID in your test dataset appearing in the URLID column - not altered in any
way!
– no mssing values
– data type for the length column should be integer or float
– data type for the word present column should be either integer (0 or 1), float (0. or 1.)
or Boolean (False/True)
2
false positive rate = proportion of pages for which word present is 0 but predicted to be 1
3
true positive rate = proportion of pages for which word present is 1 and predicted to be 1
4
false positive rate = proportion of pages for which edited 2023 is 0 but predicted to be 1
5
true positive rate = proportion of pages for which edited 2023 present is 1 and predicted to be 1
6
a notebook is provided to you for checking that your csv file is properly formatted
– data type for the edited 2023 column should be either integer (0 or 1), float (0. or 1.)
or Boolean (False/True)
Part 3 (30 points) - providing estimates of the following in a form:
– what do you predict the mean absolute error of your length predictions to be?
– what do you predict the true positive rate for your word present predictions to be?
– what do you predict the true positive rate for your edited 2023 predictions to be?
Your score in this assignment will be based on
Part 1 (30 points)
– evidence of how much effort you put into the assignment (how many different methods
did you try?)
– how well did you document what you did?
– was your method for predicting the quality of your performance prone to over-fitting?
Part 2 (60 points)
– how good are your predictions of length, word present, edited 2003 - I will do predictions
using your training data and I will compare
* your length mean absolute deviation to what I obtained in my predictions
* your true positive rate to what I obtained for the binary variables (assuming you
managed to appropriately control the false positive rate)
– how well did you meet specifications - did you get your false positive rate in predictions
of the binary variables close to .05 (again, compared to how well I was able to do this)
Part 3 (30 points)
– how good is your prediction of the length mean absolute deviation
– how good is your prediction of the true positive rate for the word present variable
– how good is your prediction of the true positive rate for the edited 2023 variable
How the datasets were produced
This is information that will not be of much help to you in completing the assignment, except
maybe to convince you that there would be no point in using one of the other students’ data in
completing this assignment.
I web crawled in WIkipedia to arrive at a random sample of around 2,000,000 pages.
I made a list of 100 random words and extracted the length, the word counts, and the last
date edited for each page.
To create one of the student personal datasets, I repeated the following steps for each student
Repeat
Chose 10 random words w0,w1,...,w9 out of the 100 words in the list above
Detemined the subsample of pages having w0 and w1 but not w2, w3 or w4.
Used the words w5,w6,w7,w8 and w9 to create the word_present variable
Until
the subsample has at least 250,000 pages
Randomly sampled 40 of 90 unsampled words without replacement
Randomly sampled without replacement 250,000 pages out of the subsample
Retained only the 250,000 pages and
word counts for the 40 words
length
word_present
last date edited
Randomly assigned missing values in the feature (word count) data
Randomly separated the 250,000 pages into
200,000 training pages
50,000 test pages
计算机网络——应用层-Web&HTTP
参考技术A计算机网络系列博文——目录
20世纪90年代初
因特网应用
Web应用的组成
由对象组成。对象是一个文件,如HTML文件,JPEG图像,Java程序,视频片段等。
对象可通过一个URL地址寻址。
Web页面常由一个HTML基本文件和多个引用对象构成。
URL(Uniform Resoure Locator):统一资源定位器 RFC1738
用以寻址Web对象
由一个存放对象的服务器主机名和对象路径名构成。
HTTP 由客户端程序和服务端程序实现,二者通过交换HTTP报文会话。
HTTP规范定义了HTTP客户端和服务端之间的通信协议。
Web浏览器实现HTTP客户端,请求、接收、展示Web对象
Web服务器实现HTTP服务端,响应客户的请求,发送对象
HTTP使用TCP作为支撑运输层协议。
端口:80
无状态协议 服务器不保存关于客户的任何信息
服务器向客户发送被请求的文件,而不存储任何关于客户的状态信息。
往返时间(Round-Trip Time,RTT)
一个短分组从客户到服务器然后再返回客户所花费的时间。
某客户和服务器的一次会话中,每个请求/响应对通过一个单独的TCP连接传输
HTTP 1.0版本使用非持续性连接
对多个待获得的web对象,客户端一次只请求一个对象,待前一个对象接收完毕后再发送对下一个对象的请求。
时间分析
浏览器通常支持并行的TCP连接。并行TCP连接数通常为5~10个。
对多个待获得的web对象,客户端一次可同时建立多个TCP连接,以同时请求多个web对象。
时间分析
某客户和服务器的一次会话中,所有请求/响应对经同一TCP连接传输
HTTP 1.1版本在默认方式下采用持续连接,但也可由客户端/服务器配置为非持续连接。
客户端只有收到前一个响应后才发送新的请求
可理解为同个TCP内的串行
时间分析
客户端只要遇到一个引用对象就尽快发出请求
可理解为同个TCP内的并行
HTTP 1.1的默认选项
时间分析
TCP 三次握手
1.客户向服务器发送一个小TCP报文段;
2.服务器用一个小TCP报文段做出确认和响应;
3.客户向服务器返回确认和一个HTTP请求报文;
4.服务器返回相应HTML文件;
HTTP规范
RFC 1945 , RFC 2616
用ASCII文本书写
HTTP协议有两类消息,请求消息(request)和响应消息(response)
请求行 HTTP请求报文的第一行
方法
首部行 请求行后继的其它行,包含一些会话信息
空行 回车换行,分隔首部行和实体体
实体体(entity body)
GET方法下实体体为空
POST方法下实体体包含表单信息
状态行
常见状态码
首部行
空行
实体体
包含了所请求的对象
HTTP是无状态协议,但cookie技术允许服务器识别用户
cookie在无状态的HTTP之上建立一个用户会话层
参见 [RFC 6265]
cookie组件
cookie技术的争议在于它可能泄露用户的隐私
代表原Web服务器来响应HTTP请求的网络实体
Web缓冲器通常由ISP购买并安装
允许缓存器证实其缓存的副本是新的。
如果缓存器有web对象最新的版本,则初始服务器不需要向缓存器发送该web对象
在HTTP请求消息中声明所持有版本的日期
If-modified-since: <date>
如果缓存的版本是最新的,则响应消息中不包含对象
HTTP/1.0 304 Not Modified
内容分发网络(Content Distribution Network,CDN)
基于缓存器技术,CDN公司在因特网上安装许多地理上分散的缓存器,使得大流量本地化。
有共享CDN(Akamai,Limelight),专用CDN(谷歌,微软)
以上是关于554.488/688 应用数学计算的主要内容,如果未能解决你的问题,请参考以下文章