AIML427 Big Data:

Posted 2023-04-26 yegwy88

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了AIML427 Big Data:相关的知识，希望对你有一定的参考价值。

2023 AIML427 Big Data: Assignment 2
This assignment has 100 marks and is at on 11:59pm, Monday, 8th May 2023. Please submit
your answers as a single .pdf file. Make sure you read the Assessment section before writing
the report. This assignment contributes 25% to your overall course grade.
Any questions about Parts 1 or 2 should be directed to Bach; any questions about Parts 3
should go to Qi.
1 Manifold Learning [40 marks]
In class, we discussed a variety of different manifold learning methods, which we broadly
categorised as “classic” statistical methods or “modern” ML-based methods. In this question,
you are expected to further explore the differences between these classes of methods, and com-
pared to PCA (as a linear dimensionality reduction method). You should use no more than 3
pages to answer the following questions.
1. Find a reasonably high-dimensional (at least 100 dimensions) dataset that is interesting
to you. It should also have at least 100 instances, but preferably more. Describe the
dataset (name, related task/what it is used for, number of features and instances, refer-
ence) and justify your choice.
2. Using your choice of library, apply PCA to the dataset, and present your results. You
should show visualisation(s) of the PCs found and also comment on the explained vari-
ance.
3. Pick one of the “classic” manifold learning methods and apply it to the same data. Show
visualisation(s) and compare the results to that of PCA (e.g. for an embedding with two
dimensions). Highlight any differences between the two methods, and hypothesise why
they may have occurred.
4. Pick one of the “modern” manifold learning methods and apply it to the same data.
Show visualisation(s), compare and contrast the results to the two previous methods,
with analysis of any differences seen.
5. Finally, pick one of the two manifold learning methods for further analyse. Your method
will have “tunable parameters” — parameters that you can change to get different re-
sults. Pick one such parameter, and explore how sensitive the embedding is to changes
in this parameter. You should explain the role of this parameter in the manifold learning
algorithm, how you tested its effect, and show the results found.
1
2 Clustering [30 marks]
The NCI60 dataset (from the Stanford NC160 Cancer Microarray Project) consists of p = 6,830
gene expression measurements for each of n = 64 cancer cell lines. (Sourced from An Introduc-
tion to Statistical Learning).
In this question, you will be clustering the genes, rather than individual cancer cell lines.
This can be seen as a form of feature clustering — i.e. what genes are most related?
For each clustering method, you will need to visualise the clustering results (partition) for
that method. Given that there are 64 dimensions for each gene, for visualising the clustering
results, you should use PCA to reduce the dimensionality to 2D so that you can plot the found
clusters.
It is recommended you use either R or Python for this question, as they both have libraries
to interact with this data, as shown below. You should use no more than 3 pages to answer the
following questions.
2.1 R:
library(ISLR)
nci.data = NCI60$data
X = scale(t(nci.data))
P = X %*% prcomp(X)$rotation
X will be the numpy array of interest — note that we have transposed our data so our rows
are the genes. P is the principal components of the data. You will use the the first 2 PCs, i.e.
the first 2 columns of P, to visualise the clusters.
2.2 Python:
For Python (or any other language), you will need to first download nci60 data.csv from
the course homepage.
import pandas as pd
from sklearn.preprocessing import scale
from sklearn.decomposition import PCA
nci_data = pd.read_csv("nci60_data.csv", index_col=0)
X = scale(nci_data.T)
P = PCA().fit_transform(X)
X will be the data matrix of interest — note that we have transposed our data so our rows
are the genes. P is the principal components of the data. You will use the the first 2 PCs, i.e.
the first 2 columns of P, to visualise the clusters.
2.3 Tasks:
1. Carry out hierarchical clustering with Euclidean distance and complete linkage.
(a) Describe the resulting clustering for 3 to 6 clusters.
(b) Plot the first 2 principal components against each other with the colour argument
set equal to the cluster labels. What can you deduce/observe about the clustering?
2
2. Repeat the cluster analysis using correlation-based distance and complete linkage. NB:
you will need to precompute the correlations and pass them into your clustering method.
Compare the clusters with those found above.
3. Finally, carry out K-means clustering for 3 to 6 clusters. Compare the clusters of K-means
with that of the above two approaches. Which of the hierarchical clustering results is
more similar to that of K-means? Why?
3 Regression [30 marks]
In the lecture, we considered the case in which the features/predictors appeared only linearly
in the regression model. The simplest type of nonlinearity we could add to the model is pair-
wise interactions of the features. If xj and xk are distinct features, this means we also consider
xjxk as a feature. Pairwise interactions are rather straightforward to implement in R:
X = model.matrix(balance ～ . ? ., Credit)[,?1] (1)
becomes the new design matrix. The construction . ? . means consider all pairwise multiplica-
tions of distinct features.
Repeat the analysis for the Credit dataset (we have done it in the lecture on Week 7) with
pairwise interactions of the features. You will find it convenient to set grid = 10∧seq(3,?1, 100)
and thresh = 1e? 10.
Answer the following questions:
1. How many predictors are there, i.e. what is p?
2. How did you generate your training and test sets?
3. Select the tuning parameter for the ridge regression model using cross-validation, and
show the process.
4. Select the tuning parameter for the lasso regression model using cross-validation, and
show the process. How many features have been selected by the lasso?
5. Compare and discuss the final form of the model from the linear regression, ridge re-
gression, and lasso regression.
6. Compare the test errors for the linear model, ridge regression model, and lasso model.
7. Plot a comparison of the test predictions for the three approaches.
NB Please show how you generated your test and training sets – in particular the RNG
seed you used – so we can replicate your results. Remember to report this in the following
questions when applicable.
Assessment
Format: You can use any font to write the report, with a minimum of single spacing and 11
point size (hand writing is not permitted unless with approval from the lecturers). Reports
exceeding the maximum page limit will be penalised. Any additional material such as code or
figures/tables can be included in an appendix, which will not count towards the page limit.
3
Communication: a key skill required of a scientist is the ability to communicate effectively.
No matter the scientific merit of a report, if it is illegible, grammatically incorrect, mispunctu-
ated, ambiguous, or contains misspellings, it is less effective and marks will be deducted.
Marking Criteria: The final report will be submitted to Turnitin for a plagiarism check. Late
submissions without a pre-arranged extension will be penalised as per the course outline.
The usual mark checking procedures in place for all assessment apply to this report. The
assessment of the reports will account of the understanding of big data, clarity and accuracy
of answer, presentation, organisation, layout and referencing.
Submission: You are required to submit a single .pdf report through the web submission
system from the AIML427 course website by the due time.

用 AIML 开发人工智能聊天机器人

借助 Python 的 AIML 包，我们很容易实现人工智能聊天机器人。AIML 指的是 Artificial Intelligence Markup Language （人工智能标记语言），它不过是简单的可 XML （扩展标记语言）形式。本文的示例代码将带你初步领略如何借助 Python 创建属于你的人工智能聊天机器人。

AIML 是什么？

AIML由Richard Wallace发明。他设计了一个名为 A.L.I.C.E. （Artificial Linguistics Internet Computer Entity 人工语言网计算机实体）的机器人，并获得了多项人工智能大奖。有趣的是，图灵测试的其中一项就在寻找这样的人工智能：人与机器人通过文本界面展开数分钟的交流，以此查看机器人是否会被当作人类。AIML是一种为了匹配模式和确定响应而进行规则定义的 XML 格式。

关于 AIML 详细的初级读物，可翻阅 Alice Bot’s AIML Primer(http://www.alicebot.org/documentation/aiml-primer.html)。你同样可以在 AIML Wikipedia page(https://en.wikipedia.org/wiki/AIML)了解更多 AIML 的内容以及它能够做什么。我们首先将创建 AIML 文件，并用 Python 赋予它生命。

创建标准的启动文件

创建一个启动文件 std-startup.xml 作为读取AIML文件的主入口点是标准做法。在这里，将创建了一个初始文件用来匹配一种模式和进行一个动作。我们想匹配模式 load aiml b ，并且使它载入我们的 aiml 大脑作为响应。我们将即时创建 basic_chat.aiml 文件。

    <aiml version="1.0.1" encoding="UTF-8">

        



        

        <category>



            

            

            <pattern>LOAD AIML B</pattern>



            

            

            <template>

                <learn>basic_chat.aiml</learn>

                

                

            </template>



        </category>



    </aiml>

创建 AIML 文件

上面我们已经创建了只有一种模式句柄的 AIML 文件，load aiml b。当我们通过命令行运行这个机器人，它会尝试读取 basic_chat.aiml。除非我们已经完成创建，否则载入失败。下面的示例代码将告诉你 basic_chat.aiml 文件可以加入什么。我们将匹配两种基础的模式和响应。

    <aiml version="1.0.1" encoding="UTF-8">

    



        <category>

            <pattern>HELLO</pattern>

            <template>

                Well, hello!

            </template>

        </category>



        <category>

            <pattern>WHAT ARE YOU</pattern>

            <template>

                I‘m a bot, silly!

            </template>

        </category>



    </aiml>

随机响应

你同样可以像下面的示例代码一样添加随机响应。当接收到“One time I”开头的信息（message），通配符“*”可以进行模糊匹配。

    <category>

        <pattern>ONE TIME I *</pattern>

        <template>

            <random>

                <li>Go on.</li>

                <li>How old are you?</li>

                <li>Be more specific.</li>

                <li>I did not know that.</li>

                <li>Are you telling the truth?</li>

                <li>I don‘t know what that means.</li>

                <li>Try to tell me that another way.</li>

                <li>Are you talking about an animal, vegetable or mineral?</li>

                <li>What is it?</li>

            </random>

        </template>

    </category>

借助已有的 AIML 文件

编写属于自己的 AIML 文件当然充满乐趣，但工作量也不小。我认为在它（机器人）能感知现实之前至少需要 10,000 中模式。所幸，ALICE基金会已经免费提供了部分 AIML 文件。Alice Bot website 可浏览这些文件。有一种说法是 std-65-percent.xml 包含了 65% 最常用的短语。还有一种说法是它可以让你和机器人玩二十一点。

运用 Python

目前为止，所有 XML 格式的 AIML 文件都准备好了。作为机器人大脑的组成部分，它们都很重要，不过目前它们只是信息（information）而已。机器人需要活过来。你可以借助任何语言定制 AIML，但某些好心人已经用 Python 这么做了。

首先用 pip 安装 aiml 包。

    pip install aiml

注意，aiml 包只能在 Python2 环境下运行。也可以选择 Py3kAiml on GitHub (https://github.com/huntersan9/Py3kAiml)

最简单的 Python 程序

我们可以用如下最简单程序入门。它创建了 aiml 类，学习启动文件，然后读取其余 aiml 文件。接下来，它已经准备好聊天了，我们也进入了一个不断提示用户输入信息的死循环。你需要输入一个机器人能识别的模式。模式的识别取决于你载入的 AIML 文件。

因为我们建立启动文件作为独立实体，所以我们稍后可以对机器人添加更多 aiml 文件而不需要调试任何程序的源代码。只有在 xml 格式的 starup 下，我们才能添加更多文件。

    import aiml



    # 创建Kernel()和 AIML 学习文件

    kernel = aiml.Kernel()

    kernel.learn("std-startup.xml")

    kernel.respond("load aiml b")



    # 按组合键 CTRL-C 停止循环

    while True:

        print kernel.respond(raw_input("Enter your message >> "))

加速大脑载入

当你渐渐有了许多 AIML 文件，机器人就需要很多时间去学习。这就需要大脑文件的介入了。在机器人学习完所有 AIML 文件后，它可以直接以文件形式存储大脑，再次运行时可以大大提升载入时间。

    import aiml

    import os



    kernel = aiml.Kernel()



    if os.path.isfile("bot_brain.brn"):

        kernel.bootstrap(brainFile = "bot_brain.brn")

    else:

        kernel.bootstrap(learnFiles = "std-startup.xml", commands = "load aiml b")

        kernel.saveBrain("bot_brain.brn")



    # kernel()已经等待使用了

    while True:

        print kernel.respond(raw_input("Enter your message >> "))

运行时重载 AIML

运行时，你可以发送载入信息给机器人，接着将会重载 AIML 文件。注意你是否像上文那样使用了大脑方式，飞速重载不会造成大脑有新的变化。你要么删除大脑文件，以便下次启动时重建；要么修改代码，以便重载后的某一时刻能够储存大脑。下一节将利用新建 Python 命令来让机器人执行这些操作。

    load aiml b

添加 Python 命令

如果你想通过运行 Python 函数来为机器人添加一些特别的命令，那么你应该在发送 kernel.respond() 函数前截取输入信息并处理。在上述的例子中，我们借助 raw_input 函数获取用户的输入。由此我们无论如何都能获取我们的输入信息。可能好似一个 TCP 套接字（socket），或者使声源转换成文本源。你也许不想 AIML 处理对于某些信息。因此在它们传递给 AIML 时处理。

    while True:

        message = raw_input("Enter your message to the bot: ")

        if message == "quit":

            exit()

        elif message == "save":

            kernel.saveBrain("bot_brain.brn")

        else:

            bot_response = kernel.respond(message)

            # bot_response() 回复某些信息

会话和谓词（Predicates）

通过指定会话，AIML 能根据不同对话者随机应变。举个例子，如果某人告诉机器人他们叫 Alice，另一个人则告诉机器人它叫 Bob，机器人可以分清他们。指定你需要的会话，将它作为第二个参数传递给 respond()。

    sessionId = 12345

    kernel.respond(raw_input(">>>"), sessionId)

和每个客户都能有个性化的对话——这棒极了。你不得不生成你特有的会话ID并追踪。记住保存大脑文件不要保存所有的会话值。

    sessionId = 12345



    # 将会话信息作为字典

    # 包含输入输出的历史像已知谓词那样

    sessionData = kernel.getSessionData(sessionId)



    # 每个会话ID需要一个唯一的值

    # 用会话中机器人已知的人或事给谓词命名

    # 机器人已经知道你叫"Billy"而你的狗叫"Brandy"

    kernel.setPredicate("dog", "Brandy", sessionId)

    clients_dogs_name = kernel.getPredicate("dog", sessionId)



    kernel.setBotPredicate("hometown", "127.0.0.1")

    bot_hometown = kernel.getBotPredicate("hometown")

在AIML中，我们可以在 <template> 项中设置谓词。

    <aiml version="1.0.1" encoding="UTF-8">

       <category>

          <pattern>MY DOGS NAME IS *</pattern>

          <template>

             That is interesting that you have a dog named <set name="dog"><star/></set>

          </template>

       </category>

       <category>

          <pattern>WHAT IS MY DOGS NAME</pattern>

          <template>

             Your dog‘s name is <get name="dog"/>.

          </template>

       </category>

    </aiml>

通过以上 AIML 你可以告诉机器人：

    My dogs name is Max

机器人会回答：

    That is interesting that you have a dog named Max

另外如果问机器人：

    What is my dogs name?

机器人会这么回应你：

    Your dog‘s name is Max.

其它参考资料

AIML Tag Reference Table

以上是关于AIML427 Big Data:的主要内容，如果未能解决你的问题，请参考以下文章