代码智能：问题与解法

Posted 2021-08-03 Jtag特工

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了代码智能：问题与解法相关的知识，希望对你有一定的参考价值。

代码智能：问题与解法

在基于预训练大模型引发自然语言处理革命的今天，代码智能技术也在迅速跟进发展。
那么，代码智能主要在做一些什么样的事情呢？可能很多同学会有比较科幻的想法，比如程序员要失业了之类的。
但是，其实很多工作并没有那么神秘，非常基础。那么我们用代码智能要解决什么问题呢？

判断两段代码是不是实现相似的功能
搜索跟当前代码段最相似的代码
检测代码是否有bug
自动修复代码中的bug
给一段代码自动写注释
根据文本推荐最相似的代码段
根据文本生成代码

看了之后是不是觉得更玄幻了？这么困难的问题怎么搞得定？
诚实地讲，这其中的每个子问题都很困难，就算是人类学习起来也很困难。
不过，正像是人类也是一步一步学会的一样，机器也在不断地进步。我们需要的不一定是万能的机器神，也是和我们一样普通的机器人，它们有很大的局限，但是它们可以帮助我们减轻不少工作量。

而且，最后一节我们将揭晓，处理这么多如此复杂问题的方法，却非常简单，一把梭哈，我们只用一个模型就能搞定。

codeBert

下面我们就详细看一看这些问题的细节。

问题：克隆检测 Clone Detection

万地高楼平地起，代码智能任务首先从克隆检测做起。
所谓克隆检测，就是寻找写法和功能上相似的代码。
不要小看代码重复，它会显著地降低代码智能训练的有效性。
我们看下图，训练集中有重复，测试集中有重复，它们的交集中仍然有重复，在论文《The Adverse Effects of Code Duplication in Machine Learning Models of Code》中有详细的分析。

code duplicate

预测两段代码是否相似

以下的例子来自BigCloneBench数据集. 论文地址在：https://arxiv.org/pdf/2002.08653.pdf

下面我们举几个例子来看什么算相似：

代码1:

    private StringBuffer encoder(String arg) {
        if (arg == null) {
            arg = "";
        }
        MessageDigest md5 = null;
        try {
            md5 = MessageDigest.getInstance("MD5");
            md5.update(arg.getBytes(SysConstant.charset));
        } catch (Exception e) {
            e.printStackTrace();
        }
        return toHex(md5.digest());
    }

代码2:

    public String kodetu(String testusoila) {
        MessageDigest md = null;
        try {
            md = MessageDigest.getInstance("SHA");
            md.update(testusoila.getBytes("UTF-8"));
        } catch (NoSuchAlgorithmException e) {
            new MezuLeiho("Ez da zifraketa algoritmoa aurkitu", "Ados", "Zifraketa Arazoa", JOptionPane.ERROR_MESSAGE);
            e.printStackTrace();
        } catch (UnsupportedEncodingException e) {
            new MezuLeiho("Errorea kodetzerakoan", "Ados", "Kodeketa Errorea", JOptionPane.ERROR_MESSAGE);
            e.printStackTrace();
        }
        byte raw[] = md.digest();
        String hash = (new BASE64Encoder()).encode(raw);
        return hash;
    }

代码2的字符串是用巴斯克语写的。它们用的算法也有区别，判空和异常处理也有不同，但是我们认为它们是很类似的，属于克隆识别认为相同或高度相似的。

我们再看一对例子：

代码1:

    public static void test(String args[]) {
        int trace;
        int bytes_read = 0;
        int last_contentLenght = 0;
        try {
            BufferedReader reader;
            URL url;
            url = new URL(args[0]);
            URLConnection istream = url.openConnection();
            last_contentLenght = istream.getContentLength();
            reader = new BufferedReader(new InputStreamReader(istream.getInputStream()));
            System.out.println(url.toString());
            String line;
            trace = t2pNewTrace();
            while ((line = reader.readLine()) != null) {
                bytes_read = bytes_read + line.length() + 1;
                t2pProcessLine(trace, line);
            }
            t2pHandleEventPairs(trace);
            t2pSort(trace, 0);
            t2pExportTrace(trace, new String("pngtest2.png"), 1000, 700, (float) 0, (float) 33);
            t2pExportTrace(trace, new String("pngtest3.png"), 1000, 700, (float) 2.3, (float) 2.44);
            System.out.println("Press any key to contiune read from stream !!!");
            System.out.println(t2pGetProcessName(trace, 0));
            System.in.read();
            istream = url.openConnection();
            if (last_contentLenght != istream.getContentLength()) {
                istream = url.openConnection();
                istream.setRequestProperty("Range", "bytes=" + Integer.toString(bytes_read) + "-");
                System.out.println(Integer.toString(istream.getContentLength()));
                reader = new BufferedReader(new InputStreamReader(istream.getInputStream()));
                while ((line = reader.readLine()) != null) {
                    System.out.println(line);
                    t2pProcessLine(trace, line);
                }
            } else System.out.println("File not changed !");
            t2pDeleteTrace(trace);
        } catch (MalformedURLException e) {
            System.out.println("MalformedURLException !!!");
        } catch (IOException e) {
            System.out.println("File not found " + args[0]);
        }
        ;
    }

代码2:

    private static String loadUrlToString(String a_url) throws IOException {
        URL l_url1 = new URL(a_url);
        BufferedReader br = new BufferedReader(new InputStreamReader(l_url1.openStream()));
        String l_content = "";
        String l_ligne = null;
        l_content = br.readLine();
        while ((l_ligne = br.readLine()) != null) {
            l_content += AA.SL + l_ligne;
        }
        return l_content;
    }

这个虽然没有涉及小语种，但是明显代码长度差异巨大。不过，我们仍然认为它们是相似的。

我们看一对不相似的吧：
代码1:

    private void setNodekeyInJsonResponse(String service) throws Exception {
        String filename = this.baseDirectory + service + ".json";
        Scanner s = new Scanner(new File(filename));
        PrintWriter fw = new PrintWriter(new File(filename + ".new"));
        while (s.hasNextLine()) {
            fw.println(s.nextLine().replaceAll("NODEKEY", this.key));
        }
        s.close();
        fw.close();
        (new File(filename + ".new")).renameTo(new File(filename));
    }

代码2:

    public void transform(String style, String spec, OutputStream out) throws IOException {
        URL url = new URL(rootURL, spec);
        InputStream in = new PatchXMLSymbolsStream(new StripDoctypeStream(url.openStream()));
        transform(style, in, out);
        in.close();
    }

不相似的就不解释了。

BigCloneBench数据集，就是提供了两段代码，以及它们是否相似的人工打标的结果。

数据分为train.txt, valid.txt, test.txt三个集合，它们的格式都是同样的：

idx1 idx2 0/1

其中idx1和idx2是两段代码在data.jsonl中的索引值，最后一个是它们是否相似的人工打标的值。
代码都保存在data.jsonl中，格式为：

{"func":"代码","idx":"idx值"}

我们以训练集train.txt为例，其前两行是这样的：

13988825	8660836	0
80378	18548122	1

13988825在data.jsonl中对应的结构是这样的：

{"func": "    private void setNodekeyInJsonResponse(String service) throws Exception {\\n        String filename = this.baseDirectory + service + \\".json\\";\\n        Scanner s = new Scanner(new File(filename));\\n        PrintWriter fw = new PrintWriter(new File(filename + \\".new\\"));\\n        while (s.hasNextLine()) {\\n            fw.println(s.nextLine().replaceAll(\\"NODEKEY\\", this.key));\\n        }\\n        s.close();\\n        fw.close();\\n        (new File(filename + \\".new\\")).renameTo(new File(filename));\\n    }\\n", "idx": "13988825"}

8660836对应的是：

{"func": "    public void transform(String style, String spec, OutputStream out) throws IOException {\\n        URL url = new URL(rootURL, spec);\\n        InputStream in = new PatchXMLSymbolsStream(new StripDoctypeStream(url.openStream()));\\n        transform(style, in, out);\\n        in.close();\\n    }\\n", "idx": "8660836"}

而它们的结果是不相似。大家看到，这个例子就是刚才上面我们写的第三个例子。

搜索跟当前代码段语义最相似的代码段

这个我们使用北大李戈李师团队的POJ-104数据集。

这个数据集需要到https://drive.google.com/uc?id=0B2i-vWnOu7MxVlJwQXN6eVNONUU去下载。

每个代码段用一个index来描述，然后code字段是完整的代码。我们来看个例子：

{
        "label":"1",
        "index":"0",
        "code":"
int f(int a,int x)
{
 int count=1,i;
 for(i=x;i&lt;a;i++)
  if(a%i==0)
   count+=f(a/i,i);
 if(i==a)
  return count;
 else
  return 0;
}

void main()
{
 int n,a;
 scanf(\\"%d\\",&amp;n);
 for(;n&gt;0;n--)
 {
  scanf(\\"%d\\",&amp;a);
  if(a==1||a==2)
   printf(\\"1\\
\\");
  else
   printf(\\"%d\\
\\",f(a,2));
 }
}
"
    }

然后，这个任务的目的就是求出针对某一段代码最相似的代码段。以取top 2为例：输出的样例如下：

{"index": "0", "answers": ["3", "2"]}
{"index": "1", "answers": ["0", "4"]}
{"index": "2", "answers": ["0", "1"]}
{"index": "4", "answers": ["1", "5"]}
{"index": "3", "answers": ["4", "2"]}
{"index": "5", "answers": ["4", "3"]}

也就是说，针对于代码index 0, 最相似的代码段是 index 3和2.

index 3是这样的：

void qut(int a,int b);                                       //????
int num=0;                                                    //?????????
int main()
{
 int i,n,g[1000];                                         //?????????
 cin>>n;
 for(i=0;i<n;i++)                                         //??????
  cin>>g[i];
 for(i=0;i<n;i++)
 {
     qut(g[i],1);                                         //????
  cout<<num<<endl;
              num=0;
 }
 return 0;
}

void qut(int a,int b)  
{
 int i;
 if (a>=b)  
 {
  num++;  
  if (b==1)                                      
   b++;
  for (i=b;i<=a;i++) 
  {
   if (a%i==0) 
   {
    qut(a/i,i);                                 //??a%i==0,??
   }
  }
 }
}

问题：缺陷检测

缺陷检测的数据集非常简单粗暴，就是一段打标的代码，标识是不是有漏洞。

我们看个有漏洞的例子：

{
        "project":"FFmpeg",
        "commit_id":"aba232cfa9b193604ed98f3fa505378d006b1b3b",
        "target":1,
        "func":"static int r3d_read_rdvo(AVFormatContext *s, Atom *atom)

{

    R3DContext *r3d = s->priv_data;

    AVStream *st = s->streams[0];

    int i;



    r3d->video_offsets_count = (atom->size - 8) / 4;

    r3d->video_offsets = av_malloc(atom->size);

    if (!r3d->video_offsets)

        return AVERROR(ENOMEM);



    for (i = 0; i < r3d->video_offsets_count; i++) {

        r3d->video_offsets[i] = avio_rb32(s->pb);

        if (!r3d->video_offsets[i]) {

            r3d->video_offsets_count = i;

            break;

        }

        av_dlog(s, \\"video offset %d: %#x\\
\\", i, r3d->video_offsets[i]);

    }



    if (st->r_frame_rate.num)

        st->duration = av_rescale_q(r3d->video_offsets_count,

                                    (AVRational){st->r_frame_rate.den,

                                                 st->r_frame_rate.num},

                                    st->time_base);

    av_dlog(s, \\"duration %\\"PRId64\\"\\
\\", st->duration);



    return 0;

}
",
        "idx":5
    }

信息就这么多，至于哪行是什么问题，训练集中没有。

当然，数据集里大部分还是没有漏洞的，比如第一条：

{"project": "FFmpeg", "commit_id": "973b1a6b9070e2bf17d17568cbaf4043ce931f51", "target": 0, "func": "static av_cold int vdadec_init(AVCodecContext *avctx)\\n\\n{\\n\\n    VDADecoderContext *ctx = avctx->priv_data;\\n\\n    struct vda_context *vda_ctx = &ctx->vda_ctx;\\n\\n    OSStatus status;\\n\\n    int ret;\\n\\n\\n\\n    ctx->h264_initialized = 0;\\n\\n\\n\\n    /* init pix_fmts of codec */\\n\\n    if (!ff_h264_vda_decoder.pix_fmts) {\\n\\n        if (kCFCoreFoundationVersionNumber < kCFCoreFoundationVersionNumber10_7)\\n\\n            ff_h264_vda_decoder.pix_fmts = vda_pixfmts_prior_10_7;\\n\\n        else\\n\\n            ff_h264_vda_decoder.pix_fmts = vda_pixfmts;\\n\\n    }\\n\\n\\n\\n    /* init vda */\\n\\n    memset(vda_ctx, 0, sizeof(struct vda_context));\\n\\n    vda_ctx->width = avctx->width;\\n\\n    vda_ctx->height = avctx->height;\\n\\n    vda_ctx->format = 'avc1';\\n\\n    vda_ctx->use_sync_decoding = 1;\\n\\n    vda_ctx->use_ref_buffer = 1;\\n\\n    ctx->pix_fmt = avctx->get_format(avctx, avctx->codec->pix_fmts);\\n\\n    switch (ctx->pix_fmt) {\\n\\n    case AV_PIX_FMT_UYVY422:\\n\\n        vda_ctx->cv_pix_fmt_type = '2vuy';\\n\\n        break;\\n\\n    case AV_PIX_FMT_YUYV422:\\n\\n        vda_ctx->cv_pix_fmt_type = 'yuvs';\\n\\n        break;\\n\\n    case AV_PIX_FMT_NV12:\\n\\n        vda_ctx->cv_pix_fmt_type = '420v';\\n\\n        break;\\n\\n    case AV_PIX_FMT_YUV420P:\\n\\n        vda_ctx->cv_pix_fmt_type = 'y420';\\n\\n        break;\\n\\n    default:\\n\\n        av_log(avctx, AV_LOG_ERROR, \\"Unsupported pixel format: %d\\\\n\\", avctx->pix_fmt);\\n\\n        goto failed;\\n\\n    }\\n\\n    status = ff_vda_create_decoder(vda_ctx,\\n\\n                                   avctx->extradata, avctx->extradata_size);\\n\\n    if (status != kVDADecoderNoErr) {\\n\\n        av_log(avctx, AV_LOG_ERROR,\\n\\n                \\"Failed to init VDA decoder: %d.\\\\n\\", status);\\n\\n        goto failed;\\n\\n    }\\n\\n    avctx->hwaccel_context = vda_ctx;\\n\\n\\n\\n    /* changes callback functions */\\n\\n    avctx->get_format = get_format;\\n\\n    avctx->get_buffer2 = get_buffer2;\\n\\n#if FF_API_GET_BUFFER\\n\\n    // force the old get_buffer to be empty\\n\\n    avctx->get_buffer = NULL;\\n\\n#endif\\n\\n\\n\\n    /* init H.264 decoder */\\n\\n    ret = ff_h264_decoder.init(avctx);\\n\\n    if (ret < 0) {\\n\\n        av_log(avctx, AV_LOG_ERROR, \\"Failed to open H.264 decoder.\\\\n\\");\\n\\n        goto failed;\\n\\n    }\\n\\n    ctx->h264_initialized = 1;\\n\\n\\n\\n    return 0;\\n\\n\\n\\nfailed:\\n\\n    vdadec_close(avctx);\\n\\n    return -1;\\n\\n}\\n", "idx": 0}

推理搞起来也是十分省事了，就是对应每个index给个0或1的结果：

问题：代码自动修复

有了识别代码漏洞的，更进一步就是学习自动修复代码的了。

代码自动修复的题目也很简单，一段是有bug的代码，另一段是修复之后的代码。

我们来看一个例子：

有bug的代码是这样的：

public java.lang.String METHOD_1 ( ) { return new TYPE_1 ( STRING_1 ) . format ( VAR_1 [ ( ( VAR_1 . length ) - 1 ) ] . getTime ( ) ) ; }

修复之后是这样子的：

public java.lang.String METHOD_1 ( ) { return new TYPE_1 ( STRING_1 ) . format ( VAR_1 [ ( ( type ) - 1 ) ] . getTime ( ) ) ; }

也真难为算法了，人看起来都有点费事。

问题：代码互译

比如实现C#语言和Java语言的互译。我们只要有一系列代码的C#写法和Java写法，就可以进行学习进行互译。

我们来看一对例子。
先看C#代码：

public virtual ListSpeechSynthesisTasksResponse ListSpeechSynthesisTasks(ListSpeechSynthesisTasksRequest request){
  var options = new InvokeOptions();
  options.RequestMarshaller = ListSpeechSynthesisTasksRequestMarshaller.Instance;
  options.ResponseUnmarshaller = ListSpeechSynthesisTasksResponseUnmarshaller.Instance;
  return Invoke<ListSpeechSynthesisTasksResponse>(request, options);
}

对应的Java

public ListSpeechSynthesisTasksResult listSpeechSynthesisTasks(ListSpeechSynthesisTasksRequest request) {
  request = beforeClientExecution(request);
  return executeListSpeechSynthesisTasks(request);
}

代码互译

问题：给代码写注释

在训练素材中，有代码和注释，这个任务的目的为新代码写注释。评价指标是对于生成的注释的语言准确度。

这个我们使用CodeSearchNet数据集。

这个数据集中的每条记录的格式如下：

repo: 仓库名

以上是关于代码智能：问题与解法的主要内容，如果未能解决你的问题，请参考以下文章