基于 TypeScript/Node 从 0 到 1 搭建一款爬虫工具

Posted 前端大全

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了基于 TypeScript/Node 从 0 到 1 搭建一款爬虫工具相关的知识,希望对你有一定的参考价值。

是 Java 中最简单的设计模式之一。这种类型的设计模式属于创建型模式,它提供了一种创建对象的最佳方式。

这种模式涉及到一个单一的类,该类负责创建自己的对象,同时确保只有单个对象被创建。这个类提供了一种访问其唯一的对象的方式,可以直接访问,不需要实例化该类的对象。

应用实例:

  • 1、一个班级只有一个班主任。
  • 2、Windows 是多进程多线程的,在操作一个文件的时候,就不可避免地出现多个进程或线程同时操作一个文件的现象,所以所有文件的处理必须通过唯一的实例来进行。
  • 3、一些设备管理器常常设计为单例模式,比如一个电脑有两台打印机,在输出的时候就要处理不能两台打印机打印同一个文件。
  • 同样,我们在src文件夹下创建一个singleton文件夹,然后在其文件夹下分别在创建两个文件crawler1.tsurlAnalyzer.ts

    这两个文件的作用与上文同样,只不过代码书写不一样。

    crawler1.ts

    import superagent from "superagent";
    import fs from "fs";
    import path from "path";
    import UrlAnalyzer from "./urlAnalyzer.ts";

    export interface Analyzer  
      analyze: (html: string, filePath: string) => string;


    class Crowller  
      private filePath = path.resolve(__dirname, "../../data/url.json"); 
      
      async getRawHtml()   
        const result = await superagent.get(this.url); 
        return result.text; 
       
      
      private writeFile(content: string)   
        fs.writeFileSync(this.filePath, content); 
       
      
      private async initSpiderProcess()   
        const html = await this.getRawHtml();  
        const fileContent = this.analyzer.analyze(html, this.filePath);  
        this.writeFile(JSON.stringify(fileContent)); 
       
      
      constructor(private analyzer: Analyzer, private url: string)   
        this.initSpiderProcess(); 
      

    const url = "https://www.hanju.run/play/39257-1-1.html";

    const analyzer = UrlAnalyzer.getInstance();
    new Crowller(analyzer, url);

    urlAnalyzer.ts

    import cheerio from "cheerio";
    import fs from "fs";
    import  Analyzer  from "./crawler1.ts";

    interface objJson 
      [propName: number]: Info[];

    interface InfoResult 
      time: number;
      data: Info[];

    interface Info 
      name: string;
      url: string;

    export default class UrlAnalyzer implements Analyzer 
      static instance: UrlAnalyzer;

      static getInstance() 
        if (!UrlAnalyzer.instance) 
          UrlAnalyzer.instance = new UrlAnalyzer();
        
        return UrlAnalyzer.instance;
      

      private getJsonInfo(html: string) 
        const $ = cheerio.load(html);
        const info: Info[] = [];
        const scpt: string = String($(".play>script:nth-child(1)").html());
        const url = unescape(
          scpt.split(";")[3].split("(")[1].split(")")[0].replace(/\\"/g, "")
        );
        const name: string = String($("title").html());
        info.push(
          name,
          url,
        );
        const result = 
          time: new Date().getTime(),
          data: info,
        ;
        return result;
      

      private getJsonContent(info: InfoResult, filePath: string) 
        let fileContent: objJson = ;
        if (fs.existsSync(filePath)) 
          fileContent = JSON.parse(fs.readFileSync(filePath, "utf-8"));
        
        fileContent[info.time] = info.data;
        return fileContent;
      

      public analyze(html: string, filePath: string) 
         const info = this.getJsonInfo(html);
         console.log(info);
        const fileContent = this.getJsonContent(info, filePath);
        return JSON.stringify(fileContent);
      

      private constructor() 

    可以在package.json文件中定义快捷启动命令。

     "scripts"
         "dev-s""ts-node ./src/singleton/crawler1.ts",
      ,

    然后使用npm run dev-s启动即可。

    结语

    这下真的结束了,谢谢阅读。希望可以帮到你。

    完整源码地址:

    https://github.com/maomincoding/TsCrawler

    - EOF -

    推荐阅读  点击标题可跳转

    1、React 中的 TS 类型过滤原来是这么做的!

    2、vite+vue3+ts 搭建通用后台管理系统

    3、来吧,解锁 vue3 全家桶+Ts 的正确姿势


    觉得本文对你有帮助?请分享给更多人

    推荐关注「前端大全」,提升前端技能

    点赞和在看就是最大的支持❤️

    typescript node - server.ts

    import * as http from "http";
    import app from "./app";
    import config from "./config";
    import logger from "./logger";
    
    const httpServer: http.Server = http.createServer(app);
    httpServer.listen(config.port);
    
    httpServer.on("listening", () => {
        const addr = httpServer.address();
        logger.warn(`Express server listening on port ${addr.port} in ${config.env} mode (pid: ${process.pid})`);
    
        // Graceful start (with PM2)
        // http://pm2.keymetrics.io/docs/usage/signals-clean-restart/#graceful-start
    
        // Sometimes you might need to wait for your application to have established
        // connections with your DBs/caches/workers/whatever.
        // PM2 needs to wait before considering your application as online. To do this:
        // 1. start the app with this flag:
        //          pm2 start app.js --wait-ready
        // 2. from app, send the 'ready' signal to PM2
        //          process.send("ready");
    
        // Here we send the ready signal to PM2
        // (process as any).send("ready"); // hack to skip a ts error - https://github.com/Microsoft/TypeScript/issues/10158
    });
    
    httpServer.on("error", (error: any) => {
        if (error.syscall !== "listen") {
            logger.debug("test123");
            throw error;
        }
    
        // handle specific listen errors with friendly messages
        switch (error.code) {
            case "EACCES":
                logger.error(`Port ${config.port} requires elevated privileges`);
                process.exit(1); // exit with failure code
                break;
            case "EADDRINUSE":
                logger.error(`Port ${config.port} is already in use`);
                process.exit(1); // exit with failure code
                break;
            default:
                throw error;
        }
    });
    httpServer.on("close", () => {
        logger.warn("Server was closed");
    });
    
    // https://nodejs.org/api/process.html#process_event_uncaughtexception
    // https://strongloop.com/strongblog/robust-node-applications-error-handling/
    
    // test:
    // Intentionally cause an exception, but don't catch it.
    // nonexistentFunc();
    process.on("uncaughtException", (err: Error) => {
        logger.error(`Caught exception: ${err.message}`, { err });
        // https://stackoverflow.com/a/40867663
        // The correct use of 'uncaughtException' is to perform synchronous cleanup of allocated
        // resources (e.g. file descriptors, handles, etc) before shutting down the process.
    
        gracefulShutdown("uncaughtException");
        // process.exit(1); // exit with failure c=ode
    });
    
    // https://nodejs.org/api/process.html#process_event_unhandledrejection
    // http://thecodebarbarian.com/unhandled-promise-rejections-in-node.js.html
    // https://www.bennadel.com/blog/3238-logging-and-debugging-unhandled-promise-rejections-in-node-js-v1-4-1-and-later.htm
    
    // test1:
    // somePromise.then((res) => {
    //     return reportToUser(JSON.pasre(res)); // note the typo (`pasre`)
    //   }); // no `.catch` or `.then`
    // test2:
    // Promise.reject(new Error('woops')); // never attach a `catch`
    process.on("unhandledRejection", (reason: Error | any, promise: Promise<any>) => {
        // reason - the object with which the promise was rejected (typically an Error object)
        // promise - the Promise that was rejected
        logger.error(`Caught rejection at ${promise}, reason: ${reason}`, { promise, reason });
        // application specific logging, throwing an error, or other logic here
    
        // throw the error in order to force nodejs to crash
        // https://medium.com/@dtinth/making-unhandled-promise-rejections-crash-the-node-js-process-ffc27cfcc9dd
        // throw reason;
    
        gracefulShutdown("unhandledRejection");
    });
    
    // https://nodejs.org/api/process.html#process_event_warning
    // warning argument is an Error object (with name, message and stack)
    
    // test:
    // $ node
    // > events.defaultMaxListeners = 1;
    // > process.on("foo", () => {});
    // > process.on("foo", () => {});
    
    process.on("warning", (warning: Error) => {
        logger.warn(`Caught warning: ${warning.message}`, { warning });
    });
    
    // https://joseoncode.com/2014/07/21/graceful-shutdown-in-node-dot-js/
    // A signal is an asynchronous notification sent to a process (or to a specific thread within
    // the same process) in order to notify it of an event that occurred.
    
    // SIGTERM is a way to politely ask a program to terminate.
    // The program can either handle this signal, clean up resources and then exit, or it can ignore the signal.
    // The program doesn't exit until it finished processing and serving the last request.
    // After the SIGTERM signal it doesn't handle more requests.
    // Every process manager will send a SIGKILL if the SIGTERM takes too much time.
    
    // SIGKILL is used to cause immediate termination. Unlike SIGTERM it can't be handled or ignored by the process.
    process.on("SIGTERM", () => {
        gracefulShutdown("SIGTERM");
    });
    
    // Graceful shutdown NodeJS HTTP server when using PM2
    // http://www.acuriousanimal.com/2017/08/27/graceful-shutdown-node-processes.html
    // SIGINT - the signal sent by PM to ask a process to shut down
    // this signal is also sent when you Ctrl+C in terminal
    process.on("SIGINT", () => {
        gracefulShutdown("SIGINT");
        // Now pm2 reload will become a gracefulReload.
    });
    
    function gracefulShutdown(eventName) {
        const cleanUpAndExit = () => {
            // close db, then exit
            // db.stop(err => {
            //     process.exit(err ? 1 : 0);
            // });
            logger.warn("Cleaned up. Bye!");
            process.exit(0); // exit with success code
        };
    
        logger.warn(`${eventName} received. Closing server...`);
    
        // the http server has a close method that stops the server for receiving new connections
        // and calls the callback once it finished handling all requests
        httpServer.close(() => {
            // logger.warn("Server closed."); we already have such event (httpServer.on("close", ...))
            cleanUpAndExit();
        });
    
        // Force close server after 5 secs
        setTimeout(() => {
            logger.warn("Forcing server to close");
            process.exit(1); // exit with failure code
        }, 5000);
    }
    

    以上是关于基于 TypeScript/Node 从 0 到 1 搭建一款爬虫工具的主要内容,如果未能解决你的问题,请参考以下文章

    使用 Visual Studio 2012 express 进行 TypeScript、node.js 开发

    typescript node - server.ts

    Typescript / Node.js - 如何模拟集成测试的传递依赖项?

    TypeScript + node.js + github.api, 实现github快速找到目标用户

    typescript node.js express 路由分隔文件的最佳实践

    TypeScript Node.js 包