Android中解析读取复杂word,excel,ppt等的方法
Posted bit_kaki
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Android中解析读取复杂word,excel,ppt等的方法相关的知识,希望对你有一定的参考价值。
前段时间在尝试做一个android里的万能播放器,能播放各种格式的软件,其中就涉及到了最常用的office软件。查阅了下资料,发现Android中最传统的直接解析读取word,excel的方法主要用了java里第三方包,比如利用tm-extractors-0.4.jar和jxl.jar等,下面附上代码和效果图。
读取word用了tm-extractors-0.4.jar包,代码如下:
package com.example.readword; import java.io.File; import java.io.FileInputStream; import java.io.FileNotFoundException; import org.textmining.text.extraction.WordExtractor; import android.app.Activity; import android.os.Bundle; import android.os.Environment; import android.widget.TextView; public class MainActivity extends Activity { /** Called when the activity is first created. */ private TextView text; @Override public void onCreate(Bundle savedInstanceState) { super.onCreate(savedInstanceState); setContentView(R.layout.activity_main); text = (TextView) findViewById(R.id.text); String str = readWord("/storage/emulated/0/ArcGIS/localtilelayer/11.doc"); text.setText(str.trim().replace("", "")); } public String readWord(String file){ //创建输入流用来读取doc文件 FileInputStream in; String text = null; try { in = new FileInputStream(new File(file)); WordExtractor extractor = null; //创建WordExtractor extractor = new WordExtractor(); //进行提取对doc文件 text = extractor.extractText(in); } catch (FileNotFoundException e) { e.printStackTrace(); } catch (Exception e) { e.printStackTrace(); } return text; } }
效果图如下:
只是从网上随便下载的一个文档,我们可以看出,虽然能读取,但是格式的效果并不佳,而且只能读取doc,不能读取docx格式,也不能读取doc里的图片。另外就是加入使用WPF打开过这个doc的话,将无法再次读取(对于只安装WPF的我简直是个灾难)
然后是用jxl读取excel的代码,这个代码不是很齐,就写了个解析的,将excel里每行每列都解析了出来,然后自己可以重新再编辑,代码如下:
package com.readexl; import java.io.FileInputStream; import java.io.InputStream; import android.os.Bundle; import android.os.Environment; import android.app.Activity; import android.text.method.ScrollingMovementMethod; import android.view.Menu; import android.widget.TextView; import jxl.*; public class MainActivity extends Activity { TextView txt = null; public String filePath_xls = Environment.getExternalStorageDirectory() + "/case.xls"; @Override protected void onCreate(Bundle savedInstanceState) { super.onCreate(savedInstanceState); setContentView(R.layout.activity_main); txt = (TextView)findViewById(R.id.txt_show); txt.setMovementMethod(ScrollingMovementMethod.getInstance()); readExcel(); } @Override public boolean onCreateOptionsMenu(Menu menu) { // Inflate the menu; this adds items to the action bar if it is present. getMenuInflater().inflate(R.menu.main, menu); return true; } public void readExcel() { try { /** * 后续考虑问题,比如Excel里面的图片以及其他数据类型的读取 **/ InputStream is = new FileInputStream(filePath_xls); //Workbook book = Workbook.getWorkbook(new File("mnt/sdcard/test.xls")); Workbook book = Workbook.getWorkbook(is); int num = book.getNumberOfSheets(); txt.setText("the num of sheets is " + num+ "\\n"); // 获得第一个工作表对象 Sheet sheet = book.getSheet(0); int Rows = sheet.getRows(); int Cols = sheet.getColumns(); txt.append("the name of sheet is " + sheet.getName() + "\\n"); txt.append("total rows is " + Rows + "\\n"); txt.append("total cols is " + Cols + "\\n"); for (int i = 0; i < Cols; ++i) { for (int j = 0; j < Rows; ++j) { // getCell(Col,Row)获得单元格的值 txt.append("contents:" + sheet.getCell(i,j).getContents() + "\\n"); } } book.close(); } catch (Exception e) { System.out.println(e); } } }
效果图如下:
好吧,这只是个半成品,不过,这个方法肯定是行得通的。
之前说了这么多,很明白的意思就是我对于这两种方法都不是很满意。在这里,我先说下doc和docx的区别(xls和xlsx,ppt和pptx等区别都和此类似)
众所周知的是doc是03及之前版本word所保存的格式,docx是07版本之后保存的格式,简单的说,在doc中,微软还是用二进制存储方式;在docx中微软开始用xml方式,docx实际上成了一个打包的ZIP压缩文件。doc解压得到的是没有扩展名的文件碎片,而docx解压可以得到一个XML和几个包含信息的文件夹。两者比较的结论就是docx更小,而且要读取图片更容易。(参考http://www.zhihu.com/question/21547795)
好吧,回到正题。如何才能解析各种word,excel等能保留原来格式并且解析里面的图片,表格或附件等内容呢。那当然就是html了!不得不承认html对于页面,表格等展示的效果确是是很强大的,原生很难写出这样的效果。在网上找了诸多的资料,以及各个大神的代码,自己又再此基础上修改了下,实现的效果还不错吧。
利用的包是POI(一堆很强大的包,可以解析几乎所有的office软件,这里以doc,docx,xls,xlsx为例)
读取文件后根据不同文件类型分别进行操作。
public void read() { if(!myFile.exists()){ if (this.nameStr.endsWith(".doc")) { this.getRange(); this.makeFile(); this.readDOC(); } if (this.nameStr.endsWith(".docx")) { this.makeFile(); this.readDOCX(); } if (this.nameStr.endsWith(".xls")) { try { this.makeFile(); this.readXLS(); } catch (Exception e) { // TODO Auto-generated catch block e.printStackTrace(); } } if (this.nameStr.endsWith(".xlsx")) { try{ this.makeFile(); this.readXLSX(); }catch (Exception e) { // TODO Auto-generated catch block e.printStackTrace(); } } } returnPath = "file:///" + myFile; // this.view.loadUrl("file:///" + this.htmlPath); System.out.println("htmlPath" + this.htmlPath); }先贴上公用的方法,主要是设置生成的html文件保存地址:
public void makeFile() { String sdStateString = android.os.Environment.getExternalStorageState();// 获取外部存储状态 if (sdStateString.equals(android.os.Environment.MEDIA_MOUNTED)) {// 确认sd卡存在,原理不知,媒体安装?? try { File sdFile = android.os.Environment .getExternalStorageDirectory();// 获取扩展设备的文件目录 String path = sdFile.getAbsolutePath() + File.separator + "library";// 得到sd卡(扩展设备)的绝对路径+"/"+xiao File dirFile = new File(path);// 获取xiao文件夹地址 if (!dirFile.exists()) {// 如果不存在 dirFile.mkdir();// 创建目录 } File myFile = new File(path + File.separator +filename+ ".html");// 获取my.html的地址 if (!myFile.exists()) {// 如果不存在 myFile.createNewFile();// 创建文件 } this.htmlPath = myFile.getAbsolutePath();// 返回路径 } catch (Exception e) { } } }然后是读取doc:
private void getRange() { FileInputStream in = null; POIFSFileSystem pfs = null; try { in = new FileInputStream(nameStr); pfs = new POIFSFileSystem(in); hwpf = new HWPFDocument(pfs); } catch (FileNotFoundException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } range = hwpf.getRange(); pictures = hwpf.getPicturesTable().getAllPictures(); tableIterator = new TableIterator(range); }
public void readDOC() { try { myFile = new File(htmlPath); output = new FileOutputStream(myFile); presentPicture=0; String head = "<html><meta charset=\\"utf-8\\"><body>"; String tagBegin = "<p>"; String tagEnd = "</p>"; output.write(head.getBytes()); int numParagraphs = range.numParagraphs();// 得到页面所有的段落数 for (int i = 0; i < numParagraphs; i++) { // 遍历段落数 Paragraph p = range.getParagraph(i); // 得到文档中的每一个段落 if (p.isInTable()) { int temp = i; if (tableIterator.hasNext()) { String tableBegin = "<table style=\\"border-collapse:collapse\\" border=1 bordercolor=\\"black\\">"; String tableEnd = "</table>"; String rowBegin = "<tr>"; String rowEnd = "</tr>"; String colBegin = "<td>"; String colEnd = "</td>"; Table table = tableIterator.next(); output.write(tableBegin.getBytes()); int rows = table.numRows(); for (int r = 0; r < rows; r++) { output.write(rowBegin.getBytes()); TableRow row = table.getRow(r); int cols = row.numCells(); int rowNumParagraphs = row.numParagraphs(); int colsNumParagraphs = 0; for (int c = 0; c < cols; c++) { output.write(colBegin.getBytes()); TableCell cell = row.getCell(c); int max = temp + cell.numParagraphs(); colsNumParagraphs = colsNumParagraphs + cell.numParagraphs(); for (int cp = temp; cp < max; cp++) { Paragraph p1 = range.getParagraph(cp); output.write(tagBegin.getBytes()); writeParagraphContent(p1); output.write(tagEnd.getBytes()); temp++; } output.write(colEnd.getBytes()); } int max1 = temp + rowNumParagraphs; for (int m = temp + colsNumParagraphs; m < max1; m++) { temp++; } output.write(rowEnd.getBytes()); } output.write(tableEnd.getBytes()); } i = temp; } else { output.write(tagBegin.getBytes()); writeParagraphContent(p); output.write(tagEnd.getBytes()); } } String end = "</body></html>"; output.write(end.getBytes()); output.close(); } catch (Exception e) { System.out.println("readAndWrite Exception:"+e.getMessage()); e.printStackTrace(); } }读取docx
public void readDOCX() { String river = ""; try { this.myFile = new File(this.htmlPath);// new一个File,路径为html文件 this.output = new FileOutputStream(this.myFile);// new一个流,目标为html文件 presentPicture=0; String head = "<!DOCTYPE><html><meta charset=\\"utf-8\\"><body>";// 定义头文件,我在这里加了utf-8,不然会出现乱码 String end = "</body></html>"; String tagBegin = "<p>";// 段落开始,标记开始? String tagEnd = "</p>";// 段落结束 String tableBegin = "<table style=\\"border-collapse:collapse\\" border=1 bordercolor=\\"black\\">"; String tableEnd = "</table>"; String rowBegin = "<tr>"; String rowEnd = "</tr>"; String colBegin = "<td>"; String colEnd = "</td>"; String style = "style=\\""; this.output.write(head.getBytes());// 写如头部 ZipFile xlsxFile = new ZipFile(new File(this.nameStr)); ZipEntry sharedStringXML = xlsxFile.getEntry("word/document.xml"); InputStream inputStream = xlsxFile.getInputStream(sharedStringXML); XmlPullParser xmlParser = Xml.newPullParser(); xmlParser.setInput(inputStream, "utf-8"); int evtType = xmlParser.getEventType(); boolean isTable = false; // 是表格 用来统计 列 行 数 boolean isSize = false; // 大小状态 boolean isColor = false; // 颜色状态 boolean isCenter = false; // 居中状态 boolean isRight = false; // 居右状态 boolean isItalic = false; // 是斜体 boolean isUnderline = false; // 是下划线 boolean isBold = false; // 加粗 boolean isR = false; // 在那个r中 boolean isStyle = false; int pictureIndex = 1; // docx 压缩包中的图片名 iamge1 开始 所以索引从1开始 while (evtType != XmlPullParser.END_DOCUMENT) { switch (evtType) { // 开始标签 case XmlPullParser.START_TAG: String tag = xmlParser.getName(); if (tag.equalsIgnoreCase("r")) { isR = true; } if (tag.equalsIgnoreCase("u")) { // 判断下划线 isUnderline = true; } if (tag.equalsIgnoreCase("jc")) { // 判断对齐方式 String align = xmlParser.getAttributeValue(0); if (align.equals("center")) { this.output.write("<center>".getBytes()); isCenter = true; } if (align.equals("right")) { this.output.write("<div align=\\"right\\">" .getBytes()); isRight = true; } } if (tag.equalsIgnoreCase("color")) { // 判断颜色 String color = xmlParser.getAttributeValue(0); this.output .write(("<span style=\\"color:" + color + ";\\">") .getBytes()); isColor = true; } if (tag.equalsIgnoreCase("sz")) { // 判断大小 if (isR == true) { int size = decideSize(Integer.valueOf(xmlParser .getAttributeValue(0))); this.output.write(("<font size=" + size + ">") .getBytes()); isSize = true; } } // 下面是表格处理 if (tag.equalsIgnoreCase("tbl")) { // 检测到tbl 表格开始 this.output.write(tableBegin.getBytes()); isTable = true; } if (tag.equalsIgnoreCase("tr")) { // 行 this.output.write(rowBegin.getBytes()); } if (tag.equalsIgnoreCase("tc")) { // 列 this.output.write(colBegin.getBytes()); } if (tag.equalsIgnoreCase("pic")) { // 检测到标签 pic 图片 String entryName_jpeg = "word/media/image" + pictureIndex + ".jpeg"; String entryName_png = "word/media/image" + pictureIndex + ".png"; String entryName_gif = "word/media/image" + pictureIndex + ".gif"; String entryName_wmf = "word/media/image" + pictureIndex + ".wmf"; ZipEntry sharePicture = null; InputStream pictIS = null; sharePicture = xlsxFile.getEntry(entryName_jpeg); // 一下为读取docx的图片 转化为流数组 if (sharePicture == null) { sharePicture = xlsxFile.getEntry(entryName_png); } if(sharePicture == null){ sharePicture = xlsxFile.getEntry(entryName_gif); } if(sharePicture == null){ sharePicture = xlsxFile.getEntry(entryName_wmf); } if(sharePicture != null){ pictIS = xlsxFile.getInputStream(sharePicture); ByteArrayOutputStream pOut = new ByteArrayOutputStream(); byte[] bt = null; byte[] b = new byte[1000]; int len = 0; while ((len = pictIS.read(b)) != -1) { pOut.write(b, 0, len); } pictIS.close(); pOut.close(); bt = pOut.toByteArray(); Log.i("byteArray", "" + bt); if (pictIS != null) pictIS.close(); if (pOut != null) pOut.close(); writeDOCXPicture(bt); } pictureIndex++; // 转换一张后 索引+1 } if (tag.equalsIgnoreCase("b")) { // 检测到加粗标签 isBold = true; } if (tag.equalsIgnoreCase("p")) {// 检测到 p 标签 if (isTable == false) { // 如果在表格中 就无视 this.output.write(tagBegin.getBytes()); } } if (tag.equalsIgnoreCase("i")) { // 斜体 isItalic = true; } // 检测到值 标签 if (tag.equalsIgnoreCase("t")) { if (isBold == true) { // 加粗 this.output.write("<b>".getBytes()); } if (isUnderline == true) { // 检测到下划线标签,输入<u> this.output.write("<u>".getBytes()); } if (isItalic == true) { // 检测到斜体标签,输入<i> output.write("<i>".getBytes()); } river = xmlParser.nextText(); this.output.write(river.getBytes()); // 写入数值 if (isItalic == true) { // 检测到斜体标签,在输入值之后,输入</i>,并且斜体状态=false this.output.write("</i>".getBytes()); isItalic = false; } if (isUnderline == true) {// 检测到下划线标签,在输入值之后,输入</u>,并且下划线状态=false this.output.write("</u>".getBytes()); isUnderline = false; } if (isBold == true) { // 加粗 this.output.write("</b>".getBytes()); isBold = false; } if (isSize == true) { // 检测到大小设置,输入结束标签 this.output.write("</font>".getBytes()); isSize = false; } if (isColor == true) { // 检测到颜色设置存在,输入结束标签 this.output.write("</span>".getBytes()); isColor = false; } if (isCenter == true) { // 检测到居中,输入结束标签 this.output.write("</center>".getBytes()); isCenter = false; } if (isRight == true) { // 居右不能使用<right></right>,使用div可能会有状况,先用着 this.output.write("</div>".getBytes()); isRight = false; } } break; // 结束标签 case XmlPullParser.END_TAG: String tag2 = xmlParser.getName(); if (tag2.equalsIgnoreCase("tbl")) { // 检测到表格结束,更改表格状态 this.output.write(tableEnd.getBytes()); isTable = false; } if (tag2.equalsIgnoreCase("tr")) { // 行结束 poi读取复杂的Excel表格,如图Android开发中读写office文件(word,ppt,excel)的操作实例大家能给详细介绍下么,网上例子太少了
java 用POI 解析word中的表格,POI只能识别word中创建的表格。 如果表格是从Excel中copy过来的, POI无法识