使用 Quartz 2D 解析 pdf 时获取文本位置
Posted
技术标签:
【中文标题】使用 Quartz 2D 解析 pdf 时获取文本位置【英文标题】:Getting text position while parsing pdf with Quartz 2D 【发布时间】:2011-04-07 08:45:07 【问题描述】:关于pdf解析的另一个问题... 刚刚阅读 PDF Reference version 1.7 “5.3.1 Text-Positioning Operators”,我有点困惑。
我写了一些代码来获取转换矩阵和初始文本位置。
CGPDFOperatorTableSetCallback (table, "MP", &op_MP);//Define marked-content point
CGPDFOperatorTableSetCallback (table, "DP", &op_DP);//Define marked-content point with property list
CGPDFOperatorTableSetCallback (table, "BMC", &op_BMC);//Begin marked-content sequence
CGPDFOperatorTableSetCallback (table, "BDC", &op_BDC);//Begin marked-content sequence with property list
CGPDFOperatorTableSetCallback (table, "EMC", &op_EMC);//End marked-content sequence
//Text State operators
CGPDFOperatorTableSetCallback(table, "Tc", &op_Tc);
CGPDFOperatorTableSetCallback(table, "Tw", &op_Tw);
CGPDFOperatorTableSetCallback(table, "Tz", &op_Tz);
CGPDFOperatorTableSetCallback(table, "TL", &op_TL);
CGPDFOperatorTableSetCallback(table, "Tf", &op_Tf);
CGPDFOperatorTableSetCallback(table, "Tr", &op_Tr);
CGPDFOperatorTableSetCallback(table, "Ts", &op_Ts);
//text showing operators
CGPDFOperatorTableSetCallback(table, "TJ", &op_TJ);
CGPDFOperatorTableSetCallback(table, "Tj", &op_Tj);
CGPDFOperatorTableSetCallback(table, "'", &op_apostrof);
CGPDFOperatorTableSetCallback(table, "\"", &op_double_apostrof);
//text positioning operators
CGPDFOperatorTableSetCallback(table, "Td", &op_Td);
CGPDFOperatorTableSetCallback(table, "TD", &op_TD);
CGPDFOperatorTableSetCallback(table, "Tm", &op_Tm);
CGPDFOperatorTableSetCallback(table, "T*", &op_T);
//text object operators
CGPDFOperatorTableSetCallback(table, "BT", &op_BT);//Begin text object
CGPDFOperatorTableSetCallback(table, "ET", &op_ET);//End text object
这是应用程序午餐后的输出:
2010-09-02 15:09:23.041 testSearch[8251:207] op_BT begin
Integer value: 0
2010-09-02 15:09:23.043 testSearch[8251:207] op_BT end
2010-09-02 15:09:23.043 testSearch[8251:207] op_Tf begin
Integer value: 1
2010-09-02 15:09:23.044 testSearch[8251:207] op_Tf end
2010-09-02 15:09:23.044 testSearch[8251:207] op_Tm begin
Float value: 557.364197
2010-09-02 15:09:23.045 testSearch[8251:207] op_Tm end
2010-09-02 15:09:23.045 testSearch[8251:207] op_TJ begin
2010-09-02 15:09:23.046 testSearch[8251:207] Array string value [0]: F
2010-09-02 15:09:23.046 testSearch[8251:207] Array integer value [1]: 94985208
2010-09-02 15:09:23.047 testSearch[8251:207] Array string value [2]: r
2010-09-02 15:09:23.047 testSearch[8251:207] Array integer value [3]: 94985208
2010-09-02 15:09:23.048 testSearch[8251:207] Array string value [4]: o
2010-09-02 15:09:23.048 testSearch[8251:207] Array integer value [5]: 94985208
2010-09-02 15:09:23.049 testSearch[8251:207] Array string value [6]: m s
2010-09-02 15:09:23.049 testSearch[8251:207] Array integer value [7]: 94985208
2010-09-02 15:09:23.049 testSearch[8251:207] Array string value [8]: a
2010-09-02 15:09:23.050 testSearch[8251:207] Array integer value [9]: 94985208
2010-09-02 15:09:23.050 testSearch[8251:207] Array string value [10]: m
2010-09-02 15:09:23.051 testSearch[8251:207] Array integer value [11]: 94985208
2010-09-02 15:09:23.051 testSearch[8251:207] Array string value [12]: p
2010-09-02 15:09:23.052 testSearch[8251:207] Array integer value [13]: 94985208
2010-09-02 15:09:23.053 testSearch[8251:207] Array string value [14]: l
2010-09-02 15:09:23.054 testSearch[8251:207] Array integer value [15]: 94985208
2010-09-02 15:09:23.055 testSearch[8251:207] Array string value [16]: e t
2010-09-02 15:09:23.055 testSearch[8251:207] Array integer value [17]: 94985208
2010-09-02 15:09:23.057 testSearch[8251:207] Array string value [18]: o r
2010-09-02 15:09:23.057 testSearch[8251:207] Array integer value [19]: 94985208
2010-09-02 15:09:23.058 testSearch[8251:207] Array string value [20]: e
2010-09-02 15:09:23.058 testSearch[8251:207] Array integer value [21]: 94985208
2010-09-02 15:09:23.059 testSearch[8251:207] Array string value [22]: s
2010-09-02 15:09:23.059 testSearch[8251:207] Array integer value [23]: 94985208
2010-09-02 15:09:23.060 testSearch[8251:207] Array string value [24]: u
2010-09-02 15:09:23.061 testSearch[8251:207] Array integer value [25]: 94985208
2010-09-02 15:09:23.061 testSearch[8251:207] Array string value [26]: l
2010-09-02 15:09:23.062 testSearch[8251:207] Array integer value [27]: 94985208
2010-09-02 15:09:23.062 testSearch[8251:207] Array string value [28]: t
2010-09-02 15:09:23.063 testSearch[8251:207] op_TJ end
如果有人熟悉文本矩阵和文本定位运算符,最好能解释一下这些东西是如何工作的。
如何使用 Tm(变换矩阵和其他数据)计算文本位置(或字形?)?
【问题讨论】:
你知道如何使用这些运算符了吗?谢谢! 【参考方案1】:@Koteg:嗨!你终于设法让它工作了吗?对于 Tm,我可以获得所有六个值,但现在我看不到如何将单词的位置放入一行中...... 我有一个想法:如果我们在 Tj,只需获取字母之间的空格(每次都跳这个相同),然后使用 Tm,获取单词的位置。 在 TJ 的情况下,这要复杂得多:获取水平平移的值以减去数组每个部分的 Tm 矩阵,但是在该数组中搜索单词将比 Tj 更复杂。
顺便说一句,对于其他人:
for(size_t n = 0; n < CGPDFArrayGetCount(array); n += 2)
if(n >= CGPDFArrayGetCount(array))
continue;
CGPDFStringRef string;
success = CGPDFArrayGetString(array, n, &string);
if(success)
NSString *data = (NSString *)CGPDFStringCopyTextString(string);
NSLog(@"array data : %@", data);
[searcher.currentData appendFormat:@"%@", data];
[data release];
CGPDFReal real;
success = CGPDFArrayGetNumber(array, n+1, &real);
if(success)
NSLog(@"array real : %f", real);
谢谢
【讨论】:
以上是关于使用 Quartz 2D 解析 pdf 时获取文本位置的主要内容,如果未能解决你的问题,请参考以下文章