用于从 url 中提取关键字的 Hive 正则表达式
Posted
技术标签:
【中文标题】用于从 url 中提取关键字的 Hive 正则表达式【英文标题】:Hive regex for extracting keywords from url 【发布时间】:2018-04-20 22:57:17 【问题描述】:文件名如下:
file:///storage/emulated/0/SHAREit/videos/Dangerous_Hero_(2017)____Latest_South_Indian_Full_Hindi_Dubbed_Movie___2017_.mp4
file:///storage/emulated/0/VidMate/download/%E0%A0_-_Promo_Songs_-_Khiladi_-_Khesari_Lal_-_Bho.mp4
file:///storage/emulated/0/WhatsApp/Media/WhatsApp%20Video/VID-20171222-WA0015.mp4
file:///storage/emulated/0/bluetooth/%5DChitaChola%7B%7D%D8%B9%D8%A7%D9%85%D8%B1%24%20.3gp
我想编写 hive 正则表达式来从每个字符串中提取单词。
例如在第一个字符串中输出应该是:storage,emulated,....
更新
这段代码给了我结果,但我想要正则表达式而不是下面的代码。
package uri_keyword_extractor;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
import java.util.ArrayList;
public class UDFUrlKeywordExtractor extends UDF
private Text result = new Text();
public Text evaluate(Text url)
if (url == null)
return null;
String keywords = url_keyword_maker(url.toString());
result.set(keywords);
return result;
private static String url_keyword_maker(String url)
// TODO Auto-generated method stub
ArrayList<String> keywordAr = new ArrayList<String>();
char[] charAr = url.toCharArray();
for (int i = 0; i < charAr.length; i++)
int current_index = i;
// check if character is a-z or A-Z
char ch = charAr[i];
StringBuilder sb = new StringBuilder();
while (current_index < charAr.length-1 && isChar(ch))
sb.append(ch);
current_index = current_index+1;
ch = charAr[current_index];
String word = sb.toString();
if (word.length() >= 2)
keywordAr.add(word);
i = current_index;
//
StringBuilder sb = new StringBuilder();
for(int i =0; i < keywordAr.size();i++)
String current = keywordAr.get(i);
sb.append(current);
if(i < keywordAr.size() -1)
sb.append(",");
return sb.toString();
private static boolean isChar(char ch)
// TODO Auto-generated method stub
int ascii_value = (int) ch;
// A-Z => (65,90) a-z => (97,122)
// condition 1 : A-Z , condition 2 : a-z character check
if ( (ascii_value >= 65 && ascii_value <= 90) || (ascii_value >= 97 && ascii_value <= 122) )
return true;
else
return false;
public static void main(String[] args)
// TODO Auto-generated method stub
String test1 = "file:///storage/emulated/0/SHAREit/videos/Dangerous_Hero_(2017)____Latest_South_Indian_Full_Hindi_Dubbed_Movie___2017_.mp4";
String test2 = "file:///storage/emulated/0/VidMate/download/%E0%A0_-_Promo_Songs_-_Khiladi_-_Khesari_Lal_-_Bho.mp4";
String test3 = "file:///storage/emulated/0/bluetooth/%5DChitaChola%7B%7D%D8%B9%D8%A7%D9%85%D8%B1%24%20.3gp";
System.out.println(url_keyword_maker(test1).toString());
System.out.println(url_keyword_maker(test2).toString());
System.out.println(url_keyword_maker(test3).toString());
【问题讨论】:
你应该提供完整的预期输出..它太模棱两可 @hlagos 查看更新 @wp78de 查看更新 【参考方案1】:使用split(str, regex_pattern)
函数,它使用正则表达式作为分隔符模式分割str并返回数组。然后使用lateral view
+ epxlode
来分解数组并按照Java 代码中的长度过滤关键字。然后应用collect_set
重新组装关键字数组+concat_ws(delimeter, str)
函数将数组转换为分隔字符串(如果需要)。
我传递给split
函数的正则表达式是'[^a-zA-Z]'
。
演示:
select url_nbr, concat_ws(',',collect_set(key_word)) keywords from
(--your URLs example, url_nbr here is just for reference
select 'file:///storage/emulated/0/SHAREit/videos/Dangerous_Hero_(2017)____Latest_South_Indian_Full_Hindi_Dubbed_Movie___2017_.mp4' as url, 1 as url_nbr union all
select 'file:///storage/emulated/0/VidMate/download/%E0%A0_-_Promo_Songs_-_Khiladi_-_Khesari_Lal_-_Bho.mp4' as url, 2 as url_nbr union all
select 'file:///storage/emulated/0/WhatsApp/Media/WhatsApp%20Video/VID-20171222-WA0015.mp4' as url, 3 as url_nbr union all
select 'file:///storage/emulated/0/bluetooth/%5DChitaChola%7B%7D%D8%B9%D8%A7%D9%85%D8%B1%24%20.3gp' as url, 4 as url_nbr)s
lateral view explode(split(url, '[^a-zA-Z]')) v as key_word
where length(key_word)>=2 --filter here
group by url_nbr
;
输出:
OK
1 file,storage,emulated,SHAREit,videos,Dangerous,Hero,Latest,South,Indian,Full,Hindi,Dubbed,Movie,mp
2 file,storage,emulated,VidMate,download,Promo,Songs,Khiladi,Khesari,Lal,Bho,mp
3 file,storage,emulated,WhatsApp,Media,Video,VID,WA,mp
4 file,storage,emulated,bluetooth,DChitaChola,gp
Time taken: 37.767 seconds, Fetched: 4 row(s)
也许我从你的 java 代码中遗漏了一些东西,但希望你能抓住这个想法,这样你就可以轻松地修改我的代码并在必要时添加额外的处理。
【讨论】:
以上是关于用于从 url 中提取关键字的 Hive 正则表达式的主要内容,如果未能解决你的问题,请参考以下文章