18.5 s

Read and Transform the Data

Data source: Shopee - Price Match Guarantee

14.5 Î¼s
df
titlelabel_group
StringInt64
1
"Paper Bag Victoria Secret"
249114794
2
"Double Tape 3M VHB 12 mm x 4,5 m ORIGINAL / DOUBLE FOAM TAPE"
2937985045
3
"Maling TTS Canned Pork Luncheon Meat 397 gr"
2395904891
4
"Daster Batik Lengan pendek - Motif Acak / Campur - Leher Kancing (DPT001-00) Batik karakter Alhadi"
4093212188
5
"Nescafe \\xc3\\x89clair Latte 220ml"
3648931069
6
"CELANA WANITA  (BB 45-84 KG)Harem wanita (bisa cod)"
2660605217
7
"Jubah anak size 1-12 thn"
1835033137
8
"KULOT PLISKET SALUR /CANDY PLISKET /WISH KULOT PREMIUM /KULOT PELANGI PREMIUM/HIEKA KULOT"
1565741687
9
"[LOGU] Tempelan kulkas magnet angka, tempelan angka magnet"
2359912463
10
"BIG SALE SEPATU PANTOFEL KULIT KEREN KERJA KANTOR LAKI PRIA COWOK DINAS RESMI FORMAL PESTA KICKERS"
2630990665
more
34250
"FLEX TAPE PELAPIS BOCOR / ISOLASI AJAIB / ANTI BOCOR"
459464107
15.1 s

Make all characters lower cased:

11.3 Î¼s
44.1 ms

Group by label_group:

6.3 Î¼s
groups

GroupedDataFrame with 11014 groups based on key: label_group

First Group (2 rows): label_group = 249114794

titlelabel_group
StringInt64
1paper bag victoria secret249114794
2paper bag victoria secret249114794

â‹®

Last Group (2 rows): label_group = 53836859

titlelabel_group
StringInt64
1sprei lady rose 180x200 king terlaris keroppi53836859
2sprei king ladyrose size 180x200 kerokeroppi53836859
137 ms

Tokenizer demo

List of tokenize functions demonstrated here (from the WordTokenizers package):

  1. punctuation_space_tokenize

  2. penn_tokenize

  3. nltk_word_tokenize

  4. poormans_tokenize

6.4 Î¼s
57.0 Î¼s
219 Î¼s
12.8 Î¼s
23.5 Î¼s
15.6 Î¼s
10.2 Î¼s
81.3 Î¼s
26.8 Î¼s

Tokenize and Count

4.0 Î¼s
tokenize_and_count (generic function with 1 method)
112 Î¼s

Compare results from the two tokenizers (nltk and penn):

4.8 Î¼s
text_1text_2label_groupn_1n_2intersectunionjaccardoverlap
StringStringInt64Int64Int64Int64Int64Float64Float64
1
"paper bag victoria secret"
"paper bag victoria secret"
249114794
4
4
4
4
1.0
1.0
2
"double tape 3m vhb 12 mm x 4,5 m original / double foam tape"
"double tape vhb 3m original 12mm x 4.5mm busa perekat"
2937985045
12
12
9
15
0.6
0.75
3
"maling tts canned pork luncheon meat 397 gr"
"maling ham pork luncheon meat tts 397gr"
2395904891
8
8
7
9
0.777778
0.875
819 ms
text_1text_2label_groupn_1n_2intersectunionjaccardoverlap
StringStringInt64Int64Int64Int64Int64Float64Float64
1
"paper bag victoria secret"
"paper bag victoria secret"
249114794
4
4
4
4
1.0
1.0
2
"double tape 3m vhb 12 mm x 4,5 m original / double foam tape"
"double tape vhb 3m original 12mm x 4.5mm busa perekat"
2937985045
14
10
6
18
0.333333
0.6
3
"maling tts canned pork luncheon meat 397 gr"
"maling ham pork luncheon meat tts 397gr"
2395904891
8
7
5
10
0.5
0.714286
1.2 s

Histogram

7.9 Î¼s
8.0 ms
7.4 ms
8.4 ms
4.8 ms
4.1 ms
16.2 ms
22.8 ms
6.5 ms

(# of groups with fewer than ten members, # of groups with ten or more members):

9.7 Î¼s
422 ms

Larger groups:

7.0 Î¼s
label_groupnrow
Int64Int64
1
1141798720
51
2
3113678103
51
3
562358068
51
4
3627744656
51
5
994676122
51
6
1163569239
51
7
159351600
51
199 ms

Pick one group as an example:

4.9 Î¼s
titlelabel_group
StringInt64
1
"implora cheek & liptint - implora lip tint original bpom"
3627744656
2
"implora cheek & liptint"
3627744656
3
"lip tint implora cheek & liptint"
3627744656
4
"implora cheek & lip tint model ice cream - set liptint & pemerah pipi"
3627744656
5
"{promo murah} implora cheek&liptint"
3627744656
6
"implora cheek and liptint/ lip tint implora"
3627744656
7
"implora cheek lip tint"
3627744656
8
"new cheek dan liptint/ lip tint by implora bpom"
3627744656
9
"implora cheek & liptint - implora liptint - lip tint implora"
3627744656
10
"new liptint implora / implora cheek & liptint"
3627744656
more
51
"\\xe2\\x9d\\xa4 belia \\xe2\\x9d\\xa4 implora (\\xe2\\x9c\\x94\\xef\\xb8\\x8fbpom)  cheek & liptint 5.5g | lip tint implora"
3627744656
12.3 ms

Samples

3.7 Î¼s
sample (generic function with 1 method)
22.8 Î¼s

Jaccard index equals 1 (exactly the same):

6.7 Î¼s
388 Î¼s

Jaccard index equals 0 (completely different):

4.7 Î¼s
217 Î¼s

Jaccard index between 0 and 0.2 (only slightly similar):

5.1 Î¼s
682 Î¼s