Share @ Google LinkedIn Facebook  path-matching, glob

Overview

glob module helps in finding all paths which matches particular patters in Unix shell.

It can handle *, ? and characters expressed in [].

It can not handle tilde expansion (~ - user home directory) though.

It returns results in arbitrary order than proper sequence.

It supports both relative path matching and absolute path matching as well.

In [1]:
import glob
import sys

print('Creating directory structures and files for experimentation purpose.')
%mkdir folder_l1_1
%mkdir folder_l1_2
%mkdir folder_l1_1/folder_l2

!touch temp.txt temp.jpg temp.png a.png b.jpg c.txt d.txt t1.png t2.jpg .temp.png .temp.jpg
!touch folder_l1_1/t.txt folder_l1_1/t2.png folder_l1_1/a.mp4 folder_l1_1/b.mp4 folder_l1_1/c.mpeg folder_l1_1/t3.jpg
!touch folder_l1_2/t2.txt folder_l1_2/t1.png folder_l1_2/b.mp4 folder_l1_2/c.mp4 folder_l1_2/d.mpeg
!touch folder_l1_1/folder_l2/t.txt folder_l1_1/folder_l2/t2.png folder_l1_1/folder_l2/a.mp4

print('\nCurrent directory contents : ')
%ls
print('\nfolder_l1_1 directory contents : ')
%ls folder_l1_1
print('\nfolder_l1_2 directory contents :')
%ls folder_l1_2
print('\nfolder_l1_1/folder_l2 directory contents : ')
%ls folder_l1_1/folder_l2
Creating directory structures and files for experimentation purpose.

Current directory contents :
__notebook_source__.ipynb  c.txt         folder_l1_2/  temp.jpg
a.png                      d.txt         t1.png        temp.png
b.jpg                      folder_l1_1/  t2.jpg        temp.txt

folder_l1_1 directory contents :
a.mp4  b.mp4  c.mpeg  folder_l2/  t.txt  t2.png  t3.jpg

folder_l1_2 directory contents :
b.mp4  c.mp4  d.mpeg  t1.png  t2.txt

folder_l1_1/folder_l2 directory contents :
a.mp4  t.txt  t2.png
  • glob(pathname, recursive=False) - Returns all paths which matches pattern. If recursive is True with ** then it looks in subdirectory as well.
In [3]:
print(glob.glob('*.txt'))
print(glob.glob('*.png'))
print(glob.glob('*.jpg'))
print(glob.glob('*.*g'))
print(glob.glob('.*'))
print(glob.glob('.*.png'))
print(glob.glob('.*.*g'))
print(glob.glob('[a-z]*.txt'))
print(glob.glob('[a-z]+.txt')) ## This does not work. Only *,? and [] works.
['d.txt', 'c.txt', 'temp.txt']
['temp.png', 't1.png', 'a.png']
['temp.jpg', 't2.jpg', 'b.jpg']
['temp.png', 't1.png', 'temp.jpg', 't2.jpg', 'a.png', 'b.jpg']
['.temp.jpg', '.temp.png', '.ipynb_checkpoints']
['.temp.png']
['.temp.jpg', '.temp.png']
['d.txt', 'c.txt', 'temp.txt']
[]
In [4]:
print(glob.glob('[a-z][0-9].*'))
print(glob.glob('t[0-9].*'))
print(glob.glob('[a-z][0-9].*'))
print(glob.glob('[a-z][0-9].*g'))
['t1.png', 't2.jpg']
['t1.png', 't2.jpg']
['t1.png', 't2.jpg']
['t1.png', 't2.jpg']
In [5]:
print(glob.glob('*/*.txt'))
print(glob.glob('*/*/*.txt'))
print(glob.glob('*/*.png'))
print(glob.glob('*/*.*g'))
print(glob.glob('folder_l1_1/*.png'))
print(glob.glob('folder_l1_1/folder_l2/*.*'))
print(glob.glob('folder_l1_2/*'))
['folder_l1_1/t.txt', 'folder_l1_2/t2.txt']
['folder_l1_1/folder_l2/t.txt']
['folder_l1_1/t2.png', 'folder_l1_2/t1.png']
['folder_l1_1/t2.png', 'folder_l1_1/c.mpeg', 'folder_l1_1/t3.jpg', 'folder_l1_2/t1.png', 'folder_l1_2/d.mpeg']
['folder_l1_1/t2.png']
['folder_l1_1/folder_l2/t2.png', 'folder_l1_1/folder_l2/a.mp4', 'folder_l1_1/folder_l2/t.txt']
['folder_l1_2/t1.png', 'folder_l1_2/b.mp4', 'folder_l1_2/d.mpeg', 'folder_l1_2/c.mp4', 'folder_l1_2/t2.txt']
In [6]:
print(glob.glob('**/*.txt',recursive=True))
print(glob.glob('**/*.png',recursive=True))
print(glob.glob('**/*.*g',recursive=True))
print(glob.glob('**/*[0-9].txt',recursive=True))
print(glob.glob('*/*/*.txt',recursive=True))
print(glob.glob('*/*/[a-z].txt',recursive=True))
['d.txt', 'c.txt', 'temp.txt', 'folder_l1_1/t.txt', 'folder_l1_1/folder_l2/t.txt', 'folder_l1_2/t2.txt']
['temp.png', 't1.png', 'a.png', 'folder_l1_1/t2.png', 'folder_l1_1/folder_l2/t2.png', 'folder_l1_2/t1.png']
['temp.png', 't1.png', 'temp.jpg', 't2.jpg', 'a.png', 'b.jpg', 'folder_l1_1/t2.png', 'folder_l1_1/c.mpeg', 'folder_l1_1/t3.jpg', 'folder_l1_1/folder_l2/t2.png', 'folder_l1_2/t1.png', 'folder_l1_2/d.mpeg']
['folder_l1_2/t2.txt']
['folder_l1_1/folder_l2/t.txt']
['folder_l1_1/folder_l2/t.txt']
  • iglob(pathname, recursive=False) - Returns iterator of all paths which matches pattern. If recursive is True with ** then it looks in subdirectory as well. It's better to use iterator when lots of files can match pattern because it won't keep all paths matching in memory. We can avoid memory issues by not keeping big list in memory.
In [8]:
%time normal_dir_list = glob.glob('**/*.*g',recursive=True) ## this one takes more time and memory because it keeps all matching paths in memory after generating.
%time iterator_dir_list = glob.iglob('**/*.*g',recursive=True) ## This takes quite less time as it just generates iterator but does create whole list in memory. Generates element base on call to retrieve element from iterator.
print('Size of list : %d bytes'%sys.getsizeof(normal_dir_list))
print('Size of iterator : %d bytes'%sys.getsizeof(iterator_dir_list))
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 2.06 ms
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 9.78 µs
Size of list : 160 bytes
Size of iterator : 88 bytes
  • escape(pathname) - Escapes all special characters[*,?, [] ] in pathname which can be useful if we have special characters are present in pathname.
In [9]:
!touch temp?tea.txt
print(glob.escape('temp?tea.txt'))
print(glob.glob(glob.escape('temp?tea.txt')))
temp[?]tea.txt
['temp?tea.txt']

Let other learners know about this article @ Google LinkedIn Facebook
Sunny Solanki  Sunny Solanki