import fastbook
fastbook.setup_book()
Production from Scratch - fastai Lesson 3
Retrieve files of our subject from bing image search. The images we search for are downloaded locally.
= os.environ.get('AZURE_SEARCH_KEY', 'xyz') key
= search_images_bing(key, 'trilobite')
results = results.attrgot('content_url')
ims len(ims)
150
= 'trilobite', 'crinoid', 'bivalve'
fossil_types = Path('critters') path
for o in fossil_types:
= (path/o)
dest if not dest.exists():
=True)
dest.mkdir(exist_ok= search_images_bing(key, f'{o} fossil')
results =results.attrgot('content_url')) download_images(dest, urls
Some post processing cleanup might be required to remove empty files, html data, or files that are encoded in a format that PIL cannot understand (such as VP8). They might say they are ‘.jpg’ files, but that does NOT mean that they are!
fastai has a handy get_image_files()
function that will recursively identify images and return them as a list and a verify_images()
which returns a list of images that are not up to snuff. Then, we use L.map()
to unlink (remove) any files in the failed images list.
= get_image_files(path)
fns fns
(#437) [Path('critters/trilobite/00000001.jpg'),Path('critters/trilobite/00000000.jpg'),Path('critters/trilobite/00000004.jpg'),Path('critters/trilobite/00000005.jpg'),Path('critters/trilobite/00000002.jpg'),Path('critters/trilobite/00000007.jpg'),Path('critters/trilobite/00000006.jpg'),Path('critters/trilobite/00000013.jpg'),Path('critters/trilobite/00000008.jpg'),Path('critters/trilobite/00000011.jpg')...]
= verify_images(fns)
failed failed
(#0) []
map(Path.unlink); failed.
Time to take a look at the images we’ve downloaded. fastai has a handy show_batch()
function for its ImageDataLoaders objects so we can get a preview of the images inside jupyter. We must specify item_tfms=
or else the widget will not be able to render the wide variety of image resolutions that have been downloaded.
= ImageDataLoaders.from_folder(path/'crinoid', valid_pct=0.2, item_tfms=Resize(256))
dls 3] dls.valid_ds.items[:
[Path('critters/crinoid/00000087.jpg'),
Path('critters/crinoid/00000065.jpg'),
Path('critters/crinoid/00000045.jpg')]
=40, nrows=4) dls.show_batch(max_n
= ImageDataLoaders.from_folder(path/'trilobite', valid_pct=0.2, item_tfms=Resize(256))
dls =40, nrows=4) dls.show_batch(max_n
= ImageDataLoaders.from_folder(path/'bivalve', valid_pct=0.2, item_tfms=Resize(256))
dls =40, nrows=4) dls.show_batch(max_n
We are now ready to create a DataBlock. The DataBlock will contain our images and labels; it will need to know how to ‘find’ the items, how to separate them into a training set and a validation set, where to get the dependent variable from (the label, e.g. the directory they are in), and last how to transform the image so that we can run CUDA on it.
= DataBlock(
critters =(ImageBlock, CategoryBlock),
blocks=get_image_files,
get_items=RandomSplitter(valid_pct=0.2, seed=42),
splitter=parent_label,
get_y=RandomResizedCrop(256, min_scale=0.3)) item_tfms
The data block has not yet seen our data, so lets show it the data.
= critters.dataloaders(path, bs=64) dls
=10, nrows=2) dls.valid.show_batch(max_n
=10, nrows=2) dls.train.show_batch(max_n
= cnn_learner(dls, resnet34, metrics=error_rate)
learn 3) learn.fine_tune(
epoch | train_loss | valid_loss | error_rate | time |
---|---|---|---|---|
0 | 1.940702 | 0.609582 | 0.244186 | 00:09 |
epoch | train_loss | valid_loss | error_rate | time |
---|---|---|---|---|
0 | 0.762174 | 0.398268 | 0.197674 | 00:12 |
1 | 0.567270 | 0.371190 | 0.151163 | 00:12 |
2 | 0.462896 | 0.349624 | 0.139535 | 00:12 |
= ClassificationInterpretation.from_learner(learn)
interp interp.plot_confusion_matrix()
10, nrows=2) interp.plot_top_losses(
= ImageClassifierCleaner(learn)
cleaner cleaner
learn.export()#path.ls(file_exts='.pkl')
= Path()
p ='.pkl') p.ls(file_exts
(#1) [Path('export.pkl')]
I had some cleanup issues when using the ImageClassifierCleaner()
class. After using the widget and then invoking the for
loops to unlink and move the images, it threw an exception because filenames already existed. It appears it had problems counting the images/moving the files due to name collisions. It also wasn’t clear what to do after cleaning the dataset - simply running learn.fine_tune(1)
did not work since the dataset had changed, so I reinitialized the datablock and retrained the model. This improved performance from an error rate of about 17% to about 14%.