Signup sheet!!
2 problem sets, 1 oral paper presentation, final research project
Scribe system (but not today)
Overall goal: to prepare students to do research in vision or medical imaging
Working definition: extracting useful information from images
In particular, information about image content
Certain formal problems [high-dimensional inverse problems with spatial constraints]
Elements of psychology, engineering, mathematics
From a technical point of view, interplay of statistics and geometry
From a very engineering point of view, getting info out of (typically) 512 by 512 8 or 24 bit arrays
It certainly looks easy (any child can see…) but about 1/3 of your brain is doing it
Ill-posed and ill-defined problems
Inverse problems are ill-posed in the sense of Hadamard (no unique solution)
Perception problems are always ill-posed, since the goal is to recover info about the world
Vision is going from the (2D) image to the (3D) scene; graphics is the opposite
Worse still, the problems in vision are ill-defined without a task
There is no formal specification for nearly any vision problem, and few doable tasks
In terms of engineering, to build a vision system requires hooking together unreliable components whose lies cannot be checked
There are many ugly engineering problems, due to e.g. bad cameras, slow computers and busses, etc. (these are slowly being solved).
This is not an exhaustive course – there are many areas of vision you will never hear about in 664
This is mostly by choice – the Cornell vision group (= RDZ) has a fairly strong bias
Pro-algorithms, i.e. computational
Discrete math (much of vision is continuous)
Task-oriented
Try to make minimal assumptions, and reasonable ones (this is partly an AI legacy).
About 30 years old (MIT summer vision project from ~1968)
Draws primarily from EE, then CS, then psych (but not in 664!)
About 1 major conference per year w/ 500 people, 2 major journals (see links page)
Originated largely in AI, but is now totally distinct (and proud of it…)
Somewhat related to image processing, 2D signal processing
There is no formal definition, here is an RDZ intuition
Sometimes the fields really do cross, e.g. MPEG-4
Tasks have evolved over time
Classic tasks:
Recognition of tanks
Recognition of characters (OCR)
Industrial inspection
Robotics?
In the last 5-10 years lots of interesting new tasks:
Search engines for images/image databases
Video surveillance
Security applications: identify people via faces, iris, fingerprints (in order of less vision)
HCI (Bill Gates’ favorite examples)
Multimedia apps, i.e. compression
Graphics applications!
A new emphasis this year on medical imaging applications
From a technical point of view, a huge amount of overlap with non-medical vision
But also some specific quirks
You can do a final project in any area of vision
There is a list of topics in the 1st day handout.
Order of topics is roughly “low level/early” to “high level/late”
Distinction: low level involves direct operations on the pixels
high level involved intermediate representations
Most of computer vision is low level/early; high-level vision is perhaps premature as a field
A very brief overview of the 1st half of the course:
There is a classic vision problem (pixel labeling) that is extremely important
It’s vital for almost any application
It provides a nice intro to some of vision’s mathematical tools and techniques
So, we’ll be talking about it in detail, starting today.
Consider taking a point X-ray (photon) source, an object to be x-rayed, and a detector.
Detector counts photons per unit time, which is what we measure
A pixel value tells us the average density (to x-rays) of a solid angle
Now consider a pinhole camera looking at a scene
The geometry is a little more complex, but basically similar
What we are measuring is the brightness of a patch of the world (scene element)
Back to the X-ray. Suppose that bone is very bright and soft tissue/air is very dark.
The picture we get should, perhaps, be 200’s (bone) and 50’s (other).
Ideally we would see only these values. BUT, a wide variety of processes, which
we lump together as “noise”, gives us slightly and randomly different values.
Suppose we really want to know which pixels are bone and which are not.
(Why do we care? A good example comes from angiography, where you are looking at an artery into which you’ve injected some radio opaque dye. To find a stenosis, or to measure its seriousness, you’d really like to know for the individual pixels if they are blood or vessel.)
So here is our problem, usually known as “image restoration”, sometimes called “denoising”.
There is a “true” value at each pixel, which we are trying to figure out. What we get as input
is the true value plus some noise.