Selecting the best processes for document imaging – 2

This is the second post on how to select relevant business processes for document imaging. You can read the first post here .

Lets look at another example on how to decide whether a particular process is suitable for document imaging or not. Let’s imagine that we have a business objective to speed up the processes in order to reduce the response time to customer queries. So the first thing that we need to do is to list down all the processes relevant for customer interactions. Then we need to list down the different document types for each process and identify the relevant characteristics for each document such as;

  • approx. how many documents are accessed / retrieved in a day?
  • how easy is it to search a particular document and how long does it take ?
  • What is the cost involved?
  • What is the impact of reducing that access time?
  • for how long we need to keep that document? etc…

 

Document Type

How many documents are accessed during a day

Time / Cost it takes to access a document

Importance / impact of quickly accessing a document

Retention Period

Insurance Policy

150

10 mins

high

7 years

         

When we have these details, then we can identify whether that document is a good candidate for document imaging or not. Of course there is no rule or formula for selection as it varies according to your objective. The key is to have a specific criteria so that the selection will be more objective. And it is important to decide on a selection criteria that aligns with your business goals.

As an example we can use some selection criteria based on the above table such as;

“If the impact is high and at least another one criteria is high then that document type is suitable for imaging”

One important thing to note is that this selection criteria should be aligned with your business objective. Otherwise it may result in an incorrect selection which will not help achieve the business goal. As an example think about the following criteria

“If the number of documents accessed a day is higher than 1,000 and the retention period is higher than 3 years then that document is suitable for imaging” –

For sure digitizing such documents will be beneficial for the organization in terms of archiving, saving working space etc…. But it will not help achieve your original business goal unless the impact of quickly accessing the document is high.

To summarize, the idea here is to have an objective selection process based on a criteria that aligns with the business objectives.  

Selecting the best processes for document imaging

When implementing a document imaging solution it is very important to select the processes that will bring the maximum ROI. Not all processes are suitable for document imaging. So we need to evaluate each process carefully in order to select the most relevant ones. Business needs will definitely vary from one process to another and a single product may not be able to cater for all requirements.

Lets look at an example. Assume that your objective is to reduce the usage of paper. It could be because you need to minimize the cost of printing or you have a sustainability initiative to reduce the usage of paper. It is not possible to reduce the usage of paper in all business processes. So we need to analyze and then identify the best few processes for this. To do this we need to identify the document types involved in a particular process and estimate the number of physical copies being used. When you start doing this sometimes you may be surprised to note that there are multiple copies of the same document being printed. There was one project where we discovered that a particular document being printed for 6-8 times by different teams.

You can follow these simple steps to identify such processes;

1. List down all the processes that generate paper documents

E.g. 

1. Accounts Payable

2. Petty Cash reimbursement

3. Work Order process

2. For each process identify the associated document types;

E.g. For accounts payable process

– Invoice

– GRN

– Inspection Report etc…

2. For each document type estimate the daily volume / transactions and the number of copies from the same document. 

Document Type

Daily Volume

No. of physical copies

Invoice

1000

2

GRN

600

2

Inspection Report

800

3

3. A process with a high number of physical copies is a good candidate for Imaging.

Once identified it needs to be further discussed in details in order to identify whether it is really possible to reduce the number of copies.

  • How many copies of a particular document is printed?
  • If this is just 1 copy then is it practical to reduce that to 0?
  • Do you still want a physical signature?
  • Can the finance team accept only the electronic copies? Or are there any supporting documents being printed that we can reduce etc…

8 Image enhancement techniques in document capture

Two main concerns for any document imaging exercise are the image quality and the file size. Anyone will need to get the best possible image quality while keeping the file size to a minimum for obvious reasons. Thus image enhancement has become an essential step in a well defined capture workflow. The purpose of image enhancement (image cleanup / image processing) is to make the images more readable, and also to remove unwanted noise reducing the storage requirements. This is especially important for forms processing / OCR applications in order to improve character recognition. There are number of image enhancement techniques available today. Described below are 8 such image processing techniques.

1. Deskewing

In a production scanning set up, document pre-processing is the  most time consuming step. One objective of this step is to arrange the documents correctly by rotating (incorrectly filed documents) and aligning them together.  The De-skew facility in production capture applications helps to reduce this effort by automatically de-skewing misaligned images. The De-skew process can straighten pages which were misaligned during the document feeding process, within a specified range of degrees.
A more advanced feature is available with Kofax VRS called content based rotation. VRS can analyze the content of the image and correct the orientation accordingly.
Here is a nice illustration called “The Effects of Deskewing a Document” in ScanHelp.com  

2. Black border cropping & removing

Cropping refers to the removal of the outer parts of an image. In document scanning, black border cropping is one technique that is used to remove the unnecessary black colour borders from an image. Border cropping removes black borders from the image completely also resulting in the reduction of image height and width. However this does not reduce the resolution of the image. (This is an Illustration of border cropping).
The other technique is to replace the black coloured pixels in the borders with white colour pixels which is called black border removal. Unlike cropping this does not reduce the image size. 

3. De-speckling / Noise reduction

When scanning old documents we usually get unwanted dots (speckles) in the background. This could be in two forms; black speckles in a white background as well as white speckles in a black background. This is also known as Salt and pepper noise. (This is an example for an image with salt and pepper noise)
Whatever the form, this affects the image compression and increases the file size. De-speckling (also known as noise reduction) is the process of removing such unwanted speckles from the image background. (Illustration : noise removal) 

4. Colour drop out

Colour dropout is a proven useful technique for forms processing applications such as census projects. The idea is to discard the text boxes and lines of a scanned image. This will increase the recognition rate of OCR. Earlier scanners used specific colored lamps to achieve this. (eg : Blue Imaging Color Drop-Out Element for Kodak 9520/9500). Now this has been improved and is achieved by software.
Colour drop out accuracy directly depends on the printing quality of the forms. Only selected colors (shades of red, blue and green) can be dropped, which depends from scanner to scanner. Therefore it is essential to use the recommended color pantone (e.g. : Fujitsu PANTONE Dropout Confirmation Listing) for printing the forms.
This is a very informative article on color drop-out by the Document Doctor.

5. Thresholding

Thresholding is a technique used when scanning grayscale images and saving as Black & white.  A grayscale image will have 16 bits per pixel (representing 65,536 shades of gray) and a black & white image will have 1 bit per pixel (representing either black or white). When converting from grayscale to black & white (example :  scanning a photograph in black & white mode), each pixel having a different shade of gray should be converted in to either black or white. This point of separation is called the threshold. By changing the threshold value the output image quality will change
As shown in the above illustration this is a fixed thresholding, which is ideal for separating solid colors (e.g.: text) from background. However for images with various shades of gray a advanced version of thresholding called adaptive thresholding is used. In adaptive thresholding the threshold value is calculated independently from pixel to pixel based on the contrast. Different scanner manufacturers and capture applications have come up with many different technologies and algorithms on this such as Kodak ithresholding developed on Adaptive Threshold Processing – ATP)

6. Line Removal

Line removal is a very useful feature especially for OCR applications. This feature is used to remove unwanted lines from scanned images. These lines could be either actual content or noise. Most application forms such as credit cards, account opening etc.. consist of text boxes. Although such lines are actual content of the document, they interfere in the character recognition process hence are unwanted. Also when scanning documents that are folded or when scanning fax copies, there is a high possibility of getting unwanted horizontal lines in the scanned image. These lines, especially vertical ones can interfere in the OCR process. Also if there are any texts that intersect with these lines, they appear as broken in the scanned image resulting in incorrect text recognition.
When line removal is used, these unwanted lines will not be included in the scanned image resulting in a clean image optimized for character recognition. Also characters that are broken due to horizontal lines will be corrected. Further line removal will also reduce the image size.

7. Punch Hole filling

When filed documents having punched holes are scanned, most of the images will show these holes as black spots. In addition to the distracted appearance of the image, this results in two main problems. First is If the file contains large number of documents and the left margin is not adequate, these black spots could interfere with the actual content of the document. The second issue is that having such black spots in blank pages could interfere with the automatic blank page deletion, since they could be recognized as actual content. Earlier these black marks were removed manually which required lot of time and effort. With the advancement of image processing applications such as Kofax VRS, this can be now automated. This feature will change the color of such black spots with the surrounding image color. Most such applications take in to consideration the dimensions and locations of such black spots and compare with the different manufacturer specifications and standards. 

8. Blank Page Deletion

Blank page deletion is useful when scanning in duplex mode where some documents contain information in both sides of the document as it requires the scanner operator to manually delete the blank pages. Automatic blank page deletion will delete the pages based on a threshold value (in bytes) specified. When a page size is less than the threshold value specified, it is considered as a blank page and will be automatically deleted. Selecting this value depends on the document type and the scanner being used and usually done after some testing with few experimental values. For blank page removal to be effective, it is essential to use some of the features described above such as black border removal, de-speckling, line removal and punch hole filling.
A common issue faced when using blank page deletion is the bleed-through effect, where content in one side of the paper appearing in the other side of the page, especially in very thin papers. Because of this the blank page is mistakenly recognized as having actual content. Advanced capture applications such as Kofax VRS, tries to address this by differentiating actual content and bleed through. 

Resources

Kodak acquires Bell + Howell scanner division

kLogobanner_a0

Last week, Kodak announced that it has completed the acquisition of the scanner division of BÖWE BELL + HOWELL. The initial announcement was made at the begining of this year. They were expecting to finish the deal by the end of first quarter and it took more than expected. Still the financial terms are not disclosed.

Having experienced scanners from both manufacturers this is an exciting news for me. As we know Kodak, a pioneer in document imaging introduced number of innovative models and features (such as straight path paper handling, perfect page scanning, detachable flatbed etc…) On the other hand B&H has a good reputation for high speed, robust scanner models. So we can expect more exciting products as a result of the combined stength of two organisations.

I am eager to see what Fujitsu, another leader in document imaging would do in return. Thinking

Read the press release here

Selecting the right scanner, 8 things to consider

1. Scanning speed

Most of the times when we are presenting a solution to a client, there is one question that always pop up. “ What is the speed of your scanner?”. Almost everyone think that higher the speed of the scanner, higher the output. Well, my personal opinion is that it is not so all the times. Beyond a certain level, the scanner will be idle due to the inability to feed documents at that speed. Especially when scanning mixed documents it is not possible to use the scanner to the full speed.
The scanning speed is measured either in ppm (pages per minute) for simplex scanners or ipm (images per minute) for duplex scanners. Today there are scanners ranging from 20 ppm to 200 ppm. Unless for a service bureau (or a similar operation of scanning very high volumes of the same document type) I prefer a scanner in the range of 40 ppm or below.

2. Document size (max & min)

One of the main factors that decide the price of a scanner is the maximum size of a document that can be scanned. Generally there are scanners which can scan either up to

  • A4 / legal
  • A3
  • larger than A3
Most of the documents that we get are either in A4 or legal size. But in practice I prefer to go for a A3 scanner, since there are lot of non standard size documents. (I wrote a separate post about the importance of document sizes here.)
Also there are few scanners that can scan extra long documents such as

3. Feeder : Flat bed / ADF

Production scanners come either with the Auto feeder only or with the flatbed. Some new scanner models (like Kodak 1400 series) have a detachable flatbed which could be very ergonomic depending on the work layout.
A flatbed is required if you need to scan bound documents, books, fragile or very delicate documents as well as files without separating papers etc… Especially when it comes to scanning of legal documents such as contracts or deeds a flatbed would be required since these come in double legal size and cannot be separated.

4. Simplex / Duplex

Most of the today’s scanners are duplex; meaning that it can scan both sides of the document at the same time. However there are Simplex scanners also
Unless if you are specifically going to scan single sided documents, it is always better to go for a duplex scanner.

5. Scanning Mode – Colour / B&W

There are 3 main output formats in document scanning,

  • B&W / bitonal,
  • gray scale
  • Colour

Most of today’s scanners can scan in all 3 formats. However there could be models that do not support colour scanning. (one such model is Kodak file master, a book scanner that cannot scan in colour; I think this has been discontinued now)

Also there are some scanners (Kodak – i150 and most of others) that supports dual stream outputs. That is it can scan a colour document and save 2 images in two formats at the same time. eg : B&W and colour.
(This article, “To Scan In Color Or Not?” – by Scott Blau, CEO of Datacap, provides an overview of colour and grayscale scanning.)


6. Resolutions (optical & output)

When it comes to selecting a scanner, (especially for high quality scanning) resolution is a main factor. There are two resolutions to consider.

1. Optical resolution : This represents the actual scanning resolution the scanner is capable of scanning. So this is the important measurement.

2. Output resolution : This is the enhanced or the maximum resolution that the scanner can produce using interpolation. Output resolution is always greater than the optical resolution. But It does not add more resolution but just enlarge the image by adding extra pixels artificially. So this is not an important measurement as the optical resolution.
(This article on scantips.com gives a detailed explanation on image interpolation)


7 . Drivers and capture software

Any of today’s scanner comes bundled with one or several scanning applications and drivers. There are few things to consider here.

  • drivers : there are two main sets of drivers. Twain and ISIS. Twain is an open and freely available set of drivers intended for consumer level scanners. ISIS is a proprietary standard recommended for high speed production scanners. Most of the scanners support both drivers. However if you intend to use a separate specialised capture application instead of the one that comes with the scanner, you need to check on the driver compatibility. As an example the recently introduced capture tool Kofax Desktop works with Twain drivers only. Also the famous scan snap series by Fujitsu does not support both these drivers and works on a scan snap specific driver only.
  • Output formats : The combination of the scanner and the capture application will decide what are the file types (TIFF, PDF, PDF/A, etc…) that can be produced.
  • Image enhancements : There are interesting and very useful image enhancement features associated with different scanners. A good example is the “perfect page” features in Kodak scanners. (pdf – perfect page matrix)

8. Specialised scanners and accessories

The last area to consider is whether you need specialised scanners such as;


Resources

Scanning in DjVu

My new project has a requirement to scan the colour documents in DjVu format, so I thought of writing about this somewhat unfamiliar file format.

Have you heard of “Deja vu”.? As i understood in French this means something like “familiar” or “already experienced”. This is used to explain the weird feeling that most of us have experienced, where we come across a new situation or a person and we feel like it has happened before, although we cannot recall the exact situation. Thinking There could be several religious interpretations on this, but as I know there is no accepted scientific explanation on this yet. (at least I couldn’t find any).

I don’t know why they have used the same name, but DjVu is a file format similar to PDF, which is significantly small in size. This has been developed by AT&T and later the commercial rights have been transferred to lizard tech. Last year again it was transferred to Celartem Technology, the parent company of Lizard Tech. However DjVu is a free file format which means the specifications and the reference libraries are freely available. Similar to PDF, any user can view a DjVu document by installing a browser plug-in which is available freely. The commercial ownership is only on the encoding technology.

Below are some interesting comparisons from DjVu.org. (I am yet to test these in practice)

  • Scanned pages at 300 DPI in full color can be compressed down to 30 to 100KB files from 25MB.
  • Black-and-white pages at 300 DPI typically occupy 5 to 30KB when compressed
  • For color document images that contain both text and pictures, DjVu files are typically 5 to 10 times smaller than JPEG at similar quality.
  • For black-and-white pages, DjVu files are typically 10 to 20 times smaller than JPEG and five times smaller than GIF.
  • DjVu files are also about 3 to 8 times smaller than black and white PDF files produced from scanned documents

This is a graphical comparison done by Lizard Tech;

There are several important technologies being used in DjVu that makes it possible to have very clear images in such small file sizes. First is the compression technology that is being used. Unlike other compressions, in DjVu a file is compressed as 3 images namely the foreground image, background image and the mask image. The mask image which is in high resolution is used to store the text layer and uses a special compression technique. It compresses a particular character only once. And instead of recording all other occurrences of the same character it records only the location of subsequent occurrences. The other two image layers are stored in colour in low resolution. Due to this high compression technology a DjVu file with lot of text is significantly lower in size than a similar file in PDF. Also the decompression of a DjVu file is done in several steps. So the user will have an initial view very quickly and after few moments only the full quality image is displayed.

These features make DjVu an ideal format for scanning colour text documents for electronic distribution. Who knows, DjVu may even replace PDF files Surprised especially when it comes to scanned colour documents such as text books. The famous million book collection is an example of using DJVU format extensively. They offer more than 1. 5 million full text books freely in the open formats such as HTML, TIFF and DJVU.

Some other useful links;

Document Imaging with SharePoint

Last week Kodak announced the incorporation of “direct scan to SharePoint” capability in to their smart touch button. There isn’t any surprise with this move considering the fact that their main competitors have already implemented this sometimes back. Fujitsu was the first to incorporate direct SharePoint scanning in to all of their scanners, however other players also had this functionality available with some of their products. One Touch button of Visioneer scanners can be configured to scan directly to SharePoint. Xerox also uses the same one touch feature while HP scanners are equipped with Smart Document Scan capable of the same. I think Kodak being the pioneer in document imaging should have done this long time back Thinking . Anyway the point is why every scanner manufacturer is moving in this direction and what would be the impact on SharePoint as well as on specialised imaging applications like Kofax.

As we know MOSS (Microsoft Office SharePoint Server 2007) consist of 6 major feature areas as shown below

image

Of these six pillars, Microsoft describes content management as “The facilities for the creation, publication, and management of content, regardless of whether that content exists in discrete documents or is published as Web pages”. This is further elaborated under 3 areas;

  1. Document management focuses on working with electronic documents (More specifically with MS Office documents). This includes features such as check in-out, versioning, offline sync, content types and templates, search and workflows.
  2. Records management is about keeping and disposing (at appropriate times) of electronic content. Features such as information management policies, auditing, bar-coding, routing etc…
  3. Web content management provides features such as content publishing and deployment, publishing templates, rendering navigation etc..

When looking at these, clearly we can identify that SharePoint content management feature area does not include document imaging facilities. MOSS is developed with the focus of working with electronic files, specially MS Office documents. It does not provide facilities to convert physical documents in to electronic format before starting to actually work with them. As an example let us consider the steps involved in getting a paper document in to a MOSS document library;

  1. Scan the document using a capture application
  2. saving the scanned image in to a local folder.
  3. Navigating in to the relevant document library and folder.
  4. Uploading the scanned image using the navigation button or drag and drop.
  5. Enter metadata (document properties)

Imagine a user working on a document library. Rather than minimizing the browser window and launching the capture application to scan a document, wouldn’t it be nice if it allows the user to scan and directly save the image in to the MOSS library.?

As I wrote on my previous post, any ECM solution should consider on managing of paper documents in the organisation. That is, it should provide the facilities to convert native paper documents in to appropriate electronic format that are suitable for actual processing. MOSS alone does not have this capability and relies on independent software manufacturers to provide suitable solutions to fill this gap.

There are several products that we can use to scan documents and save in to SharePoint;

I haven’t had an opportunity to use most of these apps, but they look pretty simple and straight forward. Since all these come at a price, it is not a surprise that scanner manufacturer’s try to get a strategic advantage by incorporating this facility in to their scanner free of charge. This will definitely ease the process of adding documents in to SharePoint document libraries by end users. This move will also make an impact in to production capture applications such as Kofax. I will write about this in a separate post, especially about the Kofax release script for SharePoint.

Wave