Tech Tip: Removing Extra Line Breaks from PDF Articles

The Problem

You find a great article as a PDF. The document is nicely formatted, and allows you to copy out the text. However, when you copy-paste to a word processor, what was once nicely formatted text:



turns into text that either wraps in really stupid places:



Or doesn't fill up the available width, which looks pretty stupid as well



Why it happens

For whatever stupid reason, most PDFs are encoded with hard "line break" characters at the end of each printed line. This means that if your word document has different margins or font size than the PDF (as it surely will), the text will have extra line breaks in places it shouldn't. We can see these extra line break characters in a text editor:



The CR and LF characters are interpreted by word as "Start a new paragraph here." That's the problem.

The Solution

We need to remove all the extra line breaks from the document. Word and OpenOffice both allow you to do this using the "Find and Replace" dialog, but the technique is not obvious.


  1. Hit CTRL+F or click Edit>Find and Replace

  2. In word, click the tab at the top of the dialog that says "Replace"

  3. In the Find box, enter ^p for word or $ for OpenOffice

  4. In Openoffice, click "More" and then check "Regular Expressions"

  5. In the Replace box, type a single space character (IE hit the spacebar once)

  6. Click "Replace All"



This is what the box looks like in Word:



And this is what it looks like in OpenOffice:



After you hit "Replace All", all the line breaks will be replaced with spaecs, so your document looks much better.



Notes


  • If you want to replace the line breaks one at a time, use "replace" instead of "replace all"

  • It's usually easier to clean the text in it's own document and then copy/paste again to your card-cutting document

  • This technique is also useful for removing extra line breaks after some bozo began a new page by hitting "enter" a bunch of times instead of adding a new page break. Use ^p^p in the find box

22 comments:

Unknown said...

CR LF LOL

Rohan said...

Many templates have macros that do that, which might be easier

Anonymous said...

Thank you! I clean up hundreds of documents each month and had looked near and far for an easier way. Your method is MUCH simpler than anything I've found.

Unknown said...
This comment has been removed by a blog administrator.
Mik said...

Also you can use: texthandler.com online tools that can remove line. Copy text from PDF, select options "Every paragraph began by capital " and click the "execute" button.

Mik said...

texthandler.com

Mik said...
This comment has been removed by a blog administrator.
Anonymous said...

Hooray! Exactly what I was looking for! We have several e-mails we want to print, but didn't want to waste all the paper by printing them with all the line breaks that kept adding up. Thanks for a quick, easy solution!

Anonymous said...

Thank you very much, this was exactly what I was looking for.
Much appreiciated.

Anonymous said...

Thank you, I had forgotten what the code was for line break --- ^p

pakistantourism said...

Thank you very much; you have saved lot of time of mine and others.

zinajda said...

:) saved my life

VJ said...

Thanks a lot! This is really helpful. I was using online tool textfixer.com for this but this way is easier.

Anonymous said...

Absolute lifesaver -- currently writing up notes for my dissertation and you've probably just doubled how productive I am.

Thank you!

Anonymous said...

Glad I looked for this now - you're a massive time-saver! THANK YOU!

Anonymous said...

thanks this saved me lots of time!

Anonymous said...

Wonderful, I should have looked for this years ago! Thank you very much,
Claudio

Andrew said...

The texthandler website didn't work well for my text (didn't split paragraphs with either option) so I wrote my own which splits into paragraphs where it finds a '.' at the end of a line: Format Text Page.

Hope it helps someone.

Anonymous said...

I just paste the text to Firefox address box. It removes the extra line breaks automatically. Ctrl+V, Ctrl+A, Ctrl+X

Anonymous said...

It will remove my "real-linebreak" (paragraph break) at the same time. What can I do?

Anonymous said...

Can I suggest an easier way is to load PDF Copy-Paster (http://www.onehourprogramming.com/blog/2010/9/1/fix-copy-and-pasting-in-pdfs.html). It's always available - no need to be on line and it works a treat.

Troy Flores said...

Typing services provides superior customer services. To ensure this, they have a dedicated staff that listens to every instruction and concentrates on each and every detail. document typing services