Yet another task getting data out of a fairly nasty/incoherent web-based table and moving it into LibreOffice for sorting/project planning. Presumably this is all “easier” with scripting but I don’t have the time/patience/inclination to learn, so here is my hacky method!
It worked for 95% of my data, so I had 5 of 100 lines to manually fix at the end. The whole task took only 5 minutes (if you exclude me writing this blog) so that’s a decent return on your time.
Background knowledge
To pop a csv file into LibreOffice so it has columns and rows like a table, it needs to have two things:
- A text delimiter, separating out the single items of text – that should not appear elsewhere in the text
- A line break at the end of every “row” in the table
This method relies on the content of the last column (or last few columns) being identical each time. In this case there are always numbers in the last two columns of my table, but consistent text like “Edit” or “View” would also be fine. If you don’t have that feature in your data, this guide won’t work for you without additional hackyness!
The Problem
I needed to extract a list of blogs/pages from a proprietary website system that does not provide an export method – part of a project moving a large site to Drupal (Hooray, Open Source!). I really only needed column 2 of 7, but it wasn’t possible to select just that column. There are also loads of text items/buttons in the first column, so copy pasting brings quite a lot of unnecessary data.
Step 1: copy the data
You want to select all the data: it’s easy to miss off the first part of the first line or last part of the last line, try your best to avoid that.
TIP: You don’t need to drag-select all the way down the page. Select the first bit, release your mouse, scroll down. Then hold down the shift key and select the last part of the page. This should select the entire table.
Then, paste your data into a text editor such as Gedit (Linux). This will give you something like the text below – basically a long list.
Step two: what you can use to create rows?
You don’t actually make the rows yet, but now is a great time to identify the consistent text/number items we will use create our rows later. In my case it’s two numbers: these match up with the last two columns of the table I want to re-build.
Step three: add a text delimiter
We will now do our first edit: we want to put a symbol or string of symbols at the end of each line which does not appear anywhere else. In this case, there are no @ symbols anywhere in the data (I did a text search to check) so that’s fine. So, what I want is an @ at the end of every line in the text document to say to LibreOffice “this is an discrete item of text making up one cell in my table”.
To do this I do a find and replace, searching for \n and replacing with @.
This turns our text into one big blob:
Step four: Add the row end delimiter
If we imported this to LibreOffice right now it would all go into row 1, where as we want many rows.
To resolve this, we use what we learned in step two to put in the new line break, which tells LibreOffice to put the next bit of text on a new row.
Our text was initially two numbers with line breaks between, but we removed all the line breaks, so now we are looking for number@number@ and we want to replace that who section with a line break/new line.
In a text editor you can find that string with a regex, like [0-9]@[0-9]@ and replace it with \n
TIP: If your final column is full of the world “Edit” you might do something like find: Edit@ replace with \n.
This turns our big blob of text:
Into something resembling a csv file
Step 5: remove trailing delimiters
My scruffy edits have left me with spare delimiters at the end of some lines, they looked like “text@”: I don’t want these as they will create extra columns, so I did one final find and replace for @\n replacing with \n.
Step 6: Save as csv
Save the file with .csv at the end, someplace you can find it again.
Step 7: Open in LibreOffice
Open LibreOffice and go to File > Open. Under Separator Options un-check everything then check Other and put your delimiter of choice in the box, mine, if you recall, was an @.
In the preview box, you should now see a nice, tidy set of rows and columns: don’t worry if you are not sure, opening the file in LibreOffice won’t change/edit the file in anyway, so you can always try again.
Step 8: Tidy up
With most data, this isn’t all going to be 100% – Out of 100 rows I had 5 that the regex didn’t work for, so these needed to be manually moved to the bottom of the list. Probably the reason was they had something else in the last two columns in the original data, but it’s not worth checking for 5 lines. If you have a LOT of problems, I suggest going through the process again and tweaking the find and replace actions you do.
TIP: you can check the spreadsheet against the original table, this can help you check it’s worked right and locate any specific issues you have.
Hope that was helpful! xx
A hello from Brazil, you save my time. Thanks
Glad to be of service, friend! x
Hey,
Thanks for your publish.
If you had GNU/linux you can use the paste command.
You should do the first step ; copy the data from website to a file let’s say called test.md.
After that if your data have for exemple 5 columns open an terminal and write :
paste -d “;” – – – – – > traited_data.csv
-d “;” it will replace all newline with ;
– – – – – : it’s the number of column you had
test.md is your unformatted file
traited_data.csv is the new file with good column