Author Topic: Board move delayed  (Read 159 times)

The Gorn

  • Your agonizer, please. And be sure to keep the batteries charged!
  • Trusted Member
  • Wise Sage
  • ******
  • Posts: 14170
  • Gornix user
    • View Profile
Among other crazy poo that I have had to contend with in this export...
« Reply #15 on: February 25, 2010, 02:23:18 pm »
I am now pacing myself and fixing a few issues every night.

Three  oddball things I was encountering lately, that corrupted something about the conversion:

1) I have this entire board set to deliver up to 100 messages per page. In other words, we should have almost no threads that are paged. EXCEPT that the dummy Yuku user ID that I have been using as the user for downloading the test batches still sees 20 messages per page. This is ALMOST no problem except that this guy's stupid script seems to duplicate the first post of the thread for each page of a multipage thread. I think I have a workaround.

2) Apparently I overlooked character encoding issues. See this thread: http://openitforum.yuku.com?topic=7979 In the quoted block: "We tell our  vendors, if you’re asking for 15 days"all of the apostrophes were three character unicode sequences; it is a DIFFERENT apostrophe character than normal. But the SMF database was set for (geez) Latin Swedish-1 (?) encoding. So these characters were being displayed in the new board as three gibberish characters. I figured out how to force UTF-8 mode for all imported and exported data, which seems to fix the issue. I've seen the Unicode trash characters (improper displaying) in several other posts I have checked at random, all of which were copy and pastes from other web sites. The whole domain of compatibility issues with Unicode and different character sets is enough to make my head explode.

3) A few posts contain the same <span> tag that I trigger on to detect the post title. The symptom was some messages come through blank.  I fixed by making the search for post title extremely specific, by adding more surrounding tags. Seems fixed.

Some people do this stuff for a living. I guess it falls under the domain of data warehousing. I don't envy 'em. This is grueling, fussy work.
« Last Edit: February 25, 2010, 02:29:22 pm by G0ddard B0lt »
Gornix is protected by the GPL. *

* Gorn Public License. Duplication by inferior sentient species prohibited.


Origisaurus

  • Wise Sage
  • Wise Sage
  • *****
  • Posts: 1675
    • View Profile
Board move delayed
« Reply #16 on: February 25, 2010, 03:03:44 pm »
Quote from: G0ddard B0lt
Some people do this stuff for a living. I guess it falls under the domain of data warehousing. I don't envy 'em. This is grueling, fussy work.
In line with the emerging Toyota meltdown, the group that I worked with on safety issues used a SAS selection criteria to look for reports to read by eyeball.  It was a lot like a regex, except a little easier to read and maintain.

Avatar is from the cover of the November 2007 National Geographic.  Fair use is assumed.

PhilFromNY

  • Trusted Member
  • Wise Sage
  • ******
  • Posts: 750
    • View Profile
FWIW
« Reply #17 on: February 25, 2010, 04:01:25 pm »
Latin-Swedish is the default for MySql. You can change it by editing /etc/mysql/my.cnf (on Ubuntu).

In the [mysqld] section add:
character-set-server=utf8
collation-server=utf8_general_ci

On an individual database you can use the Alter Database command as defined here:
I've never use this on a database with data in it.

The Gorn

  • Your agonizer, please. And be sure to keep the batteries charged!
  • Trusted Member
  • Wise Sage
  • ******
  • Posts: 14170
  • Gornix user
    • View Profile
re: FWIW
« Reply #18 on: February 25, 2010, 04:19:36 pm »
Thanks for the tips.

In SMF (forum software) there is an administrative command called "convert database to UTF-8".

I accomplished the same thing by doing an export of the entire database from phpMyAdmin, then editing the SQL file and replacing all charset directives in the table creates with utf-8.

What happens (in general) with the conversion script I am using is that it downloads all of the pages like a normal web client, and the text of each message more or less passes through to the SQL INSERT statements at the back end. Normally all special to HTML characters are escaped already in the served pages from Yuku, as they should be.  But if a web page contains utf-8 characters, those 2-3 bytes get sent down as-is in the downloaded data. I only saw this when I downloaded a test page to my PC and opened it in hex edit mode in a text editor.

Then (as I found) the database may, without the correct Unicode setting, treat those 2-3 bytes as individual characters when page views are constructed in SMF.

I didn't realize how important this aspect of data conversion was, until now.

I guess this is good experience. This is certainly the most difficult data conversion I have ever undertaken. Add data conversion to my repertoire, seriously.
Gornix is protected by the GPL. *

* Gorn Public License. Duplication by inferior sentient species prohibited.


Richardk

  • Global Moderator
  • Wise Sage
  • *****
  • Posts: 3815
    • View Profile
Always more fun than you expect
« Reply #19 on: February 25, 2010, 09:25:33 pm »
It seems that Yuku pages are UTF-8 while as noted the default MySQL encoding is latin, which if I recall is nearly ASCII.

Another area to double check is the settings on your backend for PHP and Apache. Since these pieces talk to each other, they need to know what encoding is being used. For instance the DB is passing uft-8 data but the receiving process is assuming it's ascii.

I think your best bet is to set everything to UTF-8.

Yes, data conversion can be fun. I still remember one developer having problems and I told him his data was in EBCIDIC. You can only guess the blank stare on his face.

Almost forgot:
Quote
In SMF (forum software) there is an administrative command called "convert database to UTF-8".
While that might convert the the database, I'm not so sure if it will fix your character problems, unless it "sees" the funny characters and concludes it's really a unicode character but then it still has to decide 'what was it supposed to be'. Unless you can tell it what to look for, I'm not sure it will totally fix your problem. Converting the database and converting the data are two different steps, unless there's some magic that happens in the middle.


« Last Edit: February 25, 2010, 09:34:46 pm by Richardk »


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf