Question about parsing of foreign language strings

Hello. I would appreciate a bit of swarm knowledge about different languages. Maybe, you can help me out here.

The problem is that I need to parse some text snippets in an app instead of outputting it. So, all manual translation is out of question. (Just to prevent people from posting links to Transifex.)

In the cookbook app we want to allow updating the ingredient amount based on the number of servings to be prepared. Unfortunately, the ingredients are stored as string only (for technical reasons). So for example a string could be 100 g sugar or 2 carrots. If I double the amount of servings, the interface should show 200 g sugar etc. You get it, I guess.

Currently, we are planning to use a very simple parsing routine: Assume, that the ingredient starts with the amount (200) then a spaces followed by the unit (g) and the description of the ingredient. By splitting it up like this, we might be able to just update the amount accordingly (200 → 400).
If an ingredient is not following this structure (e.g. a pinch of salt) a fallback is used to “multiply” these ingredients as well.

The problem is that there are multiple languages out there and each might have a different grammar structures. It might be the case that in other grammars our assumptions are no longer valid. So, when people start parsing non-English content and expect the recalculation to work correctly, it might seriously fail. Also there might be some differences in the typical usage of numbers and units in the various languages. I will give a few examples, I can think of below.

We are going to create a basic structure very soon but I would like to know if there are bigger problems to be on the agenda with that functionality and how to tackle these before releasing the new feature if possible.

Some examples of things, I can think of that might cause problems:

  • Is the ordering (amount unit description) always the same for all languages? I know for example in French, some adjectives go after the subject (they would not say the green flower but the flower green, la fleur verte). Is there eventually a language that changes order as well?
  • How about the decimal point? In English it is a point, in German it is the comma. Are there any other alternatives? Are there any additional special symbols for numbers in foreign languages (like the comma in English for separation like 1,000,000)?
  • I have no clue about right-to-left languages, honestly. Just a wild guess.
  • I think in the imperial scaling system it is common to have mixed fractions (like 3 1/2 cups of flour). Are there similar structures in other languages? Maybe a bit different structures?

Before we are doing some bad decisions, can you share your experience especially as translators that can read/speak multiple languages? Do you have any tips on what to do and on what to keep an eye?

Thanks a lot
Christian

2 Likes

To do something generic will be difficult. Often these adjectives can change for plural forms. Perhaps there are ways to avoid it, e.g. use the ingredient and then the quantity:
Carrot: 1
Yellow Tomato: 4
Flour: 200 g

Then you can easily change the numeric quantity.

That is normally handled by your operating system, that it is in your locals (certain formatting that comes with the language). That is already the case for date formatting.

There were some feature request for a better support (https://github.com/nextcloud/server/issues/31420).

Just for the units, if you pass from 1 carrot to 2 carrots, there is the same for 1 file or 2 files. I think that was already handled by the translation system.

One more topic: countries that don’t use SI units :wink:

The thing is: I cannot dictate the exact strings the users will use. Instead, I have to parse whatever they give to me. Until now, I have always seen such structures like 100 ml milk.

The suggested structure (description colon amount unit) would be an additional way to express the ingredients. I will have to look at this.

Nevertheless the description is not of (much) interest here. The problem is to extract the numeric value in order to do some math on it.

We have no OS involved here. We are just parsing some strings that are ingredients. We have to be able to accept some sort of fractions (be it decimal like 1.5 or pure/mixed fractions like 1 1/2).

Even when outputting, we need to find a generic algorithm as we are not using native code (OS) but create/show a website. So it would rather be a question of the browser not the OS itself.

Again, I am concerned about the parsing not the exporting/writing/outputting of strings. Is the grammar the same on these languages? How are these languages serialized (which is handled in a LTR manner)? Is here someone who can give us a few examples and see if our algorithm will work?

In cooking books, you usually have a list of ingrediants. It is very useful even for a user to have a concise list of ingrediants, and it will help you to do math on the ingrediants, filter recipies for certain ingrediants, …

Perhaps it can help to look at other projects, how they structure data (perhaps there are already some recipe formatting “standards”), and perhaps how it might change in the future (e.g. if you’d like to add nutrition information):

Yes, but if you go to parsing numbers, if you have a decimal point or decimal comma, that is defined in your OS/Browser.

For parsing, with regular expressions, it shouldn’t be difficult to extract the number values, I would then consider the words left and right to it. From the first to the right, I’d try to get a unit (mL, ml, g, kg, cups, …) and the word after the ingredient.
If the quantities are written out, then you have a problem.
I’d probably go by language by language, first try to do it in English. In the end, there are probably always a few that won’t parse correctly. In some version, it is helpful to see the extracted quantities from the text along with the text (with the parts marked).

Other idea: With all the fuzz about the AI stuff, try a bot and ask to create a list of ingrediants. Perhaps some might have a decent API that you can use.